SmallK is a high performance software package for constrained low rank matrix approximation via the nonnegative matrix factorization (NMF). Algorithms for NMF compute the low rank factors of a matrix producing two nonnegative matrices whose product approximates the original matrix. The role of NMF in data analytics has been as significant as the singular value decomposition (SVD). However, due to nonnegativity constraints, NMF has far superior interpretability of its results for many practical problems such as image processing, chemometrics, bioinformatics, topic modeling for text analytics and many more. Our approach to solving the NMF nonconvex optimization problem has proven convergence properties and is one of the most efficient methods developed to date.
1.1. Distributed Versions¶
Recently open sourced: MPI-FAUN! Both MPI and OPENMP implementations for MU, HALS and ANLS/BPP based NMF algorithms are now available. The implementations can run off the shelf or can be easily integrated into other source code. These are very highly tuned NMF algorithms to work on super computers. We have tested this software in NERSC as well OLCF cluster. The openmp implementation is tested on many different linux variants with intel processors. The library works well for both sparse and dense matrices.
Please visit MPI-FAUN text for more information and source code.
1.2. Ground truth data for graph clustering and community detection¶
Community discovery is an important task for revealing structures in large networks. The massive size of contemporary social networks poses a tremendous challenge to the scalability of traditional graph clustering algorithms and the evaluation of discovered communities.
Please visit dblp ground truth data to obtain the data.
For U.S. Patent data go test hybrid clustering of content and connection structure using joint NMF go to patent data to view the readme.
This work was funded in part by the DARPA XDATA program under contract FA8750-12-2-0309. Our DARPA program manager is Mr. Wade Shen and our XDATA Principal Investigator is Prof. Haesun Park of the Georgia Institute of Technology. We would like to thank Rundong Du for the dblp ground truth data set and Dr. Ramakrishnan Kannan of ORNL for the MPI-FAUN! distributed code. Also, special thanks to Dr. Richard Boyd, Dr. Da Kuang, and Ashley Scripka-Beavers for their contributions to previous versions of this documentation and significant technical contributions. A final special thanks to Ethan Trewhitt, who created the Docker install and advised on how to use Sphinx and RTD for documentation.
1.4. Contact Info¶
For comments, questions, bug reports, suggestions, etc., contact: