On Fri, Dec 13, 2013 at 12:42 PM, Ron Ayoub <[email protected]> wrote:
> I'm doing some up front research on implementing LSI and choice of tools. > I understand Mahout provide an out-of-core implementation of Stochastic > SVD. On the web site it use the words 'reasonable size problems'. Would a > spare matrix 1,000,000 * 1,000,000 having some 250,000,000 nonzero entries > be out of the question. for performance/accuracy assessment Nathan's dissertation [1] pp. 139 and on is so far the best source I know. Nathan compares performance and assesses bottlenecks on at least two interesting data sets -- wiki and wiki-max. He is experience the bottleneck in the matrix multiplication operation (but he may have done the testing before certain improvements were applied to the matrix-matrix part of power iterations -- i am still hazy on that). [1] http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf I have a great hope that this bottleneck could be further addressed by punting MapReduce out of equation and replacing with Bagel or GraphX broadcast operations in the upcoming Spark 0.9. I have plans to address that with Mahout-on-Spark part of the code but I am still waiting for Spark project to rehash its graph based computation approach (there's sense that GraphX should be superior in broadcasting techniques than existing Bagel api in Spark). > If so, what tools out there can do that. For instance, ARPACK. AFAIK nobody to date cared to do the comparisons with ARPACK > Regardless, how does Mahout SSVD compare to ARPACK. These seems to be the > options out there that I have found. Thanks. >
