The Lanczos implementation of SVD worked very well with my dense matrix. I ran several iterations to confirm that I had the the top 3 eigen vectors of my matrix and used these vectors to visualize the top principal components of my data.
As for the transpose code, I believe that the last part of the code could benefit from some feedback. In my implementation I am spawning multiple jobs, for as many splits as needed, so that a single node will not run out of disk space. The last step calls for a sequential combination of the pieces into one sequence file which is probably a bad approach. I am sequentially combining the pieces because I want to use the output in other mahout jobs. Instead of running this slow process, I was thinking that it would be better to keep the output in separate large chunks, and perform further jobs with Hadoop's MultiFileInputFormat. The problem with this however once a matrix is split, I do not know of any way to use the split sequence files in other Mahout jobs, other than writing dedicated Java code specifying the multi input files to the job. My questions are: What would be the preferred way of storing large matrices, or even files on the HDFS? Is it efficient to perform many small mapred jobs on the same matrix? (considering that jobs are moving and the data isn't) -Vincent On Fri, May 6, 2011 at 4:18 PM, Ted Dunning <[email protected]> wrote: > > If you have the code and would like to contribute it, file a JIRA and attach > a patch. > > It will be interesting to hear how the SVD proceeds. Such a large dense > matrix is an unusual target for SVD. > > Also, it is possible to adapt the R version of random projection to never > keep all of the large matrix in memory. Instead, only slices of the matrix > are kept and the multiplications involved are done progressively. The > results are kept in memory, but not the large matrix. This would probably > make your sequential version fast enough to use. R may not be usable unless > it can read the portions of your large matrix quickly using binary I/O. > > Also, I suspect that you are trying to get the transpose in order to > decompose A' A. This is not necessary as far as I can tell since you can > simply decompose A and use that to compute the decomposition of A' A even > faster than you can compute the decomposition of A itself. > > On Fri, May 6, 2011 at 7:36 AM, Vincent Xue <[email protected]> wrote: > > > Because I am limited by my resources, I coded up a slower but effective > > implementation of the transpose job that I could share. It avoids loading > > all the data on to one node by transposing the matrix in pieces. The > > slowest > > part of this is combining the pieces back to one matrix. :( > >
