Hi Ted, Thanks for your reply. I am doing clustering of 10^6 objects (thus affinity matrix of that size) and expect 4000-10,000 clusters. That's why I need those many eigenvectors.
Will SVD be faster in this case ? Aniruddha On Jul 19, 2012, at 7:20 PM, "Ted Dunning" <[email protected]> wrote: > Folks have done SVD on very large matrices with Mahout, but not necessarily > for spectral clustering. > > Are you sure that you actually need 4000 vectors? As sparse as your data > is, I would expect that no more than a few hundred are anything but > statistical noise. > > On Thu, Jul 19, 2012 at 6:32 PM, Aniruddha Basak <[email protected]>wrote: > >> Thanks Dmitriy for your reply. >> The matrix I am working on, has 10-20 non zero entries per row. So its >> very sparse. >> I am trying to do spectral clustering which involves eigen-decomposition. >> I am wondering whether anyone has tried to do spectral clustering using >> mahout >> for very large affinity matrix (input). >> >> Aniruddha >> >> >> -----Original Message----- >> From: Dmitriy Lyubimov [mailto:[email protected]] >> Sent: Thursday, July 19, 2012 6:28 PM >> To: [email protected] >> Subject: Re: eigendecomposition of very large matrices >> >> very significant sparsity may be a problem though for -q >=1 parameters. >> Again, depends on the hardware you have and the # of non-zero elements in >> the input. but -q=1 is still the most recommended setting here. >> >> >> On Thu, Jul 19, 2012 at 6:20 PM, Dmitriy Lyubimov <[email protected]> >> wrote: >>> you may try SSVD. >>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular >>> +Value+Decomposition >>> >>> but 4k eigenvectors (or, rather, singular values) is kind of still a >>> lot though and may push the precision out of the error estimates. I >>> don't we had precision study for that many. Also need quite a bit of >>> memory to compute that (not to mention flops). More realistically you >>> probably may try 1k singular values . You may try more if you have >>> access to more powerful hardware than we did in the studies but >>> distributed computation time will grow at about k^1.5, i.e. faster >>> than linear, even if you have enough nodes for the tasks. >>> >>> -d >>> >>> On Thu, Jul 19, 2012 at 6:12 PM, Aniruddha Basak <[email protected]> >> wrote: >>>> Hi, >>>> I am working on a clustering problem which involves determining the >>>> largest "k" eigenvectors of a very large matrix. The matrices, I work >>>> on, are typically of the order of 10^6 by 10^6. >>>> Trying to do this using the Lanczos solver available in Mahout, I >>>> found it is very slow and takes around 1.5 minutes to compute each >> eigenvectors. >>>> Hence to get 4000 eigenvectors, it takes 100 hours or 4 days !! >>>> >>>> So I am looking for something faster to solve the "Eigen decomposition" >>>> problem for very large sparse matrix. Please suggest me what should I >> use ? >>>> >>>> >>>> Thanks, >>>> Aniruddha >>>> >>
