I don't see the direct linear connection between dimensionality and number of clusters. To wit, there are 2^d corners of a hyper-cube of dimension d. Putting it differently, the quasi-orthogonal packing capacity of d-dimensional space is considerably larger than d for d > 30 or so. In fact, the capacity grows exponentially with d.
Likewise, you can approximate point-wise distances very well with dimension proportional to log d. This implies that an orthogonal basis of C log d for small C should suffice to describe your data. The log of 10^6 is only 14. These two arguments (quasi-orthogonality and Johnson-Lindenstrauss) come to the same conclusion from opposite directions. I have a hard time believing that you need more than 200 dimensions here. This is especially true since you have such sparse vectors. On Thu, Jul 19, 2012 at 7:25 PM, Aniruddha Basak <[email protected]>wrote: > Hi Ted, > Thanks for your reply. > I am doing clustering of 10^6 objects (thus affinity matrix of that size) > and expect 4000-10,000 clusters. That's why I need those many eigenvectors. > > Will SVD be faster in this case ? > > Aniruddha > > > > On Jul 19, 2012, at 7:20 PM, "Ted Dunning" <[email protected]> wrote: > > > Folks have done SVD on very large matrices with Mahout, but not > necessarily > > for spectral clustering. > > > > Are you sure that you actually need 4000 vectors? As sparse as your data > > is, I would expect that no more than a few hundred are anything but > > statistical noise. > > > > On Thu, Jul 19, 2012 at 6:32 PM, Aniruddha Basak <[email protected] > >wrote: > > > >> Thanks Dmitriy for your reply. > >> The matrix I am working on, has 10-20 non zero entries per row. So its > >> very sparse. > >> I am trying to do spectral clustering which involves > eigen-decomposition. > >> I am wondering whether anyone has tried to do spectral clustering using > >> mahout > >> for very large affinity matrix (input). > >> > >> Aniruddha > >> > >> > >> -----Original Message----- > >> From: Dmitriy Lyubimov [mailto:[email protected]] > >> Sent: Thursday, July 19, 2012 6:28 PM > >> To: [email protected] > >> Subject: Re: eigendecomposition of very large matrices > >> > >> very significant sparsity may be a problem though for -q >=1 parameters. > >> Again, depends on the hardware you have and the # of non-zero elements > in > >> the input. but -q=1 is still the most recommended setting here. > >> > >> > >> On Thu, Jul 19, 2012 at 6:20 PM, Dmitriy Lyubimov <[email protected]> > >> wrote: > >>> you may try SSVD. > >>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular > >>> +Value+Decomposition > >>> > >>> but 4k eigenvectors (or, rather, singular values) is kind of still a > >>> lot though and may push the precision out of the error estimates. I > >>> don't we had precision study for that many. Also need quite a bit of > >>> memory to compute that (not to mention flops). More realistically you > >>> probably may try 1k singular values . You may try more if you have > >>> access to more powerful hardware than we did in the studies but > >>> distributed computation time will grow at about k^1.5, i.e. faster > >>> than linear, even if you have enough nodes for the tasks. > >>> > >>> -d > >>> > >>> On Thu, Jul 19, 2012 at 6:12 PM, Aniruddha Basak <[email protected] > > > >> wrote: > >>>> Hi, > >>>> I am working on a clustering problem which involves determining the > >>>> largest "k" eigenvectors of a very large matrix. The matrices, I work > >>>> on, are typically of the order of 10^6 by 10^6. > >>>> Trying to do this using the Lanczos solver available in Mahout, I > >>>> found it is very slow and takes around 1.5 minutes to compute each > >> eigenvectors. > >>>> Hence to get 4000 eigenvectors, it takes 100 hours or 4 days !! > >>>> > >>>> So I am looking for something faster to solve the "Eigen > decomposition" > >>>> problem for very large sparse matrix. Please suggest me what should I > >> use ? > >>>> > >>>> > >>>> Thanks, > >>>> Aniruddha > >>>> > >> >
