Dan, Nathan Halko's thesis did detailed comparisons of singular values between Mahout's Lanczos and SSVD. You can look up a link to his dissertation on this list archive. (or perhaps he mentioned it @dev, can't remember on top of my head).
Bottom line, the way i read this thesis, Mahout's Lanczos loses precision in his experiments in the area of 40-th eigenvalue when used on wikipedia dataset and runs in OOM in the area of 60-th eigenvalue. SSVD results on the other hand look good with q=1 and outstanding with q=2 and there's virtually no difference between q=2 and q=3. He did not go beyond 100 or 200 eigenvalues though, i think. Doing 200 eigenvalues and beyond is fairly flops intensive but i think one could do it if he or she wanted to, with little evidence that precision will suffer at a faster ratee than it did for first 100 values. It looks like Nathan tested it with pre-0.6 release as he ran into some deficiencies with power iterations which largely were fixed in 0.6. But can't speak for him. If you find anything to add on top of this, this would certainly be a welcome testimony. -d On Tue, Mar 27, 2012 at 4:41 PM, Dan Brickley <[email protected]> wrote: > On 27 March 2012 18:22, Ted Dunning <[email protected]> wrote: >> THe smallest eigenvalues are always problematic in large matrices. >> >> Any trick to expose them (such as the diagonal subtraction that you >> mention) should work with any of our stuff as well. > > Thanks, Ted; much appreciated! I'll see if I can get a stronger grasp > on those tricks via matlab/R/etc first, then take a look at the Mahout > options (Lanczos or SSVD). Fairly nearby is a discussion with Shannon > about the Spectral Clustering code, and whether investigating > migration of that to SSVD might make sense. (c.f. > https://issues.apache.org/jira/browse/MAHOUT-986 ) > > Maybe this is as good an excuse as any to get my hands dirty with > SSVD... Is there any strong reason to prefer Lanczos for this sort of > thing? Depends on app? (mine won't care massively about super > precision; speed would generally be more of a concern) > > > cheers, > > > Dan
