Definition of "so" is Mahout Lanczos an R yielding eigenvalues like (I'm inventing the numbers here cause I don't remeber exact figures): 1834.58, 756.34, 325,67,125,67 and providing very good recommendations in the recommender system, and SSVD giving eigenvalues (invented numbers again) 723,56, 354,67, 111.67, 101.46 and provinding nonsense recommendations... that's why I'm suspecting there might be a bug in the input code. Small changes in decimal places and even in units, like 723,56 to 730,78 would be reasonable. 1834 to 723 is not. I put this numbers in quarantine until I determine everything's ok with the input code.
Thanks for the link to Halko's dissertation. I know it's a nice piece of work and a reference and had already given it a look but I always have to do my own experiments because I have found so often that things doesn't work as expected with certain real cases that I always try to at least validate what is in papers and dissertations also apply to my data.. I'm aware SSVD is non-deterministic, I always check this kind of algorithms with several runs. Here are some results on movielens 100k data using R's implementation of SSVD provided here https://cwiki.apache.org/confluence/download/attachments/27832158/ssvd.R?version=1&modificationDate=1323358453000 (I hope there are no significant differences between the results with this implementation and Mahout's): First line is 10 first eigenvalues computed with R's svd. Next three are computed with ssvd.svd with q=0 and next three are with q=1: > svd.r$d[1:10] [1] 640.63362 244.83635 217.84622 159.15360 158.21191 145.87261 > 126.57977 121.90770 106.82918 99.74794[1] "three runs with q=0" [1] > 640.63362 244.83613 217.84493 159.14512 158.20471 145.82572 126.42295 > 121.79764 105.99973 98.99649 [1] 640.63362 244.83592 217.84568 159.13914 > 158.19299 145.84226 126.46651 121.73629 106.22892 99.11622 [1] 640.63362 > 244.83590 217.84482 159.12955 158.19675 145.81728 126.47135 121.79920 > 106.45790 99.01242 [1] "three runs with q=1" [1] 640.63259 244.75889 217.66362 158.40002 157.61954 145.26448 125.25675 119.74266 104.16382 95.43547 [1] 640.6327 244.7559 217.6805 158.6019 157.4059 144.9223 124.2859 119.1194 103.9104 96.6282 [1] 640.63313 244.62599 217.67781 158.72475 157.13394 145.08462 125.33024 120.20984 102.45867 95.37994 I have repeated the runs several times with the same results... Maybe I'm still missing something else but given these results I can't apply the rule of q=1 improves accuracy. At least I have to experiment, my guess is it do depends on the dataset. I would like also to repeat this comparison with Mahout's SSVD and my dataset and see what happens. Dmitriy, thank you very much for your attention and sharing your thoughts with me. I really appreciate it. Best, Fernando. 2013/8/3 Dmitriy Lyubimov <[email protected]> > On Fri, Aug 2, 2013 at 3:08 PM, Dmitriy Lyubimov <[email protected]> > wrote: > > > > > > > > > On Fri, Aug 2, 2013 at 2:52 PM, Fernando Fernández < > > [email protected]> wrote: > > > >> I don't agree with k>10 being unlikely meaningful. I've used SVD in text > >> mining problems where k~150 yielded best results (not only a good choice > >> based on plotting eigenvalues and seeing elbow in decay was near 150 but > >> checking results with different k's and seeing around 150 made much more > >> sense). Currently I'm working in a recommender system and already have > >> Lanczos running with k~50 producing best results, again, based on visual > >> exploration of eigenvalues and exploring results one by one and seeing > >> they > >> were more meaningful. Current tests with SSVD are based on the latter > and > >> when I say I'm not getting good results I mean Lanczos is working > properly > >> on the same problem (I've explored eigenvalues up to 150 and have a good > >> decay) and SSVD is not, but as I said, this might be caused by some bug > in > >> the input process, seems to strange to me that results are so different > so > >> > > > > Depends on how you define "so". But again, in that respect all i can > point > > to is to the accuracy study by N. Halko, out of published work. > > > I guess i can save you digging thru Mahout wiki, here is the reference > http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf > . > Specifically, look at eigen values chart comparison at page 179. This is > run on Mahout's Lanczos and SSVD neck-to-neck. The order of accuracy for > first 40 values is claimed as "Order of accuracy is q = 3; q = 2; q = 1, > lanczos, q = 0." (see source for details of accuracy assessment). > > One thing i did not understand there is why Lanczos showed such > uncharacteristic values fall-off for values between 40 and 60. I have > always assumed -q=1 was showing something much closer to reality after > first 40 values as well. > > > > > >> I'll get back to this discussions when I figure it out :) . If you are > >> curious about the numbers: 1MM rows by 150k columns for text mining case > >> and 18 MM rows by 80k columns for recommender. > >> > >> About p and q, I have been playing around with movielens 100k dataset > and > >> found q>0 actually worsens results in terms of precision (nothing severe > >> though, but it happens) and its better to increase p a little in that > >> particular case, so my guess is it depends a lot on the dataset though I > >> don't know how. > >> > > > > This again sounds very strange. The algorithm is non-deterministic, > which > > means errors you get in one run, will be different from errors in another > > run, but honesly, you would be the first to report that power iterations > > worsen expectation of an error. All theoretical work and practical > > estimates did not confirm that observation; in fact, quite a bit to the > > contrary. > > > > > >> > >> 2013/8/2 Dmitriy Lyubimov <[email protected]> > >> > >> > the only time you would not get good results is if spectrum does not > >> have a > >> > good decay. Which is equivalent to mostly same variance in most of > >> original > >> > basis directions. This problem is similar to problem that arises with > >> PCA > >> > when you try to do dimensionality reduction with retaining certain > >> %-tage > >> > of variance. in case of flat spectrum decay, you'd need much bigger k > to > >> > retain same amount of variance in dimensionally reduced projection. In > >> that > >> > sense SSVD solution for a given k is as good as PCA gets for the same > k. > >> > Also, i believe (but not 100% sure) "problems too small" exhibit > higher > >> > errors due to the law of large numbers. > >> > > >> > > >> > On Fri, Aug 2, 2013 at 10:41 AM, Dmitriy Lyubimov <[email protected]> > >> > wrote: > >> > > >> > > if you use k > 40 you are already beating Lanczos for larger > datasets. > >> > > k>10 is unlikely meaninful. p need not be more than 15% of k > (default > >> is > >> > > 15). use q=1, q>1 does not yield tangible improvements in real > world. > >> > > Again, see Nathan Halko's dissertation on accuracy comparison. > >> > > > >> > > > >> > > > >> > > On Fri, Aug 2, 2013 at 4:17 AM, Fernando Fernández < > >> > > [email protected]> wrote: > >> > > > >> > >> Keeping Lanczos would be nice, Like I said, it's currently being > >> used in > >> > >> some projects with good results and I think it's easier to tune so > it > >> > >> would > >> > >> be my first choice for future developments. I still need to further > >> test > >> > >> SSVD, specially because in the current example I'm working it > yields > >> > very > >> > >> different results from Lanczos. We are investigating if it can be > due > >> > to a > >> > >> bug when loading the data, though dimensions of the ouptut seem ok, > >> or > >> > if > >> > >> it's a question of increasing p or q parameters. If it's a question > >> of > >> > >> increasing p and q I think running times would make SSVD not > viable. > >> I > >> > >> hope > >> > >> to be able to provide some comparison figures in terms of precision > >> and > >> > >> running time in a month or so. > >> > >> > >> > >> I hope that other users reads this and say wether they are using > >> > Lanczos. > >> > >> > >> > >> Best, > >> > >> Fernando. > >> > >> > >> > >> 2013/8/2 Sebastian Schelter <[email protected]> > >> > >> > >> > >> > I would also be fine with keeping if there is demand. I just > >> proposed > >> > to > >> > >> > deprecate it and nobody voted against that at that point in time. > >> > >> > > >> > >> > --sebastian > >> > >> > > >> > >> > > >> > >> > On 02.08.2013 03:12, Dmitriy Lyubimov wrote: > >> > >> > > There's a part of Nathan Halko's dissertation referenced on > >> > algorithm > >> > >> > page > >> > >> > > running comparison. In particular, he was not able to compute > >> more > >> > >> than > >> > >> > 40 > >> > >> > > eigenvectors with Lanczos on wikipedia dataset. You may refer > to > >> > that > >> > >> > > study. > >> > >> > > > >> > >> > > On the accuracy part, it was not observed that it was a > problem, > >> > >> assuming > >> > >> > > high level of random noise is not the case, at least not in > >> LSA-like > >> > >> > > application used there. > >> > >> > > > >> > >> > > That said, i am all for diversity of tools, I would actually be > >> +0 > >> > on > >> > >> > > deprecating Lanczos, it is not like we are lacking support for > >> it. > >> > >> SSVD > >> > >> > > could use improvements too. > >> > >> > > > >> > >> > > > >> > >> > > On Thu, Aug 1, 2013 at 3:15 AM, Fernando Fernández < > >> > >> > > [email protected]> wrote: > >> > >> > > > >> > >> > >> Hi everyone, > >> > >> > >> > >> > >> > >> Sorry if I duplicate the question but I've been looking for an > >> > answer > >> > >> > and I > >> > >> > >> haven't found an explanation other than it's not being used > >> > (together > >> > >> > with > >> > >> > >> some other algorithms). If it's been discussed in depth before > >> > maybe > >> > >> you > >> > >> > >> can point me to some link with the discussion. > >> > >> > >> > >> > >> > >> I have successfully used Lanczos in several projects and it's > >> been > >> > a > >> > >> > >> surprise to me finding that the main reason (according to what > >> I've > >> > >> read > >> > >> > >> that might not be the full story) is that it's not being used. > >> At > >> > the > >> > >> > >> begining I supposed it was because SSVD is supposed to be much > >> > faster > >> > >> > with > >> > >> > >> similar results, but after making some tests I have found that > >> > >> running > >> > >> > >> times are similar or even worse than lanczos for some > >> > configurations > >> > >> (I > >> > >> > >> have tried several combinations of parameters, given child > >> > processes > >> > >> > enough > >> > >> > >> memory, etc. and had no success in running SSVD at least in > 3/4 > >> of > >> > >> time > >> > >> > >> Lanczos runs, thouh they might be some combinations of > >> parameters I > >> > >> have > >> > >> > >> still not tried). It seems to be quite tricky to find a good > >> > >> > combination of > >> > >> > >> parameters for SSVD and I have seen also a precision loss in > >> some > >> > >> > examples > >> > >> > >> that makes me not confident in migrating Lanczos to SSVD from > >> now > >> > on > >> > >> > (How > >> > >> > >> far can I trust results from a combination of parameters that > >> runs > >> > in > >> > >> > >> significant less time, or at least a good time?). > >> > >> > >> > >> > >> > >> Can someone convince me that SSVD is actually a better option > >> than > >> > >> > Lanczos? > >> > >> > >> (I'm totally willing to be convinced... :) ) > >> > >> > >> > >> > >> > >> Thank you very much in advance. > >> > >> > >> > >> > >> > >> Fernando. > >> > >> > >> > >> > >> > > > >> > >> > > >> > >> > > >> > >> > >> > > > >> > > > >> > > >> > > > > >
