On Fri, Aug 2, 2013 at 3:08 PM, Dmitriy Lyubimov <[email protected]> wrote:
> > > > On Fri, Aug 2, 2013 at 2:52 PM, Fernando Fernández < > [email protected]> wrote: > >> I don't agree with k>10 being unlikely meaningful. I've used SVD in text >> mining problems where k~150 yielded best results (not only a good choice >> based on plotting eigenvalues and seeing elbow in decay was near 150 but >> checking results with different k's and seeing around 150 made much more >> sense). Currently I'm working in a recommender system and already have >> Lanczos running with k~50 producing best results, again, based on visual >> exploration of eigenvalues and exploring results one by one and seeing >> they >> were more meaningful. Current tests with SSVD are based on the latter and >> when I say I'm not getting good results I mean Lanczos is working properly >> on the same problem (I've explored eigenvalues up to 150 and have a good >> decay) and SSVD is not, but as I said, this might be caused by some bug in >> the input process, seems to strange to me that results are so different so >> > > Depends on how you define "so". But again, in that respect all i can point > to is to the accuracy study by N. Halko, out of published work. > I guess i can save you digging thru Mahout wiki, here is the reference http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf. Specifically, look at eigen values chart comparison at page 179. This is run on Mahout's Lanczos and SSVD neck-to-neck. The order of accuracy for first 40 values is claimed as "Order of accuracy is q = 3; q = 2; q = 1, lanczos, q = 0." (see source for details of accuracy assessment). One thing i did not understand there is why Lanczos showed such uncharacteristic values fall-off for values between 40 and 60. I have always assumed -q=1 was showing something much closer to reality after first 40 values as well. > >> I'll get back to this discussions when I figure it out :) . If you are >> curious about the numbers: 1MM rows by 150k columns for text mining case >> and 18 MM rows by 80k columns for recommender. >> >> About p and q, I have been playing around with movielens 100k dataset and >> found q>0 actually worsens results in terms of precision (nothing severe >> though, but it happens) and its better to increase p a little in that >> particular case, so my guess is it depends a lot on the dataset though I >> don't know how. >> > > This again sounds very strange. The algorithm is non-deterministic, which > means errors you get in one run, will be different from errors in another > run, but honesly, you would be the first to report that power iterations > worsen expectation of an error. All theoretical work and practical > estimates did not confirm that observation; in fact, quite a bit to the > contrary. > > >> >> 2013/8/2 Dmitriy Lyubimov <[email protected]> >> >> > the only time you would not get good results is if spectrum does not >> have a >> > good decay. Which is equivalent to mostly same variance in most of >> original >> > basis directions. This problem is similar to problem that arises with >> PCA >> > when you try to do dimensionality reduction with retaining certain >> %-tage >> > of variance. in case of flat spectrum decay, you'd need much bigger k to >> > retain same amount of variance in dimensionally reduced projection. In >> that >> > sense SSVD solution for a given k is as good as PCA gets for the same k. >> > Also, i believe (but not 100% sure) "problems too small" exhibit higher >> > errors due to the law of large numbers. >> > >> > >> > On Fri, Aug 2, 2013 at 10:41 AM, Dmitriy Lyubimov <[email protected]> >> > wrote: >> > >> > > if you use k > 40 you are already beating Lanczos for larger datasets. >> > > k>10 is unlikely meaninful. p need not be more than 15% of k (default >> is >> > > 15). use q=1, q>1 does not yield tangible improvements in real world. >> > > Again, see Nathan Halko's dissertation on accuracy comparison. >> > > >> > > >> > > >> > > On Fri, Aug 2, 2013 at 4:17 AM, Fernando Fernández < >> > > [email protected]> wrote: >> > > >> > >> Keeping Lanczos would be nice, Like I said, it's currently being >> used in >> > >> some projects with good results and I think it's easier to tune so it >> > >> would >> > >> be my first choice for future developments. I still need to further >> test >> > >> SSVD, specially because in the current example I'm working it yields >> > very >> > >> different results from Lanczos. We are investigating if it can be due >> > to a >> > >> bug when loading the data, though dimensions of the ouptut seem ok, >> or >> > if >> > >> it's a question of increasing p or q parameters. If it's a question >> of >> > >> increasing p and q I think running times would make SSVD not viable. >> I >> > >> hope >> > >> to be able to provide some comparison figures in terms of precision >> and >> > >> running time in a month or so. >> > >> >> > >> I hope that other users reads this and say wether they are using >> > Lanczos. >> > >> >> > >> Best, >> > >> Fernando. >> > >> >> > >> 2013/8/2 Sebastian Schelter <[email protected]> >> > >> >> > >> > I would also be fine with keeping if there is demand. I just >> proposed >> > to >> > >> > deprecate it and nobody voted against that at that point in time. >> > >> > >> > >> > --sebastian >> > >> > >> > >> > >> > >> > On 02.08.2013 03:12, Dmitriy Lyubimov wrote: >> > >> > > There's a part of Nathan Halko's dissertation referenced on >> > algorithm >> > >> > page >> > >> > > running comparison. In particular, he was not able to compute >> more >> > >> than >> > >> > 40 >> > >> > > eigenvectors with Lanczos on wikipedia dataset. You may refer to >> > that >> > >> > > study. >> > >> > > >> > >> > > On the accuracy part, it was not observed that it was a problem, >> > >> assuming >> > >> > > high level of random noise is not the case, at least not in >> LSA-like >> > >> > > application used there. >> > >> > > >> > >> > > That said, i am all for diversity of tools, I would actually be >> +0 >> > on >> > >> > > deprecating Lanczos, it is not like we are lacking support for >> it. >> > >> SSVD >> > >> > > could use improvements too. >> > >> > > >> > >> > > >> > >> > > On Thu, Aug 1, 2013 at 3:15 AM, Fernando Fernández < >> > >> > > [email protected]> wrote: >> > >> > > >> > >> > >> Hi everyone, >> > >> > >> >> > >> > >> Sorry if I duplicate the question but I've been looking for an >> > answer >> > >> > and I >> > >> > >> haven't found an explanation other than it's not being used >> > (together >> > >> > with >> > >> > >> some other algorithms). If it's been discussed in depth before >> > maybe >> > >> you >> > >> > >> can point me to some link with the discussion. >> > >> > >> >> > >> > >> I have successfully used Lanczos in several projects and it's >> been >> > a >> > >> > >> surprise to me finding that the main reason (according to what >> I've >> > >> read >> > >> > >> that might not be the full story) is that it's not being used. >> At >> > the >> > >> > >> begining I supposed it was because SSVD is supposed to be much >> > faster >> > >> > with >> > >> > >> similar results, but after making some tests I have found that >> > >> running >> > >> > >> times are similar or even worse than lanczos for some >> > configurations >> > >> (I >> > >> > >> have tried several combinations of parameters, given child >> > processes >> > >> > enough >> > >> > >> memory, etc. and had no success in running SSVD at least in 3/4 >> of >> > >> time >> > >> > >> Lanczos runs, thouh they might be some combinations of >> parameters I >> > >> have >> > >> > >> still not tried). It seems to be quite tricky to find a good >> > >> > combination of >> > >> > >> parameters for SSVD and I have seen also a precision loss in >> some >> > >> > examples >> > >> > >> that makes me not confident in migrating Lanczos to SSVD from >> now >> > on >> > >> > (How >> > >> > >> far can I trust results from a combination of parameters that >> runs >> > in >> > >> > >> significant less time, or at least a good time?). >> > >> > >> >> > >> > >> Can someone convince me that SSVD is actually a better option >> than >> > >> > Lanczos? >> > >> > >> (I'm totally willing to be convinced... :) ) >> > >> > >> >> > >> > >> Thank you very much in advance. >> > >> > >> >> > >> > >> Fernando. >> > >> > >> >> > >> > > >> > >> > >> > >> > >> > >> >> > > >> > > >> > >> > >
