Also there are a few random projection single machine implementations about to be committed. These will allow a middle ground for scalability.
As Dmitriy pointed out with p = 0 and k = full rank, these should work on any size matrix. That isn't very interesting of course since they devolve down to doing an in-memory SVD of full size in that case. When computing less than a full SVD, the approximation is much better with higher dimensions. On Sat, Sep 24, 2011 at 2:51 PM, Lance Norskog <goks...@gmail.com> wrote: > As a side note, there are also a few in-memory SVD implementations. There > is a SingularValueDecomposition which uses "pre-Mahout" data structures. > There are also a few Factorizer classes which are apparently SVD but only > supply right&left matrices but no singular values. > > What are the minimum sizes expected to "work" in these algorithms? Are they > intended to be canonical implementations that are correct from "2x2" to > "out > of memory" or "numerical instability"? > > Lance > > On Fri, Sep 23, 2011 at 6:34 PM, Dan Brickley <dan...@danbri.org> wrote: > > > On 23 September 2011 16:03, Lance Norskog <goks...@gmail.com> wrote: > > > Markus- > > > > > > Probably the best approach is to crosscheck your results with live data > > of > > > various sizes with the R statistical system. (You will often get > results > > > with opposing signs.) > > > > So, that's exactly where I was, with Ruby and Matlab(<cheapskate>GNU > > Octave</cheapskate>) taking the place of R there. > > > > It didn't help me that my grasp of the relevant linear algebra was > > somewhat journalistic, for sure. But precisely because it was shaky, > > I thought "right, let's stay sane since I'm not an expert either in > > the maths, or in hadoop, or in mahout, so ... I'll take a simple tiny > > testcase example, make sure I can run it in Octave and Ruby, ... and > > use that to build out my understanding of Mahout's SVD". > > > > That turned out to be a disappointing learning experience, for reasons > > recently summarised here. I was using a tiny example taken from > > http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/ > > because I thought that was a nice way of re-using a helpful writeup as > > Mahout documentation. Bad idea due to dataset size. > > > > Looking again at > > https://cwiki.apache.org/MAHOUT/dimensional-reduction.html I see that > > there is in fact a good sample dataset now; the mailing list stuff. > > Maybe I'd missed it at the time. It deserves more attention, as a > > common hub for documentation, user education, and for comparison > > testing and sanity-checking against non-Mahout environments like R > > etc. (Perhaps the EC2 aspect is an issue for non-Amazon users?). I'm > > not sure if "Overall, there are 6,094,444 key-value pairs in 283 > > files taking around 5.7GB of disk." makes it too big for many > > non-Mahout environments. But the sooner there's a single dataset > > people use to get started experimenting with Mahout SVD, the sooner > > we'll avoid everyone revisiting the "I don't understand what Lanczos > > has done..." thread. > > > > Should there be a FAQ on the Lanczos page? > > > > Q: Will this work with a test matrix of e.g. 5x8 size? > > A: No, ... it needs to be substantially bigger,... > > > > Q: How much bigger? > > A: <... somebody write something here ... > > > > > cheers, > > > > Dan > > > > > > -- > Lance Norskog > goks...@gmail.com >