As a side note, there are also a few in-memory SVD implementations.  There
is a SingularValueDecomposition which uses "pre-Mahout" data structures.
There are also a few Factorizer classes which are apparently SVD but only
supply right&left matrices but no singular values.

What are the minimum sizes expected to "work" in these algorithms? Are they
intended to be canonical implementations that are correct from "2x2" to "out
of memory" or "numerical instability"?

Lance

On Fri, Sep 23, 2011 at 6:34 PM, Dan Brickley <dan...@danbri.org> wrote:

> On 23 September 2011 16:03, Lance Norskog <goks...@gmail.com> wrote:
> > Markus-
> >
> > Probably the best approach is to crosscheck your results with live data
> of
> > various sizes with the R statistical system.  (You will often get results
> > with opposing signs.)
>
> So, that's exactly where I was, with Ruby and Matlab(<cheapskate>GNU
> Octave</cheapskate>) taking the place of R there.
>
> It didn't help me that my grasp of the relevant linear algebra was
> somewhat journalistic, for sure.  But precisely because it was shaky,
> I thought "right, let's stay sane since I'm not an expert either in
> the maths, or in hadoop, or in mahout, so ... I'll take a simple tiny
> testcase example, make sure I can run it in Octave and Ruby, ... and
> use that to build out my understanding of Mahout's SVD".
>
> That turned out to be a disappointing learning experience, for reasons
> recently summarised here.  I was using a tiny example taken from
> http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/
> because I thought that was a nice way of re-using a helpful writeup as
> Mahout documentation. Bad idea due to dataset size.
>
> Looking again at
> https://cwiki.apache.org/MAHOUT/dimensional-reduction.html I see that
> there is in fact a good sample dataset now; the mailing list stuff.
> Maybe I'd missed it at the time. It deserves more attention, as a
> common hub for documentation, user education, and for comparison
> testing and sanity-checking against non-Mahout environments like R
> etc. (Perhaps the EC2 aspect is an issue for non-Amazon users?). I'm
> not sure if  "Overall, there are 6,094,444 key-value pairs in 283
> files taking around 5.7GB of disk." makes it too big for many
> non-Mahout environments. But the sooner there's a single dataset
> people use to get started experimenting with Mahout SVD, the sooner
> we'll avoid everyone revisiting the "I don't understand what Lanczos
> has done..." thread.
>
> Should there be a FAQ on the Lanczos page?
>
> Q: Will this work with a test matrix of e.g. 5x8 size?
> A: No, ... it needs to be substantially bigger,...
>
> Q: How much bigger?
> A: <... somebody write something here ... >
>
> cheers,
>
> Dan
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to