Re: Understanding SVD CLI inputs

Jake Mannix Sun, 08 Aug 2010 15:07:09 -0700

On Sun, Aug 8, 2010 at 12:35 PM, Grant Ingersoll <[email protected]>wrote:


> Just to make sure I'm understanding, the docs for "clean SVD" at
> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reductionare 
> not correct, right?
>
> In looking at the code, the SVD command requires --Dmapred.input.dir (soon
> to be --input like everything else, see MAHOUT-461) a --tempDir and
> --Dmapred.output.dir (soon to be --output).  Then, in the cleansvd command,
> the --eigenInput should actually refer to the Output directory not the
> tempDir as the docs suggest, right?
>

Yeah, Sean switched it over to -Dmapred.input.dir a while back, and the docs
did not get updated to reflect that.  The cleansvd command has correct docs,
doesn't it?

  "Where corpusInput is the corpus from the previous step, eigenInput is the
output from the previous step, and you have a new output path for clean
output."


> Also, any recommendations on setting maxError and minEigenValue?  What are
> the tradeoffs I'm making there?  I mean, I suppose maxError is some measure
> of convergence and minEigenValue is just as it sounds, but what are the
> practical implications of those settings?  Are the values in the example
> good starting points?
>

minEigenvalue is totally user-dependent.  The output from the svd command
(which might be in the logs, now?) will include the estimated eigenvalues
(prior to cleaning), and if your application, and you can use this as a
guide, or in fact you can decide to cut off your decomposition by
eigenvalue, instead of rank.  It is always "safe" to pick 0 as the
minEigenvalue.

maxError is trickier.  Lanczos sometimes spits out a few basis vectors which
are repeats of previous ones, and are thus garbage.  When this happens, you
get some eigenvectors which have very high error (close to 1.0) and should
be thrown out.  In general, if your error is significantly less than (1 -
1/sqrt(N)) where N is the column cardinality of the input matrix, you're
significantly closer than random to being an eigenvector.

What I tend to do is run it once with maxError = 0.9, minEigenvalue = 0, to
get everybody who could possibly be a good eigenvector (subject to the rank
constraints), and look at the distribution of errors and eigenvalues, and
then make a further cut based on that.  Usually some of the very small and
very large eigenvectors will have converged way better, and have error in
the 1E-10 or smaller range.  Some of those in the middle will drift toward
0.01 or a little worse, but unless you're only asking for rank = 3 (lower
rank will mean less iterations, and everyone is less converged), say, even
the one exactly in the middle shouldn't have an error greater than 0.1 or
so.

It really depends on your data set though. Those options are there for
flexibility in case you know what to cut based on.

  -jake


>
> Thanks,
> Grant

Re: Understanding SVD CLI inputs

Reply via email to