On Sun, Aug 8, 2010 at 12:35 PM, Grant Ingersoll <[email protected]>wrote:
> Just to make sure I'm understanding, the docs for "clean SVD" at > https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reductionare > not correct, right? > > In looking at the code, the SVD command requires --Dmapred.input.dir (soon > to be --input like everything else, see MAHOUT-461) a --tempDir and > --Dmapred.output.dir (soon to be --output). Then, in the cleansvd command, > the --eigenInput should actually refer to the Output directory not the > tempDir as the docs suggest, right? > Yeah, Sean switched it over to -Dmapred.input.dir a while back, and the docs did not get updated to reflect that. The cleansvd command has correct docs, doesn't it? "Where corpusInput is the corpus from the previous step, eigenInput is the output from the previous step, and you have a new output path for clean output." > Also, any recommendations on setting maxError and minEigenValue? What are > the tradeoffs I'm making there? I mean, I suppose maxError is some measure > of convergence and minEigenValue is just as it sounds, but what are the > practical implications of those settings? Are the values in the example > good starting points? > minEigenvalue is totally user-dependent. The output from the svd command (which might be in the logs, now?) will include the estimated eigenvalues (prior to cleaning), and if your application, and you can use this as a guide, or in fact you can decide to cut off your decomposition by eigenvalue, instead of rank. It is always "safe" to pick 0 as the minEigenvalue. maxError is trickier. Lanczos sometimes spits out a few basis vectors which are repeats of previous ones, and are thus garbage. When this happens, you get some eigenvectors which have very high error (close to 1.0) and should be thrown out. In general, if your error is significantly less than (1 - 1/sqrt(N)) where N is the column cardinality of the input matrix, you're significantly closer than random to being an eigenvector. What I tend to do is run it once with maxError = 0.9, minEigenvalue = 0, to get everybody who could possibly be a good eigenvector (subject to the rank constraints), and look at the distribution of errors and eigenvalues, and then make a further cut based on that. Usually some of the very small and very large eigenvectors will have converged way better, and have error in the 1E-10 or smaller range. Some of those in the middle will drift toward 0.01 or a little worse, but unless you're only asking for rank = 3 (lower rank will mean less iterations, and everyone is less converged), say, even the one exactly in the middle shouldn't have an error greater than 0.1 or so. It really depends on your data set though. Those options are there for flexibility in case you know what to cut based on. -jake > > Thanks, > Grant
