Hey Erik, The order of output is fairly arbitrary: Lanczos gets both the very largest and very smallest eigenvalues at the same time, and the order you decide to store those is pretty much up to you (do you want the biggest 20 eigenvalues/vectors - useful for making low-rank approximations, or do you want the smallest eigens - useful for cutting a graph into almost disconnected clusters).
You have to be careful in choosing via looking for an elbow - if it's near the middle (desiredRank / 2) you could be missing a bunch of good eigenvalues right above the elbow. In your case, it's looking like you've really got closer to 20 or so good large eigenvalues, and 20/200 is small enough that you most likely have all 20 of the largest 20, so cutting there seems pretty reasonable to me. If your elbow was in the middle, you would have to just bump up desiredRank, then throw away more. For the cleansvd job, I'd run it first with really lax requirements (minEigenvalue 0, maxError 0.5), and see what the errors look like. Some of the eigenvectors you've gotten are really errors, and will fail the maxError test with errors of 0.99 or higher, and will get discarded. If you know (based on your first run) what eigenvalue is at the elbow, you can just set the minEigenvalue to be this, and you'll cut off everything below that, too. I *think* that cleansvd spits out the final eigen-pairs in the descending order you want, but try and see. Let me know if that works out for you! -jake On Sat, Jul 10, 2010 at 12:34 AM, Erik Frey <[email protected]> wrote: > Hi all, > > I recently ran mahout's svd on a large text corpus following the helpful > example written here: > https://cwiki.apache.org/MAHOUT/dimensionalreduction.html > > Just a few questions about how I should best interpret the output: > > * I chose to calculate 200 singular vectors - as the driver was finishing > up it printed out the eigenvalues and I was surprised to see them in > ascending order. The first singular vector had an eigenvalue of zero, there > was an elbow at ~dimension 180, and a sharp incline towards an eigenvalue of > 1.0 at dimension 199. I was expecting these to be in declining order. Did > I do something wrong? > > * Usually when choosing the number of dimensions I'd chop off at the elbow, > but cleansvd seems to have a number of more specific options. Assuming my > first run has gone correctly, are there rules of thumb I should follow for > picking the min eigenvalue and max error? > > Thanks, > > Erik
