Hi Jake, This worked. cleansvd removed a handful of vectors and the output was descending by weight. So far, everything looks good!
Thanks, Erik On Jul 9, 2010, at 3:58 PM, Jake Mannix wrote: > Hey Erik, > > The order of output is fairly arbitrary: Lanczos gets both the very > largest and very smallest eigenvalues at the same time, and the order you > decide to store those is pretty much up to you (do you want the biggest 20 > eigenvalues/vectors - useful for making low-rank approximations, or do you > want the smallest eigens - useful for cutting a graph into almost > disconnected clusters). > > You have to be careful in choosing via looking for an elbow - if it's near > the middle (desiredRank / 2) you could be missing a bunch of good > eigenvalues right above the elbow. In your case, it's looking like you've > really got closer to 20 or so good large eigenvalues, and 20/200 is small > enough that you most likely have all 20 of the largest 20, so cutting there > seems pretty reasonable to me. If your elbow was in the middle, you would > have to just bump up desiredRank, then throw away more. > > For the cleansvd job, I'd run it first with really lax requirements > (minEigenvalue 0, maxError 0.5), and see what the errors look like. Some of > the eigenvectors you've gotten are really errors, and will fail the maxError > test with errors of 0.99 or higher, and will get discarded. If you know > (based on your first run) what eigenvalue is at the elbow, you can just set > the minEigenvalue to be this, and you'll cut off everything below that, too. > I *think* that cleansvd spits out the final eigen-pairs in the descending > order you want, but try and see. > > Let me know if that works out for you! > > -jake > > On Sat, Jul 10, 2010 at 12:34 AM, Erik Frey <[email protected]> wrote: > >> Hi all, >> >> I recently ran mahout's svd on a large text corpus following the helpful >> example written here: >> https://cwiki.apache.org/MAHOUT/dimensionalreduction.html >> >> Just a few questions about how I should best interpret the output: >> >> * I chose to calculate 200 singular vectors - as the driver was finishing >> up it printed out the eigenvalues and I was surprised to see them in >> ascending order. The first singular vector had an eigenvalue of zero, there >> was an elbow at ~dimension 180, and a sharp incline towards an eigenvalue of >> 1.0 at dimension 199. I was expecting these to be in declining order. Did >> I do something wrong? >> >> * Usually when choosing the number of dimensions I'd chop off at the elbow, >> but cleansvd seems to have a number of more specific options. Assuming my >> first run has gone correctly, are there rules of thumb I should follow for >> picking the min eigenvalue and max error? >> >> Thanks, >> >> Erik
