If one wanted the *smallest* eigenvectors in a (potentially rather
large but sparse) matrix, what are the options in Mahout currently?
Lanczos? ssvd?

Specifically the eigenvector corresponding to the second-smallest
eigenvalue... (there's always a boring zero one) of laplacian
re-representation of affinity matrix, in my case.

Reading around,
http://www.sfu.ca/personal/archives/richards/Pages/NAS.AJS-WDR.pdf

"While the Laplacian spectrum has many attractive theoretical
properties which make it very useful for partitioning large sparse
arrays, it has at least one drawback: the Fiedler eigenvector is
associated with the second smallest eigenvalue. The current method of
choice for finding a few eigenvectors of very large sparse matrices is
the Lanczos method, which tends not to find small eigenvectors as
quickly as it finds the largest eigenvectors (Pothen, et al, 1990). "

That paper also suggests a workaround for this, "The definition of L
shows that there is no loss of sparsity (except for the diagonal) and
that the sparse methods mentioned earlier can be applied to find all
or some of the eigenpairs. The requirement that we must find the
smallest eigenpairs is easily overcome by subtracting a suitably large
constant from the diagonal of -L (which subtracts that constant from
the eigenvalues without  hanging the eigenvectors).This guarantees
that the first eigenpairs returned by the Power Method or Lanczos
iteration are associated with the smallest eigenvalues of L."

There's a thesis at http://am.vsb.cz/theses/kabelikova_ing.pdf that
seems to try this (or a related) method, but I didn't dig into the
details.


If I understand the literature correctly (I've been poking around
spectral graph theory e.g. see
http://www.cs.purdue.edu/homes/dgleich/demos/matlab/spectral/spectral.html
for a fairly accessible matlab example), this second-smallest (aka
fiedler) eigenvector can be very interesting for finding structure in
graph datasets, even if intuitively we're drawn more towards the
larger eigenvectors for explaining bulk variation in the data. This is
all closely related to Mahout's spectral clustering component, but
from talking with Shannon I don't believe we currently do anything to
find the smallest eigenvectors there. Are Lanczos and SSVD Mahout's
two natural starting points for this kind of thing, or am I forgetting
some other machinery?

Thanks for any pointers or fixes to my understanding,

Dan

Reply via email to