Trying to do dimensionality reduction with SSVD then running the new doc matrix through kmeans.
The Lanczos + ClusterDump test of SVD + kmeans uses A-hat = A^t V^t. Unfortunately this results in anonymous vectors in clusteredPoints after A-hat is run through kmeans. The doc ids are lost due to the transpose I assume? In any case Dmitriy pointed out that this might have been done because Lanczos does not produce U. So I need to do US^-1? This would avoid the transpose and should preserve doc/row ids for kmeans? And doing the PCA in SSVD will weight things properly so I don't need the --halfSigma? Please correct me if I'm wrong. On Sep 5, 2012, at 4:59 PM, Dmitriy Lyubimov <[email protected]> wrote: Yes i have an option to output U * Sigma^0.5 already. But strictly speaking the way PCA space is defined would require just U*Sigma output. Or it is not worth it? On Wed, Sep 5, 2012 at 4:56 PM, Ted Dunning <[email protected]> wrote: > Yes. (A-M)V is U \Sigma. You may actually want something like U \sqrt > \Sigma instead, though. > > > On Wed, Sep 5, 2012 at 4:10 PM, Dmitriy Lyubimov <[email protected]> wrote: > >> Hello, >> >> I have a question w.r.t what to advise people in the SSVD manual for PCA. >> >> So we have >> >> (A-M) \approx U \Sigma V^t >> >> and strictly speaking since svd is reduced rank, we need to re-project >> original data points as >> >> Y= (A-M)V >> >> However we can assume (A-M)V \approx U \Sigma, can't we? I.e. instead of >> recomputing tough job of (A-M)V we can just advise to use U\Sigma or just U >> in some cases, can't we? >> >> Thanks. >> -d >>
