OK, thanks The SSVD junit test with U*Sigma completes fine.
On Sep 5, 2012, at 5:37 PM, Dmitriy Lyubimov <[email protected]> wrote: Pat, No, With SSVD you need just US, not US^-1. (or U*Sigma in other notation). This is your dimensionally reduced output of your original document matrix you've run with --pca option. As Ted suggests, you may also use US^0.5 which is already produced by providing --uHalfSigma (or its embedded setter analog). the keys of that output (produced by getUPath() call) will already contain your Text document ids as sequence file keys. -d On Wed, Sep 5, 2012 at 5:20 PM, Pat Ferrel <[email protected]> wrote: > Trying to do dimensionality reduction with SSVD then running the new doc > matrix through kmeans. > > The Lanczos + ClusterDump test of SVD + kmeans uses A-hat = A^t V^t. > Unfortunately this results in anonymous vectors in clusteredPoints after > A-hat is run through kmeans. The doc ids are lost due to the transpose I > assume? > > In any case Dmitriy pointed out that this might have been done because > Lanczos does not produce U. > > So I need to do US^-1? This would avoid the transpose and should preserve > doc/row ids for kmeans? And doing the PCA in SSVD will weight things properly > so I don't need the --halfSigma? > > Please correct me if I'm wrong. > > > On Sep 5, 2012, at 4:59 PM, Dmitriy Lyubimov <[email protected]> wrote: > > Yes i have an option to output U * Sigma^0.5 already. > > But strictly speaking the way PCA space is defined would require just > U*Sigma output. Or it is not worth it? > > > On Wed, Sep 5, 2012 at 4:56 PM, Ted Dunning <[email protected]> wrote: >> Yes. (A-M)V is U \Sigma. You may actually want something like U \sqrt >> \Sigma instead, though. >> >> >> On Wed, Sep 5, 2012 at 4:10 PM, Dmitriy Lyubimov <[email protected]> wrote: >> >>> Hello, >>> >>> I have a question w.r.t what to advise people in the SSVD manual for PCA. >>> >>> So we have >>> >>> (A-M) \approx U \Sigma V^t >>> >>> and strictly speaking since svd is reduced rank, we need to re-project >>> original data points as >>> >>> Y= (A-M)V >>> >>> However we can assume (A-M)V \approx U \Sigma, can't we? I.e. instead of >>> recomputing tough job of (A-M)V we can just advise to use U\Sigma or just U >>> in some cases, can't we? >>> >>> Thanks. >>> -d >>> >
