When using Laczos the recommendation is to use clean eigen vectors as a distributed row matrix--call it V.
A-hat = A^t V^t this per the clusterdump tests DSVD and DSVD2. Dmitriy and Ted recommend when using SSVD to do: A-hat = US When using PCA it's also preferable to use --uHalfSigma to create U with the SSVD solver. One difficulty is that to perform the multiplication you have to turn the singular values vector (diagonal values) into a distributed row matrix or write your own multiply function, correct? Questions: For SSVD can someone explain why US is preferred? Given A = USV^t how can you ignore the effect of V^t? Is this only for PCA? In other words if you did not use PCA weighting would you ignore V^t? For Lanczos A-hat = A^t V^t seems to strip doc id during transpose, am I mistaken? Also shouldn't A-hat be transposed before performing kmeans or other analysis? > Dmitriy said With SSVD you need just US (or U*Sigma in other notation). This is your dimensionally reduced output of your original document matrix you've run with --pca option. As Ted suggests, you may also use US^0.5 which is already produced by providing --uHalfSigma (or its embedded setter analog). the keys of that output (produced by getUPath() call) will already contain your Text document ids as sequence file keys. -d
