When using Laczos the recommendation is to use clean eigen vectors as a 
distributed row matrix--call it V.

A-hat = A^t V^t this per the clusterdump tests DSVD and DSVD2.

Dmitriy and Ted recommend when using SSVD to do:

A-hat = US 

When using PCA it's also preferable to use --uHalfSigma to create U with the 
SSVD solver. One difficulty is that to perform the multiplication you have to 
turn the singular values vector (diagonal values) into a distributed row matrix 
or write your own multiply function, correct?

Questions:
For SSVD can someone explain why US is preferred? Given A = USV^t how can you 
ignore the effect of V^t? Is this only for PCA? In other words if you did not 
use PCA weighting would you ignore V^t?
For Lanczos A-hat = A^t V^t seems to strip doc id during transpose, am I 
mistaken? Also shouldn't A-hat be transposed before performing kmeans or other 
analysis?



> Dmitriy said
With SSVD you need just US  (or U*Sigma in other notation).
This is your dimensionally reduced output of your original document
matrix you've run with --pca option.

As Ted suggests, you may also use US^0.5 which is already produced by
providing --uHalfSigma (or its embedded setter analog). the keys of
that output (produced by getUPath() call) will already contain your
Text document ids as sequence file keys.

-d


Reply via email to