U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".

WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance 
proportions for doing clustering and similarity (though not sure if this is 
strictly required with cosine distance) I probably want to use U * Sigma 
instead of sqrt Sigma.

Since I have no other reason to load U row by row I could write another 
transform and keep it out of the mahout core but doing this in a patch seems 
trivial. Just create a new flag, something like --uSigma (the CLI option looks 
like the hardest part actually). For the API there needs to be a new setter 
something like SSVDSolver#setComputeUSigma(true) then do an extra flag check in 
the setup for the UJob, something like the following

      if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set from 
--uSigma option or SSVDSolver#setComputeUSigma(true)
        sValues = SSVDHelper.loadVector(sigmaPath, context.getConfiguration());
        // sValues.assign(Functions.SQRT);  // no need to take the sqrt for 
Sigma weighting
      }

sValues is already applied to U in the map, which would remain unchanged since 
the sigma weighted (instead of sqrt sigma) values will already be in sValues.

      if (sValues != null) {
        for (int i = 0; i < k; i++) {
          uRow.setQuick(i,
                        qRow.dot(uHat.viewColumn(i)) * sValues.getQuick(i));
        }
      } else {
        …

I'll give this a try and if it seems reasonable submit a patch.
 
On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <[email protected]> wrote:
> 
> When using PCA it's also preferable to use --uHalfSigma to create U with the 
> SSVD solver. One difficulty is that to perform the multiplication you have to 
> turn the singular values vector (diagonal values) into a distributed row 
> matrix or write your own multiply function, correct?

You could do that, but why? Sigma is a diagonal matrix (which
additionally encoded as a very short vector of singular values of
length k, say we denote it as 'sv'). Given that, there's absolutely 0
reason to encode it as Distributed row matrix.

Multiplication can be done on the fly as you load U, row by row:
U*Sigma[i,j]=U[i,j]*sv[j]

One inconvenience with that approach is that it assumes you can freely
hack the code that loads U matrix for further processing.

It is much easier to have SSVD to output U*Sigma directly using the
same logic as above (requires a patch) or just have it output
U*Sigma^0.5 (does not require a patch).

You could even use U in some cases directly, but part of the problem
is that data variances will be normalized in all directions compared
to PCA space, which will affect actual distances between data points.
If you want to retain proportions of the directional variances as in
your original input, you need to use principal components with scaling
applied, i.e. U*Sigma.


Reply via email to