On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel <[email protected]> wrote:
> OK, U * Sigma seems to be working in the patch of SSVDSolver.
>
> However I still have no doc ids in U. Has anyone seen a case where they are 
> preserved?

That should not be the case. Ids in rows of U are inherited from rows
of A. (should be at least).

>
> For
>     BtJob.run(conf,
>                 inputPath,
>                 qPath,
>                 pcaMeanPath,
>                 btPath,
>                 minSplitSize,
>                 k,
>                 p,
>                 outerBlockHeight,
>                 q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks,
>                 broadcast,
>                 labelType,
>                 q <= 0);
>
> inputPath here contains a distributedRowMatrix with text doc ids.
>
> Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look 
> for them and BtJob is the only place the input matrix is used, the rest are 
> intermediates afaict and anyway don't have ids either.
>
> Is something in BtJob stripping them? It looks like ids are ignored in the MR 
> code but maybe its hidden…
>
> Are the Keys of U guaranteed  to be the same as A? If so I could construct an 
> index for A and use it on U but it would be nice to get them out of the 
> solver.

Yes, that's the idea.

B^t matrix will not have the ideas, not sure why you are looking
there. you need U matrix. Which is solved by another job.

>
> On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <[email protected]> wrote:
>
> Yes you got it, thats what i was proposing before. A very easy patch.
> On Sep 7, 2012 9:11 AM, "Pat Ferrel" <[email protected]> wrote:
>
>> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>>
>> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
>> proportions for doing clustering and similarity (though not sure if this is
>> strictly required with cosine distance) I probably want to use U * Sigma
>> instead of sqrt Sigma.
>>
>> Since I have no other reason to load U row by row I could write another
>> transform and keep it out of the mahout core but doing this in a patch
>> seems trivial. Just create a new flag, something like --uSigma (the CLI
>> option looks like the hardest part actually). For the API there needs to be
>> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
>> extra flag check in the setup for the UJob, something like the following
>>
>>      if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
>> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>>        sValues = SSVDHelper.loadVector(sigmaPath,
>> context.getConfiguration());
>>        // sValues.assign(Functions.SQRT);  // no need to take the sqrt
>> for Sigma weighting
>>      }
>>
>> sValues is already applied to U in the map, which would remain unchanged
>> since the sigma weighted (instead of sqrt sigma) values will already be in
>> sValues.
>>
>>      if (sValues != null) {
>>        for (int i = 0; i < k; i++) {
>>          uRow.setQuick(i,
>>                        qRow.dot(uHat.viewColumn(i)) *
>> sValues.getQuick(i));
>>        }
>>      } else {
>>        …
>>
>> I'll give this a try and if it seems reasonable submit a patch.
>>
>> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>>
>>> When using PCA it's also preferable to use --uHalfSigma to create U with
>> the SSVD solver. One difficulty is that to perform the multiplication you
>> have to turn the singular values vector (diagonal values) into a
>> distributed row matrix or write your own multiply function, correct?
>>
>> You could do that, but why? Sigma is a diagonal matrix (which
>> additionally encoded as a very short vector of singular values of
>> length k, say we denote it as 'sv'). Given that, there's absolutely 0
>> reason to encode it as Distributed row matrix.
>>
>> Multiplication can be done on the fly as you load U, row by row:
>> U*Sigma[i,j]=U[i,j]*sv[j]
>>
>> One inconvenience with that approach is that it assumes you can freely
>> hack the code that loads U matrix for further processing.
>>
>> It is much easier to have SSVD to output U*Sigma directly using the
>> same logic as above (requires a patch) or just have it output
>> U*Sigma^0.5 (does not require a patch).
>>
>> You could even use U in some cases directly, but part of the problem
>> is that data variances will be normalized in all directions compared
>> to PCA space, which will affect actual distances between data points.
>> If you want to retain proportions of the directional variances as in
>> your original input, you need to use principal components with scaling
>> applied, i.e. U*Sigma.
>>
>>
>>
>

Reply via email to