Yeah. probably we are talking about different things.


On Fri, Sep 7, 2012 at 1:49 PM, Pat Ferrel <[email protected]> wrote:
> I'm looking in the output. Since I'm not familiar with what the jobs do I am 
> looking everywhere. One thing seems true that ids go in but they do not come 
> out, in any output.
>
> I do find your comments instructive though. My first time messing with mahout 
> job internals.
>
> Maybe we have a terminology problem. The Keys are not the same as ids, right? 
> The Key is an int the ID is Text for one thing. The ID is part of the 
> row/vector and can be text or an int. I call them doc ids because that's what 
> I use them for.
>
> On Sep 7, 2012, at 1:34 PM, Dmitriy Lyubimov <[email protected]> wrote:
>
> More specifically, the way it works, Q matrix inherits keys of A rows
> (BtJob line 137), and U inherits keys of Q (UJob line 128).
>
> On Fri, Sep 7, 2012 at 1:19 PM, Dmitriy Lyubimov <[email protected]> wrote:
>> On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel <[email protected]> wrote:
>>> OK, U * Sigma seems to be working in the patch of SSVDSolver.
>>>
>>> However I still have no doc ids in U. Has anyone seen a case where they are 
>>> preserved?
>>
>> That should not be the case. Ids in rows of U are inherited from rows
>> of A. (should be at least).
>>
>>>
>>> For
>>>    BtJob.run(conf,
>>>                inputPath,
>>>                qPath,
>>>                pcaMeanPath,
>>>                btPath,
>>>                minSplitSize,
>>>                k,
>>>                p,
>>>                outerBlockHeight,
>>>                q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks,
>>>                broadcast,
>>>                labelType,
>>>                q <= 0);
>>>
>>> inputPath here contains a distributedRowMatrix with text doc ids.
>>>
>>> Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look 
>>> for them and BtJob is the only place the input matrix is used, the rest are 
>>> intermediates afaict and anyway don't have ids either.
>>>
>>> Is something in BtJob stripping them? It looks like ids are ignored in the 
>>> MR code but maybe its hidden…
>>>
>>> Are the Keys of U guaranteed  to be the same as A? If so I could construct 
>>> an index for A and use it on U but it would be nice to get them out of the 
>>> solver.
>>
>> Yes, that's the idea.
>>
>> B^t matrix will not have the ideas, not sure why you are looking
>> there. you need U matrix. Which is solved by another job.
>>
>>>
>>> On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <[email protected]> wrote:
>>>
>>> Yes you got it, thats what i was proposing before. A very easy patch.
>>> On Sep 7, 2012 9:11 AM, "Pat Ferrel" <[email protected]> wrote:
>>>
>>>> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>>>>
>>>> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
>>>> proportions for doing clustering and similarity (though not sure if this is
>>>> strictly required with cosine distance) I probably want to use U * Sigma
>>>> instead of sqrt Sigma.
>>>>
>>>> Since I have no other reason to load U row by row I could write another
>>>> transform and keep it out of the mahout core but doing this in a patch
>>>> seems trivial. Just create a new flag, something like --uSigma (the CLI
>>>> option looks like the hardest part actually). For the API there needs to be
>>>> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
>>>> extra flag check in the setup for the UJob, something like the following
>>>>
>>>>     if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
>>>> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>>>>       sValues = SSVDHelper.loadVector(sigmaPath,
>>>> context.getConfiguration());
>>>>       // sValues.assign(Functions.SQRT);  // no need to take the sqrt
>>>> for Sigma weighting
>>>>     }
>>>>
>>>> sValues is already applied to U in the map, which would remain unchanged
>>>> since the sigma weighted (instead of sqrt sigma) values will already be in
>>>> sValues.
>>>>
>>>>     if (sValues != null) {
>>>>       for (int i = 0; i < k; i++) {
>>>>         uRow.setQuick(i,
>>>>                       qRow.dot(uHat.viewColumn(i)) *
>>>> sValues.getQuick(i));
>>>>       }
>>>>     } else {
>>>>       …
>>>>
>>>> I'll give this a try and if it seems reasonable submit a patch.
>>>>
>>>> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>>>>
>>>>> When using PCA it's also preferable to use --uHalfSigma to create U with
>>>> the SSVD solver. One difficulty is that to perform the multiplication you
>>>> have to turn the singular values vector (diagonal values) into a
>>>> distributed row matrix or write your own multiply function, correct?
>>>>
>>>> You could do that, but why? Sigma is a diagonal matrix (which
>>>> additionally encoded as a very short vector of singular values of
>>>> length k, say we denote it as 'sv'). Given that, there's absolutely 0
>>>> reason to encode it as Distributed row matrix.
>>>>
>>>> Multiplication can be done on the fly as you load U, row by row:
>>>> U*Sigma[i,j]=U[i,j]*sv[j]
>>>>
>>>> One inconvenience with that approach is that it assumes you can freely
>>>> hack the code that loads U matrix for further processing.
>>>>
>>>> It is much easier to have SSVD to output U*Sigma directly using the
>>>> same logic as above (requires a patch) or just have it output
>>>> U*Sigma^0.5 (does not require a patch).
>>>>
>>>> You could even use U in some cases directly, but part of the problem
>>>> is that data variances will be normalized in all directions compared
>>>> to PCA space, which will affect actual distances between data points.
>>>> If you want to retain proportions of the directional variances as in
>>>> your original input, you need to use principal components with scaling
>>>> applied, i.e. U*Sigma.
>>>>
>>>>
>>>>
>>>
>

Reply via email to