Yeah. probably we are talking about different things.
On Fri, Sep 7, 2012 at 1:49 PM, Pat Ferrel <[email protected]> wrote: > I'm looking in the output. Since I'm not familiar with what the jobs do I am > looking everywhere. One thing seems true that ids go in but they do not come > out, in any output. > > I do find your comments instructive though. My first time messing with mahout > job internals. > > Maybe we have a terminology problem. The Keys are not the same as ids, right? > The Key is an int the ID is Text for one thing. The ID is part of the > row/vector and can be text or an int. I call them doc ids because that's what > I use them for. > > On Sep 7, 2012, at 1:34 PM, Dmitriy Lyubimov <[email protected]> wrote: > > More specifically, the way it works, Q matrix inherits keys of A rows > (BtJob line 137), and U inherits keys of Q (UJob line 128). > > On Fri, Sep 7, 2012 at 1:19 PM, Dmitriy Lyubimov <[email protected]> wrote: >> On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel <[email protected]> wrote: >>> OK, U * Sigma seems to be working in the patch of SSVDSolver. >>> >>> However I still have no doc ids in U. Has anyone seen a case where they are >>> preserved? >> >> That should not be the case. Ids in rows of U are inherited from rows >> of A. (should be at least). >> >>> >>> For >>> BtJob.run(conf, >>> inputPath, >>> qPath, >>> pcaMeanPath, >>> btPath, >>> minSplitSize, >>> k, >>> p, >>> outerBlockHeight, >>> q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks, >>> broadcast, >>> labelType, >>> q <= 0); >>> >>> inputPath here contains a distributedRowMatrix with text doc ids. >>> >>> Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look >>> for them and BtJob is the only place the input matrix is used, the rest are >>> intermediates afaict and anyway don't have ids either. >>> >>> Is something in BtJob stripping them? It looks like ids are ignored in the >>> MR code but maybe its hidden… >>> >>> Are the Keys of U guaranteed to be the same as A? If so I could construct >>> an index for A and use it on U but it would be nice to get them out of the >>> solver. >> >> Yes, that's the idea. >> >> B^t matrix will not have the ideas, not sure why you are looking >> there. you need U matrix. Which is solved by another job. >> >>> >>> On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <[email protected]> wrote: >>> >>> Yes you got it, thats what i was proposing before. A very easy patch. >>> On Sep 7, 2012 9:11 AM, "Pat Ferrel" <[email protected]> wrote: >>> >>>> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply". >>>> >>>> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance >>>> proportions for doing clustering and similarity (though not sure if this is >>>> strictly required with cosine distance) I probably want to use U * Sigma >>>> instead of sqrt Sigma. >>>> >>>> Since I have no other reason to load U row by row I could write another >>>> transform and keep it out of the mahout core but doing this in a patch >>>> seems trivial. Just create a new flag, something like --uSigma (the CLI >>>> option looks like the hardest part actually). For the API there needs to be >>>> a new setter something like SSVDSolver#setComputeUSigma(true) then do an >>>> extra flag check in the setup for the UJob, something like the following >>>> >>>> if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set >>>> from --uSigma option or SSVDSolver#setComputeUSigma(true) >>>> sValues = SSVDHelper.loadVector(sigmaPath, >>>> context.getConfiguration()); >>>> // sValues.assign(Functions.SQRT); // no need to take the sqrt >>>> for Sigma weighting >>>> } >>>> >>>> sValues is already applied to U in the map, which would remain unchanged >>>> since the sigma weighted (instead of sqrt sigma) values will already be in >>>> sValues. >>>> >>>> if (sValues != null) { >>>> for (int i = 0; i < k; i++) { >>>> uRow.setQuick(i, >>>> qRow.dot(uHat.viewColumn(i)) * >>>> sValues.getQuick(i)); >>>> } >>>> } else { >>>> … >>>> >>>> I'll give this a try and if it seems reasonable submit a patch. >>>> >>>> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <[email protected]> wrote: >>>>> >>>>> When using PCA it's also preferable to use --uHalfSigma to create U with >>>> the SSVD solver. One difficulty is that to perform the multiplication you >>>> have to turn the singular values vector (diagonal values) into a >>>> distributed row matrix or write your own multiply function, correct? >>>> >>>> You could do that, but why? Sigma is a diagonal matrix (which >>>> additionally encoded as a very short vector of singular values of >>>> length k, say we denote it as 'sv'). Given that, there's absolutely 0 >>>> reason to encode it as Distributed row matrix. >>>> >>>> Multiplication can be done on the fly as you load U, row by row: >>>> U*Sigma[i,j]=U[i,j]*sv[j] >>>> >>>> One inconvenience with that approach is that it assumes you can freely >>>> hack the code that loads U matrix for further processing. >>>> >>>> It is much easier to have SSVD to output U*Sigma directly using the >>>> same logic as above (requires a patch) or just have it output >>>> U*Sigma^0.5 (does not require a patch). >>>> >>>> You could even use U in some cases directly, but part of the problem >>>> is that data variances will be normalized in all directions compared >>>> to PCA space, which will affect actual distances between data points. >>>> If you want to retain proportions of the directional variances as in >>>> your original input, you need to use principal components with scaling >>>> applied, i.e. U*Sigma. >>>> >>>> >>>> >>> >
