More specifically, the way it works, Q matrix inherits keys of A rows (BtJob line 137), and U inherits keys of Q (UJob line 128).
On Fri, Sep 7, 2012 at 1:19 PM, Dmitriy Lyubimov <[email protected]> wrote: > On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel <[email protected]> wrote: >> OK, U * Sigma seems to be working in the patch of SSVDSolver. >> >> However I still have no doc ids in U. Has anyone seen a case where they are >> preserved? > > That should not be the case. Ids in rows of U are inherited from rows > of A. (should be at least). > >> >> For >> BtJob.run(conf, >> inputPath, >> qPath, >> pcaMeanPath, >> btPath, >> minSplitSize, >> k, >> p, >> outerBlockHeight, >> q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks, >> broadcast, >> labelType, >> q <= 0); >> >> inputPath here contains a distributedRowMatrix with text doc ids. >> >> Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look >> for them and BtJob is the only place the input matrix is used, the rest are >> intermediates afaict and anyway don't have ids either. >> >> Is something in BtJob stripping them? It looks like ids are ignored in the >> MR code but maybe its hidden… >> >> Are the Keys of U guaranteed to be the same as A? If so I could construct >> an index for A and use it on U but it would be nice to get them out of the >> solver. > > Yes, that's the idea. > > B^t matrix will not have the ideas, not sure why you are looking > there. you need U matrix. Which is solved by another job. > >> >> On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <[email protected]> wrote: >> >> Yes you got it, thats what i was proposing before. A very easy patch. >> On Sep 7, 2012 9:11 AM, "Pat Ferrel" <[email protected]> wrote: >> >>> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply". >>> >>> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance >>> proportions for doing clustering and similarity (though not sure if this is >>> strictly required with cosine distance) I probably want to use U * Sigma >>> instead of sqrt Sigma. >>> >>> Since I have no other reason to load U row by row I could write another >>> transform and keep it out of the mahout core but doing this in a patch >>> seems trivial. Just create a new flag, something like --uSigma (the CLI >>> option looks like the hardest part actually). For the API there needs to be >>> a new setter something like SSVDSolver#setComputeUSigma(true) then do an >>> extra flag check in the setup for the UJob, something like the following >>> >>> if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set >>> from --uSigma option or SSVDSolver#setComputeUSigma(true) >>> sValues = SSVDHelper.loadVector(sigmaPath, >>> context.getConfiguration()); >>> // sValues.assign(Functions.SQRT); // no need to take the sqrt >>> for Sigma weighting >>> } >>> >>> sValues is already applied to U in the map, which would remain unchanged >>> since the sigma weighted (instead of sqrt sigma) values will already be in >>> sValues. >>> >>> if (sValues != null) { >>> for (int i = 0; i < k; i++) { >>> uRow.setQuick(i, >>> qRow.dot(uHat.viewColumn(i)) * >>> sValues.getQuick(i)); >>> } >>> } else { >>> … >>> >>> I'll give this a try and if it seems reasonable submit a patch. >>> >>> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <[email protected]> wrote: >>>> >>>> When using PCA it's also preferable to use --uHalfSigma to create U with >>> the SSVD solver. One difficulty is that to perform the multiplication you >>> have to turn the singular values vector (diagonal values) into a >>> distributed row matrix or write your own multiply function, correct? >>> >>> You could do that, but why? Sigma is a diagonal matrix (which >>> additionally encoded as a very short vector of singular values of >>> length k, say we denote it as 'sv'). Given that, there's absolutely 0 >>> reason to encode it as Distributed row matrix. >>> >>> Multiplication can be done on the fly as you load U, row by row: >>> U*Sigma[i,j]=U[i,j]*sv[j] >>> >>> One inconvenience with that approach is that it assumes you can freely >>> hack the code that loads U matrix for further processing. >>> >>> It is much easier to have SSVD to output U*Sigma directly using the >>> same logic as above (requires a patch) or just have it output >>> U*Sigma^0.5 (does not require a patch). >>> >>> You could even use U in some cases directly, but part of the problem >>> is that data variances will be normalized in all directions compared >>> to PCA space, which will affect actual distances between data points. >>> If you want to retain proportions of the directional variances as in >>> your original input, you need to use principal components with scaling >>> applied, i.e. U*Sigma. >>> >>> >>> >>
