I guess there's one more clarification that might be useful. SSVD in fact creates a decomposition of (k+p) rank where p is called "oversampling" to capture more plausible dimensions with high variance of data in case we guessed first k wrong. And then it throws away last p singular values and vectors. That allows to reduce rounding errors due to imperfectly guessed projection.
In the corner case when k+p=m, we get the corner case of full rank SVD. The assumption of SSVD is that m >> (k+p) but it still will work for as long as (k+p)<=m including full rank decomposition when p=0 and k=m. (it also assumes that n>>k+p too.) -d On Fri, Aug 10, 2012 at 11:07 AM, Dmitriy Lyubimov <[email protected]> wrote: > It happens because of internal constraints stemming from blocking. it > happens when a split of A (input) has less than (k+p) rows at which > point blocks are too small (or rather, to short) to successfully > perform a QR on . > > This also means, among other things, k+p cannot be more than your > total number of rows in the input. > > It is also possible that input A is way too wide or k+p is way too big > so that an arbitrary split does not fetch at least k+p rows of A, but > in practice i haven't seen such cases in practice yet. If that > happens, there's an option to increase minSplitSize (which would > undermine MR mappers efficiency somewhat). But i am pretty sure it is > not your case. > > But if your input is shorter than k+p, then it is a case too small for > SSVD. in fact, it probably means you can solve test directly in memory > with any solver. You can still use SSVD with k=m and p=0 (I think) in > this case and get exact (non-reduced rank) decomposition equivalent > with no stochastic effects, but that is not what it is for really. > > Assuming your input is m x n, can you tell me please what your m, n, k > and p are? > > thanks. > -D > > On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <[email protected]> wrote: >> There seems to be some internal constraint on k and/or p, which is making a >> test difficult. The test has a very small input doc set and choosing the >> wrong k it is very easy to get a failure with this message: >> >> java.lang.IllegalArgumentException: new m can't be less than n >> at >> org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109) >> >> I have a working test but I had to add some docs to the test data and have >> tried to reverse engineer the value for k (desiredRank). I came up with the >> following but I think it is only an accident that it works. >> >> int p = 15; //default value for CLI >> int desiredRank = sampleData.size() - p - 1;//number of docs - p - >> 1, ?????? not sure why this works >> >> This seems likely to be an issue only because of the very small data set and >> the relationship of rows to columns to p to k. But for the purposes of >> creating a test if someone (Dmitriy?) could tell me how to calculate a >> reasonable p and k from the dimensions of the tiny data set it would help. >> >> This test is derived from a non-active SVD test but I'd be up for cleaning >> it up and including it as an example in the working but non-active tests. I >> also fixed a couple trivial bugs in the non-active Lanczos tests for what >> it's worth. >> >> >> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <[email protected]> wrote: >> >> Reading "overview and usage" doc linked on that page >> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition >> should help to clarify outputs and usage. >> >> >> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <[email protected]> wrote: >>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <[email protected]> wrote: >>>> Quoth Grant Ingersoll: >>>>> To put this in bin/mahout speak, this would look like, munging some names >>>>> and taking liberties with the actual argument to be passed in: >>>>> >>>>> bin/mahout svd (original -> svdOut) >>>>> bin/mahout cleansvd ... >>>>> bin/mahout transpose svdOut -> svdT >>>>> bin/mahout transpose original -> originalT >>>>> bin/mahout matrixmult originalT svdT -> newMatrix >>>>> bin/mahout kmeans newMatrix >>>> >>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. >>>> Does SSVD require the EigenVerificationJob to clean the eigen vectors? >>> >>> No >>> >>>> if so where does SSVD put the equivalent of >>>> DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* >>>> but SSVD creates V so should I transpose V* then run it through the >>>> EigenVerificationJob? >>> no >>> >>> SSVD is SVD, meaning it produces U and V with no further need to clean that >>> >>>> I get errors when I do so trying to figure out if I'm on the wrong track. >>
