It happens because of internal constraints stemming from blocking. it happens when a split of A (input) has less than (k+p) rows at which point blocks are too small (or rather, to short) to successfully perform a QR on .
This also means, among other things, k+p cannot be more than your total number of rows in the input. It is also possible that input A is way too wide or k+p is way too big so that an arbitrary split does not fetch at least k+p rows of A, but in practice i haven't seen such cases in practice yet. If that happens, there's an option to increase minSplitSize (which would undermine MR mappers efficiency somewhat). But i am pretty sure it is not your case. But if your input is shorter than k+p, then it is a case too small for SSVD. in fact, it probably means you can solve test directly in memory with any solver. You can still use SSVD with k=m and p=0 (I think) in this case and get exact (non-reduced rank) decomposition equivalent with no stochastic effects, but that is not what it is for really. Assuming your input is m x n, can you tell me please what your m, n, k and p are? thanks. -D On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <[email protected]> wrote: > There seems to be some internal constraint on k and/or p, which is making a > test difficult. The test has a very small input doc set and choosing the > wrong k it is very easy to get a failure with this message: > > java.lang.IllegalArgumentException: new m can't be less than n > at > org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109) > > I have a working test but I had to add some docs to the test data and have > tried to reverse engineer the value for k (desiredRank). I came up with the > following but I think it is only an accident that it works. > > int p = 15; //default value for CLI > int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, > ?????? not sure why this works > > This seems likely to be an issue only because of the very small data set and > the relationship of rows to columns to p to k. But for the purposes of > creating a test if someone (Dmitriy?) could tell me how to calculate a > reasonable p and k from the dimensions of the tiny data set it would help. > > This test is derived from a non-active SVD test but I'd be up for cleaning it > up and including it as an example in the working but non-active tests. I also > fixed a couple trivial bugs in the non-active Lanczos tests for what it's > worth. > > > On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <[email protected]> wrote: > > Reading "overview and usage" doc linked on that page > https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition > should help to clarify outputs and usage. > > > On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <[email protected]> wrote: >> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <[email protected]> wrote: >>> Quoth Grant Ingersoll: >>>> To put this in bin/mahout speak, this would look like, munging some names >>>> and taking liberties with the actual argument to be passed in: >>>> >>>> bin/mahout svd (original -> svdOut) >>>> bin/mahout cleansvd ... >>>> bin/mahout transpose svdOut -> svdT >>>> bin/mahout transpose original -> originalT >>>> bin/mahout matrixmult originalT svdT -> newMatrix >>>> bin/mahout kmeans newMatrix >>> >>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. >>> Does SSVD require the EigenVerificationJob to clean the eigen vectors? >> >> No >> >>> if so where does SSVD put the equivalent of >>> DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* >>> but SSVD creates V so should I transpose V* then run it through the >>> EigenVerificationJob? >> no >> >> SSVD is SVD, meaning it produces U and V with no further need to clean that >> >>> I get errors when I do so trying to figure out if I'm on the wrong track. >
