This all stems from the fact that rank(thin SVD) <= rank(A). Since thin svd rank is really k+p, and rank(A)<=min(m,n) it follows that k+p must be at least <=min(m,n).Obviously in big data m and n are always high so it is not a problem for SSVD.
And if m and n are same order as desired rank then it is either unsolvable (if large problem) or solvable in memory (if small). On Fri, Aug 10, 2012 at 11:22 AM, Dmitriy Lyubimov <[email protected]> wrote: > I guess there's one more clarification that might be useful. > > SSVD in fact creates a decomposition of (k+p) rank where p is called > "oversampling" to capture more plausible dimensions with high variance > of data in case we guessed first k wrong. And then it throws away last > p singular values and vectors. That allows to reduce rounding errors > due to imperfectly guessed projection. > > In the corner case when k+p=m, we get the corner case of full rank SVD. > > The assumption of SSVD is that m >> (k+p) but it still will work for > as long as (k+p)<=m including full rank decomposition when p=0 and > k=m. (it also assumes that n>>k+p too.) > > -d > > On Fri, Aug 10, 2012 at 11:07 AM, Dmitriy Lyubimov <[email protected]> wrote: >> It happens because of internal constraints stemming from blocking. it >> happens when a split of A (input) has less than (k+p) rows at which >> point blocks are too small (or rather, to short) to successfully >> perform a QR on . >> >> This also means, among other things, k+p cannot be more than your >> total number of rows in the input. >> >> It is also possible that input A is way too wide or k+p is way too big >> so that an arbitrary split does not fetch at least k+p rows of A, but >> in practice i haven't seen such cases in practice yet. If that >> happens, there's an option to increase minSplitSize (which would >> undermine MR mappers efficiency somewhat). But i am pretty sure it is >> not your case. >> >> But if your input is shorter than k+p, then it is a case too small for >> SSVD. in fact, it probably means you can solve test directly in memory >> with any solver. You can still use SSVD with k=m and p=0 (I think) in >> this case and get exact (non-reduced rank) decomposition equivalent >> with no stochastic effects, but that is not what it is for really. >> >> Assuming your input is m x n, can you tell me please what your m, n, k >> and p are? >> >> thanks. >> -D >> >> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <[email protected]> wrote: >>> There seems to be some internal constraint on k and/or p, which is making a >>> test difficult. The test has a very small input doc set and choosing the >>> wrong k it is very easy to get a failure with this message: >>> >>> java.lang.IllegalArgumentException: new m can't be less than n >>> at >>> org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109) >>> >>> I have a working test but I had to add some docs to the test data and have >>> tried to reverse engineer the value for k (desiredRank). I came up with the >>> following but I think it is only an accident that it works. >>> >>> int p = 15; //default value for CLI >>> int desiredRank = sampleData.size() - p - 1;//number of docs - p - >>> 1, ?????? not sure why this works >>> >>> This seems likely to be an issue only because of the very small data set >>> and the relationship of rows to columns to p to k. But for the purposes of >>> creating a test if someone (Dmitriy?) could tell me how to calculate a >>> reasonable p and k from the dimensions of the tiny data set it would help. >>> >>> This test is derived from a non-active SVD test but I'd be up for cleaning >>> it up and including it as an example in the working but non-active tests. I >>> also fixed a couple trivial bugs in the non-active Lanczos tests for what >>> it's worth. >>> >>> >>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <[email protected]> wrote: >>> >>> Reading "overview and usage" doc linked on that page >>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition >>> should help to clarify outputs and usage. >>> >>> >>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <[email protected]> wrote: >>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <[email protected]> wrote: >>>>> Quoth Grant Ingersoll: >>>>>> To put this in bin/mahout speak, this would look like, munging some >>>>>> names and taking liberties with the actual argument to be passed in: >>>>>> >>>>>> bin/mahout svd (original -> svdOut) >>>>>> bin/mahout cleansvd ... >>>>>> bin/mahout transpose svdOut -> svdT >>>>>> bin/mahout transpose original -> originalT >>>>>> bin/mahout matrixmult originalT svdT -> newMatrix >>>>>> bin/mahout kmeans newMatrix >>>>> >>>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. >>>>> Does SSVD require the EigenVerificationJob to clean the eigen vectors? >>>> >>>> No >>>> >>>>> if so where does SSVD put the equivalent of >>>>> DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in >>>>> V* but SSVD creates V so should I transpose V* then run it through the >>>>> EigenVerificationJob? >>>> no >>>> >>>> SSVD is SVD, meaning it produces U and V with no further need to clean that >>>> >>>>> I get errors when I do so trying to figure out if I'm on the wrong track. >>>
