The easy answer is to ensure (k+p)<= m. It is mathematical constraint, not a method pecularity.
The only reason the solution doesn't warn you explicitly is because DistributedRowMatrix format, which is just a sequence file of rows, would not provide us with an easy way to verify what m actually is before it actually iterates over it and runs into block size deficiency. So if you now m as an external knowledge, it is easy to avoid being trapped by block height defiicency. On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <[email protected]> wrote: > This is only a test with some trivially simple data. I doubt there are any > splits and yes it could easily be done in memory but that is not the purpose. > It is based on testKmeansDSVD2, which is in > mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java > I've attached the modified and running version with testKmeansDSSVD > > As I said I don't think this is a real world test. It tests that the code > runs, and it does. Getting the best results is not part of the scope. I just > thought if there was an easy answer I could clean up the parameters for > SSVDSolver. > > Since it is working I don't know that it's worth the effort unless people are > likely to run into this with larger data sets. > > Thanks anyway. > > > > > On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <[email protected]> wrote: > > It happens because of internal constraints stemming from blocking. it > happens when a split of A (input) has less than (k+p) rows at which > point blocks are too small (or rather, to short) to successfully > perform a QR on . > > This also means, among other things, k+p cannot be more than your > total number of rows in the input. > > It is also possible that input A is way too wide or k+p is way too big > so that an arbitrary split does not fetch at least k+p rows of A, but > in practice i haven't seen such cases in practice yet. If that > happens, there's an option to increase minSplitSize (which would > undermine MR mappers efficiency somewhat). But i am pretty sure it is > not your case. > > But if your input is shorter than k+p, then it is a case too small for > SSVD. in fact, it probably means you can solve test directly in memory > with any solver. You can still use SSVD with k=m and p=0 (I think) in > this case and get exact (non-reduced rank) decomposition equivalent > with no stochastic effects, but that is not what it is for really. > > Assuming your input is m x n, can you tell me please what your m, n, k > and p are? > > thanks. > -D > > On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <[email protected]> wrote: >> There seems to be some internal constraint on k and/or p, which is making a >> test difficult. The test has a very small input doc set and choosing the >> wrong k it is very easy to get a failure with this message: >> >> java.lang.IllegalArgumentException: new m can't be less than n >> at >> org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109) >> >> I have a working test but I had to add some docs to the test data and have >> tried to reverse engineer the value for k (desiredRank). I came up with the >> following but I think it is only an accident that it works. >> >> int p = 15; //default value for CLI >> int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, ?????? >> not sure why this works >> >> This seems likely to be an issue only because of the very small data set and >> the relationship of rows to columns to p to k. But for the purposes of >> creating a test if someone (Dmitriy?) could tell me how to calculate a >> reasonable p and k from the dimensions of the tiny data set it would help. >> >> This test is derived from a non-active SVD test but I'd be up for cleaning >> it up and including it as an example in the working but non-active tests. I >> also fixed a couple trivial bugs in the non-active Lanczos tests for what >> it's worth. >> >> >> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <[email protected]> wrote: >> >> Reading "overview and usage" doc linked on that page >> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition >> should help to clarify outputs and usage. >> >> >> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <[email protected]> wrote: >>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <[email protected]> wrote: >>>> Quoth Grant Ingersoll: >>>>> To put this in bin/mahout speak, this would look like, munging some names >>>>> and taking liberties with the actual argument to be passed in: >>>>> >>>>> bin/mahout svd (original -> svdOut) >>>>> bin/mahout cleansvd ... >>>>> bin/mahout transpose svdOut -> svdT >>>>> bin/mahout transpose original -> originalT >>>>> bin/mahout matrixmult originalT svdT -> newMatrix >>>>> bin/mahout kmeans newMatrix >>>> >>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. >>>> Does SSVD require the EigenVerificationJob to clean the eigen vectors? >>> >>> No >>> >>>> if so where does SSVD put the equivalent of >>>> DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* >>>> but SSVD creates V so should I transpose V* then run it through the >>>> EigenVerificationJob? >>> no >>> >>> SSVD is SVD, meaning it produces U and V with no further need to clean that >>> >>>> I get errors when I do so trying to figure out if I'm on the wrong track. >> > >
