BTW if you really are trying to reduce dimensionality, you may want to consider --pca option with SSVD, that [i think] will provide with much better preserved data variance then just clean SVD (i.e. essentially run a PCA space transformation on your data rather than just SVD)
-d On Fri, Aug 10, 2012 at 11:57 AM, Pat Ferrel <[email protected]> wrote: > Got it. Well on to some real and much larger data sets then… > > On Aug 10, 2012, at 11:53 AM, Dmitriy Lyubimov <[email protected]> wrote: > > i think actually Mahout's Lanczos requires external knowledge of input > size too, in part for similar reasons. SSVD doesn't because it doesn't > have "other" reasons to know input size but fundamental assumption > rank(input)>=rank(thin SVD) still stands about the input but the > method doesn't have a goal of verifying it explicitly (which would be > kind of hard), and instead either produces 0 eigenvectors or runs into > block deficiency. > > It is however hard to assert whether block deficiency stemmed from > input size deficiency vs. split size deficiency, and neither of > situations is typical for a real-life SSVD applications, hence error > message is somewhat vague. > > On Fri, Aug 10, 2012 at 11:39 AM, Dmitriy Lyubimov <[email protected]> wrote: >> The easy answer is to ensure (k+p)<= m. It is mathematical constraint, >> not a method pecularity. >> >> The only reason the solution doesn't warn you explicitly is because >> DistributedRowMatrix format, which is just a sequence file of rows, >> would not provide us with an easy way to verify what m actually is >> before it actually iterates over it and runs into block size >> deficiency. So if you now m as an external knowledge, it is easy to >> avoid being trapped by block height defiicency. >> >> >> On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <[email protected]> wrote: >>> This is only a test with some trivially simple data. I doubt there are any >>> splits and yes it could easily be done in memory but that is not the >>> purpose. It is based on testKmeansDSVD2, which is in >>> mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java >>> I've attached the modified and running version with testKmeansDSSVD >>> >>> As I said I don't think this is a real world test. It tests that the code >>> runs, and it does. Getting the best results is not part of the scope. I >>> just thought if there was an easy answer I could clean up the parameters >>> for SSVDSolver. >>> >>> Since it is working I don't know that it's worth the effort unless people >>> are likely to run into this with larger data sets. >>> >>> Thanks anyway. >>> >>> >>> >>> >>> On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <[email protected]> wrote: >>> >>> It happens because of internal constraints stemming from blocking. it >>> happens when a split of A (input) has less than (k+p) rows at which >>> point blocks are too small (or rather, to short) to successfully >>> perform a QR on . >>> >>> This also means, among other things, k+p cannot be more than your >>> total number of rows in the input. >>> >>> It is also possible that input A is way too wide or k+p is way too big >>> so that an arbitrary split does not fetch at least k+p rows of A, but >>> in practice i haven't seen such cases in practice yet. If that >>> happens, there's an option to increase minSplitSize (which would >>> undermine MR mappers efficiency somewhat). But i am pretty sure it is >>> not your case. >>> >>> But if your input is shorter than k+p, then it is a case too small for >>> SSVD. in fact, it probably means you can solve test directly in memory >>> with any solver. You can still use SSVD with k=m and p=0 (I think) in >>> this case and get exact (non-reduced rank) decomposition equivalent >>> with no stochastic effects, but that is not what it is for really. >>> >>> Assuming your input is m x n, can you tell me please what your m, n, k >>> and p are? >>> >>> thanks. >>> -D >>> >>> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <[email protected]> wrote: >>>> There seems to be some internal constraint on k and/or p, which is making >>>> a test difficult. The test has a very small input doc set and choosing the >>>> wrong k it is very easy to get a failure with this message: >>>> >>>> java.lang.IllegalArgumentException: new m can't be less than n >>>> at >>>> org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109) >>>> >>>> I have a working test but I had to add some docs to the test data and have >>>> tried to reverse engineer the value for k (desiredRank). I came up with >>>> the following but I think it is only an accident that it works. >>>> >>>> int p = 15; //default value for CLI >>>> int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, >>>> ?????? not sure why this works >>>> >>>> This seems likely to be an issue only because of the very small data set >>>> and the relationship of rows to columns to p to k. But for the purposes of >>>> creating a test if someone (Dmitriy?) could tell me how to calculate a >>>> reasonable p and k from the dimensions of the tiny data set it would help. >>>> >>>> This test is derived from a non-active SVD test but I'd be up for cleaning >>>> it up and including it as an example in the working but non-active tests. >>>> I also fixed a couple trivial bugs in the non-active Lanczos tests for >>>> what it's worth. >>>> >>>> >>>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <[email protected]> wrote: >>>> >>>> Reading "overview and usage" doc linked on that page >>>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition >>>> should help to clarify outputs and usage. >>>> >>>> >>>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <[email protected]> wrote: >>>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <[email protected]> wrote: >>>>>> Quoth Grant Ingersoll: >>>>>>> To put this in bin/mahout speak, this would look like, munging some >>>>>>> names and taking liberties with the actual argument to be passed in: >>>>>>> >>>>>>> bin/mahout svd (original -> svdOut) >>>>>>> bin/mahout cleansvd ... >>>>>>> bin/mahout transpose svdOut -> svdT >>>>>>> bin/mahout transpose original -> originalT >>>>>>> bin/mahout matrixmult originalT svdT -> newMatrix >>>>>>> bin/mahout kmeans newMatrix >>>>>> >>>>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. >>>>>> Does SSVD require the EigenVerificationJob to clean the eigen vectors? >>>>> >>>>> No >>>>> >>>>>> if so where does SSVD put the equivalent of >>>>>> DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in >>>>>> V* but SSVD creates V so should I transpose V* then run it through the >>>>>> EigenVerificationJob? >>>>> no >>>>> >>>>> SSVD is SVD, meaning it produces U and V with no further need to clean >>>>> that >>>>> >>>>>> I get errors when I do so trying to figure out if I'm on the wrong track. >>>> >>> >>> >
