another piece of information is that according to investigation we did, using q=0 seems to underestimate singular values somewhat and produces a little better decay than optimal values would have. which means variance retained would err on the optimistic side a bit (showing a little bit more variance retained than actual).
also i think retaining 99% is reasonable with big corpuses. this desirable threshold could probably be reduced a little bit to accomodate pragmatic resource capacity. On Fri, Aug 10, 2012 at 2:42 PM, Dmitriy Lyubimov <[email protected]> wrote: > note that to estimate variance retained approximately only, you > probably don't need to run it with q=1 so you can run with q=0 and not > request either V or U but singular values only. That will reduce > running time dramatically (perhaps up to 2-5 times faster compared to > with q=1 and U and V). > > On Fri, Aug 10, 2012 at 2:31 PM, Dmitriy Lyubimov <[email protected]> wrote: >> With PCA, there's a metric, something called "variance retained". >> >> One idea of mine to estimate k is described in footnote discussion on >> page 5. While it is not possible to compute "PCA variance retained" >> metric exactly with an application of a thin SVD (the metric assumes >> use of a full rank SVD) it is possible to infer upper estimate for k >> given target variance retained (say, 99%) or try some sort of >> polynomialy approximated value for the sum of all singular values >> given visible decay. Probably requires some simple code in R or matlab >> to get reasonable estimate. >> >> This technique requires running PCA one time and then estimate >> sufficient k given singular values produced on your corpus. If the >> action is repetetive and corpus is not changing drastically, you may >> infer if you spending too much (or too little) on k for future uses. >> >> But pragmatically i just use the best k my cluster can compute in the >> time i need. my corpus is relatively small and i don't run full corpus >> run too often so i can afford some time spent. >> >> On Fri, Aug 10, 2012 at 2:14 PM, Pat Ferrel <[email protected]> wrote: >>> The built-in PCA option is one reason I wanted to try SSVD. Building the >>> test was to make sure I understood the external matrix operations before >>> diving in. I expect one primary decision is how to choose k for reduction. >>> I'm hoping to get some noise rejection so not using it for reduced matrix >>> size so much. We are starting with m = 500,000 and a million or so docs. We >>> get many dups and low value docs in a small web crawl, so lots of noise. >>> >>> You mention in your paper: >>> >>> "The valueof k + p directly impacts running time and memory requirements. >>> k+p=500 is probably more than reasonable. Typically k + p >>> is taken within range 20…200" >>> >>> So I guess we might start with >>> -p 15 (default) >>> -q 1 >>> -k 200 >>> >>> Is there any use in hand inspecting the eigen vectors before choosing the >>> final k? If so do you get those by choosing k nearly = m or is something >>> like k = 1000 (or ?) good enough to for inspection? >>> >>> On Aug 10, 2012, at 12:53 PM, Dmitriy Lyubimov <[email protected]> wrote: >>> >>> BTW if you really are trying to reduce dimensionality, you may want to >>> consider --pca option with SSVD, that [i think] will provide with much >>> better preserved data variance then just clean SVD (i.e. essentially >>> run a PCA space transformation on your data rather than just SVD) >>> >>> -d >>> >>> On Fri, Aug 10, 2012 at 11:57 AM, Pat Ferrel <[email protected]> wrote: >>>> Got it. Well on to some real and much larger data sets then… >>>> >>>> On Aug 10, 2012, at 11:53 AM, Dmitriy Lyubimov <[email protected]> wrote: >>>> >>>> i think actually Mahout's Lanczos requires external knowledge of input >>>> size too, in part for similar reasons. SSVD doesn't because it doesn't >>>> have "other" reasons to know input size but fundamental assumption >>>> rank(input)>=rank(thin SVD) still stands about the input but the >>>> method doesn't have a goal of verifying it explicitly (which would be >>>> kind of hard), and instead either produces 0 eigenvectors or runs into >>>> block deficiency. >>>> >>>> It is however hard to assert whether block deficiency stemmed from >>>> input size deficiency vs. split size deficiency, and neither of >>>> situations is typical for a real-life SSVD applications, hence error >>>> message is somewhat vague. >>>> >>>> On Fri, Aug 10, 2012 at 11:39 AM, Dmitriy Lyubimov <[email protected]> >>>> wrote: >>>>> The easy answer is to ensure (k+p)<= m. It is mathematical constraint, >>>>> not a method pecularity. >>>>> >>>>> The only reason the solution doesn't warn you explicitly is because >>>>> DistributedRowMatrix format, which is just a sequence file of rows, >>>>> would not provide us with an easy way to verify what m actually is >>>>> before it actually iterates over it and runs into block size >>>>> deficiency. So if you now m as an external knowledge, it is easy to >>>>> avoid being trapped by block height defiicency. >>>>> >>>>> >>>>> On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <[email protected]> wrote: >>>>>> This is only a test with some trivially simple data. I doubt there are >>>>>> any splits and yes it could easily be done in memory but that is not the >>>>>> purpose. It is based on testKmeansDSVD2, which is in >>>>>> mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java >>>>>> I've attached the modified and running version with testKmeansDSSVD >>>>>> >>>>>> As I said I don't think this is a real world test. It tests that the >>>>>> code runs, and it does. Getting the best results is not part of the >>>>>> scope. I just thought if there was an easy answer I could clean up the >>>>>> parameters for SSVDSolver. >>>>>> >>>>>> Since it is working I don't know that it's worth the effort unless >>>>>> people are likely to run into this with larger data sets. >>>>>> >>>>>> Thanks anyway. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <[email protected]> wrote: >>>>>> >>>>>> It happens because of internal constraints stemming from blocking. it >>>>>> happens when a split of A (input) has less than (k+p) rows at which >>>>>> point blocks are too small (or rather, to short) to successfully >>>>>> perform a QR on . >>>>>> >>>>>> This also means, among other things, k+p cannot be more than your >>>>>> total number of rows in the input. >>>>>> >>>>>> It is also possible that input A is way too wide or k+p is way too big >>>>>> so that an arbitrary split does not fetch at least k+p rows of A, but >>>>>> in practice i haven't seen such cases in practice yet. If that >>>>>> happens, there's an option to increase minSplitSize (which would >>>>>> undermine MR mappers efficiency somewhat). But i am pretty sure it is >>>>>> not your case. >>>>>> >>>>>> But if your input is shorter than k+p, then it is a case too small for >>>>>> SSVD. in fact, it probably means you can solve test directly in memory >>>>>> with any solver. You can still use SSVD with k=m and p=0 (I think) in >>>>>> this case and get exact (non-reduced rank) decomposition equivalent >>>>>> with no stochastic effects, but that is not what it is for really. >>>>>> >>>>>> Assuming your input is m x n, can you tell me please what your m, n, k >>>>>> and p are? >>>>>> >>>>>> thanks. >>>>>> -D >>>>>> >>>>>> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <[email protected]> wrote: >>>>>>> There seems to be some internal constraint on k and/or p, which is >>>>>>> making a test difficult. The test has a very small input doc set and >>>>>>> choosing the wrong k it is very easy to get a failure with this message: >>>>>>> >>>>>>> java.lang.IllegalArgumentException: new m can't be less than n >>>>>>> at >>>>>>> org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109) >>>>>>> >>>>>>> I have a working test but I had to add some docs to the test data and >>>>>>> have tried to reverse engineer the value for k (desiredRank). I came up >>>>>>> with the following but I think it is only an accident that it works. >>>>>>> >>>>>>> int p = 15; //default value for CLI >>>>>>> int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, >>>>>>> ?????? not sure why this works >>>>>>> >>>>>>> This seems likely to be an issue only because of the very small data >>>>>>> set and the relationship of rows to columns to p to k. But for the >>>>>>> purposes of creating a test if someone (Dmitriy?) could tell me how to >>>>>>> calculate a reasonable p and k from the dimensions of the tiny data set >>>>>>> it would help. >>>>>>> >>>>>>> This test is derived from a non-active SVD test but I'd be up for >>>>>>> cleaning it up and including it as an example in the working but >>>>>>> non-active tests. I also fixed a couple trivial bugs in the non-active >>>>>>> Lanczos tests for what it's worth. >>>>>>> >>>>>>> >>>>>>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <[email protected]> wrote: >>>>>>> >>>>>>> Reading "overview and usage" doc linked on that page >>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition >>>>>>> should help to clarify outputs and usage. >>>>>>> >>>>>>> >>>>>>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <[email protected]> >>>>>>> wrote: >>>>>>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <[email protected]> >>>>>>>> wrote: >>>>>>>>> Quoth Grant Ingersoll: >>>>>>>>>> To put this in bin/mahout speak, this would look like, munging some >>>>>>>>>> names and taking liberties with the actual argument to be passed in: >>>>>>>>>> >>>>>>>>>> bin/mahout svd (original -> svdOut) >>>>>>>>>> bin/mahout cleansvd ... >>>>>>>>>> bin/mahout transpose svdOut -> svdT >>>>>>>>>> bin/mahout transpose original -> originalT >>>>>>>>>> bin/mahout matrixmult originalT svdT -> newMatrix >>>>>>>>>> bin/mahout kmeans newMatrix >>>>>>>>> >>>>>>>>> I'm trying to create a test case from testKmeansDSVD2 to use >>>>>>>>> SSVDSolver. Does SSVD require the EigenVerificationJob to clean the >>>>>>>>> eigen vectors? >>>>>>>> >>>>>>>> No >>>>>>>> >>>>>>>>> if so where does SSVD put the equivalent of >>>>>>>>> DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be >>>>>>>>> in V* but SSVD creates V so should I transpose V* then run it through >>>>>>>>> the EigenVerificationJob? >>>>>>>> no >>>>>>>> >>>>>>>> SSVD is SVD, meaning it produces U and V with no further need to clean >>>>>>>> that >>>>>>>> >>>>>>>>> I get errors when I do so trying to figure out if I'm on the wrong >>>>>>>>> track. >>>>>>> >>>>>> >>>>>> >>>> >>>
