On Fri, Aug 10, 2012 at 2:48 PM, Dmitriy Lyubimov <[email protected]> wrote: > another piece of information is that according to investigation we > did, using q=0 seems to underestimate singular values somewhat and > produces a little better decay than optimal values would have. which > means variance retained would err on the optimistic side a bit > (showing a little bit more variance retained than actual). > > also i think retaining 99% is reasonable with big corpuses. this
sorry i meant "retaining 99% of data variance during dimensionality reduction is probably unreasonable with big corpuses". > desirable threshold could probably be reduced a little bit to > accomodate pragmatic resource capacity. > > On Fri, Aug 10, 2012 at 2:42 PM, Dmitriy Lyubimov <[email protected]> wrote: >> note that to estimate variance retained approximately only, you >> probably don't need to run it with q=1 so you can run with q=0 and not >> request either V or U but singular values only. That will reduce >> running time dramatically (perhaps up to 2-5 times faster compared to >> with q=1 and U and V). >> >> On Fri, Aug 10, 2012 at 2:31 PM, Dmitriy Lyubimov <[email protected]> wrote: >>> With PCA, there's a metric, something called "variance retained". >>> >>> One idea of mine to estimate k is described in footnote discussion on >>> page 5. While it is not possible to compute "PCA variance retained" >>> metric exactly with an application of a thin SVD (the metric assumes >>> use of a full rank SVD) it is possible to infer upper estimate for k >>> given target variance retained (say, 99%) or try some sort of >>> polynomialy approximated value for the sum of all singular values >>> given visible decay. Probably requires some simple code in R or matlab >>> to get reasonable estimate. >>> >>> This technique requires running PCA one time and then estimate >>> sufficient k given singular values produced on your corpus. If the >>> action is repetetive and corpus is not changing drastically, you may >>> infer if you spending too much (or too little) on k for future uses. >>> >>> But pragmatically i just use the best k my cluster can compute in the >>> time i need. my corpus is relatively small and i don't run full corpus >>> run too often so i can afford some time spent. >>> >>> On Fri, Aug 10, 2012 at 2:14 PM, Pat Ferrel <[email protected]> wrote: >>>> The built-in PCA option is one reason I wanted to try SSVD. Building the >>>> test was to make sure I understood the external matrix operations before >>>> diving in. I expect one primary decision is how to choose k for reduction. >>>> I'm hoping to get some noise rejection so not using it for reduced matrix >>>> size so much. We are starting with m = 500,000 and a million or so docs. >>>> We get many dups and low value docs in a small web crawl, so lots of noise. >>>> >>>> You mention in your paper: >>>> >>>> "The valueof k + p directly impacts running time and memory requirements. >>>> k+p=500 is probably more than reasonable. Typically k + p >>>> is taken within range 20…200" >>>> >>>> So I guess we might start with >>>> -p 15 (default) >>>> -q 1 >>>> -k 200 >>>> >>>> Is there any use in hand inspecting the eigen vectors before choosing the >>>> final k? If so do you get those by choosing k nearly = m or is something >>>> like k = 1000 (or ?) good enough to for inspection? >>>> >>>> On Aug 10, 2012, at 12:53 PM, Dmitriy Lyubimov <[email protected]> wrote: >>>> >>>> BTW if you really are trying to reduce dimensionality, you may want to >>>> consider --pca option with SSVD, that [i think] will provide with much >>>> better preserved data variance then just clean SVD (i.e. essentially >>>> run a PCA space transformation on your data rather than just SVD) >>>> >>>> -d >>>> >>>> On Fri, Aug 10, 2012 at 11:57 AM, Pat Ferrel <[email protected]> wrote: >>>>> Got it. Well on to some real and much larger data sets then… >>>>> >>>>> On Aug 10, 2012, at 11:53 AM, Dmitriy Lyubimov <[email protected]> wrote: >>>>> >>>>> i think actually Mahout's Lanczos requires external knowledge of input >>>>> size too, in part for similar reasons. SSVD doesn't because it doesn't >>>>> have "other" reasons to know input size but fundamental assumption >>>>> rank(input)>=rank(thin SVD) still stands about the input but the >>>>> method doesn't have a goal of verifying it explicitly (which would be >>>>> kind of hard), and instead either produces 0 eigenvectors or runs into >>>>> block deficiency. >>>>> >>>>> It is however hard to assert whether block deficiency stemmed from >>>>> input size deficiency vs. split size deficiency, and neither of >>>>> situations is typical for a real-life SSVD applications, hence error >>>>> message is somewhat vague. >>>>> >>>>> On Fri, Aug 10, 2012 at 11:39 AM, Dmitriy Lyubimov <[email protected]> >>>>> wrote: >>>>>> The easy answer is to ensure (k+p)<= m. It is mathematical constraint, >>>>>> not a method pecularity. >>>>>> >>>>>> The only reason the solution doesn't warn you explicitly is because >>>>>> DistributedRowMatrix format, which is just a sequence file of rows, >>>>>> would not provide us with an easy way to verify what m actually is >>>>>> before it actually iterates over it and runs into block size >>>>>> deficiency. So if you now m as an external knowledge, it is easy to >>>>>> avoid being trapped by block height defiicency. >>>>>> >>>>>> >>>>>> On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <[email protected]> >>>>>> wrote: >>>>>>> This is only a test with some trivially simple data. I doubt there are >>>>>>> any splits and yes it could easily be done in memory but that is not >>>>>>> the purpose. It is based on testKmeansDSVD2, which is in >>>>>>> mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java >>>>>>> I've attached the modified and running version with testKmeansDSSVD >>>>>>> >>>>>>> As I said I don't think this is a real world test. It tests that the >>>>>>> code runs, and it does. Getting the best results is not part of the >>>>>>> scope. I just thought if there was an easy answer I could clean up the >>>>>>> parameters for SSVDSolver. >>>>>>> >>>>>>> Since it is working I don't know that it's worth the effort unless >>>>>>> people are likely to run into this with larger data sets. >>>>>>> >>>>>>> Thanks anyway. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> It happens because of internal constraints stemming from blocking. it >>>>>>> happens when a split of A (input) has less than (k+p) rows at which >>>>>>> point blocks are too small (or rather, to short) to successfully >>>>>>> perform a QR on . >>>>>>> >>>>>>> This also means, among other things, k+p cannot be more than your >>>>>>> total number of rows in the input. >>>>>>> >>>>>>> It is also possible that input A is way too wide or k+p is way too big >>>>>>> so that an arbitrary split does not fetch at least k+p rows of A, but >>>>>>> in practice i haven't seen such cases in practice yet. If that >>>>>>> happens, there's an option to increase minSplitSize (which would >>>>>>> undermine MR mappers efficiency somewhat). But i am pretty sure it is >>>>>>> not your case. >>>>>>> >>>>>>> But if your input is shorter than k+p, then it is a case too small for >>>>>>> SSVD. in fact, it probably means you can solve test directly in memory >>>>>>> with any solver. You can still use SSVD with k=m and p=0 (I think) in >>>>>>> this case and get exact (non-reduced rank) decomposition equivalent >>>>>>> with no stochastic effects, but that is not what it is for really. >>>>>>> >>>>>>> Assuming your input is m x n, can you tell me please what your m, n, k >>>>>>> and p are? >>>>>>> >>>>>>> thanks. >>>>>>> -D >>>>>>> >>>>>>> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <[email protected]> >>>>>>> wrote: >>>>>>>> There seems to be some internal constraint on k and/or p, which is >>>>>>>> making a test difficult. The test has a very small input doc set and >>>>>>>> choosing the wrong k it is very easy to get a failure with this >>>>>>>> message: >>>>>>>> >>>>>>>> java.lang.IllegalArgumentException: new m can't be less than n >>>>>>>> at >>>>>>>> org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109) >>>>>>>> >>>>>>>> I have a working test but I had to add some docs to the test data and >>>>>>>> have tried to reverse engineer the value for k (desiredRank). I came >>>>>>>> up with the following but I think it is only an accident that it works. >>>>>>>> >>>>>>>> int p = 15; //default value for CLI >>>>>>>> int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, >>>>>>>> ?????? not sure why this works >>>>>>>> >>>>>>>> This seems likely to be an issue only because of the very small data >>>>>>>> set and the relationship of rows to columns to p to k. But for the >>>>>>>> purposes of creating a test if someone (Dmitriy?) could tell me how to >>>>>>>> calculate a reasonable p and k from the dimensions of the tiny data >>>>>>>> set it would help. >>>>>>>> >>>>>>>> This test is derived from a non-active SVD test but I'd be up for >>>>>>>> cleaning it up and including it as an example in the working but >>>>>>>> non-active tests. I also fixed a couple trivial bugs in the non-active >>>>>>>> Lanczos tests for what it's worth. >>>>>>>> >>>>>>>> >>>>>>>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <[email protected]> wrote: >>>>>>>> >>>>>>>> Reading "overview and usage" doc linked on that page >>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition >>>>>>>> should help to clarify outputs and usage. >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <[email protected]> >>>>>>>> wrote: >>>>>>>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <[email protected]> >>>>>>>>> wrote: >>>>>>>>>> Quoth Grant Ingersoll: >>>>>>>>>>> To put this in bin/mahout speak, this would look like, munging some >>>>>>>>>>> names and taking liberties with the actual argument to be passed in: >>>>>>>>>>> >>>>>>>>>>> bin/mahout svd (original -> svdOut) >>>>>>>>>>> bin/mahout cleansvd ... >>>>>>>>>>> bin/mahout transpose svdOut -> svdT >>>>>>>>>>> bin/mahout transpose original -> originalT >>>>>>>>>>> bin/mahout matrixmult originalT svdT -> newMatrix >>>>>>>>>>> bin/mahout kmeans newMatrix >>>>>>>>>> >>>>>>>>>> I'm trying to create a test case from testKmeansDSVD2 to use >>>>>>>>>> SSVDSolver. Does SSVD require the EigenVerificationJob to clean the >>>>>>>>>> eigen vectors? >>>>>>>>> >>>>>>>>> No >>>>>>>>> >>>>>>>>>> if so where does SSVD put the equivalent of >>>>>>>>>> DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be >>>>>>>>>> in V* but SSVD creates V so should I transpose V* then run it >>>>>>>>>> through the EigenVerificationJob? >>>>>>>>> no >>>>>>>>> >>>>>>>>> SSVD is SVD, meaning it produces U and V with no further need to >>>>>>>>> clean that >>>>>>>>> >>>>>>>>>> I get errors when I do so trying to figure out if I'm on the wrong >>>>>>>>>> track. >>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>
