Re: SSVD for dimensional reduction + Kmeans

Dmitriy Lyubimov Fri, 10 Aug 2012 14:50:54 -0700

On Fri, Aug 10, 2012 at 2:48 PM, Dmitriy Lyubimov <[email protected]> wrote:
> another piece of information is that according to investigation we
> did, using q=0 seems to underestimate singular values somewhat and
> produces a little better decay than optimal values would have. which
> means variance retained would err on the optimistic side a bit
> (showing a little bit more variance retained than actual).
>
> also i think retaining 99% is reasonable with big corpuses. this


sorry i meant "retaining 99% of data variance during dimensionality
reduction is probably unreasonable with big corpuses".

> desirable threshold could probably be reduced a little bit to
> accomodate pragmatic resource capacity.
>
> On Fri, Aug 10, 2012 at 2:42 PM, Dmitriy Lyubimov <[email protected]> wrote:
>> note that to estimate variance retained approximately only, you
>> probably don't need to run it with q=1 so you can run with q=0 and not
>> request either V or U but singular values only. That will reduce
>> running time dramatically (perhaps up to 2-5 times faster compared to
>> with q=1 and U and V).
>>
>> On Fri, Aug 10, 2012 at 2:31 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>> With PCA, there's a metric, something called "variance retained".
>>>
>>> One idea of mine to estimate k is described in footnote discussion on
>>> page 5. While it is not possible to compute "PCA variance retained"
>>> metric exactly with an application of a thin SVD (the metric assumes
>>> use of a full rank SVD) it is possible to infer upper estimate for k
>>> given target variance retained (say, 99%) or try some sort of
>>> polynomialy approximated value for the sum of all singular values
>>> given visible decay. Probably requires some simple code in R or matlab
>>> to get reasonable estimate.
>>>
>>> This technique requires running  PCA one time and then estimate
>>> sufficient k given singular values produced on your corpus. If the
>>> action is repetetive and corpus is not changing drastically, you may
>>> infer if you spending too much (or too little) on k for future uses.
>>>
>>> But pragmatically i just use the best k my cluster can compute in the
>>> time i need. my corpus is relatively small and i don't run full corpus
>>> run too often so i can afford some time spent.
>>>
>>> On Fri, Aug 10, 2012 at 2:14 PM, Pat Ferrel <[email protected]> wrote:
>>>> The built-in PCA option is one reason I wanted to try SSVD. Building the 
>>>> test was to make sure I understood the external matrix operations before 
>>>> diving in. I expect one primary decision is how to choose k for reduction. 
>>>> I'm hoping to get some noise rejection so not using it for reduced matrix 
>>>> size so much. We are starting with m = 500,000 and a million or so docs. 
>>>> We get many dups and low value docs in a small web crawl, so lots of noise.
>>>>
>>>> You mention in  your paper:
>>>>
>>>> "The valueof k + p directly impacts running time and memory requirements.
>>>> k+p=500 is probably more than reasonable. Typically k + p
>>>> is taken within range 20…200"
>>>>
>>>> So I guess we might start with
>>>>         -p 15 (default)
>>>>         -q 1
>>>>         -k 200
>>>>
>>>> Is there any use in hand inspecting the eigen vectors before choosing the 
>>>> final k? If so do you get those by choosing k nearly = m or is something 
>>>> like k = 1000 (or ?) good enough to for inspection?
>>>>
>>>> On Aug 10, 2012, at 12:53 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>>>
>>>> BTW if you really are trying to reduce dimensionality, you may want to
>>>> consider --pca option with SSVD, that [i think] will provide with much
>>>> better preserved data variance then just clean SVD (i.e. essentially
>>>> run a PCA space transformation on your data rather than just SVD)
>>>>
>>>> -d
>>>>
>>>> On Fri, Aug 10, 2012 at 11:57 AM, Pat Ferrel <[email protected]> wrote:
>>>>> Got it. Well on to some real and much larger data sets then…
>>>>>
>>>>> On Aug 10, 2012, at 11:53 AM, Dmitriy Lyubimov <[email protected]> wrote:
>>>>>
>>>>> i think actually Mahout's Lanczos requires external knowledge of input
>>>>> size too, in part for similar reasons. SSVD doesn't because it doesn't
>>>>> have "other" reasons to know input size but fundamental assumption
>>>>> rank(input)>=rank(thin SVD) still stands about the input but the
>>>>> method doesn't have a goal of verifying it explicitly (which would be
>>>>> kind of hard), and instead either produces 0 eigenvectors or runs into
>>>>> block deficiency.
>>>>>
>>>>> It is however hard to assert whether block deficiency stemmed from
>>>>> input size deficiency vs. split size deficiency, and neither of
>>>>> situations is typical for a real-life SSVD applications, hence error
>>>>> message is somewhat vague.
>>>>>
>>>>> On Fri, Aug 10, 2012 at 11:39 AM, Dmitriy Lyubimov <[email protected]> 
>>>>> wrote:
>>>>>> The easy answer is to ensure (k+p)<= m. It is mathematical constraint,
>>>>>> not a method pecularity.
>>>>>>
>>>>>> The only reason the solution doesn't warn you explicitly is because
>>>>>> DistributedRowMatrix format, which is just a sequence file of rows,
>>>>>> would not provide us with an easy way to verify what m actually is
>>>>>> before it actually iterates over it and runs into block size
>>>>>> deficiency. So if you now m as an external knowledge, it is easy to
>>>>>> avoid being trapped by block height defiicency.
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <[email protected]> 
>>>>>> wrote:
>>>>>>> This is only a test with some trivially simple data. I doubt there are 
>>>>>>> any splits and yes it could easily be done in memory but that is not 
>>>>>>> the purpose. It is based on testKmeansDSVD2, which is in
>>>>>>> mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
>>>>>>> I've attached the modified and running version with testKmeansDSSVD
>>>>>>>
>>>>>>> As I said I don't think this is a real world test. It tests that the 
>>>>>>> code runs, and it does. Getting the best results is not part of the 
>>>>>>> scope. I just thought if there was an easy answer I could clean up the 
>>>>>>> parameters for SSVDSolver.
>>>>>>>
>>>>>>> Since it is working I don't know that it's worth the effort unless 
>>>>>>> people are likely to run into this with larger data sets.
>>>>>>>
>>>>>>> Thanks anyway.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <[email protected]> 
>>>>>>> wrote:
>>>>>>>
>>>>>>> It happens because of internal constraints stemming from blocking. it
>>>>>>> happens when a split of A (input) has less than (k+p) rows at which
>>>>>>> point blocks are too small (or rather, to short) to successfully
>>>>>>> perform a QR on .
>>>>>>>
>>>>>>> This also means, among other things, k+p cannot be more than your
>>>>>>> total number of rows in the input.
>>>>>>>
>>>>>>> It is also possible that input A is way too wide or k+p is way too big
>>>>>>> so that an arbitrary split does not fetch at least k+p rows of A, but
>>>>>>> in practice i haven't seen such cases in practice yet. If that
>>>>>>> happens, there's an option to increase minSplitSize (which would
>>>>>>> undermine MR mappers efficiency  somewhat). But i am pretty sure it is
>>>>>>> not your case.
>>>>>>>
>>>>>>> But if your input is shorter than k+p, then it is a case too small for
>>>>>>> SSVD. in fact, it probably means you can solve test directly in memory
>>>>>>> with any solver. You can still use SSVD with k=m and p=0 (I think) in
>>>>>>> this case and get exact (non-reduced rank) decomposition equivalent
>>>>>>> with no stochastic effects, but that is not what it is for really.
>>>>>>>
>>>>>>> Assuming your input is m x n, can you tell me please what your m, n, k
>>>>>>> and p are?
>>>>>>>
>>>>>>> thanks.
>>>>>>> -D
>>>>>>>
>>>>>>> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <[email protected]> 
>>>>>>> wrote:
>>>>>>>> There seems to be some internal constraint on k and/or p, which is 
>>>>>>>> making a test difficult. The test has a very small input doc set and 
>>>>>>>> choosing the wrong k it is very easy to get a failure with this 
>>>>>>>> message:
>>>>>>>>
>>>>>>>> java.lang.IllegalArgumentException: new m can't be less than n
>>>>>>>>       at 
>>>>>>>> org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>>>>>>>>
>>>>>>>> I have a working test but I had to add some docs to the test data and 
>>>>>>>> have tried to reverse engineer the value for k (desiredRank). I came 
>>>>>>>> up with the following but I think it is only an accident that it works.
>>>>>>>>
>>>>>>>> int p = 15; //default value for CLI
>>>>>>>> int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, 
>>>>>>>> ?????? not sure why this works
>>>>>>>>
>>>>>>>> This seems likely to be an issue only because of the very small data 
>>>>>>>> set and the relationship of rows to columns to p to k. But for the 
>>>>>>>> purposes of creating a test if someone (Dmitriy?) could tell me how to 
>>>>>>>> calculate a reasonable p and k from the dimensions of the tiny data 
>>>>>>>> set it would help.
>>>>>>>>
>>>>>>>> This test is derived from a non-active SVD test but I'd be up for 
>>>>>>>> cleaning it up and including it as an example in the working but 
>>>>>>>> non-active tests. I also fixed a couple trivial bugs in the non-active 
>>>>>>>> Lanczos tests for what it's worth.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Reading "overview and usage" doc linked on that page
>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>>>>>>>> should help to clarify outputs and usage.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <[email protected]> 
>>>>>>>> wrote:
>>>>>>>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <[email protected]> 
>>>>>>>>> wrote:
>>>>>>>>>> Quoth Grant Ingersoll:
>>>>>>>>>>> To put this in bin/mahout speak, this would look like, munging some 
>>>>>>>>>>> names and taking liberties with the actual argument to be passed in:
>>>>>>>>>>>
>>>>>>>>>>> bin/mahout svd (original -> svdOut)
>>>>>>>>>>> bin/mahout cleansvd ...
>>>>>>>>>>> bin/mahout transpose svdOut -> svdT
>>>>>>>>>>> bin/mahout transpose original -> originalT
>>>>>>>>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>>>>>>>>> bin/mahout kmeans newMatrix
>>>>>>>>>>
>>>>>>>>>> I'm trying to create a test case from testKmeansDSVD2 to use 
>>>>>>>>>> SSVDSolver. Does SSVD require the EigenVerificationJob to clean the 
>>>>>>>>>> eigen vectors?
>>>>>>>>>
>>>>>>>>> No
>>>>>>>>>
>>>>>>>>>> if so where does SSVD put the equivalent of 
>>>>>>>>>> DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be 
>>>>>>>>>> in V* but SSVD creates V so should I transpose V* then run it 
>>>>>>>>>> through the EigenVerificationJob?
>>>>>>>>> no
>>>>>>>>>
>>>>>>>>> SSVD is SVD, meaning it produces U and V with no further need to 
>>>>>>>>> clean that
>>>>>>>>>
>>>>>>>>>> I get errors when I do so trying to figure out if I'm on the wrong 
>>>>>>>>>> track.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>

Re: SSVD for dimensional reduction + Kmeans

Reply via email to