Re: SSVD for dimensional reduction + Kmeans

Dmitriy Lyubimov Fri, 10 Aug 2012 11:22:55 -0700

I guess there's one more clarification that might be useful.

SSVD in fact creates a decomposition of (k+p) rank where p is called
"oversampling" to capture more plausible dimensions with high variance
of data in case we guessed first k wrong. And then it throws away last
p singular values and vectors. That allows to reduce rounding errors
due to imperfectly guessed projection.


In the corner case when k+p=m, we get the corner case of full rank SVD.

The assumption of SSVD is that m >> (k+p) but it still will work for
as long as (k+p)<=m  including full rank decomposition when p=0 and
k=m. (it also assumes that n>>k+p too.)

-d

On Fri, Aug 10, 2012 at 11:07 AM, Dmitriy Lyubimov <[email protected]> wrote:
> It happens because of internal constraints stemming from blocking. it
> happens when a split of A (input) has less than (k+p) rows at which
> point blocks are too small (or rather, to short) to successfully
> perform a QR on .
>
> This also means, among other things, k+p cannot be more than your
> total number of rows in the input.
>
> It is also possible that input A is way too wide or k+p is way too big
> so that an arbitrary split does not fetch at least k+p rows of A, but
> in practice i haven't seen such cases in practice yet. If that
> happens, there's an option to increase minSplitSize (which would
> undermine MR mappers efficiency  somewhat). But i am pretty sure it is
> not your case.
>
> But if your input is shorter than k+p, then it is a case too small for
> SSVD. in fact, it probably means you can solve test directly in memory
> with any solver. You can still use SSVD with k=m and p=0 (I think) in
> this case and get exact (non-reduced rank) decomposition equivalent
> with no stochastic effects, but that is not what it is for really.
>
> Assuming your input is m x n, can you tell me please what your m, n, k
> and p are?
>
> thanks.
> -D
>
> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <[email protected]> wrote:
>> There seems to be some internal constraint on k and/or p, which is making a 
>> test difficult. The test has a very small input doc set and choosing the 
>> wrong k it is very easy to get a failure with this message:
>>
>>         java.lang.IllegalArgumentException: new m can't be less than n
>>                 at 
>> org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>>
>> I have a working test but I had to add some docs to the test data and have 
>> tried to reverse engineer the value for k (desiredRank). I came up with the 
>> following but I think it is only an accident that it works.
>>
>>         int p = 15; //default value for CLI
>>         int desiredRank = sampleData.size() - p - 1;//number of docs - p - 
>> 1, ?????? not sure why this works
>>
>> This seems likely to be an issue only because of the very small data set and 
>> the relationship of rows to columns to p to k. But for the purposes of 
>> creating a test if someone (Dmitriy?) could tell me how to calculate a 
>> reasonable p and k from the dimensions of the tiny data set it would help.
>>
>> This test is derived from a non-active SVD test but I'd be up for cleaning 
>> it up and including it as an example in the working but non-active tests. I 
>> also fixed a couple trivial bugs in the non-active Lanczos tests for what 
>> it's worth.
>>
>>
>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>
>> Reading "overview and usage" doc linked on that page
>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>> should help to clarify outputs and usage.
>>
>>
>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <[email protected]> wrote:
>>>> Quoth Grant Ingersoll:
>>>>> To put this in bin/mahout speak, this would look like, munging some names 
>>>>> and taking liberties with the actual argument to be passed in:
>>>>>
>>>>> bin/mahout svd (original -> svdOut)
>>>>> bin/mahout cleansvd ...
>>>>> bin/mahout transpose svdOut -> svdT
>>>>> bin/mahout transpose original -> originalT
>>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>>> bin/mahout kmeans newMatrix
>>>>
>>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. 
>>>> Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>>>
>>> No
>>>
>>>> if so where does SSVD put the equivalent of 
>>>> DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* 
>>>> but SSVD creates V so should I transpose V* then run it through the 
>>>> EigenVerificationJob?
>>> no
>>>
>>> SSVD is SVD, meaning it produces U and V with no further need to clean that
>>>
>>>> I get errors when I do so trying to figure out if I'm on the wrong track.
>>

Re: SSVD for dimensional reduction + Kmeans

Reply via email to