Re: SSVD for dimensional reduction + Kmeans

Dmitriy Lyubimov Fri, 10 Aug 2012 12:33:57 -0700

I guess some strategy like this will work for a small size test:

k = ...
p = ..
m = ... external knowledge


if ( k+p > m ) {
  p=m-k;
  if ( p < 0 ) {
   k+=p;
   p=0;
  }
}


On Fri, Aug 10, 2012 at 11:39 AM, Dmitriy Lyubimov <[email protected]> wrote:
> The easy answer is to ensure (k+p)<= m. It is mathematical constraint,
> not a method pecularity.
>
> The only reason the solution doesn't warn you explicitly is because
> DistributedRowMatrix format, which is just a sequence file of rows,
> would not provide us with an easy way to verify what m actually is
> before it actually iterates over it and runs into block size
> deficiency. So if you now m as an external knowledge, it is easy to
> avoid being trapped by block height defiicency.
>
>
> On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <[email protected]> wrote:
>> This is only a test with some trivially simple data. I doubt there are any 
>> splits and yes it could easily be done in memory but that is not the 
>> purpose. It is based on testKmeansDSVD2, which is in
>> mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
>> I've attached the modified and running version with testKmeansDSSVD
>>
>> As I said I don't think this is a real world test. It tests that the code 
>> runs, and it does. Getting the best results is not part of the scope. I just 
>> thought if there was an easy answer I could clean up the parameters for 
>> SSVDSolver.
>>
>> Since it is working I don't know that it's worth the effort unless people 
>> are likely to run into this with larger data sets.
>>
>> Thanks anyway.
>>
>>
>>
>>
>> On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <[email protected]> wrote:
>>
>> It happens because of internal constraints stemming from blocking. it
>> happens when a split of A (input) has less than (k+p) rows at which
>> point blocks are too small (or rather, to short) to successfully
>> perform a QR on .
>>
>> This also means, among other things, k+p cannot be more than your
>> total number of rows in the input.
>>
>> It is also possible that input A is way too wide or k+p is way too big
>> so that an arbitrary split does not fetch at least k+p rows of A, but
>> in practice i haven't seen such cases in practice yet. If that
>> happens, there's an option to increase minSplitSize (which would
>> undermine MR mappers efficiency  somewhat). But i am pretty sure it is
>> not your case.
>>
>> But if your input is shorter than k+p, then it is a case too small for
>> SSVD. in fact, it probably means you can solve test directly in memory
>> with any solver. You can still use SSVD with k=m and p=0 (I think) in
>> this case and get exact (non-reduced rank) decomposition equivalent
>> with no stochastic effects, but that is not what it is for really.
>>
>> Assuming your input is m x n, can you tell me please what your m, n, k
>> and p are?
>>
>> thanks.
>> -D
>>
>> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <[email protected]> wrote:
>>> There seems to be some internal constraint on k and/or p, which is making a 
>>> test difficult. The test has a very small input doc set and choosing the 
>>> wrong k it is very easy to get a failure with this message:
>>>
>>> java.lang.IllegalArgumentException: new m can't be less than n
>>>         at 
>>> org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>>>
>>> I have a working test but I had to add some docs to the test data and have 
>>> tried to reverse engineer the value for k (desiredRank). I came up with the 
>>> following but I think it is only an accident that it works.
>>>
>>> int p = 15; //default value for CLI
>>> int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, 
>>> ?????? not sure why this works
>>>
>>> This seems likely to be an issue only because of the very small data set 
>>> and the relationship of rows to columns to p to k. But for the purposes of 
>>> creating a test if someone (Dmitriy?) could tell me how to calculate a 
>>> reasonable p and k from the dimensions of the tiny data set it would help.
>>>
>>> This test is derived from a non-active SVD test but I'd be up for cleaning 
>>> it up and including it as an example in the working but non-active tests. I 
>>> also fixed a couple trivial bugs in the non-active Lanczos tests for what 
>>> it's worth.
>>>
>>>
>>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>>
>>> Reading "overview and usage" doc linked on that page
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>>> should help to clarify outputs and usage.
>>>
>>>
>>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <[email protected]> wrote:
>>>>> Quoth Grant Ingersoll:
>>>>>> To put this in bin/mahout speak, this would look like, munging some 
>>>>>> names and taking liberties with the actual argument to be passed in:
>>>>>>
>>>>>> bin/mahout svd (original -> svdOut)
>>>>>> bin/mahout cleansvd ...
>>>>>> bin/mahout transpose svdOut -> svdT
>>>>>> bin/mahout transpose original -> originalT
>>>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>>>> bin/mahout kmeans newMatrix
>>>>>
>>>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. 
>>>>> Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>>>>
>>>> No
>>>>
>>>>> if so where does SSVD put the equivalent of 
>>>>> DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in 
>>>>> V* but SSVD creates V so should I transpose V* then run it through the 
>>>>> EigenVerificationJob?
>>>> no
>>>>
>>>> SSVD is SVD, meaning it produces U and V with no further need to clean that
>>>>
>>>>> I get errors when I do so trying to figure out if I'm on the wrong track.
>>>
>>
>>

Re: SSVD for dimensional reduction + Kmeans

Reply via email to