Re: SSVD + PCA

Pat Ferrel Mon, 20 Aug 2012 22:13:13 -0700

OMG, sorry but I am a complete idiot. RowId creates a "docIndex" in the matrix 
dir. Once I specified the full path to the Distributed Row Matrix everything 
was fine.


On Aug 20, 2012, at 11:09 AM, Dmitriy Lyubimov <[email protected]> wrote:

On Mon, Aug 20, 2012 at 11:03 AM, Dmitriy Lyubimov <[email protected]> wrote:
> Ok this just means that something in the A input is not really
> adhering to <Writable,VectorWritable> specification. In particular,
> there seems to be a file in the input path that has <?,VectorWritable>
> pair in its input.

sorry this should read

> there seems to be a file in the input path that has <?,Text>
> pair in its input.

Input seems to have Text values somewhere.

> 
> Can you check your input files for key/value types? Note that includes
> entire subtree of sequence files, not just files in the input
> directory.
> 
> Usually it is visible in the header of the sequence file (usually even
> if it is using compression).
> 
> I am not quite sure what you mean by "rowid" processing.
> 
> 
> 
> On Sun, Aug 19, 2012 at 7:40 PM, Pat Ferrel <[email protected]> wrote:
>> Getting an odd error on SSVD.
>> 
>> Starting with the QJob I get 9 map tasks for the data set, 8 are run on the 
>> mini cluster in parallel. Most of them complete with no errors but there 
>> usually two map task failures for each QJob, they die with the error:
>> 
>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to 
>> org.apache.mahout.math.VectorWritable
>>        at 
>> org.apache.mahout.math.hadoop.stochasticsvd.QJob$QMapper.map(QJob.java:74)
>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>        at java.security.AccessController.doPrivileged(Native Method)
>>        at javax.security.auth.Subject.doAs(Subject.java:416)
>>        at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> 
>> The data was created using seq2sparse and then running rowid to create the 
>> input matrix. The data was encoded as named vectors. These are the two 
>> differences I could think of between how I ran it from the API and from the 
>> CLI.
>> 
>> 
>> On Aug 18, 2012, at 7:29 PM, Pat Ferrel <[email protected]> wrote:
>> 
>> -t Param
>> 
>> I'm no hadoop expert but there are a couple parameters for each node in a 
>> cluster that specifies the default number of mappers and reducers for that 
>> node. There is a rule of thumb about how many mappers and reducers per core. 
>> You can tweak them either way depending on your typical jobs.
>> 
>> No idea what you mean about the total reducers being 1 for most configs. My 
>> very small cluster at home with 10 cores in three machines is configured to 
>> produce a conservative 10 mappers and 10 reducers, which is about what 
>> happens with balanced jobs. The reducers = 1 is probably for a non-clustered 
>> one machine setup.
>> 
>> I'm suspicious that the -t  parameter is not needed but would definitely 
>> defer to a hadoop master. In any case I set it to 10 for my mini cluster.
>> 
>> Variance Retained
>> 
>> If one batch of data yields a greatly different estimate of VR than another, 
>> it would be worth noticing, even if we don't know the actual error in it. To 
>> say that your estimate of VR is valueless would require that we have some 
>> experience with it, no?
>> 
>> On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov <[email protected]> wrote:
>> 
>> On Aug 18, 2012 8:32 AM, "Pat Ferrel" <[email protected]> wrote:
>>> 
>>> Switching from API to CLI
>>> 
>>> the parameter -t is described in the PDF
>>> 
>>> --reduceTasks <int-value> optional. The number of reducers to use (where
>> applicable): depends on the size of the hadoop cluster. At this point it
>> could also be overwritten by a standard hadoop property using -D option
>>> 4. Probably always needs to be speciﬁed as by default Hadoop would set it
>> to 1, which is certainly far below the cluster capacity. Recommended value
>> for this option ~ 95% or ~190% of available reducer capacity to allow for
>> opportunistic executions.
>>> 
>>> The description above seems to say it will be taken from the hadoop
>> config if not specified, which is probably all most people would every
>> want. I am unclear why this is needed? I cannot run SSVD without specifying
>> it, in other words it does not seem to be optional?
>> 
>> This parameter was made mandatory because people were repeatedly forgetting
>> set the number of reducers and kept coming back with questions like why it
>> is running so slow. So there was an issue in 0.7 where i made it mandatory.
>> I am actually not sure now other mahout methods ensure reducer
>> specification is always specified other than 1
>> 
>>> 
>>> As a first try using the CLI I'm running with 295625 rows and 337258
>> columns using the following parameters to get a sort of worst case run time
>> result with best case data output. The parameters will be tweaked later to
>> get better dimensional reduction and runtime.
>>> 
>>>  mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
>> on cluster)
>>> 
>>> Is there work being done to calculate the variance retained for the
>> output or should I calculate it myself?
>> 
>> No theres no work done since it implies your are building your own pipeline
>> for a particular purpose. It also takes a lot of assumptions that may or
>> may not hold in a  particular case, such that you do something repeatedly
>> and corpuses are of similar nature. Also, i know no paper that would do it
>> exactly the way i described, so theres no error estimate on either
>> inequality approach or any sort of decay interpolation.
>> 
>> It is not very difficult to experiment a little with your data though with
>> a subset of the corpus and see what may work.
>> 
>>

Re: SSVD + PCA

Reply via email to