SSVD + PCA

Pat Ferrel Sat, 18 Aug 2012 08:32:12 -0700

Switching from API to CLI 

the parameter -t is described in the PDF


--reduceTasks <int-value> optional. The number of reducers to use (where 
applicable): depends on the size of the hadoop cluster. At this point it could 
also be overwritten by a standard hadoop property using -D option
4. Probably always needs to be speciﬁed as by default Hadoop would set it to 1, 
which is certainly far below the cluster capacity. Recommended value for this 
option ~ 95% or ~190% of available reducer capacity to allow for opportunistic 
executions.

The description above seems to say it will be taken from the hadoop config if 
not specified, which is probably all most people would every want. I am unclear 
why this is needed? I cannot run SSVD without specifying it, in other words it 
does not seem to be optional?

As a first try using the CLI I'm running with 295625 rows and 337258 columns 
using the following parameters to get a sort of worst case run time result with 
best case data output. The parameters will be tweaked later to get better 
dimensional reduction and runtime.

    mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends on 
cluster)

Is there work being done to calculate the variance retained for the output or 
should I calculate it myself?

SSVD + PCA

Reply via email to