Switching from API to CLI
the parameter -t is described in the PDF
--reduceTasks <int-value> optional. The number of reducers to use (where
applicable): depends on the size of the hadoop cluster. At this point it could
also be overwritten by a standard hadoop property using -D option
4. Probably always needs to be speciļ¬ed as by default Hadoop would set it to 1,
which is certainly far below the cluster capacity. Recommended value for this
option ~ 95% or ~190% of available reducer capacity to allow for opportunistic
executions.
The description above seems to say it will be taken from the hadoop config if
not specified, which is probably all most people would every want. I am unclear
why this is needed? I cannot run SSVD without specifying it, in other words it
does not seem to be optional?
As a first try using the CLI I'm running with 295625 rows and 337258 columns
using the following parameters to get a sort of worst case run time result with
best case data output. The parameters will be tweaked later to get better
dimensional reduction and runtime.
mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends on
cluster)
Is there work being done to calculate the variance retained for the output or
should I calculate it myself?