On Aug 18, 2012 8:32 AM, "Pat Ferrel" <[email protected]> wrote: > > Switching from API to CLI > > the parameter -t is described in the PDF > > --reduceTasks <int-value> optional. The number of reducers to use (where applicable): depends on the size of the hadoop cluster. At this point it could also be overwritten by a standard hadoop property using -D option > 4. Probably always needs to be speciļ¬ed as by default Hadoop would set it to 1, which is certainly far below the cluster capacity. Recommended value for this option ~ 95% or ~190% of available reducer capacity to allow for opportunistic executions. > > The description above seems to say it will be taken from the hadoop config if not specified, which is probably all most people would every want. I am unclear why this is needed? I cannot run SSVD without specifying it, in other words it does not seem to be optional?
This parameter was made mandatory because people were repeatedly forgetting set the number of reducers and kept coming back with questions like why it is running so slow. So there was an issue in 0.7 where i made it mandatory. I am actually not sure now other mahout methods ensure reducer specification is always specified other than 1 > > As a first try using the CLI I'm running with 295625 rows and 337258 columns using the following parameters to get a sort of worst case run time result with best case data output. The parameters will be tweaked later to get better dimensional reduction and runtime. > > mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends on cluster) > > Is there work being done to calculate the variance retained for the output or should I calculate it myself? No theres no work done since it implies your are building your own pipeline for a particular purpose. It also takes a lot of assumptions that may or may not hold in a particular case, such that you do something repeatedly and corpuses are of similar nature. Also, i know no paper that would do it exactly the way i described, so theres no error estimate on either inequality approach or any sort of decay interpolation. It is not very difficult to experiment a little with your data though with a subset of the corpus and see what may work.
