Also id sugggest to start with much smaller k. It directly affects task running time, especially with q>0. I think in Nathan's dissertation he ran wikipedia set with only 100 singular values which already provide most of the spectrum if you look and the decay chart. In practice i dont know anyone who ran it with more than 200. It just reaches point of diminishing return around 100 where you pay quite a bit for too little additional information. On Aug 18, 2012 10:39 AM, "Dmitriy Lyubimov" <[email protected]> wrote:
> > On Aug 18, 2012 8:32 AM, "Pat Ferrel" <[email protected]> wrote: > > > > Switching from API to CLI > > > > the parameter -t is described in the PDF > > > > --reduceTasks <int-value> optional. The number of reducers to use (where > applicable): depends on the size of the hadoop cluster. At this point it > could also be overwritten by a standard hadoop property using -D option > > 4. Probably always needs to be speciļ¬ed as by default Hadoop would set > it to 1, which is certainly far below the cluster capacity. Recommended > value for this option ~ 95% or ~190% of available reducer capacity to allow > for opportunistic executions. > > > > The description above seems to say it will be taken from the hadoop > config if not specified, which is probably all most people would every > want. I am unclear why this is needed? I cannot run SSVD without specifying > it, in other words it does not seem to be optional? > > This parameter was made mandatory because people were repeatedly > forgetting set the number of reducers and kept coming back with questions > like why it is running so slow. So there was an issue in 0.7 where i made > it mandatory. I am actually not sure now other mahout methods ensure > reducer specification is always specified other than 1 > > > > > As a first try using the CLI I'm running with 295625 rows and 337258 > columns using the following parameters to get a sort of worst case run time > result with best case data output. The parameters will be tweaked later to > get better dimensional reduction and runtime. > > > > mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends > on cluster) > > > > Is there work being done to calculate the variance retained for the > output or should I calculate it myself? > > No theres no work done since it implies your are building your own > pipeline for a particular purpose. It also takes a lot of assumptions that > may or may not hold in a particular case, such that you do something > repeatedly and corpuses are of similar nature. Also, i know no paper that > would do it exactly the way i described, so theres no error estimate on > either inequality approach or any sort of decay interpolation. > > It is not very difficult to experiment a little with your data though with > a subset of the corpus and see what may work. >
