Ill take a look, although it used to run on output of seq2sparse, i made sure of it some time ago. This never happened before. Perhaps something got broken... On Aug 19, 2012 1:06 AM, "Pat Ferrel" <[email protected]> wrote:
> -t Param > > I'm no hadoop expert but there are a couple parameters for each node in a > cluster that specifies the default number of mappers and reducers for that > node. There is a rule of thumb about how many mappers and reducers per > core. You can tweak them either way depending on your typical jobs. > > No idea what you mean about the total reducers being 1 for most configs. > My very small cluster at home with 10 cores in three machines is configured > to produce a conservative 10 mappers and 10 reducers, which is about what > happens with balanced jobs. The reducers = 1 is probably for a > non-clustered one machine setup. > > I'm suspicious that the -t parameter is not needed but would definitely > defer to a hadoop master. In any case I set it to 10 for my mini cluster. > > Variance Retained > > If one batch of data yields a greatly different estimate of VR than > another, it would be worth noticing, even if we don't know the actual error > in it. To say that your estimate of VR is valueless would require that we > have some experience with it, no? > > On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov <[email protected]> wrote: > > On Aug 18, 2012 8:32 AM, "Pat Ferrel" <[email protected]> wrote: > > > > Switching from API to CLI > > > > the parameter -t is described in the PDF > > > > --reduceTasks <int-value> optional. The number of reducers to use (where > applicable): depends on the size of the hadoop cluster. At this point it > could also be overwritten by a standard hadoop property using -D option > > 4. Probably always needs to be speciļ¬ed as by default Hadoop would set it > to 1, which is certainly far below the cluster capacity. Recommended value > for this option ~ 95% or ~190% of available reducer capacity to allow for > opportunistic executions. > > > > The description above seems to say it will be taken from the hadoop > config if not specified, which is probably all most people would every > want. I am unclear why this is needed? I cannot run SSVD without specifying > it, in other words it does not seem to be optional? > > This parameter was made mandatory because people were repeatedly forgetting > set the number of reducers and kept coming back with questions like why it > is running so slow. So there was an issue in 0.7 where i made it mandatory. > I am actually not sure now other mahout methods ensure reducer > specification is always specified other than 1 > > > > > As a first try using the CLI I'm running with 295625 rows and 337258 > columns using the following parameters to get a sort of worst case run time > result with best case data output. The parameters will be tweaked later to > get better dimensional reduction and runtime. > > > > mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends > on cluster) > > > > Is there work being done to calculate the variance retained for the > output or should I calculate it myself? > > No theres no work done since it implies your are building your own pipeline > for a particular purpose. It also takes a lot of assumptions that may or > may not hold in a particular case, such that you do something repeatedly > and corpuses are of similar nature. Also, i know no paper that would do it > exactly the way i described, so theres no error estimate on either > inequality approach or any sort of decay interpolation. > > It is not very difficult to experiment a little with your data though with > a subset of the corpus and see what may work. > >
