Re: SSVD + PCA

Dmitriy Lyubimov Mon, 20 Aug 2012 08:27:14 -0700

Ill take a look, although it used to run on output of seq2sparse, i made
sure of it some time ago. This never happened before. Perhaps something got
broken...
On Aug 19, 2012 1:06 AM, "Pat Ferrel" <[email protected]> wrote:


> -t Param
>
> I'm no hadoop expert but there are a couple parameters for each node in a
> cluster that specifies the default number of mappers and reducers for that
> node. There is a rule of thumb about how many mappers and reducers per
> core. You can tweak them either way depending on your typical jobs.
>
> No idea what you mean about the total reducers being 1 for most configs.
> My very small cluster at home with 10 cores in three machines is configured
> to produce a conservative 10 mappers and 10 reducers, which is about what
> happens with balanced jobs. The reducers = 1 is probably for a
> non-clustered one machine setup.
>
> I'm suspicious that the -t  parameter is not needed but would definitely
> defer to a hadoop master. In any case I set it to 10 for my mini cluster.
>
> Variance Retained
>
> If one batch of data yields a greatly different estimate of VR than
> another, it would be worth noticing, even if we don't know the actual error
> in it. To say that your estimate of VR is valueless would require that we
> have some experience with it, no?
>
> On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov <[email protected]> wrote:
>
> On Aug 18, 2012 8:32 AM, "Pat Ferrel" <[email protected]> wrote:
> >
> > Switching from API to CLI
> >
> > the parameter -t is described in the PDF
> >
> > --reduceTasks <int-value> optional. The number of reducers to use (where
> applicable): depends on the size of the hadoop cluster. At this point it
> could also be overwritten by a standard hadoop property using -D option
> > 4. Probably always needs to be speciﬁed as by default Hadoop would set it
> to 1, which is certainly far below the cluster capacity. Recommended value
> for this option ~ 95% or ~190% of available reducer capacity to allow for
> opportunistic executions.
> >
> > The description above seems to say it will be taken from the hadoop
> config if not specified, which is probably all most people would every
> want. I am unclear why this is needed? I cannot run SSVD without specifying
> it, in other words it does not seem to be optional?
>
> This parameter was made mandatory because people were repeatedly forgetting
> set the number of reducers and kept coming back with questions like why it
> is running so slow. So there was an issue in 0.7 where i made it mandatory.
> I am actually not sure now other mahout methods ensure reducer
> specification is always specified other than 1
>
> >
> > As a first try using the CLI I'm running with 295625 rows and 337258
> columns using the following parameters to get a sort of worst case run time
> result with best case data output. The parameters will be tweaked later to
> get better dimensional reduction and runtime.
> >
> >    mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
> on cluster)
> >
> > Is there work being done to calculate the variance retained for the
> output or should I calculate it myself?
>
> No theres no work done since it implies your are building your own pipeline
> for a particular purpose. It also takes a lot of assumptions that may or
> may not hold in a  particular case, such that you do something repeatedly
> and corpuses are of similar nature. Also, i know no paper that would do it
> exactly the way i described, so theres no error estimate on either
> inequality approach or any sort of decay interpolation.
>
> It is not very difficult to experiment a little with your data though with
> a subset of the corpus and see what may work.
>
>

Re: SSVD + PCA

Reply via email to