Re: SSVD + PCA

Dmitriy Lyubimov Mon, 20 Aug 2012 08:24:18 -0700

On Aug 19, 2012 1:06 AM, "Pat Ferrel" <[email protected]> wrote:
>
> -t Param
>
> I'm no hadoop expert but there are a couple parameters for each node in a
cluster that specifies the default number of mappers and reducers for that
node. There is a rule of thumb about how many mappers and reducers per
core. You can tweak them either way depending on your typical jobs.
>
> No idea what you mean about the total reducers being 1 for most configs.
My very small cluster at home with 10 cores in three machines is configured
to produce a conservative 10 mappers and 10 reducers, which is about what
happens with balanced jobs. The reducers = 1 is probably for a
non-clustered one machine setup.


Yes i agree i was thinking the same and relying on people doing the right
thing initially. And the life proved me wrong. Absolutely all crews who
tried the method, not only did they not have reducers set up in their local
client conf, but also they failed to use t parameter to fix it. Also they
all failed to diagnose it on their own (i.e. simply noticing it in the job
stats). I think it has something to do with a typical background of our
customer.

>
> I'm suspicious that the -t  parameter is not needed but would definitely
defer to a hadoop master. In any case I set it to 10 for my mini cluster.

Recommended value is 95% of the cluster capacity to leave space for
opportunistic execution. Although on bigger clustes, i am far from sure
that too many reducers may be that beneficial for a particular problem.
Hence again override of default in command line may be useful.

Also one usually ha more than 1 task capacity per node, so i would expect
your cluster to be able to run up to 40 reducers, typically
>
> Variance Retained
>
> If one batch of data yields a greatly different estimate of VR than
another, it would be worth noticing, even if we don't know the actual error
in it. To say that your estimate of VR is valueless would require that we
have some experience with it, no?

I am not saying it is valueless. Actually i am hoping it is useful, or i
wouldnt inckude it in the howto. I am just saying it is something i leave
outside the scope of the method itself.

>
> On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov <[email protected]> wrote:
>
> On Aug 18, 2012 8:32 AM, "Pat Ferrel" <[email protected]> wrote:
> >
> > Switching from API to CLI
> >
> > the parameter -t is described in the PDF
> >
> > --reduceTasks <int-value> optional. The number of reducers to use (where
> applicable): depends on the size of the hadoop cluster. At this point it
> could also be overwritten by a standard hadoop property using -D option
> > 4. Probably always needs to be speciﬁed as by default Hadoop would set
it
> to 1, which is certainly far below the cluster capacity. Recommended value
> for this option ~ 95% or ~190% of available reducer capacity to allow for
> opportunistic executions.
> >
> > The description above seems to say it will be taken from the hadoop
> config if not specified, which is probably all most people would every
> want. I am unclear why this is needed? I cannot run SSVD without
specifying
> it, in other words it does not seem to be optional?
>
> This parameter was made mandatory because people were repeatedly
forgetting
> set the number of reducers and kept coming back with questions like why it
> is running so slow. So there was an issue in 0.7 where i made it
mandatory.
> I am actually not sure now other mahout methods ensure reducer
> specification is always specified other than 1
>
> >
> > As a first try using the CLI I'm running with 295625 rows and 337258
> columns using the following parameters to get a sort of worst case run
time
> result with best case data output. The parameters will be tweaked later to
> get better dimensional reduction and runtime.
> >
> >    mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
> on cluster)
> >
> > Is there work being done to calculate the variance retained for the
> output or should I calculate it myself?
>
> No theres no work done since it implies your are building your own
pipeline
> for a particular purpose. It also takes a lot of assumptions that may or
> may not hold in a  particular case, such that you do something repeatedly
> and corpuses are of similar nature. Also, i know no paper that would do it
> exactly the way i described, so theres no error estimate on either
> inequality approach or any sort of decay interpolation.
>
> It is not very difficult to experiment a little with your data though with
> a subset of the corpus and see what may work.
>

Re: SSVD + PCA

Reply via email to