Re: SSVD + PCA

Pat Ferrel Sun, 19 Aug 2012 01:06:04 -0700

-t Param

I'm no hadoop expert but there are a couple parameters for each node in a 
cluster that specifies the default number of mappers and reducers for that 
node. There is a rule of thumb about how many mappers and reducers per core. 
You can tweak them either way depending on your typical jobs.

No idea what you mean about the total reducers being 1 for most configs. My 
very small cluster at home with 10 cores in three machines is configured to 
produce a conservative 10 mappers and 10 reducers, which is about what happens 
with balanced jobs. The reducers = 1 is probably for a non-clustered one 
machine setup.

I'm suspicious that the -t  parameter is not needed but would definitely defer 
to a hadoop master. In any case I set it to 10 for my mini cluster.

Variance Retained

If one batch of data yields a greatly different estimate of VR than another, it 
would be worth noticing, even if we don't know the actual error in it. To say 
that your estimate of VR is valueless would require that we have some 
experience with it, no? 

On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov <[email protected]> wrote:

On Aug 18, 2012 8:32 AM, "Pat Ferrel" <[email protected]> wrote:
> 
> Switching from API to CLI
> 
> the parameter -t is described in the PDF
> 
> --reduceTasks <int-value> optional. The number of reducers to use (where
applicable): depends on the size of the hadoop cluster. At this point it
could also be overwritten by a standard hadoop property using -D option
> 4. Probably always needs to be speciﬁed as by default Hadoop would set it
to 1, which is certainly far below the cluster capacity. Recommended value
for this option ~ 95% or ~190% of available reducer capacity to allow for
opportunistic executions.
> 
> The description above seems to say it will be taken from the hadoop
config if not specified, which is probably all most people would every
want. I am unclear why this is needed? I cannot run SSVD without specifying
it, in other words it does not seem to be optional?

This parameter was made mandatory because people were repeatedly forgetting
set the number of reducers and kept coming back with questions like why it
is running so slow. So there was an issue in 0.7 where i made it mandatory.
I am actually not sure now other mahout methods ensure reducer
specification is always specified other than 1

> 
> As a first try using the CLI I'm running with 295625 rows and 337258
columns using the following parameters to get a sort of worst case run time
result with best case data output. The parameters will be tweaked later to
get better dimensional reduction and runtime.
> 
>    mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
on cluster)
> 
> Is there work being done to calculate the variance retained for the
output or should I calculate it myself?

No theres no work done since it implies your are building your own pipeline
for a particular purpose. It also takes a lot of assumptions that may or
may not hold in a  particular case, such that you do something repeatedly
and corpuses are of similar nature. Also, i know no paper that would do it
exactly the way i described, so theres no error estimate on either
inequality approach or any sort of decay interpolation.

It is not very difficult to experiment a little with your data though with
a subset of the corpus and see what may work.

Re: SSVD + PCA

Reply via email to