Re: SSVD + PCA

Dmitriy Lyubimov Sat, 18 Aug 2012 10:39:42 -0700

On Aug 18, 2012 8:32 AM, "Pat Ferrel" <[email protected]> wrote:
>
> Switching from API to CLI
>
> the parameter -t is described in the PDF
>
> --reduceTasks <int-value> optional. The number of reducers to use (where
applicable): depends on the size of the hadoop cluster. At this point it
could also be overwritten by a standard hadoop property using -D option
> 4. Probably always needs to be speciﬁed as by default Hadoop would set it
to 1, which is certainly far below the cluster capacity. Recommended value
for this option ~ 95% or ~190% of available reducer capacity to allow for
opportunistic executions.
>
> The description above seems to say it will be taken from the hadoop
config if not specified, which is probably all most people would every
want. I am unclear why this is needed? I cannot run SSVD without specifying
it, in other words it does not seem to be optional?


This parameter was made mandatory because people were repeatedly forgetting
set the number of reducers and kept coming back with questions like why it
is running so slow. So there was an issue in 0.7 where i made it mandatory.
I am actually not sure now other mahout methods ensure reducer
specification is always specified other than 1

>
> As a first try using the CLI I'm running with 295625 rows and 337258
columns using the following parameters to get a sort of worst case run time
result with best case data output. The parameters will be tweaked later to
get better dimensional reduction and runtime.
>
>     mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
on cluster)
>
> Is there work being done to calculate the variance retained for the
output or should I calculate it myself?

No theres no work done since it implies your are building your own pipeline
for a particular purpose. It also takes a lot of assumptions that may or
may not hold in a  particular case, such that you do something repeatedly
and corpuses are of similar nature. Also, i know no paper that would do it
exactly the way i described, so theres no error estimate on either
inequality approach or any sort of decay interpolation.

It is not very difficult to experiment a little with your data though with
a subset of the corpus and see what may work.

Re: SSVD + PCA

Reply via email to