Re: SSVD + PCA

Dmitriy Lyubimov Sat, 18 Aug 2012 10:50:19 -0700

Also id sugggest to start with much smaller k. It directly affects task
running time, especially with q>0. I think in Nathan's dissertation he ran
wikipedia set with only 100 singular values which already provide most of
the spectrum if you look and the decay chart. In practice i dont know
anyone who ran it with more than 200. It just reaches point of diminishing
return around 100 where you pay quite a bit for too little additional
information.
On Aug 18, 2012 10:39 AM, "Dmitriy Lyubimov" <[email protected]> wrote:


>
> On Aug 18, 2012 8:32 AM, "Pat Ferrel" <[email protected]> wrote:
> >
> > Switching from API to CLI
> >
> > the parameter -t is described in the PDF
> >
> > --reduceTasks <int-value> optional. The number of reducers to use (where
> applicable): depends on the size of the hadoop cluster. At this point it
> could also be overwritten by a standard hadoop property using -D option
> > 4. Probably always needs to be speciﬁed as by default Hadoop would set
> it to 1, which is certainly far below the cluster capacity. Recommended
> value for this option ~ 95% or ~190% of available reducer capacity to allow
> for opportunistic executions.
> >
> > The description above seems to say it will be taken from the hadoop
> config if not specified, which is probably all most people would every
> want. I am unclear why this is needed? I cannot run SSVD without specifying
> it, in other words it does not seem to be optional?
>
> This parameter was made mandatory because people were repeatedly
> forgetting set the number of reducers and kept coming back with questions
> like why it is running so slow. So there was an issue in 0.7 where i made
> it mandatory. I am actually not sure now other mahout methods ensure
> reducer specification is always specified other than 1
>
> >
> > As a first try using the CLI I'm running with 295625 rows and 337258
> columns using the following parameters to get a sort of worst case run time
> result with best case data output. The parameters will be tweaked later to
> get better dimensional reduction and runtime.
> >
> >     mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
> on cluster)
> >
> > Is there work being done to calculate the variance retained for the
> output or should I calculate it myself?
>
> No theres no work done since it implies your are building your own
> pipeline for a particular purpose. It also takes a lot of assumptions that
> may or may not hold in a  particular case, such that you do something
> repeatedly and corpuses are of similar nature. Also, i know no paper that
> would do it exactly the way i described, so theres no error estimate on
> either inequality approach or any sort of decay interpolation.
>
> It is not very difficult to experiment a little with your data though with
> a subset of the corpus and see what may work.
>

Re: SSVD + PCA

Reply via email to