Re: Mahout SSVD is too slow for highly dimensional data

Dmitriy Lyubimov Tue, 11 Jun 2013 12:45:09 -0700

What Ted said. k+p=1001 will make per-task running time quite a bit.
Actually i don't think anyone has attempted that many values so I don't
even have a sense how long it will take. it still should be cpu-bound
though regardless.


A much better trade-off is to have fewer values but more precision in them
with a power iteration (-q 1). Power iteration step (ABt) will definitely
have a hard time to multiply with k=1000 just because of the amount of data
to move around and sort


On Tue, Jun 11, 2013 at 12:48 AM, Ted Dunning <[email protected]> wrote:

> Don't do that.
>
> Why do you think you need 1000 singular values?
>
> Have you tried with k=100, p=15?
>
> Quite serious, I would expect that you would literally get just as good
> results for almost any real application with 100 singular vectors and 900
> orthogonal noise vectors.
>
>
> On Tue, Jun 11, 2013 at 9:39 AM, Yehia Zakaria <[email protected]
> >wrote:
>
> > Hi
> >
> > The requested rank (k) is 1000 and p is 1. The input size is 1.2
> gigabyte.
> >
> > Thanks
> >
> >
> >
> > On Mon, Jun 10, 2013 at 9:28 PM, Dmitriy Lyubimov <[email protected]>
> > wrote:
> >
> > > what is requested rank? This guy will not scale w.r.t rank, only w.r.t
> > > input size. Reallistically you don't need k>100, p >15.
> > >
> > > What is the input size (A in Gb?)
> > >
> > >
> > > On Mon, Jun 10, 2013 at 5:31 AM, Yahia Zakaria <
> [email protected]
> > > >wrote:
> > >
> > > > Hi All
> > > >
> > > > I am running Mahout SSVD (trunk version) using pca option on Bag of
> > Words
> > > > dataset (http://archive.ics.uci.edu/ml/datasets/Bag+of+Words). This
> > > > dataset
> > > > have 8000000 instances (rows) and 100000 attributes (columns). Mahout
> > > SSVD
> > > > is too slow, it may take days to finish the first phase of SSVD
> (Q-Job)
> > > . I
> > > > am running the code on a cluster of 16 machines, each one is 8 cores
> > and
> > > 32
> > > > GB memory. Moreover, the CPU and memory of the workers are not
> utilized
> > > at
> > > > all. While running Mahout SSVD on smaller dataset (12500 rows and
> 5000
> > > > columns), it runs too fast, the job was finished in 2 minutes. Do you
> > > have
> > > > any idea why Mahout SSVD is too slow for high dimensional data ? and
> to
> > > > what extent that SSVD can work efficiently (with respect to the
> number
> > of
> > > > rows and columns of the input matrix) ?
> > > >
> > > > Thanks
> > > > Yehia
> > > >
> > >
> >
>

Re: Mahout SSVD is too slow for highly dimensional data

Reply via email to