Re: Mahout SSVD is too slow for highly dimensional data

Ted Dunning Wed, 12 Jun 2013 01:22:49 -0700

Is your input matrix dense?


On Wed, Jun 12, 2013 at 9:54 AM, Yehia Zakaria <[email protected]>wrote:

> Thanks a lot Ted and Dmitiriy
>
> Keeping k = 100 solved the problem and Q-Job passed successfully. Actually
> I am evaluating mahout ssvd performance, so I have chosen the rank to be
> 1000 (which 0.1% of the original number of attributes). But I encountered
> another exception in BtJob :
>
> java.io.IOException: Spill failed
> at
>
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1060)
> at java.io.DataOutputStream.writeLong(DataOutputStream.java:207)
> at java.io.DataOutputStream.writeDouble(DataOutputStream.java:242)
> at
> org.apache.mahout.math.VectorWritable.writeVector(VectorWritable.java:150)
> at org.apache.mahout.math.VectorWritable.write(VectorWritable.java:80)
> at
>
> org.apache.mahout.math.hadoop.stochasticsvd.SparseRowBlockWritable.write(SparseRowBlockWritable.java:81)
> at
>
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:100)
> at
>
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:84)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:916)
> at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:576)
> at
>
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:88)
> at org.a
>
> I searched this issue and it seems it is an open issue on jira, but I am
> not sure how far this issue is related to my exception
>
> https://issues.apache.org/jira/browse/MAPREDUCE-5028
>
> Thanks
> Yehia
>
>
>
>
> On Tue, Jun 11, 2013 at 10:43 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
> >
> > What Ted said. k+p=1001 will make per-task running time quite a bit.
> > Actually i don't think anyone has attempted that many values so I don't
> > even have a sense how long it will take. it still should be cpu-bound
> > though regardless.
> >
> > A much better trade-off is to have fewer values but more precision in
> them
> > with a power iteration (-q 1). Power iteration step (ABt) will definitely
> > have a hard time to multiply with k=1000 just because of the amount of
> data
> > to move around and sort
> >
> >
> > On Tue, Jun 11, 2013 at 12:48 AM, Ted Dunning <[email protected]>
> wrote:
> >
> > > Don't do that.
> > >
> > > Why do you think you need 1000 singular values?
> > >
> > > Have you tried with k=100, p=15?
> > >
> > > Quite serious, I would expect that you would literally get just as good
> > > results for almost any real application with 100 singular vectors and
> 900
> > > orthogonal noise vectors.
> > >
> > >
> > > On Tue, Jun 11, 2013 at 9:39 AM, Yehia Zakaria <
> [email protected]
> > > >wrote:
> > >
> > > > Hi
> > > >
> > > > The requested rank (k) is 1000 and p is 1. The input size is 1.2
> > > gigabyte.
> > > >
> > > > Thanks
> > > >
> > > >
> > > >
> > > > On Mon, Jun 10, 2013 at 9:28 PM, Dmitriy Lyubimov <[email protected]
> >
> > > > wrote:
> > > >
> > > > > what is requested rank? This guy will not scale w.r.t rank, only
> w.r.t
> > > > > input size. Reallistically you don't need k>100, p >15.
> > > > >
> > > > > What is the input size (A in Gb?)
> > > > >
> > > > >
> > > > > On Mon, Jun 10, 2013 at 5:31 AM, Yahia Zakaria <
> > > [email protected]
> > > > > >wrote:
> > > > >
> > > > > > Hi All
> > > > > >
> > > > > > I am running Mahout SSVD (trunk version) using pca option on Bag
> of
> > > > Words
> > > > > > dataset (http://archive.ics.uci.edu/ml/datasets/Bag+of+Words).
> This
> > > > > > dataset
> > > > > > have 8000000 instances (rows) and 100000 attributes (columns).
> Mahout
> > > > > SSVD
> > > > > > is too slow, it may take days to finish the first phase of SSVD
> > > (Q-Job)
> > > > > . I
> > > > > > am running the code on a cluster of 16 machines, each one is 8
> cores
> > > > and
> > > > > 32
> > > > > > GB memory. Moreover, the CPU and memory of the workers are not
> > > utilized
> > > > > at
> > > > > > all. While running Mahout SSVD on smaller dataset (12500 rows and
> > > 5000
> > > > > > columns), it runs too fast, the job was finished in 2 minutes. Do
> you
> > > > > have
> > > > > > any idea why Mahout SSVD is too slow for high dimensional data ?
> and
> > > to
> > > > > > what extent that SSVD can work efficiently (with respect to the
> > > number
> > > > of
> > > > > > rows and columns of the input matrix) ?
> > > > > >
> > > > > > Thanks
> > > > > > Yehia
> > > > > >
> > > > >
> > > >
> > >
>

Re: Mahout SSVD is too slow for highly dimensional data

Reply via email to