Is your input matrix dense?
On Wed, Jun 12, 2013 at 9:54 AM, Yehia Zakaria <[email protected]>wrote: > Thanks a lot Ted and Dmitiriy > > Keeping k = 100 solved the problem and Q-Job passed successfully. Actually > I am evaluating mahout ssvd performance, so I have chosen the rank to be > 1000 (which 0.1% of the original number of attributes). But I encountered > another exception in BtJob : > > java.io.IOException: Spill failed > at > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1060) > at java.io.DataOutputStream.writeLong(DataOutputStream.java:207) > at java.io.DataOutputStream.writeDouble(DataOutputStream.java:242) > at > org.apache.mahout.math.VectorWritable.writeVector(VectorWritable.java:150) > at org.apache.mahout.math.VectorWritable.write(VectorWritable.java:80) > at > > org.apache.mahout.math.hadoop.stochasticsvd.SparseRowBlockWritable.write(SparseRowBlockWritable.java:81) > at > > org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:100) > at > > org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:84) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:916) > at > org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:576) > at > > org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:88) > at org.a > > I searched this issue and it seems it is an open issue on jira, but I am > not sure how far this issue is related to my exception > > https://issues.apache.org/jira/browse/MAPREDUCE-5028 > > Thanks > Yehia > > > > > On Tue, Jun 11, 2013 at 10:43 PM, Dmitriy Lyubimov <[email protected]> > wrote: > > > > What Ted said. k+p=1001 will make per-task running time quite a bit. > > Actually i don't think anyone has attempted that many values so I don't > > even have a sense how long it will take. it still should be cpu-bound > > though regardless. > > > > A much better trade-off is to have fewer values but more precision in > them > > with a power iteration (-q 1). Power iteration step (ABt) will definitely > > have a hard time to multiply with k=1000 just because of the amount of > data > > to move around and sort > > > > > > On Tue, Jun 11, 2013 at 12:48 AM, Ted Dunning <[email protected]> > wrote: > > > > > Don't do that. > > > > > > Why do you think you need 1000 singular values? > > > > > > Have you tried with k=100, p=15? > > > > > > Quite serious, I would expect that you would literally get just as good > > > results for almost any real application with 100 singular vectors and > 900 > > > orthogonal noise vectors. > > > > > > > > > On Tue, Jun 11, 2013 at 9:39 AM, Yehia Zakaria < > [email protected] > > > >wrote: > > > > > > > Hi > > > > > > > > The requested rank (k) is 1000 and p is 1. The input size is 1.2 > > > gigabyte. > > > > > > > > Thanks > > > > > > > > > > > > > > > > On Mon, Jun 10, 2013 at 9:28 PM, Dmitriy Lyubimov <[email protected] > > > > > > wrote: > > > > > > > > > what is requested rank? This guy will not scale w.r.t rank, only > w.r.t > > > > > input size. Reallistically you don't need k>100, p >15. > > > > > > > > > > What is the input size (A in Gb?) > > > > > > > > > > > > > > > On Mon, Jun 10, 2013 at 5:31 AM, Yahia Zakaria < > > > [email protected] > > > > > >wrote: > > > > > > > > > > > Hi All > > > > > > > > > > > > I am running Mahout SSVD (trunk version) using pca option on Bag > of > > > > Words > > > > > > dataset (http://archive.ics.uci.edu/ml/datasets/Bag+of+Words). > This > > > > > > dataset > > > > > > have 8000000 instances (rows) and 100000 attributes (columns). > Mahout > > > > > SSVD > > > > > > is too slow, it may take days to finish the first phase of SSVD > > > (Q-Job) > > > > > . I > > > > > > am running the code on a cluster of 16 machines, each one is 8 > cores > > > > and > > > > > 32 > > > > > > GB memory. Moreover, the CPU and memory of the workers are not > > > utilized > > > > > at > > > > > > all. While running Mahout SSVD on smaller dataset (12500 rows and > > > 5000 > > > > > > columns), it runs too fast, the job was finished in 2 minutes. Do > you > > > > > have > > > > > > any idea why Mahout SSVD is too slow for high dimensional data ? > and > > > to > > > > > > what extent that SSVD can work efficiently (with respect to the > > > number > > > > of > > > > > > rows and columns of the input matrix) ? > > > > > > > > > > > > Thanks > > > > > > Yehia > > > > > > > > > > > > > > > > > > >
