What Ted said. k+p=1001 will make per-task running time quite a bit. Actually i don't think anyone has attempted that many values so I don't even have a sense how long it will take. it still should be cpu-bound though regardless.
A much better trade-off is to have fewer values but more precision in them with a power iteration (-q 1). Power iteration step (ABt) will definitely have a hard time to multiply with k=1000 just because of the amount of data to move around and sort On Tue, Jun 11, 2013 at 12:48 AM, Ted Dunning <[email protected]> wrote: > Don't do that. > > Why do you think you need 1000 singular values? > > Have you tried with k=100, p=15? > > Quite serious, I would expect that you would literally get just as good > results for almost any real application with 100 singular vectors and 900 > orthogonal noise vectors. > > > On Tue, Jun 11, 2013 at 9:39 AM, Yehia Zakaria <[email protected] > >wrote: > > > Hi > > > > The requested rank (k) is 1000 and p is 1. The input size is 1.2 > gigabyte. > > > > Thanks > > > > > > > > On Mon, Jun 10, 2013 at 9:28 PM, Dmitriy Lyubimov <[email protected]> > > wrote: > > > > > what is requested rank? This guy will not scale w.r.t rank, only w.r.t > > > input size. Reallistically you don't need k>100, p >15. > > > > > > What is the input size (A in Gb?) > > > > > > > > > On Mon, Jun 10, 2013 at 5:31 AM, Yahia Zakaria < > [email protected] > > > >wrote: > > > > > > > Hi All > > > > > > > > I am running Mahout SSVD (trunk version) using pca option on Bag of > > Words > > > > dataset (http://archive.ics.uci.edu/ml/datasets/Bag+of+Words). This > > > > dataset > > > > have 8000000 instances (rows) and 100000 attributes (columns). Mahout > > > SSVD > > > > is too slow, it may take days to finish the first phase of SSVD > (Q-Job) > > > . I > > > > am running the code on a cluster of 16 machines, each one is 8 cores > > and > > > 32 > > > > GB memory. Moreover, the CPU and memory of the workers are not > utilized > > > at > > > > all. While running Mahout SSVD on smaller dataset (12500 rows and > 5000 > > > > columns), it runs too fast, the job was finished in 2 minutes. Do you > > > have > > > > any idea why Mahout SSVD is too slow for high dimensional data ? and > to > > > > what extent that SSVD can work efficiently (with respect to the > number > > of > > > > rows and columns of the input matrix) ? > > > > > > > > Thanks > > > > Yehia > > > > > > > > > >
