On Mon, Jun 10, 2013 at 11:28 AM, Dmitriy Lyubimov <[email protected]>wrote:
> what is requested rank? This guy will not scale w.r.t rank, only w.r.t > input size. Reallistically you don't need k>100, p >15. > > What is the input size (A in Gb?) > > > On Mon, Jun 10, 2013 at 5:31 AM, Yahia Zakaria <[email protected]>wrote: > >> Hi All >> >> I am running Mahout SSVD (trunk version) using pca option on Bag of Words >> dataset (http://archive.ics.uci.edu/ml/datasets/Bag+of+Words). This >> dataset >> have 8000000 instances (rows) and 100000 attributes (columns). Mahout SSVD >> is too slow, it may take days to finish the first phase of SSVD (Q-Job) . >> I >> am running the code on a cluster of 16 machines, each one is 8 cores and >> 32 >> GB memory. Moreover, the CPU and memory of the workers are not utilized at >> all. > > Also: This is suspicious. it is a cpu-bound job. (memory requirements are quite modest though). If your data are extremely sparse, and/or your hadoop input split large enough so that map task receives more than what is specified -r (default 30,000) then it spills Q blocks on disk for the second pass. Which may be more data if requested k is greater than average number of non-zero elements per row. If you have enough memory, just bump up -r (or use smaller hadoop splits). but single most important think is still (k+p). the cpu flops scale at about O((k+p)^1.5). Since hadoop splits linearly to input, it is not possible to split w.r.t flop increase commanded by (k+p) without additional custom splitting tricks. don't use (k+p)>100. Seriously. especially for LSA. Whatever you do, your LSA input will already be sufficiently varying w.r.t general human knowledge about concepts in it, so approximate inference is quite sufficient here IMO. > While running Mahout SSVD on smaller dataset (12500 rows and 5000 >> columns), it runs too fast, the job was finished in 2 minutes. Do you have >> any idea why Mahout SSVD is too slow for high dimensional data ? and to >> what extent that SSVD can work efficiently (with respect to the number of >> rows and columns of the input matrix) ? >> >> Thanks >> Yehia >> > >
