yeah sounds that something is wrong. 300mb is not huge. I have problems of around 2G input on 10 nodes and it doesn't take that long at all. Another researcher i knew was doign something similar in vicinity i think of 4-5B non-zeros.
On Fri, Sep 14, 2012 at 2:41 PM, lei tang <[email protected]> wrote: > there are around 100M non-zero entries. The sequence file size is not that > huge, around 300M bytes. > > i'll check out your other options to see what is wrong. > > - Lei > > On Fri, Sep 14, 2012 at 2:23 PM, Dmitriy Lyubimov <[email protected]> wrote: > >> most importantly, what's your number of non-zero elements. (or input >> sequence file size). >> >> On Fri, Sep 14, 2012 at 2:19 PM, Dmitriy Lyubimov <[email protected]> >> wrote: >> > Q job is actually the fastest and map-only.I'd say you drop all the >> > optional parameters (including p) and use mahout 0.7. >> > >> > Actually reducing split size is unlikely to help. Default split should >> be fine. >> > >> > i'd say running -k 10 on any sized input should result in Q mapper >> > task running in at most couple of minutes. >> > >> > using -k200 -p100 is fairly ambitious (mapper task running time will >> > scale a little worse then proportional to k+p). >> > >> > if you use -q1 you will likely to have more problems with ABt job and >> > that may require some memory tuning... >> > >> > otherwise check the usual things -- memory, cluster capacity (do you >> > actually have capacity running 100 mappers? Do they have at least 1G >> > of RAM on -Xmx without scratching the swap? Are you seeing GC >> > thrashing? etc.) >> > >> > That said your problem doesn't seem too big (judging from 100 mappers >> > with a regular split size, that should be ok). with -k 100 and default >> > p you should expect single q task to run about 20-25 minutes, >> > depending on your hardware. It is cpu-bound (or rather, mostly >> > fpu-bound, assuming you tackled memory issues etc.) >> > >> > >> > On Fri, Sep 14, 2012 at 1:24 PM, lei tang <[email protected]> wrote: >> >> Hi, >> >> >> >> I am using mahout's SSVD (stochastic SVD) to factorize a huge sparse >> >> matrix (around 30M x 1M). I used a modified script of >> >> >> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html >> >> to store the input matrix with <key, value> pairs being integer, and >> >> vectorwritable (in particular, SequentialAccessSparseVector). Should I >> >> change to RandomAccessSparseVector? >> >> >> >> I managed to run mahout SSVD with the following specification. >> >> mahout ssvd -Dmapred.max.split.size=1000000 -i mf/tr_full.seq -o >> >> mf/out_full -k 200 -p 100 -r 100000 -U true -V true -t 20 --tempDir >> mf/tmp >> >> >> >> I specified the max split in order to have more mappers. However, the >> >> first Qjob seems not moving. After 1 hour, it is still 12% with 100 >> >> mappers. Is this expected? Should I change any parameter? >> >> >> >> Any suggestion is highly appreciated. >> >> >> >> - Lei >> >> P.S. I'm also reading the docs from >> >> https://issues.apache.org/jira/browse/MAHOUT-376 in hope that I can >> figure >> >> out why it is so slow. >>
