also you can compare your performance experiments to Nathan Halko's here: http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf pp. 110+...
They attempted a very large problems, as much as 726 splits by 512mb with -k 100. (default split size is what... 64mb?) They had a problem tuning ABt job (as expected -- it looks like they had incredible memory starvation and GC thrashing to do it quite efficiently) but even that I am not quite sure if that was before performance patches for ABt job. That problem it looks like took them almost a day to run thru with -q1 -- and again, that mostly because ABt multiplication. Extremely sparse problems will produce more problems for ABt whereas densier problems are less prone to problems with q>0. -d On Fri, Sep 14, 2012 at 2:23 PM, Dmitriy Lyubimov <[email protected]> wrote: > most importantly, what's your number of non-zero elements. (or input > sequence file size). > > On Fri, Sep 14, 2012 at 2:19 PM, Dmitriy Lyubimov <[email protected]> wrote: >> Q job is actually the fastest and map-only.I'd say you drop all the >> optional parameters (including p) and use mahout 0.7. >> >> Actually reducing split size is unlikely to help. Default split should be >> fine. >> >> i'd say running -k 10 on any sized input should result in Q mapper >> task running in at most couple of minutes. >> >> using -k200 -p100 is fairly ambitious (mapper task running time will >> scale a little worse then proportional to k+p). >> >> if you use -q1 you will likely to have more problems with ABt job and >> that may require some memory tuning... >> >> otherwise check the usual things -- memory, cluster capacity (do you >> actually have capacity running 100 mappers? Do they have at least 1G >> of RAM on -Xmx without scratching the swap? Are you seeing GC >> thrashing? etc.) >> >> That said your problem doesn't seem too big (judging from 100 mappers >> with a regular split size, that should be ok). with -k 100 and default >> p you should expect single q task to run about 20-25 minutes, >> depending on your hardware. It is cpu-bound (or rather, mostly >> fpu-bound, assuming you tackled memory issues etc.) >> >> >> On Fri, Sep 14, 2012 at 1:24 PM, lei tang <[email protected]> wrote: >>> Hi, >>> >>> I am using mahout's SSVD (stochastic SVD) to factorize a huge sparse >>> matrix (around 30M x 1M). I used a modified script of >>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html >>> to store the input matrix with <key, value> pairs being integer, and >>> vectorwritable (in particular, SequentialAccessSparseVector). Should I >>> change to RandomAccessSparseVector? >>> >>> I managed to run mahout SSVD with the following specification. >>> mahout ssvd -Dmapred.max.split.size=1000000 -i mf/tr_full.seq -o >>> mf/out_full -k 200 -p 100 -r 100000 -U true -V true -t 20 --tempDir mf/tmp >>> >>> I specified the max split in order to have more mappers. However, the >>> first Qjob seems not moving. After 1 hour, it is still 12% with 100 >>> mappers. Is this expected? Should I change any parameter? >>> >>> Any suggestion is highly appreciated. >>> >>> - Lei >>> P.S. I'm also reading the docs from >>> https://issues.apache.org/jira/browse/MAHOUT-376 in hope that I can figure >>> out why it is so slow.
