Hi, I am using mahout's SSVD (stochastic SVD) to factorize a huge sparse matrix (around 30M x 1M). I used a modified script of http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html to store the input matrix with <key, value> pairs being integer, and vectorwritable (in particular, SequentialAccessSparseVector). Should I change to RandomAccessSparseVector?
I managed to run mahout SSVD with the following specification. mahout ssvd -Dmapred.max.split.size=1000000 -i mf/tr_full.seq -o mf/out_full -k 200 -p 100 -r 100000 -U true -V true -t 20 --tempDir mf/tmp I specified the max split in order to have more mappers. However, the first Qjob seems not moving. After 1 hour, it is still 12% with 100 mappers. Is this expected? Should I change any parameter? Any suggestion is highly appreciated. - Lei P.S. I'm also reading the docs from https://issues.apache.org/jira/browse/MAHOUT-376 in hope that I can figure out why it is so slow.
