On Thu, Nov 15, 2012 at 12:09 PM, Abramov Pavel <[email protected]>wrote:
> Dmitriy, > > > 3) I can apply SSVD on a sample (0,1% of my data). But it fails with 100% > of data. (Bt-job stops on a Map phase with "Java heap space" errors or > "timeout" errors). > Input matrix is a sparse matrix 20 000 000 X 150 000 with ~0,03% non-zero > values. (8GB total) > This should not happen if you use at least -Xmx1G for your MR tasks (it looks like you do). In fact, i would be more worried about ABt job (since you use -q=1) -- these guys are really memory hogs. Also try to be a bit less ambitious and run -k100 first although that would not have any measurable bearing on memory required, only on the running time. I also do not understand the rationale behind -Dmapred.max.split.size=1000000. Default split size should be good enough. But i have nothing definite to put my finger on with your configuration. It is possible that sometimes you encounter extra dense vectors (superactive user) which in your case may be up to 20M ratings, or 160M per vector, but assuming -Xmx2G and k=200, p=15 the memory should be more than plenty. Most useful advice, if you are hung up on running SVD on your data, is read thru operational setup here http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf pages 165 and on. Nathan conducted set ups on inputs as big as 90Gb of very sparse data. (I am guessing ABt job has improved a little bit since then but still it is a bottleneck). > > How I use it: > > ==================== > mahout-distribution-0.7/bin/mahout ssvd \ > -i /tmp/pabramov/sparse/tfidf-vectors/ \ > -o /tmp/pabramov/ssvd \ > -k 200 \ > -q 1 \ > --reduceTasks 150 \ > --tempDir /tmp/pabramov/tmp \ > -Dmapred.max.split.size=1000000 \ > -ow > ==================== > > Can't pass Bt-job... Should I decrease split.size and/or add extra params? > Hadoop has 400 Map and 300 reduce slots with 1 CPU core and 2GB RAM per > task. > Q-job completes in 20 minutes. > > Many thanks in advance! > > Pavel > > > ________________________________________ > От: Dmitriy Lyubimov [[email protected]] > Отправлено: 15 ноября 2012 г. 21:53 > To: [email protected] > Тема: Re: SSVD fails on seq2sparse output. > > On Thu, Nov 15, 2012 at 3:43 AM, Abramov Pavel <[email protected] > >wrote: > > > > > Many thanks in advance, any suggestion is highly appreciated. I Don't > know > > what to do, CF produces inaccurate results for my tasks, SVD is the only > > hope )) > > > > I also doubtful about that. (if you trying to factorize our recommendation > space). SVD has proven to be notoriously inadequate for that problem. > ALS-WR would be a much better first stab. > > however since you seem to be performing text analysis (seq2sparse), i don't > see immediately how it is related to collaborative filtering -- perhaps if > you told more about your problem, i am sure here are people on this list > who could advise you about perhaps one of the best courses of action. > > > > Regards, > > Pavel > > > > > > > > > > >
