I ran it with 8 map and 4 reduce tasks allowed per machine using -Dmapred.child.java.opts=-Xmx1024m (each machine has 32GB RAM).
Most of the time was spent in QRReducer not in ABtMapper. I was promised to get better monitoring next week, I'll rerun the tests and hopefully can provide more informative results. --sebastian On 18.12.2011 04:27, Dmitriy Lyubimov wrote: > also make sure you are not hitting swap. If you specifiy 3G per task, > and you have typical quad core commodity nodes, it probably means that > you have at least 4 map / 4 reduce tasks configured. that's 8 task > total, unless you have ~32G ram on these machines, i think you are in > danger of hitting swap with settings that high when GC gets all > excited and all. I never though i would say that, but maybe you can > run it a little bit more conservatively. > > my cluster is cdh3u0. > > -Dmitriy > > On Sat, Dec 17, 2011 at 7:24 PM, Dmitriy Lyubimov <[email protected]> wrote: >> Sebastian, >> >> I think my commit got screwed somehow. Here's the benchmark very close >> to yours (I also added broadcast option for B thru distributed cache >> which seems to have been a 30% win in my situation, ok, i leave it on >> by default). >> >> input: 4.5M x 4.5M sparse matrix with 40 non-zero elements per row, total >> size >> >> -rw-r--r-- 3 dmitriy supergroup 2353201134 2011-12-17 17:17 >> /user/dmitriy/drm-sparse.seq >> >> This is probably a little smaller than your wikipedia test in terms of >> dimensions but not in size. >> >> ssvd -i /user/dmitriy/drm-sparse.seq -o /user/dmitriy/SSVD-OUT >> --tempDir /user/dmitriy/temp -ow -br false -k 10 -p 1 -t 20 -q 1 >> -Dmapred.child.java.opts="-Xmx800m" >> >> so this is for 10 useful eigenvalues and U,V with 1 power iteration: >> >> with broadcast option: >> QJob 1 min 11s >> BtJob 1 min 58s >> AB' Job 2 min 6s >> B'Job 1 min 51 s >> >> U Job - 0m 12 s, V job - 20 s (U and V run in parallel, not sequential) >> >> without broadcast option (only affects AB' step) >> >> QJob 1 min 12s >> BtJob 1 min 59s >> AB' Job 3 min 3s >> B'Job 2 min 0 s >> >> 10 nodes 4 cpu each. (I had 35 splits of original input, so everything >> got allocated). It also runs our QA stuff starting something else >> almost every minute. frequent other jobs so while not loaded, job/task >> trackers are fairly busy setting up/tearing down. >> >> I'll commit shortly. >> >> On Sat, Dec 17, 2011 at 2:58 PM, Sebastian Schelter <[email protected]> wrote: >>> On 17.12.2011 17:27, Dmitriy Lyubimov wrote: >>>> Interesting. >>>> >>>> Well so how did your decomposing go? >>> >>> I tested the decomposition of the wikipedia pagelink graph (130M edges, >>> 5.6M vertices making approx. quarter of a billion non-zeros in the >>> symmetric adjacency matrix) on a 6 machine hadoop cluster. >>> >>> Got these running times for k = 10, p = 5 and one power-iteration: >>> >>> Q-job 1mins, 41sec >>> Bt-job 9mins, 30sec >>> ABt-job 37mins, 22sec >>> Bt-job 9mins, 41sec >>> U-job 30sec >>> >>> I think I'd need a couple more machines to handle the twitter graph >>> though... >>> >>> --sebastian >>> >>> >>>> On Dec 17, 2011 6:00 AM, "Sebastian Schelter" <[email protected]> >>>> wrote: >>>> >>>>> Hi there, >>>>> >>>>> I played with Mahout to decompose the adjacency matrices of large graphs >>>>> lately. I stumbled on a paper of Christos Faloutsos that describes a >>>>> variation of the Lanczos algorithm they use for this on top of Hadoop. >>>>> They even explicitly mention Mahout: >>>>> >>>>> "Very recently(March 2010), the Mahout project [2] provides >>>>> SVD on top of HADOOP. Due to insufficient documentation, we were not >>>>> able to find the input format and run a head-to-head comparison. But, >>>>> reading the source code, we discovered that Mahout suffers from two >>>>> major issues: (a) it assumes that the vector (b, with n=O(billion) >>>>> entries) fits in the memory of a single machine, and (b) it implements >>>>> the full re-orthogonalization which is inefficient." >>>>> >>>>> http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf >>>>> >>>>> --sebastian >>>>> >>>> >>>
