Ok reducer then, hm. Then it's probably not the memory, for the pressure holds high mostly in the mapper only. Reducer actually doesn't use much memory at all. The problem i previously had was with memory pressure and load time in the AB' mappers but reducers were always ok, copy+sort+run has always taken about a minute there for this input, nothing changed much there still. --> just in case, please check how many records map phase outputs. If it is more than 1 per mapper, please note by how much and increase -abth option to say 300,000 or until you don't see the multiple records output by map tasks. it should be ok, it is still roughly 24mb for everything for k+p=15 of your case ( (k+p) x abth x 8bytes packed as long double[] swathes, so it doesn't take any object overhead so much). Although with 922 patch it should not matter much, that was the point.
If you can, please notice if it is a sort, copy or the reducer itself? If it is the sort, you may need to tweak sort buffer size, especially if you didn't set up too many reducers per job (but i assume you set num reducers at at least 20). By default, those buffers are just 100m which may turn out below block size, i am not sure how exactly it would work out in this case then. i think i bumped that up a little some long time ago. What else may be different different? i am at CDH3u0 here. your input seems to be 25% wider, which may affect AB' job map time in particular by at most that, which in my case would constitute about 15s difference, but hopefully not so much anything else. My setup is actually pretty meager, we have 10 quad-cpu nodes with 16G RAM each and they are already running hbase on top of it. which is why i mostly avoid running jobs much over 500m mem there, or i start scratching something on the disk. So memory doesn't seem to be the factor. (But my cluster is a barebone assembly sitting in the same rack which is a big plus). I also currently allow 4x4 map/reduce tasks per node. if you can point me where i can download your input exactly (without or with minimum prep steps, i think i can't spend much energy on that while at the office), i may try to try on your exact input as well. ------ On a side note i found an error with my work on broadcasting via distributed cache. After patching it up, i actually see no difference whatsoever between direct streaming and using distributed cache, just like i initially figured. My timing for AB' multiplication revolves around 3 min 10 s, distributed cache or streaming. Since i did that patch anyway, i will be offering it for commit but i will probably have to put it on the review board as it touches some iterator stuff that Sean previously crafted. the branch name with broadcast option enabled is called MAHOUT-922-2. On Sun, Dec 18, 2011 at 12:05 AM, Sebastian Schelter <[email protected]> wrote: > I ran it with 8 map and 4 reduce tasks allowed per machine using > -Dmapred.child.java.opts=-Xmx1024m (each machine has 32GB RAM). > > Most of the time was spent in QRReducer not in ABtMapper. I was promised > to get better monitoring next week, I'll rerun the tests and hopefully > can provide more informative results. > > --sebastian > > On 18.12.2011 04:27, Dmitriy Lyubimov wrote: >> also make sure you are not hitting swap. If you specifiy 3G per task, >> and you have typical quad core commodity nodes, it probably means that >> you have at least 4 map / 4 reduce tasks configured. that's 8 task >> total, unless you have ~32G ram on these machines, i think you are in >> danger of hitting swap with settings that high when GC gets all >> excited and all. I never though i would say that, but maybe you can >> run it a little bit more conservatively. >> >> my cluster is cdh3u0. >> >> -Dmitriy >> >> On Sat, Dec 17, 2011 at 7:24 PM, Dmitriy Lyubimov <[email protected]> wrote: >>> Sebastian, >>> >>> I think my commit got screwed somehow. Here's the benchmark very close >>> to yours (I also added broadcast option for B thru distributed cache >>> which seems to have been a 30% win in my situation, ok, i leave it on >>> by default). >>> >>> input: 4.5M x 4.5M sparse matrix with 40 non-zero elements per row, total >>> size >>> >>> -rw-r--r-- 3 dmitriy supergroup 2353201134 2011-12-17 17:17 >>> /user/dmitriy/drm-sparse.seq >>> >>> This is probably a little smaller than your wikipedia test in terms of >>> dimensions but not in size. >>> >>> ssvd -i /user/dmitriy/drm-sparse.seq -o /user/dmitriy/SSVD-OUT >>> --tempDir /user/dmitriy/temp -ow -br false -k 10 -p 1 -t 20 -q 1 >>> -Dmapred.child.java.opts="-Xmx800m" >>> >>> so this is for 10 useful eigenvalues and U,V with 1 power iteration: >>> >>> with broadcast option: >>> QJob 1 min 11s >>> BtJob 1 min 58s >>> AB' Job 2 min 6s >>> B'Job 1 min 51 s >>> >>> U Job - 0m 12 s, V job - 20 s (U and V run in parallel, not sequential) >>> >>> without broadcast option (only affects AB' step) >>> >>> QJob 1 min 12s >>> BtJob 1 min 59s >>> AB' Job 3 min 3s >>> B'Job 2 min 0 s >>> >>> 10 nodes 4 cpu each. (I had 35 splits of original input, so everything >>> got allocated). It also runs our QA stuff starting something else >>> almost every minute. frequent other jobs so while not loaded, job/task >>> trackers are fairly busy setting up/tearing down. >>> >>> I'll commit shortly. >>> >>> On Sat, Dec 17, 2011 at 2:58 PM, Sebastian Schelter <[email protected]> wrote: >>>> On 17.12.2011 17:27, Dmitriy Lyubimov wrote: >>>>> Interesting. >>>>> >>>>> Well so how did your decomposing go? >>>> >>>> I tested the decomposition of the wikipedia pagelink graph (130M edges, >>>> 5.6M vertices making approx. quarter of a billion non-zeros in the >>>> symmetric adjacency matrix) on a 6 machine hadoop cluster. >>>> >>>> Got these running times for k = 10, p = 5 and one power-iteration: >>>> >>>> Q-job 1mins, 41sec >>>> Bt-job 9mins, 30sec >>>> ABt-job 37mins, 22sec >>>> Bt-job 9mins, 41sec >>>> U-job 30sec >>>> >>>> I think I'd need a couple more machines to handle the twitter graph >>>> though... >>>> >>>> --sebastian >>>> >>>> >>>>> On Dec 17, 2011 6:00 AM, "Sebastian Schelter" <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi there, >>>>>> >>>>>> I played with Mahout to decompose the adjacency matrices of large graphs >>>>>> lately. I stumbled on a paper of Christos Faloutsos that describes a >>>>>> variation of the Lanczos algorithm they use for this on top of Hadoop. >>>>>> They even explicitly mention Mahout: >>>>>> >>>>>> "Very recently(March 2010), the Mahout project [2] provides >>>>>> SVD on top of HADOOP. Due to insufficient documentation, we were not >>>>>> able to find the input format and run a head-to-head comparison. But, >>>>>> reading the source code, we discovered that Mahout suffers from two >>>>>> major issues: (a) it assumes that the vector (b, with n=O(billion) >>>>>> entries) fits in the memory of a single machine, and (b) it implements >>>>>> the full re-orthogonalization which is inefficient." >>>>>> >>>>>> http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf >>>>>> >>>>>> --sebastian >>>>>> >>>>> >>>> >
