Also, one of the reasons AB' job runs slower than B' job is because we underload the reduce phase. AB' job has the same flops scale as B' job but it splits its job more evenly between mappers and reducers. Usually we are subconciously accustomed to a pattern where reducers deal with already aggregated results and do little more but save them into long files. but that's not the case with AB' job. Ideally, if we load reducers there to a maximum of 1 task per core (or, as hadoop recommends, 95% of available cores), then we should be approaching the running times of B' job there. There's littlel effect on B' job, though, it's all map-side.
Once i've run my test with # of mappers == # of reducers (35, vs. previously 20), the B' job running time reduced by another average of 30 seconds (~15% running time reduction). So in your case you probably need to try to scale up reduce phases as well and run as many reducers as hadoop recommends (95% of your cluster core capacity, careful about memory though, but -1024m should be plenty). Another thing that i now remember i have on my cluster is enabled native lzo compression for MR jobs. in theory it helps the sorting phase a little bit. Hope that helps. -Dmitriy On Mon, Dec 19, 2011 at 8:34 AM, Dmitriy Lyubimov <[email protected]> wrote: > Oh that's a nice hardware you have out there then. Then you can indeed try > 16 map tasks per node then if you have some CPU time to spare. I don't know > if you will have enough slots to run all them at once though but that's > nice. At this capacity you may probably even see some effects of having > distributed cache enabled (not yet committed, MAHOUT-922-2 branch). > > On Dec 19, 2011 8:28 AM, "Sebastian Schelter" <[email protected]> wrote: >> >> I have 16 cores per machine, I'll try to decrease Xmx a little and see >> if I can run more tasks. >> >> On 19.12.2011 17:14, Dmitriy Lyubimov wrote: >> > No problem. We actually got one patch done that was useful for extra >> > sparse >> > cases, that's good. I think you still can achieve extra performance if >> > you >> > constrain to 1 task per core and have enough cores for all tasks to run >> > at >> > once. >> > On Dec 19, 2011 2:00 AM, "Sebastian Schelter" <[email protected]> wrote: >> > >> >> Hi Dmitriy, >> >> >> >> I finally found the problem... I did not set --reduceTasks, I ran the >> >> job again with 24 reduce tasks and was able to run the whole >> >> decomposition in less than 10 minutes. Sorry for causing you so much >> >> work. >> >> >> >> --sebastian >> >> >> >> hadoop jar mahout-core-0.6-SNAPSHOT-job.jar >> >> org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli >> >> -Dmapred.tasktracker.map.tasks.maximum=8 >> >> -Dmapred.tasktracker.reduce.tasks.maximum=6 >> >> -Dmapred.child.java.opts=-Xmx1024m --input >> >> hdfs:///ssc/wiki-eig15-tmp/adjacencyMatrix --output hdfs:///ssc/M922-7 >> >> --tempDir hdfs:///ssc/M922-tmp-7 --rank 10 --oversampling 5 --computeV >> >> false --powerIter 1 --reduceTasks 24 >> >> >> >> Q-Job 1mins, 49sec >> >> Bt-Job 2mins, 39sec >> >> ABt-Job 3mins, 51sec >> >> Bt-Job 2mins, 41sec >> >> U-Job 29sec >> >> >> >> On 19.12.2011 09:09, Dmitriy Lyubimov wrote: >> >>> Ok reducer then, hm. Then it's probably not the memory, for the >> >>> pressure holds high mostly in the mapper only. Reducer actually >> >>> doesn't use much memory at all. The problem i previously had was with >> >>> memory pressure and load time in the AB' mappers but reducers were >> >>> always ok, copy+sort+run has always taken about a minute there for >> >>> this input, nothing changed much there still. --> just in case, please >> >>> check how many records map phase outputs. If it is more than 1 per >> >>> mapper, please note by how much and increase -abth option to say >> >>> 300,000 or until you don't see the multiple records output by map >> >>> tasks. it should be ok, it is still roughly 24mb for everything for >> >>> k+p=15 of your case ( (k+p) x abth x 8bytes packed as long double[] >> >>> swathes, so it doesn't take any object overhead so much). Although >> >>> with 922 patch it should not matter much, that was the point. >> >>> >> >>> If you can, please notice if it is a sort, copy or the reducer itself? >> >>> If it is the sort, you may need to tweak sort buffer size, especially >> >>> if you didn't set up too many reducers per job (but i assume you set >> >>> num reducers at at least 20). By default, those buffers are just 100m >> >>> which may turn out below block size, i am not sure how exactly it >> >>> would work out in this case then. i think i bumped that up a little >> >>> some long time ago. >> >>> >> >>> >> >>> What else may be different different? i am at CDH3u0 here. your input >> >>> seems to be 25% wider, which may affect AB' job map time in particular >> >>> by at most that, which in my case would constitute about 15s >> >>> difference, but hopefully not so much anything else. >> >>> >> >>> My setup is actually pretty meager, we have 10 quad-cpu nodes with 16G >> >>> RAM each and they are already running hbase on top of it. which is why >> >>> i mostly avoid running jobs much over 500m mem there, or i start >> >>> scratching something on the disk. So memory doesn't seem to be the >> >>> factor. (But my cluster is a barebone assembly sitting in the same >> >>> rack which is a big plus). I also currently allow 4x4 map/reduce tasks >> >>> per node. >> >>> >> >>> if you can point me where i can download your input exactly (without >> >>> or with minimum prep steps, i think i can't spend much energy on that >> >>> while at the office), i may try to try on your exact input as well. >> >>> >> >>> >> >>> ------ >> >>> On a side note i found an error with my work on broadcasting via >> >>> distributed cache. After patching it up, i actually see no difference >> >>> whatsoever between direct streaming and using distributed cache, just >> >>> like i initially figured. My timing for AB' multiplication revolves >> >>> around 3 min 10 s, distributed cache or streaming. Since i did that >> >>> patch anyway, i will be offering it for commit but i will probably >> >>> have to put it on the review board as it touches some iterator stuff >> >>> that Sean previously crafted. the branch name with broadcast option >> >>> enabled is called MAHOUT-922-2. >> >>> >> >>> >> >>> >> >>> On Sun, Dec 18, 2011 at 12:05 AM, Sebastian Schelter <[email protected]> >> >> wrote: >> >>>> I ran it with 8 map and 4 reduce tasks allowed per machine using >> >>>> -Dmapred.child.java.opts=-Xmx1024m (each machine has 32GB RAM). >> >>>> >> >>>> Most of the time was spent in QRReducer not in ABtMapper. I was >> >>>> promised >> >>>> to get better monitoring next week, I'll rerun the tests and >> >>>> hopefully >> >>>> can provide more informative results. >> >>>> >> >>>> --sebastian >> >>>> >> >>>> On 18.12.2011 04:27, Dmitriy Lyubimov wrote: >> >>>>> also make sure you are not hitting swap. If you specifiy 3G per >> >>>>> task, >> >>>>> and you have typical quad core commodity nodes, it probably means >> >>>>> that >> >>>>> you have at least 4 map / 4 reduce tasks configured. that's 8 task >> >>>>> total, unless you have ~32G ram on these machines, i think you are >> >>>>> in >> >>>>> danger of hitting swap with settings that high when GC gets all >> >>>>> excited and all. I never though i would say that, but maybe you can >> >>>>> run it a little bit more conservatively. >> >>>>> >> >>>>> my cluster is cdh3u0. >> >>>>> >> >>>>> -Dmitriy >> >>>>> >> >>>>> On Sat, Dec 17, 2011 at 7:24 PM, Dmitriy Lyubimov >> >>>>> <[email protected]> >> >> wrote: >> >>>>>> Sebastian, >> >>>>>> >> >>>>>> I think my commit got screwed somehow. Here's the benchmark very >> >>>>>> close >> >>>>>> to yours (I also added broadcast option for B thru distributed >> >>>>>> cache >> >>>>>> which seems to have been a 30% win in my situation, ok, i leave it >> >>>>>> on >> >>>>>> by default). >> >>>>>> >> >>>>>> input: 4.5M x 4.5M sparse matrix with 40 non-zero elements per row, >> >> total size >> >>>>>> >> >>>>>> -rw-r--r-- 3 dmitriy supergroup 2353201134 2011-12-17 17:17 >> >>>>>> /user/dmitriy/drm-sparse.seq >> >>>>>> >> >>>>>> This is probably a little smaller than your wikipedia test in terms >> >>>>>> of >> >>>>>> dimensions but not in size. >> >>>>>> >> >>>>>> ssvd -i /user/dmitriy/drm-sparse.seq -o /user/dmitriy/SSVD-OUT >> >>>>>> --tempDir /user/dmitriy/temp -ow -br false -k 10 -p 1 -t 20 -q 1 >> >>>>>> -Dmapred.child.java.opts="-Xmx800m" >> >>>>>> >> >>>>>> so this is for 10 useful eigenvalues and U,V with 1 power >> >>>>>> iteration: >> >>>>>> >> >>>>>> with broadcast option: >> >>>>>> QJob 1 min 11s >> >>>>>> BtJob 1 min 58s >> >>>>>> AB' Job 2 min 6s >> >>>>>> B'Job 1 min 51 s >> >>>>>> >> >>>>>> U Job - 0m 12 s, V job - 20 s (U and V run in parallel, not >> >> sequential) >> >>>>>> >> >>>>>> without broadcast option (only affects AB' step) >> >>>>>> >> >>>>>> QJob 1 min 12s >> >>>>>> BtJob 1 min 59s >> >>>>>> AB' Job 3 min 3s >> >>>>>> B'Job 2 min 0 s >> >>>>>> >> >>>>>> 10 nodes 4 cpu each. (I had 35 splits of original input, so >> >>>>>> everything >> >>>>>> got allocated). It also runs our QA stuff starting something else >> >>>>>> almost every minute. frequent other jobs so while not loaded, >> >>>>>> job/task >> >>>>>> trackers are fairly busy setting up/tearing down. >> >>>>>> >> >>>>>> I'll commit shortly. >> >>>>>> >> >>>>>> On Sat, Dec 17, 2011 at 2:58 PM, Sebastian Schelter >> >>>>>> <[email protected]> >> >> wrote: >> >>>>>>> On 17.12.2011 17:27, Dmitriy Lyubimov wrote: >> >>>>>>>> Interesting. >> >>>>>>>> >> >>>>>>>> Well so how did your decomposing go? >> >>>>>>> >> >>>>>>> I tested the decomposition of the wikipedia pagelink graph (130M >> >> edges, >> >>>>>>> 5.6M vertices making approx. quarter of a billion non-zeros in the >> >>>>>>> symmetric adjacency matrix) on a 6 machine hadoop cluster. >> >>>>>>> >> >>>>>>> Got these running times for k = 10, p = 5 and one power-iteration: >> >>>>>>> >> >>>>>>> Q-job 1mins, 41sec >> >>>>>>> Bt-job 9mins, 30sec >> >>>>>>> ABt-job 37mins, 22sec >> >>>>>>> Bt-job 9mins, 41sec >> >>>>>>> U-job 30sec >> >>>>>>> >> >>>>>>> I think I'd need a couple more machines to handle the twitter >> >>>>>>> graph >> >>>>>>> though... >> >>>>>>> >> >>>>>>> --sebastian >> >>>>>>> >> >>>>>>> >> >>>>>>>> On Dec 17, 2011 6:00 AM, "Sebastian Schelter" < >> >> [email protected]> >> >>>>>>>> wrote: >> >>>>>>>> >> >>>>>>>>> Hi there, >> >>>>>>>>> >> >>>>>>>>> I played with Mahout to decompose the adjacency matrices of >> >>>>>>>>> large >> >> graphs >> >>>>>>>>> lately. I stumbled on a paper of Christos Faloutsos that >> >>>>>>>>> describes >> >> a >> >>>>>>>>> variation of the Lanczos algorithm they use for this on top of >> >> Hadoop. >> >>>>>>>>> They even explicitly mention Mahout: >> >>>>>>>>> >> >>>>>>>>> "Very recently(March 2010), the Mahout project [2] provides >> >>>>>>>>> SVD on top of HADOOP. Due to insufficient documentation, we were >> >> not >> >>>>>>>>> able to find the input format and run a head-to-head comparison. >> >> But, >> >>>>>>>>> reading the source code, we discovered that Mahout suffers from >> >>>>>>>>> two >> >>>>>>>>> major issues: (a) it assumes that the vector (b, with >> >>>>>>>>> n=O(billion) >> >>>>>>>>> entries) fits in the memory of a single machine, and (b) it >> >> implements >> >>>>>>>>> the full re-orthogonalization which is inefficient." >> >>>>>>>>> >> >>>>>>>>> http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf >> >>>>>>>>> >> >>>>>>>>> --sebastian >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>> >> >> >> >> >> > >> >
