No problem. We actually got one patch done that was useful for extra sparse cases, that's good. I think you still can achieve extra performance if you constrain to 1 task per core and have enough cores for all tasks to run at once. On Dec 19, 2011 2:00 AM, "Sebastian Schelter" <[email protected]> wrote:
> Hi Dmitriy, > > I finally found the problem... I did not set --reduceTasks, I ran the > job again with 24 reduce tasks and was able to run the whole > decomposition in less than 10 minutes. Sorry for causing you so much work. > > --sebastian > > hadoop jar mahout-core-0.6-SNAPSHOT-job.jar > org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli > -Dmapred.tasktracker.map.tasks.maximum=8 > -Dmapred.tasktracker.reduce.tasks.maximum=6 > -Dmapred.child.java.opts=-Xmx1024m --input > hdfs:///ssc/wiki-eig15-tmp/adjacencyMatrix --output hdfs:///ssc/M922-7 > --tempDir hdfs:///ssc/M922-tmp-7 --rank 10 --oversampling 5 --computeV > false --powerIter 1 --reduceTasks 24 > > Q-Job 1mins, 49sec > Bt-Job 2mins, 39sec > ABt-Job 3mins, 51sec > Bt-Job 2mins, 41sec > U-Job 29sec > > On 19.12.2011 09:09, Dmitriy Lyubimov wrote: > > Ok reducer then, hm. Then it's probably not the memory, for the > > pressure holds high mostly in the mapper only. Reducer actually > > doesn't use much memory at all. The problem i previously had was with > > memory pressure and load time in the AB' mappers but reducers were > > always ok, copy+sort+run has always taken about a minute there for > > this input, nothing changed much there still. --> just in case, please > > check how many records map phase outputs. If it is more than 1 per > > mapper, please note by how much and increase -abth option to say > > 300,000 or until you don't see the multiple records output by map > > tasks. it should be ok, it is still roughly 24mb for everything for > > k+p=15 of your case ( (k+p) x abth x 8bytes packed as long double[] > > swathes, so it doesn't take any object overhead so much). Although > > with 922 patch it should not matter much, that was the point. > > > > If you can, please notice if it is a sort, copy or the reducer itself? > > If it is the sort, you may need to tweak sort buffer size, especially > > if you didn't set up too many reducers per job (but i assume you set > > num reducers at at least 20). By default, those buffers are just 100m > > which may turn out below block size, i am not sure how exactly it > > would work out in this case then. i think i bumped that up a little > > some long time ago. > > > > > > What else may be different different? i am at CDH3u0 here. your input > > seems to be 25% wider, which may affect AB' job map time in particular > > by at most that, which in my case would constitute about 15s > > difference, but hopefully not so much anything else. > > > > My setup is actually pretty meager, we have 10 quad-cpu nodes with 16G > > RAM each and they are already running hbase on top of it. which is why > > i mostly avoid running jobs much over 500m mem there, or i start > > scratching something on the disk. So memory doesn't seem to be the > > factor. (But my cluster is a barebone assembly sitting in the same > > rack which is a big plus). I also currently allow 4x4 map/reduce tasks > > per node. > > > > if you can point me where i can download your input exactly (without > > or with minimum prep steps, i think i can't spend much energy on that > > while at the office), i may try to try on your exact input as well. > > > > > > ------ > > On a side note i found an error with my work on broadcasting via > > distributed cache. After patching it up, i actually see no difference > > whatsoever between direct streaming and using distributed cache, just > > like i initially figured. My timing for AB' multiplication revolves > > around 3 min 10 s, distributed cache or streaming. Since i did that > > patch anyway, i will be offering it for commit but i will probably > > have to put it on the review board as it touches some iterator stuff > > that Sean previously crafted. the branch name with broadcast option > > enabled is called MAHOUT-922-2. > > > > > > > > On Sun, Dec 18, 2011 at 12:05 AM, Sebastian Schelter <[email protected]> > wrote: > >> I ran it with 8 map and 4 reduce tasks allowed per machine using > >> -Dmapred.child.java.opts=-Xmx1024m (each machine has 32GB RAM). > >> > >> Most of the time was spent in QRReducer not in ABtMapper. I was promised > >> to get better monitoring next week, I'll rerun the tests and hopefully > >> can provide more informative results. > >> > >> --sebastian > >> > >> On 18.12.2011 04:27, Dmitriy Lyubimov wrote: > >>> also make sure you are not hitting swap. If you specifiy 3G per task, > >>> and you have typical quad core commodity nodes, it probably means that > >>> you have at least 4 map / 4 reduce tasks configured. that's 8 task > >>> total, unless you have ~32G ram on these machines, i think you are in > >>> danger of hitting swap with settings that high when GC gets all > >>> excited and all. I never though i would say that, but maybe you can > >>> run it a little bit more conservatively. > >>> > >>> my cluster is cdh3u0. > >>> > >>> -Dmitriy > >>> > >>> On Sat, Dec 17, 2011 at 7:24 PM, Dmitriy Lyubimov <[email protected]> > wrote: > >>>> Sebastian, > >>>> > >>>> I think my commit got screwed somehow. Here's the benchmark very close > >>>> to yours (I also added broadcast option for B thru distributed cache > >>>> which seems to have been a 30% win in my situation, ok, i leave it on > >>>> by default). > >>>> > >>>> input: 4.5M x 4.5M sparse matrix with 40 non-zero elements per row, > total size > >>>> > >>>> -rw-r--r-- 3 dmitriy supergroup 2353201134 2011-12-17 17:17 > >>>> /user/dmitriy/drm-sparse.seq > >>>> > >>>> This is probably a little smaller than your wikipedia test in terms of > >>>> dimensions but not in size. > >>>> > >>>> ssvd -i /user/dmitriy/drm-sparse.seq -o /user/dmitriy/SSVD-OUT > >>>> --tempDir /user/dmitriy/temp -ow -br false -k 10 -p 1 -t 20 -q 1 > >>>> -Dmapred.child.java.opts="-Xmx800m" > >>>> > >>>> so this is for 10 useful eigenvalues and U,V with 1 power iteration: > >>>> > >>>> with broadcast option: > >>>> QJob 1 min 11s > >>>> BtJob 1 min 58s > >>>> AB' Job 2 min 6s > >>>> B'Job 1 min 51 s > >>>> > >>>> U Job - 0m 12 s, V job - 20 s (U and V run in parallel, not > sequential) > >>>> > >>>> without broadcast option (only affects AB' step) > >>>> > >>>> QJob 1 min 12s > >>>> BtJob 1 min 59s > >>>> AB' Job 3 min 3s > >>>> B'Job 2 min 0 s > >>>> > >>>> 10 nodes 4 cpu each. (I had 35 splits of original input, so everything > >>>> got allocated). It also runs our QA stuff starting something else > >>>> almost every minute. frequent other jobs so while not loaded, job/task > >>>> trackers are fairly busy setting up/tearing down. > >>>> > >>>> I'll commit shortly. > >>>> > >>>> On Sat, Dec 17, 2011 at 2:58 PM, Sebastian Schelter <[email protected]> > wrote: > >>>>> On 17.12.2011 17:27, Dmitriy Lyubimov wrote: > >>>>>> Interesting. > >>>>>> > >>>>>> Well so how did your decomposing go? > >>>>> > >>>>> I tested the decomposition of the wikipedia pagelink graph (130M > edges, > >>>>> 5.6M vertices making approx. quarter of a billion non-zeros in the > >>>>> symmetric adjacency matrix) on a 6 machine hadoop cluster. > >>>>> > >>>>> Got these running times for k = 10, p = 5 and one power-iteration: > >>>>> > >>>>> Q-job 1mins, 41sec > >>>>> Bt-job 9mins, 30sec > >>>>> ABt-job 37mins, 22sec > >>>>> Bt-job 9mins, 41sec > >>>>> U-job 30sec > >>>>> > >>>>> I think I'd need a couple more machines to handle the twitter graph > >>>>> though... > >>>>> > >>>>> --sebastian > >>>>> > >>>>> > >>>>>> On Dec 17, 2011 6:00 AM, "Sebastian Schelter" < > [email protected]> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi there, > >>>>>>> > >>>>>>> I played with Mahout to decompose the adjacency matrices of large > graphs > >>>>>>> lately. I stumbled on a paper of Christos Faloutsos that describes > a > >>>>>>> variation of the Lanczos algorithm they use for this on top of > Hadoop. > >>>>>>> They even explicitly mention Mahout: > >>>>>>> > >>>>>>> "Very recently(March 2010), the Mahout project [2] provides > >>>>>>> SVD on top of HADOOP. Due to insufficient documentation, we were > not > >>>>>>> able to find the input format and run a head-to-head comparison. > But, > >>>>>>> reading the source code, we discovered that Mahout suffers from two > >>>>>>> major issues: (a) it assumes that the vector (b, with n=O(billion) > >>>>>>> entries) fits in the memory of a single machine, and (b) it > implements > >>>>>>> the full re-orthogonalization which is inefficient." > >>>>>>> > >>>>>>> http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf > >>>>>>> > >>>>>>> --sebastian > >>>>>>> > >>>>>> > >>>>> > >> > >
