I have 16 cores per machine, I'll try to decrease Xmx a little and see
if I can run more tasks.

On 19.12.2011 17:14, Dmitriy Lyubimov wrote:
> No problem. We actually got one patch done that was useful for extra sparse
> cases, that's good. I think you still can achieve extra performance if you
> constrain to 1 task per core and have enough cores for all tasks to run at
> once.
> On Dec 19, 2011 2:00 AM, "Sebastian Schelter" <[email protected]> wrote:
> 
>> Hi Dmitriy,
>>
>> I finally found the problem... I did not set --reduceTasks, I ran the
>> job again with 24 reduce tasks and was able to run the whole
>> decomposition in less than 10 minutes. Sorry for causing you so much work.
>>
>> --sebastian
>>
>> hadoop jar mahout-core-0.6-SNAPSHOT-job.jar
>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli
>> -Dmapred.tasktracker.map.tasks.maximum=8
>> -Dmapred.tasktracker.reduce.tasks.maximum=6
>> -Dmapred.child.java.opts=-Xmx1024m --input
>> hdfs:///ssc/wiki-eig15-tmp/adjacencyMatrix --output hdfs:///ssc/M922-7
>> --tempDir hdfs:///ssc/M922-tmp-7 --rank 10 --oversampling 5 --computeV
>> false --powerIter 1 --reduceTasks 24
>>
>> Q-Job   1mins, 49sec
>> Bt-Job  2mins, 39sec
>> ABt-Job 3mins, 51sec
>> Bt-Job  2mins, 41sec
>> U-Job   29sec
>>
>> On 19.12.2011 09:09, Dmitriy Lyubimov wrote:
>>> Ok reducer then, hm. Then it's probably not the memory, for the
>>> pressure holds high mostly in the mapper only. Reducer actually
>>> doesn't use much memory at all. The problem i previously had was with
>>> memory pressure and load time in the AB' mappers but reducers were
>>> always ok, copy+sort+run has always taken about  a minute there for
>>> this input, nothing changed much there still. --> just in case, please
>>> check how many records map phase outputs. If it is more than 1 per
>>> mapper, please note by how much and increase -abth option to say
>>> 300,000 or until you don't see the multiple records output by map
>>> tasks. it should be ok, it is still roughly 24mb for everything for
>>> k+p=15 of your case ( (k+p) x abth x 8bytes packed as long double[]
>>> swathes, so it doesn't take any object overhead so much). Although
>>> with 922 patch it should not matter much, that was the point.
>>>
>>> If you can, please notice if it is a sort, copy or the reducer itself?
>>> If it is the sort, you may need to tweak sort buffer size, especially
>>> if you didn't set up too many reducers per job (but i assume you set
>>> num reducers at at least 20). By default, those buffers are just 100m
>>> which may turn out below block size, i am not sure how exactly it
>>> would work out in this case then. i think i bumped that up a little
>>> some long time ago.
>>>
>>>
>>> What else may be different different? i am at CDH3u0 here. your input
>>> seems to be 25% wider, which may affect AB' job map time in particular
>>> by at most that, which in my case would constitute about 15s
>>> difference, but hopefully not so much anything else.
>>>
>>> My setup is actually pretty meager, we have 10 quad-cpu nodes with 16G
>>> RAM each and they are already running hbase on top of it. which is why
>>> i mostly avoid running jobs much over 500m mem there, or i start
>>> scratching something on the disk. So memory doesn't seem to be the
>>> factor. (But my cluster is a barebone assembly sitting in the same
>>> rack which is a big plus). I also currently allow 4x4 map/reduce tasks
>>> per node.
>>>
>>> if you can point me where i can download your input exactly (without
>>> or with minimum prep steps, i think i can't spend much energy on that
>>> while at the office), i may try to try on your exact input as well.
>>>
>>>
>>> ------
>>> On a side note i found an error with my work on broadcasting via
>>> distributed cache. After patching it up, i actually see no difference
>>> whatsoever between direct streaming and using distributed cache, just
>>> like i initially figured. My timing for AB' multiplication revolves
>>> around 3 min 10 s, distributed cache or streaming. Since i did that
>>> patch anyway, i will be offering it for commit but i will probably
>>> have to put it on the review board as it touches some iterator stuff
>>> that Sean previously crafted. the branch name with broadcast option
>>> enabled is called MAHOUT-922-2.
>>>
>>>
>>>
>>> On Sun, Dec 18, 2011 at 12:05 AM, Sebastian Schelter <[email protected]>
>> wrote:
>>>> I ran it with 8 map and 4 reduce tasks allowed per machine using
>>>> -Dmapred.child.java.opts=-Xmx1024m (each machine has 32GB RAM).
>>>>
>>>> Most of the time was spent in QRReducer not in ABtMapper. I was promised
>>>> to get better monitoring next week, I'll rerun the tests and hopefully
>>>> can provide more informative results.
>>>>
>>>> --sebastian
>>>>
>>>> On 18.12.2011 04:27, Dmitriy Lyubimov wrote:
>>>>> also make sure you are not hitting swap. If you specifiy 3G per task,
>>>>> and you have typical quad core commodity nodes, it probably means that
>>>>> you have at least 4 map / 4 reduce tasks configured. that's 8 task
>>>>> total, unless you have ~32G ram on these machines, i think you are in
>>>>> danger of hitting swap with settings that high when GC gets all
>>>>> excited and all. I never though i would say that, but maybe you can
>>>>> run it a little bit more conservatively.
>>>>>
>>>>> my cluster is cdh3u0.
>>>>>
>>>>> -Dmitriy
>>>>>
>>>>> On Sat, Dec 17, 2011 at 7:24 PM, Dmitriy Lyubimov <[email protected]>
>> wrote:
>>>>>> Sebastian,
>>>>>>
>>>>>> I think my commit got screwed somehow. Here's the benchmark very close
>>>>>> to yours (I also added broadcast option for B thru distributed cache
>>>>>> which seems to have been a 30% win in my situation, ok, i leave it on
>>>>>> by default).
>>>>>>
>>>>>> input: 4.5M x 4.5M sparse matrix with 40 non-zero elements per row,
>> total size
>>>>>>
>>>>>> -rw-r--r--   3 dmitriy supergroup 2353201134 2011-12-17 17:17
>>>>>> /user/dmitriy/drm-sparse.seq
>>>>>>
>>>>>> This is probably a little smaller than your wikipedia test in terms of
>>>>>> dimensions but not in size.
>>>>>>
>>>>>> ssvd -i /user/dmitriy/drm-sparse.seq -o /user/dmitriy/SSVD-OUT
>>>>>> --tempDir /user/dmitriy/temp -ow -br false -k 10 -p 1 -t 20 -q 1
>>>>>> -Dmapred.child.java.opts="-Xmx800m"
>>>>>>
>>>>>> so this is for 10 useful eigenvalues and U,V with 1 power iteration:
>>>>>>
>>>>>> with broadcast option:
>>>>>> QJob 1 min 11s
>>>>>> BtJob 1 min 58s
>>>>>> AB' Job 2 min 6s
>>>>>> B'Job 1 min 51 s
>>>>>>
>>>>>> U Job - 0m 12 s, V job - 20 s (U and V run in parallel, not
>> sequential)
>>>>>>
>>>>>> without broadcast option (only affects AB' step)
>>>>>>
>>>>>> QJob 1 min 12s
>>>>>> BtJob 1 min 59s
>>>>>> AB' Job 3 min 3s
>>>>>> B'Job 2 min 0 s
>>>>>>
>>>>>> 10 nodes 4 cpu each. (I had 35 splits of original input, so everything
>>>>>> got allocated). It also runs our QA stuff starting something else
>>>>>> almost every minute. frequent other jobs so while not loaded, job/task
>>>>>> trackers are fairly busy setting up/tearing down.
>>>>>>
>>>>>> I'll commit shortly.
>>>>>>
>>>>>> On Sat, Dec 17, 2011 at 2:58 PM, Sebastian Schelter <[email protected]>
>> wrote:
>>>>>>> On 17.12.2011 17:27, Dmitriy Lyubimov wrote:
>>>>>>>> Interesting.
>>>>>>>>
>>>>>>>> Well so how did your decomposing go?
>>>>>>>
>>>>>>> I tested the decomposition of the wikipedia pagelink graph (130M
>> edges,
>>>>>>> 5.6M vertices making approx. quarter of a billion non-zeros in the
>>>>>>> symmetric adjacency matrix) on a 6 machine hadoop cluster.
>>>>>>>
>>>>>>> Got these running times for k = 10, p = 5 and one power-iteration:
>>>>>>>
>>>>>>> Q-job    1mins, 41sec
>>>>>>> Bt-job   9mins, 30sec
>>>>>>> ABt-job  37mins, 22sec
>>>>>>> Bt-job   9mins, 41sec
>>>>>>> U-job    30sec
>>>>>>>
>>>>>>> I think I'd need a couple more machines to handle the twitter graph
>>>>>>> though...
>>>>>>>
>>>>>>> --sebastian
>>>>>>>
>>>>>>>
>>>>>>>> On Dec 17, 2011 6:00 AM, "Sebastian Schelter" <
>> [email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi there,
>>>>>>>>>
>>>>>>>>> I played with Mahout to decompose the adjacency matrices of large
>> graphs
>>>>>>>>> lately. I stumbled on a paper of Christos Faloutsos that describes
>> a
>>>>>>>>> variation of the Lanczos algorithm they use for this on top of
>> Hadoop.
>>>>>>>>> They even explicitly mention Mahout:
>>>>>>>>>
>>>>>>>>> "Very recently(March 2010), the Mahout project [2] provides
>>>>>>>>> SVD on top of HADOOP. Due to insufficient documentation, we were
>> not
>>>>>>>>> able to find the input format and run a head-to-head comparison.
>> But,
>>>>>>>>> reading the source code, we discovered that Mahout suffers from two
>>>>>>>>> major issues: (a) it assumes that the vector (b, with n=O(billion)
>>>>>>>>> entries) fits in the memory of a single machine, and (b) it
>> implements
>>>>>>>>> the full re-orthogonalization which is inefficient."
>>>>>>>>>
>>>>>>>>> http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf
>>>>>>>>>
>>>>>>>>> --sebastian
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>
>>
>>
> 

Reply via email to