Ok reducer then, hm. Then it's probably not the memory, for the
pressure holds high mostly in the mapper only. Reducer actually
doesn't use much memory at all. The problem i previously had was with
memory pressure and load time in the AB' mappers but reducers were
always ok, copy+sort+run has always taken about  a minute there for
this input, nothing changed much there still. --> just in case, please
check how many records map phase outputs. If it is more than 1 per
mapper, please note by how much and increase -abth option to say
300,000 or until you don't see the multiple records output by map
tasks. it should be ok, it is still roughly 24mb for everything for
k+p=15 of your case ( (k+p) x abth x 8bytes packed as long double[]
swathes, so it doesn't take any object overhead so much). Although
with 922 patch it should not matter much, that was the point.

If you can, please notice if it is a sort, copy or the reducer itself?
If it is the sort, you may need to tweak sort buffer size, especially
if you didn't set up too many reducers per job (but i assume you set
num reducers at at least 20). By default, those buffers are just 100m
which may turn out below block size, i am not sure how exactly it
would work out in this case then. i think i bumped that up a little
some long time ago.


What else may be different different? i am at CDH3u0 here. your input
seems to be 25% wider, which may affect AB' job map time in particular
by at most that, which in my case would constitute about 15s
difference, but hopefully not so much anything else.

My setup is actually pretty meager, we have 10 quad-cpu nodes with 16G
RAM each and they are already running hbase on top of it. which is why
i mostly avoid running jobs much over 500m mem there, or i start
scratching something on the disk. So memory doesn't seem to be the
factor. (But my cluster is a barebone assembly sitting in the same
rack which is a big plus). I also currently allow 4x4 map/reduce tasks
per node.

if you can point me where i can download your input exactly (without
or with minimum prep steps, i think i can't spend much energy on that
while at the office), i may try to try on your exact input as well.


------
On a side note i found an error with my work on broadcasting via
distributed cache. After patching it up, i actually see no difference
whatsoever between direct streaming and using distributed cache, just
like i initially figured. My timing for AB' multiplication revolves
around 3 min 10 s, distributed cache or streaming. Since i did that
patch anyway, i will be offering it for commit but i will probably
have to put it on the review board as it touches some iterator stuff
that Sean previously crafted. the branch name with broadcast option
enabled is called MAHOUT-922-2.



On Sun, Dec 18, 2011 at 12:05 AM, Sebastian Schelter <[email protected]> wrote:
> I ran it with 8 map and 4 reduce tasks allowed per machine using
> -Dmapred.child.java.opts=-Xmx1024m (each machine has 32GB RAM).
>
> Most of the time was spent in QRReducer not in ABtMapper. I was promised
> to get better monitoring next week, I'll rerun the tests and hopefully
> can provide more informative results.
>
> --sebastian
>
> On 18.12.2011 04:27, Dmitriy Lyubimov wrote:
>> also make sure you are not hitting swap. If you specifiy 3G per task,
>> and you have typical quad core commodity nodes, it probably means that
>> you have at least 4 map / 4 reduce tasks configured. that's 8 task
>> total, unless you have ~32G ram on these machines, i think you are in
>> danger of hitting swap with settings that high when GC gets all
>> excited and all. I never though i would say that, but maybe you can
>> run it a little bit more conservatively.
>>
>> my cluster is cdh3u0.
>>
>> -Dmitriy
>>
>> On Sat, Dec 17, 2011 at 7:24 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>> Sebastian,
>>>
>>> I think my commit got screwed somehow. Here's the benchmark very close
>>> to yours (I also added broadcast option for B thru distributed cache
>>> which seems to have been a 30% win in my situation, ok, i leave it on
>>> by default).
>>>
>>> input: 4.5M x 4.5M sparse matrix with 40 non-zero elements per row, total 
>>> size
>>>
>>> -rw-r--r--   3 dmitriy supergroup 2353201134 2011-12-17 17:17
>>> /user/dmitriy/drm-sparse.seq
>>>
>>> This is probably a little smaller than your wikipedia test in terms of
>>> dimensions but not in size.
>>>
>>> ssvd -i /user/dmitriy/drm-sparse.seq -o /user/dmitriy/SSVD-OUT
>>> --tempDir /user/dmitriy/temp -ow -br false -k 10 -p 1 -t 20 -q 1
>>> -Dmapred.child.java.opts="-Xmx800m"
>>>
>>> so this is for 10 useful eigenvalues and U,V with 1 power iteration:
>>>
>>> with broadcast option:
>>> QJob 1 min 11s
>>> BtJob 1 min 58s
>>> AB' Job 2 min 6s
>>> B'Job 1 min 51 s
>>>
>>> U Job - 0m 12 s, V job - 20 s (U and V run in parallel, not sequential)
>>>
>>> without broadcast option (only affects AB' step)
>>>
>>> QJob 1 min 12s
>>> BtJob 1 min 59s
>>> AB' Job 3 min 3s
>>> B'Job 2 min 0 s
>>>
>>> 10 nodes 4 cpu each. (I had 35 splits of original input, so everything
>>> got allocated). It also runs our QA stuff starting something else
>>> almost every minute. frequent other jobs so while not loaded, job/task
>>> trackers are fairly busy setting up/tearing down.
>>>
>>> I'll commit shortly.
>>>
>>> On Sat, Dec 17, 2011 at 2:58 PM, Sebastian Schelter <[email protected]> wrote:
>>>> On 17.12.2011 17:27, Dmitriy Lyubimov wrote:
>>>>> Interesting.
>>>>>
>>>>> Well so how did your decomposing go?
>>>>
>>>> I tested the decomposition of the wikipedia pagelink graph (130M edges,
>>>> 5.6M vertices making approx. quarter of a billion non-zeros in the
>>>> symmetric adjacency matrix) on a 6 machine hadoop cluster.
>>>>
>>>> Got these running times for k = 10, p = 5 and one power-iteration:
>>>>
>>>> Q-job    1mins, 41sec
>>>> Bt-job   9mins, 30sec
>>>> ABt-job  37mins, 22sec
>>>> Bt-job   9mins, 41sec
>>>> U-job    30sec
>>>>
>>>> I think I'd need a couple more machines to handle the twitter graph
>>>> though...
>>>>
>>>> --sebastian
>>>>
>>>>
>>>>> On Dec 17, 2011 6:00 AM, "Sebastian Schelter" <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> I played with Mahout to decompose the adjacency matrices of large graphs
>>>>>> lately. I stumbled on a paper of Christos Faloutsos that describes a
>>>>>> variation of the Lanczos algorithm they use for this on top of Hadoop.
>>>>>> They even explicitly mention Mahout:
>>>>>>
>>>>>> "Very recently(March 2010), the Mahout project [2] provides
>>>>>> SVD on top of HADOOP. Due to insufficient documentation, we were not
>>>>>> able to find the input format and run a head-to-head comparison. But,
>>>>>> reading the source code, we discovered that Mahout suffers from two
>>>>>> major issues: (a) it assumes that the vector (b, with n=O(billion)
>>>>>> entries) fits in the memory of a single machine, and (b) it implements
>>>>>> the full re-orthogonalization which is inefficient."
>>>>>>
>>>>>> http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf
>>>>>>
>>>>>> --sebastian
>>>>>>
>>>>>
>>>>
>

Reply via email to