Hi Jean-Daniel,

thx for your reply.

What I assume is that the total network load during reduce is O(n) with n the 
number of nodes in the cluster. We saw a major performance loss in the reduce 
step when our network degraded to 100Mbit by accident (1h vs. 13 minutes). 

With more nodes I see 2 options: 

1) using switches with a higher switching capacity 
2) improve hbase/hadoop's assignment of reduce task to those nodes which serve 
the corresponding hbase regions.

What do you think?

Sven

-----Ursprüngliche Nachricht-----
Von: [email protected] [mailto:[email protected]] Im Auftrag von Jean-Daniel 
Cryans
Gesendet: Freitag, 8. April 2011 18:04
An: [email protected]
Betreff: Re: data locality for reducer writes?

Unfortunately it seems that there's nothing in the OutputFormat
interface that we could implement (like getSplits in the InputFormat)
to inform the JobTracker of the location of the regions. It kinda make
sense, since when you're writing to HDFS in a "normal" MR job you
always write to the local DataNode (well if there's one), but even
then it is replicated to two other nodes. IMO even if we had that the
gain would be marginal.

J-D

On Fri, Apr 8, 2011 at 4:18 AM, Biedermann,S.,Fa. Post Direkt
<[email protected]> wrote:
> Hi,
>
>
>
> we have a number of Reducer task each writing a bunch of rows into the
> latest HBase via Puts.
>
> What is working is that each Reducer only creates Puts for one single
> Region by using HRegionPartionioner.
>
>
>
> However, we are seeing that the Region flush itself is not local, but
> going to some other node in the cluster. This puts load on the network.
>
> We'd like to see that instead the Reducer would be run on the same node
> where the region is served.
>
>
>
> Is that possible?
>
> Any ideas or suggestions?
>
>
>
> Sven
>
>

Reply via email to