Hi John!
If your block is going to be replicated to three nodes, then in the default 
block placement policy, 2 of them will be on the same rack, and a third one 
will be on a different rack. Depending on the network bandwidths available 
intra-rack and inter-rack, writing with replication factor=3 may be almost as 
fast or (more likely) slower. With replication factor=2, the default block 
placement is to place them on different racks, so you wouldn't gain much. So 
you can 
1. Either choose replication factor = 1
2. Change the block placement policy such that even with replication factor=2, 
it will choose two nodes in the same rack.

HTH
Ravi




________________________________
 From: Devaraj k <[email protected]>
To: "[email protected]" <[email protected]> 
Sent: Tuesday, July 2, 2013 1:00 AM
Subject: RE: intermediate results files
 


 
If you are 100% sure that all the node data nodes are available and healthy for 
that period of time, you can choose the replication factor as 1 or <3.
 
Thanks
Devaraj k
 
From:John Lilley [mailto:[email protected]] 
Sent: 02 July 2013 04:40
To: [email protected]
Subject: RE: intermediate results files
 
I’ve seen some benchmarks where replication=1 runs at about 50MB/sec and 
replication=3 runs at about 33MB/sec, but I can’t seem to find that now.
John
 
From:Mohammad Tariq [mailto:[email protected]] 
Sent: Monday, July 01, 2013 5:03 PM
To: [email protected]
Subject: Re: intermediate results files
 
Hello John,
 
      IMHO, it doesn't matter. Your job will write the result just once. 
Replica creation is handled at the HDFS layer so it has nothing to with your 
job. Your job will still be writing at the same speed.


Warm Regards,
Tariq
cloudfront.blogspot.com
 
On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <[email protected]> wrote:
If my reducers are going to create results that are temporary in nature 
(consumed by the next processing stage) is it recommended to use a replication 
factor <3 to improve performance?  
Thanks
john

Reply via email to