Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1

Kai Voigt Mon, 10 Jun 2013 06:48:43 -0700

Hello,

Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <[email protected]>:


> I have deployed Hadoop on a cluster of 20 machines. I set the replication 
> factor to one. When I put a file (larger than HDFS block size) into HDFS, all 
> the blocks are stored on the machine where the Hadoop put command is invoked. 
> 
> For higher replication factor, I see the same behavior but the replicated 
> blocks are stored randomly on all the other machines.
> 
> Is this a normal behavior, if not what would be the cause?

Yes, this is normal behavior. When a HDFS client happens to run on a host that 
also is a DataNode (always the case when a reducer writes its output), the 
first copy of a block is stored on that very same node. This is to optimize the 
latency, it's faster to write to a local disk than writing across the network.

The second copy of the block gets stored onto a random host in another rack (if 
your cluster is configured to be rack-aware), to increase the distribution of 
the data.

The third copy of the block gets stored onto a random host in that other rack.

So your observations are correct.

Kai

-- 
Kai Voigt
[email protected]

Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1

Reply via email to