Hello, Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <[email protected]>:
> I have deployed Hadoop on a cluster of 20 machines. I set the replication > factor to one. When I put a file (larger than HDFS block size) into HDFS, all > the blocks are stored on the machine where the Hadoop put command is invoked. > > For higher replication factor, I see the same behavior but the replicated > blocks are stored randomly on all the other machines. > > Is this a normal behavior, if not what would be the cause? Yes, this is normal behavior. When a HDFS client happens to run on a host that also is a DataNode (always the case when a reducer writes its output), the first copy of a block is stored on that very same node. This is to optimize the latency, it's faster to write to a local disk than writing across the network. The second copy of the block gets stored onto a random host in another rack (if your cluster is configured to be rack-aware), to increase the distribution of the data. The third copy of the block gets stored onto a random host in that other rack. So your observations are correct. Kai -- Kai Voigt [email protected]
