Yeah Kai si right. You can read more details for your understanding at:
http://hadoop.apache.org/docs/stable/hdfs_design.html#Data+Replication and right from the horse's mouth (Pgs 70-75): http://books.google.com/books?id=drbI_aro20oC&pg=PA51&lpg=PA51&dq=hadoop+replication+factor+1+definitive+guide&source=bl&ots=tZDeyhhZj1&sig=Xq-0WrYhOKnER1SDbnBTmbaEfdk&hl=en&sa=X&ei=Dtu1UdnsCcO_rQG8jICoAw&ved=0CE0Q6AEwBA#v=onepage&q=hadoop%20replication%20factor%201%20definitive%20guide&f=false On Mon, Jun 10, 2013 at 9:47 AM, Kai Voigt <[email protected]> wrote: > Hello, > > Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <[email protected]>: > > > I have deployed Hadoop on a cluster of 20 machines. I set the > replication factor to one. When I put a file (larger than HDFS block size) > into HDFS, all the blocks are stored on the machine where the Hadoop put > command is invoked. > > > > For higher replication factor, I see the same behavior but the > replicated blocks are stored randomly on all the other machines. > > > > Is this a normal behavior, if not what would be the cause? > > Yes, this is normal behavior. When a HDFS client happens to run on a host > that also is a DataNode (always the case when a reducer writes its output), > the first copy of a block is stored on that very same node. This is to > optimize the latency, it's faster to write to a local disk than writing > across the network. > > The second copy of the block gets stored onto a random host in another > rack (if your cluster is configured to be rack-aware), to increase the > distribution of the data. > > The third copy of the block gets stored onto a random host in that other > rack. > > So your observations are correct. > > Kai > > -- > Kai Voigt > [email protected] > > > > >
