Hi, I would like to add another scenario. What are the steps for removing a dead node when the server had a hard failure that is unrecoverable.
Thanks, Ben On Tuesday, February 12, 2013 7:30:57 AM UTC-8, sudhakara st wrote: > > The decommissioning process is controlled by an exclude file, which for > HDFS is set by the* dfs.hosts.exclude* property, and for MapReduce by > the*mapred.hosts.exclude > * property. In most cases, there is one shared file,referred to as the > exclude file.This exclude file name should be specified as a configuration > parameter *dfs.hosts.exclude *in the name node start up. > > > To remove nodes from the cluster: > 1. Add the network addresses of the nodes to be decommissioned to the > exclude file. > > 2. Restart the MapReduce cluster to stop the tasktrackers on the nodes > being > decommissioned. > 3. Update the namenode with the new set of permitted datanodes, with this > command: > % hadoop dfsadmin -refreshNodes > 4. Go to the web UI and check whether the admin state has changed to > “Decommission > In Progress” for the datanodes being decommissioned. They will start > copying > their blocks to other datanodes in the cluster. > > 5. When all the datanodes report their state as “Decommissioned,” then all > the blocks > have been replicated. Shut down the decommissioned nodes. > 6. Remove the nodes from the include file, and run: > % hadoop dfsadmin -refreshNodes > 7. Remove the nodes from the slaves file. > > Decommission data nodes in small percentage(less than 2%) at time don't > cause any effect on cluster. But it better to pause MR-Jobs before you > triggering Decommission to ensure no task running in decommissioning > subjected nodes. > If very small percentage of task running in the decommissioning node it > can submit to other task tracker, but percentage queued jobs larger then > threshold then there is chance of job failure. Once triggering the 'hadoop > dfsadmin -refreshNodes' command and decommission started, you can resume > the MR jobs. > > *Source : The Definitive Guide [Tom White]* > > > > On Tuesday, February 12, 2013 5:20:07 PM UTC+5:30, Dhanasekaran Anbalagan > wrote: >> >> Hi Guys, >> >> It's recommenced do with removing one the datanode in production cluster. >> via Decommission the particular datanode. please guide me. >> >> -Dhanasekaran, >> >> Did I learn something today? If not, I wasted it. >> >