Hi Manoj, Existing data is not automatically redistributed when you add new DataNodes. Take a look at the 'hdfs balancer' command which can be run as a separate administrative tool to rebalance data distribution across DataNodes.
From: Manoj Venkatesh <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Friday, February 6, 2015 at 11:34 AM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Adding datanodes to Hadoop cluster - Will data redistribute? Dear Hadoop experts, I have a Hadoop cluster of 8 nodes, 6 were added during cluster creation and 2 additional nodes were added later to increase disk and CPU capacity. What i see is that processing is shared amongst all the nodes whereas the storage is reaching capacity on the original 6 nodes whereas the newly added machines have relatively large amount of storage still unoccupied. I was wondering if there is an automated or any way of redistributing data so that all the nodes are equally utilized. I have checked for the configuration parameter - dfs.datanode.fsdataset.volume.choosing.policy have options 'Round Robin' or 'Available Space', are there any other configurations which need to be reviewed. Thanks, Manoj
