I think Brian gave the answer. On Tue, Oct 7, 2014 at 9:13 PM, Brian C. Huffman < [email protected]> wrote:
> What about setting the dfs.datanode.fsdataset.volume.choosing.policy to > org.apache.hadoop.hdfs.server. datanode.fsdataset. > AvailableSpaceVolumeChoosingPolicy? > > Would that help? > > Regards, > Brian > > > On 08/06/2014 05:23 PM, Adam Faris wrote: > >> Hadoop balancer doesn’t balance data on the local drives, it balances >> data between datanodes on the grid, so running the balancer won’t balance >> data on the local datanode. >> >> The datanode process round-robins between data directories on local disk, >> so it’s not unexpected to see the smaller drive fill faster. Typically >> people run the same size drives within each compute node to prevent this >> from happening. >> >> You could partition the 2TB drive into four 500GB partitions. This isn’t >> optimal as you’ll have 4 write threads pointing at a single disk but is >> fairly simple to implement. Otherwise you’ll want to physically rebuild >> your 4 nodes so each node has equal amounts of storage. >> >> I’d also like to suggest while restructuring your local filesystem, that >> the tasktracker/nodemanager be given it’s own partition for writes. If >> both the tasktracker/nodemanger plus datanode process share a partition, >> when the mappers spill to disk it will cause the HDFS space to shrink and >> grow as the datanode is reporting back how much free space it has for it’s >> partitions. >> >> Good luck. >> >> On Aug 6, 2014, at 1:51 PM, Felix Chern <[email protected]> wrote: >> >> Run the “hadoop balencer” command on the namenode. It’s is used for >>> balancing skewed data. >>> http://hadoop.apache.org/docs/r1.0.4/commands_manual.html#balancer >>> >>> >>> On Aug 6, 2014, at 1:45 PM, Brian C. Huffman < >>> [email protected]> wrote: >>> >>> All, >>>> >>>> We currently a Hadoop 2.2.0 cluster with the following characteristics: >>>> - 4 nodes >>>> - Each node is a datanode >>>> - Each node has 3 physical disks for data: 2 x 500GB and 1 x 2TB disk. >>>> - HDFS replication factor of 3 >>>> >>>> It appears that our 500GB disks are filling up first (the alternative >>>> would be to put 4 times the number of blocks on the 2TB disks per node). >>>> I'm concerned that once the 500GB disks fill, our performance will slow >>>> down (less spindles being read / written at the same time per node). Is >>>> this correct? Is there anything we can do to change this behavior? >>>> >>>> Thanks, >>>> Brian >>>> >>>> >>>> > >
