I have an 8 node cluster and a table that is pretty well balanced with on 
average 36 regions/node. When I run a mapreduce job on the cluster against this 
table, the data locality of the mappers is poor, e.g 100 rack local mappers and 
only 188 data local mappers. I would expect nearly all of the mappers to be 
data local. DNS appears to be fine, i.e. the hostname in the splits is the same 
as the hostnames in the task attempts.

The performance of the rack local mappers is poor and causes overall scan 
performance to suffer.

The table isn't new and from what I understand, HDFS replication will 
eventually keep region data blocks local to the regionserver. Are there other 
reasons for data locality to be poor and any way to fix it?

Reply via email to