>
> 1. HBase guarantees data locality of store files and Regionserver only if
> it stays up for long. If there are too many region movements or the server
> has been recycled recently, there is a high probability that store file
> blocks are not local to the region server.  But the getSplits command
> always return the RegionServer of the StoreFile. So in this scenario,
> MapReduce loses its data locality?
>

It's impossible to get data locality in this case since mapreduce reads
from the regionserver, and the data is not local to the regionserver. The
data moves from datanode->regionserver->mapreduce. If the blocks are not
local to the regionserver, you cannot avoid using the network from
datanode->regionserver even if the regionserver->mapreduce step is local.


2. As the getSplits return only the RegionServer, the MR job is not aware
> of the multiple replicates of the StoreFile block. It only accesses one
> block (which is local if the point above is not applicable). This can
> constrain the MR processing as you cannot distribute the data processing
> in the best possible manner. Is this correct?
>

I think there's a misunderstanding. The mapreduce job does not read from
HDFS when using TableInputFormat. The mapreduce tasks use the HBase client
API to talk to a regionserver, and the *regionserver* reads from HDFS.

Also yes, the locality of data blocks to regionservers can be suboptimal,
and the locality of mapreduce tasks to regionservers can also be suboptimal.

3. A guess - since the MR processing goes through the RegionServer, it may
> impact the RegionServer performance for other random operations?
>

Yes, absolutely. Some people use separate HBase clusters for mapreduce
versus real-time traffic for this reason. You can also try to limit the
rate of data consumption by your mapreduce job by reducing the number of
map tasks, or sleeping for short periods in your mapper, or any other hack
that will slow your job down.

Good luck!
-Dave

Reply via email to