Re: Bulkloading impacts to block locality (0.94.6)

lars hofhansl Mon, 12 Aug 2013 21:37:39 -0700

A write in HDFS (by default) places one copy on the local datanode, another one 
on a node in a different rack (when applicable), and a third one on a node in 
the same rack.
HBase gets data locality by being co-located with the data nodes, so after a 
compaction all blocks of the compacted HFile(s) are local.
For bulkload you probably had an external process place the HFiles onto HDFS, 
and hence the location of these HFile's blocks are more or less random (from 
HBase's point of view).



Sometimes the HFiles need to be split again (if they do not fit the current 
region boundaries). In that we could be smart and write the split hfiles on the 
correct data nodes to get data locality, but it seems we are not doing that.

-- Lars

________________________________
From: Scott Kuehn <[email protected]>
To: [email protected] 
Sent: Wednesday, August 7, 2013 1:19 PM
Subject: Bulkloading impacts to block locality (0.94.6)


I'd like to improve block locality on a system where nearly 100% of data
ingest is via bulkloading.  Presently,  I measure block locality by
monitoring the hdfsBlocksLocalityIndex metric. On a 10 node cluster with
block replication of 3, the block locality index is about 30%, which is
what I'd expect to see from random block placement.  Running a major
compaction does not significantly improve the locality.

How can I maximize block locality in a bulkloading-based system?

Re: Bulkloading impacts to block locality (0.94.6)

Reply via email to