Hi folks-
Something we've been using that's been working out pretty well with
client-loaded data (I.e., not "bulk loading" via creating your own StoreFiles)
is to pre-bucket data by region on chunks of data.
When the client calls flushCommits() it will internally do the same thing and
bucket Puts by RegionServer, but when loading a lot of data it is more
efficient to make 1 RS RPC call to deliver 100 puts than 10 RPC calls to
deliver 10 Puts apiece, for example. This RS iteration is not a "bug" in the
client because it makes it easy to communicate to any RS in the cluster, but it
doesn't know you are trying to batch load.
The gist is to bucket every 50k-100k+ Puts or so and reduce the number of RS
RPC calls. Your milage may vary, so work out the optimal bucket interval with
your data.
I'll add this to the Hbase book and look into adding a utility method for this.
public void bucketAndPut(HTable htable, List<Put> puts) throws IOException {
// could also use a guava multimap
Map<String, List<Put>> putMap = new HashMap<String, List<Put>>();
for (Put put: puts) {
HRegionLocation rl = htable.getRegionLocation( put.getRow() );
String hostname = rl.getServerAddress().getHostname();
add( putMap, hostname, put);
}
for (List<Put> puts2: putMap.values()) {
// adjust writeBuffer as necessary
// or use .batch method
htable.put( puts2 );
}
htable.flushCommits();
}
private void add(Map<String, List<Put>> putMap, String hostname, Put put) {
List<Put> recs = putMap.get( hostname);
if (recs == null) {
recs = new ArrayList<Put>();
putMap.put( hostname, recs);
}
recs.add(put);
}
Doug Meil
Chief Software Architect, Explorys
[email protected]