batch loading optimization tip

Doug Meil Wed, 06 Jul 2011 05:25:57 -0700

Hi folks-

Something we've been using that's been working out pretty well with 
client-loaded data (I.e., not "bulk loading" via creating your own StoreFiles) 
is to pre-bucket data by region on chunks of data.


When the client calls flushCommits() it will internally do the same thing and 
bucket Puts by RegionServer, but when loading a lot of data it is more 
efficient to make 1 RS RPC call to deliver 100 puts than 10 RPC calls to 
deliver 10 Puts apiece, for example.   This RS iteration is not a "bug" in the 
client because it makes it easy to communicate to any RS in the cluster, but it 
doesn't know you are trying to batch load.

The gist is to bucket every 50k-100k+ Puts or so and reduce the number of RS 
RPC calls.  Your milage may vary, so work out the optimal bucket interval with 
your data.

I'll add this to the Hbase book and look into adding a utility method for this.



public void bucketAndPut(HTable htable, List<Put> puts) throws IOException {

// could also use a guava multimap

Map<String, List<Put>> putMap = new HashMap<String, List<Put>>();

for (Put put: puts) {

HRegionLocation rl = htable.getRegionLocation( put.getRow() );

String hostname = rl.getServerAddress().getHostname();

add( putMap, hostname, put);

}

for (List<Put> puts2: putMap.values()) {

// adjust writeBuffer as necessary

// or use .batch method

htable.put( puts2 );

}

htable.flushCommits();

}

private void add(Map<String, List<Put>> putMap, String hostname, Put put) {

List<Put> recs = putMap.get( hostname);

if (recs == null) {

recs = new ArrayList<Put>();

putMap.put( hostname, recs);

}

recs.add(put);

}



Doug Meil
Chief Software Architect, Explorys
[email protected]

batch loading optimization tip

Reply via email to