I have a table who's keys are prefixed with a byte to help distribute the keys 
so scans don't hotspot.

I also have a bunch of slave processes that work to scan the prefix partitions 
in parallel.  Currently each slave sets up their own hbase connection, scanner, 
etc..  Most of the slave processes finish their scan and return within 2-3 
seconds.  It tends to take the same amount of time regardless of if there's 
lots of data, or very little.  So I think that 2 sec overhead is there because 
each slave will setup a new connection on each request (I am unable to reuse 
connections in the slaves).

I'm wondering if I could remove some of that overhead by using the master 
(which can reuse it's hbase connection) to determine the splits, and then 
delegating that information out to each slave. I think I could possible use 
TableInputFormat/TableRecordReader to accomplish this?  Would this route make 
sense?

Reply via email to