I have a table who's keys are prefixed with a byte to help distribute the keys so scans don't hotspot.
I also have a bunch of slave processes that work to scan the prefix partitions in parallel. Currently each slave sets up their own hbase connection, scanner, etc.. Most of the slave processes finish their scan and return within 2-3 seconds. It tends to take the same amount of time regardless of if there's lots of data, or very little. So I think that 2 sec overhead is there because each slave will setup a new connection on each request (I am unable to reuse connections in the slaves). I'm wondering if I could remove some of that overhead by using the master (which can reuse it's hbase connection) to determine the splits, and then delegating that information out to each slave. I think I could possible use TableInputFormat/TableRecordReader to accomplish this? Would this route make sense?
