Re: Parallel reading advice

Jean-Daniel Cryans Wed, 28 Nov 2012 02:10:46 -0800

Inline.

J-D

On Wed, Nov 28, 2012 at 7:28 AM, Sean McNamara
<[email protected]>wrote:

> I have a table who's keys are prefixed with a byte to help distribute the
> keys so scans don't hotspot.
>
> I also have a bunch of slave processes that work to scan the prefix
> partitions in parallel.  Currently each slave sets up their own hbase
> connection, scanner, etc..  Most of the slave processes finish their scan
> and return within 2-3 seconds.  It tends to take the same amount of time
> regardless of if there's lots of data, or very little.  So I think that 2
> sec overhead is there because each slave will setup a new connection on
> each request (I am unable to reuse connections in the slaves).
>

2 secs sounds way too high. I recommend you check into this and see where
the time is spent as you may find underlying issues lis misconfiguration.

>
> I'm wondering if I could remove some of that overhead by using the master
> (which can reuse it's hbase connection) to determine the splits, and then
> delegating that information out to each slave. I think I could possible use
> TableInputFormat/TableRecordReader to accomplish this?  Would this route
> make sense?
>

I'm not sure what you're talking about here. Which master? HBase's or
there's something in your infrastructure that's also called "master"? Then
I'm not sure what your are trying to achieve by "determine the splits", you
mean finding the regions you need to contact from your slaves? Since this
is something done within the HBase client, doing it externally sounds
terribly hacky. BTW why can't you keep the connections around? Is it a
problem of JVMs being re-spawned? If so, there are techniques you can use
to keep them around for reuse and then you would also benefit from reusing
connections.

Hope this helps,

J-D

Re: Parallel reading advice

Reply via email to