RE: HBase parallel scanner performance

Bijieshan Thu, 19 Apr 2012 07:26:45 -0700

Hi Narendra,

I have a few doubts:

1. Which version you are using?
2. What's the size of each KeyValue?
3. Did you change the GC parameters in client side or server side? After 
changing the GC parameters, did you keep an eye on the GC logs? 

Thank you.

Regards,
Jieshan

-----Original Message-----
From: Narendra yadala [mailto:[email protected]] 
Sent: Thursday, April 19, 2012 8:04 PM
To: [email protected]
Subject: Re: HBase parallel scanner performance

Hi Michel

Yes, that is exactly what I do in step 2. I am aware of the reason for the
scanner timeout exceptions. It is the time between two consecutive
invocations of the next call on a specific scanner object. I increased the
scanner timeout to 10 min on the region server and still I keep seeing the
timeouts. So I reduced my scanner cache to 128.

Full table scan takes 130 seconds and there are 2.2 million rows in the
table as of now. Each row is around 2 KB in size. I measured time for the
full table scan by issuing `count` command from the hbase shell.

I kind of understood the fix that you are specifying, but do I need to
change the table structure to fix this problem? All I do is a n^2 operation
and even that fails with 10 different types of exceptions. It is mildly
annoying that I need to know all the low level storage details of HBase to
do such a simple operation. And this is happening for just 14 parallel
scanners. I am wondering what would happen when there are thousands of
parallel scanners.

Please let me know if there is any configuration param change which would
fix this issue.

Thanks a lot
Narendra

On Thu, Apr 19, 2012 at 4:40 PM, Michel Segel <[email protected]>wrote:

> So in your step 2 you have the following:
> FOREACH row IN TABLE alpha:
>     SELECT something
>     FROM TABLE alpha
>     WHERE alpha.url = row.url
>
> Right?
> And you are wondering why you are getting timeouts?
> ...
> ...
> And how long does it take to do a full table scan? ;-)
> (there's more, but that's the first thing you should see...)
>
> Try creating a second table where you invert the URL and key pair such
> that for each URL, you have a set of your alpha table's keys?
>
> Then you have the following...
> FOREACH row IN TABLE alpha:
>   FETCH key-set FROM beta
>   WHERE beta.rowkey = alpha.url
>
> Note I use FETCH to signify that you should get a single row in response.
>
> Does this make sense?
> ( your second table is actually and index of the URL column in your first
> table)
>
> HTH
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Apr 19, 2012, at 5:43 AM, Narendra yadala <[email protected]>
> wrote:
>
> > I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop
> (4*32
> > GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution
> > for maintaining our cluster. I have a single tweets table in which we
> store
> > the tweets, one tweet per row (it has millions of rows currently).
> >
> > Now I try to run a Java batch (not a map reduce) which does the
> following :
> >
> >   1. Open a scanner over the tweet table and read the tweets one after
> >   another. I set scanner caching to 128 rows as higher scanner caching is
> >   leading to ScannerTimeoutExceptions. I scan over the first 10k rows
> only.
> >   2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are there
> >   in that tweet and open another scanner over the tweets table to see who
> >   else shared that link. This involves getting rows having that URL from
> the
> >   entire table (not first 10k rows).
> >   3. Do similar stuff as in step 2 for hashtags
> >   (hashtagcolfamily:hashtagvalue).
> >   4. Do steps 1-3 in parallel for approximately 7-8 threads. This number
> >   can be higher (thousands also) later.
> >
> >
> > When I run this batch I got the GC issue which is specified here
> >
> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
> > Then I tried to turn on the MSLAB feature and changed the GC settings by
> > specifying  -XX:+UseParNewGC  and  -XX:+UseConcMarkSweepGC JVM flags.
> > Even after doing this, I am running into all kinds of IOExceptions
> > and SocketTimeoutExceptions.
> >
> > This Java batch opens approximately 7*2 (14) scanners open at a point in
> > time and still I am running into all kinds of troubles. I am wondering
> > whether I can have thousands of parallel scanners with HBase when I need
> to
> > scale.
> >
> > It would be great to know whether I can open thousands/millions of
> scanners
> > in parallel with HBase efficiently.
> >
> > Thanks
> > Narendra
>

RE: HBase parallel scanner performance

Reply via email to