Re: HBase parallel scanner performance

Narendra yadala Thu, 19 Apr 2012 08:24:58 -0700

Hi Jieshan

HBase version : Version 0.90.4-cdh3u3
Size of Key Value pair should not be more than 2KB
I changed the GC parameters at the server side. I have not looked into GC
logs yet but I have noticed that it pausing the batch process every now and
then. How do I look at the server GC logs?


Thanks
Narendra

On Thu, Apr 19, 2012 at 7:46 PM, Bijieshan <bijies...@huawei.com> wrote:

> Hi Narendra,
>
> I have a few doubts:
>
> 1. Which version you are using?
> 2. What's the size of each KeyValue?
> 3. Did you change the GC parameters in client side or server side? After
> changing the GC parameters, did you keep an eye on the GC logs?
>
> Thank you.
>
> Regards,
> Jieshan
>
> -----Original Message-----
> From: Narendra yadala [mailto:narendra.yad...@gmail.com]
> Sent: Thursday, April 19, 2012 8:04 PM
> To: user@hbase.apache.org
> Subject: Re: HBase parallel scanner performance
>
> Hi Michel
>
> Yes, that is exactly what I do in step 2. I am aware of the reason for the
> scanner timeout exceptions. It is the time between two consecutive
> invocations of the next call on a specific scanner object. I increased the
> scanner timeout to 10 min on the region server and still I keep seeing the
> timeouts. So I reduced my scanner cache to 128.
>
> Full table scan takes 130 seconds and there are 2.2 million rows in the
> table as of now. Each row is around 2 KB in size. I measured time for the
> full table scan by issuing `count` command from the hbase shell.
>
> I kind of understood the fix that you are specifying, but do I need to
> change the table structure to fix this problem? All I do is a n^2 operation
> and even that fails with 10 different types of exceptions. It is mildly
> annoying that I need to know all the low level storage details of HBase to
> do such a simple operation. And this is happening for just 14 parallel
> scanners. I am wondering what would happen when there are thousands of
> parallel scanners.
>
> Please let me know if there is any configuration param change which would
> fix this issue.
>
> Thanks a lot
> Narendra
>
> On Thu, Apr 19, 2012 at 4:40 PM, Michel Segel <michael_se...@hotmail.com
> >wrote:
>
> > So in your step 2 you have the following:
> > FOREACH row IN TABLE alpha:
> >     SELECT something
> >     FROM TABLE alpha
> >     WHERE alpha.url = row.url
> >
> > Right?
> > And you are wondering why you are getting timeouts?
> > ...
> > ...
> > And how long does it take to do a full table scan? ;-)
> > (there's more, but that's the first thing you should see...)
> >
> > Try creating a second table where you invert the URL and key pair such
> > that for each URL, you have a set of your alpha table's keys?
> >
> > Then you have the following...
> > FOREACH row IN TABLE alpha:
> >   FETCH key-set FROM beta
> >   WHERE beta.rowkey = alpha.url
> >
> > Note I use FETCH to signify that you should get a single row in response.
> >
> > Does this make sense?
> > ( your second table is actually and index of the URL column in your first
> > table)
> >
> > HTH
> >
> > Sent from a remote device. Please excuse any typos...
> >
> > Mike Segel
> >
> > On Apr 19, 2012, at 5:43 AM, Narendra yadala <narendra.yad...@gmail.com>
> > wrote:
> >
> > > I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop
> > (4*32
> > > GB RAM and 4*6 TB disk space) cluster. We are using Cloudera
> distribution
> > > for maintaining our cluster. I have a single tweets table in which we
> > store
> > > the tweets, one tweet per row (it has millions of rows currently).
> > >
> > > Now I try to run a Java batch (not a map reduce) which does the
> > following :
> > >
> > >   1. Open a scanner over the tweet table and read the tweets one after
> > >   another. I set scanner caching to 128 rows as higher scanner caching
> is
> > >   leading to ScannerTimeoutExceptions. I scan over the first 10k rows
> > only.
> > >   2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are
> there
> > >   in that tweet and open another scanner over the tweets table to see
> who
> > >   else shared that link. This involves getting rows having that URL
> from
> > the
> > >   entire table (not first 10k rows).
> > >   3. Do similar stuff as in step 2 for hashtags
> > >   (hashtagcolfamily:hashtagvalue).
> > >   4. Do steps 1-3 in parallel for approximately 7-8 threads. This
> number
> > >   can be higher (thousands also) later.
> > >
> > >
> > > When I run this batch I got the GC issue which is specified here
> > >
> >
> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
> > > Then I tried to turn on the MSLAB feature and changed the GC settings
> by
> > > specifying  -XX:+UseParNewGC  and  -XX:+UseConcMarkSweepGC JVM flags.
> > > Even after doing this, I am running into all kinds of IOExceptions
> > > and SocketTimeoutExceptions.
> > >
> > > This Java batch opens approximately 7*2 (14) scanners open at a point
> in
> > > time and still I am running into all kinds of troubles. I am wondering
> > > whether I can have thousands of parallel scanners with HBase when I
> need
> > to
> > > scale.
> > >
> > > It would be great to know whether I can open thousands/millions of
> > scanners
> > > in parallel with HBase efficiently.
> > >
> > > Thanks
> > > Narendra
> >
>

Re: HBase parallel scanner performance

Reply via email to