Re: HBase parallel scanner performance

Michel Segel Thu, 19 Apr 2012 07:12:52 -0700

Narendra, 

Are you trying to solve a real problem, or is this a class project?


Your solution doesn't scale. It's a non starter. 130 seconds for each iteration 
times 1 million seconds is how long? 130 million seconds, which is ~36000 hours 
or over 4 years to complete.
(the numbers are rough but you get the idea...)

That's assuming that your table is static and doesn't change.

I didn't even ask if you were attempting any sort of server side filtering 
which would reduce the amount of data you send back to the client because it a 
moot point. 

Finer tuning is also moot.

So you insert a row in one table. You then do n^2 operations to pull out data.
The better solution is to insert data into 2 tables where you then have to do 
2n operations to get the same results. Thats per thread btw.  So if you were 
running 10 threads, you would have 10n^2  operations versus 20n operations to 
get the same result set.

A million row table... 1*10^13. Vs 2*10^6 

I don't believe I mentioned anything about HBase's internals and this solution 
works for any NoSQL database.


Sent from a remote device. Please excuse any typos...

Mike Segel

On Apr 19, 2012, at 7:03 AM, Narendra yadala <[email protected]> wrote:

> Hi Michel
> 
> Yes, that is exactly what I do in step 2. I am aware of the reason for the
> scanner timeout exceptions. It is the time between two consecutive
> invocations of the next call on a specific scanner object. I increased the
> scanner timeout to 10 min on the region server and still I keep seeing the
> timeouts. So I reduced my scanner cache to 128.
> 
> Full table scan takes 130 seconds and there are 2.2 million rows in the
> table as of now. Each row is around 2 KB in size. I measured time for the
> full table scan by issuing `count` command from the hbase shell.
> 
> I kind of understood the fix that you are specifying, but do I need to
> change the table structure to fix this problem? All I do is a n^2 operation
> and even that fails with 10 different types of exceptions. It is mildly
> annoying that I need to know all the low level storage details of HBase to
> do such a simple operation. And this is happening for just 14 parallel
> scanners. I am wondering what would happen when there are thousands of
> parallel scanners.
> 
> Please let me know if there is any configuration param change which would
> fix this issue.
> 
> Thanks a lot
> Narendra
> 
> On Thu, Apr 19, 2012 at 4:40 PM, Michel Segel 
> <[email protected]>wrote:
> 
>> So in your step 2 you have the following:
>> FOREACH row IN TABLE alpha:
>>    SELECT something
>>    FROM TABLE alpha
>>    WHERE alpha.url = row.url
>> 
>> Right?
>> And you are wondering why you are getting timeouts?
>> ...
>> ...
>> And how long does it take to do a full table scan? ;-)
>> (there's more, but that's the first thing you should see...)
>> 
>> Try creating a second table where you invert the URL and key pair such
>> that for each URL, you have a set of your alpha table's keys?
>> 
>> Then you have the following...
>> FOREACH row IN TABLE alpha:
>>  FETCH key-set FROM beta
>>  WHERE beta.rowkey = alpha.url
>> 
>> Note I use FETCH to signify that you should get a single row in response.
>> 
>> Does this make sense?
>> ( your second table is actually and index of the URL column in your first
>> table)
>> 
>> HTH
>> 
>> Sent from a remote device. Please excuse any typos...
>> 
>> Mike Segel
>> 
>> On Apr 19, 2012, at 5:43 AM, Narendra yadala <[email protected]>
>> wrote:
>> 
>>> I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop
>> (4*32
>>> GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution
>>> for maintaining our cluster. I have a single tweets table in which we
>> store
>>> the tweets, one tweet per row (it has millions of rows currently).
>>> 
>>> Now I try to run a Java batch (not a map reduce) which does the
>> following :
>>> 
>>>  1. Open a scanner over the tweet table and read the tweets one after
>>>  another. I set scanner caching to 128 rows as higher scanner caching is
>>>  leading to ScannerTimeoutExceptions. I scan over the first 10k rows
>> only.
>>>  2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are there
>>>  in that tweet and open another scanner over the tweets table to see who
>>>  else shared that link. This involves getting rows having that URL from
>> the
>>>  entire table (not first 10k rows).
>>>  3. Do similar stuff as in step 2 for hashtags
>>>  (hashtagcolfamily:hashtagvalue).
>>>  4. Do steps 1-3 in parallel for approximately 7-8 threads. This number
>>>  can be higher (thousands also) later.
>>> 
>>> 
>>> When I run this batch I got the GC issue which is specified here
>>> 
>> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
>>> Then I tried to turn on the MSLAB feature and changed the GC settings by
>>> specifying  -XX:+UseParNewGC  and  -XX:+UseConcMarkSweepGC JVM flags.
>>> Even after doing this, I am running into all kinds of IOExceptions
>>> and SocketTimeoutExceptions.
>>> 
>>> This Java batch opens approximately 7*2 (14) scanners open at a point in
>>> time and still I am running into all kinds of troubles. I am wondering
>>> whether I can have thousands of parallel scanners with HBase when I need
>> to
>>> scale.
>>> 
>>> It would be great to know whether I can open thousands/millions of
>> scanners
>>> in parallel with HBase efficiently.
>>> 
>>> Thanks
>>> Narendra
>>

Re: HBase parallel scanner performance

Reply via email to