Re: HBase parallel scanner performance

Michael Segel Thu, 19 Apr 2012 11:33:44 -0700

No problem. 

One of the hardest things to do is to try to be open to other design ideas and 
not become wedded to one.


I think once you get that working you can start to look at your cluster. 

On Apr 19, 2012, at 1:26 PM, Narendra yadala wrote:

> Michael,
> 
> I will do the redesign and build the index. Thanks a lot for the insights.
> 
> Narendra
> 
> On Thu, Apr 19, 2012 at 9:56 PM, Michael Segel 
> <michael_se...@hotmail.com>wrote:
> 
>> Narendra,
>> 
>> I think you are still missing the point.
>> 130 seconds to scan the table per iteration.
>> Even if you have 10K rows
>> 130 * 10^4 or 1.3*10^6 seconds.  ~361 hours
>> 
>> Compare that to 10K rows where you then select a single row in your sub
>> select that has a list of all of the associated rows.
>> You can then do  n number of get()s based on the data in the index. (If
>> the data wasn't in the index itself)
>> 
>> Assuming that the data was in the index, that's one get(). This is sub
>> second.
>> Just to keep things simple assume 1 second.
>> That's 10K seconds vs 1.3 million seconds.  (2 hours vs 361hours)
>> Actually its more like 10ms  so its 100 seconds to run your code.  (So its
>> like 2 minutes or so)
>> 
>> Also since you're doing less work, you put less strain on the system.
>> 
>> Look, you're asking for help. You're fighting to maintain a bad design.
>> Building the index table shouldn't take you more than a day to think,
>> design and implement.
>> 
>> So you tell me, 2 minutes vs 361 hours. Which would you choose?
>> 
>> HTH
>> 
>> -Mike
>> 
>> 
>> On Apr 19, 2012, at 10:04 AM, Narendra yadala wrote:
>> 
>>> Michael,
>>> 
>>> Thanks for the response. This is a real problem and not a class project.
>>> Boxes itself costed 9k ;)
>>> 
>>> I think there is some difference in understanding of the problem. The
>> table
>>> has 2m rows but I am looking at the latest 10k rows only in the outer for
>>> loop. Only in the inner for loop i am trying to get all rows that contain
>>> the url that is given by the row in the outer for loop. So pseudo code is
>>> like this
>>> 
>>> All scanners have a caching of 128.
>>> 
>>> Scanner outerScanner =  tweetTable.getScanner(new Scan()); //This gets
>> the
>>> entire row
>>> for (int index = 0; index < 10000; index++) {
>>> Result tweet =  outerScanner.next();
>>> NavigableMap<byte[],byte[]> linkFamilyMap =
>>> tweet.getFamilyMap(Bytes.toBytes("link"));
>>> String url = Bytes.toString( linkFamilyMap.firstKey());  //assuming only
>>> one link is there in the tweet.
>>> Scan linkScan = new Scan();
>>> linkScan.addColumn(Bytes.toBytes("link"), Bytes.toBytes(url)); //get only
>>> the link column family
>>> Scanner linkScanner = tweetTable.getScanner(linkScan); //ideally this for
>>> loop is taking 2 sec per sc
>>> for (Result linkResult = linkScanner.next(); linkResult != null;
>>> linkResult = linkScanner.next()) {
>>>   //do something with the link
>>> }
>>> linkScanner.close();
>>> 
>>>       //do a similar for loop for hashtags
>>> }
>>> 
>>> Each of my inner for loop is taking around 20 seconds (or more depending
>> on
>>> number of rows returned by that particular scanner) for each of the 10k
>>> rows that I am processing and this is also triggering a lot of GC in
>> turn.
>>> So it is 10000*40 seconds (4 days) for each thread. But the problem is
>> that
>>> the batch process crashes before completion throwing IOException and
>>> SocketTimeoutException and sometimes GC OutOfMemory exceptions.
>>> 
>>> I will definitely take the much elegant approach that you mentioned
>>> eventually. I just wanted to get to the core of the issue before choosing
>>> the solution.
>>> 
>>> Thanks again.
>>> Narendra
>>> 
>>> On Thu, Apr 19, 2012 at 7:42 PM, Michel Segel <michael_se...@hotmail.com
>>> wrote:
>>> 
>>>> Narendra,
>>>> 
>>>> Are you trying to solve a real problem, or is this a class project?
>>>> 
>>>> Your solution doesn't scale. It's a non starter. 130 seconds for each
>>>> iteration times 1 million seconds is how long? 130 million seconds,
>> which
>>>> is ~36000 hours or over 4 years to complete.
>>>> (the numbers are rough but you get the idea...)
>>>> 
>>>> That's assuming that your table is static and doesn't change.
>>>> 
>>>> I didn't even ask if you were attempting any sort of server side
>> filtering
>>>> which would reduce the amount of data you send back to the client
>> because
>>>> it a moot point.
>>>> 
>>>> Finer tuning is also moot.
>>>> 
>>>> So you insert a row in one table. You then do n^2 operations to pull out
>>>> data.
>>>> The better solution is to insert data into 2 tables where you then have
>> to
>>>> do 2n operations to get the same results. Thats per thread btw.  So if
>> you
>>>> were running 10 threads, you would have 10n^2  operations versus 20n
>>>> operations to get the same result set.
>>>> 
>>>> A million row table... 1*10^13. Vs 2*10^6
>>>> 
>>>> I don't believe I mentioned anything about HBase's internals and this
>>>> solution works for any NoSQL database.
>>>> 
>>>> 
>>>> Sent from a remote device. Please excuse any typos...
>>>> 
>>>> Mike Segel
>>>> 
>>>> On Apr 19, 2012, at 7:03 AM, Narendra yadala <narendra.yad...@gmail.com
>>> 
>>>> wrote:
>>>> 
>>>>> Hi Michel
>>>>> 
>>>>> Yes, that is exactly what I do in step 2. I am aware of the reason for
>>>> the
>>>>> scanner timeout exceptions. It is the time between two consecutive
>>>>> invocations of the next call on a specific scanner object. I increased
>>>> the
>>>>> scanner timeout to 10 min on the region server and still I keep seeing
>>>> the
>>>>> timeouts. So I reduced my scanner cache to 128.
>>>>> 
>>>>> Full table scan takes 130 seconds and there are 2.2 million rows in the
>>>>> table as of now. Each row is around 2 KB in size. I measured time for
>> the
>>>>> full table scan by issuing `count` command from the hbase shell.
>>>>> 
>>>>> I kind of understood the fix that you are specifying, but do I need to
>>>>> change the table structure to fix this problem? All I do is a n^2
>>>> operation
>>>>> and even that fails with 10 different types of exceptions. It is mildly
>>>>> annoying that I need to know all the low level storage details of HBase
>>>> to
>>>>> do such a simple operation. And this is happening for just 14 parallel
>>>>> scanners. I am wondering what would happen when there are thousands of
>>>>> parallel scanners.
>>>>> 
>>>>> Please let me know if there is any configuration param change which
>> would
>>>>> fix this issue.
>>>>> 
>>>>> Thanks a lot
>>>>> Narendra
>>>>> 
>>>>> On Thu, Apr 19, 2012 at 4:40 PM, Michel Segel <
>> michael_se...@hotmail.com
>>>>> wrote:
>>>>> 
>>>>>> So in your step 2 you have the following:
>>>>>> FOREACH row IN TABLE alpha:
>>>>>>  SELECT something
>>>>>>  FROM TABLE alpha
>>>>>>  WHERE alpha.url = row.url
>>>>>> 
>>>>>> Right?
>>>>>> And you are wondering why you are getting timeouts?
>>>>>> ...
>>>>>> ...
>>>>>> And how long does it take to do a full table scan? ;-)
>>>>>> (there's more, but that's the first thing you should see...)
>>>>>> 
>>>>>> Try creating a second table where you invert the URL and key pair such
>>>>>> that for each URL, you have a set of your alpha table's keys?
>>>>>> 
>>>>>> Then you have the following...
>>>>>> FOREACH row IN TABLE alpha:
>>>>>> FETCH key-set FROM beta
>>>>>> WHERE beta.rowkey = alpha.url
>>>>>> 
>>>>>> Note I use FETCH to signify that you should get a single row in
>>>> response.
>>>>>> 
>>>>>> Does this make sense?
>>>>>> ( your second table is actually and index of the URL column in your
>>>> first
>>>>>> table)
>>>>>> 
>>>>>> HTH
>>>>>> 
>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>> 
>>>>>> Mike Segel
>>>>>> 
>>>>>> On Apr 19, 2012, at 5:43 AM, Narendra yadala <
>> narendra.yad...@gmail.com
>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop
>>>>>> (4*32
>>>>>>> GB RAM and 4*6 TB disk space) cluster. We are using Cloudera
>>>> distribution
>>>>>>> for maintaining our cluster. I have a single tweets table in which we
>>>>>> store
>>>>>>> the tweets, one tweet per row (it has millions of rows currently).
>>>>>>> 
>>>>>>> Now I try to run a Java batch (not a map reduce) which does the
>>>>>> following :
>>>>>>> 
>>>>>>> 1. Open a scanner over the tweet table and read the tweets one after
>>>>>>> another. I set scanner caching to 128 rows as higher scanner caching
>>>> is
>>>>>>> leading to ScannerTimeoutExceptions. I scan over the first 10k rows
>>>>>> only.
>>>>>>> 2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are
>>>> there
>>>>>>> in that tweet and open another scanner over the tweets table to see
>>>> who
>>>>>>> else shared that link. This involves getting rows having that URL
>> from
>>>>>> the
>>>>>>> entire table (not first 10k rows).
>>>>>>> 3. Do similar stuff as in step 2 for hashtags
>>>>>>> (hashtagcolfamily:hashtagvalue).
>>>>>>> 4. Do steps 1-3 in parallel for approximately 7-8 threads. This
>> number
>>>>>>> can be higher (thousands also) later.
>>>>>>> 
>>>>>>> 
>>>>>>> When I run this batch I got the GC issue which is specified here
>>>>>>> 
>>>>>> 
>>>> 
>> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
>>>>>>> Then I tried to turn on the MSLAB feature and changed the GC settings
>>>> by
>>>>>>> specifying  -XX:+UseParNewGC  and  -XX:+UseConcMarkSweepGC JVM flags.
>>>>>>> Even after doing this, I am running into all kinds of IOExceptions
>>>>>>> and SocketTimeoutExceptions.
>>>>>>> 
>>>>>>> This Java batch opens approximately 7*2 (14) scanners open at a point
>>>> in
>>>>>>> time and still I am running into all kinds of troubles. I am
>> wondering
>>>>>>> whether I can have thousands of parallel scanners with HBase when I
>>>> need
>>>>>> to
>>>>>>> scale.
>>>>>>> 
>>>>>>> It would be great to know whether I can open thousands/millions of
>>>>>> scanners
>>>>>>> in parallel with HBase efficiently.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Narendra
>>>>>> 
>>>> 
>> 
>>

Re: HBase parallel scanner performance

Reply via email to