Re: Using Scans in parallel

Bryan Keller Sun, 09 Oct 2011 20:03:49 -0700

Sure. 2 region servers with 5 disks each. Table has 2 column families and 113 
regions total for 2m rows. I'm scanning just one of the families. Performance 
with the 8 parallel scanners is 4x faster than the serial scanner (20m vs 80m 
roughly).


On Oct 9, 2011, at 7:00 PM, Himanshu Vashishtha wrote:

> Interesting.
> 
> Hey Bryan, can you please share the stats about: how many Regions, how
> many Region Servers, time taken by Serial scanner and with 8 parallel
> scanners.
> 
> Himanshu
> 
> On Sun, Oct 9, 2011 at 6:49 PM, Bryan Keller <[email protected]> wrote:
>> This is 100% reproducible for me, so I doubt it is related to random number 
>> generation.
>> 
>> On Oct 9, 2011, at 2:53 PM, lars hofhansl wrote:
>> 
>>> How frequently does this happen?
>>> I did notice a while ago in the code that scanner ids are drawn just from a 
>>> Random number generator.
>>> 
>>> So in theory it would be possible that multiple concurrent scans draw the 
>>> same scanner id.
>>> 
>>> Since these are longs, this is astronomically unlikely, though (picking the 
>>> same number of 2^64, just does not happen :) ).
>>> 
>>> 
>>> 
>>> ________________________________
>>> From: Bryan Keller <[email protected]>
>>> To: [email protected]
>>> Sent: Sunday, October 9, 2011 2:40 PM
>>> Subject: Re: Using Scans in parallel
>>> 
>>> This is just scanning (reads). I'll need to do more testing to find a 
>>> cause, hopefully it is something with my test.
>>> 
>>> On Oct 9, 2011, at 1:13 PM, lars hofhansl wrote:
>>> 
>>>> Which version of HBase?
>>>> Are there concurrent inserts? If so, do you see splits in the log files 
>>>> happening while you do the scanning?
>>>> 
>>>> I am pretty sure this has nothing to do with concurrent scans.
>>>> 
>>>> From: Bryan Keller <[email protected]>
>>>> To: Bryan Keller <[email protected]>
>>>> Cc: [email protected]
>>>> Sent: Sunday, October 9, 2011 11:03 AM
>>>> Subject: Re: Using Scans in parallel
>>>> 
>>>> On further thought, it seems this might be a serious issue, as two 
>>>> unrelated processes within an application may be scanning the same table 
>>>> at the same time.
>>>> 
>>>> On Oct 9, 2011, at 10:59 AM, Bryan Keller wrote:
>>>> 
>>>>> I was not able to get consistent results using multiple scanners in 
>>>>> parallel on a table. I implemented a counter test that used 8 scanners in 
>>>>> parallel on a table with 2m rows with 2k+ columns each, and the results 
>>>>> were not consistent. There were no errors thrown, but the count was off 
>>>>> by as much as 2%. Using a single thread gave the same (correct) result 
>>>>> every run.
>>>>> 
>>>>> I tried various approaches, such as creating an HTable and opening a 
>>>>> connection per thread, but I was not able to get stable results. I would 
>>>>> do some testing before using parallel scanners as described here.
>>>>> 
>>>>> 
>>>>> On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote:
>>>>> 
>>>>>> That's part of it, the other part is to get the region demarcations.
>>>>>> You can also just get the smallest and largest key of the table and pick 
>>>>>> other demarcations for your scans. Then your individual scans will 
>>>>>> likely cover multiple regions and regionservers.
>>>>>> 
>>>>>> 
>>>>>> Your threading model depends on your needs. If you interested in lowest 
>>>>>> latency you want to keep your regionservers busy for each query.
>>>>>> What exactly that means depends on your setup. Maybe you split up the 
>>>>>> overall scan so that no more than N scans are active at any regionserver.
>>>>>> 
>>>>>> If you're more interested in overall predictability, you might not want 
>>>>>> parallelize each scan too much.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Original Message -----
>>>>>> From: Sam Seigal <[email protected]>
>>>>>> To: [email protected]; lars hofhansl <[email protected]>
>>>>>> Cc: "[email protected]" <[email protected]>
>>>>>> Sent: Wednesday, October 5, 2011 6:18 PM
>>>>>> Subject: Re: Using Scans in parallel
>>>>>> 
>>>>>> So the whole point of getting the region locations is to ensure that
>>>>>> there is one thread per region server ?
>>>>>> 
>>>>>> 
>>>>>> On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <[email protected]> 
>>>>>> wrote:
>>>>>>> Hi Sam,
>>>>>>> 
>>>>>>> 
>>>>>>> There were some attempts to build this in. In the end I think the exact 
>>>>>>> patterns are different based on what one is trying to achieve.
>>>>>>> Currently what you can do is getting all the region locations 
>>>>>>> (HTable.getRegionLocations). From the HRegionInfos you can
>>>>>>> get the regions start and end keys.
>>>>>>> Now you can issue parallel scan for as many regions as you want (by 
>>>>>>> create a Scan object with start and row set to the region's
>>>>>>> start and end key).
>>>>>>> You probably want to group the regions by regionserver and have one 
>>>>>>> thread per region server, or something.
>>>>>>> 
>>>>>>> 
>>>>>>> -- Lars
>>>>>>> ________________________________
>>>>>>> From: Sam Seigal <[email protected]>
>>>>>>> To: [email protected]
>>>>>>> Sent: Wednesday, October 5, 2011 4:29 PM
>>>>>>> Subject: Using Scans in parallel
>>>>>>> 
>>>>>>> Hi ,
>>>>>>> 
>>>>>>> Is there a known way to be able to do Scan's in parallel (in different
>>>>>>> threads even) and then sort/combine the output ?
>>>>>>> 
>>>>>>> For a row key like:
>>>>>>> 
>>>>>>> prefix-event_type-event_id
>>>>>>> prefix-event_type-event_id
>>>>>>> 
>>>>>>> I want to declare two scan objects (for say event_id_type foo)
>>>>>>> 
>>>>>>> Scan 1 =>  0-foo
>>>>>>> Scan 2 =>  1-foo
>>>>>>> 
>>>>>>> execute the scans in parallel (maybe even in different threads) and
>>>>>>> then merge the results ?
>>>>>>> 
>>>>>>> Thank you,
>>>>>>> 
>>>>>>> Sam
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Using Scans in parallel

Reply via email to