Re: Using Scans in parallel

Bryan Keller Mon, 10 Oct 2011 21:13:31 -0700

To follow up, the problem I was having w/ parallel scanners appears to be an 
issue w/ my app, I wasn't able to reproduce it in a more controlled test.


On Oct 9, 2011, at 8:21 PM, Bryan Keller wrote:

> BTW, a map reduce job can scan the table in 6m (both column families), 
> including some processing. So that is the fastest approach.
> 
> On Oct 9, 2011, at 8:03 PM, Bryan Keller wrote:
> 
>> Sure. 2 region servers with 5 disks each. Table has 2 column families and 
>> 113 regions total for 2m rows. I'm scanning just one of the families. 
>> Performance with the 8 parallel scanners is 4x faster than the serial 
>> scanner (20m vs 80m roughly).
>> 
>> On Oct 9, 2011, at 7:00 PM, Himanshu Vashishtha wrote:
>> 
>>> Interesting.
>>> 
>>> Hey Bryan, can you please share the stats about: how many Regions, how
>>> many Region Servers, time taken by Serial scanner and with 8 parallel
>>> scanners.
>>> 
>>> Himanshu
>>> 
>>> On Sun, Oct 9, 2011 at 6:49 PM, Bryan Keller <[email protected]> wrote:
>>>> This is 100% reproducible for me, so I doubt it is related to random 
>>>> number generation.
>>>> 
>>>> On Oct 9, 2011, at 2:53 PM, lars hofhansl wrote:
>>>> 
>>>>> How frequently does this happen?
>>>>> I did notice a while ago in the code that scanner ids are drawn just from 
>>>>> a Random number generator.
>>>>> 
>>>>> So in theory it would be possible that multiple concurrent scans draw the 
>>>>> same scanner id.
>>>>> 
>>>>> Since these are longs, this is astronomically unlikely, though (picking 
>>>>> the same number of 2^64, just does not happen :) ).
>>>>> 
>>>>> 
>>>>> 
>>>>> ________________________________
>>>>> From: Bryan Keller <[email protected]>
>>>>> To: [email protected]
>>>>> Sent: Sunday, October 9, 2011 2:40 PM
>>>>> Subject: Re: Using Scans in parallel
>>>>> 
>>>>> This is just scanning (reads). I'll need to do more testing to find a 
>>>>> cause, hopefully it is something with my test.
>>>>> 
>>>>> On Oct 9, 2011, at 1:13 PM, lars hofhansl wrote:
>>>>> 
>>>>>> Which version of HBase?
>>>>>> Are there concurrent inserts? If so, do you see splits in the log files 
>>>>>> happening while you do the scanning?
>>>>>> 
>>>>>> I am pretty sure this has nothing to do with concurrent scans.
>>>>>> 
>>>>>> From: Bryan Keller <[email protected]>
>>>>>> To: Bryan Keller <[email protected]>
>>>>>> Cc: [email protected]
>>>>>> Sent: Sunday, October 9, 2011 11:03 AM
>>>>>> Subject: Re: Using Scans in parallel
>>>>>> 
>>>>>> On further thought, it seems this might be a serious issue, as two 
>>>>>> unrelated processes within an application may be scanning the same table 
>>>>>> at the same time.
>>>>>> 
>>>>>> On Oct 9, 2011, at 10:59 AM, Bryan Keller wrote:
>>>>>> 
>>>>>>> I was not able to get consistent results using multiple scanners in 
>>>>>>> parallel on a table. I implemented a counter test that used 8 scanners 
>>>>>>> in parallel on a table with 2m rows with 2k+ columns each, and the 
>>>>>>> results were not consistent. There were no errors thrown, but the count 
>>>>>>> was off by as much as 2%. Using a single thread gave the same (correct) 
>>>>>>> result every run.
>>>>>>> 
>>>>>>> I tried various approaches, such as creating an HTable and opening a 
>>>>>>> connection per thread, but I was not able to get stable results. I 
>>>>>>> would do some testing before using parallel scanners as described here.
>>>>>>> 
>>>>>>> 
>>>>>>> On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote:
>>>>>>> 
>>>>>>>> That's part of it, the other part is to get the region demarcations.
>>>>>>>> You can also just get the smallest and largest key of the table and 
>>>>>>>> pick other demarcations for your scans. Then your individual scans 
>>>>>>>> will likely cover multiple regions and regionservers.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Your threading model depends on your needs. If you interested in 
>>>>>>>> lowest latency you want to keep your regionservers busy for each query.
>>>>>>>> What exactly that means depends on your setup. Maybe you split up the 
>>>>>>>> overall scan so that no more than N scans are active at any 
>>>>>>>> regionserver.
>>>>>>>> 
>>>>>>>> If you're more interested in overall predictability, you might not 
>>>>>>>> want parallelize each scan too much.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ----- Original Message -----
>>>>>>>> From: Sam Seigal <[email protected]>
>>>>>>>> To: [email protected]; lars hofhansl <[email protected]>
>>>>>>>> Cc: "[email protected]" <[email protected]>
>>>>>>>> Sent: Wednesday, October 5, 2011 6:18 PM
>>>>>>>> Subject: Re: Using Scans in parallel
>>>>>>>> 
>>>>>>>> So the whole point of getting the region locations is to ensure that
>>>>>>>> there is one thread per region server ?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <[email protected]> 
>>>>>>>> wrote:
>>>>>>>>> Hi Sam,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> There were some attempts to build this in. In the end I think the 
>>>>>>>>> exact patterns are different based on what one is trying to achieve.
>>>>>>>>> Currently what you can do is getting all the region locations 
>>>>>>>>> (HTable.getRegionLocations). From the HRegionInfos you can
>>>>>>>>> get the regions start and end keys.
>>>>>>>>> Now you can issue parallel scan for as many regions as you want (by 
>>>>>>>>> create a Scan object with start and row set to the region's
>>>>>>>>> start and end key).
>>>>>>>>> You probably want to group the regions by regionserver and have one 
>>>>>>>>> thread per region server, or something.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- Lars
>>>>>>>>> ________________________________
>>>>>>>>> From: Sam Seigal <[email protected]>
>>>>>>>>> To: [email protected]
>>>>>>>>> Sent: Wednesday, October 5, 2011 4:29 PM
>>>>>>>>> Subject: Using Scans in parallel
>>>>>>>>> 
>>>>>>>>> Hi ,
>>>>>>>>> 
>>>>>>>>> Is there a known way to be able to do Scan's in parallel (in different
>>>>>>>>> threads even) and then sort/combine the output ?
>>>>>>>>> 
>>>>>>>>> For a row key like:
>>>>>>>>> 
>>>>>>>>> prefix-event_type-event_id
>>>>>>>>> prefix-event_type-event_id
>>>>>>>>> 
>>>>>>>>> I want to declare two scan objects (for say event_id_type foo)
>>>>>>>>> 
>>>>>>>>> Scan 1 =>  0-foo
>>>>>>>>> Scan 2 =>  1-foo
>>>>>>>>> 
>>>>>>>>> execute the scans in parallel (maybe even in different threads) and
>>>>>>>>> then merge the results ?
>>>>>>>>> 
>>>>>>>>> Thank you,
>>>>>>>>> 
>>>>>>>>> Sam
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>

Re: Using Scans in parallel

Reply via email to