Re: Using Scans in parallel

Bryan Keller Sun, 09 Oct 2011 17:50:34 -0700

This is 100% reproducible for me, so I doubt it is related to random number 
generation.


On Oct 9, 2011, at 2:53 PM, lars hofhansl wrote:

> How frequently does this happen?
> I did notice a while ago in the code that scanner ids are drawn just from a 
> Random number generator.
> 
> So in theory it would be possible that multiple concurrent scans draw the 
> same scanner id. 
> 
> Since these are longs, this is astronomically unlikely, though (picking the 
> same number of 2^64, just does not happen :) ).
> 
> 
> 
> ________________________________
> From: Bryan Keller <[email protected]>
> To: [email protected]
> Sent: Sunday, October 9, 2011 2:40 PM
> Subject: Re: Using Scans in parallel
> 
> This is just scanning (reads). I'll need to do more testing to find a cause, 
> hopefully it is something with my test.
> 
> On Oct 9, 2011, at 1:13 PM, lars hofhansl wrote:
> 
>> Which version of HBase?
>> Are there concurrent inserts? If so, do you see splits in the log files 
>> happening while you do the scanning?
>> 
>> I am pretty sure this has nothing to do with concurrent scans.
>> 
>> From: Bryan Keller <[email protected]>
>> To: Bryan Keller <[email protected]>
>> Cc: [email protected]
>> Sent: Sunday, October 9, 2011 11:03 AM
>> Subject: Re: Using Scans in parallel
>> 
>> On further thought, it seems this might be a serious issue, as two unrelated 
>> processes within an application may be scanning the same table at the same 
>> time.
>> 
>> On Oct 9, 2011, at 10:59 AM, Bryan Keller wrote:
>> 
>>> I was not able to get consistent results using multiple scanners in 
>>> parallel on a table. I implemented a counter test that used 8 scanners in 
>>> parallel on a table with 2m rows with 2k+ columns each, and the results 
>>> were not consistent. There were no errors thrown, but the count was off by 
>>> as much as 2%. Using a single thread gave the same (correct) result every 
>>> run.
>>> 
>>> I tried various approaches, such as creating an HTable and opening a 
>>> connection per thread, but I was not able to get stable results. I would do 
>>> some testing before using parallel scanners as described here.
>>> 
>>> 
>>> On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote:
>>> 
>>>> That's part of it, the other part is to get the region demarcations.
>>>> You can also just get the smallest and largest key of the table and pick 
>>>> other demarcations for your scans. Then your individual scans will likely 
>>>> cover multiple regions and regionservers.
>>>> 
>>>> 
>>>> Your threading model depends on your needs. If you interested in lowest 
>>>> latency you want to keep your regionservers busy for each query.
>>>> What exactly that means depends on your setup. Maybe you split up the 
>>>> overall scan so that no more than N scans are active at any regionserver.
>>>> 
>>>> If you're more interested in overall predictability, you might not want 
>>>> parallelize each scan too much.
>>>> 
>>>> 
>>>> 
>>>> ----- Original Message -----
>>>> From: Sam Seigal <[email protected]>
>>>> To: [email protected]; lars hofhansl <[email protected]>
>>>> Cc: "[email protected]" <[email protected]>
>>>> Sent: Wednesday, October 5, 2011 6:18 PM
>>>> Subject: Re: Using Scans in parallel
>>>> 
>>>> So the whole point of getting the region locations is to ensure that
>>>> there is one thread per region server ?
>>>> 
>>>> 
>>>> On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <[email protected]> wrote:
>>>>> Hi Sam,
>>>>> 
>>>>> 
>>>>> There were some attempts to build this in. In the end I think the exact 
>>>>> patterns are different based on what one is trying to achieve.
>>>>> Currently what you can do is getting all the region locations 
>>>>> (HTable.getRegionLocations). From the HRegionInfos you can
>>>>> get the regions start and end keys.
>>>>> Now you can issue parallel scan for as many regions as you want (by 
>>>>> create a Scan object with start and row set to the region's
>>>>> start and end key).
>>>>> You probably want to group the regions by regionserver and have one 
>>>>> thread per region server, or something.
>>>>> 
>>>>> 
>>>>> -- Lars
>>>>> ________________________________
>>>>> From: Sam Seigal <[email protected]>
>>>>> To: [email protected]
>>>>> Sent: Wednesday, October 5, 2011 4:29 PM
>>>>> Subject: Using Scans in parallel
>>>>> 
>>>>> Hi ,
>>>>> 
>>>>> Is there a known way to be able to do Scan's in parallel (in different
>>>>> threads even) and then sort/combine the output ?
>>>>> 
>>>>> For a row key like:
>>>>> 
>>>>> prefix-event_type-event_id
>>>>> prefix-event_type-event_id
>>>>> 
>>>>> I want to declare two scan objects (for say event_id_type foo)
>>>>> 
>>>>> Scan 1 =>  0-foo
>>>>> Scan 2 =>  1-foo
>>>>> 
>>>>> execute the scans in parallel (maybe even in different threads) and
>>>>> then merge the results ?
>>>>> 
>>>>> Thank you,
>>>>> 
>>>>> Sam
>>>>> 
>>>> 
>>> 
>> 
>>

Re: Using Scans in parallel

Reply via email to