Re: Using Scans in parallel

lars hofhansl Sun, 09 Oct 2011 13:13:44 -0700

Which version of HBase?
Are there concurrent inserts? If so, do you see splits in the log files 
happening while you do the scanning?



I am pretty sure this has nothing to do with concurrent scans.



________________________________
From: Bryan Keller <[email protected]>
To: Bryan Keller <[email protected]>
Cc: [email protected]
Sent: Sunday, October 9, 2011 11:03 AM
Subject: Re: Using Scans in parallel

On further thought, it seems this might be a serious issue, as two unrelated 
processes within an application may be scanning the same table at the same time.

On Oct 9, 2011, at 10:59 AM, Bryan Keller wrote:

> I was not able to get consistent results using multiple scanners in parallel 
> on a table. I implemented a counter test that used 8 scanners in parallel on 
> a table with 2m rows with 2k+ columns each, and the results were not 
> consistent. There were no errors thrown, but the count was off by as much as 
> 2%. Using a single thread gave the same (correct) result every run.
> 
> I tried various approaches, such as creating an HTable and opening a 
> connection per thread, but I was not able to get stable results. I would do 
> some testing before using parallel scanners as described here.
> 
> 
> On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote:
> 
>> That's part of it, the other part is to get the region demarcations.
>> You can also just get the smallest and largest key of the table and pick 
>> other demarcations for your scans. Then your individual scans will likely 
>> cover multiple regions and regionservers.
>> 
>> 
>> Your threading model depends on your needs. If you interested in lowest 
>> latency you want to keep your regionservers busy for each query.
>> What exactly that means depends on your setup. Maybe you split up the 
>> overall scan so that no more than N scans are active at any regionserver.
>> 
>> If you're more interested in overall predictability, you might not want 
>> parallelize each scan too much.
>> 
>> 
>> 
>> ----- Original Message -----
>> From: Sam Seigal <[email protected]>
>> To: [email protected]; lars hofhansl <[email protected]>
>> Cc: "[email protected]" <[email protected]>
>> Sent: Wednesday, October 5, 2011 6:18 PM
>> Subject: Re: Using Scans in parallel
>> 
>> So the whole point of getting the region locations is to ensure that
>> there is one thread per region server ?
>> 
>> 
>> On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <[email protected]> wrote:
>>> Hi Sam,
>>> 
>>> 
>>> There were some attempts to build this in. In the end I think the exact 
>>> patterns are different based on what one is trying to achieve.
>>> Currently what you can do is getting all the region locations 
>>> (HTable.getRegionLocations). From the HRegionInfos you can
>>> get the regions start and end keys.
>>> Now you can issue parallel scan for as many regions as you want (by create 
>>> a Scan object with start and row set to the region's
>>> start and end key).
>>> You probably want to group the regions by regionserver and have one thread 
>>> per region server, or something.
>>> 
>>> 
>>> -- Lars
>>> ________________________________
>>> From: Sam Seigal <[email protected]>
>>> To: [email protected]
>>> Sent: Wednesday, October 5, 2011 4:29 PM
>>> Subject: Using Scans in parallel
>>> 
>>> Hi ,
>>> 
>>> Is there a known way to be able to do Scan's in parallel (in different
>>> threads even) and then sort/combine the output ?
>>> 
>>> For a row key like:
>>> 
>>> prefix-event_type-event_id
>>> prefix-event_type-event_id
>>> 
>>> I want to declare two scan objects (for say event_id_type foo)
>>> 
>>> Scan 1 =>  0-foo
>>> Scan 2 =>  1-foo
>>> 
>>> execute the scans in parallel (maybe even in different threads) and
>>> then merge the results ?
>>> 
>>> Thank you,
>>> 
>>> Sam
>>> 
>> 
>

Re: Using Scans in parallel

Reply via email to