Re: Using Scans in parallel

Himanshu Vashishtha Sun, 09 Oct 2011 15:06:44 -0700

I don't think it will work without exception in that case. These
scanner Ids are generated from Random instance of HRegionServer.
In case there is same scannerId then one will get a
LeaseStillHeldException in the addScanner method?


Himanshu

On Sun, Oct 9, 2011 at 3:53 PM, lars hofhansl <[email protected]> wrote:
> How frequently does this happen?
> I did notice a while ago in the code that scanner ids are drawn just from a 
> Random number generator.
>
> So in theory it would be possible that multiple concurrent scans draw the 
> same scanner id.
>
> Since these are longs, this is astronomically unlikely, though (picking the 
> same number of 2^64, just does not happen :) ).
>
>
>
> ________________________________
> From: Bryan Keller <[email protected]>
> To: [email protected]
> Sent: Sunday, October 9, 2011 2:40 PM
> Subject: Re: Using Scans in parallel
>
> This is just scanning (reads). I'll need to do more testing to find a cause, 
> hopefully it is something with my test.
>
> On Oct 9, 2011, at 1:13 PM, lars hofhansl wrote:
>
>> Which version of HBase?
>> Are there concurrent inserts? If so, do you see splits in the log files 
>> happening while you do the scanning?
>>
>> I am pretty sure this has nothing to do with concurrent scans.
>>
>> From: Bryan Keller <[email protected]>
>> To: Bryan Keller <[email protected]>
>> Cc: [email protected]
>> Sent: Sunday, October 9, 2011 11:03 AM
>> Subject: Re: Using Scans in parallel
>>
>> On further thought, it seems this might be a serious issue, as two unrelated 
>> processes within an application may be scanning the same table at the same 
>> time.
>>
>> On Oct 9, 2011, at 10:59 AM, Bryan Keller wrote:
>>
>> > I was not able to get consistent results using multiple scanners in 
>> > parallel on a table. I implemented a counter test that used 8 scanners in 
>> > parallel on a table with 2m rows with 2k+ columns each, and the results 
>> > were not consistent. There were no errors thrown, but the count was off by 
>> > as much as 2%. Using a single thread gave the same (correct) result every 
>> > run.
>> >
>> > I tried various approaches, such as creating an HTable and opening a 
>> > connection per thread, but I was not able to get stable results. I would 
>> > do some testing before using parallel scanners as described here.
>> >
>> >
>> > On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote:
>> >
>> >> That's part of it, the other part is to get the region demarcations.
>> >> You can also just get the smallest and largest key of the table and pick 
>> >> other demarcations for your scans. Then your individual scans will likely 
>> >> cover multiple regions and regionservers.
>> >>
>> >>
>> >> Your threading model depends on your needs. If you interested in lowest 
>> >> latency you want to keep your regionservers busy for each query.
>> >> What exactly that means depends on your setup. Maybe you split up the 
>> >> overall scan so that no more than N scans are active at any regionserver.
>> >>
>> >> If you're more interested in overall predictability, you might not want 
>> >> parallelize each scan too much.
>> >>
>> >>
>> >>
>> >> ----- Original Message -----
>> >> From: Sam Seigal <[email protected]>
>> >> To: [email protected]; lars hofhansl <[email protected]>
>> >> Cc: "[email protected]" <[email protected]>
>> >> Sent: Wednesday, October 5, 2011 6:18 PM
>> >> Subject: Re: Using Scans in parallel
>> >>
>> >> So the whole point of getting the region locations is to ensure that
>> >> there is one thread per region server ?
>> >>
>> >>
>> >> On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <[email protected]> wrote:
>> >>> Hi Sam,
>> >>>
>> >>>
>> >>> There were some attempts to build this in. In the end I think the exact 
>> >>> patterns are different based on what one is trying to achieve.
>> >>> Currently what you can do is getting all the region locations 
>> >>> (HTable.getRegionLocations). From the HRegionInfos you can
>> >>> get the regions start and end keys.
>> >>> Now you can issue parallel scan for as many regions as you want (by 
>> >>> create a Scan object with start and row set to the region's
>> >>> start and end key).
>> >>> You probably want to group the regions by regionserver and have one 
>> >>> thread per region server, or something.
>> >>>
>> >>>
>> >>> -- Lars
>> >>> ________________________________
>> >>> From: Sam Seigal <[email protected]>
>> >>> To: [email protected]
>> >>> Sent: Wednesday, October 5, 2011 4:29 PM
>> >>> Subject: Using Scans in parallel
>> >>>
>> >>> Hi ,
>> >>>
>> >>> Is there a known way to be able to do Scan's in parallel (in different
>> >>> threads even) and then sort/combine the output ?
>> >>>
>> >>> For a row key like:
>> >>>
>> >>> prefix-event_type-event_id
>> >>> prefix-event_type-event_id
>> >>>
>> >>> I want to declare two scan objects (for say event_id_type foo)
>> >>>
>> >>> Scan 1 =>  0-foo
>> >>> Scan 2 =>  1-foo
>> >>>
>> >>> execute the scans in parallel (maybe even in different threads) and
>> >>> then merge the results ?
>> >>>
>> >>> Thank you,
>> >>>
>> >>> Sam
>> >>>
>> >>
>> >
>>
>>
>>

Re: Using Scans in parallel

Reply via email to