BTW, a map reduce job can scan the table in 6m (both column families), 
including some processing. So that is the fastest approach.

On Oct 9, 2011, at 8:03 PM, Bryan Keller wrote:

> Sure. 2 region servers with 5 disks each. Table has 2 column families and 113 
> regions total for 2m rows. I'm scanning just one of the families. Performance 
> with the 8 parallel scanners is 4x faster than the serial scanner (20m vs 80m 
> roughly).
> 
> On Oct 9, 2011, at 7:00 PM, Himanshu Vashishtha wrote:
> 
>> Interesting.
>> 
>> Hey Bryan, can you please share the stats about: how many Regions, how
>> many Region Servers, time taken by Serial scanner and with 8 parallel
>> scanners.
>> 
>> Himanshu
>> 
>> On Sun, Oct 9, 2011 at 6:49 PM, Bryan Keller <[email protected]> wrote:
>>> This is 100% reproducible for me, so I doubt it is related to random number 
>>> generation.
>>> 
>>> On Oct 9, 2011, at 2:53 PM, lars hofhansl wrote:
>>> 
>>>> How frequently does this happen?
>>>> I did notice a while ago in the code that scanner ids are drawn just from 
>>>> a Random number generator.
>>>> 
>>>> So in theory it would be possible that multiple concurrent scans draw the 
>>>> same scanner id.
>>>> 
>>>> Since these are longs, this is astronomically unlikely, though (picking 
>>>> the same number of 2^64, just does not happen :) ).
>>>> 
>>>> 
>>>> 
>>>> ________________________________
>>>> From: Bryan Keller <[email protected]>
>>>> To: [email protected]
>>>> Sent: Sunday, October 9, 2011 2:40 PM
>>>> Subject: Re: Using Scans in parallel
>>>> 
>>>> This is just scanning (reads). I'll need to do more testing to find a 
>>>> cause, hopefully it is something with my test.
>>>> 
>>>> On Oct 9, 2011, at 1:13 PM, lars hofhansl wrote:
>>>> 
>>>>> Which version of HBase?
>>>>> Are there concurrent inserts? If so, do you see splits in the log files 
>>>>> happening while you do the scanning?
>>>>> 
>>>>> I am pretty sure this has nothing to do with concurrent scans.
>>>>> 
>>>>> From: Bryan Keller <[email protected]>
>>>>> To: Bryan Keller <[email protected]>
>>>>> Cc: [email protected]
>>>>> Sent: Sunday, October 9, 2011 11:03 AM
>>>>> Subject: Re: Using Scans in parallel
>>>>> 
>>>>> On further thought, it seems this might be a serious issue, as two 
>>>>> unrelated processes within an application may be scanning the same table 
>>>>> at the same time.
>>>>> 
>>>>> On Oct 9, 2011, at 10:59 AM, Bryan Keller wrote:
>>>>> 
>>>>>> I was not able to get consistent results using multiple scanners in 
>>>>>> parallel on a table. I implemented a counter test that used 8 scanners 
>>>>>> in parallel on a table with 2m rows with 2k+ columns each, and the 
>>>>>> results were not consistent. There were no errors thrown, but the count 
>>>>>> was off by as much as 2%. Using a single thread gave the same (correct) 
>>>>>> result every run.
>>>>>> 
>>>>>> I tried various approaches, such as creating an HTable and opening a 
>>>>>> connection per thread, but I was not able to get stable results. I would 
>>>>>> do some testing before using parallel scanners as described here.
>>>>>> 
>>>>>> 
>>>>>> On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote:
>>>>>> 
>>>>>>> That's part of it, the other part is to get the region demarcations.
>>>>>>> You can also just get the smallest and largest key of the table and 
>>>>>>> pick other demarcations for your scans. Then your individual scans will 
>>>>>>> likely cover multiple regions and regionservers.
>>>>>>> 
>>>>>>> 
>>>>>>> Your threading model depends on your needs. If you interested in lowest 
>>>>>>> latency you want to keep your regionservers busy for each query.
>>>>>>> What exactly that means depends on your setup. Maybe you split up the 
>>>>>>> overall scan so that no more than N scans are active at any 
>>>>>>> regionserver.
>>>>>>> 
>>>>>>> If you're more interested in overall predictability, you might not want 
>>>>>>> parallelize each scan too much.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Original Message -----
>>>>>>> From: Sam Seigal <[email protected]>
>>>>>>> To: [email protected]; lars hofhansl <[email protected]>
>>>>>>> Cc: "[email protected]" <[email protected]>
>>>>>>> Sent: Wednesday, October 5, 2011 6:18 PM
>>>>>>> Subject: Re: Using Scans in parallel
>>>>>>> 
>>>>>>> So the whole point of getting the region locations is to ensure that
>>>>>>> there is one thread per region server ?
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <[email protected]> 
>>>>>>> wrote:
>>>>>>>> Hi Sam,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> There were some attempts to build this in. In the end I think the 
>>>>>>>> exact patterns are different based on what one is trying to achieve.
>>>>>>>> Currently what you can do is getting all the region locations 
>>>>>>>> (HTable.getRegionLocations). From the HRegionInfos you can
>>>>>>>> get the regions start and end keys.
>>>>>>>> Now you can issue parallel scan for as many regions as you want (by 
>>>>>>>> create a Scan object with start and row set to the region's
>>>>>>>> start and end key).
>>>>>>>> You probably want to group the regions by regionserver and have one 
>>>>>>>> thread per region server, or something.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -- Lars
>>>>>>>> ________________________________
>>>>>>>> From: Sam Seigal <[email protected]>
>>>>>>>> To: [email protected]
>>>>>>>> Sent: Wednesday, October 5, 2011 4:29 PM
>>>>>>>> Subject: Using Scans in parallel
>>>>>>>> 
>>>>>>>> Hi ,
>>>>>>>> 
>>>>>>>> Is there a known way to be able to do Scan's in parallel (in different
>>>>>>>> threads even) and then sort/combine the output ?
>>>>>>>> 
>>>>>>>> For a row key like:
>>>>>>>> 
>>>>>>>> prefix-event_type-event_id
>>>>>>>> prefix-event_type-event_id
>>>>>>>> 
>>>>>>>> I want to declare two scan objects (for say event_id_type foo)
>>>>>>>> 
>>>>>>>> Scan 1 =>  0-foo
>>>>>>>> Scan 2 =>  1-foo
>>>>>>>> 
>>>>>>>> execute the scans in parallel (maybe even in different threads) and
>>>>>>>> then merge the results ?
>>>>>>>> 
>>>>>>>> Thank you,
>>>>>>>> 
>>>>>>>> Sam
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
> 

Reply via email to