Sure. 2 region servers with 5 disks each. Table has 2 column families and 113 regions total for 2m rows. I'm scanning just one of the families. Performance with the 8 parallel scanners is 4x faster than the serial scanner (20m vs 80m roughly).
On Oct 9, 2011, at 7:00 PM, Himanshu Vashishtha wrote: > Interesting. > > Hey Bryan, can you please share the stats about: how many Regions, how > many Region Servers, time taken by Serial scanner and with 8 parallel > scanners. > > Himanshu > > On Sun, Oct 9, 2011 at 6:49 PM, Bryan Keller <[email protected]> wrote: >> This is 100% reproducible for me, so I doubt it is related to random number >> generation. >> >> On Oct 9, 2011, at 2:53 PM, lars hofhansl wrote: >> >>> How frequently does this happen? >>> I did notice a while ago in the code that scanner ids are drawn just from a >>> Random number generator. >>> >>> So in theory it would be possible that multiple concurrent scans draw the >>> same scanner id. >>> >>> Since these are longs, this is astronomically unlikely, though (picking the >>> same number of 2^64, just does not happen :) ). >>> >>> >>> >>> ________________________________ >>> From: Bryan Keller <[email protected]> >>> To: [email protected] >>> Sent: Sunday, October 9, 2011 2:40 PM >>> Subject: Re: Using Scans in parallel >>> >>> This is just scanning (reads). I'll need to do more testing to find a >>> cause, hopefully it is something with my test. >>> >>> On Oct 9, 2011, at 1:13 PM, lars hofhansl wrote: >>> >>>> Which version of HBase? >>>> Are there concurrent inserts? If so, do you see splits in the log files >>>> happening while you do the scanning? >>>> >>>> I am pretty sure this has nothing to do with concurrent scans. >>>> >>>> From: Bryan Keller <[email protected]> >>>> To: Bryan Keller <[email protected]> >>>> Cc: [email protected] >>>> Sent: Sunday, October 9, 2011 11:03 AM >>>> Subject: Re: Using Scans in parallel >>>> >>>> On further thought, it seems this might be a serious issue, as two >>>> unrelated processes within an application may be scanning the same table >>>> at the same time. >>>> >>>> On Oct 9, 2011, at 10:59 AM, Bryan Keller wrote: >>>> >>>>> I was not able to get consistent results using multiple scanners in >>>>> parallel on a table. I implemented a counter test that used 8 scanners in >>>>> parallel on a table with 2m rows with 2k+ columns each, and the results >>>>> were not consistent. There were no errors thrown, but the count was off >>>>> by as much as 2%. Using a single thread gave the same (correct) result >>>>> every run. >>>>> >>>>> I tried various approaches, such as creating an HTable and opening a >>>>> connection per thread, but I was not able to get stable results. I would >>>>> do some testing before using parallel scanners as described here. >>>>> >>>>> >>>>> On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote: >>>>> >>>>>> That's part of it, the other part is to get the region demarcations. >>>>>> You can also just get the smallest and largest key of the table and pick >>>>>> other demarcations for your scans. Then your individual scans will >>>>>> likely cover multiple regions and regionservers. >>>>>> >>>>>> >>>>>> Your threading model depends on your needs. If you interested in lowest >>>>>> latency you want to keep your regionservers busy for each query. >>>>>> What exactly that means depends on your setup. Maybe you split up the >>>>>> overall scan so that no more than N scans are active at any regionserver. >>>>>> >>>>>> If you're more interested in overall predictability, you might not want >>>>>> parallelize each scan too much. >>>>>> >>>>>> >>>>>> >>>>>> ----- Original Message ----- >>>>>> From: Sam Seigal <[email protected]> >>>>>> To: [email protected]; lars hofhansl <[email protected]> >>>>>> Cc: "[email protected]" <[email protected]> >>>>>> Sent: Wednesday, October 5, 2011 6:18 PM >>>>>> Subject: Re: Using Scans in parallel >>>>>> >>>>>> So the whole point of getting the region locations is to ensure that >>>>>> there is one thread per region server ? >>>>>> >>>>>> >>>>>> On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <[email protected]> >>>>>> wrote: >>>>>>> Hi Sam, >>>>>>> >>>>>>> >>>>>>> There were some attempts to build this in. In the end I think the exact >>>>>>> patterns are different based on what one is trying to achieve. >>>>>>> Currently what you can do is getting all the region locations >>>>>>> (HTable.getRegionLocations). From the HRegionInfos you can >>>>>>> get the regions start and end keys. >>>>>>> Now you can issue parallel scan for as many regions as you want (by >>>>>>> create a Scan object with start and row set to the region's >>>>>>> start and end key). >>>>>>> You probably want to group the regions by regionserver and have one >>>>>>> thread per region server, or something. >>>>>>> >>>>>>> >>>>>>> -- Lars >>>>>>> ________________________________ >>>>>>> From: Sam Seigal <[email protected]> >>>>>>> To: [email protected] >>>>>>> Sent: Wednesday, October 5, 2011 4:29 PM >>>>>>> Subject: Using Scans in parallel >>>>>>> >>>>>>> Hi , >>>>>>> >>>>>>> Is there a known way to be able to do Scan's in parallel (in different >>>>>>> threads even) and then sort/combine the output ? >>>>>>> >>>>>>> For a row key like: >>>>>>> >>>>>>> prefix-event_type-event_id >>>>>>> prefix-event_type-event_id >>>>>>> >>>>>>> I want to declare two scan objects (for say event_id_type foo) >>>>>>> >>>>>>> Scan 1 => 0-foo >>>>>>> Scan 2 => 1-foo >>>>>>> >>>>>>> execute the scans in parallel (maybe even in different threads) and >>>>>>> then merge the results ? >>>>>>> >>>>>>> Thank you, >>>>>>> >>>>>>> Sam >>>>>>> >>>>>> >>>>> >>>> >>>> >> >>
