BTW, a map reduce job can scan the table in 6m (both column families), including some processing. So that is the fastest approach.
On Oct 9, 2011, at 8:03 PM, Bryan Keller wrote: > Sure. 2 region servers with 5 disks each. Table has 2 column families and 113 > regions total for 2m rows. I'm scanning just one of the families. Performance > with the 8 parallel scanners is 4x faster than the serial scanner (20m vs 80m > roughly). > > On Oct 9, 2011, at 7:00 PM, Himanshu Vashishtha wrote: > >> Interesting. >> >> Hey Bryan, can you please share the stats about: how many Regions, how >> many Region Servers, time taken by Serial scanner and with 8 parallel >> scanners. >> >> Himanshu >> >> On Sun, Oct 9, 2011 at 6:49 PM, Bryan Keller <[email protected]> wrote: >>> This is 100% reproducible for me, so I doubt it is related to random number >>> generation. >>> >>> On Oct 9, 2011, at 2:53 PM, lars hofhansl wrote: >>> >>>> How frequently does this happen? >>>> I did notice a while ago in the code that scanner ids are drawn just from >>>> a Random number generator. >>>> >>>> So in theory it would be possible that multiple concurrent scans draw the >>>> same scanner id. >>>> >>>> Since these are longs, this is astronomically unlikely, though (picking >>>> the same number of 2^64, just does not happen :) ). >>>> >>>> >>>> >>>> ________________________________ >>>> From: Bryan Keller <[email protected]> >>>> To: [email protected] >>>> Sent: Sunday, October 9, 2011 2:40 PM >>>> Subject: Re: Using Scans in parallel >>>> >>>> This is just scanning (reads). I'll need to do more testing to find a >>>> cause, hopefully it is something with my test. >>>> >>>> On Oct 9, 2011, at 1:13 PM, lars hofhansl wrote: >>>> >>>>> Which version of HBase? >>>>> Are there concurrent inserts? If so, do you see splits in the log files >>>>> happening while you do the scanning? >>>>> >>>>> I am pretty sure this has nothing to do with concurrent scans. >>>>> >>>>> From: Bryan Keller <[email protected]> >>>>> To: Bryan Keller <[email protected]> >>>>> Cc: [email protected] >>>>> Sent: Sunday, October 9, 2011 11:03 AM >>>>> Subject: Re: Using Scans in parallel >>>>> >>>>> On further thought, it seems this might be a serious issue, as two >>>>> unrelated processes within an application may be scanning the same table >>>>> at the same time. >>>>> >>>>> On Oct 9, 2011, at 10:59 AM, Bryan Keller wrote: >>>>> >>>>>> I was not able to get consistent results using multiple scanners in >>>>>> parallel on a table. I implemented a counter test that used 8 scanners >>>>>> in parallel on a table with 2m rows with 2k+ columns each, and the >>>>>> results were not consistent. There were no errors thrown, but the count >>>>>> was off by as much as 2%. Using a single thread gave the same (correct) >>>>>> result every run. >>>>>> >>>>>> I tried various approaches, such as creating an HTable and opening a >>>>>> connection per thread, but I was not able to get stable results. I would >>>>>> do some testing before using parallel scanners as described here. >>>>>> >>>>>> >>>>>> On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote: >>>>>> >>>>>>> That's part of it, the other part is to get the region demarcations. >>>>>>> You can also just get the smallest and largest key of the table and >>>>>>> pick other demarcations for your scans. Then your individual scans will >>>>>>> likely cover multiple regions and regionservers. >>>>>>> >>>>>>> >>>>>>> Your threading model depends on your needs. If you interested in lowest >>>>>>> latency you want to keep your regionservers busy for each query. >>>>>>> What exactly that means depends on your setup. Maybe you split up the >>>>>>> overall scan so that no more than N scans are active at any >>>>>>> regionserver. >>>>>>> >>>>>>> If you're more interested in overall predictability, you might not want >>>>>>> parallelize each scan too much. >>>>>>> >>>>>>> >>>>>>> >>>>>>> ----- Original Message ----- >>>>>>> From: Sam Seigal <[email protected]> >>>>>>> To: [email protected]; lars hofhansl <[email protected]> >>>>>>> Cc: "[email protected]" <[email protected]> >>>>>>> Sent: Wednesday, October 5, 2011 6:18 PM >>>>>>> Subject: Re: Using Scans in parallel >>>>>>> >>>>>>> So the whole point of getting the region locations is to ensure that >>>>>>> there is one thread per region server ? >>>>>>> >>>>>>> >>>>>>> On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <[email protected]> >>>>>>> wrote: >>>>>>>> Hi Sam, >>>>>>>> >>>>>>>> >>>>>>>> There were some attempts to build this in. In the end I think the >>>>>>>> exact patterns are different based on what one is trying to achieve. >>>>>>>> Currently what you can do is getting all the region locations >>>>>>>> (HTable.getRegionLocations). From the HRegionInfos you can >>>>>>>> get the regions start and end keys. >>>>>>>> Now you can issue parallel scan for as many regions as you want (by >>>>>>>> create a Scan object with start and row set to the region's >>>>>>>> start and end key). >>>>>>>> You probably want to group the regions by regionserver and have one >>>>>>>> thread per region server, or something. >>>>>>>> >>>>>>>> >>>>>>>> -- Lars >>>>>>>> ________________________________ >>>>>>>> From: Sam Seigal <[email protected]> >>>>>>>> To: [email protected] >>>>>>>> Sent: Wednesday, October 5, 2011 4:29 PM >>>>>>>> Subject: Using Scans in parallel >>>>>>>> >>>>>>>> Hi , >>>>>>>> >>>>>>>> Is there a known way to be able to do Scan's in parallel (in different >>>>>>>> threads even) and then sort/combine the output ? >>>>>>>> >>>>>>>> For a row key like: >>>>>>>> >>>>>>>> prefix-event_type-event_id >>>>>>>> prefix-event_type-event_id >>>>>>>> >>>>>>>> I want to declare two scan objects (for say event_id_type foo) >>>>>>>> >>>>>>>> Scan 1 => 0-foo >>>>>>>> Scan 2 => 1-foo >>>>>>>> >>>>>>>> execute the scans in parallel (maybe even in different threads) and >>>>>>>> then merge the results ? >>>>>>>> >>>>>>>> Thank you, >>>>>>>> >>>>>>>> Sam >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>> >>> >
