To follow up, the problem I was having w/ parallel scanners appears to be an issue w/ my app, I wasn't able to reproduce it in a more controlled test.
On Oct 9, 2011, at 8:21 PM, Bryan Keller wrote: > BTW, a map reduce job can scan the table in 6m (both column families), > including some processing. So that is the fastest approach. > > On Oct 9, 2011, at 8:03 PM, Bryan Keller wrote: > >> Sure. 2 region servers with 5 disks each. Table has 2 column families and >> 113 regions total for 2m rows. I'm scanning just one of the families. >> Performance with the 8 parallel scanners is 4x faster than the serial >> scanner (20m vs 80m roughly). >> >> On Oct 9, 2011, at 7:00 PM, Himanshu Vashishtha wrote: >> >>> Interesting. >>> >>> Hey Bryan, can you please share the stats about: how many Regions, how >>> many Region Servers, time taken by Serial scanner and with 8 parallel >>> scanners. >>> >>> Himanshu >>> >>> On Sun, Oct 9, 2011 at 6:49 PM, Bryan Keller <[email protected]> wrote: >>>> This is 100% reproducible for me, so I doubt it is related to random >>>> number generation. >>>> >>>> On Oct 9, 2011, at 2:53 PM, lars hofhansl wrote: >>>> >>>>> How frequently does this happen? >>>>> I did notice a while ago in the code that scanner ids are drawn just from >>>>> a Random number generator. >>>>> >>>>> So in theory it would be possible that multiple concurrent scans draw the >>>>> same scanner id. >>>>> >>>>> Since these are longs, this is astronomically unlikely, though (picking >>>>> the same number of 2^64, just does not happen :) ). >>>>> >>>>> >>>>> >>>>> ________________________________ >>>>> From: Bryan Keller <[email protected]> >>>>> To: [email protected] >>>>> Sent: Sunday, October 9, 2011 2:40 PM >>>>> Subject: Re: Using Scans in parallel >>>>> >>>>> This is just scanning (reads). I'll need to do more testing to find a >>>>> cause, hopefully it is something with my test. >>>>> >>>>> On Oct 9, 2011, at 1:13 PM, lars hofhansl wrote: >>>>> >>>>>> Which version of HBase? >>>>>> Are there concurrent inserts? If so, do you see splits in the log files >>>>>> happening while you do the scanning? >>>>>> >>>>>> I am pretty sure this has nothing to do with concurrent scans. >>>>>> >>>>>> From: Bryan Keller <[email protected]> >>>>>> To: Bryan Keller <[email protected]> >>>>>> Cc: [email protected] >>>>>> Sent: Sunday, October 9, 2011 11:03 AM >>>>>> Subject: Re: Using Scans in parallel >>>>>> >>>>>> On further thought, it seems this might be a serious issue, as two >>>>>> unrelated processes within an application may be scanning the same table >>>>>> at the same time. >>>>>> >>>>>> On Oct 9, 2011, at 10:59 AM, Bryan Keller wrote: >>>>>> >>>>>>> I was not able to get consistent results using multiple scanners in >>>>>>> parallel on a table. I implemented a counter test that used 8 scanners >>>>>>> in parallel on a table with 2m rows with 2k+ columns each, and the >>>>>>> results were not consistent. There were no errors thrown, but the count >>>>>>> was off by as much as 2%. Using a single thread gave the same (correct) >>>>>>> result every run. >>>>>>> >>>>>>> I tried various approaches, such as creating an HTable and opening a >>>>>>> connection per thread, but I was not able to get stable results. I >>>>>>> would do some testing before using parallel scanners as described here. >>>>>>> >>>>>>> >>>>>>> On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote: >>>>>>> >>>>>>>> That's part of it, the other part is to get the region demarcations. >>>>>>>> You can also just get the smallest and largest key of the table and >>>>>>>> pick other demarcations for your scans. Then your individual scans >>>>>>>> will likely cover multiple regions and regionservers. >>>>>>>> >>>>>>>> >>>>>>>> Your threading model depends on your needs. If you interested in >>>>>>>> lowest latency you want to keep your regionservers busy for each query. >>>>>>>> What exactly that means depends on your setup. Maybe you split up the >>>>>>>> overall scan so that no more than N scans are active at any >>>>>>>> regionserver. >>>>>>>> >>>>>>>> If you're more interested in overall predictability, you might not >>>>>>>> want parallelize each scan too much. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ----- Original Message ----- >>>>>>>> From: Sam Seigal <[email protected]> >>>>>>>> To: [email protected]; lars hofhansl <[email protected]> >>>>>>>> Cc: "[email protected]" <[email protected]> >>>>>>>> Sent: Wednesday, October 5, 2011 6:18 PM >>>>>>>> Subject: Re: Using Scans in parallel >>>>>>>> >>>>>>>> So the whole point of getting the region locations is to ensure that >>>>>>>> there is one thread per region server ? >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <[email protected]> >>>>>>>> wrote: >>>>>>>>> Hi Sam, >>>>>>>>> >>>>>>>>> >>>>>>>>> There were some attempts to build this in. In the end I think the >>>>>>>>> exact patterns are different based on what one is trying to achieve. >>>>>>>>> Currently what you can do is getting all the region locations >>>>>>>>> (HTable.getRegionLocations). From the HRegionInfos you can >>>>>>>>> get the regions start and end keys. >>>>>>>>> Now you can issue parallel scan for as many regions as you want (by >>>>>>>>> create a Scan object with start and row set to the region's >>>>>>>>> start and end key). >>>>>>>>> You probably want to group the regions by regionserver and have one >>>>>>>>> thread per region server, or something. >>>>>>>>> >>>>>>>>> >>>>>>>>> -- Lars >>>>>>>>> ________________________________ >>>>>>>>> From: Sam Seigal <[email protected]> >>>>>>>>> To: [email protected] >>>>>>>>> Sent: Wednesday, October 5, 2011 4:29 PM >>>>>>>>> Subject: Using Scans in parallel >>>>>>>>> >>>>>>>>> Hi , >>>>>>>>> >>>>>>>>> Is there a known way to be able to do Scan's in parallel (in different >>>>>>>>> threads even) and then sort/combine the output ? >>>>>>>>> >>>>>>>>> For a row key like: >>>>>>>>> >>>>>>>>> prefix-event_type-event_id >>>>>>>>> prefix-event_type-event_id >>>>>>>>> >>>>>>>>> I want to declare two scan objects (for say event_id_type foo) >>>>>>>>> >>>>>>>>> Scan 1 => 0-foo >>>>>>>>> Scan 2 => 1-foo >>>>>>>>> >>>>>>>>> execute the scans in parallel (maybe even in different threads) and >>>>>>>>> then merge the results ? >>>>>>>>> >>>>>>>>> Thank you, >>>>>>>>> >>>>>>>>> Sam >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>> >>>> >> >
