Re: Using Scans in parallel

Bryan Keller Sun, 09 Oct 2011 14:40:39 -0700

This is just scanning (reads). I'll need to do more testing to find a cause, 
hopefully it is something with my test.


On Oct 9, 2011, at 1:13 PM, lars hofhansl wrote:

> Which version of HBase?
> Are there concurrent inserts? If so, do you see splits in the log files 
> happening while you do the scanning?
> 
> I am pretty sure this has nothing to do with concurrent scans.
> 
> From: Bryan Keller <[email protected]>
> To: Bryan Keller <[email protected]>
> Cc: [email protected]
> Sent: Sunday, October 9, 2011 11:03 AM
> Subject: Re: Using Scans in parallel
> 
> On further thought, it seems this might be a serious issue, as two unrelated 
> processes within an application may be scanning the same table at the same 
> time.
> 
> On Oct 9, 2011, at 10:59 AM, Bryan Keller wrote:
> 
> > I was not able to get consistent results using multiple scanners in 
> > parallel on a table. I implemented a counter test that used 8 scanners in 
> > parallel on a table with 2m rows with 2k+ columns each, and the results 
> > were not consistent. There were no errors thrown, but the count was off by 
> > as much as 2%. Using a single thread gave the same (correct) result every 
> > run.
> > 
> > I tried various approaches, such as creating an HTable and opening a 
> > connection per thread, but I was not able to get stable results. I would do 
> > some testing before using parallel scanners as described here.
> > 
> > 
> > On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote:
> > 
> >> That's part of it, the other part is to get the region demarcations.
> >> You can also just get the smallest and largest key of the table and pick 
> >> other demarcations for your scans. Then your individual scans will likely 
> >> cover multiple regions and regionservers.
> >> 
> >> 
> >> Your threading model depends on your needs. If you interested in lowest 
> >> latency you want to keep your regionservers busy for each query.
> >> What exactly that means depends on your setup. Maybe you split up the 
> >> overall scan so that no more than N scans are active at any regionserver.
> >> 
> >> If you're more interested in overall predictability, you might not want 
> >> parallelize each scan too much.
> >> 
> >> 
> >> 
> >> ----- Original Message -----
> >> From: Sam Seigal <[email protected]>
> >> To: [email protected]; lars hofhansl <[email protected]>
> >> Cc: "[email protected]" <[email protected]>
> >> Sent: Wednesday, October 5, 2011 6:18 PM
> >> Subject: Re: Using Scans in parallel
> >> 
> >> So the whole point of getting the region locations is to ensure that
> >> there is one thread per region server ?
> >> 
> >> 
> >> On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <[email protected]> wrote:
> >>> Hi Sam,
> >>> 
> >>> 
> >>> There were some attempts to build this in. In the end I think the exact 
> >>> patterns are different based on what one is trying to achieve.
> >>> Currently what you can do is getting all the region locations 
> >>> (HTable.getRegionLocations). From the HRegionInfos you can
> >>> get the regions start and end keys.
> >>> Now you can issue parallel scan for as many regions as you want (by 
> >>> create a Scan object with start and row set to the region's
> >>> start and end key).
> >>> You probably want to group the regions by regionserver and have one 
> >>> thread per region server, or something.
> >>> 
> >>> 
> >>> -- Lars
> >>> ________________________________
> >>> From: Sam Seigal <[email protected]>
> >>> To: [email protected]
> >>> Sent: Wednesday, October 5, 2011 4:29 PM
> >>> Subject: Using Scans in parallel
> >>> 
> >>> Hi ,
> >>> 
> >>> Is there a known way to be able to do Scan's in parallel (in different
> >>> threads even) and then sort/combine the output ?
> >>> 
> >>> For a row key like:
> >>> 
> >>> prefix-event_type-event_id
> >>> prefix-event_type-event_id
> >>> 
> >>> I want to declare two scan objects (for say event_id_type foo)
> >>> 
> >>> Scan 1 =>  0-foo
> >>> Scan 2 =>  1-foo
> >>> 
> >>> execute the scans in parallel (maybe even in different threads) and
> >>> then merge the results ?
> >>> 
> >>> Thank you,
> >>> 
> >>> Sam
> >>> 
> >> 
> > 
> 
> 
>

Re: Using Scans in parallel

Reply via email to