This is just scanning (reads). I'll need to do more testing to find a cause, hopefully it is something with my test.
On Oct 9, 2011, at 1:13 PM, lars hofhansl wrote: > Which version of HBase? > Are there concurrent inserts? If so, do you see splits in the log files > happening while you do the scanning? > > I am pretty sure this has nothing to do with concurrent scans. > > From: Bryan Keller <[email protected]> > To: Bryan Keller <[email protected]> > Cc: [email protected] > Sent: Sunday, October 9, 2011 11:03 AM > Subject: Re: Using Scans in parallel > > On further thought, it seems this might be a serious issue, as two unrelated > processes within an application may be scanning the same table at the same > time. > > On Oct 9, 2011, at 10:59 AM, Bryan Keller wrote: > > > I was not able to get consistent results using multiple scanners in > > parallel on a table. I implemented a counter test that used 8 scanners in > > parallel on a table with 2m rows with 2k+ columns each, and the results > > were not consistent. There were no errors thrown, but the count was off by > > as much as 2%. Using a single thread gave the same (correct) result every > > run. > > > > I tried various approaches, such as creating an HTable and opening a > > connection per thread, but I was not able to get stable results. I would do > > some testing before using parallel scanners as described here. > > > > > > On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote: > > > >> That's part of it, the other part is to get the region demarcations. > >> You can also just get the smallest and largest key of the table and pick > >> other demarcations for your scans. Then your individual scans will likely > >> cover multiple regions and regionservers. > >> > >> > >> Your threading model depends on your needs. If you interested in lowest > >> latency you want to keep your regionservers busy for each query. > >> What exactly that means depends on your setup. Maybe you split up the > >> overall scan so that no more than N scans are active at any regionserver. > >> > >> If you're more interested in overall predictability, you might not want > >> parallelize each scan too much. > >> > >> > >> > >> ----- Original Message ----- > >> From: Sam Seigal <[email protected]> > >> To: [email protected]; lars hofhansl <[email protected]> > >> Cc: "[email protected]" <[email protected]> > >> Sent: Wednesday, October 5, 2011 6:18 PM > >> Subject: Re: Using Scans in parallel > >> > >> So the whole point of getting the region locations is to ensure that > >> there is one thread per region server ? > >> > >> > >> On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <[email protected]> wrote: > >>> Hi Sam, > >>> > >>> > >>> There were some attempts to build this in. In the end I think the exact > >>> patterns are different based on what one is trying to achieve. > >>> Currently what you can do is getting all the region locations > >>> (HTable.getRegionLocations). From the HRegionInfos you can > >>> get the regions start and end keys. > >>> Now you can issue parallel scan for as many regions as you want (by > >>> create a Scan object with start and row set to the region's > >>> start and end key). > >>> You probably want to group the regions by regionserver and have one > >>> thread per region server, or something. > >>> > >>> > >>> -- Lars > >>> ________________________________ > >>> From: Sam Seigal <[email protected]> > >>> To: [email protected] > >>> Sent: Wednesday, October 5, 2011 4:29 PM > >>> Subject: Using Scans in parallel > >>> > >>> Hi , > >>> > >>> Is there a known way to be able to do Scan's in parallel (in different > >>> threads even) and then sort/combine the output ? > >>> > >>> For a row key like: > >>> > >>> prefix-event_type-event_id > >>> prefix-event_type-event_id > >>> > >>> I want to declare two scan objects (for say event_id_type foo) > >>> > >>> Scan 1 => 0-foo > >>> Scan 2 => 1-foo > >>> > >>> execute the scans in parallel (maybe even in different threads) and > >>> then merge the results ? > >>> > >>> Thank you, > >>> > >>> Sam > >>> > >> > > > > >
