Re: Scan vs Parallel scan.

Guillermo Ortiz Sun, 14 Sep 2014 11:22:02 -0700

I don't have the code here. But I created a class RegionScanner, this class
does a complete scan of a region. So I have to set the start and stop keys.
the start and stop key are the limits of that region.


El domingo, 14 de septiembre de 2014, Anoop John <[email protected]>
escribió:

> Again full code snippet can better speak.
>
> But not getting what u r doing with below code
>
> private List<RegionScanner> generatePartitions() {
>         List<RegionScanner> regionScanners = new
> ArrayList<RegionScanner>();
>         byte[] startKey;
>         byte[] stopKey;
>         HConnection connection = null;
>         HBaseAdmin hbaseAdmin = null;
>         try {
>             connection = HConnectionManager.
> createConnection(HBaseConfiguration.create());
>             hbaseAdmin = new HBaseAdmin(connection);
>             List<HRegionInfo> regions =
> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>             RegionScanner regionScanner = null;
>             for (HRegionInfo region : regions) {
>
>                 startKey = region.getStartKey();
>                 stopKey = region.getEndKey();
>
>                 regionScanner = new RegionScanner(startKey, stopKey,
> scanConfiguration);
>                 // regionScanner = createRegionScanner(startKey, stopKey);
>                 if (regionScanner != null) {
>                     regionScanners.add(regionScanner);
>                 }
>             }
>
> And I execute the RegionScanner with this:
> public List<Result> call() throws Exception {
>         HConnection connection =
> HConnectionManager.
> createConnection(HBaseConfiguration.create());
>         HTableInterface table =
> connection.getTable(configuration.getTable());
>
>     Scan scan = new Scan(startKey, stopKey);
>         scan.setBatch(configuration.getBatch());
>         scan.setCaching(configuration.getCaching());
>         ResultScanner resultScanner = table.getScanner(scan);
>
>
> What is this part?
> new RegionScanner(startKey, stopKey,
> scanConfiguration);
>
>
> >>Scan scan = new Scan(startKey, stopKey);
>         scan.setBatch(configuration.
> getBatch());
>         scan.setCaching(configuration.getCaching());
>         ResultScanner resultScanner = table.getScanner(scan);
>
>
> And not setting start and stop rows to this Scan object? !!
>
>
> Sorry If I missed some parts from ur code.
>
> -Anoop-
>
>
> On Sun, Sep 14, 2014 at 2:54 PM, Guillermo Ortiz <[email protected]
> <javascript:;>>
> wrote:
>
> > I don't have the code here,, but I'll put the code in a couple of days. I
> > have to check the executeservice again! I don't remember exactly how I
> did.
> >
> > I'm using Hbase 0.98.
> >
> > El domingo, 14 de septiembre de 2014, lars hofhansl <[email protected]
> <javascript:;>>
> > escribió:
> >
> > > What specific version of 0.94 are you using?
> > >
> > > In general, if you have multiple spindles (disks) and/or multiple CPU
> > > cores at the region server you should benefits from keeping multiple
> > region
> > > server handler threads busy. I have experimented with this before and
> > saw a
> > > close to linear speed up (up to the point where all disks/core were
> > busy).
> > > Obviously this also assuming this is the only load you throw at the
> > servers
> > > at this point.
> > >
> > > Can you post your complete code to pastebin? Maybe even with some code
> to
> > > seed the data?
> > > How do you run your callables? Did you configure the ExecuteService
> > > correctly (assuming you use one to run your callables)?
> > >
> > > Then we can run it and have a look.
> > >
> > > Thanks.
> > >
> > > -- Lars
> > >
> > >
> > > ----- Original Message -----
> > > From: Guillermo Ortiz <[email protected] <javascript:;>
> <javascript:;>>
> > > To: "[email protected] <javascript:;> <javascript:;>" <
> [email protected] <javascript:;>
> > > <javascript:;>>
> > > Cc:
> > > Sent: Saturday, September 13, 2014 4:49 PM
> > > Subject: Re: Scan vs Parallel scan.
> > >
> > > What am I missing??
> > >
> > >
> > >
> > >
> > > 2014-09-12 16:05 GMT+02:00 Guillermo Ortiz <[email protected]
> <javascript:;>
> > > <javascript:;>>:
> > >
> > > > For an partial scan, I guess that I call to the RS to get data, it
> > starts
> > > > looking in the store files and recollecting the data. (It doesn't
> write
> > > to
> > > > the blockcache in both cases). It has ready the data and it gives to
> > the
> > > > client the data step by step, I mean,,, it depends the caching and
> > > batching
> > > > parameters.
> > > >
> > > > Big differences that I see...
> > > > I'm opening more connections to the Table, one for Region.
> > > >
> > > > I should check the single table scan, it looks like it does partial
> > scans
> > > > sequentially. Since you can see on the HBase Master how the request
> > > > increase one after another, not all in the same time.
> > > >
> > > > 2014-09-12 15:23 GMT+02:00 Michael Segel <[email protected]
> <javascript:;>
> > > <javascript:;>>:
> > > >
> > > >> It doesn’t matter which RS, but that you have 1 thread for each
> > region.
> > > >>
> > > >> So for each thread, what’s happening.
> > > >> Step by step, what is the code doing.
> > > >>
> > > >> Now you’re comparing this against a single table scan, right?
> > > >> What’s happening in the table scan…?
> > > >>
> > > >>
> > > >> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz <[email protected]
> <javascript:;>
> > > <javascript:;>>
> > > >> wrote:
> > > >>
> > > >> > Right, My table for example has keys between 0-9. in three regions
> > > >> > 0-2,3-7,7-9
> > > >> > I lauch three partial scans in parallel. The scans that I'm
> > executing
> > > >> are:
> > > >> > scan(0,2), scan(3,7), scan(7,9).
> > > >> > Each region is if a different RS, so each thread goes to different
> > RS.
> > > >> It's
> > > >> > not exactly like that, but on the benchmark case it's like it's
> > > working.
> > > >> >
> > > >> > Really the code will execute a thread for each Region not for each
> > > >> > RegionServer. But in the test I only have two regions for
> > > regionServer.
> > > >> I
> > > >> > dont' think that's an important point, there're two threads for
> RS.
> > > >> >
> > > >> > 2014-09-12 14:48 GMT+02:00 Michael Segel <
> [email protected] <javascript:;>
> > > <javascript:;>>:
> > > >> >
> > > >> >> Ok, lets again take a step back…
> > > >> >>
> > > >> >> So you are comparing your partial scan(s) against a full table
> > scan?
> > > >> >>
> > > >> >> If I understood your question, you launch 3 partial scans where
> you
> > > set
> > > >> >> the start row and then end row of each scan, right?
> > > >> >>
> > > >> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <
> [email protected] <javascript:;>
> > > <javascript:;>>
> > > >> wrote:
> > > >> >>
> > > >> >>> Okay, then, the partial scan doesn't work as I think.
> > > >> >>> How could it exceed the limit of a single region if I calculate
> > the
> > > >> >> limits?
> > > >> >>>
> > > >> >>>
> > > >> >>> The only bad point that I see it's that If a region server has
> > three
> > > >> >>> regions of the same table,  I'm executing three partial scans
> > about
> > > >> this
> > > >> >> RS
> > > >> >>> and they could compete for resources (network, etc..) on this
> > node.
> > > >> It'd
> > > >> >> be
> > > >> >>> better to have one thread for RS. But, that doesn't answer your
> > > >> >> questions.
> > > >> >>>
> > > >> >>> I keep thinking...
> > > >> >>>
> > > >> >>> 2014-09-12 9:40 GMT+02:00 Michael Segel <
> > [email protected] <javascript:;>
> > > <javascript:;>>:
> > > >> >>>
> > > >> >>>> Hi,
> > > >> >>>>
> > > >> >>>> I wanted to take a step back from the actual code and to stop
> and
> > > >> think
> > > >> >>>> about what you are doing and what HBase is doing under the
> > covers.
> > > >> >>>>
> > > >> >>>> So in your code, you are asking HBase to do 3 separate scans
> and
> > > then
> > > >> >> you
> > > >> >>>> take the result set back and join it.
> > > >> >>>>
> > > >> >>>> What does HBase do when it does a range scan?
> > > >> >>>> What happens when that range scan exceeds a single region?
> > > >> >>>>
> > > >> >>>> If you answer those questions… you’ll have your answer.
> > > >> >>>>
> > > >> >>>> HTH
> > > >> >>>>
> > > >> >>>> -Mike
> > > >> >>>>
> > > >> >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <
> > [email protected] <javascript:;>
> > > <javascript:;>>
> > > >> >> wrote:
> > > >> >>>>
> > > >> >>>>> It's not all the code, I set things like these as well:
> > > >> >>>>> scan.setMaxVersions();
> > > >> >>>>> scan.setCacheBlocks(false);
> > > >> >>>>> ...
> > > >> >>>>>
> > > >> >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <
> [email protected] <javascript:;>
> > > <javascript:;>>:
> > > >> >>>>>
> > > >> >>>>>> yes, that is. I have changed the HBase version to 0.98
> > > >> >>>>>>
> > > >> >>>>>> I got the start and stop keys with this method:
> > > >> >>>>>> private List<RegionScanner> generatePartitions() {
> > > >> >>>>>>      List<RegionScanner> regionScanners = new
> > > >> >>>>>> ArrayList<RegionScanner>();
> > > >> >>>>>>      byte[] startKey;
> > > >> >>>>>>      byte[] stopKey;
> > > >> >>>>>>      HConnection connection = null;
> > > >> >>>>>>      HBaseAdmin hbaseAdmin = null;
> > > >> >>>>>>      try {
> > > >> >>>>>>          connection = HConnectionManager.
> > > >> >>>>>> createConnection(HBaseConfiguration.create());
> > > >> >>>>>>          hbaseAdmin = new HBaseAdmin(connection);
> > > >> >>>>>>          List<HRegionInfo> regions =
> > > >> >>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> > > >> >>>>>>          RegionScanner regionScanner = null;
> > > >> >>>>>>          for (HRegionInfo region : regions) {
> > > >> >>>>>>
> > > >> >>>>>>              startKey = region.getStartKey();
> > > >> >>>>>>              stopKey = region.getEndKey();
> > > >> >>>>>>
> > > >> >>>>>>              regionScanner = new RegionScanner(startKey,
> > stopKey,
> > > >> >>>>>> scanConfiguration);
> > > >> >>>>>>              // regionScanner = createRegionScanner(startKey,
> > > >> >>>> stopKey);
> > > >> >>>>>>              if (regionScanner != null) {
> > > >> >>>>>>                  regionScanners.add(regionScanner);
> > > >> >>>>>>              }
> > > >> >>>>>>          }
> > > >> >>>>>>
> > > >> >>>>>> And I execute the RegionScanner with this:
> > > >> >>>>>> public List<Result> call() throws Exception {
> > > >> >>>>>>      HConnection connection =
> > > >> >>>>>>
> > HConnectionManager.createConnection(HBaseConfiguration.create());
> > > >> >>>>>>      HTableInterface table =
> > > >> >>>>>> connection.getTable(configuration.getTable());
> > > >> >>>>>>
> > > >> >>>>>>  Scan scan = new Scan(startKey, stopKey);
> > > >> >>>>>>      scan.setBatch(configuration.getBatch());
> > > >> >>>>>>      scan.setCaching(configuration.getCaching());
> > > >> >>>>>>      ResultScanner resultScanner = table.getScanner(scan);
> > > >> >>>>>>
> > > >> >>>>>>      List<Result> results = new ArrayList<Result>();
> > > >> >>>>>>      for (Result result : resultScanner) {
> > > >> >>>>>>          results.add(result);
> > > >> >>>>>>      }
> > > >> >>>>>>
> > > >> >>>>>>      connection.close();
> > > >> >>>>>>      table.close();
> > > >> >>>>>>
> > > >> >>>>>>      return results;
> > > >> >>>>>>  }
> > > >> >>>>>>
> > > >> >>>>>> They implement Callable.
> > > >> >>>>>>
> > > >> >>>>>>
> > > >> >>>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <
> > > [email protected] <javascript:;> <javascript:;>
> > > >> >:
> > > >> >>>>>>
> > > >> >>>>>>> Lets take a step back….
> > > >> >>>>>>>
> > > >> >>>>>>> Your parallel scan is having the client create N threads
> where
> > > in
> > > >> >> each
> > > >> >>>>>>> thread, you’re doing a partial scan of the table where each
> > > >> partial
> > > >> >>>> scan
> > > >> >>>>>>> takes the first and last row of each region?
> > > >> >>>>>>>
> > > >> >>>>>>> Is that correct?
> > > >> >>>>>>>
> > > >> >>>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <
> > > >> [email protected] <javascript:;> <javascript:;>>
> > > >> >>>>>>> wrote:
> > > >> >>>>>>>
> > > >> >>>>>>>> I was checking a little bit more about,, I checked the
> > cluster
> > > >> and
> > > >> >>>> data
> > > >> >>>>>>> is
> > > >> >>>>>>>> store in three different regions servers, each one in a
> > > >> differente
> > > >> >>>> node.
> > > >> >>>>>>>> So, I guess the threads go to different hard-disks.
> > > >> >>>>>>>>
> > > >> >>>>>>>> If someone has an idea or suggestion.. why it's faster a
> > single
> > > >> scan
> > > >> >>>>>>> than
> > > >> >>>>>>>> this implementation. I based on this implementation
> > > >> >>>>>>>> https://github.com/zygm0nt/hbase-distributed-search
> > > >> >>>>>>>>
> > > >> >>>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <
> > > [email protected] <javascript:;> <javascript:;>
> > > >> >:
> > > >> >>>>>>>>
> > > >> >>>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with
> > > 0.98,
> > > >> >>>>>>> although
> > > >> >>>>>>>>> there is not difference.
> > > >> >>>>>>>>> I disabled the table and disabled the blockcache for that
> > > family
> > > >> >> and
> > > >> >>>> I
> > > >> >>>>>>> put
> > > >> >>>>>>>>> scan.setBlockcache(false) as well for both cases.
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> I think that it's not possible that I executing an
> complete
> > > scan
> > > >> >> for
> > > >> >>>>>>> each
> > > >> >>>>>>>>> thread since my data are the type:
> > > >> >>>>>>>>> 000001 f:q value=1
> > > >> >>>>>>>>> 000002 f:q value=2
> > > >> >>>>>>>>> 000003 f:q value=3
> > > >> >>>>>>>>> ...
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> I add all the values and get the same result on a single
> > scan
> > > >> than
> > > >> >> a
> > > >> >>>>>>>>> distributed, so, I guess that DistributedScan did well.
> > > >> >>>>>>>>> The count from the hbase shell takes about 10-15seconds, I
> > > don't
> > > >> >>>>>>> remember,
> > > >> >>>>>>>>> but like 4x  of the scan time.
> > > >> >>>>>>>>> I'm not using any filter for the scans.
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> This is the way I calculate number of regions/scans
> > > >> >>>>>>>>> private List<RegionScanner> generatePartitions() {
> > > >> >>>>>>>>>     List<RegionScanner> regionScanners = new
> > > >> >>>>>>>>> ArrayList<RegionScanner>();
> > > >> >>>>>>>>>     byte[] startKey;
> > > >> >>>>>>>>>     byte[] stopKey;
> > > >> >>>>>>>>>     HConnection connection = null;
> > > >> >>>>>>>>>     HBaseAdmin hbaseAdmin = null;
> > > >> >>>>>>>>>     try {
> > > >> >>>>>>>>>         connection =
> > > >> >>>>>>>>>
> > > >> HConnectionManager.createConnection(HBaseConfiguration.create());
> > > >> >>>>>>>>>         hbaseAdmin = new HBaseAdmin(connection);
> > > >> >>>>>>>>>         List<HRegionInfo> regions =
> > > >> >>>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> > > >> >>>>>>>>>         RegionScanner regionScanner = null;
> > > >> >>>>>>>>>         for (HRegionInfo region : regions) {
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>             startKey = region.getStartKey();
> > > >> >>>>>>>>>             stopKey = region.getEndKey();
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>             regionScanner = new RegionScanner(startKey,
> > > stopKey,
> > > >> >>>>>>>>> scanConfiguration);
> > > >> >>>>>>>>>             // regionScanner =
> createRegionScanner(startKey,
> > > >> >>>>>>> stopKey);
> > > >> >>>>>>>>>             if (regionScanner != null) {
> > > >> >>>>>>>>>                 regionScanners.add(regionScanner);
> > > >> >>>>>>>>>             }
> > > >> >>>>>>>>>         }
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> I did some test for a tiny table and I think that the
> range
> > > for
> > > >> >> each
> > > >> >>>>>>> scan
> > > >> >>>>>>>>> works fine. Although, I though that it was interesting
> that
> > > the
> > > >> >> time
> > > >> >>>>>>> when I
> > > >> >>>>>>>>> execute distributed scan is about 6x.
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> I'm going to check about the hard disks, but I think that
> > ti's
> > > >> >> right.
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <[email protected]
> <javascript:;>
> > > <javascript:;>>:
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>> Which version of HBase?
> > > >> >>>>>>>>>> Can you show us the code?
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Your parallel scan with caching 100 takes about 6x as
> long
> > as
> > > >> the
> > > >> >>>>>>> single
> > > >> >>>>>>>>>> scan, which is suspicious because you say you have 6
> > regions.
> > > >> >>>>>>>>>> Are you sure you're not accidentally scanning all the
> data
> > in
> > > >> each
> > > >> >>>> of
> > > >> >>>>>>>>>> your parallel scans?
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> -- Lars
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> ________________________________
> > > >> >>>>>>>>>> From: Guillermo Ortiz <[email protected]
> <javascript:;>
> > <javascript:;>>
> > > >> >>>>>>>>>> To: "[email protected] <javascript:;>
> <javascript:;>" <
> > > [email protected] <javascript:;> <javascript:;>>
> > > >> >>>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
> > > >> >>>>>>>>>> Subject: Scan vs Parallel scan.
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Hi,
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> I developed an distributed scan, I create an thread for
> > each
> > > >> >> region.
> > > >> >>>>>>> After
> > > >> >>>>>>>>>> that, I've tried to get some times Scan vs
> DistributedScan.
> > > >> >>>>>>>>>> I have disabled blockcache in my table. My cluster has 3
> > > region
> > > >> >>>>>>> servers
> > > >> >>>>>>>>>> with 2 regions each one, in total there are 100.000 rows
> > and
> > > >> >>>> execute a
> > > >> >>>>>>>>>> complete scan.
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> My partitions are
> > > >> >>>>>>>>>> -01666 -> request 16665
> > > >> >>>>>>>>>> 016666-033332 -> request 16666
> > > >> >>>>>>>>>> 033332-049998 -> request 16666
> > > >> >>>>>>>>>> 049998-066664 -> request 16666
> > > >> >>>>>>>>>> 066664-083330 -> request 16666
> > > >> >>>>>>>>>> 083330- -> request 16671
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS
> 100000
> > > >> >>>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
> > > >> >>>>>>> PARALLEL:22089ms,Counter:2 ->
> > > >> >>>>>>>>>> Caching 10
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS
> 100000
> > > >> >>>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
> > > >> >>>>>>> PARALJEL:16598ms,Counter:2 ->
> > > >> >>>>>>>>>> Caching 100
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS
> 100000
> > > >> >>>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
> > > >> >>>>>>> PARALLEL:16497ms,Counter:2 ->
> > > >> >>>>>>>>>> Caching 1000
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS
> 100000
> > > >> >>>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN
> > > >> >> NORMAL:68288ms,Counter:2
> > > >> >>>>>>> ->
> > > >> >>>>>>>>>> Caching 1
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS
> 100000
> > > >> >>>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN
> > > >> >> NORMAL:2646ms,Counter:2
> > > >> >>>> ->
> > > >> >>>>>>>>>> Caching 100
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS
> 100000
> > > >> >>>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN
> > > >> >> NORMAL:3903ms,Counter:2
> > > >> >>>> ->
> > > >> >>>>>>>>>> Caching 1000
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Parallel scan works much worse than simple scan,, and I
> > don't
> > > >> know
> > > >> >>>> why
> > > >> >>>>>>>>>> it's
> > > >> >>>>>>>>>> so fast,, it's really much faster than execute an "count"
> > > from
> > > >> >> hbase
> > > >> >>>>>>>>>> shell,
> > > >> >>>>>>>>>> what it doesn't look pretty notmal. The only time that it
> > > works
> > > >> >>>> better
> > > >> >>>>>>>>>> parallel is when I execute a normal scan with caching 1.
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Any clue about it?
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>
> > > >> >>>>>>>
> > > >> >>>>>>
> > > >> >>>>
> > > >> >>>>
> > > >> >>
> > > >> >>
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: Scan vs Parallel scan.

Reply via email to