Re: Scan vs Parallel scan.

Anoop John Sun, 14 Sep 2014 02:59:03 -0700

Again full code snippet can better speak.

But not getting what u r doing with below code


private List<RegionScanner> generatePartitions() {
        List<RegionScanner> regionScanners = new ArrayList<RegionScanner>();
        byte[] startKey;
        byte[] stopKey;
        HConnection connection = null;
        HBaseAdmin hbaseAdmin = null;
        try {
            connection = HConnectionManager.
createConnection(HBaseConfiguration.create());
            hbaseAdmin = new HBaseAdmin(connection);
            List<HRegionInfo> regions =
hbaseAdmin.getTableRegions(scanConfiguration.getTable());
            RegionScanner regionScanner = null;
            for (HRegionInfo region : regions) {

                startKey = region.getStartKey();
                stopKey = region.getEndKey();

                regionScanner = new RegionScanner(startKey, stopKey,
scanConfiguration);
                // regionScanner = createRegionScanner(startKey, stopKey);
                if (regionScanner != null) {
                    regionScanners.add(regionScanner);
                }
            }

And I execute the RegionScanner with this:
public List<Result> call() throws Exception {
        HConnection connection =
HConnectionManager.
createConnection(HBaseConfiguration.create());
        HTableInterface table =
connection.getTable(configuration.getTable());

    Scan scan = new Scan(startKey, stopKey);
        scan.setBatch(configuration.getBatch());
        scan.setCaching(configuration.getCaching());
        ResultScanner resultScanner = table.getScanner(scan);


What is this part?
new RegionScanner(startKey, stopKey,
scanConfiguration);


>>Scan scan = new Scan(startKey, stopKey);
        scan.setBatch(configuration.
getBatch());
        scan.setCaching(configuration.getCaching());
        ResultScanner resultScanner = table.getScanner(scan);


And not setting start and stop rows to this Scan object? !!


Sorry If I missed some parts from ur code.

-Anoop-


On Sun, Sep 14, 2014 at 2:54 PM, Guillermo Ortiz <[email protected]>
wrote:

> I don't have the code here,, but I'll put the code in a couple of days. I
> have to check the executeservice again! I don't remember exactly how I did.
>
> I'm using Hbase 0.98.
>
> El domingo, 14 de septiembre de 2014, lars hofhansl <[email protected]>
> escribió:
>
> > What specific version of 0.94 are you using?
> >
> > In general, if you have multiple spindles (disks) and/or multiple CPU
> > cores at the region server you should benefits from keeping multiple
> region
> > server handler threads busy. I have experimented with this before and
> saw a
> > close to linear speed up (up to the point where all disks/core were
> busy).
> > Obviously this also assuming this is the only load you throw at the
> servers
> > at this point.
> >
> > Can you post your complete code to pastebin? Maybe even with some code to
> > seed the data?
> > How do you run your callables? Did you configure the ExecuteService
> > correctly (assuming you use one to run your callables)?
> >
> > Then we can run it and have a look.
> >
> > Thanks.
> >
> > -- Lars
> >
> >
> > ----- Original Message -----
> > From: Guillermo Ortiz <[email protected] <javascript:;>>
> > To: "[email protected] <javascript:;>" <[email protected]
> > <javascript:;>>
> > Cc:
> > Sent: Saturday, September 13, 2014 4:49 PM
> > Subject: Re: Scan vs Parallel scan.
> >
> > What am I missing??
> >
> >
> >
> >
> > 2014-09-12 16:05 GMT+02:00 Guillermo Ortiz <[email protected]
> > <javascript:;>>:
> >
> > > For an partial scan, I guess that I call to the RS to get data, it
> starts
> > > looking in the store files and recollecting the data. (It doesn't write
> > to
> > > the blockcache in both cases). It has ready the data and it gives to
> the
> > > client the data step by step, I mean,,, it depends the caching and
> > batching
> > > parameters.
> > >
> > > Big differences that I see...
> > > I'm opening more connections to the Table, one for Region.
> > >
> > > I should check the single table scan, it looks like it does partial
> scans
> > > sequentially. Since you can see on the HBase Master how the request
> > > increase one after another, not all in the same time.
> > >
> > > 2014-09-12 15:23 GMT+02:00 Michael Segel <[email protected]
> > <javascript:;>>:
> > >
> > >> It doesn’t matter which RS, but that you have 1 thread for each
> region.
> > >>
> > >> So for each thread, what’s happening.
> > >> Step by step, what is the code doing.
> > >>
> > >> Now you’re comparing this against a single table scan, right?
> > >> What’s happening in the table scan…?
> > >>
> > >>
> > >> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz <[email protected]
> > <javascript:;>>
> > >> wrote:
> > >>
> > >> > Right, My table for example has keys between 0-9. in three regions
> > >> > 0-2,3-7,7-9
> > >> > I lauch three partial scans in parallel. The scans that I'm
> executing
> > >> are:
> > >> > scan(0,2), scan(3,7), scan(7,9).
> > >> > Each region is if a different RS, so each thread goes to different
> RS.
> > >> It's
> > >> > not exactly like that, but on the benchmark case it's like it's
> > working.
> > >> >
> > >> > Really the code will execute a thread for each Region not for each
> > >> > RegionServer. But in the test I only have two regions for
> > regionServer.
> > >> I
> > >> > dont' think that's an important point, there're two threads for RS.
> > >> >
> > >> > 2014-09-12 14:48 GMT+02:00 Michael Segel <[email protected]
> > <javascript:;>>:
> > >> >
> > >> >> Ok, lets again take a step back…
> > >> >>
> > >> >> So you are comparing your partial scan(s) against a full table
> scan?
> > >> >>
> > >> >> If I understood your question, you launch 3 partial scans where you
> > set
> > >> >> the start row and then end row of each scan, right?
> > >> >>
> > >> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <[email protected]
> > <javascript:;>>
> > >> wrote:
> > >> >>
> > >> >>> Okay, then, the partial scan doesn't work as I think.
> > >> >>> How could it exceed the limit of a single region if I calculate
> the
> > >> >> limits?
> > >> >>>
> > >> >>>
> > >> >>> The only bad point that I see it's that If a region server has
> three
> > >> >>> regions of the same table,  I'm executing three partial scans
> about
> > >> this
> > >> >> RS
> > >> >>> and they could compete for resources (network, etc..) on this
> node.
> > >> It'd
> > >> >> be
> > >> >>> better to have one thread for RS. But, that doesn't answer your
> > >> >> questions.
> > >> >>>
> > >> >>> I keep thinking...
> > >> >>>
> > >> >>> 2014-09-12 9:40 GMT+02:00 Michael Segel <
> [email protected]
> > <javascript:;>>:
> > >> >>>
> > >> >>>> Hi,
> > >> >>>>
> > >> >>>> I wanted to take a step back from the actual code and to stop and
> > >> think
> > >> >>>> about what you are doing and what HBase is doing under the
> covers.
> > >> >>>>
> > >> >>>> So in your code, you are asking HBase to do 3 separate scans and
> > then
> > >> >> you
> > >> >>>> take the result set back and join it.
> > >> >>>>
> > >> >>>> What does HBase do when it does a range scan?
> > >> >>>> What happens when that range scan exceeds a single region?
> > >> >>>>
> > >> >>>> If you answer those questions… you’ll have your answer.
> > >> >>>>
> > >> >>>> HTH
> > >> >>>>
> > >> >>>> -Mike
> > >> >>>>
> > >> >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <
> [email protected]
> > <javascript:;>>
> > >> >> wrote:
> > >> >>>>
> > >> >>>>> It's not all the code, I set things like these as well:
> > >> >>>>> scan.setMaxVersions();
> > >> >>>>> scan.setCacheBlocks(false);
> > >> >>>>> ...
> > >> >>>>>
> > >> >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <[email protected]
> > <javascript:;>>:
> > >> >>>>>
> > >> >>>>>> yes, that is. I have changed the HBase version to 0.98
> > >> >>>>>>
> > >> >>>>>> I got the start and stop keys with this method:
> > >> >>>>>> private List<RegionScanner> generatePartitions() {
> > >> >>>>>>      List<RegionScanner> regionScanners = new
> > >> >>>>>> ArrayList<RegionScanner>();
> > >> >>>>>>      byte[] startKey;
> > >> >>>>>>      byte[] stopKey;
> > >> >>>>>>      HConnection connection = null;
> > >> >>>>>>      HBaseAdmin hbaseAdmin = null;
> > >> >>>>>>      try {
> > >> >>>>>>          connection = HConnectionManager.
> > >> >>>>>> createConnection(HBaseConfiguration.create());
> > >> >>>>>>          hbaseAdmin = new HBaseAdmin(connection);
> > >> >>>>>>          List<HRegionInfo> regions =
> > >> >>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> > >> >>>>>>          RegionScanner regionScanner = null;
> > >> >>>>>>          for (HRegionInfo region : regions) {
> > >> >>>>>>
> > >> >>>>>>              startKey = region.getStartKey();
> > >> >>>>>>              stopKey = region.getEndKey();
> > >> >>>>>>
> > >> >>>>>>              regionScanner = new RegionScanner(startKey,
> stopKey,
> > >> >>>>>> scanConfiguration);
> > >> >>>>>>              // regionScanner = createRegionScanner(startKey,
> > >> >>>> stopKey);
> > >> >>>>>>              if (regionScanner != null) {
> > >> >>>>>>                  regionScanners.add(regionScanner);
> > >> >>>>>>              }
> > >> >>>>>>          }
> > >> >>>>>>
> > >> >>>>>> And I execute the RegionScanner with this:
> > >> >>>>>> public List<Result> call() throws Exception {
> > >> >>>>>>      HConnection connection =
> > >> >>>>>>
> HConnectionManager.createConnection(HBaseConfiguration.create());
> > >> >>>>>>      HTableInterface table =
> > >> >>>>>> connection.getTable(configuration.getTable());
> > >> >>>>>>
> > >> >>>>>>  Scan scan = new Scan(startKey, stopKey);
> > >> >>>>>>      scan.setBatch(configuration.getBatch());
> > >> >>>>>>      scan.setCaching(configuration.getCaching());
> > >> >>>>>>      ResultScanner resultScanner = table.getScanner(scan);
> > >> >>>>>>
> > >> >>>>>>      List<Result> results = new ArrayList<Result>();
> > >> >>>>>>      for (Result result : resultScanner) {
> > >> >>>>>>          results.add(result);
> > >> >>>>>>      }
> > >> >>>>>>
> > >> >>>>>>      connection.close();
> > >> >>>>>>      table.close();
> > >> >>>>>>
> > >> >>>>>>      return results;
> > >> >>>>>>  }
> > >> >>>>>>
> > >> >>>>>> They implement Callable.
> > >> >>>>>>
> > >> >>>>>>
> > >> >>>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <
> > [email protected] <javascript:;>
> > >> >:
> > >> >>>>>>
> > >> >>>>>>> Lets take a step back….
> > >> >>>>>>>
> > >> >>>>>>> Your parallel scan is having the client create N threads where
> > in
> > >> >> each
> > >> >>>>>>> thread, you’re doing a partial scan of the table where each
> > >> partial
> > >> >>>> scan
> > >> >>>>>>> takes the first and last row of each region?
> > >> >>>>>>>
> > >> >>>>>>> Is that correct?
> > >> >>>>>>>
> > >> >>>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <
> > >> [email protected] <javascript:;>>
> > >> >>>>>>> wrote:
> > >> >>>>>>>
> > >> >>>>>>>> I was checking a little bit more about,, I checked the
> cluster
> > >> and
> > >> >>>> data
> > >> >>>>>>> is
> > >> >>>>>>>> store in three different regions servers, each one in a
> > >> differente
> > >> >>>> node.
> > >> >>>>>>>> So, I guess the threads go to different hard-disks.
> > >> >>>>>>>>
> > >> >>>>>>>> If someone has an idea or suggestion.. why it's faster a
> single
> > >> scan
> > >> >>>>>>> than
> > >> >>>>>>>> this implementation. I based on this implementation
> > >> >>>>>>>> https://github.com/zygm0nt/hbase-distributed-search
> > >> >>>>>>>>
> > >> >>>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <
> > [email protected] <javascript:;>
> > >> >:
> > >> >>>>>>>>
> > >> >>>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with
> > 0.98,
> > >> >>>>>>> although
> > >> >>>>>>>>> there is not difference.
> > >> >>>>>>>>> I disabled the table and disabled the blockcache for that
> > family
> > >> >> and
> > >> >>>> I
> > >> >>>>>>> put
> > >> >>>>>>>>> scan.setBlockcache(false) as well for both cases.
> > >> >>>>>>>>>
> > >> >>>>>>>>> I think that it's not possible that I executing an complete
> > scan
> > >> >> for
> > >> >>>>>>> each
> > >> >>>>>>>>> thread since my data are the type:
> > >> >>>>>>>>> 000001 f:q value=1
> > >> >>>>>>>>> 000002 f:q value=2
> > >> >>>>>>>>> 000003 f:q value=3
> > >> >>>>>>>>> ...
> > >> >>>>>>>>>
> > >> >>>>>>>>> I add all the values and get the same result on a single
> scan
> > >> than
> > >> >> a
> > >> >>>>>>>>> distributed, so, I guess that DistributedScan did well.
> > >> >>>>>>>>> The count from the hbase shell takes about 10-15seconds, I
> > don't
> > >> >>>>>>> remember,
> > >> >>>>>>>>> but like 4x  of the scan time.
> > >> >>>>>>>>> I'm not using any filter for the scans.
> > >> >>>>>>>>>
> > >> >>>>>>>>> This is the way I calculate number of regions/scans
> > >> >>>>>>>>> private List<RegionScanner> generatePartitions() {
> > >> >>>>>>>>>     List<RegionScanner> regionScanners = new
> > >> >>>>>>>>> ArrayList<RegionScanner>();
> > >> >>>>>>>>>     byte[] startKey;
> > >> >>>>>>>>>     byte[] stopKey;
> > >> >>>>>>>>>     HConnection connection = null;
> > >> >>>>>>>>>     HBaseAdmin hbaseAdmin = null;
> > >> >>>>>>>>>     try {
> > >> >>>>>>>>>         connection =
> > >> >>>>>>>>>
> > >> HConnectionManager.createConnection(HBaseConfiguration.create());
> > >> >>>>>>>>>         hbaseAdmin = new HBaseAdmin(connection);
> > >> >>>>>>>>>         List<HRegionInfo> regions =
> > >> >>>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> > >> >>>>>>>>>         RegionScanner regionScanner = null;
> > >> >>>>>>>>>         for (HRegionInfo region : regions) {
> > >> >>>>>>>>>
> > >> >>>>>>>>>             startKey = region.getStartKey();
> > >> >>>>>>>>>             stopKey = region.getEndKey();
> > >> >>>>>>>>>
> > >> >>>>>>>>>             regionScanner = new RegionScanner(startKey,
> > stopKey,
> > >> >>>>>>>>> scanConfiguration);
> > >> >>>>>>>>>             // regionScanner = createRegionScanner(startKey,
> > >> >>>>>>> stopKey);
> > >> >>>>>>>>>             if (regionScanner != null) {
> > >> >>>>>>>>>                 regionScanners.add(regionScanner);
> > >> >>>>>>>>>             }
> > >> >>>>>>>>>         }
> > >> >>>>>>>>>
> > >> >>>>>>>>> I did some test for a tiny table and I think that the range
> > for
> > >> >> each
> > >> >>>>>>> scan
> > >> >>>>>>>>> works fine. Although, I though that it was interesting that
> > the
> > >> >> time
> > >> >>>>>>> when I
> > >> >>>>>>>>> execute distributed scan is about 6x.
> > >> >>>>>>>>>
> > >> >>>>>>>>> I'm going to check about the hard disks, but I think that
> ti's
> > >> >> right.
> > >> >>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <[email protected]
> > <javascript:;>>:
> > >> >>>>>>>>>
> > >> >>>>>>>>>> Which version of HBase?
> > >> >>>>>>>>>> Can you show us the code?
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Your parallel scan with caching 100 takes about 6x as long
> as
> > >> the
> > >> >>>>>>> single
> > >> >>>>>>>>>> scan, which is suspicious because you say you have 6
> regions.
> > >> >>>>>>>>>> Are you sure you're not accidentally scanning all the data
> in
> > >> each
> > >> >>>> of
> > >> >>>>>>>>>> your parallel scans?
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> -- Lars
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> ________________________________
> > >> >>>>>>>>>> From: Guillermo Ortiz <[email protected]
> <javascript:;>>
> > >> >>>>>>>>>> To: "[email protected] <javascript:;>" <
> > [email protected] <javascript:;>>
> > >> >>>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
> > >> >>>>>>>>>> Subject: Scan vs Parallel scan.
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Hi,
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> I developed an distributed scan, I create an thread for
> each
> > >> >> region.
> > >> >>>>>>> After
> > >> >>>>>>>>>> that, I've tried to get some times Scan vs DistributedScan.
> > >> >>>>>>>>>> I have disabled blockcache in my table. My cluster has 3
> > region
> > >> >>>>>>> servers
> > >> >>>>>>>>>> with 2 regions each one, in total there are 100.000 rows
> and
> > >> >>>> execute a
> > >> >>>>>>>>>> complete scan.
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> My partitions are
> > >> >>>>>>>>>> -01666 -> request 16665
> > >> >>>>>>>>>> 016666-033332 -> request 16666
> > >> >>>>>>>>>> 033332-049998 -> request 16666
> > >> >>>>>>>>>> 049998-066664 -> request 16666
> > >> >>>>>>>>>> 066664-083330 -> request 16666
> > >> >>>>>>>>>> 083330- -> request 16671
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > >> >>>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
> > >> >>>>>>> PARALLEL:22089ms,Counter:2 ->
> > >> >>>>>>>>>> Caching 10
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > >> >>>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
> > >> >>>>>>> PARALJEL:16598ms,Counter:2 ->
> > >> >>>>>>>>>> Caching 100
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > >> >>>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
> > >> >>>>>>> PARALLEL:16497ms,Counter:2 ->
> > >> >>>>>>>>>> Caching 1000
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > >> >>>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN
> > >> >> NORMAL:68288ms,Counter:2
> > >> >>>>>>> ->
> > >> >>>>>>>>>> Caching 1
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > >> >>>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN
> > >> >> NORMAL:2646ms,Counter:2
> > >> >>>> ->
> > >> >>>>>>>>>> Caching 100
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > >> >>>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN
> > >> >> NORMAL:3903ms,Counter:2
> > >> >>>> ->
> > >> >>>>>>>>>> Caching 1000
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Parallel scan works much worse than simple scan,, and I
> don't
> > >> know
> > >> >>>> why
> > >> >>>>>>>>>> it's
> > >> >>>>>>>>>> so fast,, it's really much faster than execute an "count"
> > from
> > >> >> hbase
> > >> >>>>>>>>>> shell,
> > >> >>>>>>>>>> what it doesn't look pretty notmal. The only time that it
> > works
> > >> >>>> better
> > >> >>>>>>>>>> parallel is when I execute a normal scan with caching 1.
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Any clue about it?
> > >> >>>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>
> > >> >>>>>>>
> > >> >>>>>>
> > >> >>>>
> > >> >>>>
> > >> >>
> > >> >>
> > >>
> > >>
> > >
> >
>

Re: Scan vs Parallel scan.

Reply via email to