Re: Scan vs Parallel scan.

Guillermo Ortiz Sat, 13 Sep 2014 16:50:45 -0700

What am I missing??

2014-09-12 16:05 GMT+02:00 Guillermo Ortiz <[email protected]>:


> For an partial scan, I guess that I call to the RS to get data, it starts
> looking in the store files and recollecting the data. (It doesn't write to
> the blockcache in both cases). It has ready the data and it gives to the
> client the data step by step, I mean,,, it depends the caching and batching
> parameters.
>
> Big differences that I see...
> I'm opening more connections to the Table, one for Region.
>
> I should check the single table scan, it looks like it does partial scans
> sequentially. Since you can see on the HBase Master how the request
> increase one after another, not all in the same time.
>
> 2014-09-12 15:23 GMT+02:00 Michael Segel <[email protected]>:
>
>> It doesn’t matter which RS, but that you have 1 thread for each region.
>>
>> So for each thread, what’s happening.
>> Step by step, what is the code doing.
>>
>> Now you’re comparing this against a single table scan, right?
>> What’s happening in the table scan…?
>>
>>
>> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz <[email protected]>
>> wrote:
>>
>> > Right, My table for example has keys between 0-9. in three regions
>> > 0-2,3-7,7-9
>> > I lauch three partial scans in parallel. The scans that I'm executing
>> are:
>> > scan(0,2), scan(3,7), scan(7,9).
>> > Each region is if a different RS, so each thread goes to different RS.
>> It's
>> > not exactly like that, but on the benchmark case it's like it's working.
>> >
>> > Really the code will execute a thread for each Region not for each
>> > RegionServer. But in the test I only have two regions for regionServer.
>> I
>> > dont' think that's an important point, there're two threads for RS.
>> >
>> > 2014-09-12 14:48 GMT+02:00 Michael Segel <[email protected]>:
>> >
>> >> Ok, lets again take a step back…
>> >>
>> >> So you are comparing your partial scan(s) against a full table scan?
>> >>
>> >> If I understood your question, you launch 3 partial scans where you set
>> >> the start row and then end row of each scan, right?
>> >>
>> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <[email protected]>
>> wrote:
>> >>
>> >>> Okay, then, the partial scan doesn't work as I think.
>> >>> How could it exceed the limit of a single region if I calculate the
>> >> limits?
>> >>>
>> >>>
>> >>> The only bad point that I see it's that If a region server has three
>> >>> regions of the same table,  I'm executing three partial scans about
>> this
>> >> RS
>> >>> and they could compete for resources (network, etc..) on this node.
>> It'd
>> >> be
>> >>> better to have one thread for RS. But, that doesn't answer your
>> >> questions.
>> >>>
>> >>> I keep thinking...
>> >>>
>> >>> 2014-09-12 9:40 GMT+02:00 Michael Segel <[email protected]>:
>> >>>
>> >>>> Hi,
>> >>>>
>> >>>> I wanted to take a step back from the actual code and to stop and
>> think
>> >>>> about what you are doing and what HBase is doing under the covers.
>> >>>>
>> >>>> So in your code, you are asking HBase to do 3 separate scans and then
>> >> you
>> >>>> take the result set back and join it.
>> >>>>
>> >>>> What does HBase do when it does a range scan?
>> >>>> What happens when that range scan exceeds a single region?
>> >>>>
>> >>>> If you answer those questions… you’ll have your answer.
>> >>>>
>> >>>> HTH
>> >>>>
>> >>>> -Mike
>> >>>>
>> >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <[email protected]>
>> >> wrote:
>> >>>>
>> >>>>> It's not all the code, I set things like these as well:
>> >>>>> scan.setMaxVersions();
>> >>>>> scan.setCacheBlocks(false);
>> >>>>> ...
>> >>>>>
>> >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <[email protected]>:
>> >>>>>
>> >>>>>> yes, that is. I have changed the HBase version to 0.98
>> >>>>>>
>> >>>>>> I got the start and stop keys with this method:
>> >>>>>> private List<RegionScanner> generatePartitions() {
>> >>>>>>      List<RegionScanner> regionScanners = new
>> >>>>>> ArrayList<RegionScanner>();
>> >>>>>>      byte[] startKey;
>> >>>>>>      byte[] stopKey;
>> >>>>>>      HConnection connection = null;
>> >>>>>>      HBaseAdmin hbaseAdmin = null;
>> >>>>>>      try {
>> >>>>>>          connection = HConnectionManager.
>> >>>>>> createConnection(HBaseConfiguration.create());
>> >>>>>>          hbaseAdmin = new HBaseAdmin(connection);
>> >>>>>>          List<HRegionInfo> regions =
>> >>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>> >>>>>>          RegionScanner regionScanner = null;
>> >>>>>>          for (HRegionInfo region : regions) {
>> >>>>>>
>> >>>>>>              startKey = region.getStartKey();
>> >>>>>>              stopKey = region.getEndKey();
>> >>>>>>
>> >>>>>>              regionScanner = new RegionScanner(startKey, stopKey,
>> >>>>>> scanConfiguration);
>> >>>>>>              // regionScanner = createRegionScanner(startKey,
>> >>>> stopKey);
>> >>>>>>              if (regionScanner != null) {
>> >>>>>>                  regionScanners.add(regionScanner);
>> >>>>>>              }
>> >>>>>>          }
>> >>>>>>
>> >>>>>> And I execute the RegionScanner with this:
>> >>>>>> public List<Result> call() throws Exception {
>> >>>>>>      HConnection connection =
>> >>>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
>> >>>>>>      HTableInterface table =
>> >>>>>> connection.getTable(configuration.getTable());
>> >>>>>>
>> >>>>>>  Scan scan = new Scan(startKey, stopKey);
>> >>>>>>      scan.setBatch(configuration.getBatch());
>> >>>>>>      scan.setCaching(configuration.getCaching());
>> >>>>>>      ResultScanner resultScanner = table.getScanner(scan);
>> >>>>>>
>> >>>>>>      List<Result> results = new ArrayList<Result>();
>> >>>>>>      for (Result result : resultScanner) {
>> >>>>>>          results.add(result);
>> >>>>>>      }
>> >>>>>>
>> >>>>>>      connection.close();
>> >>>>>>      table.close();
>> >>>>>>
>> >>>>>>      return results;
>> >>>>>>  }
>> >>>>>>
>> >>>>>> They implement Callable.
>> >>>>>>
>> >>>>>>
>> >>>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <[email protected]
>> >:
>> >>>>>>
>> >>>>>>> Lets take a step back….
>> >>>>>>>
>> >>>>>>> Your parallel scan is having the client create N threads where in
>> >> each
>> >>>>>>> thread, you’re doing a partial scan of the table where each
>> partial
>> >>>> scan
>> >>>>>>> takes the first and last row of each region?
>> >>>>>>>
>> >>>>>>> Is that correct?
>> >>>>>>>
>> >>>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <
>> [email protected]>
>> >>>>>>> wrote:
>> >>>>>>>
>> >>>>>>>> I was checking a little bit more about,, I checked the cluster
>> and
>> >>>> data
>> >>>>>>> is
>> >>>>>>>> store in three different regions servers, each one in a
>> differente
>> >>>> node.
>> >>>>>>>> So, I guess the threads go to different hard-disks.
>> >>>>>>>>
>> >>>>>>>> If someone has an idea or suggestion.. why it's faster a single
>> scan
>> >>>>>>> than
>> >>>>>>>> this implementation. I based on this implementation
>> >>>>>>>> https://github.com/zygm0nt/hbase-distributed-search
>> >>>>>>>>
>> >>>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <[email protected]
>> >:
>> >>>>>>>>
>> >>>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with 0.98,
>> >>>>>>> although
>> >>>>>>>>> there is not difference.
>> >>>>>>>>> I disabled the table and disabled the blockcache for that family
>> >> and
>> >>>> I
>> >>>>>>> put
>> >>>>>>>>> scan.setBlockcache(false) as well for both cases.
>> >>>>>>>>>
>> >>>>>>>>> I think that it's not possible that I executing an complete scan
>> >> for
>> >>>>>>> each
>> >>>>>>>>> thread since my data are the type:
>> >>>>>>>>> 000001 f:q value=1
>> >>>>>>>>> 000002 f:q value=2
>> >>>>>>>>> 000003 f:q value=3
>> >>>>>>>>> ...
>> >>>>>>>>>
>> >>>>>>>>> I add all the values and get the same result on a single scan
>> than
>> >> a
>> >>>>>>>>> distributed, so, I guess that DistributedScan did well.
>> >>>>>>>>> The count from the hbase shell takes about 10-15seconds, I don't
>> >>>>>>> remember,
>> >>>>>>>>> but like 4x  of the scan time.
>> >>>>>>>>> I'm not using any filter for the scans.
>> >>>>>>>>>
>> >>>>>>>>> This is the way I calculate number of regions/scans
>> >>>>>>>>> private List<RegionScanner> generatePartitions() {
>> >>>>>>>>>     List<RegionScanner> regionScanners = new
>> >>>>>>>>> ArrayList<RegionScanner>();
>> >>>>>>>>>     byte[] startKey;
>> >>>>>>>>>     byte[] stopKey;
>> >>>>>>>>>     HConnection connection = null;
>> >>>>>>>>>     HBaseAdmin hbaseAdmin = null;
>> >>>>>>>>>     try {
>> >>>>>>>>>         connection =
>> >>>>>>>>>
>> HConnectionManager.createConnection(HBaseConfiguration.create());
>> >>>>>>>>>         hbaseAdmin = new HBaseAdmin(connection);
>> >>>>>>>>>         List<HRegionInfo> regions =
>> >>>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>> >>>>>>>>>         RegionScanner regionScanner = null;
>> >>>>>>>>>         for (HRegionInfo region : regions) {
>> >>>>>>>>>
>> >>>>>>>>>             startKey = region.getStartKey();
>> >>>>>>>>>             stopKey = region.getEndKey();
>> >>>>>>>>>
>> >>>>>>>>>             regionScanner = new RegionScanner(startKey, stopKey,
>> >>>>>>>>> scanConfiguration);
>> >>>>>>>>>             // regionScanner = createRegionScanner(startKey,
>> >>>>>>> stopKey);
>> >>>>>>>>>             if (regionScanner != null) {
>> >>>>>>>>>                 regionScanners.add(regionScanner);
>> >>>>>>>>>             }
>> >>>>>>>>>         }
>> >>>>>>>>>
>> >>>>>>>>> I did some test for a tiny table and I think that the range for
>> >> each
>> >>>>>>> scan
>> >>>>>>>>> works fine. Although, I though that it was interesting that the
>> >> time
>> >>>>>>> when I
>> >>>>>>>>> execute distributed scan is about 6x.
>> >>>>>>>>>
>> >>>>>>>>> I'm going to check about the hard disks, but I think that ti's
>> >> right.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <[email protected]>:
>> >>>>>>>>>
>> >>>>>>>>>> Which version of HBase?
>> >>>>>>>>>> Can you show us the code?
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Your parallel scan with caching 100 takes about 6x as long as
>> the
>> >>>>>>> single
>> >>>>>>>>>> scan, which is suspicious because you say you have 6 regions.
>> >>>>>>>>>> Are you sure you're not accidentally scanning all the data in
>> each
>> >>>> of
>> >>>>>>>>>> your parallel scans?
>> >>>>>>>>>>
>> >>>>>>>>>> -- Lars
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> ________________________________
>> >>>>>>>>>> From: Guillermo Ortiz <[email protected]>
>> >>>>>>>>>> To: "[email protected]" <[email protected]>
>> >>>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
>> >>>>>>>>>> Subject: Scan vs Parallel scan.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Hi,
>> >>>>>>>>>>
>> >>>>>>>>>> I developed an distributed scan, I create an thread for each
>> >> region.
>> >>>>>>> After
>> >>>>>>>>>> that, I've tried to get some times Scan vs DistributedScan.
>> >>>>>>>>>> I have disabled blockcache in my table. My cluster has 3 region
>> >>>>>>> servers
>> >>>>>>>>>> with 2 regions each one, in total there are 100.000 rows and
>> >>>> execute a
>> >>>>>>>>>> complete scan.
>> >>>>>>>>>>
>> >>>>>>>>>> My partitions are
>> >>>>>>>>>> -01666 -> request 16665
>> >>>>>>>>>> 016666-033332 -> request 16666
>> >>>>>>>>>> 033332-049998 -> request 16666
>> >>>>>>>>>> 049998-066664 -> request 16666
>> >>>>>>>>>> 066664-083330 -> request 16666
>> >>>>>>>>>> 083330- -> request 16671
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
>> >>>>>>> PARALLEL:22089ms,Counter:2 ->
>> >>>>>>>>>> Caching 10
>> >>>>>>>>>>
>> >>>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
>> >>>>>>> PARALJEL:16598ms,Counter:2 ->
>> >>>>>>>>>> Caching 100
>> >>>>>>>>>>
>> >>>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
>> >>>>>>> PARALLEL:16497ms,Counter:2 ->
>> >>>>>>>>>> Caching 1000
>> >>>>>>>>>>
>> >>>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN
>> >> NORMAL:68288ms,Counter:2
>> >>>>>>> ->
>> >>>>>>>>>> Caching 1
>> >>>>>>>>>>
>> >>>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN
>> >> NORMAL:2646ms,Counter:2
>> >>>> ->
>> >>>>>>>>>> Caching 100
>> >>>>>>>>>>
>> >>>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN
>> >> NORMAL:3903ms,Counter:2
>> >>>> ->
>> >>>>>>>>>> Caching 1000
>> >>>>>>>>>>
>> >>>>>>>>>> Parallel scan works much worse than simple scan,, and I don't
>> know
>> >>>> why
>> >>>>>>>>>> it's
>> >>>>>>>>>> so fast,, it's really much faster than execute an "count" from
>> >> hbase
>> >>>>>>>>>> shell,
>> >>>>>>>>>> what it doesn't look pretty notmal. The only time that it works
>> >>>> better
>> >>>>>>>>>> parallel is when I execute a normal scan with caching 1.
>> >>>>>>>>>>
>> >>>>>>>>>> Any clue about it?
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>
>> >>>>
>> >>
>> >>
>>
>>
>

Re: Scan vs Parallel scan.

Reply via email to