Re: Scan vs Parallel scan.

Michael Segel Fri, 12 Sep 2014 05:50:09 -0700

Ok, lets again take a step back… 

So you are comparing your partial scan(s) against a full table scan?


If I understood your question, you launch 3 partial scans where you set the 
start row and then end row of each scan, right? 

On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <[email protected]> wrote:

> Okay, then, the partial scan doesn't work as I think.
> How could it exceed the limit of a single region if I calculate the limits?
> 
> 
> The only bad point that I see it's that If a region server has three
> regions of the same table,  I'm executing three partial scans about this RS
> and they could compete for resources (network, etc..) on this node. It'd be
> better to have one thread for RS. But, that doesn't answer your questions.
> 
> I keep thinking...
> 
> 2014-09-12 9:40 GMT+02:00 Michael Segel <[email protected]>:
> 
>> Hi,
>> 
>> I wanted to take a step back from the actual code and to stop and think
>> about what you are doing and what HBase is doing under the covers.
>> 
>> So in your code, you are asking HBase to do 3 separate scans and then you
>> take the result set back and join it.
>> 
>> What does HBase do when it does a range scan?
>> What happens when that range scan exceeds a single region?
>> 
>> If you answer those questions… you’ll have your answer.
>> 
>> HTH
>> 
>> -Mike
>> 
>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <[email protected]> wrote:
>> 
>>> It's not all the code, I set things like these as well:
>>> scan.setMaxVersions();
>>> scan.setCacheBlocks(false);
>>> ...
>>> 
>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <[email protected]>:
>>> 
>>>> yes, that is. I have changed the HBase version to 0.98
>>>> 
>>>> I got the start and stop keys with this method:
>>>> private List<RegionScanner> generatePartitions() {
>>>>       List<RegionScanner> regionScanners = new
>>>> ArrayList<RegionScanner>();
>>>>       byte[] startKey;
>>>>       byte[] stopKey;
>>>>       HConnection connection = null;
>>>>       HBaseAdmin hbaseAdmin = null;
>>>>       try {
>>>>           connection = HConnectionManager.
>>>> createConnection(HBaseConfiguration.create());
>>>>           hbaseAdmin = new HBaseAdmin(connection);
>>>>           List<HRegionInfo> regions =
>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>>>>           RegionScanner regionScanner = null;
>>>>           for (HRegionInfo region : regions) {
>>>> 
>>>>               startKey = region.getStartKey();
>>>>               stopKey = region.getEndKey();
>>>> 
>>>>               regionScanner = new RegionScanner(startKey, stopKey,
>>>> scanConfiguration);
>>>>               // regionScanner = createRegionScanner(startKey,
>> stopKey);
>>>>               if (regionScanner != null) {
>>>>                   regionScanners.add(regionScanner);
>>>>               }
>>>>           }
>>>> 
>>>> And I execute the RegionScanner with this:
>>>> public List<Result> call() throws Exception {
>>>>       HConnection connection =
>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
>>>>       HTableInterface table =
>>>> connection.getTable(configuration.getTable());
>>>> 
>>>>   Scan scan = new Scan(startKey, stopKey);
>>>>       scan.setBatch(configuration.getBatch());
>>>>       scan.setCaching(configuration.getCaching());
>>>>       ResultScanner resultScanner = table.getScanner(scan);
>>>> 
>>>>       List<Result> results = new ArrayList<Result>();
>>>>       for (Result result : resultScanner) {
>>>>           results.add(result);
>>>>       }
>>>> 
>>>>       connection.close();
>>>>       table.close();
>>>> 
>>>>       return results;
>>>>   }
>>>> 
>>>> They implement Callable.
>>>> 
>>>> 
>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <[email protected]>:
>>>> 
>>>>> Lets take a step back….
>>>>> 
>>>>> Your parallel scan is having the client create N threads where in each
>>>>> thread, you’re doing a partial scan of the table where each partial
>> scan
>>>>> takes the first and last row of each region?
>>>>> 
>>>>> Is that correct?
>>>>> 
>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> I was checking a little bit more about,, I checked the cluster and
>> data
>>>>> is
>>>>>> store in three different regions servers, each one in a differente
>> node.
>>>>>> So, I guess the threads go to different hard-disks.
>>>>>> 
>>>>>> If someone has an idea or suggestion.. why it's faster a single scan
>>>>> than
>>>>>> this implementation. I based on this implementation
>>>>>> https://github.com/zygm0nt/hbase-distributed-search
>>>>>> 
>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <[email protected]>:
>>>>>> 
>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with 0.98,
>>>>> although
>>>>>>> there is not difference.
>>>>>>> I disabled the table and disabled the blockcache for that family and
>> I
>>>>> put
>>>>>>> scan.setBlockcache(false) as well for both cases.
>>>>>>> 
>>>>>>> I think that it's not possible that I executing an complete scan for
>>>>> each
>>>>>>> thread since my data are the type:
>>>>>>> 000001 f:q value=1
>>>>>>> 000002 f:q value=2
>>>>>>> 000003 f:q value=3
>>>>>>> ...
>>>>>>> 
>>>>>>> I add all the values and get the same result on a single scan than a
>>>>>>> distributed, so, I guess that DistributedScan did well.
>>>>>>> The count from the hbase shell takes about 10-15seconds, I don't
>>>>> remember,
>>>>>>> but like 4x  of the scan time.
>>>>>>> I'm not using any filter for the scans.
>>>>>>> 
>>>>>>> This is the way I calculate number of regions/scans
>>>>>>> private List<RegionScanner> generatePartitions() {
>>>>>>>      List<RegionScanner> regionScanners = new
>>>>>>> ArrayList<RegionScanner>();
>>>>>>>      byte[] startKey;
>>>>>>>      byte[] stopKey;
>>>>>>>      HConnection connection = null;
>>>>>>>      HBaseAdmin hbaseAdmin = null;
>>>>>>>      try {
>>>>>>>          connection =
>>>>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
>>>>>>>          hbaseAdmin = new HBaseAdmin(connection);
>>>>>>>          List<HRegionInfo> regions =
>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>>>>>>>          RegionScanner regionScanner = null;
>>>>>>>          for (HRegionInfo region : regions) {
>>>>>>> 
>>>>>>>              startKey = region.getStartKey();
>>>>>>>              stopKey = region.getEndKey();
>>>>>>> 
>>>>>>>              regionScanner = new RegionScanner(startKey, stopKey,
>>>>>>> scanConfiguration);
>>>>>>>              // regionScanner = createRegionScanner(startKey,
>>>>> stopKey);
>>>>>>>              if (regionScanner != null) {
>>>>>>>                  regionScanners.add(regionScanner);
>>>>>>>              }
>>>>>>>          }
>>>>>>> 
>>>>>>> I did some test for a tiny table and I think that the range for each
>>>>> scan
>>>>>>> works fine. Although, I though that it was interesting that the time
>>>>> when I
>>>>>>> execute distributed scan is about 6x.
>>>>>>> 
>>>>>>> I'm going to check about the hard disks, but I think that ti's right.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <[email protected]>:
>>>>>>> 
>>>>>>>> Which version of HBase?
>>>>>>>> Can you show us the code?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Your parallel scan with caching 100 takes about 6x as long as the
>>>>> single
>>>>>>>> scan, which is suspicious because you say you have 6 regions.
>>>>>>>> Are you sure you're not accidentally scanning all the data in each
>> of
>>>>>>>> your parallel scans?
>>>>>>>> 
>>>>>>>> -- Lars
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ________________________________
>>>>>>>> From: Guillermo Ortiz <[email protected]>
>>>>>>>> To: "[email protected]" <[email protected]>
>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
>>>>>>>> Subject: Scan vs Parallel scan.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I developed an distributed scan, I create an thread for each region.
>>>>> After
>>>>>>>> that, I've tried to get some times Scan vs DistributedScan.
>>>>>>>> I have disabled blockcache in my table. My cluster has 3 region
>>>>> servers
>>>>>>>> with 2 regions each one, in total there are 100.000 rows and
>> execute a
>>>>>>>> complete scan.
>>>>>>>> 
>>>>>>>> My partitions are
>>>>>>>> -01666 -> request 16665
>>>>>>>> 016666-033332 -> request 16666
>>>>>>>> 033332-049998 -> request 16666
>>>>>>>> 049998-066664 -> request 16666
>>>>>>>> 066664-083330 -> request 16666
>>>>>>>> 083330- -> request 16671
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
>>>>> PARALLEL:22089ms,Counter:2 ->
>>>>>>>> Caching 10
>>>>>>>> 
>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
>>>>> PARALJEL:16598ms,Counter:2 ->
>>>>>>>> Caching 100
>>>>>>>> 
>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
>>>>> PARALLEL:16497ms,Counter:2 ->
>>>>>>>> Caching 1000
>>>>>>>> 
>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2
>>>>> ->
>>>>>>>> Caching 1
>>>>>>>> 
>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2
>> ->
>>>>>>>> Caching 100
>>>>>>>> 
>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2
>> ->
>>>>>>>> Caching 1000
>>>>>>>> 
>>>>>>>> Parallel scan works much worse than simple scan,, and I don't know
>> why
>>>>>>>> it's
>>>>>>>> so fast,, it's really much faster than execute an "count" from hbase
>>>>>>>> shell,
>>>>>>>> what it doesn't look pretty notmal. The only time that it works
>> better
>>>>>>>> parallel is when I execute a normal scan with caching 1.
>>>>>>>> 
>>>>>>>> Any clue about it?
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: Scan vs Parallel scan.

Reply via email to