Re: Scan vs Parallel scan.

Guillermo Ortiz Fri, 12 Sep 2014 06:05:57 -0700

Right, My table for example has keys between 0-9. in three regions
0-2,3-7,7-9
I lauch three partial scans in parallel. The scans that I'm executing are:
scan(0,2), scan(3,7), scan(7,9).
Each region is if a different RS, so each thread goes to different RS. It's
not exactly like that, but on the benchmark case it's like it's working.


Really the code will execute a thread for each Region not for each
RegionServer. But in the test I only have two regions for regionServer. I
dont' think that's an important point, there're two threads for RS.

2014-09-12 14:48 GMT+02:00 Michael Segel <[email protected]>:

> Ok, lets again take a step back…
>
> So you are comparing your partial scan(s) against a full table scan?
>
> If I understood your question, you launch 3 partial scans where you set
> the start row and then end row of each scan, right?
>
> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <[email protected]> wrote:
>
> > Okay, then, the partial scan doesn't work as I think.
> > How could it exceed the limit of a single region if I calculate the
> limits?
> >
> >
> > The only bad point that I see it's that If a region server has three
> > regions of the same table,  I'm executing three partial scans about this
> RS
> > and they could compete for resources (network, etc..) on this node. It'd
> be
> > better to have one thread for RS. But, that doesn't answer your
> questions.
> >
> > I keep thinking...
> >
> > 2014-09-12 9:40 GMT+02:00 Michael Segel <[email protected]>:
> >
> >> Hi,
> >>
> >> I wanted to take a step back from the actual code and to stop and think
> >> about what you are doing and what HBase is doing under the covers.
> >>
> >> So in your code, you are asking HBase to do 3 separate scans and then
> you
> >> take the result set back and join it.
> >>
> >> What does HBase do when it does a range scan?
> >> What happens when that range scan exceeds a single region?
> >>
> >> If you answer those questions… you’ll have your answer.
> >>
> >> HTH
> >>
> >> -Mike
> >>
> >> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <[email protected]>
> wrote:
> >>
> >>> It's not all the code, I set things like these as well:
> >>> scan.setMaxVersions();
> >>> scan.setCacheBlocks(false);
> >>> ...
> >>>
> >>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <[email protected]>:
> >>>
> >>>> yes, that is. I have changed the HBase version to 0.98
> >>>>
> >>>> I got the start and stop keys with this method:
> >>>> private List<RegionScanner> generatePartitions() {
> >>>>       List<RegionScanner> regionScanners = new
> >>>> ArrayList<RegionScanner>();
> >>>>       byte[] startKey;
> >>>>       byte[] stopKey;
> >>>>       HConnection connection = null;
> >>>>       HBaseAdmin hbaseAdmin = null;
> >>>>       try {
> >>>>           connection = HConnectionManager.
> >>>> createConnection(HBaseConfiguration.create());
> >>>>           hbaseAdmin = new HBaseAdmin(connection);
> >>>>           List<HRegionInfo> regions =
> >>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >>>>           RegionScanner regionScanner = null;
> >>>>           for (HRegionInfo region : regions) {
> >>>>
> >>>>               startKey = region.getStartKey();
> >>>>               stopKey = region.getEndKey();
> >>>>
> >>>>               regionScanner = new RegionScanner(startKey, stopKey,
> >>>> scanConfiguration);
> >>>>               // regionScanner = createRegionScanner(startKey,
> >> stopKey);
> >>>>               if (regionScanner != null) {
> >>>>                   regionScanners.add(regionScanner);
> >>>>               }
> >>>>           }
> >>>>
> >>>> And I execute the RegionScanner with this:
> >>>> public List<Result> call() throws Exception {
> >>>>       HConnection connection =
> >>>> HConnectionManager.createConnection(HBaseConfiguration.create());
> >>>>       HTableInterface table =
> >>>> connection.getTable(configuration.getTable());
> >>>>
> >>>>   Scan scan = new Scan(startKey, stopKey);
> >>>>       scan.setBatch(configuration.getBatch());
> >>>>       scan.setCaching(configuration.getCaching());
> >>>>       ResultScanner resultScanner = table.getScanner(scan);
> >>>>
> >>>>       List<Result> results = new ArrayList<Result>();
> >>>>       for (Result result : resultScanner) {
> >>>>           results.add(result);
> >>>>       }
> >>>>
> >>>>       connection.close();
> >>>>       table.close();
> >>>>
> >>>>       return results;
> >>>>   }
> >>>>
> >>>> They implement Callable.
> >>>>
> >>>>
> >>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <[email protected]>:
> >>>>
> >>>>> Lets take a step back….
> >>>>>
> >>>>> Your parallel scan is having the client create N threads where in
> each
> >>>>> thread, you’re doing a partial scan of the table where each partial
> >> scan
> >>>>> takes the first and last row of each region?
> >>>>>
> >>>>> Is that correct?
> >>>>>
> >>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <[email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> I was checking a little bit more about,, I checked the cluster and
> >> data
> >>>>> is
> >>>>>> store in three different regions servers, each one in a differente
> >> node.
> >>>>>> So, I guess the threads go to different hard-disks.
> >>>>>>
> >>>>>> If someone has an idea or suggestion.. why it's faster a single scan
> >>>>> than
> >>>>>> this implementation. I based on this implementation
> >>>>>> https://github.com/zygm0nt/hbase-distributed-search
> >>>>>>
> >>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <[email protected]>:
> >>>>>>
> >>>>>>> I'm working with HBase 0.94 for this case,, I'll try with 0.98,
> >>>>> although
> >>>>>>> there is not difference.
> >>>>>>> I disabled the table and disabled the blockcache for that family
> and
> >> I
> >>>>> put
> >>>>>>> scan.setBlockcache(false) as well for both cases.
> >>>>>>>
> >>>>>>> I think that it's not possible that I executing an complete scan
> for
> >>>>> each
> >>>>>>> thread since my data are the type:
> >>>>>>> 000001 f:q value=1
> >>>>>>> 000002 f:q value=2
> >>>>>>> 000003 f:q value=3
> >>>>>>> ...
> >>>>>>>
> >>>>>>> I add all the values and get the same result on a single scan than
> a
> >>>>>>> distributed, so, I guess that DistributedScan did well.
> >>>>>>> The count from the hbase shell takes about 10-15seconds, I don't
> >>>>> remember,
> >>>>>>> but like 4x  of the scan time.
> >>>>>>> I'm not using any filter for the scans.
> >>>>>>>
> >>>>>>> This is the way I calculate number of regions/scans
> >>>>>>> private List<RegionScanner> generatePartitions() {
> >>>>>>>      List<RegionScanner> regionScanners = new
> >>>>>>> ArrayList<RegionScanner>();
> >>>>>>>      byte[] startKey;
> >>>>>>>      byte[] stopKey;
> >>>>>>>      HConnection connection = null;
> >>>>>>>      HBaseAdmin hbaseAdmin = null;
> >>>>>>>      try {
> >>>>>>>          connection =
> >>>>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
> >>>>>>>          hbaseAdmin = new HBaseAdmin(connection);
> >>>>>>>          List<HRegionInfo> regions =
> >>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >>>>>>>          RegionScanner regionScanner = null;
> >>>>>>>          for (HRegionInfo region : regions) {
> >>>>>>>
> >>>>>>>              startKey = region.getStartKey();
> >>>>>>>              stopKey = region.getEndKey();
> >>>>>>>
> >>>>>>>              regionScanner = new RegionScanner(startKey, stopKey,
> >>>>>>> scanConfiguration);
> >>>>>>>              // regionScanner = createRegionScanner(startKey,
> >>>>> stopKey);
> >>>>>>>              if (regionScanner != null) {
> >>>>>>>                  regionScanners.add(regionScanner);
> >>>>>>>              }
> >>>>>>>          }
> >>>>>>>
> >>>>>>> I did some test for a tiny table and I think that the range for
> each
> >>>>> scan
> >>>>>>> works fine. Although, I though that it was interesting that the
> time
> >>>>> when I
> >>>>>>> execute distributed scan is about 6x.
> >>>>>>>
> >>>>>>> I'm going to check about the hard disks, but I think that ti's
> right.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <[email protected]>:
> >>>>>>>
> >>>>>>>> Which version of HBase?
> >>>>>>>> Can you show us the code?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Your parallel scan with caching 100 takes about 6x as long as the
> >>>>> single
> >>>>>>>> scan, which is suspicious because you say you have 6 regions.
> >>>>>>>> Are you sure you're not accidentally scanning all the data in each
> >> of
> >>>>>>>> your parallel scans?
> >>>>>>>>
> >>>>>>>> -- Lars
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ________________________________
> >>>>>>>> From: Guillermo Ortiz <[email protected]>
> >>>>>>>> To: "[email protected]" <[email protected]>
> >>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
> >>>>>>>> Subject: Scan vs Parallel scan.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I developed an distributed scan, I create an thread for each
> region.
> >>>>> After
> >>>>>>>> that, I've tried to get some times Scan vs DistributedScan.
> >>>>>>>> I have disabled blockcache in my table. My cluster has 3 region
> >>>>> servers
> >>>>>>>> with 2 regions each one, in total there are 100.000 rows and
> >> execute a
> >>>>>>>> complete scan.
> >>>>>>>>
> >>>>>>>> My partitions are
> >>>>>>>> -01666 -> request 16665
> >>>>>>>> 016666-033332 -> request 16666
> >>>>>>>> 033332-049998 -> request 16666
> >>>>>>>> 049998-066664 -> request 16666
> >>>>>>>> 066664-083330 -> request 16666
> >>>>>>>> 083330- -> request 16671
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
> >>>>> PARALLEL:22089ms,Counter:2 ->
> >>>>>>>> Caching 10
> >>>>>>>>
> >>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
> >>>>> PARALJEL:16598ms,Counter:2 ->
> >>>>>>>> Caching 100
> >>>>>>>>
> >>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
> >>>>> PARALLEL:16497ms,Counter:2 ->
> >>>>>>>> Caching 1000
> >>>>>>>>
> >>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN
> NORMAL:68288ms,Counter:2
> >>>>> ->
> >>>>>>>> Caching 1
> >>>>>>>>
> >>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN
> NORMAL:2646ms,Counter:2
> >> ->
> >>>>>>>> Caching 100
> >>>>>>>>
> >>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN
> NORMAL:3903ms,Counter:2
> >> ->
> >>>>>>>> Caching 1000
> >>>>>>>>
> >>>>>>>> Parallel scan works much worse than simple scan,, and I don't know
> >> why
> >>>>>>>> it's
> >>>>>>>> so fast,, it's really much faster than execute an "count" from
> hbase
> >>>>>>>> shell,
> >>>>>>>> what it doesn't look pretty notmal. The only time that it works
> >> better
> >>>>>>>> parallel is when I execute a normal scan with caching 1.
> >>>>>>>>
> >>>>>>>> Any clue about it?
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>>
> >>
> >>
>
>

Re: Scan vs Parallel scan.

Reply via email to