Re: Scan vs Parallel scan.

Guillermo Ortiz Fri, 12 Sep 2014 00:36:13 -0700

It's not all the code, I set things like these as well:
scan.setMaxVersions();
scan.setCacheBlocks(false);
...


2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <[email protected]>:

> yes, that is. I have changed the HBase version to 0.98
>
> I got the start and stop keys with this method:
> private List<RegionScanner> generatePartitions() {
>         List<RegionScanner> regionScanners = new
> ArrayList<RegionScanner>();
>         byte[] startKey;
>         byte[] stopKey;
>         HConnection connection = null;
>         HBaseAdmin hbaseAdmin = null;
>         try {
>             connection = HConnectionManager.
> createConnection(HBaseConfiguration.create());
>             hbaseAdmin = new HBaseAdmin(connection);
>             List<HRegionInfo> regions =
> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>             RegionScanner regionScanner = null;
>             for (HRegionInfo region : regions) {
>
>                 startKey = region.getStartKey();
>                 stopKey = region.getEndKey();
>
>                 regionScanner = new RegionScanner(startKey, stopKey,
> scanConfiguration);
>                 // regionScanner = createRegionScanner(startKey, stopKey);
>                 if (regionScanner != null) {
>                     regionScanners.add(regionScanner);
>                 }
>             }
>
> And I execute the RegionScanner with this:
> public List<Result> call() throws Exception {
>         HConnection connection =
> HConnectionManager.createConnection(HBaseConfiguration.create());
>         HTableInterface table =
> connection.getTable(configuration.getTable());
>
>     Scan scan = new Scan(startKey, stopKey);
>         scan.setBatch(configuration.getBatch());
>         scan.setCaching(configuration.getCaching());
>         ResultScanner resultScanner = table.getScanner(scan);
>
>         List<Result> results = new ArrayList<Result>();
>         for (Result result : resultScanner) {
>             results.add(result);
>         }
>
>         connection.close();
>         table.close();
>
>         return results;
>     }
>
> They implement Callable.
>
>
> 2014-09-12 9:26 GMT+02:00 Michael Segel <[email protected]>:
>
>> Lets take a step back….
>>
>> Your parallel scan is having the client create N threads where in each
>> thread, you’re doing a partial scan of the table where each partial scan
>> takes the first and last row of each region?
>>
>> Is that correct?
>>
>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <[email protected]>
>> wrote:
>>
>> > I was checking a little bit more about,, I checked the cluster and data
>> is
>> > store in three different regions servers, each one in a differente node.
>> > So, I guess the threads go to different hard-disks.
>> >
>> > If someone has an idea or suggestion.. why it's faster a single scan
>> than
>> > this implementation. I based on this implementation
>> > https://github.com/zygm0nt/hbase-distributed-search
>> >
>> > 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <[email protected]>:
>> >
>> >> I'm working with HBase 0.94 for this case,, I'll try with 0.98,
>> although
>> >> there is not difference.
>> >> I disabled the table and disabled the blockcache for that family and I
>> put
>> >> scan.setBlockcache(false) as well for both cases.
>> >>
>> >> I think that it's not possible that I executing an complete scan for
>> each
>> >> thread since my data are the type:
>> >> 000001 f:q value=1
>> >> 000002 f:q value=2
>> >> 000003 f:q value=3
>> >> ...
>> >>
>> >> I add all the values and get the same result on a single scan than a
>> >> distributed, so, I guess that DistributedScan did well.
>> >> The count from the hbase shell takes about 10-15seconds, I don't
>> remember,
>> >> but like 4x  of the scan time.
>> >> I'm not using any filter for the scans.
>> >>
>> >> This is the way I calculate number of regions/scans
>> >> private List<RegionScanner> generatePartitions() {
>> >>        List<RegionScanner> regionScanners = new
>> >> ArrayList<RegionScanner>();
>> >>        byte[] startKey;
>> >>        byte[] stopKey;
>> >>        HConnection connection = null;
>> >>        HBaseAdmin hbaseAdmin = null;
>> >>        try {
>> >>            connection =
>> >> HConnectionManager.createConnection(HBaseConfiguration.create());
>> >>            hbaseAdmin = new HBaseAdmin(connection);
>> >>            List<HRegionInfo> regions =
>> >> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>> >>            RegionScanner regionScanner = null;
>> >>            for (HRegionInfo region : regions) {
>> >>
>> >>                startKey = region.getStartKey();
>> >>                stopKey = region.getEndKey();
>> >>
>> >>                regionScanner = new RegionScanner(startKey, stopKey,
>> >> scanConfiguration);
>> >>                // regionScanner = createRegionScanner(startKey,
>> stopKey);
>> >>                if (regionScanner != null) {
>> >>                    regionScanners.add(regionScanner);
>> >>                }
>> >>            }
>> >>
>> >> I did some test for a tiny table and I think that the range for each
>> scan
>> >> works fine. Although, I though that it was interesting that the time
>> when I
>> >> execute distributed scan is about 6x.
>> >>
>> >> I'm going to check about the hard disks, but I think that ti's right.
>> >>
>> >>
>> >>
>> >>
>> >> 2014-09-11 7:50 GMT+02:00 lars hofhansl <[email protected]>:
>> >>
>> >>> Which version of HBase?
>> >>> Can you show us the code?
>> >>>
>> >>>
>> >>> Your parallel scan with caching 100 takes about 6x as long as the
>> single
>> >>> scan, which is suspicious because you say you have 6 regions.
>> >>> Are you sure you're not accidentally scanning all the data in each of
>> >>> your parallel scans?
>> >>>
>> >>> -- Lars
>> >>>
>> >>>
>> >>>
>> >>> ________________________________
>> >>> From: Guillermo Ortiz <[email protected]>
>> >>> To: "[email protected]" <[email protected]>
>> >>> Sent: Wednesday, September 10, 2014 1:40 AM
>> >>> Subject: Scan vs Parallel scan.
>> >>>
>> >>>
>> >>> Hi,
>> >>>
>> >>> I developed an distributed scan, I create an thread for each region.
>> After
>> >>> that, I've tried to get some times Scan vs DistributedScan.
>> >>> I have disabled blockcache in my table. My cluster has 3 region
>> servers
>> >>> with 2 regions each one, in total there are 100.000 rows and execute a
>> >>> complete scan.
>> >>>
>> >>> My partitions are
>> >>> -01666 -> request 16665
>> >>> 016666-033332 -> request 16666
>> >>> 033332-049998 -> request 16666
>> >>> 049998-066664 -> request 16666
>> >>> 066664-083330 -> request 16666
>> >>> 083330- -> request 16671
>> >>>
>> >>>
>> >>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
>> PARALLEL:22089ms,Counter:2 ->
>> >>> Caching 10
>> >>>
>> >>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
>> PARALJEL:16598ms,Counter:2 ->
>> >>> Caching 100
>> >>>
>> >>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
>> PARALLEL:16497ms,Counter:2 ->
>> >>> Caching 1000
>> >>>
>> >>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2
>> ->
>> >>> Caching 1
>> >>>
>> >>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
>> >>> Caching 100
>> >>>
>> >>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
>> >>> Caching 1000
>> >>>
>> >>> Parallel scan works much worse than simple scan,, and I don't know why
>> >>> it's
>> >>> so fast,, it's really much faster than execute an "count" from hbase
>> >>> shell,
>> >>> what it doesn't look pretty notmal. The only time that it works better
>> >>> parallel is when I execute a normal scan with caching 1.
>> >>>
>> >>> Any clue about it?
>> >>>
>> >>
>> >>
>>
>>
>

Re: Scan vs Parallel scan.

Reply via email to