I'm working with HBase 0.94 for this case,, I'll try with 0.98, although
there is not difference.
I disabled the table and disabled the blockcache for that family and I put
scan.setBlockcache(false) as well for both cases.
I think that it's not possible that I executing an complete scan for each
thread since my data are the type:
000001 f:q value=1
000002 f:q value=2
000003 f:q value=3
...
I add all the values and get the same result on a single scan than a
distributed, so, I guess that DistributedScan did well.
The count from the hbase shell takes about 10-15seconds, I don't remember,
but like 4x of the scan time.
I'm not using any filter for the scans.
This is the way I calculate number of regions/scans
private List<RegionScanner> generatePartitions() {
List<RegionScanner> regionScanners = new ArrayList<RegionScanner>();
byte[] startKey;
byte[] stopKey;
HConnection connection = null;
HBaseAdmin hbaseAdmin = null;
try {
connection =
HConnectionManager.createConnection(HBaseConfiguration.create());
hbaseAdmin = new HBaseAdmin(connection);
List<HRegionInfo> regions =
hbaseAdmin.getTableRegions(scanConfiguration.getTable());
RegionScanner regionScanner = null;
for (HRegionInfo region : regions) {
startKey = region.getStartKey();
stopKey = region.getEndKey();
regionScanner = new RegionScanner(startKey, stopKey,
scanConfiguration);
// regionScanner = createRegionScanner(startKey, stopKey);
if (regionScanner != null) {
regionScanners.add(regionScanner);
}
}
I did some test for a tiny table and I think that the range for each scan
works fine. Although, I though that it was interesting that the time when I
execute distributed scan is about 6x.
I'm going to check about the hard disks, but I think that ti's right.
2014-09-11 7:50 GMT+02:00 lars hofhansl <[email protected]>:
> Which version of HBase?
> Can you show us the code?
>
>
> Your parallel scan with caching 100 takes about 6x as long as the single
> scan, which is suspicious because you say you have 6 regions.
> Are you sure you're not accidentally scanning all the data in each of your
> parallel scans?
>
> -- Lars
>
>
>
> ________________________________
> From: Guillermo Ortiz <[email protected]>
> To: "[email protected]" <[email protected]>
> Sent: Wednesday, September 10, 2014 1:40 AM
> Subject: Scan vs Parallel scan.
>
>
> Hi,
>
> I developed an distributed scan, I create an thread for each region. After
> that, I've tried to get some times Scan vs DistributedScan.
> I have disabled blockcache in my table. My cluster has 3 region servers
> with 2 regions each one, in total there are 100.000 rows and execute a
> complete scan.
>
> My partitions are
> -01666 -> request 16665
> 016666-033332 -> request 16666
> 033332-049998 -> request 16666
> 049998-066664 -> request 16666
> 066664-083330 -> request 16666
> 083330- -> request 16671
>
>
> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 ->
> Caching 10
>
> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 ->
> Caching 100
>
> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 ->
> Caching 1000
>
> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
> Caching 1
>
> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
> Caching 100
>
> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
> Caching 1000
>
> Parallel scan works much worse than simple scan,, and I don't know why it's
> so fast,, it's really much faster than execute an "count" from hbase shell,
> what it doesn't look pretty notmal. The only time that it works better
> parallel is when I execute a normal scan with caching 1.
>
> Any clue about it?
>