It's not all the code, I set things like these as well: scan.setMaxVersions(); scan.setCacheBlocks(false); ...
2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <[email protected]>: > yes, that is. I have changed the HBase version to 0.98 > > I got the start and stop keys with this method: > private List<RegionScanner> generatePartitions() { > List<RegionScanner> regionScanners = new > ArrayList<RegionScanner>(); > byte[] startKey; > byte[] stopKey; > HConnection connection = null; > HBaseAdmin hbaseAdmin = null; > try { > connection = HConnectionManager. > createConnection(HBaseConfiguration.create()); > hbaseAdmin = new HBaseAdmin(connection); > List<HRegionInfo> regions = > hbaseAdmin.getTableRegions(scanConfiguration.getTable()); > RegionScanner regionScanner = null; > for (HRegionInfo region : regions) { > > startKey = region.getStartKey(); > stopKey = region.getEndKey(); > > regionScanner = new RegionScanner(startKey, stopKey, > scanConfiguration); > // regionScanner = createRegionScanner(startKey, stopKey); > if (regionScanner != null) { > regionScanners.add(regionScanner); > } > } > > And I execute the RegionScanner with this: > public List<Result> call() throws Exception { > HConnection connection = > HConnectionManager.createConnection(HBaseConfiguration.create()); > HTableInterface table = > connection.getTable(configuration.getTable()); > > Scan scan = new Scan(startKey, stopKey); > scan.setBatch(configuration.getBatch()); > scan.setCaching(configuration.getCaching()); > ResultScanner resultScanner = table.getScanner(scan); > > List<Result> results = new ArrayList<Result>(); > for (Result result : resultScanner) { > results.add(result); > } > > connection.close(); > table.close(); > > return results; > } > > They implement Callable. > > > 2014-09-12 9:26 GMT+02:00 Michael Segel <[email protected]>: > >> Lets take a step back…. >> >> Your parallel scan is having the client create N threads where in each >> thread, you’re doing a partial scan of the table where each partial scan >> takes the first and last row of each region? >> >> Is that correct? >> >> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <[email protected]> >> wrote: >> >> > I was checking a little bit more about,, I checked the cluster and data >> is >> > store in three different regions servers, each one in a differente node. >> > So, I guess the threads go to different hard-disks. >> > >> > If someone has an idea or suggestion.. why it's faster a single scan >> than >> > this implementation. I based on this implementation >> > https://github.com/zygm0nt/hbase-distributed-search >> > >> > 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <[email protected]>: >> > >> >> I'm working with HBase 0.94 for this case,, I'll try with 0.98, >> although >> >> there is not difference. >> >> I disabled the table and disabled the blockcache for that family and I >> put >> >> scan.setBlockcache(false) as well for both cases. >> >> >> >> I think that it's not possible that I executing an complete scan for >> each >> >> thread since my data are the type: >> >> 000001 f:q value=1 >> >> 000002 f:q value=2 >> >> 000003 f:q value=3 >> >> ... >> >> >> >> I add all the values and get the same result on a single scan than a >> >> distributed, so, I guess that DistributedScan did well. >> >> The count from the hbase shell takes about 10-15seconds, I don't >> remember, >> >> but like 4x of the scan time. >> >> I'm not using any filter for the scans. >> >> >> >> This is the way I calculate number of regions/scans >> >> private List<RegionScanner> generatePartitions() { >> >> List<RegionScanner> regionScanners = new >> >> ArrayList<RegionScanner>(); >> >> byte[] startKey; >> >> byte[] stopKey; >> >> HConnection connection = null; >> >> HBaseAdmin hbaseAdmin = null; >> >> try { >> >> connection = >> >> HConnectionManager.createConnection(HBaseConfiguration.create()); >> >> hbaseAdmin = new HBaseAdmin(connection); >> >> List<HRegionInfo> regions = >> >> hbaseAdmin.getTableRegions(scanConfiguration.getTable()); >> >> RegionScanner regionScanner = null; >> >> for (HRegionInfo region : regions) { >> >> >> >> startKey = region.getStartKey(); >> >> stopKey = region.getEndKey(); >> >> >> >> regionScanner = new RegionScanner(startKey, stopKey, >> >> scanConfiguration); >> >> // regionScanner = createRegionScanner(startKey, >> stopKey); >> >> if (regionScanner != null) { >> >> regionScanners.add(regionScanner); >> >> } >> >> } >> >> >> >> I did some test for a tiny table and I think that the range for each >> scan >> >> works fine. Although, I though that it was interesting that the time >> when I >> >> execute distributed scan is about 6x. >> >> >> >> I'm going to check about the hard disks, but I think that ti's right. >> >> >> >> >> >> >> >> >> >> 2014-09-11 7:50 GMT+02:00 lars hofhansl <[email protected]>: >> >> >> >>> Which version of HBase? >> >>> Can you show us the code? >> >>> >> >>> >> >>> Your parallel scan with caching 100 takes about 6x as long as the >> single >> >>> scan, which is suspicious because you say you have 6 regions. >> >>> Are you sure you're not accidentally scanning all the data in each of >> >>> your parallel scans? >> >>> >> >>> -- Lars >> >>> >> >>> >> >>> >> >>> ________________________________ >> >>> From: Guillermo Ortiz <[email protected]> >> >>> To: "[email protected]" <[email protected]> >> >>> Sent: Wednesday, September 10, 2014 1:40 AM >> >>> Subject: Scan vs Parallel scan. >> >>> >> >>> >> >>> Hi, >> >>> >> >>> I developed an distributed scan, I create an thread for each region. >> After >> >>> that, I've tried to get some times Scan vs DistributedScan. >> >>> I have disabled blockcache in my table. My cluster has 3 region >> servers >> >>> with 2 regions each one, in total there are 100.000 rows and execute a >> >>> complete scan. >> >>> >> >>> My partitions are >> >>> -01666 -> request 16665 >> >>> 016666-033332 -> request 16666 >> >>> 033332-049998 -> request 16666 >> >>> 049998-066664 -> request 16666 >> >>> 066664-083330 -> request 16666 >> >>> 083330- -> request 16671 >> >>> >> >>> >> >>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000 >> >>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN >> PARALLEL:22089ms,Counter:2 -> >> >>> Caching 10 >> >>> >> >>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000 >> >>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN >> PARALJEL:16598ms,Counter:2 -> >> >>> Caching 100 >> >>> >> >>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000 >> >>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN >> PARALLEL:16497ms,Counter:2 -> >> >>> Caching 1000 >> >>> >> >>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000 >> >>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 >> -> >> >>> Caching 1 >> >>> >> >>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000 >> >>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 -> >> >>> Caching 100 >> >>> >> >>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000 >> >>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 -> >> >>> Caching 1000 >> >>> >> >>> Parallel scan works much worse than simple scan,, and I don't know why >> >>> it's >> >>> so fast,, it's really much faster than execute an "count" from hbase >> >>> shell, >> >>> what it doesn't look pretty notmal. The only time that it works better >> >>> parallel is when I execute a normal scan with caching 1. >> >>> >> >>> Any clue about it? >> >>> >> >> >> >> >> >> >
