yes, that is. I have changed the HBase version to 0.98
I got the start and stop keys with this method:
private List<RegionScanner> generatePartitions() {
List<RegionScanner> regionScanners = new ArrayList<RegionScanner>();
byte[] startKey;
byte[] stopKey;
HConnection connection = null;
HBaseAdmin hbaseAdmin = null;
try {
connection = HConnectionManager.
createConnection(HBaseConfiguration.create());
hbaseAdmin = new HBaseAdmin(connection);
List<HRegionInfo> regions =
hbaseAdmin.getTableRegions(scanConfiguration.getTable());
RegionScanner regionScanner = null;
for (HRegionInfo region : regions) {
startKey = region.getStartKey();
stopKey = region.getEndKey();
regionScanner = new RegionScanner(startKey, stopKey,
scanConfiguration);
// regionScanner = createRegionScanner(startKey, stopKey);
if (regionScanner != null) {
regionScanners.add(regionScanner);
}
}
And I execute the RegionScanner with this:
public List<Result> call() throws Exception {
HConnection connection =
HConnectionManager.createConnection(HBaseConfiguration.create());
HTableInterface table =
connection.getTable(configuration.getTable());
Scan scan = new Scan(startKey, stopKey);
scan.setBatch(configuration.getBatch());
scan.setCaching(configuration.getCaching());
ResultScanner resultScanner = table.getScanner(scan);
List<Result> results = new ArrayList<Result>();
for (Result result : resultScanner) {
results.add(result);
}
connection.close();
table.close();
return results;
}
They implement Callable.
2014-09-12 9:26 GMT+02:00 Michael Segel <[email protected]>:
> Lets take a step back….
>
> Your parallel scan is having the client create N threads where in each
> thread, you’re doing a partial scan of the table where each partial scan
> takes the first and last row of each region?
>
> Is that correct?
>
> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <[email protected]> wrote:
>
> > I was checking a little bit more about,, I checked the cluster and data
> is
> > store in three different regions servers, each one in a differente node.
> > So, I guess the threads go to different hard-disks.
> >
> > If someone has an idea or suggestion.. why it's faster a single scan than
> > this implementation. I based on this implementation
> > https://github.com/zygm0nt/hbase-distributed-search
> >
> > 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <[email protected]>:
> >
> >> I'm working with HBase 0.94 for this case,, I'll try with 0.98, although
> >> there is not difference.
> >> I disabled the table and disabled the blockcache for that family and I
> put
> >> scan.setBlockcache(false) as well for both cases.
> >>
> >> I think that it's not possible that I executing an complete scan for
> each
> >> thread since my data are the type:
> >> 000001 f:q value=1
> >> 000002 f:q value=2
> >> 000003 f:q value=3
> >> ...
> >>
> >> I add all the values and get the same result on a single scan than a
> >> distributed, so, I guess that DistributedScan did well.
> >> The count from the hbase shell takes about 10-15seconds, I don't
> remember,
> >> but like 4x of the scan time.
> >> I'm not using any filter for the scans.
> >>
> >> This is the way I calculate number of regions/scans
> >> private List<RegionScanner> generatePartitions() {
> >> List<RegionScanner> regionScanners = new
> >> ArrayList<RegionScanner>();
> >> byte[] startKey;
> >> byte[] stopKey;
> >> HConnection connection = null;
> >> HBaseAdmin hbaseAdmin = null;
> >> try {
> >> connection =
> >> HConnectionManager.createConnection(HBaseConfiguration.create());
> >> hbaseAdmin = new HBaseAdmin(connection);
> >> List<HRegionInfo> regions =
> >> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >> RegionScanner regionScanner = null;
> >> for (HRegionInfo region : regions) {
> >>
> >> startKey = region.getStartKey();
> >> stopKey = region.getEndKey();
> >>
> >> regionScanner = new RegionScanner(startKey, stopKey,
> >> scanConfiguration);
> >> // regionScanner = createRegionScanner(startKey,
> stopKey);
> >> if (regionScanner != null) {
> >> regionScanners.add(regionScanner);
> >> }
> >> }
> >>
> >> I did some test for a tiny table and I think that the range for each
> scan
> >> works fine. Although, I though that it was interesting that the time
> when I
> >> execute distributed scan is about 6x.
> >>
> >> I'm going to check about the hard disks, but I think that ti's right.
> >>
> >>
> >>
> >>
> >> 2014-09-11 7:50 GMT+02:00 lars hofhansl <[email protected]>:
> >>
> >>> Which version of HBase?
> >>> Can you show us the code?
> >>>
> >>>
> >>> Your parallel scan with caching 100 takes about 6x as long as the
> single
> >>> scan, which is suspicious because you say you have 6 regions.
> >>> Are you sure you're not accidentally scanning all the data in each of
> >>> your parallel scans?
> >>>
> >>> -- Lars
> >>>
> >>>
> >>>
> >>> ________________________________
> >>> From: Guillermo Ortiz <[email protected]>
> >>> To: "[email protected]" <[email protected]>
> >>> Sent: Wednesday, September 10, 2014 1:40 AM
> >>> Subject: Scan vs Parallel scan.
> >>>
> >>>
> >>> Hi,
> >>>
> >>> I developed an distributed scan, I create an thread for each region.
> After
> >>> that, I've tried to get some times Scan vs DistributedScan.
> >>> I have disabled blockcache in my table. My cluster has 3 region servers
> >>> with 2 regions each one, in total there are 100.000 rows and execute a
> >>> complete scan.
> >>>
> >>> My partitions are
> >>> -01666 -> request 16665
> >>> 016666-033332 -> request 16666
> >>> 033332-049998 -> request 16666
> >>> 049998-066664 -> request 16666
> >>> 066664-083330 -> request 16666
> >>> 083330- -> request 16671
> >>>
> >>>
> >>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2
> ->
> >>> Caching 10
> >>>
> >>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2
> ->
> >>> Caching 100
> >>>
> >>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2
> ->
> >>> Caching 1000
> >>>
> >>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
> >>> Caching 1
> >>>
> >>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
> >>> Caching 100
> >>>
> >>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
> >>> Caching 1000
> >>>
> >>> Parallel scan works much worse than simple scan,, and I don't know why
> >>> it's
> >>> so fast,, it's really much faster than execute an "count" from hbase
> >>> shell,
> >>> what it doesn't look pretty notmal. The only time that it works better
> >>> parallel is when I execute a normal scan with caching 1.
> >>>
> >>> Any clue about it?
> >>>
> >>
> >>
>
>