Right, My table for example has keys between 0-9. in three regions 0-2,3-7,7-9 I lauch three partial scans in parallel. The scans that I'm executing are: scan(0,2), scan(3,7), scan(7,9). Each region is if a different RS, so each thread goes to different RS. It's not exactly like that, but on the benchmark case it's like it's working.
Really the code will execute a thread for each Region not for each RegionServer. But in the test I only have two regions for regionServer. I dont' think that's an important point, there're two threads for RS. 2014-09-12 14:48 GMT+02:00 Michael Segel <[email protected]>: > Ok, lets again take a step back… > > So you are comparing your partial scan(s) against a full table scan? > > If I understood your question, you launch 3 partial scans where you set > the start row and then end row of each scan, right? > > On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <[email protected]> wrote: > > > Okay, then, the partial scan doesn't work as I think. > > How could it exceed the limit of a single region if I calculate the > limits? > > > > > > The only bad point that I see it's that If a region server has three > > regions of the same table, I'm executing three partial scans about this > RS > > and they could compete for resources (network, etc..) on this node. It'd > be > > better to have one thread for RS. But, that doesn't answer your > questions. > > > > I keep thinking... > > > > 2014-09-12 9:40 GMT+02:00 Michael Segel <[email protected]>: > > > >> Hi, > >> > >> I wanted to take a step back from the actual code and to stop and think > >> about what you are doing and what HBase is doing under the covers. > >> > >> So in your code, you are asking HBase to do 3 separate scans and then > you > >> take the result set back and join it. > >> > >> What does HBase do when it does a range scan? > >> What happens when that range scan exceeds a single region? > >> > >> If you answer those questions… you’ll have your answer. > >> > >> HTH > >> > >> -Mike > >> > >> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <[email protected]> > wrote: > >> > >>> It's not all the code, I set things like these as well: > >>> scan.setMaxVersions(); > >>> scan.setCacheBlocks(false); > >>> ... > >>> > >>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <[email protected]>: > >>> > >>>> yes, that is. I have changed the HBase version to 0.98 > >>>> > >>>> I got the start and stop keys with this method: > >>>> private List<RegionScanner> generatePartitions() { > >>>> List<RegionScanner> regionScanners = new > >>>> ArrayList<RegionScanner>(); > >>>> byte[] startKey; > >>>> byte[] stopKey; > >>>> HConnection connection = null; > >>>> HBaseAdmin hbaseAdmin = null; > >>>> try { > >>>> connection = HConnectionManager. > >>>> createConnection(HBaseConfiguration.create()); > >>>> hbaseAdmin = new HBaseAdmin(connection); > >>>> List<HRegionInfo> regions = > >>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable()); > >>>> RegionScanner regionScanner = null; > >>>> for (HRegionInfo region : regions) { > >>>> > >>>> startKey = region.getStartKey(); > >>>> stopKey = region.getEndKey(); > >>>> > >>>> regionScanner = new RegionScanner(startKey, stopKey, > >>>> scanConfiguration); > >>>> // regionScanner = createRegionScanner(startKey, > >> stopKey); > >>>> if (regionScanner != null) { > >>>> regionScanners.add(regionScanner); > >>>> } > >>>> } > >>>> > >>>> And I execute the RegionScanner with this: > >>>> public List<Result> call() throws Exception { > >>>> HConnection connection = > >>>> HConnectionManager.createConnection(HBaseConfiguration.create()); > >>>> HTableInterface table = > >>>> connection.getTable(configuration.getTable()); > >>>> > >>>> Scan scan = new Scan(startKey, stopKey); > >>>> scan.setBatch(configuration.getBatch()); > >>>> scan.setCaching(configuration.getCaching()); > >>>> ResultScanner resultScanner = table.getScanner(scan); > >>>> > >>>> List<Result> results = new ArrayList<Result>(); > >>>> for (Result result : resultScanner) { > >>>> results.add(result); > >>>> } > >>>> > >>>> connection.close(); > >>>> table.close(); > >>>> > >>>> return results; > >>>> } > >>>> > >>>> They implement Callable. > >>>> > >>>> > >>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <[email protected]>: > >>>> > >>>>> Lets take a step back…. > >>>>> > >>>>> Your parallel scan is having the client create N threads where in > each > >>>>> thread, you’re doing a partial scan of the table where each partial > >> scan > >>>>> takes the first and last row of each region? > >>>>> > >>>>> Is that correct? > >>>>> > >>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <[email protected]> > >>>>> wrote: > >>>>> > >>>>>> I was checking a little bit more about,, I checked the cluster and > >> data > >>>>> is > >>>>>> store in three different regions servers, each one in a differente > >> node. > >>>>>> So, I guess the threads go to different hard-disks. > >>>>>> > >>>>>> If someone has an idea or suggestion.. why it's faster a single scan > >>>>> than > >>>>>> this implementation. I based on this implementation > >>>>>> https://github.com/zygm0nt/hbase-distributed-search > >>>>>> > >>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <[email protected]>: > >>>>>> > >>>>>>> I'm working with HBase 0.94 for this case,, I'll try with 0.98, > >>>>> although > >>>>>>> there is not difference. > >>>>>>> I disabled the table and disabled the blockcache for that family > and > >> I > >>>>> put > >>>>>>> scan.setBlockcache(false) as well for both cases. > >>>>>>> > >>>>>>> I think that it's not possible that I executing an complete scan > for > >>>>> each > >>>>>>> thread since my data are the type: > >>>>>>> 000001 f:q value=1 > >>>>>>> 000002 f:q value=2 > >>>>>>> 000003 f:q value=3 > >>>>>>> ... > >>>>>>> > >>>>>>> I add all the values and get the same result on a single scan than > a > >>>>>>> distributed, so, I guess that DistributedScan did well. > >>>>>>> The count from the hbase shell takes about 10-15seconds, I don't > >>>>> remember, > >>>>>>> but like 4x of the scan time. > >>>>>>> I'm not using any filter for the scans. > >>>>>>> > >>>>>>> This is the way I calculate number of regions/scans > >>>>>>> private List<RegionScanner> generatePartitions() { > >>>>>>> List<RegionScanner> regionScanners = new > >>>>>>> ArrayList<RegionScanner>(); > >>>>>>> byte[] startKey; > >>>>>>> byte[] stopKey; > >>>>>>> HConnection connection = null; > >>>>>>> HBaseAdmin hbaseAdmin = null; > >>>>>>> try { > >>>>>>> connection = > >>>>>>> HConnectionManager.createConnection(HBaseConfiguration.create()); > >>>>>>> hbaseAdmin = new HBaseAdmin(connection); > >>>>>>> List<HRegionInfo> regions = > >>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable()); > >>>>>>> RegionScanner regionScanner = null; > >>>>>>> for (HRegionInfo region : regions) { > >>>>>>> > >>>>>>> startKey = region.getStartKey(); > >>>>>>> stopKey = region.getEndKey(); > >>>>>>> > >>>>>>> regionScanner = new RegionScanner(startKey, stopKey, > >>>>>>> scanConfiguration); > >>>>>>> // regionScanner = createRegionScanner(startKey, > >>>>> stopKey); > >>>>>>> if (regionScanner != null) { > >>>>>>> regionScanners.add(regionScanner); > >>>>>>> } > >>>>>>> } > >>>>>>> > >>>>>>> I did some test for a tiny table and I think that the range for > each > >>>>> scan > >>>>>>> works fine. Although, I though that it was interesting that the > time > >>>>> when I > >>>>>>> execute distributed scan is about 6x. > >>>>>>> > >>>>>>> I'm going to check about the hard disks, but I think that ti's > right. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <[email protected]>: > >>>>>>> > >>>>>>>> Which version of HBase? > >>>>>>>> Can you show us the code? > >>>>>>>> > >>>>>>>> > >>>>>>>> Your parallel scan with caching 100 takes about 6x as long as the > >>>>> single > >>>>>>>> scan, which is suspicious because you say you have 6 regions. > >>>>>>>> Are you sure you're not accidentally scanning all the data in each > >> of > >>>>>>>> your parallel scans? > >>>>>>>> > >>>>>>>> -- Lars > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> ________________________________ > >>>>>>>> From: Guillermo Ortiz <[email protected]> > >>>>>>>> To: "[email protected]" <[email protected]> > >>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM > >>>>>>>> Subject: Scan vs Parallel scan. > >>>>>>>> > >>>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> I developed an distributed scan, I create an thread for each > region. > >>>>> After > >>>>>>>> that, I've tried to get some times Scan vs DistributedScan. > >>>>>>>> I have disabled blockcache in my table. My cluster has 3 region > >>>>> servers > >>>>>>>> with 2 regions each one, in total there are 100.000 rows and > >> execute a > >>>>>>>> complete scan. > >>>>>>>> > >>>>>>>> My partitions are > >>>>>>>> -01666 -> request 16665 > >>>>>>>> 016666-033332 -> request 16666 > >>>>>>>> 033332-049998 -> request 16666 > >>>>>>>> 049998-066664 -> request 16666 > >>>>>>>> 066664-083330 -> request 16666 > >>>>>>>> 083330- -> request 16671 > >>>>>>>> > >>>>>>>> > >>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000 > >>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN > >>>>> PARALLEL:22089ms,Counter:2 -> > >>>>>>>> Caching 10 > >>>>>>>> > >>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000 > >>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN > >>>>> PARALJEL:16598ms,Counter:2 -> > >>>>>>>> Caching 100 > >>>>>>>> > >>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000 > >>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN > >>>>> PARALLEL:16497ms,Counter:2 -> > >>>>>>>> Caching 1000 > >>>>>>>> > >>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000 > >>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN > NORMAL:68288ms,Counter:2 > >>>>> -> > >>>>>>>> Caching 1 > >>>>>>>> > >>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000 > >>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN > NORMAL:2646ms,Counter:2 > >> -> > >>>>>>>> Caching 100 > >>>>>>>> > >>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000 > >>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN > NORMAL:3903ms,Counter:2 > >> -> > >>>>>>>> Caching 1000 > >>>>>>>> > >>>>>>>> Parallel scan works much worse than simple scan,, and I don't know > >> why > >>>>>>>> it's > >>>>>>>> so fast,, it's really much faster than execute an "count" from > hbase > >>>>>>>> shell, > >>>>>>>> what it doesn't look pretty notmal. The only time that it works > >> better > >>>>>>>> parallel is when I execute a normal scan with caching 1. > >>>>>>>> > >>>>>>>> Any clue about it? > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>>>> > >>>> > >> > >> > >
