Ok, lets again take a step back… So you are comparing your partial scan(s) against a full table scan?
If I understood your question, you launch 3 partial scans where you set the start row and then end row of each scan, right? On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <[email protected]> wrote: > Okay, then, the partial scan doesn't work as I think. > How could it exceed the limit of a single region if I calculate the limits? > > > The only bad point that I see it's that If a region server has three > regions of the same table, I'm executing three partial scans about this RS > and they could compete for resources (network, etc..) on this node. It'd be > better to have one thread for RS. But, that doesn't answer your questions. > > I keep thinking... > > 2014-09-12 9:40 GMT+02:00 Michael Segel <[email protected]>: > >> Hi, >> >> I wanted to take a step back from the actual code and to stop and think >> about what you are doing and what HBase is doing under the covers. >> >> So in your code, you are asking HBase to do 3 separate scans and then you >> take the result set back and join it. >> >> What does HBase do when it does a range scan? >> What happens when that range scan exceeds a single region? >> >> If you answer those questions… you’ll have your answer. >> >> HTH >> >> -Mike >> >> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <[email protected]> wrote: >> >>> It's not all the code, I set things like these as well: >>> scan.setMaxVersions(); >>> scan.setCacheBlocks(false); >>> ... >>> >>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <[email protected]>: >>> >>>> yes, that is. I have changed the HBase version to 0.98 >>>> >>>> I got the start and stop keys with this method: >>>> private List<RegionScanner> generatePartitions() { >>>> List<RegionScanner> regionScanners = new >>>> ArrayList<RegionScanner>(); >>>> byte[] startKey; >>>> byte[] stopKey; >>>> HConnection connection = null; >>>> HBaseAdmin hbaseAdmin = null; >>>> try { >>>> connection = HConnectionManager. >>>> createConnection(HBaseConfiguration.create()); >>>> hbaseAdmin = new HBaseAdmin(connection); >>>> List<HRegionInfo> regions = >>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable()); >>>> RegionScanner regionScanner = null; >>>> for (HRegionInfo region : regions) { >>>> >>>> startKey = region.getStartKey(); >>>> stopKey = region.getEndKey(); >>>> >>>> regionScanner = new RegionScanner(startKey, stopKey, >>>> scanConfiguration); >>>> // regionScanner = createRegionScanner(startKey, >> stopKey); >>>> if (regionScanner != null) { >>>> regionScanners.add(regionScanner); >>>> } >>>> } >>>> >>>> And I execute the RegionScanner with this: >>>> public List<Result> call() throws Exception { >>>> HConnection connection = >>>> HConnectionManager.createConnection(HBaseConfiguration.create()); >>>> HTableInterface table = >>>> connection.getTable(configuration.getTable()); >>>> >>>> Scan scan = new Scan(startKey, stopKey); >>>> scan.setBatch(configuration.getBatch()); >>>> scan.setCaching(configuration.getCaching()); >>>> ResultScanner resultScanner = table.getScanner(scan); >>>> >>>> List<Result> results = new ArrayList<Result>(); >>>> for (Result result : resultScanner) { >>>> results.add(result); >>>> } >>>> >>>> connection.close(); >>>> table.close(); >>>> >>>> return results; >>>> } >>>> >>>> They implement Callable. >>>> >>>> >>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <[email protected]>: >>>> >>>>> Lets take a step back…. >>>>> >>>>> Your parallel scan is having the client create N threads where in each >>>>> thread, you’re doing a partial scan of the table where each partial >> scan >>>>> takes the first and last row of each region? >>>>> >>>>> Is that correct? >>>>> >>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <[email protected]> >>>>> wrote: >>>>> >>>>>> I was checking a little bit more about,, I checked the cluster and >> data >>>>> is >>>>>> store in three different regions servers, each one in a differente >> node. >>>>>> So, I guess the threads go to different hard-disks. >>>>>> >>>>>> If someone has an idea or suggestion.. why it's faster a single scan >>>>> than >>>>>> this implementation. I based on this implementation >>>>>> https://github.com/zygm0nt/hbase-distributed-search >>>>>> >>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <[email protected]>: >>>>>> >>>>>>> I'm working with HBase 0.94 for this case,, I'll try with 0.98, >>>>> although >>>>>>> there is not difference. >>>>>>> I disabled the table and disabled the blockcache for that family and >> I >>>>> put >>>>>>> scan.setBlockcache(false) as well for both cases. >>>>>>> >>>>>>> I think that it's not possible that I executing an complete scan for >>>>> each >>>>>>> thread since my data are the type: >>>>>>> 000001 f:q value=1 >>>>>>> 000002 f:q value=2 >>>>>>> 000003 f:q value=3 >>>>>>> ... >>>>>>> >>>>>>> I add all the values and get the same result on a single scan than a >>>>>>> distributed, so, I guess that DistributedScan did well. >>>>>>> The count from the hbase shell takes about 10-15seconds, I don't >>>>> remember, >>>>>>> but like 4x of the scan time. >>>>>>> I'm not using any filter for the scans. >>>>>>> >>>>>>> This is the way I calculate number of regions/scans >>>>>>> private List<RegionScanner> generatePartitions() { >>>>>>> List<RegionScanner> regionScanners = new >>>>>>> ArrayList<RegionScanner>(); >>>>>>> byte[] startKey; >>>>>>> byte[] stopKey; >>>>>>> HConnection connection = null; >>>>>>> HBaseAdmin hbaseAdmin = null; >>>>>>> try { >>>>>>> connection = >>>>>>> HConnectionManager.createConnection(HBaseConfiguration.create()); >>>>>>> hbaseAdmin = new HBaseAdmin(connection); >>>>>>> List<HRegionInfo> regions = >>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable()); >>>>>>> RegionScanner regionScanner = null; >>>>>>> for (HRegionInfo region : regions) { >>>>>>> >>>>>>> startKey = region.getStartKey(); >>>>>>> stopKey = region.getEndKey(); >>>>>>> >>>>>>> regionScanner = new RegionScanner(startKey, stopKey, >>>>>>> scanConfiguration); >>>>>>> // regionScanner = createRegionScanner(startKey, >>>>> stopKey); >>>>>>> if (regionScanner != null) { >>>>>>> regionScanners.add(regionScanner); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> I did some test for a tiny table and I think that the range for each >>>>> scan >>>>>>> works fine. Although, I though that it was interesting that the time >>>>> when I >>>>>>> execute distributed scan is about 6x. >>>>>>> >>>>>>> I'm going to check about the hard disks, but I think that ti's right. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <[email protected]>: >>>>>>> >>>>>>>> Which version of HBase? >>>>>>>> Can you show us the code? >>>>>>>> >>>>>>>> >>>>>>>> Your parallel scan with caching 100 takes about 6x as long as the >>>>> single >>>>>>>> scan, which is suspicious because you say you have 6 regions. >>>>>>>> Are you sure you're not accidentally scanning all the data in each >> of >>>>>>>> your parallel scans? >>>>>>>> >>>>>>>> -- Lars >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ________________________________ >>>>>>>> From: Guillermo Ortiz <[email protected]> >>>>>>>> To: "[email protected]" <[email protected]> >>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM >>>>>>>> Subject: Scan vs Parallel scan. >>>>>>>> >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I developed an distributed scan, I create an thread for each region. >>>>> After >>>>>>>> that, I've tried to get some times Scan vs DistributedScan. >>>>>>>> I have disabled blockcache in my table. My cluster has 3 region >>>>> servers >>>>>>>> with 2 regions each one, in total there are 100.000 rows and >> execute a >>>>>>>> complete scan. >>>>>>>> >>>>>>>> My partitions are >>>>>>>> -01666 -> request 16665 >>>>>>>> 016666-033332 -> request 16666 >>>>>>>> 033332-049998 -> request 16666 >>>>>>>> 049998-066664 -> request 16666 >>>>>>>> 066664-083330 -> request 16666 >>>>>>>> 083330- -> request 16671 >>>>>>>> >>>>>>>> >>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN >>>>> PARALLEL:22089ms,Counter:2 -> >>>>>>>> Caching 10 >>>>>>>> >>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN >>>>> PARALJEL:16598ms,Counter:2 -> >>>>>>>> Caching 100 >>>>>>>> >>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN >>>>> PARALLEL:16497ms,Counter:2 -> >>>>>>>> Caching 1000 >>>>>>>> >>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 >>>>> -> >>>>>>>> Caching 1 >>>>>>>> >>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 >> -> >>>>>>>> Caching 100 >>>>>>>> >>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 >> -> >>>>>>>> Caching 1000 >>>>>>>> >>>>>>>> Parallel scan works much worse than simple scan,, and I don't know >> why >>>>>>>> it's >>>>>>>> so fast,, it's really much faster than execute an "count" from hbase >>>>>>>> shell, >>>>>>>> what it doesn't look pretty notmal. The only time that it works >> better >>>>>>>> parallel is when I execute a normal scan with caching 1. >>>>>>>> >>>>>>>> Any clue about it? >>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>>> >> >>
