What am I missing?? 2014-09-12 16:05 GMT+02:00 Guillermo Ortiz <[email protected]>:
> For an partial scan, I guess that I call to the RS to get data, it starts > looking in the store files and recollecting the data. (It doesn't write to > the blockcache in both cases). It has ready the data and it gives to the > client the data step by step, I mean,,, it depends the caching and batching > parameters. > > Big differences that I see... > I'm opening more connections to the Table, one for Region. > > I should check the single table scan, it looks like it does partial scans > sequentially. Since you can see on the HBase Master how the request > increase one after another, not all in the same time. > > 2014-09-12 15:23 GMT+02:00 Michael Segel <[email protected]>: > >> It doesn’t matter which RS, but that you have 1 thread for each region. >> >> So for each thread, what’s happening. >> Step by step, what is the code doing. >> >> Now you’re comparing this against a single table scan, right? >> What’s happening in the table scan…? >> >> >> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz <[email protected]> >> wrote: >> >> > Right, My table for example has keys between 0-9. in three regions >> > 0-2,3-7,7-9 >> > I lauch three partial scans in parallel. The scans that I'm executing >> are: >> > scan(0,2), scan(3,7), scan(7,9). >> > Each region is if a different RS, so each thread goes to different RS. >> It's >> > not exactly like that, but on the benchmark case it's like it's working. >> > >> > Really the code will execute a thread for each Region not for each >> > RegionServer. But in the test I only have two regions for regionServer. >> I >> > dont' think that's an important point, there're two threads for RS. >> > >> > 2014-09-12 14:48 GMT+02:00 Michael Segel <[email protected]>: >> > >> >> Ok, lets again take a step back… >> >> >> >> So you are comparing your partial scan(s) against a full table scan? >> >> >> >> If I understood your question, you launch 3 partial scans where you set >> >> the start row and then end row of each scan, right? >> >> >> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <[email protected]> >> wrote: >> >> >> >>> Okay, then, the partial scan doesn't work as I think. >> >>> How could it exceed the limit of a single region if I calculate the >> >> limits? >> >>> >> >>> >> >>> The only bad point that I see it's that If a region server has three >> >>> regions of the same table, I'm executing three partial scans about >> this >> >> RS >> >>> and they could compete for resources (network, etc..) on this node. >> It'd >> >> be >> >>> better to have one thread for RS. But, that doesn't answer your >> >> questions. >> >>> >> >>> I keep thinking... >> >>> >> >>> 2014-09-12 9:40 GMT+02:00 Michael Segel <[email protected]>: >> >>> >> >>>> Hi, >> >>>> >> >>>> I wanted to take a step back from the actual code and to stop and >> think >> >>>> about what you are doing and what HBase is doing under the covers. >> >>>> >> >>>> So in your code, you are asking HBase to do 3 separate scans and then >> >> you >> >>>> take the result set back and join it. >> >>>> >> >>>> What does HBase do when it does a range scan? >> >>>> What happens when that range scan exceeds a single region? >> >>>> >> >>>> If you answer those questions… you’ll have your answer. >> >>>> >> >>>> HTH >> >>>> >> >>>> -Mike >> >>>> >> >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <[email protected]> >> >> wrote: >> >>>> >> >>>>> It's not all the code, I set things like these as well: >> >>>>> scan.setMaxVersions(); >> >>>>> scan.setCacheBlocks(false); >> >>>>> ... >> >>>>> >> >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <[email protected]>: >> >>>>> >> >>>>>> yes, that is. I have changed the HBase version to 0.98 >> >>>>>> >> >>>>>> I got the start and stop keys with this method: >> >>>>>> private List<RegionScanner> generatePartitions() { >> >>>>>> List<RegionScanner> regionScanners = new >> >>>>>> ArrayList<RegionScanner>(); >> >>>>>> byte[] startKey; >> >>>>>> byte[] stopKey; >> >>>>>> HConnection connection = null; >> >>>>>> HBaseAdmin hbaseAdmin = null; >> >>>>>> try { >> >>>>>> connection = HConnectionManager. >> >>>>>> createConnection(HBaseConfiguration.create()); >> >>>>>> hbaseAdmin = new HBaseAdmin(connection); >> >>>>>> List<HRegionInfo> regions = >> >>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable()); >> >>>>>> RegionScanner regionScanner = null; >> >>>>>> for (HRegionInfo region : regions) { >> >>>>>> >> >>>>>> startKey = region.getStartKey(); >> >>>>>> stopKey = region.getEndKey(); >> >>>>>> >> >>>>>> regionScanner = new RegionScanner(startKey, stopKey, >> >>>>>> scanConfiguration); >> >>>>>> // regionScanner = createRegionScanner(startKey, >> >>>> stopKey); >> >>>>>> if (regionScanner != null) { >> >>>>>> regionScanners.add(regionScanner); >> >>>>>> } >> >>>>>> } >> >>>>>> >> >>>>>> And I execute the RegionScanner with this: >> >>>>>> public List<Result> call() throws Exception { >> >>>>>> HConnection connection = >> >>>>>> HConnectionManager.createConnection(HBaseConfiguration.create()); >> >>>>>> HTableInterface table = >> >>>>>> connection.getTable(configuration.getTable()); >> >>>>>> >> >>>>>> Scan scan = new Scan(startKey, stopKey); >> >>>>>> scan.setBatch(configuration.getBatch()); >> >>>>>> scan.setCaching(configuration.getCaching()); >> >>>>>> ResultScanner resultScanner = table.getScanner(scan); >> >>>>>> >> >>>>>> List<Result> results = new ArrayList<Result>(); >> >>>>>> for (Result result : resultScanner) { >> >>>>>> results.add(result); >> >>>>>> } >> >>>>>> >> >>>>>> connection.close(); >> >>>>>> table.close(); >> >>>>>> >> >>>>>> return results; >> >>>>>> } >> >>>>>> >> >>>>>> They implement Callable. >> >>>>>> >> >>>>>> >> >>>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <[email protected] >> >: >> >>>>>> >> >>>>>>> Lets take a step back…. >> >>>>>>> >> >>>>>>> Your parallel scan is having the client create N threads where in >> >> each >> >>>>>>> thread, you’re doing a partial scan of the table where each >> partial >> >>>> scan >> >>>>>>> takes the first and last row of each region? >> >>>>>>> >> >>>>>>> Is that correct? >> >>>>>>> >> >>>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz < >> [email protected]> >> >>>>>>> wrote: >> >>>>>>> >> >>>>>>>> I was checking a little bit more about,, I checked the cluster >> and >> >>>> data >> >>>>>>> is >> >>>>>>>> store in three different regions servers, each one in a >> differente >> >>>> node. >> >>>>>>>> So, I guess the threads go to different hard-disks. >> >>>>>>>> >> >>>>>>>> If someone has an idea or suggestion.. why it's faster a single >> scan >> >>>>>>> than >> >>>>>>>> this implementation. I based on this implementation >> >>>>>>>> https://github.com/zygm0nt/hbase-distributed-search >> >>>>>>>> >> >>>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <[email protected] >> >: >> >>>>>>>> >> >>>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with 0.98, >> >>>>>>> although >> >>>>>>>>> there is not difference. >> >>>>>>>>> I disabled the table and disabled the blockcache for that family >> >> and >> >>>> I >> >>>>>>> put >> >>>>>>>>> scan.setBlockcache(false) as well for both cases. >> >>>>>>>>> >> >>>>>>>>> I think that it's not possible that I executing an complete scan >> >> for >> >>>>>>> each >> >>>>>>>>> thread since my data are the type: >> >>>>>>>>> 000001 f:q value=1 >> >>>>>>>>> 000002 f:q value=2 >> >>>>>>>>> 000003 f:q value=3 >> >>>>>>>>> ... >> >>>>>>>>> >> >>>>>>>>> I add all the values and get the same result on a single scan >> than >> >> a >> >>>>>>>>> distributed, so, I guess that DistributedScan did well. >> >>>>>>>>> The count from the hbase shell takes about 10-15seconds, I don't >> >>>>>>> remember, >> >>>>>>>>> but like 4x of the scan time. >> >>>>>>>>> I'm not using any filter for the scans. >> >>>>>>>>> >> >>>>>>>>> This is the way I calculate number of regions/scans >> >>>>>>>>> private List<RegionScanner> generatePartitions() { >> >>>>>>>>> List<RegionScanner> regionScanners = new >> >>>>>>>>> ArrayList<RegionScanner>(); >> >>>>>>>>> byte[] startKey; >> >>>>>>>>> byte[] stopKey; >> >>>>>>>>> HConnection connection = null; >> >>>>>>>>> HBaseAdmin hbaseAdmin = null; >> >>>>>>>>> try { >> >>>>>>>>> connection = >> >>>>>>>>> >> HConnectionManager.createConnection(HBaseConfiguration.create()); >> >>>>>>>>> hbaseAdmin = new HBaseAdmin(connection); >> >>>>>>>>> List<HRegionInfo> regions = >> >>>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable()); >> >>>>>>>>> RegionScanner regionScanner = null; >> >>>>>>>>> for (HRegionInfo region : regions) { >> >>>>>>>>> >> >>>>>>>>> startKey = region.getStartKey(); >> >>>>>>>>> stopKey = region.getEndKey(); >> >>>>>>>>> >> >>>>>>>>> regionScanner = new RegionScanner(startKey, stopKey, >> >>>>>>>>> scanConfiguration); >> >>>>>>>>> // regionScanner = createRegionScanner(startKey, >> >>>>>>> stopKey); >> >>>>>>>>> if (regionScanner != null) { >> >>>>>>>>> regionScanners.add(regionScanner); >> >>>>>>>>> } >> >>>>>>>>> } >> >>>>>>>>> >> >>>>>>>>> I did some test for a tiny table and I think that the range for >> >> each >> >>>>>>> scan >> >>>>>>>>> works fine. Although, I though that it was interesting that the >> >> time >> >>>>>>> when I >> >>>>>>>>> execute distributed scan is about 6x. >> >>>>>>>>> >> >>>>>>>>> I'm going to check about the hard disks, but I think that ti's >> >> right. >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <[email protected]>: >> >>>>>>>>> >> >>>>>>>>>> Which version of HBase? >> >>>>>>>>>> Can you show us the code? >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> Your parallel scan with caching 100 takes about 6x as long as >> the >> >>>>>>> single >> >>>>>>>>>> scan, which is suspicious because you say you have 6 regions. >> >>>>>>>>>> Are you sure you're not accidentally scanning all the data in >> each >> >>>> of >> >>>>>>>>>> your parallel scans? >> >>>>>>>>>> >> >>>>>>>>>> -- Lars >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> ________________________________ >> >>>>>>>>>> From: Guillermo Ortiz <[email protected]> >> >>>>>>>>>> To: "[email protected]" <[email protected]> >> >>>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM >> >>>>>>>>>> Subject: Scan vs Parallel scan. >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> Hi, >> >>>>>>>>>> >> >>>>>>>>>> I developed an distributed scan, I create an thread for each >> >> region. >> >>>>>>> After >> >>>>>>>>>> that, I've tried to get some times Scan vs DistributedScan. >> >>>>>>>>>> I have disabled blockcache in my table. My cluster has 3 region >> >>>>>>> servers >> >>>>>>>>>> with 2 regions each one, in total there are 100.000 rows and >> >>>> execute a >> >>>>>>>>>> complete scan. >> >>>>>>>>>> >> >>>>>>>>>> My partitions are >> >>>>>>>>>> -01666 -> request 16665 >> >>>>>>>>>> 016666-033332 -> request 16666 >> >>>>>>>>>> 033332-049998 -> request 16666 >> >>>>>>>>>> 049998-066664 -> request 16666 >> >>>>>>>>>> 066664-083330 -> request 16666 >> >>>>>>>>>> 083330- -> request 16671 >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000 >> >>>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN >> >>>>>>> PARALLEL:22089ms,Counter:2 -> >> >>>>>>>>>> Caching 10 >> >>>>>>>>>> >> >>>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000 >> >>>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN >> >>>>>>> PARALJEL:16598ms,Counter:2 -> >> >>>>>>>>>> Caching 100 >> >>>>>>>>>> >> >>>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000 >> >>>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN >> >>>>>>> PARALLEL:16497ms,Counter:2 -> >> >>>>>>>>>> Caching 1000 >> >>>>>>>>>> >> >>>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000 >> >>>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN >> >> NORMAL:68288ms,Counter:2 >> >>>>>>> -> >> >>>>>>>>>> Caching 1 >> >>>>>>>>>> >> >>>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000 >> >>>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN >> >> NORMAL:2646ms,Counter:2 >> >>>> -> >> >>>>>>>>>> Caching 100 >> >>>>>>>>>> >> >>>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000 >> >>>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN >> >> NORMAL:3903ms,Counter:2 >> >>>> -> >> >>>>>>>>>> Caching 1000 >> >>>>>>>>>> >> >>>>>>>>>> Parallel scan works much worse than simple scan,, and I don't >> know >> >>>> why >> >>>>>>>>>> it's >> >>>>>>>>>> so fast,, it's really much faster than execute an "count" from >> >> hbase >> >>>>>>>>>> shell, >> >>>>>>>>>> what it doesn't look pretty notmal. The only time that it works >> >>>> better >> >>>>>>>>>> parallel is when I execute a normal scan with caching 1. >> >>>>>>>>>> >> >>>>>>>>>> Any clue about it? >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>> >> >>>> >> >> >> >> >> >> >
