I don't have the code here. But I created a class RegionScanner, this class does a complete scan of a region. So I have to set the start and stop keys. the start and stop key are the limits of that region.
El domingo, 14 de septiembre de 2014, Anoop John <[email protected]> escribió: > Again full code snippet can better speak. > > But not getting what u r doing with below code > > private List<RegionScanner> generatePartitions() { > List<RegionScanner> regionScanners = new > ArrayList<RegionScanner>(); > byte[] startKey; > byte[] stopKey; > HConnection connection = null; > HBaseAdmin hbaseAdmin = null; > try { > connection = HConnectionManager. > createConnection(HBaseConfiguration.create()); > hbaseAdmin = new HBaseAdmin(connection); > List<HRegionInfo> regions = > hbaseAdmin.getTableRegions(scanConfiguration.getTable()); > RegionScanner regionScanner = null; > for (HRegionInfo region : regions) { > > startKey = region.getStartKey(); > stopKey = region.getEndKey(); > > regionScanner = new RegionScanner(startKey, stopKey, > scanConfiguration); > // regionScanner = createRegionScanner(startKey, stopKey); > if (regionScanner != null) { > regionScanners.add(regionScanner); > } > } > > And I execute the RegionScanner with this: > public List<Result> call() throws Exception { > HConnection connection = > HConnectionManager. > createConnection(HBaseConfiguration.create()); > HTableInterface table = > connection.getTable(configuration.getTable()); > > Scan scan = new Scan(startKey, stopKey); > scan.setBatch(configuration.getBatch()); > scan.setCaching(configuration.getCaching()); > ResultScanner resultScanner = table.getScanner(scan); > > > What is this part? > new RegionScanner(startKey, stopKey, > scanConfiguration); > > > >>Scan scan = new Scan(startKey, stopKey); > scan.setBatch(configuration. > getBatch()); > scan.setCaching(configuration.getCaching()); > ResultScanner resultScanner = table.getScanner(scan); > > > And not setting start and stop rows to this Scan object? !! > > > Sorry If I missed some parts from ur code. > > -Anoop- > > > On Sun, Sep 14, 2014 at 2:54 PM, Guillermo Ortiz <[email protected] > <javascript:;>> > wrote: > > > I don't have the code here,, but I'll put the code in a couple of days. I > > have to check the executeservice again! I don't remember exactly how I > did. > > > > I'm using Hbase 0.98. > > > > El domingo, 14 de septiembre de 2014, lars hofhansl <[email protected] > <javascript:;>> > > escribió: > > > > > What specific version of 0.94 are you using? > > > > > > In general, if you have multiple spindles (disks) and/or multiple CPU > > > cores at the region server you should benefits from keeping multiple > > region > > > server handler threads busy. I have experimented with this before and > > saw a > > > close to linear speed up (up to the point where all disks/core were > > busy). > > > Obviously this also assuming this is the only load you throw at the > > servers > > > at this point. > > > > > > Can you post your complete code to pastebin? Maybe even with some code > to > > > seed the data? > > > How do you run your callables? Did you configure the ExecuteService > > > correctly (assuming you use one to run your callables)? > > > > > > Then we can run it and have a look. > > > > > > Thanks. > > > > > > -- Lars > > > > > > > > > ----- Original Message ----- > > > From: Guillermo Ortiz <[email protected] <javascript:;> > <javascript:;>> > > > To: "[email protected] <javascript:;> <javascript:;>" < > [email protected] <javascript:;> > > > <javascript:;>> > > > Cc: > > > Sent: Saturday, September 13, 2014 4:49 PM > > > Subject: Re: Scan vs Parallel scan. > > > > > > What am I missing?? > > > > > > > > > > > > > > > 2014-09-12 16:05 GMT+02:00 Guillermo Ortiz <[email protected] > <javascript:;> > > > <javascript:;>>: > > > > > > > For an partial scan, I guess that I call to the RS to get data, it > > starts > > > > looking in the store files and recollecting the data. (It doesn't > write > > > to > > > > the blockcache in both cases). It has ready the data and it gives to > > the > > > > client the data step by step, I mean,,, it depends the caching and > > > batching > > > > parameters. > > > > > > > > Big differences that I see... > > > > I'm opening more connections to the Table, one for Region. > > > > > > > > I should check the single table scan, it looks like it does partial > > scans > > > > sequentially. Since you can see on the HBase Master how the request > > > > increase one after another, not all in the same time. > > > > > > > > 2014-09-12 15:23 GMT+02:00 Michael Segel <[email protected] > <javascript:;> > > > <javascript:;>>: > > > > > > > >> It doesn’t matter which RS, but that you have 1 thread for each > > region. > > > >> > > > >> So for each thread, what’s happening. > > > >> Step by step, what is the code doing. > > > >> > > > >> Now you’re comparing this against a single table scan, right? > > > >> What’s happening in the table scan…? > > > >> > > > >> > > > >> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz <[email protected] > <javascript:;> > > > <javascript:;>> > > > >> wrote: > > > >> > > > >> > Right, My table for example has keys between 0-9. in three regions > > > >> > 0-2,3-7,7-9 > > > >> > I lauch three partial scans in parallel. The scans that I'm > > executing > > > >> are: > > > >> > scan(0,2), scan(3,7), scan(7,9). > > > >> > Each region is if a different RS, so each thread goes to different > > RS. > > > >> It's > > > >> > not exactly like that, but on the benchmark case it's like it's > > > working. > > > >> > > > > >> > Really the code will execute a thread for each Region not for each > > > >> > RegionServer. But in the test I only have two regions for > > > regionServer. > > > >> I > > > >> > dont' think that's an important point, there're two threads for > RS. > > > >> > > > > >> > 2014-09-12 14:48 GMT+02:00 Michael Segel < > [email protected] <javascript:;> > > > <javascript:;>>: > > > >> > > > > >> >> Ok, lets again take a step back… > > > >> >> > > > >> >> So you are comparing your partial scan(s) against a full table > > scan? > > > >> >> > > > >> >> If I understood your question, you launch 3 partial scans where > you > > > set > > > >> >> the start row and then end row of each scan, right? > > > >> >> > > > >> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz < > [email protected] <javascript:;> > > > <javascript:;>> > > > >> wrote: > > > >> >> > > > >> >>> Okay, then, the partial scan doesn't work as I think. > > > >> >>> How could it exceed the limit of a single region if I calculate > > the > > > >> >> limits? > > > >> >>> > > > >> >>> > > > >> >>> The only bad point that I see it's that If a region server has > > three > > > >> >>> regions of the same table, I'm executing three partial scans > > about > > > >> this > > > >> >> RS > > > >> >>> and they could compete for resources (network, etc..) on this > > node. > > > >> It'd > > > >> >> be > > > >> >>> better to have one thread for RS. But, that doesn't answer your > > > >> >> questions. > > > >> >>> > > > >> >>> I keep thinking... > > > >> >>> > > > >> >>> 2014-09-12 9:40 GMT+02:00 Michael Segel < > > [email protected] <javascript:;> > > > <javascript:;>>: > > > >> >>> > > > >> >>>> Hi, > > > >> >>>> > > > >> >>>> I wanted to take a step back from the actual code and to stop > and > > > >> think > > > >> >>>> about what you are doing and what HBase is doing under the > > covers. > > > >> >>>> > > > >> >>>> So in your code, you are asking HBase to do 3 separate scans > and > > > then > > > >> >> you > > > >> >>>> take the result set back and join it. > > > >> >>>> > > > >> >>>> What does HBase do when it does a range scan? > > > >> >>>> What happens when that range scan exceeds a single region? > > > >> >>>> > > > >> >>>> If you answer those questions… you’ll have your answer. > > > >> >>>> > > > >> >>>> HTH > > > >> >>>> > > > >> >>>> -Mike > > > >> >>>> > > > >> >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz < > > [email protected] <javascript:;> > > > <javascript:;>> > > > >> >> wrote: > > > >> >>>> > > > >> >>>>> It's not all the code, I set things like these as well: > > > >> >>>>> scan.setMaxVersions(); > > > >> >>>>> scan.setCacheBlocks(false); > > > >> >>>>> ... > > > >> >>>>> > > > >> >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz < > [email protected] <javascript:;> > > > <javascript:;>>: > > > >> >>>>> > > > >> >>>>>> yes, that is. I have changed the HBase version to 0.98 > > > >> >>>>>> > > > >> >>>>>> I got the start and stop keys with this method: > > > >> >>>>>> private List<RegionScanner> generatePartitions() { > > > >> >>>>>> List<RegionScanner> regionScanners = new > > > >> >>>>>> ArrayList<RegionScanner>(); > > > >> >>>>>> byte[] startKey; > > > >> >>>>>> byte[] stopKey; > > > >> >>>>>> HConnection connection = null; > > > >> >>>>>> HBaseAdmin hbaseAdmin = null; > > > >> >>>>>> try { > > > >> >>>>>> connection = HConnectionManager. > > > >> >>>>>> createConnection(HBaseConfiguration.create()); > > > >> >>>>>> hbaseAdmin = new HBaseAdmin(connection); > > > >> >>>>>> List<HRegionInfo> regions = > > > >> >>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable()); > > > >> >>>>>> RegionScanner regionScanner = null; > > > >> >>>>>> for (HRegionInfo region : regions) { > > > >> >>>>>> > > > >> >>>>>> startKey = region.getStartKey(); > > > >> >>>>>> stopKey = region.getEndKey(); > > > >> >>>>>> > > > >> >>>>>> regionScanner = new RegionScanner(startKey, > > stopKey, > > > >> >>>>>> scanConfiguration); > > > >> >>>>>> // regionScanner = createRegionScanner(startKey, > > > >> >>>> stopKey); > > > >> >>>>>> if (regionScanner != null) { > > > >> >>>>>> regionScanners.add(regionScanner); > > > >> >>>>>> } > > > >> >>>>>> } > > > >> >>>>>> > > > >> >>>>>> And I execute the RegionScanner with this: > > > >> >>>>>> public List<Result> call() throws Exception { > > > >> >>>>>> HConnection connection = > > > >> >>>>>> > > HConnectionManager.createConnection(HBaseConfiguration.create()); > > > >> >>>>>> HTableInterface table = > > > >> >>>>>> connection.getTable(configuration.getTable()); > > > >> >>>>>> > > > >> >>>>>> Scan scan = new Scan(startKey, stopKey); > > > >> >>>>>> scan.setBatch(configuration.getBatch()); > > > >> >>>>>> scan.setCaching(configuration.getCaching()); > > > >> >>>>>> ResultScanner resultScanner = table.getScanner(scan); > > > >> >>>>>> > > > >> >>>>>> List<Result> results = new ArrayList<Result>(); > > > >> >>>>>> for (Result result : resultScanner) { > > > >> >>>>>> results.add(result); > > > >> >>>>>> } > > > >> >>>>>> > > > >> >>>>>> connection.close(); > > > >> >>>>>> table.close(); > > > >> >>>>>> > > > >> >>>>>> return results; > > > >> >>>>>> } > > > >> >>>>>> > > > >> >>>>>> They implement Callable. > > > >> >>>>>> > > > >> >>>>>> > > > >> >>>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel < > > > [email protected] <javascript:;> <javascript:;> > > > >> >: > > > >> >>>>>> > > > >> >>>>>>> Lets take a step back…. > > > >> >>>>>>> > > > >> >>>>>>> Your parallel scan is having the client create N threads > where > > > in > > > >> >> each > > > >> >>>>>>> thread, you’re doing a partial scan of the table where each > > > >> partial > > > >> >>>> scan > > > >> >>>>>>> takes the first and last row of each region? > > > >> >>>>>>> > > > >> >>>>>>> Is that correct? > > > >> >>>>>>> > > > >> >>>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz < > > > >> [email protected] <javascript:;> <javascript:;>> > > > >> >>>>>>> wrote: > > > >> >>>>>>> > > > >> >>>>>>>> I was checking a little bit more about,, I checked the > > cluster > > > >> and > > > >> >>>> data > > > >> >>>>>>> is > > > >> >>>>>>>> store in three different regions servers, each one in a > > > >> differente > > > >> >>>> node. > > > >> >>>>>>>> So, I guess the threads go to different hard-disks. > > > >> >>>>>>>> > > > >> >>>>>>>> If someone has an idea or suggestion.. why it's faster a > > single > > > >> scan > > > >> >>>>>>> than > > > >> >>>>>>>> this implementation. I based on this implementation > > > >> >>>>>>>> https://github.com/zygm0nt/hbase-distributed-search > > > >> >>>>>>>> > > > >> >>>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz < > > > [email protected] <javascript:;> <javascript:;> > > > >> >: > > > >> >>>>>>>> > > > >> >>>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with > > > 0.98, > > > >> >>>>>>> although > > > >> >>>>>>>>> there is not difference. > > > >> >>>>>>>>> I disabled the table and disabled the blockcache for that > > > family > > > >> >> and > > > >> >>>> I > > > >> >>>>>>> put > > > >> >>>>>>>>> scan.setBlockcache(false) as well for both cases. > > > >> >>>>>>>>> > > > >> >>>>>>>>> I think that it's not possible that I executing an > complete > > > scan > > > >> >> for > > > >> >>>>>>> each > > > >> >>>>>>>>> thread since my data are the type: > > > >> >>>>>>>>> 000001 f:q value=1 > > > >> >>>>>>>>> 000002 f:q value=2 > > > >> >>>>>>>>> 000003 f:q value=3 > > > >> >>>>>>>>> ... > > > >> >>>>>>>>> > > > >> >>>>>>>>> I add all the values and get the same result on a single > > scan > > > >> than > > > >> >> a > > > >> >>>>>>>>> distributed, so, I guess that DistributedScan did well. > > > >> >>>>>>>>> The count from the hbase shell takes about 10-15seconds, I > > > don't > > > >> >>>>>>> remember, > > > >> >>>>>>>>> but like 4x of the scan time. > > > >> >>>>>>>>> I'm not using any filter for the scans. > > > >> >>>>>>>>> > > > >> >>>>>>>>> This is the way I calculate number of regions/scans > > > >> >>>>>>>>> private List<RegionScanner> generatePartitions() { > > > >> >>>>>>>>> List<RegionScanner> regionScanners = new > > > >> >>>>>>>>> ArrayList<RegionScanner>(); > > > >> >>>>>>>>> byte[] startKey; > > > >> >>>>>>>>> byte[] stopKey; > > > >> >>>>>>>>> HConnection connection = null; > > > >> >>>>>>>>> HBaseAdmin hbaseAdmin = null; > > > >> >>>>>>>>> try { > > > >> >>>>>>>>> connection = > > > >> >>>>>>>>> > > > >> HConnectionManager.createConnection(HBaseConfiguration.create()); > > > >> >>>>>>>>> hbaseAdmin = new HBaseAdmin(connection); > > > >> >>>>>>>>> List<HRegionInfo> regions = > > > >> >>>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable()); > > > >> >>>>>>>>> RegionScanner regionScanner = null; > > > >> >>>>>>>>> for (HRegionInfo region : regions) { > > > >> >>>>>>>>> > > > >> >>>>>>>>> startKey = region.getStartKey(); > > > >> >>>>>>>>> stopKey = region.getEndKey(); > > > >> >>>>>>>>> > > > >> >>>>>>>>> regionScanner = new RegionScanner(startKey, > > > stopKey, > > > >> >>>>>>>>> scanConfiguration); > > > >> >>>>>>>>> // regionScanner = > createRegionScanner(startKey, > > > >> >>>>>>> stopKey); > > > >> >>>>>>>>> if (regionScanner != null) { > > > >> >>>>>>>>> regionScanners.add(regionScanner); > > > >> >>>>>>>>> } > > > >> >>>>>>>>> } > > > >> >>>>>>>>> > > > >> >>>>>>>>> I did some test for a tiny table and I think that the > range > > > for > > > >> >> each > > > >> >>>>>>> scan > > > >> >>>>>>>>> works fine. Although, I though that it was interesting > that > > > the > > > >> >> time > > > >> >>>>>>> when I > > > >> >>>>>>>>> execute distributed scan is about 6x. > > > >> >>>>>>>>> > > > >> >>>>>>>>> I'm going to check about the hard disks, but I think that > > ti's > > > >> >> right. > > > >> >>>>>>>>> > > > >> >>>>>>>>> > > > >> >>>>>>>>> > > > >> >>>>>>>>> > > > >> >>>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <[email protected] > <javascript:;> > > > <javascript:;>>: > > > >> >>>>>>>>> > > > >> >>>>>>>>>> Which version of HBase? > > > >> >>>>>>>>>> Can you show us the code? > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> Your parallel scan with caching 100 takes about 6x as > long > > as > > > >> the > > > >> >>>>>>> single > > > >> >>>>>>>>>> scan, which is suspicious because you say you have 6 > > regions. > > > >> >>>>>>>>>> Are you sure you're not accidentally scanning all the > data > > in > > > >> each > > > >> >>>> of > > > >> >>>>>>>>>> your parallel scans? > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> -- Lars > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> ________________________________ > > > >> >>>>>>>>>> From: Guillermo Ortiz <[email protected] > <javascript:;> > > <javascript:;>> > > > >> >>>>>>>>>> To: "[email protected] <javascript:;> > <javascript:;>" < > > > [email protected] <javascript:;> <javascript:;>> > > > >> >>>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM > > > >> >>>>>>>>>> Subject: Scan vs Parallel scan. > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> Hi, > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> I developed an distributed scan, I create an thread for > > each > > > >> >> region. > > > >> >>>>>>> After > > > >> >>>>>>>>>> that, I've tried to get some times Scan vs > DistributedScan. > > > >> >>>>>>>>>> I have disabled blockcache in my table. My cluster has 3 > > > region > > > >> >>>>>>> servers > > > >> >>>>>>>>>> with 2 regions each one, in total there are 100.000 rows > > and > > > >> >>>> execute a > > > >> >>>>>>>>>> complete scan. > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> My partitions are > > > >> >>>>>>>>>> -01666 -> request 16665 > > > >> >>>>>>>>>> 016666-033332 -> request 16666 > > > >> >>>>>>>>>> 033332-049998 -> request 16666 > > > >> >>>>>>>>>> 049998-066664 -> request 16666 > > > >> >>>>>>>>>> 066664-083330 -> request 16666 > > > >> >>>>>>>>>> 083330- -> request 16671 > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS > 100000 > > > >> >>>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN > > > >> >>>>>>> PARALLEL:22089ms,Counter:2 -> > > > >> >>>>>>>>>> Caching 10 > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS > 100000 > > > >> >>>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN > > > >> >>>>>>> PARALJEL:16598ms,Counter:2 -> > > > >> >>>>>>>>>> Caching 100 > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS > 100000 > > > >> >>>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN > > > >> >>>>>>> PARALLEL:16497ms,Counter:2 -> > > > >> >>>>>>>>>> Caching 1000 > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS > 100000 > > > >> >>>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN > > > >> >> NORMAL:68288ms,Counter:2 > > > >> >>>>>>> -> > > > >> >>>>>>>>>> Caching 1 > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS > 100000 > > > >> >>>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN > > > >> >> NORMAL:2646ms,Counter:2 > > > >> >>>> -> > > > >> >>>>>>>>>> Caching 100 > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS > 100000 > > > >> >>>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN > > > >> >> NORMAL:3903ms,Counter:2 > > > >> >>>> -> > > > >> >>>>>>>>>> Caching 1000 > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> Parallel scan works much worse than simple scan,, and I > > don't > > > >> know > > > >> >>>> why > > > >> >>>>>>>>>> it's > > > >> >>>>>>>>>> so fast,, it's really much faster than execute an "count" > > > from > > > >> >> hbase > > > >> >>>>>>>>>> shell, > > > >> >>>>>>>>>> what it doesn't look pretty notmal. The only time that it > > > works > > > >> >>>> better > > > >> >>>>>>>>>> parallel is when I execute a normal scan with caching 1. > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> Any clue about it? > > > >> >>>>>>>>>> > > > >> >>>>>>>>> > > > >> >>>>>>>>> > > > >> >>>>>>> > > > >> >>>>>>> > > > >> >>>>>> > > > >> >>>> > > > >> >>>> > > > >> >> > > > >> >> > > > >> > > > >> > > > > > > > > > >
