Hey Peter, I am trying to benchmark our 3 node cluster now and trying to optimize for scanning. Using the PerformanceEvaluation tool I did a random write to populate 5M rows (I believe they are 1k each but whatever the tool does by default).
I am seeing 33k records per second (which I believe to be too low) with the following. scan.setCacheBlocks(true); scan.setCaching(10000); It might be worth using the PE (http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation) tool to load, as then you are using a known table and content to compare against. I am running a 3 node cluser (2xquad core, 6x250G SATA, 24GB men with 6G on RS). HTH, Tim On Thu, Jan 26, 2012 at 3:39 PM, Peter Wolf <opus...@gmail.com> wrote: > Thank you Doug and Geoff, > > After following your advice I am now up to about 100 rows a second. Is that > considered fast for HBase? > > My data is not big, and I only have 100,000's of rows in my table at the > moment. > > Do I still have a tuning problem? How fast should I expect? > > Thanks > > Peter > > > > On 1/25/12 2:32 PM, Doug Meil wrote: >> >> Thanks Geoff! No apology required, that's good stuff. I'll update the >> book with that param. >> >> >> >> >> On 1/25/12 2:17 PM, "Geoff Hendrey"<ghend...@decarta.com> wrote: >> >>> Sorry for jumping in late, and perhaps out of context, but I'm pasting >>> in some findings (reported to this list by us a while back) that helped >>> us to get scans to perform very fast. Adjusting >>> hbase.client.prefetch.limit was critical for us.: >>> ======================== >>> It's even more mysterious than we think. There is lack of documentation >>> (or perhaps lack of know how). Apparently there are 2 factors that >>> decide the performance of scan. >>> >>> 1. Scanner cache as we know - We always had scanner caching set to >>> 1, but this is different than pre fetch limit >>> 2. hbase.client.prefetch.limit - This is meta caching limit >>> defaults to 10 to prefetch 10 region locations every time we scan that >>> is not already been pre-warmed >>> >>> the "hbase.client.prefetch.limit" is passed along to the client code to >>> prefetch the next 10 region locations. >>> >>> int rows = Math.min(rowLimit, >>> configuration.getInt("hbase.meta.scanner.caching", 100)); >>> >>> the "row" variable mins to 10 and always prefetch atmost 10 region >>> boundaries. Hence every new region boundary that is not already been >>> pre-warmed fetch the next 10 region locations resulting in 1st slow >>> query followed by quick responses. This is basically pre-warming the >>> meta not region cache. >>> >>> -----Original Message----- >>> From: Jeff Whiting [mailto:je...@qualtrics.com] >>> Sent: Wednesday, January 25, 2012 10:09 AM >>> To: user@hbase.apache.org >>> Subject: Re: Speeding up Scans >>> >>> Does it make sense to have better defaults so the performance out of the >>> box is better? >>> >>> ~Jeff >>> >>> On 1/25/2012 8:06 AM, Peter Wolf wrote: >>>> >>>> Ah ha! I appear to be insane ;-) >>>> >>>> Adding the following speeded things up quite a bit >>>> >>>> scan.setCacheBlocks(true); >>>> scan.setCaching(1000); >>>> >>>> Thank you, it was a duh! >>>> >>>> P >>>> >>>> >>>> >>>> On 1/25/12 8:13 AM, Doug Meil wrote: >>>>> >>>>> Hi there- >>>>> >>>>> Quick sanity check: what caching level are you using? (default is >>> >>> 1) I >>>>> >>>>> know this is basic, but it's always good to double-check. >>>>> >>>>> If "language" is already in the lead position of the rowkey, why use >>> >>> the >>>>> >>>>> filter? >>>>> >>>>> As for EC2, that's a wildcard. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 1/25/12 7:56 AM, "Peter Wolf"<opus...@gmail.com> wrote: >>>>> >>>>>> Hello all, >>>>>> >>>>>> I am looking for advice on speeding up my Scanning. >>>>>> >>>>>> I want to iterate over all rows where a particular column (language) >>>>>> equals a particular value ("JA"). >>>>>> >>>>>> I am already creating my row keys using that column in the first >>> >>> bytes. >>>>>> >>>>>> And I do my scans using partial row matching, like this... >>>>>> >>>>>> public static byte[] calculateStartRowKey(String language) { >>>>>> int languageHash = language.length()> 0 ? >>> >>> language.hashCode() : >>>>>> >>>>>> 0; >>>>>> byte[] language2 = Bytes.toBytes(languageHash); >>>>>> byte[] accountID2 = Bytes.toBytes(0); >>>>>> byte[] timestamp2 = Bytes.toBytes(0); >>>>>> return Bytes.add(Bytes.add(language2, accountID2), >>> >>> timestamp2); >>>>>> >>>>>> } >>>>>> >>>>>> public static byte[] calculateEndRowKey(String language) { >>>>>> int languageHash = language.length()> 0 ? >>> >>> language.hashCode() : >>>>>> >>>>>> 0; >>>>>> byte[] language2 = Bytes.toBytes(languageHash + 1); >>>>>> byte[] accountID2 = Bytes.toBytes(0); >>>>>> byte[] timestamp2 = Bytes.toBytes(0); >>>>>> return Bytes.add(Bytes.add(language2, accountID2), >>> >>> timestamp2); >>>>>> >>>>>> } >>>>>> >>>>>> Scan scan = new Scan(calculateStartRowKey(language), >>>>>> calculateEndRowKey(language)); >>>>>> >>>>>> >>>>>> Since I am using a hash value for the string, I need to re-check the >>>>>> column to make sure that some other string does not get the same >>> >>> hash >>>>>> >>>>>> value >>>>>> >>>>>> Filter filter = new SingleColumnValueFilter(resultFamily, >>>>>> languageCol, CompareFilter.CompareOp.EQUAL, >>> >>> Bytes.toBytes(language)); >>>>>> >>>>>> scan.setFilter(filter); >>>>>> >>>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines >>> >>> on >>>>>> >>>>>> EC2. >>>>>> >>>>>> I think that this should be really fast, but it is not. Any advice >>> >>> on >>>>>> >>>>>> how to debug/speed it up? >>>>>> >>>>>> Thanks >>>>>> Peter >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>> -- >>> Jeff Whiting >>> Qualtrics Senior Software Engineer >>> je...@qualtrics.com >>> >>> >> >