If you're running a full scan (what PE scan does) on a table that doesn't fit in the block cache, setting setCacheBlocks(true) is the last thing you want to do (unless you fancy getting massive cache churn).
33k does sound awfully low. J-D On Thu, Jan 26, 2012 at 6:54 AM, Tim Robertson <[email protected]> wrote: > Hey Peter, > > I am trying to benchmark our 3 node cluster now and trying to optimize > for scanning. > Using the PerformanceEvaluation tool I did a random write to populate > 5M rows (I believe they are 1k each but whatever the tool does by > default). > > I am seeing 33k records per second (which I believe to be too low) > with the following. > scan.setCacheBlocks(true); > scan.setCaching(10000); > > It might be worth using the PE > (http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation) tool to > load, as then you are using a known table and content to compare > against. > > I am running a 3 node cluser (2xquad core, 6x250G SATA, 24GB men with 6G on > RS). > > HTH, > Tim > > > > On Thu, Jan 26, 2012 at 3:39 PM, Peter Wolf <[email protected]> wrote: >> Thank you Doug and Geoff, >> >> After following your advice I am now up to about 100 rows a second. Is that >> considered fast for HBase? >> >> My data is not big, and I only have 100,000's of rows in my table at the >> moment. >> >> Do I still have a tuning problem? How fast should I expect? >> >> Thanks >> >> Peter >> >> >> >> On 1/25/12 2:32 PM, Doug Meil wrote: >>> >>> Thanks Geoff! No apology required, that's good stuff. I'll update the >>> book with that param. >>> >>> >>> >>> >>> On 1/25/12 2:17 PM, "Geoff Hendrey"<[email protected]> wrote: >>> >>>> Sorry for jumping in late, and perhaps out of context, but I'm pasting >>>> in some findings (reported to this list by us a while back) that helped >>>> us to get scans to perform very fast. Adjusting >>>> hbase.client.prefetch.limit was critical for us.: >>>> ======================== >>>> It's even more mysterious than we think. There is lack of documentation >>>> (or perhaps lack of know how). Apparently there are 2 factors that >>>> decide the performance of scan. >>>> >>>> 1. Scanner cache as we know - We always had scanner caching set to >>>> 1, but this is different than pre fetch limit >>>> 2. hbase.client.prefetch.limit - This is meta caching limit >>>> defaults to 10 to prefetch 10 region locations every time we scan that >>>> is not already been pre-warmed >>>> >>>> the "hbase.client.prefetch.limit" is passed along to the client code to >>>> prefetch the next 10 region locations. >>>> >>>> int rows = Math.min(rowLimit, >>>> configuration.getInt("hbase.meta.scanner.caching", 100)); >>>> >>>> the "row" variable mins to 10 and always prefetch atmost 10 region >>>> boundaries. Hence every new region boundary that is not already been >>>> pre-warmed fetch the next 10 region locations resulting in 1st slow >>>> query followed by quick responses. This is basically pre-warming the >>>> meta not region cache. >>>> >>>> -----Original Message----- >>>> From: Jeff Whiting [mailto:[email protected]] >>>> Sent: Wednesday, January 25, 2012 10:09 AM >>>> To: [email protected] >>>> Subject: Re: Speeding up Scans >>>> >>>> Does it make sense to have better defaults so the performance out of the >>>> box is better? >>>> >>>> ~Jeff >>>> >>>> On 1/25/2012 8:06 AM, Peter Wolf wrote: >>>>> >>>>> Ah ha! I appear to be insane ;-) >>>>> >>>>> Adding the following speeded things up quite a bit >>>>> >>>>> scan.setCacheBlocks(true); >>>>> scan.setCaching(1000); >>>>> >>>>> Thank you, it was a duh! >>>>> >>>>> P >>>>> >>>>> >>>>> >>>>> On 1/25/12 8:13 AM, Doug Meil wrote: >>>>>> >>>>>> Hi there- >>>>>> >>>>>> Quick sanity check: what caching level are you using? (default is >>>> >>>> 1) I >>>>>> >>>>>> know this is basic, but it's always good to double-check. >>>>>> >>>>>> If "language" is already in the lead position of the rowkey, why use >>>> >>>> the >>>>>> >>>>>> filter? >>>>>> >>>>>> As for EC2, that's a wildcard. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 1/25/12 7:56 AM, "Peter Wolf"<[email protected]> wrote: >>>>>> >>>>>>> Hello all, >>>>>>> >>>>>>> I am looking for advice on speeding up my Scanning. >>>>>>> >>>>>>> I want to iterate over all rows where a particular column (language) >>>>>>> equals a particular value ("JA"). >>>>>>> >>>>>>> I am already creating my row keys using that column in the first >>>> >>>> bytes. >>>>>>> >>>>>>> And I do my scans using partial row matching, like this... >>>>>>> >>>>>>> public static byte[] calculateStartRowKey(String language) { >>>>>>> int languageHash = language.length()> 0 ? >>>> >>>> language.hashCode() : >>>>>>> >>>>>>> 0; >>>>>>> byte[] language2 = Bytes.toBytes(languageHash); >>>>>>> byte[] accountID2 = Bytes.toBytes(0); >>>>>>> byte[] timestamp2 = Bytes.toBytes(0); >>>>>>> return Bytes.add(Bytes.add(language2, accountID2), >>>> >>>> timestamp2); >>>>>>> >>>>>>> } >>>>>>> >>>>>>> public static byte[] calculateEndRowKey(String language) { >>>>>>> int languageHash = language.length()> 0 ? >>>> >>>> language.hashCode() : >>>>>>> >>>>>>> 0; >>>>>>> byte[] language2 = Bytes.toBytes(languageHash + 1); >>>>>>> byte[] accountID2 = Bytes.toBytes(0); >>>>>>> byte[] timestamp2 = Bytes.toBytes(0); >>>>>>> return Bytes.add(Bytes.add(language2, accountID2), >>>> >>>> timestamp2); >>>>>>> >>>>>>> } >>>>>>> >>>>>>> Scan scan = new Scan(calculateStartRowKey(language), >>>>>>> calculateEndRowKey(language)); >>>>>>> >>>>>>> >>>>>>> Since I am using a hash value for the string, I need to re-check the >>>>>>> column to make sure that some other string does not get the same >>>> >>>> hash >>>>>>> >>>>>>> value >>>>>>> >>>>>>> Filter filter = new SingleColumnValueFilter(resultFamily, >>>>>>> languageCol, CompareFilter.CompareOp.EQUAL, >>>> >>>> Bytes.toBytes(language)); >>>>>>> >>>>>>> scan.setFilter(filter); >>>>>>> >>>>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines >>>> >>>> on >>>>>>> >>>>>>> EC2. >>>>>>> >>>>>>> I think that this should be really fast, but it is not. Any advice >>>> >>>> on >>>>>>> >>>>>>> how to debug/speed it up? >>>>>>> >>>>>>> Thanks >>>>>>> Peter >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>> -- >>>> Jeff Whiting >>>> Qualtrics Senior Software Engineer >>>> [email protected] >>>> >>>> >>> >>
