Re: Speeding up Scans

Jean-Daniel Cryans Thu, 26 Jan 2012 10:09:58 -0800

If you're running a full scan (what PE scan does) on a table that
doesn't fit in the block cache, setting setCacheBlocks(true) is the
last thing you want to do (unless you fancy getting massive cache
churn).


33k does sound awfully low.

J-D

On Thu, Jan 26, 2012 at 6:54 AM, Tim Robertson
<[email protected]> wrote:
> Hey Peter,
>
> I am trying to benchmark our 3 node cluster now and trying to optimize
> for scanning.
> Using the PerformanceEvaluation tool I did a random write to populate
> 5M rows (I believe they are 1k each but whatever the tool does by
> default).
>
> I am seeing 33k records per second (which I believe to be too low)
> with the following.
>    scan.setCacheBlocks(true);
>    scan.setCaching(10000);
>
> It might be worth using the PE
> (http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation) tool to
> load, as then you are using a known table and content to compare
> against.
>
> I am running a 3 node cluser (2xquad core, 6x250G SATA, 24GB men with 6G on 
> RS).
>
> HTH,
> Tim
>
>
>
> On Thu, Jan 26, 2012 at 3:39 PM, Peter Wolf <[email protected]> wrote:
>> Thank you Doug and Geoff,
>>
>> After following your advice I am now up to about 100 rows a second.  Is that
>> considered fast for HBase?
>>
>> My data is not big, and I only have 100,000's of rows in my table at the
>> moment.
>>
>> Do I still have a tuning problem?  How fast should I expect?
>>
>> Thanks
>>
>> Peter
>>
>>
>>
>> On 1/25/12 2:32 PM, Doug Meil wrote:
>>>
>>> Thanks Geoff!  No apology required, that's good stuff.  I'll update the
>>> book with that param.
>>>
>>>
>>>
>>>
>>> On 1/25/12 2:17 PM, "Geoff Hendrey"<[email protected]>  wrote:
>>>
>>>> Sorry for jumping in late, and perhaps out of context, but I'm pasting
>>>> in some findings  (reported to this list by us a while back) that helped
>>>> us to get scans to perform very fast. Adjusting
>>>> hbase.client.prefetch.limit was critical for us.:
>>>> ========================
>>>> It's even more mysterious than we think. There is lack of documentation
>>>> (or perhaps lack of know how). Apparently there are 2 factors that
>>>> decide the performance of scan.
>>>>
>>>> 1.      Scanner cache as we know - We always had scanner caching set to
>>>> 1, but this is different than pre fetch limit
>>>> 2.      hbase.client.prefetch.limit -  This is meta caching limit
>>>> defaults to 10 to prefetch 10 region locations every time we scan that
>>>> is not already been pre-warmed
>>>>
>>>> the "hbase.client.prefetch.limit" is passed along to the client code to
>>>> prefetch the next 10 region locations.
>>>>
>>>> int rows = Math.min(rowLimit,
>>>> configuration.getInt("hbase.meta.scanner.caching", 100));
>>>>
>>>> the "row" variable mins to 10 and always prefetch atmost 10 region
>>>> boundaries. Hence every new region boundary that is not already been
>>>> pre-warmed fetch the next 10 region locations resulting in 1st slow
>>>> query followed by quick responses. This is basically pre-warming the
>>>> meta not region cache.
>>>>
>>>> -----Original Message-----
>>>> From: Jeff Whiting [mailto:[email protected]]
>>>> Sent: Wednesday, January 25, 2012 10:09 AM
>>>> To: [email protected]
>>>> Subject: Re: Speeding up Scans
>>>>
>>>> Does it make sense to have better defaults so the performance out of the
>>>> box is better?
>>>>
>>>> ~Jeff
>>>>
>>>> On 1/25/2012 8:06 AM, Peter Wolf wrote:
>>>>>
>>>>> Ah ha!  I appear to be insane ;-)
>>>>>
>>>>> Adding the following speeded things up quite a bit
>>>>>
>>>>>         scan.setCacheBlocks(true);
>>>>>         scan.setCaching(1000);
>>>>>
>>>>> Thank you, it was a duh!
>>>>>
>>>>> P
>>>>>
>>>>>
>>>>>
>>>>> On 1/25/12 8:13 AM, Doug Meil wrote:
>>>>>>
>>>>>> Hi there-
>>>>>>
>>>>>> Quick sanity check:  what caching level are you using?  (default is
>>>>
>>>> 1)  I
>>>>>>
>>>>>> know this is basic, but it's always good to double-check.
>>>>>>
>>>>>> If "language" is already in the lead position of the rowkey, why use
>>>>
>>>> the
>>>>>>
>>>>>> filter?
>>>>>>
>>>>>> As for EC2, that's a wildcard.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 1/25/12 7:56 AM, "Peter Wolf"<[email protected]>   wrote:
>>>>>>
>>>>>>> Hello all,
>>>>>>>
>>>>>>> I am looking for advice on speeding up my Scanning.
>>>>>>>
>>>>>>> I want to iterate over all rows where a particular column (language)
>>>>>>> equals a particular value ("JA").
>>>>>>>
>>>>>>> I am already creating my row keys using that column in the first
>>>>
>>>> bytes.
>>>>>>>
>>>>>>> And I do my scans using partial row matching, like this...
>>>>>>>
>>>>>>>      public static byte[] calculateStartRowKey(String language) {
>>>>>>>          int languageHash = language.length()>   0 ?
>>>>
>>>> language.hashCode() :
>>>>>>>
>>>>>>> 0;
>>>>>>>          byte[] language2 = Bytes.toBytes(languageHash);
>>>>>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>>>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>>>>>          return Bytes.add(Bytes.add(language2, accountID2),
>>>>
>>>> timestamp2);
>>>>>>>
>>>>>>>      }
>>>>>>>
>>>>>>>      public static byte[] calculateEndRowKey(String language) {
>>>>>>>          int languageHash = language.length()>   0 ?
>>>>
>>>> language.hashCode() :
>>>>>>>
>>>>>>> 0;
>>>>>>>          byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>>>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>>>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>>>>>          return Bytes.add(Bytes.add(language2, accountID2),
>>>>
>>>> timestamp2);
>>>>>>>
>>>>>>>      }
>>>>>>>
>>>>>>>      Scan scan = new Scan(calculateStartRowKey(language),
>>>>>>> calculateEndRowKey(language));
>>>>>>>
>>>>>>>
>>>>>>> Since I am using a hash value for the string, I need to re-check the
>>>>>>> column to make sure that some other string does not get the same
>>>>
>>>> hash
>>>>>>>
>>>>>>> value
>>>>>>>
>>>>>>>      Filter filter = new SingleColumnValueFilter(resultFamily,
>>>>>>> languageCol, CompareFilter.CompareOp.EQUAL,
>>>>
>>>> Bytes.toBytes(language));
>>>>>>>
>>>>>>>      scan.setFilter(filter);
>>>>>>>
>>>>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
>>>>
>>>> on
>>>>>>>
>>>>>>> EC2.
>>>>>>>
>>>>>>> I think that this should be really fast, but it is not.  Any advice
>>>>
>>>> on
>>>>>>>
>>>>>>> how to debug/speed it up?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Peter
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>> --
>>>> Jeff Whiting
>>>> Qualtrics Senior Software Engineer
>>>> [email protected]
>>>>
>>>>
>>>
>>

Re: Speeding up Scans

Reply via email to