Hey Peter,

I am trying to benchmark our 3 node cluster now and trying to optimize
for scanning.
Using the PerformanceEvaluation tool I did a random write to populate
5M rows (I believe they are 1k each but whatever the tool does by
default).

I am seeing 33k records per second (which I believe to be too low)
with the following.
    scan.setCacheBlocks(true);
    scan.setCaching(10000);

It might be worth using the PE
(http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation) tool to
load, as then you are using a known table and content to compare
against.

I am running a 3 node cluser (2xquad core, 6x250G SATA, 24GB men with 6G on RS).

HTH,
Tim



On Thu, Jan 26, 2012 at 3:39 PM, Peter Wolf <opus...@gmail.com> wrote:
> Thank you Doug and Geoff,
>
> After following your advice I am now up to about 100 rows a second.  Is that
> considered fast for HBase?
>
> My data is not big, and I only have 100,000's of rows in my table at the
> moment.
>
> Do I still have a tuning problem?  How fast should I expect?
>
> Thanks
>
> Peter
>
>
>
> On 1/25/12 2:32 PM, Doug Meil wrote:
>>
>> Thanks Geoff!  No apology required, that's good stuff.  I'll update the
>> book with that param.
>>
>>
>>
>>
>> On 1/25/12 2:17 PM, "Geoff Hendrey"<ghend...@decarta.com>  wrote:
>>
>>> Sorry for jumping in late, and perhaps out of context, but I'm pasting
>>> in some findings  (reported to this list by us a while back) that helped
>>> us to get scans to perform very fast. Adjusting
>>> hbase.client.prefetch.limit was critical for us.:
>>> ========================
>>> It's even more mysterious than we think. There is lack of documentation
>>> (or perhaps lack of know how). Apparently there are 2 factors that
>>> decide the performance of scan.
>>>
>>> 1.      Scanner cache as we know - We always had scanner caching set to
>>> 1, but this is different than pre fetch limit
>>> 2.      hbase.client.prefetch.limit -  This is meta caching limit
>>> defaults to 10 to prefetch 10 region locations every time we scan that
>>> is not already been pre-warmed
>>>
>>> the "hbase.client.prefetch.limit" is passed along to the client code to
>>> prefetch the next 10 region locations.
>>>
>>> int rows = Math.min(rowLimit,
>>> configuration.getInt("hbase.meta.scanner.caching", 100));
>>>
>>> the "row" variable mins to 10 and always prefetch atmost 10 region
>>> boundaries. Hence every new region boundary that is not already been
>>> pre-warmed fetch the next 10 region locations resulting in 1st slow
>>> query followed by quick responses. This is basically pre-warming the
>>> meta not region cache.
>>>
>>> -----Original Message-----
>>> From: Jeff Whiting [mailto:je...@qualtrics.com]
>>> Sent: Wednesday, January 25, 2012 10:09 AM
>>> To: user@hbase.apache.org
>>> Subject: Re: Speeding up Scans
>>>
>>> Does it make sense to have better defaults so the performance out of the
>>> box is better?
>>>
>>> ~Jeff
>>>
>>> On 1/25/2012 8:06 AM, Peter Wolf wrote:
>>>>
>>>> Ah ha!  I appear to be insane ;-)
>>>>
>>>> Adding the following speeded things up quite a bit
>>>>
>>>>         scan.setCacheBlocks(true);
>>>>         scan.setCaching(1000);
>>>>
>>>> Thank you, it was a duh!
>>>>
>>>> P
>>>>
>>>>
>>>>
>>>> On 1/25/12 8:13 AM, Doug Meil wrote:
>>>>>
>>>>> Hi there-
>>>>>
>>>>> Quick sanity check:  what caching level are you using?  (default is
>>>
>>> 1)  I
>>>>>
>>>>> know this is basic, but it's always good to double-check.
>>>>>
>>>>> If "language" is already in the lead position of the rowkey, why use
>>>
>>> the
>>>>>
>>>>> filter?
>>>>>
>>>>> As for EC2, that's a wildcard.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 1/25/12 7:56 AM, "Peter Wolf"<opus...@gmail.com>   wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> I am looking for advice on speeding up my Scanning.
>>>>>>
>>>>>> I want to iterate over all rows where a particular column (language)
>>>>>> equals a particular value ("JA").
>>>>>>
>>>>>> I am already creating my row keys using that column in the first
>>>
>>> bytes.
>>>>>>
>>>>>> And I do my scans using partial row matching, like this...
>>>>>>
>>>>>>      public static byte[] calculateStartRowKey(String language) {
>>>>>>          int languageHash = language.length()>   0 ?
>>>
>>> language.hashCode() :
>>>>>>
>>>>>> 0;
>>>>>>          byte[] language2 = Bytes.toBytes(languageHash);
>>>>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>>>>          return Bytes.add(Bytes.add(language2, accountID2),
>>>
>>> timestamp2);
>>>>>>
>>>>>>      }
>>>>>>
>>>>>>      public static byte[] calculateEndRowKey(String language) {
>>>>>>          int languageHash = language.length()>   0 ?
>>>
>>> language.hashCode() :
>>>>>>
>>>>>> 0;
>>>>>>          byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>>>>          return Bytes.add(Bytes.add(language2, accountID2),
>>>
>>> timestamp2);
>>>>>>
>>>>>>      }
>>>>>>
>>>>>>      Scan scan = new Scan(calculateStartRowKey(language),
>>>>>> calculateEndRowKey(language));
>>>>>>
>>>>>>
>>>>>> Since I am using a hash value for the string, I need to re-check the
>>>>>> column to make sure that some other string does not get the same
>>>
>>> hash
>>>>>>
>>>>>> value
>>>>>>
>>>>>>      Filter filter = new SingleColumnValueFilter(resultFamily,
>>>>>> languageCol, CompareFilter.CompareOp.EQUAL,
>>>
>>> Bytes.toBytes(language));
>>>>>>
>>>>>>      scan.setFilter(filter);
>>>>>>
>>>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
>>>
>>> on
>>>>>>
>>>>>> EC2.
>>>>>>
>>>>>> I think that this should be really fast, but it is not.  Any advice
>>>
>>> on
>>>>>>
>>>>>> how to debug/speed it up?
>>>>>>
>>>>>> Thanks
>>>>>> Peter
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>> --
>>> Jeff Whiting
>>> Qualtrics Senior Software Engineer
>>> je...@qualtrics.com
>>>
>>>
>>
>

Reply via email to