Thank you Doug and Geoff,

After following your advice I am now up to about 100 rows a second. Is that considered fast for HBase?

My data is not big, and I only have 100,000's of rows in my table at the moment.

Do I still have a tuning problem?  How fast should I expect?

Thanks
Peter



On 1/25/12 2:32 PM, Doug Meil wrote:
Thanks Geoff!  No apology required, that's good stuff.  I'll update the
book with that param.




On 1/25/12 2:17 PM, "Geoff Hendrey"<[email protected]>  wrote:

Sorry for jumping in late, and perhaps out of context, but I'm pasting
in some findings  (reported to this list by us a while back) that helped
us to get scans to perform very fast. Adjusting
hbase.client.prefetch.limit was critical for us.:
========================
It's even more mysterious than we think. There is lack of documentation
(or perhaps lack of know how). Apparently there are 2 factors that
decide the performance of scan.

1.      Scanner cache as we know - We always had scanner caching set to
1, but this is different than pre fetch limit
2.      hbase.client.prefetch.limit -  This is meta caching limit
defaults to 10 to prefetch 10 region locations every time we scan that
is not already been pre-warmed

the "hbase.client.prefetch.limit" is passed along to the client code to
prefetch the next 10 region locations.

int rows = Math.min(rowLimit,
configuration.getInt("hbase.meta.scanner.caching", 100));

the "row" variable mins to 10 and always prefetch atmost 10 region
boundaries. Hence every new region boundary that is not already been
pre-warmed fetch the next 10 region locations resulting in 1st slow
query followed by quick responses. This is basically pre-warming the
meta not region cache.

-----Original Message-----
From: Jeff Whiting [mailto:[email protected]]
Sent: Wednesday, January 25, 2012 10:09 AM
To: [email protected]
Subject: Re: Speeding up Scans

Does it make sense to have better defaults so the performance out of the
box is better?

~Jeff

On 1/25/2012 8:06 AM, Peter Wolf wrote:
Ah ha!  I appear to be insane ;-)

Adding the following speeded things up quite a bit

         scan.setCacheBlocks(true);
         scan.setCaching(1000);

Thank you, it was a duh!

P



On 1/25/12 8:13 AM, Doug Meil wrote:
Hi there-

Quick sanity check:  what caching level are you using?  (default is
1)  I
know this is basic, but it's always good to double-check.

If "language" is already in the lead position of the rowkey, why use
the
filter?

As for EC2, that's a wildcard.





On 1/25/12 7:56 AM, "Peter Wolf"<[email protected]>   wrote:

Hello all,

I am looking for advice on speeding up my Scanning.

I want to iterate over all rows where a particular column (language)
equals a particular value ("JA").

I am already creating my row keys using that column in the first
bytes.
And I do my scans using partial row matching, like this...

      public static byte[] calculateStartRowKey(String language) {
          int languageHash = language.length()>   0 ?
language.hashCode() :
0;
          byte[] language2 = Bytes.toBytes(languageHash);
          byte[] accountID2 = Bytes.toBytes(0);
          byte[] timestamp2 = Bytes.toBytes(0);
          return Bytes.add(Bytes.add(language2, accountID2),
timestamp2);
      }

      public static byte[] calculateEndRowKey(String language) {
          int languageHash = language.length()>   0 ?
language.hashCode() :
0;
          byte[] language2 = Bytes.toBytes(languageHash + 1);
          byte[] accountID2 = Bytes.toBytes(0);
          byte[] timestamp2 = Bytes.toBytes(0);
          return Bytes.add(Bytes.add(language2, accountID2),
timestamp2);
      }

      Scan scan = new Scan(calculateStartRowKey(language),
calculateEndRowKey(language));


Since I am using a hash value for the string, I need to re-check the
column to make sure that some other string does not get the same
hash
value

      Filter filter = new SingleColumnValueFilter(resultFamily,
languageCol, CompareFilter.CompareOp.EQUAL,
Bytes.toBytes(language));
      scan.setFilter(filter);

I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
on
EC2.

I think that this should be really fast, but it is not.  Any advice
on
how to debug/speed it up?

Thanks
Peter





--
Jeff Whiting
Qualtrics Senior Software Engineer
[email protected]




Reply via email to