Re: Speeding up Scans

Peter Wolf Wed, 25 Jan 2012 11:58:47 -0800

Interesting,

I added this, and my scan did speed up somewhat


        conf.setInt("hbase.client.prefetch.limit",100);
        hTable = new HTable(conf, tableName);

What does this environment variable really control, and how should it beset to an appropriate value? What is a region, and how does it map tolines, families and columns? What are the tradeoffs for making it big?


Peter



On 1/25/12 2:32 PM, Doug Meil wrote:

Thanks Geoff!  No apology required, that's good stuff.  I'll update the
book with that param.




On 1/25/12 2:17 PM, "Geoff Hendrey"<[email protected]>  wrote:

Sorry for jumping in late, and perhaps out of context, but I'm pasting
in some findings  (reported to this list by us a while back) that helped
us to get scans to perform very fast. Adjusting
hbase.client.prefetch.limit was critical for us.:
========================
It's even more mysterious than we think. There is lack of documentation
(or perhaps lack of know how). Apparently there are 2 factors that
decide the performance of scan.

1.      Scanner cache as we know - We always had scanner caching set to
1, but this is different than pre fetch limit
2.      hbase.client.prefetch.limit -  This is meta caching limit
defaults to 10 to prefetch 10 region locations every time we scan that
is not already been pre-warmed

the "hbase.client.prefetch.limit" is passed along to the client code to
prefetch the next 10 region locations.

int rows = Math.min(rowLimit,
configuration.getInt("hbase.meta.scanner.caching", 100));

the "row" variable mins to 10 and always prefetch atmost 10 region
boundaries. Hence every new region boundary that is not already been
pre-warmed fetch the next 10 region locations resulting in 1st slow
query followed by quick responses. This is basically pre-warming the
meta not region cache.

-----Original Message-----
From: Jeff Whiting [mailto:[email protected]]
Sent: Wednesday, January 25, 2012 10:09 AM
To: [email protected]
Subject: Re: Speeding up Scans

Does it make sense to have better defaults so the performance out of the
box is better?

~Jeff

On 1/25/2012 8:06 AM, Peter Wolf wrote:

Ah ha!  I appear to be insane ;-)

Adding the following speeded things up quite a bit

         scan.setCacheBlocks(true);
         scan.setCaching(1000);

Thank you, it was a duh!

P



On 1/25/12 8:13 AM, Doug Meil wrote:

Hi there-

Quick sanity check:  what caching level are you using?  (default is

1)  I

know this is basic, but it's always good to double-check.

If "language" is already in the lead position of the rowkey, why use

the

filter?

As for EC2, that's a wildcard.





On 1/25/12 7:56 AM, "Peter Wolf"<[email protected]>   wrote:

Hello all,

I am looking for advice on speeding up my Scanning.

I want to iterate over all rows where a particular column (language)
equals a particular value ("JA").

I am already creating my row keys using that column in the first

bytes.

And I do my scans using partial row matching, like this...

      public static byte[] calculateStartRowKey(String language) {
          int languageHash = language.length()>   0 ?

language.hashCode() :

0;
          byte[] language2 = Bytes.toBytes(languageHash);
          byte[] accountID2 = Bytes.toBytes(0);
          byte[] timestamp2 = Bytes.toBytes(0);
          return Bytes.add(Bytes.add(language2, accountID2),

timestamp2);

      }

      public static byte[] calculateEndRowKey(String language) {
          int languageHash = language.length()>   0 ?

language.hashCode() :

0;
          byte[] language2 = Bytes.toBytes(languageHash + 1);
          byte[] accountID2 = Bytes.toBytes(0);
          byte[] timestamp2 = Bytes.toBytes(0);
          return Bytes.add(Bytes.add(language2, accountID2),

timestamp2);

      }

      Scan scan = new Scan(calculateStartRowKey(language),
calculateEndRowKey(language));


Since I am using a hash value for the string, I need to re-check the
column to make sure that some other string does not get the same

hash

value

      Filter filter = new SingleColumnValueFilter(resultFamily,
languageCol, CompareFilter.CompareOp.EQUAL,

Bytes.toBytes(language));

      scan.setFilter(filter);

I am using the Cloudera 0.09.4 release, and a cluster of 3 machines

on

EC2.

I think that this should be really fast, but it is not.  Any advice

on

how to debug/speed it up?

Thanks
Peter

--
Jeff Whiting
Qualtrics Senior Software Engineer
[email protected]

Re: Speeding up Scans

Reply via email to