Interesting,
I added this, and my scan did speed up somewhat
conf.setInt("hbase.client.prefetch.limit",100);
hTable = new HTable(conf, tableName);
What does this environment variable really control, and how should it be
set to an appropriate value? What is a region, and how does it map to
lines, families and columns? What are the tradeoffs for making it big?
Peter
On 1/25/12 2:32 PM, Doug Meil wrote:
Thanks Geoff! No apology required, that's good stuff. I'll update the
book with that param.
On 1/25/12 2:17 PM, "Geoff Hendrey"<[email protected]> wrote:
Sorry for jumping in late, and perhaps out of context, but I'm pasting
in some findings (reported to this list by us a while back) that helped
us to get scans to perform very fast. Adjusting
hbase.client.prefetch.limit was critical for us.:
========================
It's even more mysterious than we think. There is lack of documentation
(or perhaps lack of know how). Apparently there are 2 factors that
decide the performance of scan.
1. Scanner cache as we know - We always had scanner caching set to
1, but this is different than pre fetch limit
2. hbase.client.prefetch.limit - This is meta caching limit
defaults to 10 to prefetch 10 region locations every time we scan that
is not already been pre-warmed
the "hbase.client.prefetch.limit" is passed along to the client code to
prefetch the next 10 region locations.
int rows = Math.min(rowLimit,
configuration.getInt("hbase.meta.scanner.caching", 100));
the "row" variable mins to 10 and always prefetch atmost 10 region
boundaries. Hence every new region boundary that is not already been
pre-warmed fetch the next 10 region locations resulting in 1st slow
query followed by quick responses. This is basically pre-warming the
meta not region cache.
-----Original Message-----
From: Jeff Whiting [mailto:[email protected]]
Sent: Wednesday, January 25, 2012 10:09 AM
To: [email protected]
Subject: Re: Speeding up Scans
Does it make sense to have better defaults so the performance out of the
box is better?
~Jeff
On 1/25/2012 8:06 AM, Peter Wolf wrote:
Ah ha! I appear to be insane ;-)
Adding the following speeded things up quite a bit
scan.setCacheBlocks(true);
scan.setCaching(1000);
Thank you, it was a duh!
P
On 1/25/12 8:13 AM, Doug Meil wrote:
Hi there-
Quick sanity check: what caching level are you using? (default is
1) I
know this is basic, but it's always good to double-check.
If "language" is already in the lead position of the rowkey, why use
the
filter?
As for EC2, that's a wildcard.
On 1/25/12 7:56 AM, "Peter Wolf"<[email protected]> wrote:
Hello all,
I am looking for advice on speeding up my Scanning.
I want to iterate over all rows where a particular column (language)
equals a particular value ("JA").
I am already creating my row keys using that column in the first
bytes.
And I do my scans using partial row matching, like this...
public static byte[] calculateStartRowKey(String language) {
int languageHash = language.length()> 0 ?
language.hashCode() :
0;
byte[] language2 = Bytes.toBytes(languageHash);
byte[] accountID2 = Bytes.toBytes(0);
byte[] timestamp2 = Bytes.toBytes(0);
return Bytes.add(Bytes.add(language2, accountID2),
timestamp2);
}
public static byte[] calculateEndRowKey(String language) {
int languageHash = language.length()> 0 ?
language.hashCode() :
0;
byte[] language2 = Bytes.toBytes(languageHash + 1);
byte[] accountID2 = Bytes.toBytes(0);
byte[] timestamp2 = Bytes.toBytes(0);
return Bytes.add(Bytes.add(language2, accountID2),
timestamp2);
}
Scan scan = new Scan(calculateStartRowKey(language),
calculateEndRowKey(language));
Since I am using a hash value for the string, I need to re-check the
column to make sure that some other string does not get the same
hash
value
Filter filter = new SingleColumnValueFilter(resultFamily,
languageCol, CompareFilter.CompareOp.EQUAL,
Bytes.toBytes(language));
scan.setFilter(filter);
I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
on
EC2.
I think that this should be really fast, but it is not. Any advice
on
how to debug/speed it up?
Thanks
Peter
--
Jeff Whiting
Qualtrics Senior Software Engineer
[email protected]