Hello all,
I am looking for advice on speeding up my Scanning.
I want to iterate over all rows where a particular column (language)
equals a particular value ("JA").
I am already creating my row keys using that column in the first bytes.
And I do my scans using partial row matching, like this...
public static byte[] calculateStartRowKey(String language) {
int languageHash = language.length() > 0 ? language.hashCode() : 0;
byte[] language2 = Bytes.toBytes(languageHash);
byte[] accountID2 = Bytes.toBytes(0);
byte[] timestamp2 = Bytes.toBytes(0);
return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
}
public static byte[] calculateEndRowKey(String language) {
int languageHash = language.length() > 0 ? language.hashCode() : 0;
byte[] language2 = Bytes.toBytes(languageHash + 1);
byte[] accountID2 = Bytes.toBytes(0);
byte[] timestamp2 = Bytes.toBytes(0);
return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
}
Scan scan = new Scan(calculateStartRowKey(language),
calculateEndRowKey(language));
Since I am using a hash value for the string, I need to re-check the
column to make sure that some other string does not get the same hash value
Filter filter = new SingleColumnValueFilter(resultFamily,
languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
scan.setFilter(filter);
I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on EC2.
I think that this should be really fast, but it is not. Any advice on
how to debug/speed it up?
Thanks
Peter