No problem! That's one of the tips in the Performance chapter of the book/refGuide - always a good thing to double-check because even the most experienced folks sometimes forget the simple stuff.
On 1/25/12 10:06 AM, "Peter Wolf" <[email protected]> wrote: >Ah ha! I appear to be insane ;-) > >Adding the following speeded things up quite a bit > > scan.setCacheBlocks(true); > scan.setCaching(1000); > >Thank you, it was a duh! > >P > > > >On 1/25/12 8:13 AM, Doug Meil wrote: >> Hi there- >> >> Quick sanity check: what caching level are you using? (default is 1) >>I >> know this is basic, but it's always good to double-check. >> >> If "language" is already in the lead position of the rowkey, why use the >> filter? >> >> As for EC2, that's a wildcard. >> >> >> >> >> >> On 1/25/12 7:56 AM, "Peter Wolf"<[email protected]> wrote: >> >>> Hello all, >>> >>> I am looking for advice on speeding up my Scanning. >>> >>> I want to iterate over all rows where a particular column (language) >>> equals a particular value ("JA"). >>> >>> I am already creating my row keys using that column in the first bytes. >>> And I do my scans using partial row matching, like this... >>> >>> public static byte[] calculateStartRowKey(String language) { >>> int languageHash = language.length()> 0 ? >>>language.hashCode() : >>> 0; >>> byte[] language2 = Bytes.toBytes(languageHash); >>> byte[] accountID2 = Bytes.toBytes(0); >>> byte[] timestamp2 = Bytes.toBytes(0); >>> return Bytes.add(Bytes.add(language2, accountID2), >>>timestamp2); >>> } >>> >>> public static byte[] calculateEndRowKey(String language) { >>> int languageHash = language.length()> 0 ? >>>language.hashCode() : >>> 0; >>> byte[] language2 = Bytes.toBytes(languageHash + 1); >>> byte[] accountID2 = Bytes.toBytes(0); >>> byte[] timestamp2 = Bytes.toBytes(0); >>> return Bytes.add(Bytes.add(language2, accountID2), >>>timestamp2); >>> } >>> >>> Scan scan = new Scan(calculateStartRowKey(language), >>> calculateEndRowKey(language)); >>> >>> >>> Since I am using a hash value for the string, I need to re-check the >>> column to make sure that some other string does not get the same hash >>> value >>> >>> Filter filter = new SingleColumnValueFilter(resultFamily, >>> languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language)); >>> scan.setFilter(filter); >>> >>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on >>> EC2. >>> >>> I think that this should be really fast, but it is not. Any advice on >>> how to debug/speed it up? >>> >>> Thanks >>> Peter >>> >>> >>> >>> >>> >> > >
