Re: Speeding up Scans

Doug Meil Wed, 25 Jan 2012 08:22:21 -0800

No problem!  That's one of the tips in the Performance chapter of the
book/refGuide - always a good thing to double-check because even the most
experienced folks sometimes forget the simple stuff.




On 1/25/12 10:06 AM, "Peter Wolf" <[email protected]> wrote:

>Ah ha!  I appear to be insane ;-)
>
>Adding the following speeded things up quite a bit
>
>         scan.setCacheBlocks(true);
>         scan.setCaching(1000);
>
>Thank you, it was a duh!
>
>P
>
>
>
>On 1/25/12 8:13 AM, Doug Meil wrote:
>> Hi there-
>>
>> Quick sanity check:  what caching level are you using?  (default is 1)
>>I
>> know this is basic, but it's always good to double-check.
>>
>> If "language" is already in the lead position of the rowkey, why use the
>> filter?
>>
>> As for EC2, that's a wildcard.
>>
>>
>>
>>
>>
>> On 1/25/12 7:56 AM, "Peter Wolf"<[email protected]>  wrote:
>>
>>> Hello all,
>>>
>>> I am looking for advice on speeding up my Scanning.
>>>
>>> I want to iterate over all rows where a particular column (language)
>>> equals a particular value ("JA").
>>>
>>> I am already creating my row keys using that column in the first bytes.
>>> And I do my scans using partial row matching, like this...
>>>
>>>      public static byte[] calculateStartRowKey(String language) {
>>>          int languageHash = language.length()>  0 ?
>>>language.hashCode() :
>>> 0;
>>>          byte[] language2 = Bytes.toBytes(languageHash);
>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>          return Bytes.add(Bytes.add(language2, accountID2),
>>>timestamp2);
>>>      }
>>>
>>>      public static byte[] calculateEndRowKey(String language) {
>>>          int languageHash = language.length()>  0 ?
>>>language.hashCode() :
>>> 0;
>>>          byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>          return Bytes.add(Bytes.add(language2, accountID2),
>>>timestamp2);
>>>      }
>>>
>>>      Scan scan = new Scan(calculateStartRowKey(language),
>>> calculateEndRowKey(language));
>>>
>>>
>>> Since I am using a hash value for the string, I need to re-check the
>>> column to make sure that some other string does not get the same hash
>>> value
>>>
>>>      Filter filter = new SingleColumnValueFilter(resultFamily,
>>> languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
>>>      scan.setFilter(filter);
>>>
>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on
>>> EC2.
>>>
>>> I think that this should be really fast, but it is not.  Any advice on
>>> how to debug/speed it up?
>>>
>>> Thanks
>>> Peter
>>>
>>>
>>>
>>>
>>>
>>
>
>

Re: Speeding up Scans

Reply via email to