RE: Scan performance

Tony Dean Mon, 24 Jun 2013 13:40:58 -0700

Hi James,

I do plan on looking more closely at Phoenix for SQL access to HBase.  Thanks.


-----Original Message-----
From: James Taylor [mailto:[email protected]] 
Sent: Saturday, June 22, 2013 1:18 PM
To: [email protected]
Subject: Re: Scan performance

Hi Tony,
Have you had a look at Phoenix(https://github.com/forcedotcom/phoenix), a SQL 
skin over HBase? It has a skip scan that will let you model a multi part row 
key and skip through it efficiently as you've described. Take a look at this 
blog for more info: 
http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html?m=1

Regards,
James

On Jun 22, 2013, at 6:29 AM, "lars hofhansl" <[email protected]> wrote:

> Yep generally you should design your keys such that start/stopKey can 
> efficiently narrow the scope.
> 
> If that really cannot be done (and you should try hard), the 2nd  best option 
> are "skip scans".
> 
> Filters in HBase allow for providing the scanner framework with hints where 
> to go next.
> They can skip to the next column (to avoid looking at many versions), to the 
> next row (to avoid looking at many columns), or they can provide a custom 
> seek hint to a specific key value. The latter is what FuzzyRowFilter does.
> 
> 
> -- Lars
> 
> 
> 
> ________________________________
> From: Anoop John <[email protected]>
> To: [email protected]
> Sent: Friday, June 21, 2013 11:58 PM
> Subject: Re: Scan performance
> 
> 
> Have a look at FuzzyRowFilter
> 
> -Anoop-
> 
> On Sat, Jun 22, 2013 at 9:20 AM, Tony Dean <[email protected]> wrote:
> 
>> I understand more, but have additional questions about the internals...
>> 
>> So, in this example I have 6000 rows X 40 columns in this table.  In 
>> this test my startRow and stopRow do not narrow the scan criterior 
>> therefore all
>> 6000x40 KVs must be included in the search and thus read from disk 
>> and into memory.
>> 
>> The first filter that I used was:
>> Filter f2 = new SingleColumnValueFilter(cf, qualifier, 
>> CompareFilter.CompareOp.EQUALS, value);
>> 
>> This means that HBase must look for the qualifier column on all 6000 rows.
>> As you mention I could add certain columns to a different cf; but 
>> unfortunately, in my case there is no such small set of columns that 
>> will need to be compared (filtered on).  I could try to use indexes 
>> so that a complete row key can be calculated from a secondary index 
>> in order to perform a faster search against data in a primary table.  
>> This requires additional tables and maintenance that I would like to avoid.
>> 
>> I did try a row key filter with regex hoping that it would limit the 
>> number of rows that were read from disk.
>> Filter f2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new 
>> RegexStringComparator(row_regexpr));
>> 
>> My row keys are something like: vid,sid,event.  sid is not known at 
>> query time so I can use a regex similar to: vid,.*,Logon where Logon 
>> is the event that I am looking for in a particular visit.  In my test 
>> data this should have narrowed the scan to 1 row X 40 columns.  The 
>> best I could do for start/stop row is: vid,0 and vid,~ respectively.  
>> I guess that is still going to cause all 6000 rows to be scanned, but 
>> the filtering should be more specific with the rowKey filter.  
>> However, I did not see any performance improvement.  Anything obvious?
>> 
>> Do you have any other ideas to help out with performance when row key is:
>> vid,sid,event and sid is not known at query time which leaves a gap 
>> in the start/stop row?  Too bad regex can't be used in start/stop row 
>> specification.  That's really what I need.
>> 
>> Thanks again.
>> -Tony
>> 
>> -----Original Message-----
>> From: Vladimir Rodionov [mailto:[email protected]]
>> Sent: Friday, June 21, 2013 8:00 PM
>> To: [email protected]; lars hofhansl
>> Subject: RE: Scan performance
>> 
>> Lars,
>> I thought that column family is the locality group and placement 
>> columns which are frequently accessed together into the same column 
>> family (locality group) is the obvious performance improvement tip. 
>> What are the "essential column families" for in this context?
>> 
>> As for original question..  Unless you place your column into a 
>> separate column family in Table 2, you will need to scan (load from 
>> disk if not
>> cached) ~ 40x more data for the second table (because you have 40 columns).
>> This may explain why do  see such a difference in execution time if 
>> all data needs to be loaded first from HDFS.
>> 
>> Best regards,
>> Vladimir Rodionov
>> Principal Platform Engineer
>> Carrier IQ, www.carrieriq.com
>> e-mail: [email protected]
>> 
>> ________________________________________
>> From: lars hofhansl [[email protected]]
>> Sent: Friday, June 21, 2013 3:37 PM
>> To: [email protected]
>> Subject: Re: Scan performance
>> 
>> HBase is a key value (KV) store. Each column is stored in its own KV, 
>> a row is just a set of KVs that happen to have the row key (which is 
>> the first part of the key).
>> I tried to summarize this here:
>> http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)
>> 
>> In the StoreFiles all KVs are sorted in row/column order, but HBase 
>> still needs to skip over many KVs in order to "reach" the next row. 
>> So more disk and memory IO is needed.
>> 
>> If you using 0.94 there is a new feature "essential column families". 
>> If you always search by the same column you can place that one in its 
>> own column family and all other column in another column family. In 
>> that case your scan performance should be close identical.
>> 
>> 
>> -- Lars
>> ________________________________
>> 
>> From: Tony Dean <[email protected]>
>> To: "[email protected]" <[email protected]>
>> Sent: Friday, June 21, 2013 2:08 PM
>> Subject: Scan performance
>> 
>> 
>> 
>> 
>> Hi,
>> 
>> I hope that you can shed some light on these 2 scenarios below.
>> 
>> I have 2 small tables of 6000 rows.
>> Table 1 has only 1 column in each of its rows.
>> Table 2 has 40 columns in each of its rows.
>> Other than that the two tables are identical.
>> 
>> In both tables there is only 1 row that contains a matching column that I
>> am filtering on.   And the Scan performs correctly in both cases by
>> returning only the single result.
>> 
>> The code looks something like the following:
>> 
>> Scan scan = new Scan(startRow, stopRow);   // the start/stop rows should
>> include all 6000 rows
>> scan.addColumn(cf, qualifier); // only return the column that I am 
>> interested in (should only be in 1 row and only 1 version)
>> 
>> Filter f1 = new InclusiveStopFilter(stopRow); Filter f2 = new 
>> SingleColumnValueFilter(cf, qualifier,  
>> CompareFilter.CompareOp.EQUALS, value); scan.setFilter(new 
>> FilterList(f1, f2));
>> 
>> scan .setTimeRange(0, MAX_LONG);
>> scan.setMaxVersions(1);
>> 
>> ResultScanner rs = t.getScanner(scan); for (Result result: rs) {
>> 
>> }
>> 
>> For table 1, rs.next() takes about 30ms.
>> For table 2, rs.next() takes about 180ms.
>> 
>> Both are returning the exact same result.  Why is it taking so much 
>> longer on table 2 to get the same result?  The scan depth is the 
>> same.  The only difference is the column width.  But I'm filtering on 
>> a single column and returning only that column.
>> 
>> Am I missing something?  As I increase the number of columns, the 
>> response time gets worse.  I do expect the response time to get worse 
>> when increasing the number of rows, but not by increasing the number 
>> of columns since I'm returning only 1 column in both cases.
>> 
>> I appreciate any comments that you have.
>> 
>> -Tony
>> 
>> 
>> 
>> Tony Dean
>> SAS Institute Inc.
>> Principal Software Developer
>> 919-531-6704          ...
>> 
>> Confidentiality Notice:  The information contained in this message, 
>> including any attachments hereto, may be confidential and is intended 
>> to be read only by the individual or entity to whom this message is 
>> addressed. If the reader of this message is not the intended 
>> recipient or an agent or designee of the intended recipient, please 
>> note that any review, use, disclosure or distribution of this message 
>> or its attachments, in any form, is strictly prohibited.  If you have 
>> received this message in error, please immediately notify the sender 
>> and/or [email protected] and delete or destroy any copy of this 
>> message and its attachments.
>> 
>> 
>>

RE: Scan performance

Reply via email to