RE: Scan performance

Tony Dean Wed, 03 Jul 2013 08:00:28 -0700

Thanks Ted.

-----Original Message-----
From: Ted Yu [mailto:[email protected]] 
Sent: Tuesday, July 02, 2013 6:11 PM
To: [email protected]
Subject: Re: Scan performance


Tony:
Take a look at
http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/

Cheers

On Tue, Jul 2, 2013 at 2:31 PM, Tony Dean <[email protected]> wrote:

> The following information is what I discovered from Scan performance 
> testing.
>
> Setup
> -------
> row key format:
> positiion1,position2,position3
> where position1 is a fixed literal, and position2 and position3 are 
> variable data.
>
> I have created data with 6000 rows with ~40 columns in each row.  The 
> table contains only 1 column family.
>
> The row that I want to query is:
> vid,sid-0,Logon    event:customer value=?
>
> -------
>
> Case 1:
> use fully qualified row specification in start/stop row key (e.g.,
> vid,sid-0,Logon) with a SingleColumnValueFilter in the Scan.
>
> avg response time to get Scan iterator and iterate the single result 
> is ~5ms.  This is expected.
>
>
> Case 2:
> This is the normal case where position2 in the row key is unknown at 
> the time of the query: vid,?,Logon.
> Using a SingleColumnValueFilter in the Scan, the avg response time to 
> get Scan iterator and iterate the single result is ~100ms.
> This is the use case that I'm trying to improve upon.
>
> Case 3:
> After upgrading to 0.94.8 I was able to change Case2 by using 
> FuzzyRowFilter instead of SingleColumnValueFilter.  It's a good 
> candidate since I know position1 and position3.
> The avg response time to get Scan iterator and iterate the single 
> result was ~5ms (pretty much the same response time as case 1 where I 
> knew the complete row key).
>
> I didn't expect such an improvement.  Can you explain how 
> FuzzyRowFilter optimizes scanning rows from disk?  In my case it needs 
> to scan rows
> (vid,?,xxxx) until xxxx is greater than "Logon".  Then it can just 
> stop after that; thereby optimizing the scan, correct?  So, 
> optimization using FuzzyRowFilter is very dependent upon the data that you 
> are scanning.
>
> Thanks for any insight.
>
>
> -----Original Message-----
> From: lars hofhansl [mailto:[email protected]]
> Sent: Monday, June 24, 2013 5:05 PM
> To: [email protected]
> Subject: Re: Scan performance
>
> RowFilter can help. It depends on the setup.
> RowFilter skip all column of the row when the row key does not match.
> That will help with IO *if* your rows are larger than the HFile block 
> size (64k by default). Otherwise it still needs to touch each block.
>
> An HTable does some priming when it is created. The region information 
> for all tables could be substantial, so it does not make much sense to 
> prime the cache for all tables.
> How are you using the client. If you pre-create a reuse HTable and/or 
> HConnection you should be OK.
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Tony Dean <[email protected]>
> To: "[email protected]" <[email protected]>; lars hofhansl < 
> [email protected]>
> Sent: Monday, June 24, 2013 1:48 PM
> Subject: RE: Scan performance
>
>
> Lars,
> I'm waiting for some time to exchange out hbase jars in cluster (that 
> support FuzzyRow filter) in order to try out.  In the meantime, I'm 
> wondering why RowFilter regex is not more helpful.  I'm guessing that 
> FuzzyRow filter helps in disk io while Row filter just filters after 
> the disk io has completed.  Also, I turned on row level bloom filter 
> which does not seem to help either.
>
> On a different performance note, I'm wondering if there is a way to 
> prime client connection information and such so that the first client 
> query isn't miserably slow.  After the first query, response times do 
> get considerably better due to caching necessary information.  Is 
> there a way to get around this first initial hit?  I assume any such 
> priming would have to be application specific.
>
> Thanks.
>
> -----Original Message-----
> From: lars hofhansl [mailto:[email protected]]
> Sent: Saturday, June 22, 2013 9:24 AM
> To: [email protected]
> Subject: Re: Scan performance
>
> "essential column families" help when you filter on one column but 
> want to return *other* columns for the rows that matched the column.
>
> Check out HBASE-5416.
>
> -- Lars
>
>
>
> ________________________________
> From: Vladimir Rodionov <[email protected]>
> To: "[email protected]" <[email protected]>; lars hofhansl < 
> [email protected]>
> Sent: Friday, June 21, 2013 5:00 PM
> Subject: RE: Scan performance
>
>
> Lars,
> I thought that column family is the locality group and placement 
> columns which are frequently accessed together into the same column 
> family (locality group) is the obvious performance improvement tip. 
> What are the "essential column families" for in this context?
>
> As for original question..  Unless you place your column into a 
> separate column family in Table 2, you will need to scan (load from 
> disk if not cached) ~ 40x more data for the second table (because you 
> have 40 columns). This may explain why do  see such a difference in 
> execution time if all data needs to be loaded first from HDFS.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: [email protected]
>
> ________________________________________
> From: lars hofhansl [[email protected]]
> Sent: Friday, June 21, 2013 3:37 PM
> To: [email protected]
> Subject: Re: Scan performance
>
> HBase is a key value (KV) store. Each column is stored in its own KV, 
> a row is just a set of KVs that happen to have the row key (which is 
> the first part of the key).
> I tried to summarize this here:
> http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)
>
> In the StoreFiles all KVs are sorted in row/column order, but HBase 
> still needs to skip over many KVs in order to "reach" the next row. So 
> more disk and memory IO is needed.
>
> If you using 0.94 there is a new feature "essential column families". 
> If you always search by the same column you can place that one in its 
> own column family and all other column in another column family. In 
> that case your scan performance should be close identical.
>
>
> -- Lars
> ________________________________
>
> From: Tony Dean <[email protected]>
> To: "[email protected]" <[email protected]>
> Sent: Friday, June 21, 2013 2:08 PM
> Subject: Scan performance
>
>
>
>
> Hi,
>
> I hope that you can shed some light on these 2 scenarios below.
>
> I have 2 small tables of 6000 rows.
> Table 1 has only 1 column in each of its rows.
> Table 2 has 40 columns in each of its rows.
> Other than that the two tables are identical.
>
> In both tables there is only 1 row that contains a matching column that I
> am filtering on.   And the Scan performs correctly in both cases by
> returning only the single result.
>
> The code looks something like the following:
>
> Scan scan = new Scan(startRow, stopRow);   // the start/stop rows should
> include all 6000 rows
> scan.addColumn(cf, qualifier); // only return the column that I am 
> interested in (should only be in 1 row and only 1 version)
>
> Filter f1 = new InclusiveStopFilter(stopRow); Filter f2 = new 
> SingleColumnValueFilter(cf, qualifier, CompareFilter.CompareOp.EQUALS, 
> value); scan.setFilter(new FilterList(f1, f2));
>
> scan .setTimeRange(0, MAX_LONG);
> scan.setMaxVersions(1);
>
> ResultScanner rs = t.getScanner(scan); for (Result result: rs) {
>
> }
>
> For table 1, rs.next() takes about 30ms.
> For table 2, rs.next() takes about 180ms.
>
> Both are returning the exact same result.  Why is it taking so much longer
> on table 2 to get the same result?  The scan depth is the same.  The only
> difference is the column width.  But I'm filtering on a single column and
> returning only that column.
>
> Am I missing something?  As I increase the number of columns, the response
> time gets worse.  I do expect the response time to get worse when
> increasing the number of rows, but not by increasing the number of columns
> since I'm returning only 1 column in
> both cases.
>
> I appreciate any comments that you have.
>
> -Tony
>
>
>
> Tony Dean
> SAS Institute Inc.
> Principal Software Developer
> 919-531-6704          ...
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or [email protected] and
> delete or destroy any copy of this message and its attachments.
>

RE: Scan performance

Reply via email to