RE: Scan performance

Tony Dean Tue, 02 Jul 2013 14:32:06 -0700

The following information is what I discovered from Scan performance testing.


Setup
-------
row key format:
positiion1,position2,position3
where position1 is a fixed literal, and position2 and position3 are variable 
data.

I have created data with 6000 rows with ~40 columns in each row.  The table 
contains only 1 column family.

The row that I want to query is:
vid,sid-0,Logon    event:customer value=?

-------

Case 1:
use fully qualified row specification in start/stop row key (e.g., 
vid,sid-0,Logon) with a SingleColumnValueFilter in the Scan.

avg response time to get Scan iterator and iterate the single result is ~5ms.  
This is expected.


Case 2:
This is the normal case where position2 in the row key is unknown at the time 
of the query: vid,?,Logon.
Using a SingleColumnValueFilter in the Scan, the avg response time to get Scan 
iterator and iterate the single result is ~100ms.
This is the use case that I'm trying to improve upon.

Case 3:
After upgrading to 0.94.8 I was able to change Case2 by using FuzzyRowFilter 
instead of SingleColumnValueFilter.  It's a good candidate since I know 
position1 and position3.
The avg response time to get Scan iterator and iterate the single result was 
~5ms (pretty much the same response time as case 1 where I knew the complete 
row key).

I didn't expect such an improvement.  Can you explain how FuzzyRowFilter 
optimizes scanning rows from disk?  In my case it needs to scan rows 
(vid,?,xxxx) until xxxx is greater than "Logon".  Then it can just stop after 
that; thereby optimizing the scan, correct?  So, optimization using 
FuzzyRowFilter is very dependent upon the data that you are scanning.

Thanks for any insight.


-----Original Message-----
From: lars hofhansl [mailto:[email protected]] 
Sent: Monday, June 24, 2013 5:05 PM
To: [email protected]
Subject: Re: Scan performance

RowFilter can help. It depends on the setup.
RowFilter skip all column of the row when the row key does not match.
That will help with IO *if* your rows are larger than the HFile block size (64k 
by default). Otherwise it still needs to touch each block.

An HTable does some priming when it is created. The region information for all 
tables could be substantial, so it does not make much sense to prime the cache 
for all tables.
How are you using the client. If you pre-create a reuse HTable and/or 
HConnection you should be OK.


-- Lars



________________________________
 From: Tony Dean <[email protected]>
To: "[email protected]" <[email protected]>; lars hofhansl 
<[email protected]> 
Sent: Monday, June 24, 2013 1:48 PM
Subject: RE: Scan performance
 

Lars,
I'm waiting for some time to exchange out hbase jars in cluster (that support 
FuzzyRow filter) in order to try out.  In the meantime, I'm wondering why 
RowFilter regex is not more helpful.  I'm guessing that FuzzyRow filter helps 
in disk io while Row filter just filters after the disk io has completed.  
Also, I turned on row level bloom filter which does not seem to help either.

On a different performance note, I'm wondering if there is a way to prime 
client connection information and such so that the first client query isn't 
miserably slow.  After the first query, response times do get considerably 
better due to caching necessary information.  Is there a way to get around this 
first initial hit?  I assume any such priming would have to be application 
specific.

Thanks.

-----Original Message-----
From: lars hofhansl [mailto:[email protected]] 
Sent: Saturday, June 22, 2013 9:24 AM
To: [email protected]
Subject: Re: Scan performance

"essential column families" help when you filter on one column but want to 
return *other* columns for the rows that matched the column.

Check out HBASE-5416.

-- Lars



________________________________
From: Vladimir Rodionov <[email protected]>
To: "[email protected]" <[email protected]>; lars hofhansl 
<[email protected]> 
Sent: Friday, June 21, 2013 5:00 PM
Subject: RE: Scan performance


Lars,
I thought that column family is the locality group and placement columns which 
are frequently accessed together into
the same column family (locality group) is the obvious performance improvement 
tip. What are the "essential column families" for in this context?

As for original question..  Unless you place your column into a separate column 
family in Table 2, you will
need to scan (load from disk if not cached) ~ 40x more data for the second 
table (because you have 40 columns). This may explain why do  see such a 
difference in
execution time if all data needs to be loaded first from HDFS.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: [email protected]

________________________________________
From: lars hofhansl [[email protected]]
Sent: Friday, June 21, 2013 3:37 PM
To: [email protected]
Subject: Re: Scan performance

HBase is a key value (KV) store. Each column is stored in its own KV, a row is 
just a set of KVs that happen to have the row key (which is the first part of 
the key).
I tried to summarize this here: 
http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)

In the StoreFiles all KVs are sorted in row/column order, but HBase still needs 
to skip over many KVs in order to "reach" the next row. So more disk and memory 
IO is needed.

If you using 0.94 there is a new feature "essential column families". If you 
always search by the same column you can place that one in its own column 
family and all other column in another column family. In that case your scan 
performance should be close identical.


-- Lars
________________________________

From: Tony Dean <[email protected]>
To: "[email protected]" <[email protected]>
Sent: Friday, June 21, 2013 2:08 PM
Subject: Scan performance




Hi,

I hope that you can shed some light on these 2 scenarios below.

I have 2 small tables of 6000 rows.
Table 1 has only 1 column in each of its rows.
Table 2 has 40 columns in each of its rows.
Other than that the two tables are identical.

In both tables there is only 1 row that contains a matching column that I am 
filtering on.   And the Scan performs correctly in both cases by returning only 
the single result.

The code looks something like the following:

Scan scan = new Scan(startRow, stopRow);   // the start/stop rows should 
include all 6000 rows
scan.addColumn(cf, qualifier); // only return the column that I am interested 
in (should only be in 1 row and only 1 version)

Filter f1 = new InclusiveStopFilter(stopRow);
Filter f2 = new SingleColumnValueFilter(cf, qualifier,  
CompareFilter.CompareOp.EQUALS, value);
scan.setFilter(new FilterList(f1, f2));

scan .setTimeRange(0, MAX_LONG);
scan.setMaxVersions(1);

ResultScanner rs = t.getScanner(scan);
for (Result result: rs)
{

}

For table 1, rs.next() takes about 30ms.
For table 2, rs.next() takes about 180ms.

Both are returning the exact same result.  Why is it taking so much longer on 
table 2 to get the same result?  The scan depth is the same.  The only 
difference is the column width.  But I’m filtering on a single column and 
returning only that column.

Am I missing something?  As I increase the number of columns, the response time 
gets worse.  I do expect the response time to get worse when increasing the 
number of rows, but not by increasing the number of columns since I’m returning 
only 1 column in
both cases.

I appreciate any comments that you have.

-Tony



Tony Dean
SAS Institute Inc.
Principal Software Developer
919-531-6704          …

Confidentiality Notice:  The information contained in this message, including 
any attachments hereto, may be confidential and is intended to be read only by 
the individual or entity to whom this message is addressed. If the reader of 
this message is not the intended recipient or an agent or designee of the 
intended recipient, please note that any review, use, disclosure or 
distribution of this message or its attachments, in any form, is strictly 
prohibited.  If you have received this message in error, please immediately 
notify the sender and/or [email protected] and delete or destroy any 
copy of this message and its attachments.

RE: Scan performance

Reply via email to