Thanks Ted. -----Original Message----- From: Ted Yu [mailto:[email protected]] Sent: Tuesday, July 02, 2013 6:11 PM To: [email protected] Subject: Re: Scan performance
Tony: Take a look at http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/ Cheers On Tue, Jul 2, 2013 at 2:31 PM, Tony Dean <[email protected]> wrote: > The following information is what I discovered from Scan performance > testing. > > Setup > ------- > row key format: > positiion1,position2,position3 > where position1 is a fixed literal, and position2 and position3 are > variable data. > > I have created data with 6000 rows with ~40 columns in each row. The > table contains only 1 column family. > > The row that I want to query is: > vid,sid-0,Logon event:customer value=? > > ------- > > Case 1: > use fully qualified row specification in start/stop row key (e.g., > vid,sid-0,Logon) with a SingleColumnValueFilter in the Scan. > > avg response time to get Scan iterator and iterate the single result > is ~5ms. This is expected. > > > Case 2: > This is the normal case where position2 in the row key is unknown at > the time of the query: vid,?,Logon. > Using a SingleColumnValueFilter in the Scan, the avg response time to > get Scan iterator and iterate the single result is ~100ms. > This is the use case that I'm trying to improve upon. > > Case 3: > After upgrading to 0.94.8 I was able to change Case2 by using > FuzzyRowFilter instead of SingleColumnValueFilter. It's a good > candidate since I know position1 and position3. > The avg response time to get Scan iterator and iterate the single > result was ~5ms (pretty much the same response time as case 1 where I > knew the complete row key). > > I didn't expect such an improvement. Can you explain how > FuzzyRowFilter optimizes scanning rows from disk? In my case it needs > to scan rows > (vid,?,xxxx) until xxxx is greater than "Logon". Then it can just > stop after that; thereby optimizing the scan, correct? So, > optimization using FuzzyRowFilter is very dependent upon the data that you > are scanning. > > Thanks for any insight. > > > -----Original Message----- > From: lars hofhansl [mailto:[email protected]] > Sent: Monday, June 24, 2013 5:05 PM > To: [email protected] > Subject: Re: Scan performance > > RowFilter can help. It depends on the setup. > RowFilter skip all column of the row when the row key does not match. > That will help with IO *if* your rows are larger than the HFile block > size (64k by default). Otherwise it still needs to touch each block. > > An HTable does some priming when it is created. The region information > for all tables could be substantial, so it does not make much sense to > prime the cache for all tables. > How are you using the client. If you pre-create a reuse HTable and/or > HConnection you should be OK. > > > -- Lars > > > > ________________________________ > From: Tony Dean <[email protected]> > To: "[email protected]" <[email protected]>; lars hofhansl < > [email protected]> > Sent: Monday, June 24, 2013 1:48 PM > Subject: RE: Scan performance > > > Lars, > I'm waiting for some time to exchange out hbase jars in cluster (that > support FuzzyRow filter) in order to try out. In the meantime, I'm > wondering why RowFilter regex is not more helpful. I'm guessing that > FuzzyRow filter helps in disk io while Row filter just filters after > the disk io has completed. Also, I turned on row level bloom filter > which does not seem to help either. > > On a different performance note, I'm wondering if there is a way to > prime client connection information and such so that the first client > query isn't miserably slow. After the first query, response times do > get considerably better due to caching necessary information. Is > there a way to get around this first initial hit? I assume any such > priming would have to be application specific. > > Thanks. > > -----Original Message----- > From: lars hofhansl [mailto:[email protected]] > Sent: Saturday, June 22, 2013 9:24 AM > To: [email protected] > Subject: Re: Scan performance > > "essential column families" help when you filter on one column but > want to return *other* columns for the rows that matched the column. > > Check out HBASE-5416. > > -- Lars > > > > ________________________________ > From: Vladimir Rodionov <[email protected]> > To: "[email protected]" <[email protected]>; lars hofhansl < > [email protected]> > Sent: Friday, June 21, 2013 5:00 PM > Subject: RE: Scan performance > > > Lars, > I thought that column family is the locality group and placement > columns which are frequently accessed together into the same column > family (locality group) is the obvious performance improvement tip. > What are the "essential column families" for in this context? > > As for original question.. Unless you place your column into a > separate column family in Table 2, you will need to scan (load from > disk if not cached) ~ 40x more data for the second table (because you > have 40 columns). This may explain why do see such a difference in > execution time if all data needs to be loaded first from HDFS. > > Best regards, > Vladimir Rodionov > Principal Platform Engineer > Carrier IQ, www.carrieriq.com > e-mail: [email protected] > > ________________________________________ > From: lars hofhansl [[email protected]] > Sent: Friday, June 21, 2013 3:37 PM > To: [email protected] > Subject: Re: Scan performance > > HBase is a key value (KV) store. Each column is stored in its own KV, > a row is just a set of KVs that happen to have the row key (which is > the first part of the key). > I tried to summarize this here: > http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html) > > In the StoreFiles all KVs are sorted in row/column order, but HBase > still needs to skip over many KVs in order to "reach" the next row. So > more disk and memory IO is needed. > > If you using 0.94 there is a new feature "essential column families". > If you always search by the same column you can place that one in its > own column family and all other column in another column family. In > that case your scan performance should be close identical. > > > -- Lars > ________________________________ > > From: Tony Dean <[email protected]> > To: "[email protected]" <[email protected]> > Sent: Friday, June 21, 2013 2:08 PM > Subject: Scan performance > > > > > Hi, > > I hope that you can shed some light on these 2 scenarios below. > > I have 2 small tables of 6000 rows. > Table 1 has only 1 column in each of its rows. > Table 2 has 40 columns in each of its rows. > Other than that the two tables are identical. > > In both tables there is only 1 row that contains a matching column that I > am filtering on. And the Scan performs correctly in both cases by > returning only the single result. > > The code looks something like the following: > > Scan scan = new Scan(startRow, stopRow); // the start/stop rows should > include all 6000 rows > scan.addColumn(cf, qualifier); // only return the column that I am > interested in (should only be in 1 row and only 1 version) > > Filter f1 = new InclusiveStopFilter(stopRow); Filter f2 = new > SingleColumnValueFilter(cf, qualifier, CompareFilter.CompareOp.EQUALS, > value); scan.setFilter(new FilterList(f1, f2)); > > scan .setTimeRange(0, MAX_LONG); > scan.setMaxVersions(1); > > ResultScanner rs = t.getScanner(scan); for (Result result: rs) { > > } > > For table 1, rs.next() takes about 30ms. > For table 2, rs.next() takes about 180ms. > > Both are returning the exact same result. Why is it taking so much longer > on table 2 to get the same result? The scan depth is the same. The only > difference is the column width. But I'm filtering on a single column and > returning only that column. > > Am I missing something? As I increase the number of columns, the response > time gets worse. I do expect the response time to get worse when > increasing the number of rows, but not by increasing the number of columns > since I'm returning only 1 column in > both cases. > > I appreciate any comments that you have. > > -Tony > > > > Tony Dean > SAS Institute Inc. > Principal Software Developer > 919-531-6704 ... > > Confidentiality Notice: The information contained in this message, > including any attachments hereto, may be confidential and is intended to be > read only by the individual or entity to whom this message is addressed. If > the reader of this message is not the intended recipient or an agent or > designee of the intended recipient, please note that any review, use, > disclosure or distribution of this message or its attachments, in any form, > is strictly prohibited. If you have received this message in error, please > immediately notify the sender and/or [email protected] and > delete or destroy any copy of this message and its attachments. >
