Hi James, I do plan on looking more closely at Phoenix for SQL access to HBase. Thanks.
-----Original Message----- From: James Taylor [mailto:[email protected]] Sent: Saturday, June 22, 2013 1:18 PM To: [email protected] Subject: Re: Scan performance Hi Tony, Have you had a look at Phoenix(https://github.com/forcedotcom/phoenix), a SQL skin over HBase? It has a skip scan that will let you model a multi part row key and skip through it efficiently as you've described. Take a look at this blog for more info: http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html?m=1 Regards, James On Jun 22, 2013, at 6:29 AM, "lars hofhansl" <[email protected]> wrote: > Yep generally you should design your keys such that start/stopKey can > efficiently narrow the scope. > > If that really cannot be done (and you should try hard), the 2nd best option > are "skip scans". > > Filters in HBase allow for providing the scanner framework with hints where > to go next. > They can skip to the next column (to avoid looking at many versions), to the > next row (to avoid looking at many columns), or they can provide a custom > seek hint to a specific key value. The latter is what FuzzyRowFilter does. > > > -- Lars > > > > ________________________________ > From: Anoop John <[email protected]> > To: [email protected] > Sent: Friday, June 21, 2013 11:58 PM > Subject: Re: Scan performance > > > Have a look at FuzzyRowFilter > > -Anoop- > > On Sat, Jun 22, 2013 at 9:20 AM, Tony Dean <[email protected]> wrote: > >> I understand more, but have additional questions about the internals... >> >> So, in this example I have 6000 rows X 40 columns in this table. In >> this test my startRow and stopRow do not narrow the scan criterior >> therefore all >> 6000x40 KVs must be included in the search and thus read from disk >> and into memory. >> >> The first filter that I used was: >> Filter f2 = new SingleColumnValueFilter(cf, qualifier, >> CompareFilter.CompareOp.EQUALS, value); >> >> This means that HBase must look for the qualifier column on all 6000 rows. >> As you mention I could add certain columns to a different cf; but >> unfortunately, in my case there is no such small set of columns that >> will need to be compared (filtered on). I could try to use indexes >> so that a complete row key can be calculated from a secondary index >> in order to perform a faster search against data in a primary table. >> This requires additional tables and maintenance that I would like to avoid. >> >> I did try a row key filter with regex hoping that it would limit the >> number of rows that were read from disk. >> Filter f2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new >> RegexStringComparator(row_regexpr)); >> >> My row keys are something like: vid,sid,event. sid is not known at >> query time so I can use a regex similar to: vid,.*,Logon where Logon >> is the event that I am looking for in a particular visit. In my test >> data this should have narrowed the scan to 1 row X 40 columns. The >> best I could do for start/stop row is: vid,0 and vid,~ respectively. >> I guess that is still going to cause all 6000 rows to be scanned, but >> the filtering should be more specific with the rowKey filter. >> However, I did not see any performance improvement. Anything obvious? >> >> Do you have any other ideas to help out with performance when row key is: >> vid,sid,event and sid is not known at query time which leaves a gap >> in the start/stop row? Too bad regex can't be used in start/stop row >> specification. That's really what I need. >> >> Thanks again. >> -Tony >> >> -----Original Message----- >> From: Vladimir Rodionov [mailto:[email protected]] >> Sent: Friday, June 21, 2013 8:00 PM >> To: [email protected]; lars hofhansl >> Subject: RE: Scan performance >> >> Lars, >> I thought that column family is the locality group and placement >> columns which are frequently accessed together into the same column >> family (locality group) is the obvious performance improvement tip. >> What are the "essential column families" for in this context? >> >> As for original question.. Unless you place your column into a >> separate column family in Table 2, you will need to scan (load from >> disk if not >> cached) ~ 40x more data for the second table (because you have 40 columns). >> This may explain why do see such a difference in execution time if >> all data needs to be loaded first from HDFS. >> >> Best regards, >> Vladimir Rodionov >> Principal Platform Engineer >> Carrier IQ, www.carrieriq.com >> e-mail: [email protected] >> >> ________________________________________ >> From: lars hofhansl [[email protected]] >> Sent: Friday, June 21, 2013 3:37 PM >> To: [email protected] >> Subject: Re: Scan performance >> >> HBase is a key value (KV) store. Each column is stored in its own KV, >> a row is just a set of KVs that happen to have the row key (which is >> the first part of the key). >> I tried to summarize this here: >> http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html) >> >> In the StoreFiles all KVs are sorted in row/column order, but HBase >> still needs to skip over many KVs in order to "reach" the next row. >> So more disk and memory IO is needed. >> >> If you using 0.94 there is a new feature "essential column families". >> If you always search by the same column you can place that one in its >> own column family and all other column in another column family. In >> that case your scan performance should be close identical. >> >> >> -- Lars >> ________________________________ >> >> From: Tony Dean <[email protected]> >> To: "[email protected]" <[email protected]> >> Sent: Friday, June 21, 2013 2:08 PM >> Subject: Scan performance >> >> >> >> >> Hi, >> >> I hope that you can shed some light on these 2 scenarios below. >> >> I have 2 small tables of 6000 rows. >> Table 1 has only 1 column in each of its rows. >> Table 2 has 40 columns in each of its rows. >> Other than that the two tables are identical. >> >> In both tables there is only 1 row that contains a matching column that I >> am filtering on. And the Scan performs correctly in both cases by >> returning only the single result. >> >> The code looks something like the following: >> >> Scan scan = new Scan(startRow, stopRow); // the start/stop rows should >> include all 6000 rows >> scan.addColumn(cf, qualifier); // only return the column that I am >> interested in (should only be in 1 row and only 1 version) >> >> Filter f1 = new InclusiveStopFilter(stopRow); Filter f2 = new >> SingleColumnValueFilter(cf, qualifier, >> CompareFilter.CompareOp.EQUALS, value); scan.setFilter(new >> FilterList(f1, f2)); >> >> scan .setTimeRange(0, MAX_LONG); >> scan.setMaxVersions(1); >> >> ResultScanner rs = t.getScanner(scan); for (Result result: rs) { >> >> } >> >> For table 1, rs.next() takes about 30ms. >> For table 2, rs.next() takes about 180ms. >> >> Both are returning the exact same result. Why is it taking so much >> longer on table 2 to get the same result? The scan depth is the >> same. The only difference is the column width. But I'm filtering on >> a single column and returning only that column. >> >> Am I missing something? As I increase the number of columns, the >> response time gets worse. I do expect the response time to get worse >> when increasing the number of rows, but not by increasing the number >> of columns since I'm returning only 1 column in both cases. >> >> I appreciate any comments that you have. >> >> -Tony >> >> >> >> Tony Dean >> SAS Institute Inc. >> Principal Software Developer >> 919-531-6704 ... >> >> Confidentiality Notice: The information contained in this message, >> including any attachments hereto, may be confidential and is intended >> to be read only by the individual or entity to whom this message is >> addressed. If the reader of this message is not the intended >> recipient or an agent or designee of the intended recipient, please >> note that any review, use, disclosure or distribution of this message >> or its attachments, in any form, is strictly prohibited. If you have >> received this message in error, please immediately notify the sender >> and/or [email protected] and delete or destroy any copy of this >> message and its attachments. >> >> >>
