I think this is exactly what Christian is trying to (and should be trying to) avoid ;).
I can't imagine use-case when you need to filter something and you can do it with (at least) server-side filter, and yet in this situation you want to try to do it on the client-side... Doing filtering on client-side when you can do it on server-side just feels wrong. Esp. given that there's a lot of data in HBase (otherwise why would you use it). Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Thu, Aug 2, 2012 at 7:09 PM, Matt Corgan <[email protected]> wrote: > Also Christian, don't forget you can read all the rows back to the client > and do the filtering there using whatever logic you like. HBase Filters > can be thought of as an optimization (predicate push-down) over client-side > filtering. Pulling all the rows over the network will be slower, but I > don't think we know enough about your data or speed requirements to rule it > out. > > > On Thu, Aug 2, 2012 at 3:57 PM, Alex Baranau <[email protected] > >wrote: > > > Hi Christian! > > > > If to put off secondary indexes and assume you are going with "heavy > > scans", you can try two following things to make it much faster. If this > is > > appropriate to your situation, of course. > > > > 1. > > > > > Is there a more elegant way to collect rows within time range X? > > > (Unfortunately, the date attribute is not equal to the timestamp that > is > > stored by hbase automatically.) > > > > Can you set timestamp of the Puts to the one you have in row key? Instead > > of relying on the one that HBase puts automatically (current ts). If you > > can, this will improve reading speed a lot by setting time range on > > scanner. Depending on how you are writing your data of course, but I > assume > > that you mostly write data in "time-increasing" manner. > > > > 2. > > > > If your userId has fixed length, or you can change it so that it has > fixed > > length, then you can actually use smth like "wildcard" in row key. > There's > > a way in Filter implementation to fast-forward to the record with > specific > > row key and by doing this skip many records. This might be used as > follows: > > * suppose your userId is 5 characters in length > > * suppose you are scanning for records with time between 2012-08-01 > > and 2012-08-08 > > * when you scanning records and you face e.g. key > > "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell > > the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01". > > Because you know that all remained records of user "aaaaa" don't fall > into > > the interval you need (as the time for its records will be >= > 2012-08-09). > > > > As of now, I believe you will have to implement your custom filter to do > > that. > > Pointer: > > org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT > > I believe I implemented similar thing some time ago. If this idea works > for > > you I could look for the implementation and share it if it helps. Or may > be > > even simply add it to HBase codebase. > > > > Hope this helps, > > > > Alex Baranau > > ------ > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch > - > > Solr > > > > > > On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <[email protected] > > >wrote: > > > > > > > > > > > Excuse my double posting. > > > Here is the complete mail: > > > > > > > > > OK, > > > > > > at first I will try the scans. > > > > > > If that's too slow I will have to upgrade hbase (currently > 0.90.4-cdh3u2) > > > to be able to use coprocessors. > > > > > > > > > Currently I'm stuck at the scans because it requires two steps > (therefore > > > maybe some kind of filter chaining is required) > > > > > > > > > The key: userId-dateInMillis-sessionId > > > > > > At first I need to extract dateInMllis with regex or substring (using > > > special delimiters for date) > > > > > > Second, the extracted value must be parsed to Long and set to a > RowFilter > > > Comparator like this: > > > > > > scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new > > > BinaryComparator(Bytes.toBytes((Long)dateInMillis)))); > > > > > > How to chain that? > > > Do I have to write a custom filter? > > > (Would like to avoid that due to deployment) > > > > > > regards > > > Chris > > > > > > ----- Ursprüngliche Message ----- > > > Von: Michael Segel <[email protected]> > > > An: [email protected] > > > CC: > > > Gesendet: 13:52 Mittwoch, 1.August 2012 > > > Betreff: Re: How to query by rowKey-infix > > > > > > Actually w coprocessors you can create a secondary index in short > order. > > > Then your cost is going to be 2 fetches. Trying to do a partial table > > scan > > > will be more expensive. > > > > > > On Jul 31, 2012, at 12:41 PM, Matt Corgan <[email protected]> wrote: > > > > > > > When deciding between a table scan vs secondary index, you should try > > to > > > > estimate what percent of the underlying data blocks will be used in > the > > > > query. By default, each block is 64KB. > > > > > > > > If each user's data is small and you are fitting multiple users per > > > block, > > > > then you're going to need all the blocks, so a tablescan is better > > > because > > > > it's simpler. If each user has 1MB+ data then you will want to pick > > out > > > > the individual blocks relevant to each date. The secondary index > will > > > help > > > > you go directly to those sparse blocks, but with a cost in > complexity, > > > > consistency, and extra denormalized data that knocks primary data out > > of > > > > your block cache. > > > > > > > > If latency is not a concern, I would start with the table scan. If > > > that's > > > > too slow you add the secondary index, and if you still need it faster > > you > > > > do the primary key lookups in parallel as Jerry mentions. > > > > > > > > Matt > > > > > > > > On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <[email protected]> > > > wrote: > > > > > > > >> Hi Chris: > > > >> > > > >> I'm thinking about building a secondary index for primary key > lookup, > > > then > > > >> query using the primary keys in parallel. > > > >> > > > >> I'm interested to see if there is other option too. > > > >> > > > >> Best Regards, > > > >> > > > >> Jerry > > > >> > > > >> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer < > > > [email protected] > > > >>> wrote: > > > >> > > > >>> Hello there, > > > >>> > > > >>> I designed a row key for queries that need best performance (~100 > ms) > > > >>> which looks like this: > > > >>> > > > >>> userId-date-sessionId > > > >>> > > > >>> These queries(scans) are always based on a userId and sometimes > > > >>> additionally on a date, too. > > > >>> That's no problem with the key above. > > > >>> > > > >>> However, another kind of queries shall be based on a given time > range > > > >>> whereas the outermost left userId is not given or known. > > > >>> In this case I need to get all rows covering the given time range > > with > > > >>> their date to create a daily reporting. > > > >>> > > > >>> As I can't set wildcards at the beginning of a left-based index for > > the > > > >>> scan, > > > >>> I only see the possibility to scan the index of the whole table to > > > >> collect > > > >>> the > > > >>> rowKeys that are inside the timerange I'm interested in. > > > >>> > > > >>> Is there a more elegant way to collect rows within time range X? > > > >>> (Unfortunately, the date attribute is not equal to the timestamp > that > > > is > > > >>> stored by hbase automatically.) > > > >>> > > > >>> Could/should one maybe leverage some kind of row key caching to > > > >> accelerate > > > >>> the collection process? > > > >>> Is that covered by the block cache? > > > >>> > > > >>> Thanks in advance for any advice. > > > >>> > > > >>> regards > > > >>> Chris > > > >>> > > > >> > > > > > > > > > > > -- > > Alex Baranau > > ------ > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch > - > > Solr > > > -- Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
