I think this is exactly what Christian is trying to (and should be trying
to) avoid ;).

I can't imagine use-case when you need to filter something and you can do
it with (at least) server-side filter, and yet in this situation you want
to try to do it on the client-side... Doing filtering on client-side when
you can do it on server-side just feels wrong. Esp. given that there's a
lot of data in HBase (otherwise why would you use it).

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

On Thu, Aug 2, 2012 at 7:09 PM, Matt Corgan <[email protected]> wrote:

> Also Christian, don't forget you can read all the rows back to the client
> and do the filtering there using whatever logic you like.  HBase Filters
> can be thought of as an optimization (predicate push-down) over client-side
> filtering.  Pulling all the rows over the network will be slower, but I
> don't think we know enough about your data or speed requirements to rule it
> out.
>
>
> On Thu, Aug 2, 2012 at 3:57 PM, Alex Baranau <[email protected]
> >wrote:
>
> > Hi Christian!
> >
> > If to put off secondary indexes and assume you are going with "heavy
> > scans", you can try two following things to make it much faster. If this
> is
> > appropriate to your situation, of course.
> >
> > 1.
> >
> > > Is there a more elegant way to collect rows within time range X?
> > > (Unfortunately, the date attribute is not equal to the timestamp that
> is
> > stored by hbase automatically.)
> >
> > Can you set timestamp of the Puts to the one you have in row key? Instead
> > of relying on the one that HBase puts automatically (current ts). If you
> > can, this will improve reading speed a lot by setting time range on
> > scanner. Depending on how you are writing your data of course, but I
> assume
> > that you mostly write data in "time-increasing" manner.
> >
> > 2.
> >
> > If your userId has fixed length, or you can change it so that it has
> fixed
> > length, then you can actually use smth like "wildcard"  in row key.
> There's
> > a way in Filter implementation to fast-forward to the record with
> specific
> > row key and by doing this skip many records. This might be used as
> follows:
> > * suppose your userId is 5 characters in length
> > * suppose you are scanning for records with time between 2012-08-01
> > and 2012-08-08
> > * when you scanning records and you face e.g. key
> > "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell
> > the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01".
> > Because you know that all remained records of user "aaaaa" don't fall
> into
> > the interval you need (as the time for its records will be >=
> 2012-08-09).
> >
> > As of now, I believe you will have to implement your custom filter to do
> > that.
> > Pointer:
> > org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT
> > I believe I implemented similar thing some time ago. If this idea works
> for
> > you I could look for the implementation and share it if it helps. Or may
> be
> > even simply add it to HBase codebase.
> >
> > Hope this helps,
> >
> > Alex Baranau
> > ------
> > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> -
> > Solr
> >
> >
> > On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <[email protected]
> > >wrote:
> >
> > >
> > >
> > > Excuse my double posting.
> > > Here is the complete mail:
> > >
> > >
> > > OK,
> > >
> > > at first I will try the scans.
> > >
> > > If that's too slow I will have to upgrade hbase (currently
> 0.90.4-cdh3u2)
> > > to be able to use coprocessors.
> > >
> > >
> > > Currently I'm stuck at the scans because it requires two steps
> (therefore
> > > maybe some kind of filter chaining is required)
> > >
> > >
> > > The key:  userId-dateInMillis-sessionId
> > >
> > > At first I need to extract dateInMllis with regex or substring (using
> > > special delimiters for date)
> > >
> > > Second, the extracted value must be parsed to Long and set to a
> RowFilter
> > > Comparator like this:
> > >
> > > scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new
> > > BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
> > >
> > > How to chain that?
> > > Do I have to write a custom filter?
> > > (Would like to avoid that due to deployment)
> > >
> > > regards
> > > Chris
> > >
> > > ----- Ursprüngliche Message -----
> > > Von: Michael Segel <[email protected]>
> > > An: [email protected]
> > > CC:
> > > Gesendet: 13:52 Mittwoch, 1.August 2012
> > > Betreff: Re: How to query by rowKey-infix
> > >
> > > Actually w coprocessors you can create a secondary index in short
> order.
> > > Then your cost is going to be 2 fetches. Trying to do a partial table
> > scan
> > > will be more expensive.
> > >
> > > On Jul 31, 2012, at 12:41 PM, Matt Corgan <[email protected]> wrote:
> > >
> > > > When deciding between a table scan vs secondary index, you should try
> > to
> > > > estimate what percent of the underlying data blocks will be used in
> the
> > > > query.  By default, each block is 64KB.
> > > >
> > > > If each user's data is small and you are fitting multiple users per
> > > block,
> > > > then you're going to need all the blocks, so a tablescan is better
> > > because
> > > > it's simpler.  If each user has 1MB+ data then you will want to pick
> > out
> > > > the individual blocks relevant to each date.  The secondary index
> will
> > > help
> > > > you go directly to those sparse blocks, but with a cost in
> complexity,
> > > > consistency, and extra denormalized data that knocks primary data out
> > of
> > > > your block cache.
> > > >
> > > > If latency is not a concern, I would start with the table scan.  If
> > > that's
> > > > too slow you add the secondary index, and if you still need it faster
> > you
> > > > do the primary key lookups in parallel as Jerry mentions.
> > > >
> > > > Matt
> > > >
> > > > On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <[email protected]>
> > > wrote:
> > > >
> > > >> Hi Chris:
> > > >>
> > > >> I'm thinking about building a secondary index for primary key
> lookup,
> > > then
> > > >> query using the primary keys in parallel.
> > > >>
> > > >> I'm interested to see if there is other option too.
> > > >>
> > > >> Best Regards,
> > > >>
> > > >> Jerry
> > > >>
> > > >> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <
> > > [email protected]
> > > >>> wrote:
> > > >>
> > > >>> Hello there,
> > > >>>
> > > >>> I designed a row key for queries that need best performance (~100
> ms)
> > > >>> which looks like this:
> > > >>>
> > > >>> userId-date-sessionId
> > > >>>
> > > >>> These queries(scans) are always based on a userId and sometimes
> > > >>> additionally on a date, too.
> > > >>> That's no problem with the key above.
> > > >>>
> > > >>> However, another kind of queries shall be based on a given time
> range
> > > >>> whereas the outermost left userId is not given or known.
> > > >>> In this case I need to get all rows covering the given time range
> > with
> > > >>> their date to create a daily reporting.
> > > >>>
> > > >>> As I can't set wildcards at the beginning of a left-based index for
> > the
> > > >>> scan,
> > > >>> I only see the possibility to scan the index of the whole table to
> > > >> collect
> > > >>> the
> > > >>> rowKeys that are inside the timerange I'm interested in.
> > > >>>
> > > >>> Is there a more elegant way to collect rows within time range X?
> > > >>> (Unfortunately, the date attribute is not equal to the timestamp
> that
> > > is
> > > >>> stored by hbase automatically.)
> > > >>>
> > > >>> Could/should one maybe leverage some kind of row key caching to
> > > >> accelerate
> > > >>> the collection process?
> > > >>> Is that covered by the block cache?
> > > >>>
> > > >>> Thanks in advance for any advice.
> > > >>>
> > > >>> regards
> > > >>> Chris
> > > >>>
> > > >>
> > >
> >
> >
> >
> > --
> > Alex Baranau
> > ------
> > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> -
> > Solr
> >
>



-- 
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

Reply via email to