Hi Alex,

Apart from the query which i mentioned in last email. Till now, i have
implemented the following queries using filters and coprocessors:

1. Getting all the records for a customer.
2. Perform min,max,avg,sum aggregation for a customer using coprocessors. I
am storing some of the data as BigDecimal also to do accurate floating
point calculations.
3. Perform min,max,avg,sum aggregation for a customer within a given
time-range using coprocessors.
4. Filter that data for a customer within a given time-range on the basis
of column values. The filtering on column values can be matching a string
value or it can be doing range based numerical comparison.

Basically, as per our current requirement all the queries have customerid
and most of the queries have timerange also. We are not in prod yet. All of
this effort is part of a POC.

2. Can you set timestamp on Puts the same as timestamp "assigned" to your
record by app logic?
Anil: Wow! This sounds like an awesome idea. Actually, my data is
non-mutable so at present i was putting 0 as the timestamp for all the
data. I will definitely try this stuff. Currently, i run bulkloader to load
the data so i think its gonna be a small change.

Yes, i would love to give a try from my side for developing a range based
FuzzyRowFilter. However, first i am going to try putting in the timestamp.

Thanks for a very helpful discussion. Let me know when you create the JIRA
for range-based FuzzyRowFilter.

Thanks,
Anil Gupta

On Sat, Aug 18, 2012 at 12:13 PM, Alex Baranau <[email protected]>wrote:

> @Michael,
>
> This is not a simple partial key scan. Take this example of rows:
>
> aaaaa_100001_20120801
> aaaaa_100001_20120802
> aaaaa_100001_20120802
> aaaaa_100001_20120803
> aaaaa_100001_20120804
> aaaaa_100001_20120805
> aaaaa_100002_20120801
> aaaaa_100002_20120802
> aaaaa_100002_20120802
> aaaaa_100002_20120803
> aaaaa_100002_20120804
> aaaaa_100002_20120805
>
> where aaaaa is userId, 10000x is actionId and 201208xx is a timestamp. If
> the query is to select actions in the range 20120803-20120805 (in this case
> last 3 days), then when scan encounters row:
>
> aaaaa_100001_20120801
>
> it "knows" it can fast forward scanning to "aaaaa_100001_20120803", and
> skip some records (in practice, this may mean skipping really a LOT of
> recrods).
>
>
> @Anil,
>
> > Sample Query: I want to get all the event which happened in last month.
>
> 1. What other queries do you do? Just trying to understand why this row key
> format was chosen.
>
> 2. Can you set timestamp on Puts the same as timestamp "assigned" to your
> record by app logic? If you can, then this is the first thing to try and
> perform scan with the help of scan.setTimeRange(startTs, stopTs). Depending
> on how you write the data this may help a lot with the reading speed by ts,
> because that way you may skip the whole HFiles from reading based on ts. I
> don't know about your data a lot to judge, but:
>   * in case you have not a lot of users most of which are with long history
> of interaction with you system (i.e. there are a lot of records for
> specific "userX_actionY") and
>   * if you write data with monotonically increasing timestamp
>   * your regions are not too big
> then this might help you, as it will increase the chance that some of the
> HFiles will contain data *all of which* doesn't fell into the time interval
> you select by. Otherwise, if written data items with different timestamps
> are very well spread across the HFiles the chance that some HFiles are
> skipped from reading is very small. I believe Lars George has illustrated
> this in one of his presentations, but couldn't find it quickly.
>
> > something like FuzzyRowFilter with range
>
> Yes, smth like this looks like would be very valuable. It would be
> interesting to implement too. Let's see if I find the time for that in my
> work plan. If you want to try it by yourself, go for it! Let me know if you
> need a help in that case ;)
>
> Alex Baranau
> ------
> Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
>
> On Sat, Aug 18, 2012 at 6:56 AM, Michael Segel <[email protected]
> >wrote:
>
> > What row keys are you skipping?
> >
> > Using your example...
> > You have a start row of 00000000200, and an end key of
> > xFFxFFxFFxFFxFFxFF00350.
> > Note that you could also write that end key as xFF(1..6) 01 since it
> looks
> > like you're trying to match the 00 in positons 7 and 8 of your numeric
> > string.
> >
> > Assuming that when you say ? you mean that you expect to have a character
> > in that spot and that your row key is exactly 11 characters in length.
> >
> > While you may not return all the rows in that range, you do have to still
> > check the row key, unless I am missing something.
> >
> > So what am I missing?
> >
> > On Aug 17, 2012, at 3:42 PM, Alex Baranau <[email protected]>
> > wrote:
> >
> > > There was a question [1] in
> > > https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it makes
> > > more sense to answer it here.
> > >
> > > With the current FuzzyRowFilter I believe the only way to approach the
> > > problem is to add 150 fuzzy rules to the filter: ??????00200,
> > ??????00201,
> > > ..., ??????00350.
> > >
> > > As for performance of this approach I can say the following:
> > > * there are two "checks" happening for each processed row key (i.e.
> those
> > > row keys we don't skip)
> > > * first one performs simple check if the given row key satisfies the
> > fuzzy
> > > rule and also determines if there's next row key to advance to (if this
> > one
> > > doesn't satisfy). The check takes up at max O(n), where n is the length
> > of
> > > fuzzy rule. I.e. this is done in one simple loop, which can be broken
> > > before all bytes are checked. For m rules this will be O(m*n).
> > > * second piece calculates the next row key to provide it as a hint for
> > > fast-forwarding. We again check all rules and finding the smallest
> hint.
> > > Operation is also done in one loop, i.e. O(m*n) here as well.
> > >
> > > With 150 fuzzy rules of length 11, the applying filter is equivalent to
> > the
> > > loop with simple checks thru 150*11*2 ~ 3000 elements. This might look
> a
> > > lot, but can work quite fast. So I'd just try it.
> > >
> > > As for extension which will be more efficient, it makes sense to
> consider
> > > implementing it. Let me think more about it and get back with the JIRA
> > > Issue to you :). But I'd suggest you trying existing FuzzyRowFilter
> > first.
> > > The output (performance) would give us some food for thinking, or may
> be
> > > even turns out to be acceptable for you (hopefully).
> > >
> > >> Can i run this kind of filter on HBase0.92 without doing any
> significant
> > > update to the cluster
> > >
> > > Until the next release, you'll have to use the FuzzyRowFilter as any
> > other
> > > custom filter. Just grab the patch from HBASE-6509 and copy the filter.
> > No
> > > need to patch & rebuild HBase.
> > >
> > > Alex Baranau
> > > ------
> > > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch -
> > Solr
> > >
> > > [1]
> > >
> > > Anil Gupta added a comment - 18/Aug/12 04:37
> > > Hi Alex,
> > > I have a question related to this filter. I have a similar filtering
> > > requirement which will be an extension to FuzzyFilterRow.
> > > Suppose, i have the following structure of rowkeys: userid_actionid,
> > where
> > > userid is of 6 digit and then actionid is 5 digit. I would like to get
> > all
> > > the rows with actionid between 00200 to 00350. With current
> > FuzzyRowFilter
> > > i can search for all the rows a particular actionid. Instead of
> searching
> > > for a particular actionid i would like to search for a range of
> actionid.
> > > Does this use case sounds like an extension to current FuzzyRowFilter?
> > Can
> > > i run this kind of filter on HBase0.92 without doing any significant
> > update
> > > to the cluster. If i develop this kind of filter then what is needed to
> > run
> > > it on all the RS's?
> > > Thanks,
> > > Anil
> >
> >
>



-- 
Thanks & Regards,
Anil Gupta

Reply via email to