Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?

anil gupta Tue, 21 Aug 2012 23:19:46 -0700

Hi Alex,

Thanks for creating the JIRA.
On Monday, I completed testing the time range filtering using timestamps
and IMO the results seems satisfactory(if not great). The table has 34
million records(average row size is 1.21 KB), in 136 seconds i get the
entire result of query which had 225 rows.
I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node
had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up
is hosting 2 Slaves Instance(2 VM's running Datanode,
NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't
done any modification in the block size of HDFS or HBase. Considering the
below-par hardware configuration of cluster, does the performance sounds OK
for timestamp filtering?


Thanks,
Anil

On Mon, Aug 20, 2012 at 1:07 PM, Alex Baranau <[email protected]>wrote:

> Created: https://issues.apache.org/jira/browse/HBASE-6618
>
> Alex Baranau
> ------
> Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
>
> On Sat, Aug 18, 2012 at 5:02 PM, anil gupta <[email protected]> wrote:
>
> > Hi Alex,
> >
> > Apart from the query which i mentioned in last email. Till now, i have
> > implemented the following queries using filters and coprocessors:
> >
> > 1. Getting all the records for a customer.
> > 2. Perform min,max,avg,sum aggregation for a customer using
> coprocessors. I
> > am storing some of the data as BigDecimal also to do accurate floating
> > point calculations.
> > 3. Perform min,max,avg,sum aggregation for a customer within a given
> > time-range using coprocessors.
> > 4. Filter that data for a customer within a given time-range on the basis
> > of column values. The filtering on column values can be matching a string
> > value or it can be doing range based numerical comparison.
> >
> > Basically, as per our current requirement all the queries have customerid
> > and most of the queries have timerange also. We are not in prod yet. All
> of
> > this effort is part of a POC.
> >
> > 2. Can you set timestamp on Puts the same as timestamp "assigned" to your
> > record by app logic?
> > Anil: Wow! This sounds like an awesome idea. Actually, my data is
> > non-mutable so at present i was putting 0 as the timestamp for all the
> > data. I will definitely try this stuff. Currently, i run bulkloader to
> load
> > the data so i think its gonna be a small change.
> >
> > Yes, i would love to give a try from my side for developing a range based
> > FuzzyRowFilter. However, first i am going to try putting in the
> timestamp.
> >
> > Thanks for a very helpful discussion. Let me know when you create the
> JIRA
> > for range-based FuzzyRowFilter.
> >
> > Thanks,
> > Anil Gupta
> >
> > On Sat, Aug 18, 2012 at 12:13 PM, Alex Baranau <[email protected]
> > >wrote:
> >
> > > @Michael,
> > >
> > > This is not a simple partial key scan. Take this example of rows:
> > >
> > > aaaaa_100001_20120801
> > > aaaaa_100001_20120802
> > > aaaaa_100001_20120802
> > > aaaaa_100001_20120803
> > > aaaaa_100001_20120804
> > > aaaaa_100001_20120805
> > > aaaaa_100002_20120801
> > > aaaaa_100002_20120802
> > > aaaaa_100002_20120802
> > > aaaaa_100002_20120803
> > > aaaaa_100002_20120804
> > > aaaaa_100002_20120805
> > >
> > > where aaaaa is userId, 10000x is actionId and 201208xx is a timestamp.
> If
> > > the query is to select actions in the range 20120803-20120805 (in this
> > case
> > > last 3 days), then when scan encounters row:
> > >
> > > aaaaa_100001_20120801
> > >
> > > it "knows" it can fast forward scanning to "aaaaa_100001_20120803", and
> > > skip some records (in practice, this may mean skipping really a LOT of
> > > recrods).
> > >
> > >
> > > @Anil,
> > >
> > > > Sample Query: I want to get all the event which happened in last
> month.
> > >
> > > 1. What other queries do you do? Just trying to understand why this row
> > key
> > > format was chosen.
> > >
> > > 2. Can you set timestamp on Puts the same as timestamp "assigned" to
> your
> > > record by app logic? If you can, then this is the first thing to try
> and
> > > perform scan with the help of scan.setTimeRange(startTs, stopTs).
> > Depending
> > > on how you write the data this may help a lot with the reading speed by
> > ts,
> > > because that way you may skip the whole HFiles from reading based on
> ts.
> > I
> > > don't know about your data a lot to judge, but:
> > >   * in case you have not a lot of users most of which are with long
> > history
> > > of interaction with you system (i.e. there are a lot of records for
> > > specific "userX_actionY") and
> > >   * if you write data with monotonically increasing timestamp
> > >   * your regions are not too big
> > > then this might help you, as it will increase the chance that some of
> the
> > > HFiles will contain data *all of which* doesn't fell into the time
> > interval
> > > you select by. Otherwise, if written data items with different
> timestamps
> > > are very well spread across the HFiles the chance that some HFiles are
> > > skipped from reading is very small. I believe Lars George has
> illustrated
> > > this in one of his presentations, but couldn't find it quickly.
> > >
> > > > something like FuzzyRowFilter with range
> > >
> > > Yes, smth like this looks like would be very valuable. It would be
> > > interesting to implement too. Let's see if I find the time for that in
> my
> > > work plan. If you want to try it by yourself, go for it! Let me know if
> > you
> > > need a help in that case ;)
> > >
> > > Alex Baranau
> > > ------
> > > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch -
> > Solr
> > >
> > > On Sat, Aug 18, 2012 at 6:56 AM, Michael Segel <
> > [email protected]
> > > >wrote:
> > >
> > > > What row keys are you skipping?
> > > >
> > > > Using your example...
> > > > You have a start row of 00000000200, and an end key of
> > > > xFFxFFxFFxFFxFFxFF00350.
> > > > Note that you could also write that end key as xFF(1..6) 01 since it
> > > looks
> > > > like you're trying to match the 00 in positons 7 and 8 of your
> numeric
> > > > string.
> > > >
> > > > Assuming that when you say ? you mean that you expect to have a
> > character
> > > > in that spot and that your row key is exactly 11 characters in
> length.
> > > >
> > > > While you may not return all the rows in that range, you do have to
> > still
> > > > check the row key, unless I am missing something.
> > > >
> > > > So what am I missing?
> > > >
> > > > On Aug 17, 2012, at 3:42 PM, Alex Baranau <[email protected]>
> > > > wrote:
> > > >
> > > > > There was a question [1] in
> > > > > https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it
> > makes
> > > > > more sense to answer it here.
> > > > >
> > > > > With the current FuzzyRowFilter I believe the only way to approach
> > the
> > > > > problem is to add 150 fuzzy rules to the filter: ??????00200,
> > > > ??????00201,
> > > > > ..., ??????00350.
> > > > >
> > > > > As for performance of this approach I can say the following:
> > > > > * there are two "checks" happening for each processed row key (i.e.
> > > those
> > > > > row keys we don't skip)
> > > > > * first one performs simple check if the given row key satisfies
> the
> > > > fuzzy
> > > > > rule and also determines if there's next row key to advance to (if
> > this
> > > > one
> > > > > doesn't satisfy). The check takes up at max O(n), where n is the
> > length
> > > > of
> > > > > fuzzy rule. I.e. this is done in one simple loop, which can be
> broken
> > > > > before all bytes are checked. For m rules this will be O(m*n).
> > > > > * second piece calculates the next row key to provide it as a hint
> > for
> > > > > fast-forwarding. We again check all rules and finding the smallest
> > > hint.
> > > > > Operation is also done in one loop, i.e. O(m*n) here as well.
> > > > >
> > > > > With 150 fuzzy rules of length 11, the applying filter is
> equivalent
> > to
> > > > the
> > > > > loop with simple checks thru 150*11*2 ~ 3000 elements. This might
> > look
> > > a
> > > > > lot, but can work quite fast. So I'd just try it.
> > > > >
> > > > > As for extension which will be more efficient, it makes sense to
> > > consider
> > > > > implementing it. Let me think more about it and get back with the
> > JIRA
> > > > > Issue to you :). But I'd suggest you trying existing FuzzyRowFilter
> > > > first.
> > > > > The output (performance) would give us some food for thinking, or
> may
> > > be
> > > > > even turns out to be acceptable for you (hopefully).
> > > > >
> > > > >> Can i run this kind of filter on HBase0.92 without doing any
> > > significant
> > > > > update to the cluster
> > > > >
> > > > > Until the next release, you'll have to use the FuzzyRowFilter as
> any
> > > > other
> > > > > custom filter. Just grab the patch from HBASE-6509 and copy the
> > filter.
> > > > No
> > > > > need to patch & rebuild HBase.
> > > > >
> > > > > Alex Baranau
> > > > > ------
> > > > > Sematext :: http://sematext.com/ :: Hadoop - HBase -
> ElasticSearch -
> > > > Solr
> > > > >
> > > > > [1]
> > > > >
> > > > > Anil Gupta added a comment - 18/Aug/12 04:37
> > > > > Hi Alex,
> > > > > I have a question related to this filter. I have a similar
> filtering
> > > > > requirement which will be an extension to FuzzyFilterRow.
> > > > > Suppose, i have the following structure of rowkeys:
> userid_actionid,
> > > > where
> > > > > userid is of 6 digit and then actionid is 5 digit. I would like to
> > get
> > > > all
> > > > > the rows with actionid between 00200 to 00350. With current
> > > > FuzzyRowFilter
> > > > > i can search for all the rows a particular actionid. Instead of
> > > searching
> > > > > for a particular actionid i would like to search for a range of
> > > actionid.
> > > > > Does this use case sounds like an extension to current
> > FuzzyRowFilter?
> > > > Can
> > > > > i run this kind of filter on HBase0.92 without doing any
> significant
> > > > update
> > > > > to the cluster. If i develop this kind of filter then what is
> needed
> > to
> > > > run
> > > > > it on all the RS's?
> > > > > Thanks,
> > > > > Anil
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Thanks & Regards,
> > Anil Gupta
> >
>



-- 
Thanks & Regards,
Anil Gupta

Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?

Reply via email to