Created: https://issues.apache.org/jira/browse/HBASE-6618
Alex Baranau ------ Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Sat, Aug 18, 2012 at 5:02 PM, anil gupta <[email protected]> wrote: > Hi Alex, > > Apart from the query which i mentioned in last email. Till now, i have > implemented the following queries using filters and coprocessors: > > 1. Getting all the records for a customer. > 2. Perform min,max,avg,sum aggregation for a customer using coprocessors. I > am storing some of the data as BigDecimal also to do accurate floating > point calculations. > 3. Perform min,max,avg,sum aggregation for a customer within a given > time-range using coprocessors. > 4. Filter that data for a customer within a given time-range on the basis > of column values. The filtering on column values can be matching a string > value or it can be doing range based numerical comparison. > > Basically, as per our current requirement all the queries have customerid > and most of the queries have timerange also. We are not in prod yet. All of > this effort is part of a POC. > > 2. Can you set timestamp on Puts the same as timestamp "assigned" to your > record by app logic? > Anil: Wow! This sounds like an awesome idea. Actually, my data is > non-mutable so at present i was putting 0 as the timestamp for all the > data. I will definitely try this stuff. Currently, i run bulkloader to load > the data so i think its gonna be a small change. > > Yes, i would love to give a try from my side for developing a range based > FuzzyRowFilter. However, first i am going to try putting in the timestamp. > > Thanks for a very helpful discussion. Let me know when you create the JIRA > for range-based FuzzyRowFilter. > > Thanks, > Anil Gupta > > On Sat, Aug 18, 2012 at 12:13 PM, Alex Baranau <[email protected] > >wrote: > > > @Michael, > > > > This is not a simple partial key scan. Take this example of rows: > > > > aaaaa_100001_20120801 > > aaaaa_100001_20120802 > > aaaaa_100001_20120802 > > aaaaa_100001_20120803 > > aaaaa_100001_20120804 > > aaaaa_100001_20120805 > > aaaaa_100002_20120801 > > aaaaa_100002_20120802 > > aaaaa_100002_20120802 > > aaaaa_100002_20120803 > > aaaaa_100002_20120804 > > aaaaa_100002_20120805 > > > > where aaaaa is userId, 10000x is actionId and 201208xx is a timestamp. If > > the query is to select actions in the range 20120803-20120805 (in this > case > > last 3 days), then when scan encounters row: > > > > aaaaa_100001_20120801 > > > > it "knows" it can fast forward scanning to "aaaaa_100001_20120803", and > > skip some records (in practice, this may mean skipping really a LOT of > > recrods). > > > > > > @Anil, > > > > > Sample Query: I want to get all the event which happened in last month. > > > > 1. What other queries do you do? Just trying to understand why this row > key > > format was chosen. > > > > 2. Can you set timestamp on Puts the same as timestamp "assigned" to your > > record by app logic? If you can, then this is the first thing to try and > > perform scan with the help of scan.setTimeRange(startTs, stopTs). > Depending > > on how you write the data this may help a lot with the reading speed by > ts, > > because that way you may skip the whole HFiles from reading based on ts. > I > > don't know about your data a lot to judge, but: > > * in case you have not a lot of users most of which are with long > history > > of interaction with you system (i.e. there are a lot of records for > > specific "userX_actionY") and > > * if you write data with monotonically increasing timestamp > > * your regions are not too big > > then this might help you, as it will increase the chance that some of the > > HFiles will contain data *all of which* doesn't fell into the time > interval > > you select by. Otherwise, if written data items with different timestamps > > are very well spread across the HFiles the chance that some HFiles are > > skipped from reading is very small. I believe Lars George has illustrated > > this in one of his presentations, but couldn't find it quickly. > > > > > something like FuzzyRowFilter with range > > > > Yes, smth like this looks like would be very valuable. It would be > > interesting to implement too. Let's see if I find the time for that in my > > work plan. If you want to try it by yourself, go for it! Let me know if > you > > need a help in that case ;) > > > > Alex Baranau > > ------ > > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > > > On Sat, Aug 18, 2012 at 6:56 AM, Michael Segel < > [email protected] > > >wrote: > > > > > What row keys are you skipping? > > > > > > Using your example... > > > You have a start row of 00000000200, and an end key of > > > xFFxFFxFFxFFxFFxFF00350. > > > Note that you could also write that end key as xFF(1..6) 01 since it > > looks > > > like you're trying to match the 00 in positons 7 and 8 of your numeric > > > string. > > > > > > Assuming that when you say ? you mean that you expect to have a > character > > > in that spot and that your row key is exactly 11 characters in length. > > > > > > While you may not return all the rows in that range, you do have to > still > > > check the row key, unless I am missing something. > > > > > > So what am I missing? > > > > > > On Aug 17, 2012, at 3:42 PM, Alex Baranau <[email protected]> > > > wrote: > > > > > > > There was a question [1] in > > > > https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it > makes > > > > more sense to answer it here. > > > > > > > > With the current FuzzyRowFilter I believe the only way to approach > the > > > > problem is to add 150 fuzzy rules to the filter: ??????00200, > > > ??????00201, > > > > ..., ??????00350. > > > > > > > > As for performance of this approach I can say the following: > > > > * there are two "checks" happening for each processed row key (i.e. > > those > > > > row keys we don't skip) > > > > * first one performs simple check if the given row key satisfies the > > > fuzzy > > > > rule and also determines if there's next row key to advance to (if > this > > > one > > > > doesn't satisfy). The check takes up at max O(n), where n is the > length > > > of > > > > fuzzy rule. I.e. this is done in one simple loop, which can be broken > > > > before all bytes are checked. For m rules this will be O(m*n). > > > > * second piece calculates the next row key to provide it as a hint > for > > > > fast-forwarding. We again check all rules and finding the smallest > > hint. > > > > Operation is also done in one loop, i.e. O(m*n) here as well. > > > > > > > > With 150 fuzzy rules of length 11, the applying filter is equivalent > to > > > the > > > > loop with simple checks thru 150*11*2 ~ 3000 elements. This might > look > > a > > > > lot, but can work quite fast. So I'd just try it. > > > > > > > > As for extension which will be more efficient, it makes sense to > > consider > > > > implementing it. Let me think more about it and get back with the > JIRA > > > > Issue to you :). But I'd suggest you trying existing FuzzyRowFilter > > > first. > > > > The output (performance) would give us some food for thinking, or may > > be > > > > even turns out to be acceptable for you (hopefully). > > > > > > > >> Can i run this kind of filter on HBase0.92 without doing any > > significant > > > > update to the cluster > > > > > > > > Until the next release, you'll have to use the FuzzyRowFilter as any > > > other > > > > custom filter. Just grab the patch from HBASE-6509 and copy the > filter. > > > No > > > > need to patch & rebuild HBase. > > > > > > > > Alex Baranau > > > > ------ > > > > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - > > > Solr > > > > > > > > [1] > > > > > > > > Anil Gupta added a comment - 18/Aug/12 04:37 > > > > Hi Alex, > > > > I have a question related to this filter. I have a similar filtering > > > > requirement which will be an extension to FuzzyFilterRow. > > > > Suppose, i have the following structure of rowkeys: userid_actionid, > > > where > > > > userid is of 6 digit and then actionid is 5 digit. I would like to > get > > > all > > > > the rows with actionid between 00200 to 00350. With current > > > FuzzyRowFilter > > > > i can search for all the rows a particular actionid. Instead of > > searching > > > > for a particular actionid i would like to search for a range of > > actionid. > > > > Does this use case sounds like an extension to current > FuzzyRowFilter? > > > Can > > > > i run this kind of filter on HBase0.92 without doing any significant > > > update > > > > to the cluster. If i develop this kind of filter then what is needed > to > > > run > > > > it on all the RS's? > > > > Thanks, > > > > Anil > > > > > > > > > > > > -- > Thanks & Regards, > Anil Gupta >
