JM - I am searching for top N urls in date+category, so this rowkey does work well for the my purpose. Cristofer - I realize that having the raw date at the beginning of the rowkey makes all the writes in a day rush to the same region server. Maybe I could have the rowkey start with the category(which is more distributed) and have date in the column qualifier. I just went through the slides. Was very enlightening. thanks for that.
Thank again! On Tue, Jul 24, 2012 at 7:59 PM, Jean-Marc Spaggiari <[email protected]> wrote: > Hi Hari, > > Why do you think it's wasteful? > > Let's imagine this situation. > Key=<date>|<category>|<padded_visits>|<url> Value = nothing. > > And this one: > Key=<url> Value = <date>|<category>|<padded_visits> > > Both situation will, at the end, represent almost the same size in the > database. > > You can also do somthing like that: > Key=<url> ColumnFamillyName=<date> Value=<category>|<padded_visits> > > Just that the first option will allow you to retreive the information > you are looking for very quickly. > > Now, are you sure that this key is really what you need? What will be > the access model for your database? With the key you are using, you > will have to search by date first. So if you want to fine all the > entries for one URL, you will have to scan the entire table, jumping > to the next date each time you find it. > > If you are searching by date, then this key is good. > > So you really need first to think on the way you are going to read > your data, and then, you will be able to design a key to match your > needs. > > JM > > 2012/7/24, Minh Duc Nguyen <[email protected]>: >> Hari, >> >> According to the HBase book: http://hbase.apache.org/book.html#dm.sort >> >> All data model operations HBase return data in sorted order. First by row, >> then by ColumnFamily, followed by column qualifier, and finally timestamp >> (sorted in reverse, so newest records are returned first). >> >> ~ Minh >> >> On Tue, Jul 24, 2012 at 9:50 AM, Hari Prasanna <[email protected]> wrote: >> >>> Hello - >>> >>> I'm using HBase for web server log processing and I'm trying to save >>> the top N urls per category per day in a sorted manner in HBase. From >>> what I've read, the only sortable structure that HBase offers is the >>> lexicographic sort in the row keys. So, here is the rowkey format I'm >>> currently using >>> <date>|<category>|<padded_visits>|<url> >>> where, padded_visits = Long.MAX_VALUE - visits >>> >>> This seems wasteful because of the long rowkeys. Is there any other >>> approach to maintain sorted results in HBase? >>> >>> Thanks >>> Hari Prasanna >>> >> -- Hari
