Hello Hari! Just for the sake of maintaining sorted results, that's it. You have to keep it in lexicographic order. An alternative, for example, could be maintain date|category as RowKey and store your N URLs as members of a Column Family, where padded_visits could be the Column Qualifier and URL the value. In the end, it will depend on how you need to access your log data.
Wasteful is relative... if you have to keep all those fields, store them as part of your RowKey, Column Qualifier or value will have the same 'physical' result, which is, all these values will be repeated for every row. Don't know if my last sentence is clear, but Lars George made a good diagram to explain this. It's inside his HBase book, but also in this presentation: http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012 (check slides 14 and 15). If storage is a hard constraint, you can try to work with reduced data... one two bytes can represent a good amount of distinct categories, and if you know a theoretical limit for total of visits you can probably work with a range lower than a Long. Also, are you aware of the effect of having a raw date as the start of your RowKey? Regards, Cristofer -----Mensagem original----- De: Hari Prasanna [mailto:[email protected]] Enviada em: terça-feira, 24 de julho de 2012 10:51 Para: [email protected] Assunto: Schema for sorted results Hello - I'm using HBase for web server log processing and I'm trying to save the top N urls per category per day in a sorted manner in HBase. From what I've read, the only sortable structure that HBase offers is the lexicographic sort in the row keys. So, here is the rowkey format I'm currently using <date>|<category>|<padded_visits>|<url> where, padded_visits = Long.MAX_VALUE - visits This seems wasteful because of the long rowkeys. Is there any other approach to maintain sorted results in HBase? Thanks Hari Prasanna
