Hello Hari!

Just for the sake of maintaining sorted results, that's it. You have to keep it 
in lexicographic order.  An alternative, for example, could be maintain 
date|category as RowKey and store your N URLs as members of a Column Family, 
where padded_visits could be the Column Qualifier and URL the value. In the 
end, it will depend on how you need to access your log data. 

Wasteful is relative... if you have to keep all those fields, store them as 
part of your RowKey, Column Qualifier or value will have the same 'physical' 
result, which is, all these values will be repeated for every row. Don't know 
if my last sentence is clear, but Lars George made a good diagram to explain 
this. It's inside his HBase book, but also in this presentation: 
http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012
 (check slides 14 and 15). 

If storage is a hard constraint, you can try to work with reduced data... one 
two bytes can represent a good amount of distinct  categories, and if you know 
a theoretical limit for total of visits you can probably work with a range 
lower than a Long.

Also, are you aware of the effect of having a raw date as the start of your 
RowKey? 

Regards,
Cristofer

-----Mensagem original-----
De: Hari Prasanna [mailto:[email protected]] 
Enviada em: terça-feira, 24 de julho de 2012 10:51
Para: [email protected]
Assunto: Schema for sorted results

Hello -

I'm using HBase for web server log processing and I'm trying to save the top N 
urls per category per day in a sorted manner in HBase. From what I've read, the 
only sortable structure that HBase offers is the lexicographic sort in the row 
keys. So, here is the rowkey format I'm currently using 
<date>|<category>|<padded_visits>|<url>
where,  padded_visits = Long.MAX_VALUE - visits

This seems wasteful because of the long rowkeys. Is there any other approach to 
maintain sorted results in HBase?

Thanks
Hari Prasanna

Reply via email to