JM - I am searching for top N urls in date+category, so this rowkey
does work well for the my purpose.
Cristofer - I realize that having the raw date at the beginning of the
rowkey makes all the writes in a day rush to the same region server.
Maybe I could have the rowkey start with the category(which is more
distributed) and have date in the column qualifier.
I just went through the slides. Was very enlightening. thanks for that.

Thank again!

On Tue, Jul 24, 2012 at 7:59 PM, Jean-Marc Spaggiari
<[email protected]> wrote:
> Hi Hari,
>
> Why do you think it's wasteful?
>
> Let's imagine this situation.
> Key=<date>|<category>|<padded_visits>|<url> Value = nothing.
>
> And this one:
> Key=<url> Value = <date>|<category>|<padded_visits>
>
> Both situation will, at the end, represent almost the same size in the 
> database.
>
> You can also do somthing like that:
> Key=<url> ColumnFamillyName=<date> Value=<category>|<padded_visits>
>
> Just that the first option will allow you to retreive the information
> you are looking for very quickly.
>
> Now, are you sure that this key is really what you need? What will be
> the access model for your database? With the key you are using, you
> will have to search by date first. So if you want to fine all the
> entries for one URL, you will have to scan the entire table, jumping
> to the next date each time you find it.
>
> If you are searching by date, then this key is good.
>
> So you really need first to think on the way you are going to read
> your data, and then, you will be able to design a key to match your
> needs.
>
> JM
>
> 2012/7/24, Minh Duc Nguyen <[email protected]>:
>> Hari,
>>
>>    According to the HBase book: http://hbase.apache.org/book.html#dm.sort
>>
>> All data model operations HBase return data in sorted order. First by row,
>> then by ColumnFamily, followed by column qualifier, and finally timestamp
>> (sorted in reverse, so newest records are returned first).
>>
>>     ~ Minh
>>
>> On Tue, Jul 24, 2012 at 9:50 AM, Hari Prasanna <[email protected]> wrote:
>>
>>> Hello -
>>>
>>> I'm using HBase for web server log processing and I'm trying to save
>>> the top N urls per category per day in a sorted manner in HBase. From
>>> what I've read, the only sortable structure that HBase offers is the
>>> lexicographic sort in the row keys. So, here is the rowkey format I'm
>>> currently using
>>> <date>|<category>|<padded_visits>|<url>
>>> where,  padded_visits = Long.MAX_VALUE - visits
>>>
>>> This seems wasteful because of the long rowkeys. Is there any other
>>> approach to maintain sorted results in HBase?
>>>
>>> Thanks
>>> Hari Prasanna
>>>
>>



-- 
Hari

Reply via email to