We use hbase for exactly this kind of stuff and it works great. It depends on 
how you design your data model in hbase. Based on your use case, define a good 
row key that has the notion of timestamp in it. Your counters could be column 
identifiers. Yes aggregation needs to be in code.

In fact, we went a step further and even do deep dive using hbase. E.g you not 
only want a count of all IPs that hit your site on a day but also the IPs list 
itself.

Thanks
Vinod

Sent from my iPhone.

On Nov 13, 2011, at 3:39 PM, <[email protected]> wrote:

> 
> On Nov 13, 2011, at 6:11 PM, ext Samuel García Martínez wrote:
> 
>> Hi everyone, i had a question about HBase.
>> 
>> * Background:
>> I'm working on analytics project and, so far, we are using MySQL as DBMS
>> and Hadoop for data processing and aggregation. By now, we collect data
>> analytics over HTTP and pushes to Hadoop. Every day (in fact, every night
>> :P) we run Hadoop jobs for summarizing data in one day series as needed by
>> every report (not relational, one denormalized table for every report).
>> 
>> Every report table structure is something like
>> * metric_key (text)
>> * timestamp
>> * counter1
>> * counter2
>> * counter3
>> * counter4
>> 
>> Query this data is very straight forward in SQL systems; grouping by
>> metric_key, filtering by date and using aggregation functions on counters
>> to calculate factors and coefficients.
>> 
>> * Problem: as everyone, data gets too big to fit in one single SQL machine
>> and performance is dropping. By now, we receive about 600k events per day,
>> summarized(some get grouped, some get discarded) to ~350k metrics
>> (metric_key+timestamp pair).
>> 
>> * Question: reading the book, forum or mailing list, I dont find any clues
>> to aggregation based on arbitrary time series slices. So, is there any way
>> to query HBase to get the counter3 sum between 2011-09-01 to 2011-10-01 for
>> every metric_key?
>> 
> 
>> (I mean something like where date > :date_low AND date>:date_high group by
>> metric_key)
>> 
>> I know that using timestamp as part of the key allows to range scan the
>> table to fetch the rows. But i have no clue if there is any way to do sums
>> on HBase. Or in the case there is no way, is crazy to do these aggregation
>> calculations on top of it, after querying?
>> 
> 
> In short, Yes. You can.  But you have to perform the grouping and sum in your 
> own code after scan the rows between 2011-09-01 and 2011-10-01.
> 
> But if you only need to perform daily sum, you can partition your data by 
> date, then perform aggregation using Java map-reduce, Hive, or pig.
> 
> -- Victor
> 
>> This are web reports, so the use of Hadoop(pig/hive) to render this data is
>> totally discarded.
>> 
>> -- 
>> Un saludo,
>> Samuel García.
> 

Reply via email to