We use hbase for exactly this kind of stuff and it works great. It depends on how you design your data model in hbase. Based on your use case, define a good row key that has the notion of timestamp in it. Your counters could be column identifiers. Yes aggregation needs to be in code.
In fact, we went a step further and even do deep dive using hbase. E.g you not only want a count of all IPs that hit your site on a day but also the IPs list itself. Thanks Vinod Sent from my iPhone. On Nov 13, 2011, at 3:39 PM, <[email protected]> wrote: > > On Nov 13, 2011, at 6:11 PM, ext Samuel García Martínez wrote: > >> Hi everyone, i had a question about HBase. >> >> * Background: >> I'm working on analytics project and, so far, we are using MySQL as DBMS >> and Hadoop for data processing and aggregation. By now, we collect data >> analytics over HTTP and pushes to Hadoop. Every day (in fact, every night >> :P) we run Hadoop jobs for summarizing data in one day series as needed by >> every report (not relational, one denormalized table for every report). >> >> Every report table structure is something like >> * metric_key (text) >> * timestamp >> * counter1 >> * counter2 >> * counter3 >> * counter4 >> >> Query this data is very straight forward in SQL systems; grouping by >> metric_key, filtering by date and using aggregation functions on counters >> to calculate factors and coefficients. >> >> * Problem: as everyone, data gets too big to fit in one single SQL machine >> and performance is dropping. By now, we receive about 600k events per day, >> summarized(some get grouped, some get discarded) to ~350k metrics >> (metric_key+timestamp pair). >> >> * Question: reading the book, forum or mailing list, I dont find any clues >> to aggregation based on arbitrary time series slices. So, is there any way >> to query HBase to get the counter3 sum between 2011-09-01 to 2011-10-01 for >> every metric_key? >> > >> (I mean something like where date > :date_low AND date>:date_high group by >> metric_key) >> >> I know that using timestamp as part of the key allows to range scan the >> table to fetch the rows. But i have no clue if there is any way to do sums >> on HBase. Or in the case there is no way, is crazy to do these aggregation >> calculations on top of it, after querying? >> > > In short, Yes. You can. But you have to perform the grouping and sum in your > own code after scan the rows between 2011-09-01 and 2011-10-01. > > But if you only need to perform daily sum, you can partition your data by > date, then perform aggregation using Java map-reduce, Hive, or pig. > > -- Victor > >> This are web reports, so the use of Hadoop(pig/hive) to render this data is >> totally discarded. >> >> -- >> Un saludo, >> Samuel García. >
