On Nov 13, 2011, at 6:11 PM, ext Samuel García Martínez wrote: > Hi everyone, i had a question about HBase. > > * Background: > I'm working on analytics project and, so far, we are using MySQL as DBMS > and Hadoop for data processing and aggregation. By now, we collect data > analytics over HTTP and pushes to Hadoop. Every day (in fact, every night > :P) we run Hadoop jobs for summarizing data in one day series as needed by > every report (not relational, one denormalized table for every report). > > Every report table structure is something like > * metric_key (text) > * timestamp > * counter1 > * counter2 > * counter3 > * counter4 > > Query this data is very straight forward in SQL systems; grouping by > metric_key, filtering by date and using aggregation functions on counters > to calculate factors and coefficients. > > * Problem: as everyone, data gets too big to fit in one single SQL machine > and performance is dropping. By now, we receive about 600k events per day, > summarized(some get grouped, some get discarded) to ~350k metrics > (metric_key+timestamp pair). > > * Question: reading the book, forum or mailing list, I dont find any clues > to aggregation based on arbitrary time series slices. So, is there any way > to query HBase to get the counter3 sum between 2011-09-01 to 2011-10-01 for > every metric_key? >
> (I mean something like where date > :date_low AND date>:date_high group by > metric_key) > > I know that using timestamp as part of the key allows to range scan the > table to fetch the rows. But i have no clue if there is any way to do sums > on HBase. Or in the case there is no way, is crazy to do these aggregation > calculations on top of it, after querying? > In short, Yes. You can. But you have to perform the grouping and sum in your own code after scan the rows between 2011-09-01 and 2011-10-01. But if you only need to perform daily sum, you can partition your data by date, then perform aggregation using Java map-reduce, Hive, or pig. -- Victor > This are web reports, so the use of Hadoop(pig/hive) to render this data is > totally discarded. > > -- > Un saludo, > Samuel García.
