Hi everyone, i had a question about HBase.

* Background:
I'm working on analytics project and, so far, we are using MySQL as DBMS
and Hadoop for data processing and aggregation. By now, we collect data
analytics over HTTP and pushes to Hadoop. Every day (in fact, every night
:P) we run Hadoop jobs for summarizing data in one day series as needed by
every report (not relational, one denormalized table for every report).

Every report table structure is something like
* metric_key (text)
* timestamp
* counter1
* counter2
* counter3
* counter4

Query this data is very straight forward in SQL systems; grouping by
metric_key, filtering by date and using aggregation functions on counters
to calculate factors and coefficients.

* Problem: as everyone, data gets too big to fit in one single SQL machine
and performance is dropping. By now, we receive about 600k events per day,
summarized(some get grouped, some get discarded) to ~350k metrics
(metric_key+timestamp pair).

* Question: reading the book, forum or mailing list, I dont find any clues
to aggregation based on arbitrary time series slices. So, is there any way
to query HBase to get the counter3 sum between 2011-09-01 to 2011-10-01 for
every metric_key?

(I mean something like where date > :date_low AND date>:date_high group by
metric_key)

I know that using timestamp as part of the key allows to range scan the
table to fetch the rows. But i have no clue if there is any way to do sums
on HBase. Or in the case there is no way, is crazy to do these aggregation
calculations on top of it, after querying?

This are web reports, so the use of Hadoop(pig/hive) to render this data is
totally discarded.

-- 
Un saludo,
Samuel García.

Reply via email to