Re: Web analytics and HBase

Socialyantra Mon, 14 Nov 2011 08:27:36 -0800

We use hbase for exactly this kind of stuff and it works great. It depends on 
how you design your data model in hbase. Based on your use case, define a good 
row key that has the notion of timestamp in it. Your counters could be column 
identifiers. Yes aggregation needs to be in code.


In fact, we went a step further and even do deep dive using hbase. E.g you not 
only want a count of all IPs that hit your site on a day but also the IPs list 
itself.

Thanks
Vinod

Sent from my iPhone.

On Nov 13, 2011, at 3:39 PM, <[email protected]> wrote:

> 
> On Nov 13, 2011, at 6:11 PM, ext Samuel García Martínez wrote:
> 
>> Hi everyone, i had a question about HBase.
>> 
>> * Background:
>> I'm working on analytics project and, so far, we are using MySQL as DBMS
>> and Hadoop for data processing and aggregation. By now, we collect data
>> analytics over HTTP and pushes to Hadoop. Every day (in fact, every night
>> :P) we run Hadoop jobs for summarizing data in one day series as needed by
>> every report (not relational, one denormalized table for every report).
>> 
>> Every report table structure is something like
>> * metric_key (text)
>> * timestamp
>> * counter1
>> * counter2
>> * counter3
>> * counter4
>> 
>> Query this data is very straight forward in SQL systems; grouping by
>> metric_key, filtering by date and using aggregation functions on counters
>> to calculate factors and coefficients.
>> 
>> * Problem: as everyone, data gets too big to fit in one single SQL machine
>> and performance is dropping. By now, we receive about 600k events per day,
>> summarized(some get grouped, some get discarded) to ~350k metrics
>> (metric_key+timestamp pair).
>> 
>> * Question: reading the book, forum or mailing list, I dont find any clues
>> to aggregation based on arbitrary time series slices. So, is there any way
>> to query HBase to get the counter3 sum between 2011-09-01 to 2011-10-01 for
>> every metric_key?
>> 
> 
>> (I mean something like where date > :date_low AND date>:date_high group by
>> metric_key)
>> 
>> I know that using timestamp as part of the key allows to range scan the
>> table to fetch the rows. But i have no clue if there is any way to do sums
>> on HBase. Or in the case there is no way, is crazy to do these aggregation
>> calculations on top of it, after querying?
>> 
> 
> In short, Yes. You can.  But you have to perform the grouping and sum in your 
> own code after scan the rows between 2011-09-01 and 2011-10-01.
> 
> But if you only need to perform daily sum, you can partition your data by 
> date, then perform aggregation using Java map-reduce, Hive, or pig.
> 
> -- Victor
> 
>> This are web reports, so the use of Hadoop(pig/hive) to render this data is
>> totally discarded.
>> 
>> -- 
>> Un saludo,
>> Samuel García.
>

Re: Web analytics and HBase

Reply via email to