Re: Web analytics and HBase

Doug Meil Mon, 14 Nov 2011 07:37:55 -0800

Hi there-

See...
http://hbase.apache.org/book.html#rowkey.design
http://hbase.apache.org/book.html#mapreduce.example.summary
http://hbase.apache.org/book.html#precreate.regions



Especially focus on the rowkey part because it mentioned OpenTSDB
specifically.




On 11/13/11 11:10 PM, "[email protected]" <[email protected]> wrote:

>A note - storing time series data in hbase can cause hot spots and
>splits...have you looked at opentsdb
>
>Inder
>Sent from BlackBerry® on Airtel
>
>-----Original Message-----
>From: <[email protected]>
>Date: Sun, 13 Nov 2011 23:39:51
>To: <[email protected]>
>Reply-To: [email protected]
>Subject: Re: Web analytics and HBase
>
>
>On Nov 13, 2011, at 6:11 PM, ext Samuel García Martínez wrote:
>
>> Hi everyone, i had a question about HBase.
>> 
>> * Background:
>> I'm working on analytics project and, so far, we are using MySQL as DBMS
>> and Hadoop for data processing and aggregation. By now, we collect data
>> analytics over HTTP and pushes to Hadoop. Every day (in fact, every
>>night
>> :P) we run Hadoop jobs for summarizing data in one day series as needed
>>by
>> every report (not relational, one denormalized table for every report).
>> 
>> Every report table structure is something like
>> * metric_key (text)
>> * timestamp
>> * counter1
>> * counter2
>> * counter3
>> * counter4
>> 
>> Query this data is very straight forward in SQL systems; grouping by
>> metric_key, filtering by date and using aggregation functions on
>>counters
>> to calculate factors and coefficients.
>> 
>> * Problem: as everyone, data gets too big to fit in one single SQL
>>machine
>> and performance is dropping. By now, we receive about 600k events per
>>day,
>> summarized(some get grouped, some get discarded) to ~350k metrics
>> (metric_key+timestamp pair).
>> 
>> * Question: reading the book, forum or mailing list, I dont find any
>>clues
>> to aggregation based on arbitrary time series slices. So, is there any
>>way
>> to query HBase to get the counter3 sum between 2011-09-01 to 2011-10-01
>>for
>> every metric_key?
>> 
>
>> (I mean something like where date > :date_low AND date>:date_high group
>>by
>> metric_key)
>> 
>> I know that using timestamp as part of the key allows to range scan the
>> table to fetch the rows. But i have no clue if there is any way to do
>>sums
>> on HBase. Or in the case there is no way, is crazy to do these
>>aggregation
>> calculations on top of it, after querying?
>> 
>
>In short, Yes. You can.  But you have to perform the grouping and sum in
>your own code after scan the rows between 2011-09-01 and 2011-10-01.
>
>But if you only need to perform daily sum, you can partition your data by
>date, then perform aggregation using Java map-reduce, Hive, or pig.
>
>-- Victor
>
>> This are web reports, so the use of Hadoop(pig/hive) to render this
>>data is
>> totally discarded.
>> 
>> -- 
>> Un saludo,
>> Samuel García.
>

Re: Web analytics and HBase

Reply via email to