We're new to hbase, but somewhat familiar with the core concepts associated
with it. We use mysql now, but have also used cassandra for portions of our
code. We feel that hbase is a better fit because of the tight integration
with mapreduce and the proven stability of the underlying hadoop system. 

We run an advertising network in which we collect several thousand pieces of
analytical data per second. This obviously scales poorly in mysql. Our
initial gut feeling is to do something like the following with hbase. Let me
know if we are on the right track.

Aggregate our detailed raw stats into hbase that contain all of our verbose
data. From here, we can run mapreduce jobs and create hourly, daily,
monthly, etc rollups of our data as it is needed for our different front end
interfaces. Store it in such a way that it is formatted how we need it so we
don't have to do any further processing on it when we hit display time. This
would also give us the flexibility to create new views with new rollup
metrics since we stored all of our raw data and can again mapreduce it
anyway we need it. 

For simple graphs and a more realtime view of simple data like clicks and
impressions we thought about simply auto incrementing hourly, daily, monthly
counters for a user or revenue channel. 

The other consideration is getting the data into hbase. We were looking at
adding variables to our url's so we can aggregate the apache logs from each
of our front end application servers. That or we can simply do the inserts
straight into hbase using php and thrift. I'm guessing the first scenario is
more efficient speed wise, but again, I may be overlooking other issues.

Does this basic data strategy sound solid? Any suggestions, or potential
pitfalls? I would love some advice from those more seasoned in handling
large volume analytical datasets. 

Thanks guys

Brian
-- 
View this message in context: 
http://old.nabble.com/hbase-architecture-question-tp31374398p31374398.html
Sent from the HBase User mailing list archive at Nabble.com.

Reply via email to