We're new to hbase, but somewhat familiar with the core concepts associated with it. We use mysql now, but have also used cassandra for portions of our code. We feel that hbase is a better fit because of the tight integration with mapreduce and the proven stability of the underlying hadoop system.
We run an advertising network in which we collect several thousand pieces of analytical data per second. This obviously scales poorly in mysql. Our initial gut feeling is to do something like the following with hbase. Let me know if we are on the right track. Aggregate our detailed raw stats into hbase that contain all of our verbose data. From here, we can run mapreduce jobs and create hourly, daily, monthly, etc rollups of our data as it is needed for our different front end interfaces. Store it in such a way that it is formatted how we need it so we don't have to do any further processing on it when we hit display time. This would also give us the flexibility to create new views with new rollup metrics since we stored all of our raw data and can again mapreduce it anyway we need it. For simple graphs and a more realtime view of simple data like clicks and impressions we thought about simply auto incrementing hourly, daily, monthly counters for a user or revenue channel. The other consideration is getting the data into hbase. We were looking at adding variables to our url's so we can aggregate the apache logs from each of our front end application servers. That or we can simply do the inserts straight into hbase using php and thrift. I'm guessing the first scenario is more efficient speed wise, but again, I may be overlooking other issues. Does this basic data strategy sound solid? Any suggestions, or potential pitfalls? I would love some advice from those more seasoned in handling large volume analytical datasets. Thanks guys Brian -- View this message in context: http://old.nabble.com/hbase-architecture-question-tp31374398p31374398.html Sent from the HBase User mailing list archive at Nabble.com.
