I'm using HBase for similar stats, some things I've learned: - date/time as key is good because that way it's very easy to get last N results (for a chart, for example), and it's much more scalable than timestamps - several column families on one date/time are useful - and different tables for different level of aggregation (hour, date, week, month, year) - you can increment long values when you need to know total: http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue(byte[], byte[], byte[], long) - MR jobs are a good and scalable way of processing this type of data - data size is unlimited, so it's fine to write to multiple tables - optimize for reads you're going to make, not for writes. To import some of our logs, I'm using a java program which is called via logrotate every 10 minutes (but be careful with that one, because if hbase client freezes like happened to me after 0.20.4 upgrade, memory can get filled very quickly).
There's also a Python project for analytical data: http://github.com/zohmg/zohmg Hope that helps, -- Viktors On Tue, May 25, 2010 at 12:44 AM, Alex Thurlow <[email protected]> wrote: > Hi list, > With HBase's great write speed, I was thinking it would be a good thing > to switch an app that logs to a database to logging to HBase. I couldn't > really find anyone else who's using it that way though. Are there reasons I > shouldn't? If I should, how should I structure my data? > > It's basically going to be data for an ad server, so the relevant stuff > would be the timestamp, the id of the ad placement, and the id of the > creative that showed. Some other data would be stored, but I wouldn't need > to search on it. > > I would be wanting to make reports out of that data by date, date/placement > id, date/creative id, date/placementid/creativeid > > Should I just log with the timestamp as the key and then pull the whole > range and filter when I need the data or should I log everything three times > so I can pull by whichever key I need? > > I'm fairly new to HBase, although I've used Cassandra some, so I have an > idea of how this kind of works. I just can't quite get my head around the > right way to use it for this purpose. > > Thanks, > -Alex > > -- http://rotanovs.com - personal blog | http://www.hitgeist.com - fastest growing websites
