Hi, If you're not going access this logged data as is from HBase and will use it to produce some statistics, etc. (perhaps via MR jobs), then I wouldn't recommend using neither date/time nor timestamp as a row key. This will make HBase write logs into particular RegionServer at one time and thus at one time only one box will have hot load wereas others will just wait their turn. I'd suggest using some random value for the key like UUID.
Alex Baranau http://sematext.com http://en.wordpress.com/tag/hadoop-ecosystem-digest/ http://search-hadoop.com - Search Hadoop, HDFS, MapReduce, HBase, and other related projects. On Tue, May 25, 2010 at 2:32 AM, Viktors Rotanovs < [email protected]> wrote: > I'm using HBase for similar stats, some things I've learned: > - date/time as key is good because that way it's very easy to get > last N results (for a chart, for example), and it's much more scalable > than timestamps > - several column families on one date/time are useful > - and different tables for different level of aggregation (hour, > date, week, month, year) > - you can increment long values when you need to know total: > > http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue(byte[]<http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte%5B%5D> > , > byte[], byte[], long) > - MR jobs are a good and scalable way of processing this type of data > - data size is unlimited, so it's fine to write to multiple tables > - optimize for reads you're going to make, not for writes. > To import some of our logs, I'm using a java program which is called > via logrotate every 10 minutes (but be careful with that one, because > if hbase client freezes like happened to me after 0.20.4 upgrade, > memory can get filled very quickly). > > There's also a Python project for analytical data: > http://github.com/zohmg/zohmg > > Hope that helps, > -- Viktors > > On Tue, May 25, 2010 at 12:44 AM, Alex Thurlow <[email protected]> wrote: > > Hi list, > > With HBase's great write speed, I was thinking it would be a good > thing > > to switch an app that logs to a database to logging to HBase. I couldn't > > really find anyone else who's using it that way though. Are there > reasons I > > shouldn't? If I should, how should I structure my data? > > > > It's basically going to be data for an ad server, so the relevant stuff > > would be the timestamp, the id of the ad placement, and the id of the > > creative that showed. Some other data would be stored, but I wouldn't > need > > to search on it. > > > > I would be wanting to make reports out of that data by date, > date/placement > > id, date/creative id, date/placementid/creativeid > > > > Should I just log with the timestamp as the key and then pull the whole > > range and filter when I need the data or should I log everything three > times > > so I can pull by whichever key I need? > > > > I'm fairly new to HBase, although I've used Cassandra some, so I have an > > idea of how this kind of works. I just can't quite get my head around > the > > right way to use it for this purpose. > > > > Thanks, > > -Alex > > > > > > > > -- > http://rotanovs.com - personal blog | http://www.hitgeist.com - > fastest growing websites >
