Hey Guys,

Before I get to my thoughts on the rowkey design, here's some background info about the problem we are trying to tackle.

We are producing about 60TB of data a year (uncompressed). Most of this data is collected continuously from various detectors around our facility Vast majority of it is numerical (either scalar or 1-dimensional array). Detectors can be either polled at regular intervals or they can send their measurements asynchronously. We are currently using regular compressed files to store all this information with headers (metadata) stored in a relational database. Managing (moving, archiving) this amount of data is slowly becoming a nuisance, so we are investigating other solutions, including distributed databases like HBase.

We have a number of requirements based on our users' access patters - the most important one being a (ridiculously) fast sequential, time sorted access for selected metric(s). We also want to decimate the data for live viewing (visually compressing billions of time series points into a manageable size without loosing the "shape" of the original dataset) . We are already doing this in our custom middleware server, but this looks like a problem that could be tackled by MapReduce.

Now, back to the subject matter. I examined both OpenTSDB(http://opentsdb.net/schema.html) and HBaseWD (http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/) solutions and although their general ideas looks right, neither one of them looks like a perfect fit for us. The former correctly distributes writes across the entire cluster, but the sequential, time ordered reads for the same metric end up being localized to a relatively small number of regions (with enough collected data over a long period of time the reads will hit just one, maybe two regions). The latter also distributes the writes across the entire cluster, but the reads require BUCKET_COUNT (BC) number of scans and they are almost guaranteed to be out of order across multiple buckets (they are in the correct relative order within each bucket). I was thinking about a rowkey design, which takes another time dimension into consideration (the timestamp or some form of it ends up at after the metric name itself) for example hour of the day. This value ranging from 1 to 24 would be prefixed to each rowkey (i.e. 7-metric.name-1349585333) - obviously this is a terrible design, because a handful of regions would end up being overloaded based on the hour of the day, however we can use the metric name hash modded with the bucket count (24 in this case) to come up with a new starting prefix base for each metric. Now we have to add the real hour of day to the base and subtract BC if the value is greater than BC This way all writes are still distributed evenly across the system ... and so are the reads assuming we reading more than one hour worth of data in this case, which is almost always true in our case. We still have to do BC scans if reading 24+ hours of data, but the data in and across the buckets is always correctly time sorted. We can also limit the scan count based on the selected time range (i.e. if someone asks for data for a given metric between 7am and 10am we'll only have to do 3 scans for those three full hours).

I'm a complete newb when it comes to distributed databases, so if I'm way off on this please set me straight.

Bartek

Reply via email to