Hey Guys,
Before I get to my thoughts on the rowkey design, here's some background
info about the problem we are trying to tackle.
We are producing about 60TB of data a year (uncompressed). Most of this
data is collected continuously from various detectors around our
facility Vast majority of it is numerical (either scalar or
1-dimensional array). Detectors can be either polled at regular
intervals or they can send their measurements asynchronously. We are
currently using regular compressed files to store all this information
with headers (metadata) stored in a relational database. Managing
(moving, archiving) this amount of data is slowly becoming a nuisance,
so we are investigating other solutions, including distributed
databases like HBase.
We have a number of requirements based on our users' access patters -
the most important one being a (ridiculously) fast sequential, time
sorted access for selected metric(s). We also want to decimate the data
for live viewing (visually compressing billions of time series points
into a manageable size without loosing the "shape" of the original
dataset) . We are already doing this in our custom middleware server,
but this looks like a problem that could be tackled by MapReduce.
Now, back to the subject matter. I examined both
OpenTSDB(http://opentsdb.net/schema.html) and HBaseWD
(http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/)
solutions and although their general ideas looks right, neither one of
them looks like a perfect fit for us. The former correctly distributes
writes across the entire cluster, but the sequential, time ordered reads
for the same metric end up being localized to a relatively small number
of regions (with enough collected data over a long period of time the
reads will hit just one, maybe two regions). The latter also distributes
the writes across the entire cluster, but the reads require BUCKET_COUNT
(BC) number of scans and they are almost guaranteed to be out of order
across multiple buckets (they are in the correct relative order within
each bucket).
I was thinking about a rowkey design, which takes another time dimension
into consideration (the timestamp or some form of it ends up at after
the metric name itself) for example hour of the day. This value ranging
from 1 to 24 would be prefixed to each rowkey (i.e.
7-metric.name-1349585333) - obviously this is a terrible design, because
a handful of regions would end up being overloaded based on the hour of
the day, however we can use the metric name hash modded with the bucket
count (24 in this case) to come up with a new starting prefix base for
each metric. Now we have to add the real hour of day to the base and
subtract BC if the value is greater than BC This way all writes are
still distributed evenly across the system ... and so are the reads
assuming we reading more than one hour worth of data in this case, which
is almost always true in our case. We still have to do BC scans if
reading 24+ hours of data, but the data in and across the buckets is
always correctly time sorted. We can also limit the scan count based on
the selected time range (i.e. if someone asks for data for a given
metric between 7am and 10am we'll only have to do 3 scans for those
three full hours).
I'm a complete newb when it comes to distributed databases, so if I'm
way off on this please set me straight.
Bartek