Alex,
This might be an interesting use of the time dimension in HBase. Every value in
HBase is uniquely represented by a set of coordinates:
- table
- row key
- column family
- column qualifier
- timestamp
So, you can have two different values that have all the same coordinates,
except their timestamp. So for your example, that could be:
- table: econ
- row key: "indicatorABC"
- column family: cf1
- column qualifier: "reporting_2011-10-01"
first value:
- timestamp: "2011-11-01 00:00:00.000"
- value: 2
second value:
- timestamp: "2011-12-01 00:00:00.000"
- value: 2.5
I.e., if you load the data such that the timestamps on the values represent the release date, then
you can model this in a natural way. By default, reads in HBase will only give you the latest
value, but you can manually tell a scanner to give you "time travel" by only reporting
values as of an older date; so you could say "tell me what the data would have said on
11/01".
(Also, by default, HBase only keeps a limited number of historical versions
(3), but you can tell it to keep all versions.)
There are some downsides to using the time dimension explicitly like this:
- If you back date things and also work with deletes, you could get some weird
behavior depending on when compaction runs.
- If you have lots of versions of things, the server still has to read over
these when you scan, which makes things slower. (Probably doesn't apply if you
only have a couple historical versions of any given value.)
All the usual caveats apply: don't bother with HBase unless you've got some
serious size in your data (e.g. TB) and need to support a heavy load of
real-time updates and queries. Otherwise, go with something simpler to operate
like a relational database, couchdb, etc.
Ian
On Feb 6, 2013, at 2:24 PM, Alex Grund wrote:
Hi,
I am a newbie in nosql-databases and I am wondering how to model a
specific case with Hbase.
The thing I want to model are economic time series, such as
unemployment rate in a given country.
The complicated thing is this: Values of an economic time series can,
but do not have to be revised.
An example:
Imagine, the time series is published monthly, at the first day of a
month with the value for the previous month, such like:
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
(where "release" is the date of release and "reporting" is the date of
the month the "value" refers to. Read: "On Dec 1, 2011 the
unemployement rate for Nov 2011 was reported to be "1").
Now, imagine, that on every release, the value for the previous month
is revised, such like:
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5
Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5
Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5
Read: On Oct, 1, 2011, the unemployment rate was reported to be "3"
for Sep, and the revised value for Aug was reported to be "4.5".
The most recent observation (release) ex-post is: [1]
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
Since the data is not revised further than one month behind, the whole
series ex-post would look like that: [3]
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5
Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5
Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5
Whereas, the "known-to-market"-series would look like that: [2]
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
That are the series I want to get from the db.
How would you model this with Hbase? Is Hbase suitable for that
application? Or in general, a column oriented DB?
Or, is a a relational approach a better fit?
Thanks!