Re: How would you model this in Hbase?

James Taylor Wed, 06 Feb 2013 14:01:44 -0800

Another approach would be to use Phoenix(http://github.com/forcedotcom/phoenix). You can model your schema asyou would in the relational world, but you get the horizontalscalability of HBase.


    James


On 02/06/2013 01:49 PM, Michael Segel wrote:

Overloading the time stamp aka the versions of the cell is really not a good 
idea.

Yeah, I know opinions are like A.... everyone has one. ;-)

But you have to be aware that if someone decides to delete some data... well 
one tombstone marker for the column, goodbye all of the versions of the cell.
(Any ideas on a clean easy way to remove that tombstone?  ;-)

You're better off using other methods of adding dimension to your cell. One 
that works well is using Avro.

When I teach a course on HBase, I do mention about cells in my schema design 
section of the course. I talk about the ability to use the versioning as a way 
to add dimension and then tell the students that this really isn't a good idea 
...

-Just saying...

On Feb 6, 2013, at 3:05 PM, Ian Varley <[email protected]> wrote:

Alex,

This might be an interesting use of the time dimension in HBase. Every value in 
HBase is uniquely represented by a set of coordinates:

- table
- row key
- column family
- column qualifier
- timestamp

So, you can have two different values that have all the same coordinates, 
except their timestamp. So for your example, that could be:

- table: econ
- row key: "indicatorABC"
- column family: cf1
- column qualifier: "reporting_2011-10-01"

first value:
- timestamp: "2011-11-01 00:00:00.000"
- value: 2

second value:
- timestamp: "2011-12-01 00:00:00.000"
- value: 2.5

I.e., if you load the data such that the timestamps on the values represent the release date, then 
you can model this in a natural way. By default, reads in HBase will only give you the latest 
value, but you can manually tell a scanner to give you "time travel" by only reporting 
values as of an older date; so you could say "tell me what the data would have said on 
11/01".

(Also, by default, HBase only keeps a limited number of historical versions 
(3), but you can tell it to keep all versions.)

There are some downsides to using the time dimension explicitly like this:
- If you back date things and also work with deletes, you could get some weird 
behavior depending on when compaction runs.
- If you have lots of versions of things, the server still has to read over 
these when you scan, which makes things slower. (Probably doesn't apply if you 
only have a couple historical versions of any given value.)

All the usual caveats apply: don't bother with HBase unless you've got some 
serious size in your data (e.g. TB) and need to support a heavy load of 
real-time updates and queries. Otherwise, go with something simpler to operate 
like a relational database, couchdb, etc.

Ian

On Feb 6, 2013, at 2:24 PM, Alex Grund wrote:

Hi,

I am a newbie in nosql-databases and I am wondering how to model a
specific case with Hbase.

The thing I want to model are economic time series, such as
unemployment rate in a given country.

The complicated thing is this: Values of an economic time series can,
but do not have to be revised.

An example:

Imagine, the time series is published monthly, at the first day of a
month with the value for the previous month, such like:

Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4

(where "release" is the date of release and "reporting" is the date of
the month the "value" refers to. Read: "On Dec 1, 2011 the
unemployement rate for Nov 2011 was reported to be "1").

Now, imagine, that on every release, the value for the previous month
is revised, such like:

Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5

Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5

Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5

Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5

Read: On Oct, 1, 2011, the unemployment rate was reported to be "3"
for Sep, and the revised value for Aug was reported to be "4.5".

The most recent observation (release) ex-post is:  [1]
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5

Since the data is not revised further than one month behind, the whole
series ex-post would look like that: [3]
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5

Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5

Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5

Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5

Whereas, the "known-to-market"-series would look like that: [2]

Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4

That are the series I want to get from the db.


How would you model this with Hbase? Is Hbase suitable for that
application? Or in general, a column oriented DB?

Or, is a a relational approach a better fit?


Thanks!

The opinions expressed here are mine, while they may reflect a cognitive 
thought, that is purely accidental.
Use at your own risk.
Michael Segel
michael_segel (AT) hotmail.com

Re: How would you model this in Hbase?

Reply via email to