Hi,
When I first started learning about HBase I compared the logic of setting
new values to something that is similar to the way a tool like Subversion
works: When you set a new value you don't overwrite the old one, you simply
create a new version.
Just like subversion you can then at a later moment retrieve the old value
that way the situation at an earlier date.
(The only real variation to the SVN model is that HBase only retains the
last N versions of a cell.)
There is however one situation where this comparison really fails: When you
do a delete on a cell.
If you want to retrieve the state of a thing from subversion and in the
current version this thing has been deleted then you can still get it back.
With HBase however if you delete a cell you place a tombstone at a specific
time and as such internally the older values are still present.
But when you try to retrieve such an older value then you still get an
empty result back (i.e. no such cell).
The direct consequence of the currently implemented model is that an
application can never retrieve the correct state of a row at an older
timestamp if a delete on any cell has occurred.
Example:
I create a table with one row:
> create 't1', 'cf'
> put 't1', 'rowid', 'cf:1', 'One', 1000
> put 't1', 'rowid', 'cf:2', 'Two', 2000
> put 't1', 'rowid', 'cf:3', 'Three', 3000
> get 't1', 'rowid' , {TIMERANGE => [0,3500]}
COLUMN CELL
cf:1 timestamp=1000, value=One
cf:2 timestamp=2000, value=Two
cf:3 timestamp=3000, value=Three
3 row(s) in 0.0150 seconds
Then the delete of a cell at a later timestamp:
> delete 't1', 'rowid', 'cf:1', 4000
Now if I retrieve the row at time 3500 I would find it logical that I would
still see the same values as I would above.
This is however the reality:
> get 't1', 'rowid' , {TIMERANGE => [0,3500]}
COLUMN CELL
cf:2 timestamp=2000, value=Two
cf:3 timestamp=3000, value=Three
2 row(s) in 0.0120 seconds
Why has it been designed/implemented like this?
What is the logic behind this model?
--
Best regards / Met vriendelijke groeten,
Niels Basjes