Re: Using HBase timestamps as natural versioning

Henning Blohm Sat, 31 Aug 2013 03:10:32 -0700

It can be made to be of fixed length, so that the scan would work. Sothat would work.



On 08/30/2013 04:58 PM, Ted Yu wrote:

Is your ID fixed length or variable length ?

If the length is fixed, you can specify ID/0 as the start row in scan.

Cheers


On Fri, Aug 30, 2013 at 5:42 AM, Henning Blohm <[email protected]>wrote:

Was gone for a few days. Sorry for not getting back to this until now. And
thanks for adding to the discussion!

The time used in the timestamp is the "natural" time (in ms resolution) as
far as known. I.e. in the end it is of course some machine time, but the
trigger to choose it is some human interaction typically. So there is some
natural time to events that update a row's data.
If timestamps happen to differ just by 1 ms, as unlikely as that may be,
this would still be valid.
And the timestamp is always set by the client (i.e. the app server) when
performing an HBase put. So it's never the region server time or something
slightly arbitrary.

To recap: The data model (even before mapping to HBase) is essentially

ID -> ( attribute -> ( time -> value ))

(where ID is a composite key consisting of some natural elements and some
surrogate part).

An event is something like "at time t, attribute x of  ID attained value
z".

Events may enter the system out of timely order!

Typical access patterns are:

(R1) "Get me all attributes of ID at time t"
(R2) "Get me a trails of attribute changes between time t0 and t1"
(W1) "Set x=z on ID for time t"

As said, currently we store data almost exactly the way I described the
model above (and probably that's why I wrote it down the way I did) using
the HBase time stamp to store to time dimension.


Alternative: Adding the time dimension to the row key
-----------

That would mean: ID/time -> (attribute -> value)

That would imply to either have copies of all (later) attribute values in
all (later) rows or to only put deltas and to scan over rows to collect
attribute values.

Let's assume the latter (for better storage and writing performance).

Wouldn't that mean to rebuild what HBase does? Is there nothing HBase does
more efficient when performing R1 for example?

I.e: Assume I want to get the latest state of row ID. In that case I would
need to scan from ID/0 to ID/<now> (or reverse) to fish for all attribute
values (assuming I don't know all expected attributes beforehand). Is that
as efficient as an HBase get with max versions 1 and <now> as time stamp?

Thanks,
Henning



On 08/21/2013 01:11 PM, Michael Segel wrote:

I would have to disagree with Lars on this one...

Its really a bad design.

To your point, your data is temporal in nature. That is to say, time is
an element of your data and it should be part of your schema.

You have to remember that time is relative.

When a row is entered in to HBase, which time is used in the timestamp?
The client(s)? The RS?  Unless I am mistaken or the API has changed, you
can set up any arbitrary long value to be the timestamp for a given
row/cell.
Like I said, its relative.

Since your data is temporal what is the difference if the event happened
at TS xxxxxxxx10 xxxxxxxxx11 (the point is that the TS is different by 1 in
the least significant bit)
You could be trying to reference the same event.

To Lars point, if you make time part of your key, you could end up with
hot spots. It depends on your key design. If its the least significant
portion of the key, its less of an issue. (clientX | action | TS) would be
an example that would sort the data by client, by action type, then by time
stamp.  (EPOCH - TS ) would put the most current first.

When you try to take a short cut, it usually will bite you in the ass.

TANSTAAFL applies!

HTH

-Mike

On Aug 11, 2013, at 12:21 AM, lars hofhansl <[email protected]> wrote:

  If you want deletes to work correctly you should enable

KEEP_DELETED_CELLS for your column families (I still think that should be
the default anyway).
Otherwise time-range queries will not be correct w.r.t. deleted data
(specifically you cannot get back at deleted data even if you specify a
time range before the delete and even if you column family as unlimited
versions).


Depending on what your typical queries are, you might run into
performance issues. HBase sorts all versions of a KeyValue adjacent to each
other.
If you now want to query only along the latest data (the last version),
HBase will have to skip a lot of other versions. In the worst case the
latest version of all KeyVales are on separate (HFile) blocks.

The question of whether to use the builtin timestamps or model the time
as part of the row keys (or even a time-column), is an interesting one.
Generally the row-key identifies your row. If you want a new row for
each TS in your logical model you should manage the time dimension yourself.
Otherwise if you identities (i.e. row) with many versions, the builtin
TS might be better.

-- Lars

______________________________**__
From: Henning Blohm <[email protected]>
To: user <[email protected]>
Sent: Saturday, August 10, 2013 6:26 AM
Subject: Using HBase timestamps as natural versioning


Hi,

we are managing some naturally time versioned data in HBase. That is,
there are change events that have a specific time set and when such
event is handled, data in HBase, pertaining to the exact same point in
time, is updated.

So far we are using HBase time stamps to model the time dimension. All
columns have unlimited number of versions. That worked ok so far, and
HBase's way of providing access to data at a given time or time range
seemed a natural fit.

We are aware of some tricky issues around timestamp handling (e.g. in
particular in conjunction with deletes). As we need to migrate HBase
stored data (for other reasons) shortly we are wondering, if our
approach has some long-term drawbacks that we should pay attention to
now and possibly re-design our timestamp handling as well.

So my question is:

* Is there problematic experience with using HBase timestamps as time
dimension of your data (assuming it has some natural time-based
versioning)?

* Is it generally better to model time-based versioning of data within
the data structure itself (e.g. in the row key) and why?

* In case you used HBase timestamps similar to the way we use them,
feedback on how that worked is welcome as well!

Thanks,
Henning

  The opinions expressed here are mine, while they may reflect a

cognitive thought, that is purely accidental.
Use at your own risk.
Michael Segel
michael_segel (AT) hotmail.com

--
Henning Blohm

*ZFabrik Software KG*

T:      +49 6227 3984255
F:      +49 6227 3984254
M:      +49 1781891820

Lammstrasse 2 69190 Walldorf

[email protected] 
<mailto:henning.blohm@zfabrik.**de<[email protected]>
Linkedin 
<http://www.linkedin.com/pub/**henning-blohm/0/7b5/628<http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
ZFabrik <http://www.zfabrik.de>
Blog <http://www.z2-environment.**net/blog<http://www.z2-environment.net/blog>
Z2-Environment <http://www.z2-environment.eu>
Z2 Wiki <http://redmine.z2-**environment.net<http://redmine.z2-environment.net>



--
Henning Blohm

*ZFabrik Software KG*

T:      +49 6227 3984255
F:      +49 6227 3984254
M:      +49 1781891820

Lammstrasse 2 69190 Walldorf

[email protected] <mailto:[email protected]>
Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
ZFabrik <http://www.zfabrik.de>
Blog <http://www.z2-environment.net/blog>
Z2-Environment <http://www.z2-environment.eu>
Z2 Wiki <http://redmine.z2-environment.net>

Re: Using HBase timestamps as natural versioning

Reply via email to