Re: Using HBase timestamps as natural versioning

Ted Yu Fri, 30 Aug 2013 08:27:15 -0700

Is your ID fixed length or variable length ?

If the length is fixed, you can specify ID/0 as the start row in scan.


Cheers


On Fri, Aug 30, 2013 at 5:42 AM, Henning Blohm <[email protected]>wrote:

> Was gone for a few days. Sorry for not getting back to this until now. And
> thanks for adding to the discussion!
>
> The time used in the timestamp is the "natural" time (in ms resolution) as
> far as known. I.e. in the end it is of course some machine time, but the
> trigger to choose it is some human interaction typically. So there is some
> natural time to events that update a row's data.
> If timestamps happen to differ just by 1 ms, as unlikely as that may be,
> this would still be valid.
> And the timestamp is always set by the client (i.e. the app server) when
> performing an HBase put. So it's never the region server time or something
> slightly arbitrary.
>
> To recap: The data model (even before mapping to HBase) is essentially
>
> ID -> ( attribute -> ( time -> value ))
>
> (where ID is a composite key consisting of some natural elements and some
> surrogate part).
>
> An event is something like "at time t, attribute x of  ID attained value
> z".
>
> Events may enter the system out of timely order!
>
> Typical access patterns are:
>
> (R1) "Get me all attributes of ID at time t"
> (R2) "Get me a trails of attribute changes between time t0 and t1"
> (W1) "Set x=z on ID for time t"
>
> As said, currently we store data almost exactly the way I described the
> model above (and probably that's why I wrote it down the way I did) using
> the HBase time stamp to store to time dimension.
>
>
> Alternative: Adding the time dimension to the row key
> -----------
>
> That would mean: ID/time -> (attribute -> value)
>
> That would imply to either have copies of all (later) attribute values in
> all (later) rows or to only put deltas and to scan over rows to collect
> attribute values.
>
> Let's assume the latter (for better storage and writing performance).
>
> Wouldn't that mean to rebuild what HBase does? Is there nothing HBase does
> more efficient when performing R1 for example?
>
> I.e: Assume I want to get the latest state of row ID. In that case I would
> need to scan from ID/0 to ID/<now> (or reverse) to fish for all attribute
> values (assuming I don't know all expected attributes beforehand). Is that
> as efficient as an HBase get with max versions 1 and <now> as time stamp?
>
> Thanks,
> Henning
>
>
>
> On 08/21/2013 01:11 PM, Michael Segel wrote:
>
>> I would have to disagree with Lars on this one...
>>
>> Its really a bad design.
>>
>> To your point, your data is temporal in nature. That is to say, time is
>> an element of your data and it should be part of your schema.
>>
>> You have to remember that time is relative.
>>
>> When a row is entered in to HBase, which time is used in the timestamp?
>> The client(s)? The RS?  Unless I am mistaken or the API has changed, you
>> can set up any arbitrary long value to be the timestamp for a given
>> row/cell.
>> Like I said, its relative.
>>
>> Since your data is temporal what is the difference if the event happened
>> at TS xxxxxxxx10 xxxxxxxxx11 (the point is that the TS is different by 1 in
>> the least significant bit)
>> You could be trying to reference the same event.
>>
>> To Lars point, if you make time part of your key, you could end up with
>> hot spots. It depends on your key design. If its the least significant
>> portion of the key, its less of an issue. (clientX | action | TS) would be
>> an example that would sort the data by client, by action type, then by time
>> stamp.  (EPOCH - TS ) would put the most current first.
>>
>> When you try to take a short cut, it usually will bite you in the ass.
>>
>> TANSTAAFL applies!
>>
>> HTH
>>
>> -Mike
>>
>> On Aug 11, 2013, at 12:21 AM, lars hofhansl <[email protected]> wrote:
>>
>>  If you want deletes to work correctly you should enable
>>> KEEP_DELETED_CELLS for your column families (I still think that should be
>>> the default anyway).
>>> Otherwise time-range queries will not be correct w.r.t. deleted data
>>> (specifically you cannot get back at deleted data even if you specify a
>>> time range before the delete and even if you column family as unlimited
>>> versions).
>>>
>>>
>>> Depending on what your typical queries are, you might run into
>>> performance issues. HBase sorts all versions of a KeyValue adjacent to each
>>> other.
>>> If you now want to query only along the latest data (the last version),
>>> HBase will have to skip a lot of other versions. In the worst case the
>>> latest version of all KeyVales are on separate (HFile) blocks.
>>>
>>> The question of whether to use the builtin timestamps or model the time
>>> as part of the row keys (or even a time-column), is an interesting one.
>>> Generally the row-key identifies your row. If you want a new row for
>>> each TS in your logical model you should manage the time dimension yourself.
>>> Otherwise if you identities (i.e. row) with many versions, the builtin
>>> TS might be better.
>>>
>>> -- Lars
>>>
>>> ______________________________**__
>>> From: Henning Blohm <[email protected]>
>>> To: user <[email protected]>
>>> Sent: Saturday, August 10, 2013 6:26 AM
>>> Subject: Using HBase timestamps as natural versioning
>>>
>>>
>>> Hi,
>>>
>>> we are managing some naturally time versioned data in HBase. That is,
>>> there are change events that have a specific time set and when such
>>> event is handled, data in HBase, pertaining to the exact same point in
>>> time, is updated.
>>>
>>> So far we are using HBase time stamps to model the time dimension. All
>>> columns have unlimited number of versions. That worked ok so far, and
>>> HBase's way of providing access to data at a given time or time range
>>> seemed a natural fit.
>>>
>>> We are aware of some tricky issues around timestamp handling (e.g. in
>>> particular in conjunction with deletes). As we need to migrate HBase
>>> stored data (for other reasons) shortly we are wondering, if our
>>> approach has some long-term drawbacks that we should pay attention to
>>> now and possibly re-design our timestamp handling as well.
>>>
>>> So my question is:
>>>
>>> * Is there problematic experience with using HBase timestamps as time
>>> dimension of your data (assuming it has some natural time-based
>>> versioning)?
>>>
>>> * Is it generally better to model time-based versioning of data within
>>> the data structure itself (e.g. in the row key) and why?
>>>
>>> * In case you used HBase timestamps similar to the way we use them,
>>> feedback on how that worked is welcome as well!
>>>
>>> Thanks,
>>> Henning
>>>
>>>  The opinions expressed here are mine, while they may reflect a
>> cognitive thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>>
>>
>>
>>
>>
>>
>
> --
> Henning Blohm
>
> *ZFabrik Software KG*
>
> T:      +49 6227 3984255
> F:      +49 6227 3984254
> M:      +49 1781891820
>
> Lammstrasse 2 69190 Walldorf
>
> [email protected] 
> <mailto:henning.blohm@zfabrik.**de<[email protected]>
> >
> Linkedin 
> <http://www.linkedin.com/pub/**henning-blohm/0/7b5/628<http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
> >
> ZFabrik <http://www.zfabrik.de>
> Blog <http://www.z2-environment.**net/blog<http://www.z2-environment.net/blog>
> >
> Z2-Environment <http://www.z2-environment.eu>
> Z2 Wiki 
> <http://redmine.z2-**environment.net<http://redmine.z2-environment.net>
> >
>
>

Re: Using HBase timestamps as natural versioning

Reply via email to