I would have to disagree with Lars on this one... Its really a bad design.
To your point, your data is temporal in nature. That is to say, time is an element of your data and it should be part of your schema. You have to remember that time is relative. When a row is entered in to HBase, which time is used in the timestamp? The client(s)? The RS? Unless I am mistaken or the API has changed, you can set up any arbitrary long value to be the timestamp for a given row/cell. Like I said, its relative. Since your data is temporal what is the difference if the event happened at TS xxxxxxxx10 xxxxxxxxx11 (the point is that the TS is different by 1 in the least significant bit) You could be trying to reference the same event. To Lars point, if you make time part of your key, you could end up with hot spots. It depends on your key design. If its the least significant portion of the key, its less of an issue. (clientX | action | TS) would be an example that would sort the data by client, by action type, then by time stamp. (EPOCH - TS ) would put the most current first. When you try to take a short cut, it usually will bite you in the ass. TANSTAAFL applies! HTH -Mike On Aug 11, 2013, at 12:21 AM, lars hofhansl <[email protected]> wrote: > If you want deletes to work correctly you should enable KEEP_DELETED_CELLS > for your column families (I still think that should be the default anyway). > Otherwise time-range queries will not be correct w.r.t. deleted data > (specifically you cannot get back at deleted data even if you specify a time > range before the delete and even if you column family as unlimited versions). > > > Depending on what your typical queries are, you might run into performance > issues. HBase sorts all versions of a KeyValue adjacent to each other. > If you now want to query only along the latest data (the last version), HBase > will have to skip a lot of other versions. In the worst case the latest > version of all KeyVales are on separate (HFile) blocks. > > The question of whether to use the builtin timestamps or model the time as > part of the row keys (or even a time-column), is an interesting one. > Generally the row-key identifies your row. If you want a new row for each TS > in your logical model you should manage the time dimension yourself. > Otherwise if you identities (i.e. row) with many versions, the builtin TS > might be better. > > -- Lars > > ________________________________ > From: Henning Blohm <[email protected]> > To: user <[email protected]> > Sent: Saturday, August 10, 2013 6:26 AM > Subject: Using HBase timestamps as natural versioning > > > Hi, > > we are managing some naturally time versioned data in HBase. That is, > there are change events that have a specific time set and when such > event is handled, data in HBase, pertaining to the exact same point in > time, is updated. > > So far we are using HBase time stamps to model the time dimension. All > columns have unlimited number of versions. That worked ok so far, and > HBase's way of providing access to data at a given time or time range > seemed a natural fit. > > We are aware of some tricky issues around timestamp handling (e.g. in > particular in conjunction with deletes). As we need to migrate HBase > stored data (for other reasons) shortly we are wondering, if our > approach has some long-term drawbacks that we should pay attention to > now and possibly re-design our timestamp handling as well. > > So my question is: > > * Is there problematic experience with using HBase timestamps as time > dimension of your data (assuming it has some natural time-based versioning)? > > * Is it generally better to model time-based versioning of data within > the data structure itself (e.g. in the row key) and why? > > * In case you used HBase timestamps similar to the way we use them, > feedback on how that worked is welcome as well! > > Thanks, > Henning > The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com
