Is your ID fixed length or variable length ? If the length is fixed, you can specify ID/0 as the start row in scan.
Cheers On Fri, Aug 30, 2013 at 5:42 AM, Henning Blohm <[email protected]>wrote: > Was gone for a few days. Sorry for not getting back to this until now. And > thanks for adding to the discussion! > > The time used in the timestamp is the "natural" time (in ms resolution) as > far as known. I.e. in the end it is of course some machine time, but the > trigger to choose it is some human interaction typically. So there is some > natural time to events that update a row's data. > If timestamps happen to differ just by 1 ms, as unlikely as that may be, > this would still be valid. > And the timestamp is always set by the client (i.e. the app server) when > performing an HBase put. So it's never the region server time or something > slightly arbitrary. > > To recap: The data model (even before mapping to HBase) is essentially > > ID -> ( attribute -> ( time -> value )) > > (where ID is a composite key consisting of some natural elements and some > surrogate part). > > An event is something like "at time t, attribute x of ID attained value > z". > > Events may enter the system out of timely order! > > Typical access patterns are: > > (R1) "Get me all attributes of ID at time t" > (R2) "Get me a trails of attribute changes between time t0 and t1" > (W1) "Set x=z on ID for time t" > > As said, currently we store data almost exactly the way I described the > model above (and probably that's why I wrote it down the way I did) using > the HBase time stamp to store to time dimension. > > > Alternative: Adding the time dimension to the row key > ----------- > > That would mean: ID/time -> (attribute -> value) > > That would imply to either have copies of all (later) attribute values in > all (later) rows or to only put deltas and to scan over rows to collect > attribute values. > > Let's assume the latter (for better storage and writing performance). > > Wouldn't that mean to rebuild what HBase does? Is there nothing HBase does > more efficient when performing R1 for example? > > I.e: Assume I want to get the latest state of row ID. In that case I would > need to scan from ID/0 to ID/<now> (or reverse) to fish for all attribute > values (assuming I don't know all expected attributes beforehand). Is that > as efficient as an HBase get with max versions 1 and <now> as time stamp? > > Thanks, > Henning > > > > On 08/21/2013 01:11 PM, Michael Segel wrote: > >> I would have to disagree with Lars on this one... >> >> Its really a bad design. >> >> To your point, your data is temporal in nature. That is to say, time is >> an element of your data and it should be part of your schema. >> >> You have to remember that time is relative. >> >> When a row is entered in to HBase, which time is used in the timestamp? >> The client(s)? The RS? Unless I am mistaken or the API has changed, you >> can set up any arbitrary long value to be the timestamp for a given >> row/cell. >> Like I said, its relative. >> >> Since your data is temporal what is the difference if the event happened >> at TS xxxxxxxx10 xxxxxxxxx11 (the point is that the TS is different by 1 in >> the least significant bit) >> You could be trying to reference the same event. >> >> To Lars point, if you make time part of your key, you could end up with >> hot spots. It depends on your key design. If its the least significant >> portion of the key, its less of an issue. (clientX | action | TS) would be >> an example that would sort the data by client, by action type, then by time >> stamp. (EPOCH - TS ) would put the most current first. >> >> When you try to take a short cut, it usually will bite you in the ass. >> >> TANSTAAFL applies! >> >> HTH >> >> -Mike >> >> On Aug 11, 2013, at 12:21 AM, lars hofhansl <[email protected]> wrote: >> >> If you want deletes to work correctly you should enable >>> KEEP_DELETED_CELLS for your column families (I still think that should be >>> the default anyway). >>> Otherwise time-range queries will not be correct w.r.t. deleted data >>> (specifically you cannot get back at deleted data even if you specify a >>> time range before the delete and even if you column family as unlimited >>> versions). >>> >>> >>> Depending on what your typical queries are, you might run into >>> performance issues. HBase sorts all versions of a KeyValue adjacent to each >>> other. >>> If you now want to query only along the latest data (the last version), >>> HBase will have to skip a lot of other versions. In the worst case the >>> latest version of all KeyVales are on separate (HFile) blocks. >>> >>> The question of whether to use the builtin timestamps or model the time >>> as part of the row keys (or even a time-column), is an interesting one. >>> Generally the row-key identifies your row. If you want a new row for >>> each TS in your logical model you should manage the time dimension yourself. >>> Otherwise if you identities (i.e. row) with many versions, the builtin >>> TS might be better. >>> >>> -- Lars >>> >>> ______________________________**__ >>> From: Henning Blohm <[email protected]> >>> To: user <[email protected]> >>> Sent: Saturday, August 10, 2013 6:26 AM >>> Subject: Using HBase timestamps as natural versioning >>> >>> >>> Hi, >>> >>> we are managing some naturally time versioned data in HBase. That is, >>> there are change events that have a specific time set and when such >>> event is handled, data in HBase, pertaining to the exact same point in >>> time, is updated. >>> >>> So far we are using HBase time stamps to model the time dimension. All >>> columns have unlimited number of versions. That worked ok so far, and >>> HBase's way of providing access to data at a given time or time range >>> seemed a natural fit. >>> >>> We are aware of some tricky issues around timestamp handling (e.g. in >>> particular in conjunction with deletes). As we need to migrate HBase >>> stored data (for other reasons) shortly we are wondering, if our >>> approach has some long-term drawbacks that we should pay attention to >>> now and possibly re-design our timestamp handling as well. >>> >>> So my question is: >>> >>> * Is there problematic experience with using HBase timestamps as time >>> dimension of your data (assuming it has some natural time-based >>> versioning)? >>> >>> * Is it generally better to model time-based versioning of data within >>> the data structure itself (e.g. in the row key) and why? >>> >>> * In case you used HBase timestamps similar to the way we use them, >>> feedback on how that worked is welcome as well! >>> >>> Thanks, >>> Henning >>> >>> The opinions expressed here are mine, while they may reflect a >> cognitive thought, that is purely accidental. >> Use at your own risk. >> Michael Segel >> michael_segel (AT) hotmail.com >> >> >> >> >> >> > > -- > Henning Blohm > > *ZFabrik Software KG* > > T: +49 6227 3984255 > F: +49 6227 3984254 > M: +49 1781891820 > > Lammstrasse 2 69190 Walldorf > > [email protected] > <mailto:henning.blohm@zfabrik.**de<[email protected]> > > > Linkedin > <http://www.linkedin.com/pub/**henning-blohm/0/7b5/628<http://www.linkedin.com/pub/henning-blohm/0/7b5/628> > > > ZFabrik <http://www.zfabrik.de> > Blog <http://www.z2-environment.**net/blog<http://www.z2-environment.net/blog> > > > Z2-Environment <http://www.z2-environment.eu> > Z2 Wiki > <http://redmine.z2-**environment.net<http://redmine.z2-environment.net> > > > >
