Hi there, No offence meant Ian. I might also think too trading oriented.
You definitely want to have those numbers readily available and not as a version. In retrospective, you will want to know by how much the actuals were off. Or you will want to run a trading strategy against the actuals ... It is the same with any of those macro figures. Revised and initially reported are two separate types of information and there is (usually) always a revised figure. And when doing research, I wouldn't dare start with versioning unless it is absolutely clear that the original value is wrong, void and worthless. Cheers P.s. pardon for double posting an hour ago. Am 07.02.2013 14:36 schrieb "Ian Varley" <[email protected]>: Overloading the time stamp aka the versions of the cell is really not a good idea. I agree in general, guys (and noted the dangers in my original post). I'd note, however, that this may be one of the rare cases where this actually *isn't* overloading the timestamp. If you look at the OP's question, this really is two versions of a single value. The data originally came in as X, then a month later it's revised to Y. If the majority of queries are going to just ask "what's the latest value", then this will make it easy in HBase, because that's the default behavior. And if you want to do a time travel query, that too is easy (you just set the max date you'd like to use). Doing either of those things with the reporting_month explicitly factored into the model (in the key, say) is harder. (Not impossible, just more complicated.) In a relational database, you might model this as a simple "UPDATE econ SET value = '2.5' WHERE figure='unemployment' AND month_reporting = '2011-11-01'". But the downside there is you'd lose the old value, and wouldn't be able to time travel. But in HBase you can. Overloading the timestamp is a terrible idea if you make it mean something other than "the date at which this data was valid". But that's not what's happening here, that's exactly what he's looking for. Ian On Feb 7, 2013, at 1:26 AM, Ulrich Staudinger wrote: On 02/06/2013 01:49 PM, Michael Segel wrote: Overloading the time stamp aka the versions of the cell is really not a good idea. Fully agree. Yeah, I know opinions are like A.... everyone has one. ;-) Yeah, but some people share one. But you have to be aware that if someone decides to delete some data... well one tombstone marker for the column, goodbye all of the versions of the cell. (Any ideas on a clean easy way to remove that tombstone? ;-) You're better off using other methods of adding dimension to your cell. One that works well is using Avro. All the usual caveats apply: don't bother with HBase unless you've got some serious size in your data (e.g. TB) and need to support a heavy load of real-time updates and queries. Otherwise, go with something simpler to operate like a relational database, couchdb, etc. While this is a valid point for just storing it and working on your own with data, there are reasons why you want to choose a data integration platform (more on this later). Back to the root discussion. Why don't you simply identify the six different types of information per number: - figure name (unemployment) - month (reporting) - release date - figure - revision date - revised figure the key would be: <figure name>_<month> en voila. I strongly advise against "overloading" the timestamping/versioning feature of hbase. You would still have to load the entire series and sort it by what you like, but that's not a problem with hbase. <snip> Thinking in ActiveQuant, you would store each of the columns above through it's IArchiveWriter. Then you can seamlessly view/chart it in the ActiveQuant Master Server, making it available over CSV and SOAP to your corporate intranet or to Excel through the AQ plugin. </snip> -- Ulrich Staudinger http://www.activequant.org Connect online: https://www.xing.com/profile/Ulrich_Staudinger
