Temporal Database (Almost)

james Sat, 04 Aug 2007 13:43:11 -0700

I have seen several discussions on implementing temporal databases. I am 
working on an application that needs something very similar, but slightly 
different from what I have seen discussed on the list. I would like a 
"repository" database that allows me to provide versioning and allow 
collaboration. I would like to structure it conceptually similarly to SVN or 
CVS. I would like to use some of Cayenne's abilities to accomplish this.




If anyone has some spare cycles to help, I am looking to hire a qualified 
Cayenne developer for some part-time work. Follow this link to job posting 
266326 on Odesk.com. 
http://www.odesk.com/console/j/opening.php?key=8bd7d42fda97bfde&tot=1&pos=0

(By the way, I think Odesk does a nice job providing a good contracting 
environment and have had a good experience so far.)




If you are just curious about what I am thinking, I have included a long-winded 
explanation of my initial ideas. If you are at all interested in this or have 
tried similar things, I would like to know what you may think.

James Jensen


... ... ... ... ...

(Understand that this is a first pass, and is intended as a starting point)


The databases that make up this system should behave similar to a repository 
such as SVN or CVS. There is one "central" repository database, and multiple 
client databases. A client can only connect to one repository, and a reasonable 
mechanism should be in place to prevent normal users from connecting a client 
to a different repository than the one for which it was originally created. 




All modifications by users are made to the client database. There will usually 
be one per laptop/user, although occasionally there could be more than one 
user. Each client application should have a password login process that 
identifies the user, and selects the correct local client db that is used for 
that user. (The application should also have a roll based security system to 
restrict/allow specific activities, like sales, installer, scheduler, admin., 
etc) For each workstation, a user must have a separate user db. I.e. Each 
client is identified by user and workstation id. When a user logs into a new 
workstation, that user must create a new client from the repository, and cannot 
directly access data on a different workstation that has not been synced. 
Information never flows directly from client to client.




Modifications to the repository are made by a client "syncing" with the 
repository. Like SVN, the client must first update any and all repository 
changes back to the local client by merging and choosing all changes. When all 
merges are complete and client has updated to the latest repository version, it 
is allowed to commit its changes to the repository. The commit process should 
be atomic, and the first one wins. I.e. If two commits from different clients 
start at the same time, the first to get the lock, checks repository version, 
does its update, which causes the repository version to advance, the second 
client's commit will fail because of the version mismatch, so the client will 
need to update again and then attempt to commit again.




The repository is intended to be a fairly passive part of the system. It 
consists of a database and some locking mechanism to prevent concurrent access. 
It is intended that the repository be some sort of web service (Like Cayenne 
makes possible). Connections from client to repository being made via web. The 
repository is only really visible through the clients. All application logic is 
managed by the client.




Versioning: 




The repository behavior is enabled by adding a few extra columns to most tables 
in the database, in addition to the ID column. Rows will be fully identified by 
ID, BIRTH_VER, RETIRED_VER, and CLIENT_ID. (Assuming Cayenne is used as the 
ORM, ID will actually be created by some custom code to combine the standard 
auto-increment ID plus the CLIENT_ID added as some upper bits to ensure that ID 
is always unique across the repository without requiring that the CLIENT_ID 
column always be used for all queries.)




The "current" version is a concept that applies to the entire database. Each 
table does not maintain it's own notion of current version. Even and odd 
versions are used to differentiate between synced data and client modified 
data. Even indicates data synced with the repository. Odd versions have been 
modified since the last sync.




When a record is created, BIRTH_VER is set to current version (some odd client 
ver) and RETIRED_VER is set to max-value or some other really big value called 
NOT_RETIRED that logically represents "beyond" the current version. 




When a record is deleted, a check is made to see if that retired version 
already exists, if it does, then the deleted data is just copied over the data 
with the same retired version. If a retired record doesn't exist yet with that 
version, retired is set to current version . When a record is modified, the 
original record is treated like a delete (some odd number), and the record is 
cloned with same id, with the death set to NOT_RETIRED. The birth could be set 
to current, but it would work even if it were not, queries would just be harder.



It is possible to "save" a version by simply incrementing the current version 
number to the next odd number, thus protecting all existing retired data from 
further changes and a new layer of history has been started. Also queries for 
the current snap shot just have to test for retired == NOT_RETIRED. 




When a client is updated. It uses its last updated even version as a base line 
and looks for any newer versions. If they exist, entries from tables of that 
version are copied to the client (maybe to tmp tables). Merging is not as 
simple as text merges, and requires special code that takes into account the 
specific tables being merged and presents appropriate choices to the user. As 
pieces of the merge are made they are give a new version number. When a 
successful set of merge data has been chosen/created, then the data is accepted 
as a new odd version and the last updated version is changed to the even 
version of the repository. To summarize, an update consists of first 
downloading a tmp version of latest modified records from the latest repository 
state, attempt to merge, if a successful merge is created, it can be accepted, 
or the merge process can just be rolled back a version and thrown away.




When a client wants to commit, if the "updated" version matches and no new odd 
versions have been created, then the client data is copied over to the 
repository, but with any odd version numbers moved to the "next" even version. 
I.e. Only records with even versions makeup the "official" repository. By 
committing, a version is being saved in the repository. (The original odd 
version records can also be copied to the repository and replicated to other 
clients if it is desired to give clients the opportunity to see all the 
versions used to merge into the repository. However, this expands the amount of 
data stored even further.)



Thanks for your interest and patience if you made it this far.

Temporal Database (Almost)

Reply via email to