[sqlite] Keeping a history of structured data

Tristan Van Berkom Thu, 17 Jul 2014 12:47:21 -0700

Hi all,

We've been pondering how to keep history of structured records without
storing the entirety of the structured record for each new revision, we
have some ideas on how to achieve this but I wanted to consult this list
in case we have overlooked some other strategies, and also just to see
if our proposed solution to the problem is sound.


First, here is a brief description of the nature of the data and the
requirements of it's usage, it revolves around the notion of 'Events'
which can have 'Participants'.

One record:


     E
    /|\
   / | \
  P P P...

One can assume that 'E' is a table of 'Events' where each Event has
a uid, and that 'P' is a table of participants which refer to the
'Event' uid they belong to (each participant consequently has an 'id'
which is only unique per 'Event' uid).

The objective is to keep a revisioned history of 'E' whenever 'E' has
changed, or any of it's 'P' counterparts have changed, ideally without
storing a duplicate row for the one 'E' and every one of its 'P's every
time anything changes.

The requirements are that:

 a.) The current revision of a given 'E' (Event) and it's
     'P' (Participant) counterparts be quickly searchable.

 b.) Every change to a given 'E', or any of it's 'P' counterparts
     must be recorded (may be implemented with triggers, may not
     be, the method of this is not a requirement).

 c.) The entirety of 'E' and every 'P' should not be duplicated for
     a modification of only the 'E' or any individual 'P' which belongs
     to the given 'E'.

 d.) Any 'E' can be queried for it's history (i.e. the full structured
     record of 'E' and all 'P' counterparts can be reconstructed for
     every version), the history of 'E' can be listed.

Our tentative solution for this flows as follows:

 1.) All history of 'E' and 'P' records are to be stored in
     separate tables - this is to satisfy the requirement 'a'
     by reducing the complexity of joining tables 'E' and 'P'
     (i.e. when querying, we are only interested in the latest
     revision).

     Let us refer to these tables as 'E(h)' and 'P(h)', i.e.
     the history of E and the history of P.


 2.) Every 'E' and every 'P' has it's own local revision

     Every 'E' revision must be unique for a given 'E' and every
     'P' revision must be unique for a given 'P's participation
     in a given 'E' (there is no other constraint on this revision
     except for it's uniqueness).


 3.) Another revision exists which is the master revision of the
     conceptually structured record 'E'.

     The nature of this revision is very specific, it is an integer
     which starts at 0 whenever any new 'E' is introduced, and it is
     incremented whenever the structured record 'E' changes in any
     way (I.e. the master revision is incremented if 'E' changes or
     if one or more of it's related 'P's change, or both).

     One row per 'E' will be stored in a separate table indicating
     the "latest" revision of any given 'E'. Let's refer to this
     table as 'E(r)', the "revision of E".


 4.) We will require 2 more tables to reconstruct the records for
     a given revision of the structured record:

       o The Event History Index contains a local revision / master
         revision tuple for every possible master revision of 'E', we
         shall refer to this as: E(hi)

       o The Participant History Index contains a local / master
         revision tuple for every possible master revision of 'E',
         it differs only from the above as it holds one revision per
         'P' of a given 'E'. Let's refer to this table as: P(hi)

     This table will obviously grow in proportion to the history,
     however as it is constructed mostly of master revisions (integers)
     and foreign keys, it should be an order of magnitude less than
     simply duplicating everything for every modification.


 5.) When a modification is made to 'E' or any of it's 'P's, the
     following will occur:

      o If the 'E' has changed:
         o Create a new copy of 'E' in 'E(h)'
         o Set a new local revision of 'E'
      o If any 'P' has changed:
         o Create a new copy of 'P' in 'P(h)'
         o Set a new local revision of 'P'

      o Create one entry in 'E(hi)' mapping the global revision to
        it's local revision
      o Create one entry for every related 'P' in 'P(hi)' mapping
        their local revisions to the current master revision

      o Increment the master revision for 'E' in 'E(r)', the current
        data in tables 'E' and 'P' are considered to be at the revision
        indicated by 'E(r)'


 6.) When listing the history of a given 'E' the following will occur:

      o Obtain the current master revision of 'E' from 'E(r)',
        let's call it $current

      o If $current is 0, there is no history yet, abort

      o Otherwise, let $rev be each revision from ($current - 1)
        down to 0

      o Construct a complete structured record of 'E' for every
        $rev, this will involve the following:

        - Obtain the local revision corresponding to $rev from E(hi)
        - Fetch the row from E(h) by it's local revision
        - Obtain the list of local revisions corresponding to $rev
          from P(hi)
        - Fetch the rows from P(hi) by their local revisions


We have considered at least 3 ways of revisioning structured records and
currently I think we are settled with the above, other approaches we
thought of include:

 o Copy all of the data into history every time anything changes,
   possibly the simplest technique but seems extremely wasteful

 o Accumulate individual changes in a single table, rows stored in that
   table would indicate the nature of changes effected, many (small)
   rows can be created for a single revision. Data would need to be
   reconstructed with extreme care, the system would be similar to how
   distributed source revisioning tools handle change sets. In my
   opinion, this is way too complex to be of value.

 o Use the same approach described in my email, but without tables
   E(hi) and P(hi), instead of storing explicit references to which row
   holds the valid local revision for the master revision, we can infer
   all of this with a statement such as:

     SELECT * FROM 'E(h)' AS event_history
       WHERE event_history.revision <= $rev
       ORDER BY event_history.revision DESCENDING
       LIMIT 1

   With this approach, the master revision is incremented and any
   records which have changed as a part of the new revision must have
   their local revisions set to the value of the master revision (and
   local revisions are then of course a sequence starting from 0 as
   described above).

   This approach almost holds it's ground however it requires an extra
   table to list which 'P' are valid participants for a given revision
   of 'E', since adding or removing any 'P' from the given 'E' is
   allowed and in itself causes a new logical revision.

   I consider this approach to be in second place, it may take a longer
   time to fetch each revision but it involves less tables and is
   seemingly simpler (also, fetching history records is not strictly
   required to be fast).

So in closing, are there some interesting approaches to this problem wehaven't thought of yet ?


Does the general approach outlined in this email sound like a good one ?

Any further insight into this problem would be greatly appreciated,
before we dive deep into the implementation.

Best Regards,
   -Tristan



_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Keeping a history of structured data

Reply via email to