Re: [Trac-dev] Refactoring the Trac data model - relational approach

Terje S. Wed, 14 Jul 2010 11:56:01 -0700

On Fri, Jul 09, 2010 at 04:06:48PM +0200, Christian Boos wrote:

> [good post I should have replied to]


Well, what do you know, here I am again;-)

I could not leave this message unanswered.

(however I have cut it *all* out, so you don't know what it said.)

Christian, there are many reasons I am not supplying a full model design
up-front. I know you understand it, it's a HUGE task.

Even if you all don't want to graph all your resourcess as I do 
(I *DONT* GET IT), maybe there still is something I can contribute to.. 

I *FULLY* admit there are BIG challenges in refactoring the data model,
but also keep in mind that I have been promoting a long-term perspective
with research as the primary goal. I will happily discuss any part of the 
future database design with you on e-mail, including the intricate details 
of storage mechanisms, to the very best of my knowledge. Heck we'll discuss 
it over a $drink some day before you know it   :-)

In the mean time, here is a slightly further developed version of previously 
mentioned model-level revisioning, to show use cases for views and triggers.

I thought that might be cool to try.

--------------------------------------------------------------------------
                 I wrote and "tested" for **SQLite**
               <3 JUST to be please YOU Christian!! <3

        It was fun to do. I hope you all enjoy hearing about it.
--------------------------------------------------------------------------

TABLES:
   Resources  - let's just say Tickets
   Attributes - defines a "field" for a ticket, relationship to storage
   Storage    - Same as ticket_custom
   Facts      - New table; Holds username of person doing changes.

 If your mailer is not ASCII-capable, you should replace it!! :D

        .-----------------------------------------.
        |         VIEW: ResourceValues            |
        |          . . . . . . . . . . . . . . . .'--------------.
        |          |          VIEW: AttributeRevisions           |
        |          |           - -----------------.              |
        |          |  VIEW: AttributeValues       |              |
       .|..........................               |              |
       | VIEW: ResourceAttributes |               |              |
Tables |Resources  ->  Attributes | -> Storage    |  -> Revisions| -> FACTS
TracEQv|Ticket       missing link | -> ticket_custom -> 4got name| timeline

The "Timeline" is only accurate in the sense that the Facts table now takes
it central job. Facts are the economy of this model, you *must* always
obtain a valid fact_id and use it to lock the resources you are going to
change (to tie all the modifications to groups, representable in timeline)

CREATE VIEW get_attribute_revision_fact AS
  SELECT A.id AS attribute_id,
         F.id AS revision_fact
    FROM Attributes A
    INNER JOIN Resources R 
         ON A.resource_id = R.id
    INNER JOIN Facts F 
         ON COALESCE(R.lock_fact, 0) = F.id
            AND F.end_time IS NULL;

Attribute being a link between resource (Ticket) and storage (_custom), I 
only just *skimmed* 1/4th of Carsten's post, but I think that's conceptually 
the same. Don't quote me, I sketched up a minimal model. 

For all rows in Resources, that have lock_fact set to an integer
corresponding to rowid in Facts (that has end_time IS NULL), this view lists
related Attribute ids and the corresponding Fact that is currently locking 
it. This view is used to determine the owners of value change operations
(insert, update, delete), depending on who has locked them. It's just a 
long list of all attributes that are currently locked, derived from resource
locks in "real-time")

A TRIGGER can now use such a VIEW as a value when INSERTing a row to the
revision table. It does require that you lock the resource with a valid
fact_id (by INSERTing a row to Facts containing your username/id, and 
leaving the end_time as NULL, indicating that the operation is in progress).
You would then have to supply this fact_id to a locking mechanism together 
with a resource id, whether directly table-based or with views to lock
cross-domains, or black magic or possibly some other way with virtual tables
or such... In this case you lock the resource by UPDATEing the lock_fact 
column in the Resources table with an open fact_id. You have to check that 
*YOU* are actually the one obtaining the lock, as is "traditional",
and not just grab it blindly (we all know how that can go..):

   UPDATE Resources SET lock_fact=x WHERE id=y AND lock_fact=NULL;

if someone grabbed the lock before you, the AND lock_fact=NULL will fail.
If this update completes, you are *guaranteed* to have the lock.
Now select it right back to check the result:

   SELECT lock_fact FROM Resources WHERE id=y;

If you get your own fact_id in return, all is good, if not: retry or abort!

Once the resource is locked, each row is autorevisioned, tagged with the 
relevant fact_id, so you can work your way back down to all the changes 
clustered below single edit afterwards. If the Facts table has a parent_id 
column, I THINK it seems to be possible to deal with nested "changesets" 
in this way (I have *NOT* tried that in practice though, but with manual
intervention from the lock operator, say, a callchain of code that maintain
a stack of facts, it should actually be "fairly easy" to track/backtrack 
and open a new fact by specifying the top stack element as parent (if any), 
pushing the resulting new id to stack (moving inwards), and popping the 
stack empty and update with end_time=.. when retracting... 
.....I HAVE NOT TESTED THAT AT ALL TO BE CLEAR.....). 

When you are done, you UPDATE your Fact with SET end_time=JULIANDAY('NOW').
A different trigger picks up this change, and releases the resource rows
themselves (if any were actually locked)..

storage is ticket_custom equivalent in this case.

CREATE TRIGGER storage_changed_to_revision
  AFTER UPDATE OF value ON storage
  FOR EACH ROW WHEN OLD.value <> NEW.value -- no redundancy :D
    BEGIN
      INSERT INTO revisions
        (storage_id, attribute_id, owning_fact, acting_fact, created,
         revision_type, value)
      VALUES
        (OLD.rowid, OLD.attribute_id, OLD.fact_id,
          (SELECT revision_fact 
              FROM get_attribute_revision_fact 
              WHERE attribute_id = OLD.attribute_id
          ),
        JULIANDAY('NOW'), 'change', NEW.value);

      -- Autotag the new data with the current locking fact_id
      -- owning_fact above, is the value that was previously present here.
      -- this obviously is not equal for INSERT/DELETE triggers
      UPDATE storage SET fact_id = (
         SELECT revision_fact
            FROM get_attribute_revision_fact 
           WHERE attribute_id = OLD.attribute_id
         )
        WHERE rowid = OLD.rowid;
    END;

This example will neither block a change NOR perform revisioning if the 
resource is not locked (but we are free to do what we want, that is, the
admin is free to do that by generating and installing different triggersets
for the schema, depending on how strict s/he is with that env). That could
easily be done from a comand line tool.

The table Storage holds the current value and the fact_id that put it there.
(the fact holds the username, datetime, ..)

Note that this system keeps a *duplicate* of the current real value in
revision history. Any other way I did it I was doomed, offset issues, could 
not reconstruct cleanly in SQL, heavy joins or unions or some shit, or would
end up with something really cumbersome in code. PLUS we would've had to do 
extra magic to get all the facts right (AFAICTW/OT).

It might seem a little bit redundant, but when you think about it it's
really not. It's just that the revisioning system is *one small step ahead* 
of where you might expect it to be, and it helps in reassembling a complete
changelog from start -> current right out of the revisions table.
I actually think it turned out pretty neat (it's just a baby ofc) :-)

When a storage row is first INSERTed, it is cloned to revisions *immediately*
with an owning_fact of NULL, indicating that nobody had put a value in there
previously, and acting_fact of the current locking fact. When it is UPDATEd,
the new value is pushed to revisions with owning_fact being the *INSERTing 
fact* and acting_fact being the *UPDATING fact*. When it is deleted, the 
value NULL is pushed to revision.

That way, all lookups to history are done against one table that contains
all the information needed for a full reconstruction (JOIN against Facts..).
The "ticket_custom" (storage) is a sort of "caching front-end" to the
revisions table, it's an API. Storage holds _nothing_ we don't have, but 
it serves as a central interface to *manipulate* and validate data, and 
"caches" the currently valid value for quick comparison and retrieval. 

It is *OBVIOUSLY* a prototype. There is no sanity checking at all, etc. I do 
*NOT* know if it will work in the real-world (all I know is it works 
superficially, I have never used triggers for such tasks before, but I *have*
seen it baked in the wild, albeit with ASE+procedures. SQLite is VERY fast 
if you need it to be.. I'm certain a PRAGMA synchronous = OFF before a bulk 
edit and ON afterwards would literally FLY ;).

Wouldn't it be worth a little more testing you think???!?

A big side-advantage - IF this meets performance and practicality
requirements, is that a piece of external code in whatever language can 
operate the Trac database with full integrity and revisioning, as long 
as they follow the model API, even from a shell script piping back and forth
to SQLite in a loop (and all would show up in timeline). Well, **I** consider
that a big advantage, but you guys dont even want a **GRAPH** WTF!!!:D 

It is possible to create triggers that monitor the single columns 
in the main storage tables (ticket etc), and just push the values to the 
SAME revisions table IF they change INDIVIDUALLY, even though it is still 
stored exactly as it is today (we obviously generate these triggers with a 
script). The middleware is then free to push out all updates it can, the
triggers will take care of only revisioning that which has changed.
(the lock_fact would be added to WorkEffort in my world, since that's the 
"toplevel" entity where you can do it in *one place*. For the current 
codebase, either a locking table with realm/id is established, or a
lock_fact is introduced to Ticket Milestone + other tables to introduce them
to revisioning system (they need their triggers, revisions table etc. Well
actually you could conceivably collapse *all* revisions to one table, do we
rerally need to care about datatype once revisioned?? Could we not just
CAST() it back if we have to restore?? idk..

It MUST be faster performance-wise to do this directly in SQLite, using its
*tailored data monitoring mechanisms*, than to load it back-forth to host via 
queries, no??? .. I have not tested, but would absolutely think it to be the 
case. The locking drags down of course, but it is easy to do even if you have 
to repeat it, and it even HAS GOT *direct* benefits of its own (IMO)!

Pre-generated testdata helps us prove the model:

$ rm x.db; sqlite3 x.db < emptyschema.sql
$ ./datagen.pl -tickets 1000 -milestones 10  -changes 10 > dummy.sql
                        ^^^^             ^^           ^^
$ sqlite3 x.db < dummy.sql
$ sqlite3 x.db

sqlite> select count(*) from revisions_text;
21508
^^^^^

sqlite> select count(*) from facts;
3041                         ^^^^^
^^^^

 **Note** the number of facts/revisions for this size dataset (ABOVE) ^^ 







Can YOU please explain to me, what the CORRELATIONS of those numbers are?!???

(frankly I don't understand it *at all*... it just makes some SQL afaik.)

sqlite> select resource_id, attribute_id, attribute_storage, revision_type,
value from attributerevisions where attribute_id between 1 and 10;
1           1           text        create         Datagen Component 1
1           1           text        change         Datagen Component 1
1           1           text        change         Datagen Component 1
1           1           text        change         Datagen Component 1
1           2           text        create         Owner: 1
1           2           text        change         Owner: 1 upd 1
1           2           text        change         Owner: 1 upd 2
1           2           text        change         Owner: 1 upd 3
1           2           text        change         Owner: 1 upd 4
1           2           text        delete
1           3           text        create         Datagen Milestone 1
1           3           text        change         Datagen Milestone 1
1           3           text        change         Datagen Milestone 1
1           3           text        change         Datagen Milestone 1
1           3           text        change         Datagen Milestone 1
1           3           text        change         Datagen Milestone 1
1           4           text        create         Description for Mil
2           7           text        create         Datagen Milestone 2
2           7           text        change         Datagen Milestone 2
2           7           text        change         Datagen Milestone 2
2           7           text        change         Datagen Milestone 2
2           8           text        create         Description for Mil
2           8           text        change         Description for Mil
2           8           text        change         Description for Mil
2           8           text        change         Description for Mil
2           8           text        change         Description for Mil
2           8           text        change         Description for Mil
real    0m0.027s
user    0m0.024s
sys     0m0.000s

(ofc storing the text "change" is redundant etc, just to test.)

And for some revisioned write operations:

$ cat 10k.sql
PRAGMA synchronous = OFF;
BEGIN EXCLUSIVE;

-- Obtain a fact we can use to lock with
INSERT INTO Facts 
  (author) 
VALUES 
  ('CLI script'); --id 1 

-- you are *required* to specify lock_fact on creation
INSERT INTO Resources (lock_fact) VALUES (1); --id 1

-- Declare an attribute belonging to above resource
INSERT INTO Attributes 
  (resource_id, storage_type, name) 
VALUES 
  (1, 'text', 'name'); --id 1

-- And finally INSERT some data, it is copied to revisions.
INSERT INTO storage_text 
  (attribute_id, value) 
VALUES 
(1, 'Contents of custom field');

-- Now run 10k updates to the same string.
UPDATE storage_text SET value='abcdef' where id=1;
[.... total 10.000 copies of previous UPDATE statement ....]

-- Wipe the data, close the fact.
DELETE FROM storage_text;
UPDATE Facts SET end_time = JULIANDAY('now') WHERE fact_id=1;
COMMIT;

$ time sqlite3 x.db < 10k.sql
real    0m1.769s
user    0m1.472s
sys     0m0.296s

sqlite> select resource_id, attribute_id, attribute_storage, value from 
attributerevisions; -- that being a view of revisions and attributes..
1|1|text|Contents of custom field
1|1|text|abcdef
1|1|text|

The trigger does not revision the 10k changes, because the value is
identical! I have now wiped x.db and changed the trigger to revision 
unconditionally (removed WHEN <>), the 2 extra are of course INSERT 
and DELETE triggers (not shown, almost identical)..

$ time sqlite3 x.db < 10k.sql
real    0m3.044s
user    0m2.024s
sys     0m0.984s

sqlite> select count(*) from attributerevisions;
10002

synchronous = ON, BEGIN .. COMMIT:
real    0m2.994s
user    0m1.944s
sys     0m1.036s

synchronous = ON, NO transaction:
real    0m46.058s
user    0m2.356s
sys     0m3.084s

(200+/sec synced w/o trans, high disk load on puter...not so bad)

AFAICT features available from SQLite 2002 Jun 17 (2.5.0) (?!??!!)
Certainly works on my debian stable. (I think I used'em in 4.0 too?)

I will also quickly answer another key point that you raised, and it is the
bridge to the data stored in the schema. The correct solutions for storage is
already in place (save relations); what is missing is metadata.

One proven solution is to provide the middleware with information about 
where the different data can be found (normal table or custom field) by
documenting your domains *within* the model. The middleware must load all 
this information, that describes the available datatypes, table names, etc
and use it as a basis for decisions, UI- and query-generation. Our case is 
not as complex as a full EAV/CR system, the question is only what minimum 
subset is possible to derive from this technology and get what we need.

http://jamia.bmjjournals.com/content/6/6/478.full

Christian, 

              I will gladly devote some time coding SQL,

             <3 <3 <3    *TO* *YOUR* *SPECS*    <3 <3 <3

               (wouldn't that be a lot more productive?)


(sorry there are probably alot of mistakes/inaccuracies technical and
otherwise.. you have to live with it.. cause I *am* posting now:p))

I will be back online in (over?) the week-end. Nothing from me till 
then, I hope you enjoy this post meanwhile:-)

..... G*dd*%川 Carsten WHY would you reply wise words to a post I 
EXPLICITLY said I would NOT revive??.... 

;(

Terje

-- 
You received this message because you are subscribed to the Google Groups "Trac 
Development" group.
To post to this group, send email to trac-...@googlegroups.com.
To unsubscribe from this group, send email to 
trac-dev+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/trac-dev?hl=en.

Re: [Trac-dev] Refactoring the Trac data model - relational approach

Reply via email to