Re: [Trac-dev] Refactoring the Trac data model - relational approach

Terje S. Mon, 19 Jul 2010 07:17:14 -0700

Hi list \o/

Hope you all had a good week-end.

On Thu, Jul 15, 2010 at 07:13:47PM +0200, Carsten Klein wrote:

> To be frank, you do not get 1k+/s new issues nor do you modify 1k+/s
> issues at the same time, neither do you get 1k+ page edits or page adds in

I agree.

However, this is not a full solution, it will require much more. But *from* 
those early numbers, we can deduct that such system will probably perform 
OK when all the other triggers are in place (we have quite a good 
margin before hitting any realistic limit, as you seem to agree.)

Remember that the tested numbers are not simultaneous edits by users. They 
are "atomic changes" at storage level. One edit by a user is NOT one 
change, it maybe 0 or 5 or 20 or 100 depending entirely on what was done, 
how many custom fields are present etc. What happens if you rename a 
milestone with 1000 associated tickets, for example? First, one update 
to the milestone table is done, and then all associated tickets are 
updated; total 1001 revisioned changes in a nanosecond.

> What I do not understand, however, is why you reinvent the row lock
> mechanisms already available in the system. [...]
> [..post merge..]
> After having thought about its purpose for I while, I came to the
> conclusion that you want to lock a resource when a user starts for example
> editing a page or a ticket.

BEEEP! Think again! :D

The very first place I talk about the locking mechanism I say this:

[...] (to tie all the modifications to groups, representable in timeline)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The lock is *not* to be grabbed on read request! It is an exclusive write
lock to be grabbed when the middleware *changes data* that is coupled
to the revisioning system, and released immediately when done. The only
real "locking" function it serves, is to make sure that two parallel 
processes cannot change rows in storage for the same resource at the 
same time (we are not locking the row we are applying lock logic to, but
rather all rows *associated with it*, located in other tables).

The MAIN purpose, is something entirely different, and I did not
explain this well enough in the OP.

How would you, in one query, delete a row *and* specify who did it? ;)

Think of it this way: When a trigger in the schema detects a change of any 
type on storage table (which means there is some data to revision), it 
executes a block of SQL. Now the question becomes, how does the SQL block 
in the trigger know what user is performing the query? To accurately 
revision the data, we must put in with it the name of the person or 
program who messed with the values.

1. Tell the system who is making a change (make a new Fact)
2. Tell the revisioning system what resource(s) you plan to change,
   by locking them with your fact_id.
   The VIEW now lists your fact as modifying all relevant attributes.
3. Perform modifications
4. Release your Fact (this cascades a release of locks)

The entire purpose of the VIEW is to present to the TRIGGERs this info
that they *cannot* know themselves. The trigger only has a pitiful link
to an attribute, nothing about the user. So we provide a facility for 
it to look at, the view, showing the information the trigger has 
(attribute id), right next to the information it needs (fact), so:

 SELECT revision_fact <-- grab the info our trigger does not have
  FROM get_attribute_revision_fact <-- from the view
  WHERE attribute_id = OLD.attribute_id <-- using the attribute id

If we had not locked the resource (in this way or some other), the
trigger could not obtain the author information with any query,
because the information does not exist!

The VIEW is a highly inefficient solution for a large dataset. The
fastest approach is to ATTACH DATABASE ':memory:' AS factprovider;
You could then use triggers to populate an in-memory table with
the fact-attribute tuples when locks are taken (we don't care
about attribute links in other database connections). This 
prevents a heavy look at the view (2 joins) for each revisioned 
row of data. I use a view because it's the easiest for experimental 
work, and works just fine with a small dataset.

> [...] However, users might exist that will use this mechanism for
> indefinitely locking individual resources, so that eventually the
> administrator of the system will have to remove these locks again,
> manually.

Certainly, this is an issue. However, we are not grabbing the lock on
read, so it would occur only in the case where middleware crashes or
for other reason does not release its fact. We will have this problem
in any case, the difference is we can now detect if it has happened,
because it *will* result in a stale lock. Now you might consider it a
disadvantage, but in *my* (messed up) world it's a major benefit. I 
would **MUCH** rather know that something went to pieces (and be able 
to track which edit, which user at which time), than to simply have 
some corrupt/missing data somewhere and not know about it.

There are several unexplored options that might help, like using
expiry timestamp on locks. But inevitably you have to make a decision
on what to do, whether manually or automatically.

> Of course, this would require the db layer to be extended so as to allow
> the user of that system to select for update, which is currently not part
> of its implementation.

Absolutely, the db abstraction will change with the model.

The biggest issue with this solution, is coming up with a clean way to
generate equivalent system across engines. While simple in principle,
the differing support for AFTER UPDATE OF .. ON means that some engines 
like mysql and postgres!=dev need to combine several triggers into one 
with IF blocks or some other technique. Exactly how this is done, has 
not been researched at all. I take it for granted, however, that if 
SQLite can provide the goods, it is possible to duplicate it in the 
larger engines with other functionality (and I know this to be the case 
for quite a few, but maybe it's not all; mysql4 can never be supported 
for example, since it does not have triggers afaik)

After sleeping on this stuff for some nights, I think I concluded that
the double-fact system is redundant. A single column will do for this 
purpose, the "linked list" does not add anything except practical help 
in a few edge cases - it will be removed.

I've started work on a small trigger generator, and it looks promising 
(SQLite only so far). I'll inform the list, and supply the code for
testing, as soon as it is *somewhat* more functional.. (not Python tho)

(if the weather stays this bad, it'll be this week I guess ;-)

Terje

-- 
You received this message because you are subscribed to the Google Groups "Trac 
Development" group.
To post to this group, send email to trac-...@googlegroups.com.
To unsubscribe from this group, send email to 
trac-dev+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/trac-dev?hl=en.

Re: [Trac-dev] Refactoring the Trac data model - relational approach

Reply via email to