G'day,

I've been mulling over this for a little while, but only got around to 
writing anything down last night. I may have made some errors, perhaps 
serious ones. Either way, I thought I'd just put it out there. This email 
contains an algorithm for permitting concurrent reader and writer access 
to an sqlite database. The algorithm has the following characteristics:
- File-based locking (rather than page or row-level locking) is retained
- Semantics of current operations remains the same, as far as I can tell
- A change to the sqlite file format would be required
- The solution is (probably) not backwards-compatible
- Multiple readers can access the database at the same time
- A single writer can prepare changes to the database while readers are 
still reading
- The writer locks out all readers and other writers while it performs a 
commit operation
- Blocking locks would be supported (and may be required in some places)
- Lock periods are short, even when transactions are held open for a long 
time. This may be blocking locks are more appropriate than they were 
previously. A process that holds a database open for a second at a time to 
improve throughput will only block readers on the second boundary, 
although it will still block other writers for the full second duration.
- Inconsistency only occurs if a writer fails to apply its transaction to 
the main at commit time, not if it fails while the transaction is still 
open.
- More synching may be proposed than in the current algorithm, but I'm 
pretty picky about synchs
- The journal file is used for writer locking contention, so can't be 
erased at the end of each transaction. It can be erased safely only when 
there are no writers currently waiting.
- The algorithm doesn't allow modification of main file (apart from 
journal-based recovery) until the commit phase of the writer algorithm.

Reader algorithm:
open main file
read-lock main file
Check for inconsistency marker in main file
If inconsistent
        release main file read-lock[1a]
        open journal file
        write-lock journal file
        write-lock main file[1b]
        perform journal playback routine[2]
        sync main file
        clear inconsistency marker in main file
        sync main file
        truncate journal file
        release journal file write-lock
        downgrade main file write-lock to read-lock
[3]
perform queries
flush block cache
release main file read-lock


Notes:
[1] Don't upgrade read lock to write lock, otherwise deadlock may occur 
for blocking lock. Never attempt to lock the journal file while you have a 
lock on the main file.
[2] May be a no-op as a process who got the write-lock first may have 
already performed the playback
[3] Consistency is now guaranteed, and a valid read-lock exists on the 
main file

Writer algorithm[1]:
open journal file
write-lock journal file
write-lock main file[2a]
Check for inconsistency marker in main file
If inconsistent
        perform journal playback routine
        sync main file
        clear inconsistency marker in main file
        sync main file
downgrade main file write-lock to read-lock[2b]
truncate journal
[3]
perform updates in-memory and in-journal only. Do not modify main file.
Commit:
sync journal file
upgrade main file read-lock to write-lock
write inconsistency marker into main file[4]
sync main file
write changes into main file
sync main file
clear inconsistency marker for main file
sync main file[5]
truncate journal file[6]

Notes:
[1] To avoid deadlock, a reader should never attempt to "become" a writer. 
The reader must first drop its read lock (with associated cache flush), 
and then begin the Writer algorithm steps.
[2] Write-lock main instead of read-lock as a precaution. If we read-lock 
it, readers could open the database concurrently, see the same 
inconsistency as we do, and get caught behind our transaction as writers 
trying to resolve it. With a write-lock up-front until we're sure no 
inconsistency exists we get to avoid this degenerate case. This ends up 
meaning a writer briefly locks the main file twice: once at the beginning 
of the transaction, and once at the end.
[3] At this point the writer has a write lock on the journal, a read lock 
on the database, and the journal is empty. No inconsistency exists in the 
database.
[4] The use of an inconsistency marker would be the main incompatability 
between proposed and existing sqlite implementations. I see this most 
probably as an inconspicuous boolean that could sneak into an sqlite 
database header.
[5] Why so many syncs? Well, to be safe. We want to know all the data we 
might roll back in to the main is written to disk before we record the 
fact that that roll-back has to occur. We want to know that the 
inconsistency is noted before we start modifying the database, otherwise 
the modifications could be written but not the inconsistency record. We 
want to know the the data has been fully updated before we clear the 
inconsistency marker, and finally we want to know that the consistency 
marker is clear before we clear out the journal or allow the next writer 
to start modifying it. This may be excessive, and I think its more detail 
than the current sqlite goes into, but I can't see any other way around 
it. The current sqlite should probably be performing similar procedures 
for handling its journal and its directory entry.
[6] No need to sync, since all inconsistency checks hinge off the 
inconsistency marker in the main file

So, now that it's out there... comments?
I think it addresses the most commonly expressed issue in the 
not-so-recent thread on this subject, which is concurrent read with a 
writer that has long(ish) transactions. It also deals with my own issues 
regarding blocking locks. You might have to put some more thought into it 
to work out how to do non-blocking locks in these algorithms (I haven't 
described any abort paths), but again I don't really think they're 
necessary in the general case given the algorithms outlined. The most 
important place to have non-blocking locks is probably the initial lock of 
the journal by the writer algorithm. That's the only lock that might be 
held up for a long time (i.e. longer than it takes to commit a 
transaction). That's the place it could be held-up behind another writer 
with a long transaction. I guess I would propose that that be the only 
place where non-blocking locks could be used and that blocking locks still 
be used there by default.

Benjamin.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to