G'day,
I thought I'd so somewhat of a code review on the lower-level pieces of
sqlite 3.0.1, so here goes:
v----- code review ----v
<file name="os_unix.c">
<comment lines="546-549" function="sqlite3OsWrite"
importance="high">
You use a loop here to try and write all data
synchronously to disk. I had to read it a couple of times before I
realised it was actually performing the check on write's return correctly,
but I am happy with it
Is there a reason why similar logic doesn't appear in
sqlite3OsRead?
Background: Under unix, multiple reads/writes may be
required for "slow" devices. One read/write should alway suffice for a
"fast" device. A disk is considered a fast device for this purpose. An nfs
mount would be considered a slow device.
</comment>
<comment function="sqlite3OsLock" importance="medium-buggsme">
I've always been uneasy about the interaction in sqlite
between the pager an os layer with regards to locking. It seems like
excessive coupling to me. The os layer is essentially written to know the
lock state of the pager, and then participate in a dance whereby the pager
and os always approximately agree on the lock status. The sqlite3OsLock
function is an example of this, where an attempt to grab a lock of a less
than or equal status to the current lock is a no-op. To me this means that
the pager is making uncessary calls and doesn't seem particularly aware of
its own state. The os layer knows too much about what's happening ("I know
that you didn't really mean to call this function"), and the pager doesn't
know enough ("I'll make this call, just in case"). I think that all such
no-op forms of os layer functions should be replaced with assertions and
the code above made self-contained.
The concept of upgrading and downgrading locks
transparently has also consistently bugged me, especially when I've wanted
to use blocking locks. It makes things much harder when the aren't
clearly-defined individual lifecycles for reader and writer locks, and
when writers are forced to share the early lifecycle of readers.
</comment>
<comment lines="742" function="sqlite3OsLock" importance="low">
Extraneous assert. The if condition tests this assert
already.
</comment>
<comment lines="743" function="sqlite3OsLock" importance="low">
Use the #define value rather than numeric value of
NO_LOCK.
</comment>
<comment lines="761" function="sqlite3OsLock"
importance="critical">
F_RDLCK should be replaced with F_WRLCK. Locking only with
F_RDLCK has no effect.
</comment>
<comment lines="804" function="sqlite3OsLock" importance="low">
Use the #define value rather than numeric value of
NO_LOCK.
</comment>
<comment lines="808" function="sqlite3OsLock" importance="low">
lock.l_len should be set to 1 when it needs to be that
value. Doing it at the top of function reduces code clarity and introduces
uncecessary asymmetry in conditional branches.
</comment>
<comment function="sqlite3OsLock" importance="medium-buggsme">
I'm nervy about the different locks that might be held by
an process in EXCLUSIVE_LOCK state depending on how it reached that state.
If it went through SHARED_LOCK it has SHARED_FIRST through SHARED_FIRST +
SHARED_SIZE write locked + PENDING_BYTE through PENDING_BYTE + 1 read
locked. If it came through the reserved state it also has a write lock on
RESERVED_BYTE through RESERVED_BYTE + 1. This appears to be dealt with in
the unlock code, but it grates a little. I actually don't like the SHARED
-> PENDING path at all, and think it should be removed for simplicity. It
effectively creates two versions of the pending and exclusive states,
respectively.
</comment>
<comment lines="826-829" function="sqlite3OsLock"
importance="low">
I don't like the setting of PENDING_LOCK here. Surely the
code would be clearer if it were set back when the pending lock was
obtained. I understand that you still want access to the "old value"
during the rest of the function, but couldn't you copy it?
</comment>
Ok, I haven't reviewed much past this point. I was hoping to get
in some comments on the pager itself which I haven't read, yet... but I've
been at this for a little too long now.
</file>
v---- analysis of locking model ----v
Just to get this straight in my head, this is the current unix locking
model (does this appear somewhere in comments?):
States
NO_LOCK = Nothing locked
SHARED_LOCK = SHARED_FIRST through SHARED_FIRST + SHARED_SIZE read-locked
RESERVED_LOCK = RESERVED_BYTE through RESERVED_BYTE + 1 write-locked +
locks of SHARED_LOCK state
PENDING_LOCK = PENDING_BYTE through PENDING_BYTE + 1 read-locked + either
locks of SHARED_LOCK state, or locks of RESERVED_LOCK state
EXCLUSIVE_LOCK = SHARED_FIRST through SHARED_FIRST + SHARED_SIZE
write-locked + locks of PENDING_LOCK state
Transitions
NO_LOCK -> SHARED_LOCK:
1. Pick up locks for pending state
2. Pick up locks for shared state
3. Drop locks for pending state
SHARED_LOCK -> RESERVED_LOCK:
1. Pick up locks for reserved state. Reserved lock is exclusive, so only
one process can be in reserved state at any one time but concurrency with
readers is ok.
SHARED_LOCK -> EXCLUSIVE_LOCK
1. Pick up locks for pending state
2. Pick up locks for exclusive state
RESERVED_LOCK -> EXCLUSIVE_LOCK
1. Pick up locks for pending state
2. Pick up locks for exclusive state
PENDING_LOCK -> EXCLUSIVE_LOCK
2. Pick up locks for exclusive state
This makes three main locking lifecycles:
1. NO_LOCK -> SHARED_LOCK -> NO_LOCK
2. NO_LOCK -> SHARED_LOCK -> SHARED_PENDING_LOCK -> SHARED_EXCLUSIVE_LOCK
-> NO_LOCK
3. NO_LOCK -> SHARED_LOCK -> RESERVED_LOCK -> RESERVED_PENDING_LOCK ->
RESERVED_EXCLUSIVE_LOCK -> NO_LOCK
Thoughts
My comments on compatability with a blocking lock mechanism:
* Path 1 is safe to replace with blocking locks. A strict sequence of
locks is followed and no lock upgrade occurs.
* Paths 2 and 3 are not safe. They put a pending lock on the database
while they still have a read lock on the shared section. When they
subsequently put a write lock on the shared section they can't guarantee
that all existing read locks will eventually clear. Other processes may be
trying the same manouver.
v----- My suggestion, from here down -----v
I think things are too complicated as they are, and don't map well to
blocking semantics. It's not far off, though, and fixes should be simple.
Here's what I think should be done:
Create separate locking lifecycles for a database reader and a database
writer. The shared "SHARED_LOCK" state does noone any good. I recommend
the following transitions:
1. NO_LOCK -> PENDING_LOCK -> SHARED_LOCK -> NO_LOCK
2. NO_LOCK -> RESERVED_LOCK -> PENDING_LOCK -> EXCLUSIVE_LOCK
Consider the following lock states:
NO_LOCK = Nothing locked
SHARED_LOCK = SHARED_FIRST through SHARED_FIRST + SHARED_SIZE read-locked
RESERVED_LOCK = RESERVED_BYTE through RESERVED_BYTE + 1 write-locked +
locks of SHARED_LOCK state
PENDING_LOCK = PENDING_BYTE through PENDING_BYTE + 1 write-locked only
EXCLUSIVE_LOCK = SHARED_FIRST through SHARED_FIRST + SHARED_SIZE
write-locked only
Consider the following transitions:
NO_LOCK -> SHARED_LOCK
1. Acquire exclusive pending lock
2. Acquire shared lock
NO_LOCK -> RESERVED_LOCK
1. Acquire reserved lock (don't do anything with pending locks)
2. Acquire shared lock
RESERVED_LOCK -> EXCLUSIVE_LOCK
1. Acquire pending lock
2. Upgrade shared lock to exclusive lock
3. Drop reserved and pending locks
* Use the reader lifecycle for SELECT statements outside of transactions.
Create the exclusive pending lock, acquire the reader lock, then drop the
pending lock. Reader locks must eventually be unlocked.
* When a transaction is started (explicitly or implicitly) drop any shared
locks back to the NO_LOCK state, then move to the RESERVED_LOCK state
immediately or on the first SQL statement inside the transaction. This
should be done by acquiring only the reserve lock.
* When actual modification to the database is required (either at commit
or page cache overflow time), acquire the pending lock exclusively, then
acquire the writer lock and drop the pending lock. Writer locks must
eventually be unlocked.
* All error cases must return to the NO_LOCK state unless recovery and
continuing on in the current lifecycle is possible.
I believe this set of states and transitions has all the positive
qualities associated with the current implementation. Additionally, it
supports blocking locks and to my mind is simpler. Here are the sequences:
Reader :-
1. Reader acquires pending lock so one of no writer, reserved writer, or
exclusive writer is true. No writer will not interfere, nor will reserved
writer. Exclusive writer will finish writing and unlock eventually.
2. Reader acquires shared lock so one of no writer, or reserved writer is
true. Read to your heart's content. Eventually reader will finish reading
and unlock
Writer :-
1. Writer acquires reserved lock so readers may or may not be present (and
will terminate eventually), and a writer in the exclusive state may be
present (and will terminate eventually)
2. Writer acquires shared lock so readers may or may not be present (and
will terminate eventually), and no other writer has a lock
3. Writer acquires pending lock so readers may or may not be present (and
will terminate eventually), new readers will be blocked out (and not begin
any read operations), and no other writer has a lock
4. Writer upgrades shared lock to exclusive lock so no readers are present
and no writers are present. Party time.
5. Writer unlocks the reserved lock and pending lock to let the next
writer get in line. That writer won't be able to become active until it
obtains a shared lock and we are done.
Is there any chance of this locking model going in before the public 3.0
release? If so, yippee for me. If not, are they compatible? Essentially
the same locks are used, so I think they will all behave so long as
blocking locks are not attempted with the current model. It's actually
getting late, now. I've been reviewing the locking model for about four
hours now to get to this point so I don't think I can spend any longer
thinking about compatibility issues. Hopefully what I've written to this
point is clear and correct.
Benjamin.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]