G'day, I thought I'd so somewhat of a code review on the lower-level pieces of sqlite 3.0.1, so here goes:
v----- code review ----v <file name="os_unix.c"> <comment lines="546-549" function="sqlite3OsWrite" importance="high"> You use a loop here to try and write all data synchronously to disk. I had to read it a couple of times before I realised it was actually performing the check on write's return correctly, but I am happy with it Is there a reason why similar logic doesn't appear in sqlite3OsRead? Background: Under unix, multiple reads/writes may be required for "slow" devices. One read/write should alway suffice for a "fast" device. A disk is considered a fast device for this purpose. An nfs mount would be considered a slow device. </comment> <comment function="sqlite3OsLock" importance="medium-buggsme"> I've always been uneasy about the interaction in sqlite between the pager an os layer with regards to locking. It seems like excessive coupling to me. The os layer is essentially written to know the lock state of the pager, and then participate in a dance whereby the pager and os always approximately agree on the lock status. The sqlite3OsLock function is an example of this, where an attempt to grab a lock of a less than or equal status to the current lock is a no-op. To me this means that the pager is making uncessary calls and doesn't seem particularly aware of its own state. The os layer knows too much about what's happening ("I know that you didn't really mean to call this function"), and the pager doesn't know enough ("I'll make this call, just in case"). I think that all such no-op forms of os layer functions should be replaced with assertions and the code above made self-contained. The concept of upgrading and downgrading locks transparently has also consistently bugged me, especially when I've wanted to use blocking locks. It makes things much harder when the aren't clearly-defined individual lifecycles for reader and writer locks, and when writers are forced to share the early lifecycle of readers. </comment> <comment lines="742" function="sqlite3OsLock" importance="low"> Extraneous assert. The if condition tests this assert already. </comment> <comment lines="743" function="sqlite3OsLock" importance="low"> Use the #define value rather than numeric value of NO_LOCK. </comment> <comment lines="761" function="sqlite3OsLock" importance="critical"> F_RDLCK should be replaced with F_WRLCK. Locking only with F_RDLCK has no effect. </comment> <comment lines="804" function="sqlite3OsLock" importance="low"> Use the #define value rather than numeric value of NO_LOCK. </comment> <comment lines="808" function="sqlite3OsLock" importance="low"> lock.l_len should be set to 1 when it needs to be that value. Doing it at the top of function reduces code clarity and introduces uncecessary asymmetry in conditional branches. </comment> <comment function="sqlite3OsLock" importance="medium-buggsme"> I'm nervy about the different locks that might be held by an process in EXCLUSIVE_LOCK state depending on how it reached that state. If it went through SHARED_LOCK it has SHARED_FIRST through SHARED_FIRST + SHARED_SIZE write locked + PENDING_BYTE through PENDING_BYTE + 1 read locked. If it came through the reserved state it also has a write lock on RESERVED_BYTE through RESERVED_BYTE + 1. This appears to be dealt with in the unlock code, but it grates a little. I actually don't like the SHARED -> PENDING path at all, and think it should be removed for simplicity. It effectively creates two versions of the pending and exclusive states, respectively. </comment> <comment lines="826-829" function="sqlite3OsLock" importance="low"> I don't like the setting of PENDING_LOCK here. Surely the code would be clearer if it were set back when the pending lock was obtained. I understand that you still want access to the "old value" during the rest of the function, but couldn't you copy it? </comment> Ok, I haven't reviewed much past this point. I was hoping to get in some comments on the pager itself which I haven't read, yet... but I've been at this for a little too long now. </file> v---- analysis of locking model ----v Just to get this straight in my head, this is the current unix locking model (does this appear somewhere in comments?): States NO_LOCK = Nothing locked SHARED_LOCK = SHARED_FIRST through SHARED_FIRST + SHARED_SIZE read-locked RESERVED_LOCK = RESERVED_BYTE through RESERVED_BYTE + 1 write-locked + locks of SHARED_LOCK state PENDING_LOCK = PENDING_BYTE through PENDING_BYTE + 1 read-locked + either locks of SHARED_LOCK state, or locks of RESERVED_LOCK state EXCLUSIVE_LOCK = SHARED_FIRST through SHARED_FIRST + SHARED_SIZE write-locked + locks of PENDING_LOCK state Transitions NO_LOCK -> SHARED_LOCK: 1. Pick up locks for pending state 2. Pick up locks for shared state 3. Drop locks for pending state SHARED_LOCK -> RESERVED_LOCK: 1. Pick up locks for reserved state. Reserved lock is exclusive, so only one process can be in reserved state at any one time but concurrency with readers is ok. SHARED_LOCK -> EXCLUSIVE_LOCK 1. Pick up locks for pending state 2. Pick up locks for exclusive state RESERVED_LOCK -> EXCLUSIVE_LOCK 1. Pick up locks for pending state 2. Pick up locks for exclusive state PENDING_LOCK -> EXCLUSIVE_LOCK 2. Pick up locks for exclusive state This makes three main locking lifecycles: 1. NO_LOCK -> SHARED_LOCK -> NO_LOCK 2. NO_LOCK -> SHARED_LOCK -> SHARED_PENDING_LOCK -> SHARED_EXCLUSIVE_LOCK -> NO_LOCK 3. NO_LOCK -> SHARED_LOCK -> RESERVED_LOCK -> RESERVED_PENDING_LOCK -> RESERVED_EXCLUSIVE_LOCK -> NO_LOCK Thoughts My comments on compatability with a blocking lock mechanism: * Path 1 is safe to replace with blocking locks. A strict sequence of locks is followed and no lock upgrade occurs. * Paths 2 and 3 are not safe. They put a pending lock on the database while they still have a read lock on the shared section. When they subsequently put a write lock on the shared section they can't guarantee that all existing read locks will eventually clear. Other processes may be trying the same manouver. v----- My suggestion, from here down -----v I think things are too complicated as they are, and don't map well to blocking semantics. It's not far off, though, and fixes should be simple. Here's what I think should be done: Create separate locking lifecycles for a database reader and a database writer. The shared "SHARED_LOCK" state does noone any good. I recommend the following transitions: 1. NO_LOCK -> PENDING_LOCK -> SHARED_LOCK -> NO_LOCK 2. NO_LOCK -> RESERVED_LOCK -> PENDING_LOCK -> EXCLUSIVE_LOCK Consider the following lock states: NO_LOCK = Nothing locked SHARED_LOCK = SHARED_FIRST through SHARED_FIRST + SHARED_SIZE read-locked RESERVED_LOCK = RESERVED_BYTE through RESERVED_BYTE + 1 write-locked + locks of SHARED_LOCK state PENDING_LOCK = PENDING_BYTE through PENDING_BYTE + 1 write-locked only EXCLUSIVE_LOCK = SHARED_FIRST through SHARED_FIRST + SHARED_SIZE write-locked only Consider the following transitions: NO_LOCK -> SHARED_LOCK 1. Acquire exclusive pending lock 2. Acquire shared lock NO_LOCK -> RESERVED_LOCK 1. Acquire reserved lock (don't do anything with pending locks) 2. Acquire shared lock RESERVED_LOCK -> EXCLUSIVE_LOCK 1. Acquire pending lock 2. Upgrade shared lock to exclusive lock 3. Drop reserved and pending locks * Use the reader lifecycle for SELECT statements outside of transactions. Create the exclusive pending lock, acquire the reader lock, then drop the pending lock. Reader locks must eventually be unlocked. * When a transaction is started (explicitly or implicitly) drop any shared locks back to the NO_LOCK state, then move to the RESERVED_LOCK state immediately or on the first SQL statement inside the transaction. This should be done by acquiring only the reserve lock. * When actual modification to the database is required (either at commit or page cache overflow time), acquire the pending lock exclusively, then acquire the writer lock and drop the pending lock. Writer locks must eventually be unlocked. * All error cases must return to the NO_LOCK state unless recovery and continuing on in the current lifecycle is possible. I believe this set of states and transitions has all the positive qualities associated with the current implementation. Additionally, it supports blocking locks and to my mind is simpler. Here are the sequences: Reader :- 1. Reader acquires pending lock so one of no writer, reserved writer, or exclusive writer is true. No writer will not interfere, nor will reserved writer. Exclusive writer will finish writing and unlock eventually. 2. Reader acquires shared lock so one of no writer, or reserved writer is true. Read to your heart's content. Eventually reader will finish reading and unlock Writer :- 1. Writer acquires reserved lock so readers may or may not be present (and will terminate eventually), and a writer in the exclusive state may be present (and will terminate eventually) 2. Writer acquires shared lock so readers may or may not be present (and will terminate eventually), and no other writer has a lock 3. Writer acquires pending lock so readers may or may not be present (and will terminate eventually), new readers will be blocked out (and not begin any read operations), and no other writer has a lock 4. Writer upgrades shared lock to exclusive lock so no readers are present and no writers are present. Party time. 5. Writer unlocks the reserved lock and pending lock to let the next writer get in line. That writer won't be able to become active until it obtains a shared lock and we are done. Is there any chance of this locking model going in before the public 3.0 release? If so, yippee for me. If not, are they compatible? Essentially the same locks are used, so I think they will all behave so long as blocking locks are not attempted with the current model. It's actually getting late, now. I've been reviewing the locking model for about four hours now to get to this point so I don't think I can spend any longer thinking about compatibility issues. Hopefully what I've written to this point is clear and correct. Benjamin. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]