I think the gist of Ben's proposal is as follows (please correct me if I am wrong):
Writes do not modify the main database file until they are ready to commit - meaning that reader can continue to read from the file during lengthy transactions. The journal contains modified pages, not the original pages, and is thus a roll-forward journal rather than a roll-back journal.
This email of mine mainly replies to the the issues brought up by D. Richard Hipp in his first reply on the thread; most of it is not quoted.
First of all, I like the idea of a roll-forward rather than a roll-back journal, and understand the benefits. However, I have an alternate implementation to suggest which may address Richard's concerns plus have additional benefits. Forgive me if this comes out as naive.
The characteristics of my solution are largely similar to Ben's: - Only one active writer at a time; other writers block until the first commits - Any number of readers at the same time as a writer preparing a transaction - The active writer blocks all readers (and writers) while it commits - Inconsistency only if active writer fails to apply changes to main at commit - No changes to main file until commit time - Amount of disk space used roughly the same
Other characteristics would be different:
- The main file format and journal file format would need zero changes
- Lifetime of journal file is the same, only until active writer finishes commit
- May be backwards compatible; if previous SQLite version sees same files, its default action should cause the right thing to happen, maybe
- A new zero-length semaphore/lock file gets used for writer locking contention
- Another new zero-length file gets used as a flag to say a commit is in progress
- A separate page cache for writer from the readers (happens already, I think)
- An in-memory hash table would store offsets for journal pages so they can be found easily after it is removed from the cache; if this starts to get large for massive updates then an additional (third) new smallish temp file can be created which holds just that offset information, keeping RAM use down (DBM-like file?)
- Zero changes to b-tree layer and above
- All changes conceptually or actually in pager layer, possibly a bit in os layer
- Changes may result in a new layer above pager that emulates pager API, or not
- Over all, probably fewer changes to SQlite code are necessary
- Only files needing locks are the main file and the semaphore/lock file; the journal file and the commit-flag file do not need their own locks
If my idea is implemented without changes, then older SQLite versions will be able to take over from the newer one under some circumstances, but not on others. Everything will work right if there is no journal file because the last transaction was committed or rolled back successfully. But if there is a journal file, then we have a potential problem because the journal has to be conditionally applied or discarded to the main, but the older SQLite wouldn't know that and will always apply it. And so, as long as the newer SQlite user makes sure that any inconsistencies are fixed prior to using an older SQlite version with the main file, then everything's okay. Alternately, with some human intervention, an older SQLite can still be made to do the right thing. Basically, if the commit-flag file exists, then let the older Sqlite run and "roll back" the journal file against the main, which will actually finish the commit. If a journal file exists but a commit-flag file doesn't, then manually delete the journal file, which is the same as rolling back a transaction that was incompletely prepared.
----------------
Here's the algorithm, which describes a modified pager layer...
WRITER:
obtain semaphore file write-lock
obtain main file write-lock
if commit-flag file exists (meaning, there is an inconsistency) then:
if journal file exists then:
play back journal file, writing its contents into main (w syncs)
delete commit-flag file
if journal file exists then:
delete journal file
if hash file exists then:
delete hash file
downgrade main file write-lock to read-lock
create journal file
create hash file if large amount of data to be changed
prepare transaction:
- all changed pages written to journal file (with syncs), none to main file
- if necessary, store in hash file the offsets of many changed pages in journal
- active writer process will read changed pages from journal if needed
upgrade main file read-lock to write-lock
if transaction is to be committed then:
create commit-flag file
play back journal file, writing its contents into main (with syncs)
delete commit-flag file
delete journal file
delete hash file if it exists
release main file write-lock
release semaphore file write-lock
READER:
open main file
obtain main file read-lock
if commit-flag file exists (meaning, there is an inconsistency) then
release main file read-lock
obtain semaphore file write-lock
obtain main file write-lock
if commit-flag file exists (meaning, there is an inconsistency) then:
if journal file exists then:
play back journal file, writing its contents into main
delete commit-flag file
if journal file exists then:
delete journal file
if hash file exists then:
delete hash file
downgrade main file write-lock to read-lock
release semaphore file write-lock
perform queries
release main file read-lock-----------------
So, I think that outlines my idea. Any questions? Anything you don't understand in my explanation? Any flaws in the idea that I need to address?
Note that, if it would help speed at all, it doesn't matter if the journal file or hash file are deleted; they could be truncated instead. Only the commit-flag file needs to be added and deleted regularly. Also note that the hash file isn't necessary to playback a journal; it is only meant to save a linear scan following a cache-bump in the active writer.
Thank you and have a good day. -- Darren Duncan
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

