"Brian Smith" <[EMAIL PROTECTED]> wrote:
> When SQLite writes to the log file, it 
>
>  (1) writes all the data,
>  (2) fsyncs, then
>  (3) updates the page count in the header, and finally
>  (4) fsyncs again.
>
> Isn't it possible to change SQLite so that the steps 3
> and 4 are unnecessary?
> 

That depends on your filesystem.  On many modern file
systems you can safely omit 3 and 4.  And if the
xDeviceCharacteristics() method of the VFS implementation
for a particular filesystem reports SQLITE_IOCAP_SAFE_APPEND,
then SQLite skips steps 3 and 4.  Steps 3 and 4 are
also skipped if you set

   PRAGMA synchronous=NORMAL;

instead of the default

   PRAGMA synchronous=FULL;

It has been reported to us that by omitting steps 3 and
4 you get about a 30% speed improvement on MacOS X.

But without steps 3 and 4 and on some filesystems, a
power failure that occurs while the journal is being
written can result in database corruption.  The damage
happens like this:  The journal is written to the disk
out-of-order so that later parts of the journal are
written before some earlier parts.  And the power failure
occurs before the entire journal is written.  So you are
left with something like this:

     ........****....

Where "." indicates valid data, and "*" indicates random
trash that just happen to be on the disk - not information
that was written by SQLite.  Once power is restore, the
next SQLite process to open the file would see the "hot"
journal, notice that both the header and the "end mark"
are valid, but the stuff in the middle is not.  That
invalid stuff in the middle will get played back into
the database, corrupting it.

It is true that a sufficiently strong checksum might
detect the corruption in the middle.  SQLite does do
some checksumming to try to detect this sort of problem
when you specify PRAGMA synchronous=NORMAL.  But checksums
are not 100%.

The SAFE_APPEND attribute means that the filesystem does
not extend the size of the file until the content has been
safely written to oxide.  For filesystems that support
SAFE_APPEND, there is no possibility of getting corrupt
data in the file during a power failure.  I suspect that
most modern journalling filesystems are SAFE_APPEND, but
I'll continue to assume the worst until I know for sure.

Early versions of SQLite2 always assumed SAFE_APPEND. 
Then we got some reports from the field of corruption
following power loss that was ultimately traced back to the
scenario described above.  That's why we added the extra
fsyncs and checksums.

--
D. Richard Hipp <[EMAIL PROTECTED]>

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to