On May 22, 2008, at 6:50 PM, Bob Ebert wrote:

> I wonder if anyone is in a position to compare and contrast the
> journal_mode and synchronous settings with the various ext3 journal
> modes?
>
> Up until now we've been using ext3 with data=ordered, and sqlite3 with
> synchronous=normal, journal_mode=delete.  We're on an embedded system
> with a very high frequency of random power cycling.  In this mode,  
> we've
> seen an unusually large number of corrupted databases.
>
> My best theory right now is that we lose power after the ext3 metadata
> has written the delete of the -journal file, but before all of the  
> page
> overwrites for the db file are fully flushed, since these are done by
> two different processes in normal linux.

SQLite calls fsync() before it calls unlink() on the journal file.  So  
if fsync() is working as documented, all database data should be  
safely on oxide prior to the unlink().  We have seen instances before  
where fsync() returned long before the data was on oxide, so I would  
not be surprised by this.  You can often see this yourself by mounting  
a flash memory stick, writing a file on the stick, calling fsync()  
then watching the LED on the end of the stick continue to flash long  
after the fsync() has returned.

Please recognize that there really is nothing that SQLite can do to  
correct this problem.  SQLite depends on the operating system and/or  
disk controller living up to its end of the contract.  If the disk  
says all data is on oxide, SQLite has no choice but to believe it  
because SQLite has no way to independently verify the matter.  And if  
the disk controller and/or operating system is lying, and a power  
failure follows, the database can be corrupted.  There isn't much you  
can do when your hardware starts lying to you.

>  I believe if we lose power at
> this point, then after a restart ext3 will replay the journal delete,
> but will leave the main DB in a semi-written state, and thus corrupt  
> the
> DB.
>
> I'm wondering what impact switching to journal_mode=persist will  
> have on
> this scenario.  I believe this will change how the last step of the
> atomic commit (clearing the journal) is written to disk, and thus
> changes how ext3 will recover the file after a power loss.  Currently
> our IO scheduler doesn't guarantee ordered writes, so in theory the
> journal header clearing could still make it to disk before all the  
> main
> db pages.  We're in a position to adjust the IO scheduler if necessary
> to prevent this.

Once again, SQLite calls fsync() on the database prior to calling  
write() to clear the journal header.  So *if* fsync is doing its job,  
you should have no problems.  *If*.

Note the the whole point of these fsync() calls in SQLite is to act as  
an I/O barrier operation - to guarantee that all I/O operations that  
are issued prior to the fsync() complete prior to any I/O operations  
issued afterwards.  We must have an I/O barrier in order to preserve  
database integrity across a power failure.  If your fsync() is not  
work quite right, then all bets are off.  I don't know of anything  
SQLite can do to make the situation better.

>
>
> Are there any other potential holes or races between ext3 journal data
> and sqlite file contents that I should worry about?  Is  
> synchronous=full
> the only way to guarantee atomic commits under these conditions?
>                       

A synchronous=FULL commit goes like this:

       1.  Write all pages that will change into the journal
       2.  fsync() the journal
       3.  Overwrite the journal header to indicate that it is valid
       4.  fsync() the journal
       5.  Write changes into the database file.
       6.  fsync() the database files
       7.  Delete or truncate or overwrite the journal header  
(depending on journaling mode)

The difference between FULL and NORMAL is that NORMAL omits the  
fsync() on step 2.  That is the only difference.  The step-2 fsync is  
important on some filesystems, but on ext3 it is probably  
unnecessary.  So I don't think that going to synchronous=FULL is going  
to help you.

D. Richard Hipp
[EMAIL PROTECTED]



_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to