Hi All,

I was testing out SQLite with a framework I developed. I believe, while
running on Linux, transactions might not be durable when a power crash
happens immediately after a commit. I observed this using "SQLite version
3.7.16.2 2013-04-12 11:52:43", and kernel "3.8.4-102.fc17.x86_64". Steps to
reproduce:

1. Use a Linux machine with an Ext4 filesystem (default mount options).
2. Create a database file, set journal_mode to DELETE, perform a
transaction using "begin transaction ... commit;".
3. Pull the power plug immediately after "commit;" returns.
4. Put plug back in, power on the machine, open database file with SQLite,
and examine whether the transaction has been executed.

Expected result: You always find that the transaction had been executed.
Observed result: You sometimes find that the transaction did not execute.
(To increase the chance that you end with the transaction not having
executed, create an Ext4 partition in an unmounted hard disk, mount the
partition with the "-o commit=30" mount option, and do the experiment with
the database in that partition).

Of course, it's possible that the transactions are durable, and I'm just
being totally bonkers, or running SQLite in some really wrong way. If not,
I believe that the problem lies in SQLite not doing an fsync() on the
parent directory after unlink()-ing the rollback journal file. It seems
that, in Ext4, unlink() is atomic, but not synchronous: the rollback
journal can exist after rebooting if a power crash follows the unlink().
Also, ftruncate() doesn't seem to be synchronous either (non-durability
even if the journal_mode is set to TRUNCATE).

I haven't examined how other filesystems treat unlink(). I haven't also
tested with other pragma options and other journal_modes. I expect it
wouldn't be a problem with WAL (it should be correct even if the log exists
after the reboot); however, with things like VACUUM-ing, the database size
would probably not be reduced if a power crash happened, because of
ftruncate() not being synchronous. More generally, I think some of the code
assumes that all non-write and non-create operations that modify a file (in
other words, unlink and ftruncate) are synchronous; but that's true in
Linux only if the partition is mounted with "-o dirsync". One more random
thought: if I'm not wrong, "SQLite crash-tests" currently tests
partial-flushing only with write() calls....

Again, sorry for the spam if I'm just being bonkers.

Thanks,
Thanu
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to