Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous

Thanumalayan Sankaranarayana Pillai Tue, 21 May 2013 23:00:42 -0700

Thank you for your replies! I now fully understand (and appreciate) that
the "ACI" part of transactions is the most important. Also, I didn't notice
any of ACI being broken: SQLite guarantees those conditions really well.

However, just to be clear, my "potential bug" affects out-of-the-box Fedora
and Ubuntu, even with costly SCSI drives, and with the OS installed using
all default options. On the other hand, in these environments, SQLite
transactions will be immediately durable if fsync()-s are issued after the
ftruncate() or unlink() calls while resetting the rollback journal; I
verified this by modifying unixDelete() within SQLite, setting dirSync = 1
always. I'm writing the rest of this email assuming that this is an
interesting piece of news (and not something already known). Also,
confession: I am a grad student researching filesystems, and I haven’t
directly used SQLite much; I looked at SQLite because I observed some funny
disk activity. So if I am totally out-of-place with this rigmarole,
imagining things about SQLite that aren’t actually true, please forgive the
spam.

It is entirely true that desktop SATA drives with disk-caches, some bad
filesystems, and crazy OS configurations, all disrespect the entire concept
of flushing. But, the setup I verified used either Fedora 17 and Ubuntu
12.04, and either Ext4, Ext3, or Btrfs. I did my best to make sure that the
effects are neither due to a lying disk, nor because of configuration stuff
like disabling filesystem barriers. Moreover, *the last SQLite
transaction's durability is almost always delayed by 5 seconds* (it's not
something you observe only once in a thousand power crashes), if there are
no other applications flushing to the disk. If other applications do flush
to the disk, then the delay is only till when the next flush (of another
application) happens. The bug probably does not affect Windows and other
OSes - it deals with the "Unix" VFS exclusively.

Not calling fsync() might be based on the assumption that they are
synchronous. As far as I can tell, both the ftruncate() and unlink() system
calls are not synchronous (not immediately **issued to the disk**) in the
Linux distros I used, independent of the disk. I understand that such
semantics (of whether the system calls are synchronous) are often
non-standard; at other times, the documentation is not clear about what the
standards are. However, some manpages (http://linux.die.net/man/2/fsync for
ftruncate, and http://linux.die.net/man/8/mount for unlink: look for
"dirsync") suggest that most Linux distributions require an fsync() for
unlink and ftruncate to be synchronous. Some filesystems or distros might,
however, automatically flush, though: I haven't yet looked into the
behavior of other distributions such as RHEL, other Unix-like OSes such as
FreeBSD, or file systems such as XFS. But, I'll be looking into them soon,
as part of my research, so do let me know if you are interested in their
behavior.

Thanks for reading all the way through! Again, it might be a better
decision to not issue the fsync()-s after unlink and ftruncate: "D" is
probably less important than the extra overhead. I'm just trying to make
sure this is by design, and not an accident.

--
Thanu

On Tue, May 21, 2013 at 11:52 AM, Richard Hipp <d...@sqlite.org> wrote:
>
> On Tue, May 21, 2013 at 12:04 PM, Thanumalayan Sankaranarayana Pillai <
> madth...@cs.wisc.edu> wrote:
>
> > Hi all,
> >
> > Did anyone look into this? I might be setting some config option wrong,
> > so it would be great if you sent me a "you did something wrong" reply if
> > you feel that I might have the wrong config (or might be doing something
> > totally idiotic).
> >
>
> There are no configuration options set wrong.  You shouldn't ever have to
> set configuration options in order to get SQLite to work right.   I think
> instead you are just running up against fundamental limits of modern
> consumer-grade hardware.  I don't think there is anything anybody can do
> about it.
>
>
> >
> > I tested with a few other Linux machines and a few different SQLite
> > versions, and found that there is an upto-5-second delay (the default
> > filesystem commit interval) in the transactions getting durable (with
power
> > crashes, that is).
> >
> > Thanks in advance,
> > Thanu
> >
> > On Sat, May 18, 2013 at 3:41 AM, thanumalayan mad <madth...@gmail.com
> > >wrote:
> >
> > > Hi All,
> > >
> > > I was testing out SQLite with a framework I developed. I believe,
while
> > > running on Linux, transactions might not be durable when a power crash
> > > happens immediately after a commit. I observed this using "SQLite
version
> > > 3.7.16.2 2013-04-12 11:52:43", and kernel "3.8.4-102.fc17.x86_64".
Steps
> > to
> > > reproduce:
> > >
> > > 1. Use a Linux machine with an Ext4 filesystem (default mount
options).
> > > 2. Create a database file, set journal_mode to DELETE, perform a
> > > transaction using "begin transaction ... commit;".
> > > 3. Pull the power plug immediately after "commit;" returns.
> > > 4. Put plug back in, power on the machine, open database file with
> > SQLite,
> > > and examine whether the transaction has been executed.
> > >
> > > Expected result: You always find that the transaction had been
executed.
> > > Observed result: You sometimes find that the transaction did not
execute.
> > > (To increase the chance that you end with the transaction not having
> > > executed, create an Ext4 partition in an unmounted hard disk, mount
the
> > > partition with the "-o commit=30" mount option, and do the experiment
> > with
> > > the database in that partition).
> > >
> > > Of course, it's possible that the transactions are durable, and I'm
just
> > > being totally bonkers, or running SQLite in some really wrong way. If
> > not,
> > > I believe that the problem lies in SQLite not doing an fsync() on the
> > > parent directory after unlink()-ing the rollback journal file. It
seems
> > > that, in Ext4, unlink() is atomic, but not synchronous: the rollback
> > > journal can exist after rebooting if a power crash follows the
unlink().
> > > Also, ftruncate() doesn't seem to be synchronous either
(non-durability
> > > even if the journal_mode is set to TRUNCATE).
> > >
> > > I haven't examined how other filesystems treat unlink(). I haven't
also
> > > tested with other pragma options and other journal_modes. I expect it
> > > wouldn't be a problem with WAL (it should be correct even if the log
> > exists
> > > after the reboot); however, with things like VACUUM-ing, the database
> > size
> > > would probably not be reduced if a power crash happened, because of
> > > ftruncate() not being synchronous. More generally, I think some of the
> > code
> > > assumes that all non-write and non-create operations that modify a
file
> > (in
> > > other words, unlink and ftruncate) are synchronous; but that's true in
> > > Linux only if the partition is mounted with "-o dirsync". One more
random
> > > thought: if I'm not wrong, "SQLite crash-tests" currently tests
> > > partial-flushing only with write() calls....
> > >
> > > Again, sorry for the spam if I'm just being bonkers.
> > >
> > > Thanks,
> > > Thanu
> > >
> > _______________________________________________
> > sqlite-users mailing list
> > sqlite-users@sqlite.org
> > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >
>
>
>
> --
> D. Richard Hipp
> d...@sqlite.org
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous

Reply via email to