On 2/10/19, Edwin Török <[email protected]> wrote:
> I was wondering what changes SQLite3 would need in light of the fsync
> problems discovered by the PostgreSQL community

We've been looking into this, even before your message arrived.

When fsync() returns non-zero, that is basically the operating system
telling you that it has already corrupted the database file.  Recovery
is not feasible in the general case.  An SQLITE_IOERR_FSYNC error is
nearly the same thing as SQLITE_CORRUPT.

Though complete recovery is not possible in general, we are looking at
proposals for making recovery more likely and making permanent
database damage less likely.

Do not panic.  Fsync() failures are exceedingly rare, especially if
you avoid the use of network filesystems.  Though we are working to
mitigate the adverse effects of fsync() failures, we believe you are
much more likely to get corruption due to hardware failures, cosmic
rays, or rogue applications scribbling into your database files using
fopen().

BACKGROUND

For readers who are new to this problem, the following is a quick summary:

All modern operating systems cache disk content in kernel memory.
When you do a read(), the OS first checks to see if it already has the
content you are trying to read in cache, and if it does it merely does
something like an memcpy() to make that content available to your
process.  If the content is not already in cache, it is loaded into
cache and then the same memcpy()-like operation occurs.  When you
write(), the new content overwrites the in-cache copy of the content
and the cache line is marked "dirty", but the new content is not
normally written back to persistent storage, at least not immediately.
The OS might chose to write back content to storage at any time, but
the application does not know when that might be.  The fsync() call is
suppose to force content to be written back to persistent storage.
When fsync() returns successfully, that means that there are no
"dirty" cache lines for the file in question and whatever content is
currently in cache exactly matches the content that is on persistent
storage.

So when fsync() works, that means that the cache and the disk agree
and are consistent.  But when fsync() fails, the situation is muddled.
The state of the system after a failed fsync() depends on the
operating system, the specific version of the operating system, and
the filesystem on which the file resides.

When an fsync() fails, the system might purge all cache lines
associated with that file.  This causes some or all of the write()
requests since the previous fsync() to roll back.  This is
non-compliant behavior according to POSIX, but we are told that this
is what the XFS and BTRFS filesystems on Linux do.  This behavior is
non-compliant, but it does seem to make the most sense, and if all
systems worked this way we could perhaps engineer SQLite so that
recovery would always be possible.  Unfortunately, not all systems
work this way.

The POSIX-compliant behavior after an fsync() fails is that the cache
lines are all marked "clean" and are retained.  The EXT4 filesystem on
Linux works this way, I am told.  This means that subsequent reads
will see the new content for a while, but when the cache lines are
eventually ejected due to memory pressure, or when the system reboots,
some of the content will silently revert to its old value.  Some of
the cache lines might have successfully been written to persistent
storage prior to the fsync() failure, and the content in those lines
will not revert.  But content that did not get back to persistent
storage will eventually revert to its original value.  So, after an
fsync() failure, database file content could change out from under the
application, but the application has no way of knowing if or when this
will happen.  This is an unrecoverable situation.

If process X is writing to a file and encounters an fsync() error and
hence shuts down, this leave the filesystem cache in an inconsistent
state - the content in cache is "clean" but does not reflect what is
actually in persistent storage.  If a second process Y comes along and
starts reading the same file, the Y process has no good way of knowing
that the file is in an inconsistent state.  The file content might
change out from under process Y as cache lines are ejected, and
process Y is never informed of this.  The fsync() error was delivered
to process X, so process Y has no way of knowing that anything ever
went amiss.

Direct I/O has been proposed as a way to work around these problems.
But direct I/O has substantial performance penalties.  Also, according
to the man-page: "The O_DIRECT flag may impose alignment restrictions
on the length and address of user-space buffers and the file offset of
I/Os. Linux alignment restrictions vary by filesystem and kernel
version.... [T]here is currently no filesystem-independent interrface
for an application to discover these restrictions for a given file or
filesystem."  And that is just for Linux. Probably other unix systems
have other rules. In other words, O_DIRECT is not portable - its
operation depends on the specific OS and filesystem you are using.
This seems like an unacceptable solution for a portable library like
SQLite, and so direct I/O is not really a viable solution.

-- 
D. Richard Hipp
[email protected]
_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to