On 2/10/19, Edwin Török <[email protected]> wrote: > I was wondering what changes SQLite3 would need in light of the fsync > problems discovered by the PostgreSQL community
We've been looking into this, even before your message arrived. When fsync() returns non-zero, that is basically the operating system telling you that it has already corrupted the database file. Recovery is not feasible in the general case. An SQLITE_IOERR_FSYNC error is nearly the same thing as SQLITE_CORRUPT. Though complete recovery is not possible in general, we are looking at proposals for making recovery more likely and making permanent database damage less likely. Do not panic. Fsync() failures are exceedingly rare, especially if you avoid the use of network filesystems. Though we are working to mitigate the adverse effects of fsync() failures, we believe you are much more likely to get corruption due to hardware failures, cosmic rays, or rogue applications scribbling into your database files using fopen(). BACKGROUND For readers who are new to this problem, the following is a quick summary: All modern operating systems cache disk content in kernel memory. When you do a read(), the OS first checks to see if it already has the content you are trying to read in cache, and if it does it merely does something like an memcpy() to make that content available to your process. If the content is not already in cache, it is loaded into cache and then the same memcpy()-like operation occurs. When you write(), the new content overwrites the in-cache copy of the content and the cache line is marked "dirty", but the new content is not normally written back to persistent storage, at least not immediately. The OS might chose to write back content to storage at any time, but the application does not know when that might be. The fsync() call is suppose to force content to be written back to persistent storage. When fsync() returns successfully, that means that there are no "dirty" cache lines for the file in question and whatever content is currently in cache exactly matches the content that is on persistent storage. So when fsync() works, that means that the cache and the disk agree and are consistent. But when fsync() fails, the situation is muddled. The state of the system after a failed fsync() depends on the operating system, the specific version of the operating system, and the filesystem on which the file resides. When an fsync() fails, the system might purge all cache lines associated with that file. This causes some or all of the write() requests since the previous fsync() to roll back. This is non-compliant behavior according to POSIX, but we are told that this is what the XFS and BTRFS filesystems on Linux do. This behavior is non-compliant, but it does seem to make the most sense, and if all systems worked this way we could perhaps engineer SQLite so that recovery would always be possible. Unfortunately, not all systems work this way. The POSIX-compliant behavior after an fsync() fails is that the cache lines are all marked "clean" and are retained. The EXT4 filesystem on Linux works this way, I am told. This means that subsequent reads will see the new content for a while, but when the cache lines are eventually ejected due to memory pressure, or when the system reboots, some of the content will silently revert to its old value. Some of the cache lines might have successfully been written to persistent storage prior to the fsync() failure, and the content in those lines will not revert. But content that did not get back to persistent storage will eventually revert to its original value. So, after an fsync() failure, database file content could change out from under the application, but the application has no way of knowing if or when this will happen. This is an unrecoverable situation. If process X is writing to a file and encounters an fsync() error and hence shuts down, this leave the filesystem cache in an inconsistent state - the content in cache is "clean" but does not reflect what is actually in persistent storage. If a second process Y comes along and starts reading the same file, the Y process has no good way of knowing that the file is in an inconsistent state. The file content might change out from under process Y as cache lines are ejected, and process Y is never informed of this. The fsync() error was delivered to process X, so process Y has no way of knowing that anything ever went amiss. Direct I/O has been proposed as a way to work around these problems. But direct I/O has substantial performance penalties. Also, according to the man-page: "The O_DIRECT flag may impose alignment restrictions on the length and address of user-space buffers and the file offset of I/Os. Linux alignment restrictions vary by filesystem and kernel version.... [T]here is currently no filesystem-independent interrface for an application to discover these restrictions for a given file or filesystem." And that is just for Linux. Probably other unix systems have other rules. In other words, O_DIRECT is not portable - its operation depends on the specific OS and filesystem you are using. This seems like an unacceptable solution for a portable library like SQLite, and so direct I/O is not really a viable solution. -- D. Richard Hipp [email protected] _______________________________________________ sqlite-users mailing list [email protected] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

