Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Anthony Iliopoulos
Hi Robert,

On Tue, Apr 10, 2018 at 11:15:46AM -0400, Robert Haas wrote:
> On Mon, Apr 9, 2018 at 3:13 PM, Andres Freund  wrote:
> > Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
> > absurd, as is some of the proposed ways this is all supposed to
> > work. But I think the case we're discussing is much closer to a near
> > irresolvable corner case than anything else.
> 
> Well, I admit that I wasn't entirely serious about that email, but I
> wasn't entirely not-serious either.  If you can't find reliably find
> out whether the contents of the file on disk are the same as the
> contents that the kernel is giving you when you call read(), then you
> are going to have a heck of a time building a reliable system.  If the
> kernel developers are determined to insist on these semantics (and,
> admittedly, I don't know whether that's the case - I've only read
> Anthony's remarks), then I don't really see what we can do except give
> up on buffered I/O (or on Linux).

I think it would be interesting to get in touch with some of the
respective linux kernel maintainers and open up this topic for
more detailed discussions. LSF/MM'18 is upcoming and it would
have been the perfect opportunity but it's past the CFP deadline.
It may still worth contacting the organizers to bring forward
the issue, and see if there is a chance to have someone from
Pg invited for further discussions.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 12:37:03PM -0700, Andres Freund wrote:
> 
> 
> On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos <ail...@altatus.com> 
> wrote:
> 
> >I honestly do not expect that keeping around the failed pages will
> >be an acceptable change for most kernels, and as such the
> >recommendation
> >will probably be to coordinate in userspace for the fsync().
> 
> Why is that required? You could very well just keep per inode information 
> about fatal failures that occurred around. Report errors until that bit is 
> explicitly cleared.  Yes, that keeps some memory around until unmount if 
> nobody clears it. But it's orders of magnitude less, and results in usable 
> semantics.

As discussed before, I think this could be acceptable, especially
if you pair it with an opt-in mechanism (only applications that
care to deal with this will have to), and would give it a shot.

Still need a way to deal with all other systems and prior kernel
releases that are eating fsync() writeback errors even over sync().

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 12:29:16PM -0700, Andres Freund wrote:
> On 2018-04-09 21:26:21 +0200, Anthony Iliopoulos wrote:
> > What about having buffered IO with implied fsync() atomicity via
> > O_SYNC?
> 
> You're kidding, right?  We could also just add sleep(30)'s all over the
> tree, and hope that that'll solve the problem.  There's a reason we
> don't permanently fsync everything. Namely that it'll be way too slow.

I am assuming you can apply the same principle of selectively using O_SYNC
at times and places that you'd currently actually call fsync().

Also assuming that you'd want to have a backwards-compatible solution for
all those kernels that don't keep the pages around, irrespective of future
fixes. Short of loading a kernel module and dealing with the problem directly,
the only other available options seem to be either O_SYNC, O_DIRECT or ignoring
the issue.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 04:29:36PM +0100, Greg Stark wrote:
> Honestly I don't think there's *any* way to use the current interface
> to implement reliable operation. Even that embedded database using a
> single process and keeping every file open all the time (which means
> file descriptor limits limit its scalability) can be having silent
> corruption whenever some other process like a backup program comes
> along and calls fsync (or even sync?).

That is indeed true (sync would induce fsync on open inodes and clear
the error), and that's a nasty bug that apparently went unnoticed for
a very long time. Hopefully the errseq_t linux 4.13 fixes deal with at
least this issue, but similar fixes need to be adopted by many other
kernels (all those that mark failed pages as clean).

I honestly do not expect that keeping around the failed pages will
be an acceptable change for most kernels, and as such the recommendation
will probably be to coordinate in userspace for the fsync().

What about having buffered IO with implied fsync() atomicity via O_SYNC?
This would probably necessitate some helper threads that mask the
latency and present an async interface to the rest of PG, but sounds
less intrusive than going for DIO.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>
> We already have dirty_bytes and dirty_background_bytes, for example. I
> don't see why there couldn't be another limit defining how much dirty
> data to allow before blocking writes altogether. I'm sure it's not that
> simple, but you get the general idea - do not allow using all available
> memory because of writeback issues, but don't throw the data away in
> case it's just a temporary issue.

Sure, there could be knobs for limiting how much memory such "zombie"
pages may occupy. Not sure how helpful it would be in the long run
since this tends to be highly application-specific, and for something
with a large data footprint one would end up tuning this accordingly
in a system-wide manner. This has the potential to leave other
applications running in the same system with very little memory, in
cases where for example original application crashes and never clears
the error. Apart from that, further interfaces would need to be provided
for actually dealing with the error (again assuming non-transient
issues that may not be fixed transparently and that temporary issues
are taken care of by lower layers of the stack).

> Well, there seem to be kernels that seem to do exactly that already. At
> least that's how I understand what this thread says about FreeBSD and
> Illumos, for example. So it's not an entirely insane design, apparently.

It is reasonable, but even FreeBSD has a big fat comment right
there (since 2017), mentioning that there can be no recovery from
EIO at the block layer and this needs to be done differently. No
idea how an application running on top of either FreeBSD or Illumos
would actually recover from this error (and clear it out), other
than remounting the fs in order to force dropping of relevant pages.
It does provide though indeed a persistent error indication that
would allow Pg to simply reliably panic. But again this does not
necessarily play well with other applications that may be using
the filesystem reliably at the same time, and are now faced with
EIO while their own writes succeed to be persisted.

Ideally, you'd want a (potentially persistent) indication of error
localized to a file region (mapping the corresponding failed writeback
pages). NetBSD is already implementing fsync_ranges(), which could
be a step in the right direction.

> One has to wonder how many applications actually use this correctly,
> considering PostgreSQL cares about data durability/consistency so much
> and yet we've been misunderstanding how it works for 20+ years.

I would expect it would be very few, potentially those that have
a very simple process model (e.g. embedded DBs that can abort a
txn on fsync() EIO). I think that durability is a rather complex
cross-layer issue which has been grossly misunderstood similarly
in the past (e.g. see [1]). It seems that both the OS and DB
communities greatly benefit from a periodic reality check, and
I see this as an opportunity for strengthening the IO stack in
an end-to-end manner.

Best regards,
Anthony

[1] 
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 08:16:38PM +0800, Craig Ringer wrote:
> 
> I'd like a middle ground where the kernel lets us register our interest and
> tells us if it lost something, without us having to keep eight million FDs
> open for some long period. "Tell us about anything that happens under
> pgdata/" or an inotify-style per-directory-registration option. I'd even
> say that's ideal.

I see what you are saying. So basically you'd always maintain the
notification descriptor open, where the kernel would inject events
related to writeback failures of files under watch (potentially
enriched to contain info regarding the exact failed pages and
the file offset they map to). The kernel wouldn't even have to
maintain per-page bits to trace the errors, since they will be
consumed by the process that reads the events (or discarded,
when the notification fd is closed).

Assuming this would be possible, wouldn't Pg still need to deal
with synchronizing writers and related issues (since this would
be merely a notification mechanism - not prevent any process
from continuing), which I understand would be rather intrusive
for the current Pg multi-process design.

But other than that, similarly this interface could in principle
be similarly implemented in the BSDs via kqueue(), I suppose,
to provide what you need.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 01:03:28PM +0100, Geoff Winkless wrote:
> On 9 April 2018 at 11:50, Anthony Iliopoulos <ail...@altatus.com> wrote:
> 
> > What you seem to be asking for is the capability of dropping
> > buffers over the (kernel) fence and idemnifying the application
> > from any further responsibility, i.e. a hard assurance
> > that either the kernel will persist the pages or it will
> > keep them around till the application recovers them
> > asynchronously, the filesystem is unmounted, or the system
> > is rebooted.
> >
> 
> That seems like a perfectly reasonable position to take, frankly.

Indeed, as long as you are willing to ignore the consequences of
this design decision: mainly, how you would recover memory when no
application is interested in clearing the error. At which point
other applications with different priorities will find this position
rather unreasonable since there can be no way out of it for them.
Good luck convincing any OS kernel upstream to go with this design.

> The whole _point_ of an Operating System should be that you can do exactly
> that. As a developer I should be able to call write() and fsync() and know
> that if both calls have succeeded then the result is on disk, no matter
> what another application has done in the meantime. If that's a "difficult"
> problem then that's the OS's problem, not mine. If the OS doesn't do that,
> it's _not_doing_its_job_.

No OS kernel that I know of provides any promises for atomicity of a
write()+fsync() sequence, unless one is using O_SYNC. It doesn't
provide you with isolation either, as this is delegated to userspace,
where processes that share a file should coordinate accordingly.

It's not a difficult problem, but rather the kernels provide a common
denominator of possible interfaces and designs that could accommodate
a wider range of potential application scenarios for which the kernel
cannot possibly anticipate requirements. There have been plenty of
experimental works for providing a transactional (ACID) filesystem
interface to applications. On the opposite end, there have been quite
a few commercial databases that completely bypass the kernel storage
stack. But I would assume it is reasonable to figure out something
between those two extremes that can work in a "portable" fashion.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 09:45:40AM +0100, Greg Stark wrote:
> On 8 April 2018 at 22:47, Anthony Iliopoulos <ail...@altatus.com> wrote:
> > On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:
> >> On 8 April 2018 at 04:27, Craig Ringer <cr...@2ndquadrant.com> wrote:
> >> > On 8 April 2018 at 10:16, Thomas Munro <thomas.mu...@enterprisedb.com>
> >
> > The question is, what should the kernel and application do in cases
> > where this is simply not possible (according to freebsd that keeps
> > dirty pages around after failure, for example, -EIO from the block
> > layer is a contract for unrecoverable errors so it is pointless to
> > keep them dirty). You'd need a specialized interface to clear-out
> > the errors (and drop the dirty pages), or potentially just remount
> > the filesystem.
> 
> Well firstly that's not necessarily the question. ENOSPC is not an
> unrecoverable error. And even unrecoverable errors for a single write
> doesn't mean the write will never be able to succeed in the future.

To make things a bit simpler, let us focus on EIO for the moment.
The contract between the block layer and the filesystem layer is
assumed to be that of, when an EIO is propagated up to the fs,
then you may assume that all possibilities for recovering have
been exhausted in lower layers of the stack. Mind you, I am not
claiming that this contract is either documented or necessarily
respected (in fact there have been studies on the error propagation
and handling of the block layer, see [1]). Let us assume that
this is the design contract though (which appears to be the case
across a number of open-source kernels), and if not - it's a bug.
In this case, indeed the specific write()s will never be able
to succeed in the future, at least not as long as the BIOs are
allocated to the specific failing LBAs.

> But secondly doesn't such an interface already exist? When the device
> is dropped any dirty pages already get dropped with it. What's the
> point in dropping them but keeping the failing device?

I think there are degrees of failure. There are certainly cases
where one may encounter localized unrecoverable medium errors
(specific to certain LBAs) that are non-maskable from the block
layer and below. That does not mean that the device is dropped
at all, so it does make sense to continue all other operations
to all other regions of the device that are functional. In cases
of total device failure, then the filesystem will prevent you
from proceeding anyway.

> But just to underline the point. "pointless to keep them dirty" is
> exactly backwards from the application's point of view. If the error
> writing to persistent media really is unrecoverable then it's all the
> more critical that the pages be kept so the data can be copied to some
> other device. The last thing user space expects to happen is if the
> data can't be written to persistent storage then also immediately
> delete it from RAM. (And the *really* last thing user space expects is
> for this to happen and return no error.)

Right. This implies though that apart from the kernel having
to keep around the dirtied-but-unrecoverable pages for an
unbounded time, that there's further an interface for obtaining
the exact failed pages so that you can read them back. This in
turn means that there needs to be an association between the
fsync() caller and the specific dirtied pages that the caller
intents to drain (for which we'd need an fsync_range(), among
other things). BTW, currently the failed writebacks are not
dropped from memory, but rather marked clean. They could be
lost though due to memory pressure or due to explicit request
(e.g. proc drop_caches), unless mlocked.

There is a clear responsibility of the application to keep
its buffers around until a successful fsync(). The kernels
do report the error (albeit with all the complexities of
dealing with the interface), at which point the application
may not assume that the write()s where ever even buffered
in the kernel page cache in the first place.

What you seem to be asking for is the capability of dropping
buffers over the (kernel) fence and idemnifying the application
from any further responsibility, i.e. a hard assurance
that either the kernel will persist the pages or it will
keep them around till the application recovers them
asynchronously, the filesystem is unmounted, or the system
is rebooted.

Best regards,
Anthony

[1] 
https://www.usenix.org/legacy/event/fast08/tech/full_papers/gunawi/gunawi.pdf



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Anthony Iliopoulos
On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:
> On 8 April 2018 at 04:27, Craig Ringer  wrote:
> > On 8 April 2018 at 10:16, Thomas Munro 
> > wrote:
> >
> > If the kernel does writeback in the middle, how on earth is it supposed to
> > know we expect to reopen the file and check back later?
> >
> > Should it just remember "this file had an error" forever, and tell every
> > caller? In that case how could we recover? We'd need some new API to say
> > "yeah, ok already, I'm redoing all my work since the last good fsync() so
> > you can clear the error flag now". Otherwise it'd keep reporting an error
> > after we did redo to recover, too.
> 
> There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep
> track of any errors. We just need fsync to accurately report whether
> all the buffers in the file have been written out. When you call fsync

Instead, fsync() reports when some of the buffers have not been
written out, due to reasons outlined before. As such it may make
some sense to maintain some tracking regarding errors even after
marking failed dirty pages as clean (in fact it has been proposed,
but this introduces memory overhead).

> again the kernel needs to initiate i/o on all the dirty buffers and
> block until they complete successfully. If they complete successfully
> then nobody cares whether they had some failure in the past when i/o
> was initiated at some point in the past.

The question is, what should the kernel and application do in cases
where this is simply not possible (according to freebsd that keeps
dirty pages around after failure, for example, -EIO from the block
layer is a contract for unrecoverable errors so it is pointless to
keep them dirty). You'd need a specialized interface to clear-out
the errors (and drop the dirty pages), or potentially just remount
the filesystem.

> The problem is not that errors aren't been tracked correctly. The
> problem is that dirty buffers are being marked clean when they haven't
> been written out. They consider dirty filesystem buffers when there's
> hardware failure preventing them from being written "a memory leak".
> 
> As long as any error means the kernel has discarded writes then
> there's no real hope of any reliable operation through that interface.

This does not necessarily follow. Whether the kernel discards writes
or not would not really help (see above). It is more a matter of
proper "reporting contract" between userspace and kernel, and tracking
would be a way for facilitating this vs. having a more complex userspace
scheme (as described by others in this thread) where synchronization
for fsync() is required in a multi-process application.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-03 Thread Anthony Iliopoulos
On Tue, Apr 03, 2018 at 03:37:30PM +0100, Greg Stark wrote:
> On 3 April 2018 at 14:36, Anthony Iliopoulos <ail...@altatus.com> wrote:
> 
> > If EIO persists between invocations until explicitly cleared, a process
> > cannot possibly make any decision as to if it should clear the error
> 
> I still don't understand what "clear the error" means here. The writes
> still haven't been written out. We don't care about tracking errors,
> we just care whether all the writes to the file have been flushed to
> disk. By "clear the error" you mean throw away the dirty pages and
> revert part of the file to some old data? Why would anyone ever want
> that?

It means that the responsibility of recovering the data is passed
back to the application. The writes may never be able to be written
out. How would a kernel deal with that? Either discard the data
(and have the writer acknowledge) or buffer the data until reboot
and simply risk going OOM. It's not what someone would want, but
rather *need* to deal with, one way or the other. At least on the
application-level there's a fighting chance for restoring to a
consistent state. The kernel does not have that opportunity.

> > But instead of deconstructing and debating the semantics of the
> > current mechanism, why not come up with the ideal desired form of
> > error reporting/tracking granularity etc., and see how this may be
> > fitted into kernels as a new interface.
> 
> Because Postgres is portable software that won't be able to use some
> Linux-specific interface. And doesn't really need any granular error

I don't really follow this argument, Pg is admittedly using non-portable
interfaces (e.g the sync_file_range()). While it's nice to avoid platform
specific hacks, expecting that the POSIX semantics will be consistent
across systems is simply a 90's pipe dream. While it would be lovely
to have really consistent interfaces for application writers, this is
simply not going to happen any time soon.

And since those problematic semantics of fsync() appear to be prevalent
in other systems as well that are not likely to be changed, you cannot
rely on preconception that once buffers are handed over to kernel you
have a guarantee that they will be eventually persisted no matter what.
(Why even bother having fsync() in that case? The kernel would eventually
evict and writeback dirty pages anyway. The point of reporting the error
back to the application is to give it a chance to recover - the kernel
could repeat "fsync()" itself internally if this would solve anything).

> reporting system anyways. It just needs to know when all writes have
> been synced to disk.

Well, it does know when *some* writes have *not* been synced to disk,
exactly because the responsibility is passed back to the application.
I do realize this puts more burden back to the application, but what
would a viable alternative be? Would you rather have a kernel that
risks periodically going OOM due to this design decision?

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-03 Thread Anthony Iliopoulos
On Tue, Apr 03, 2018 at 12:26:05PM +0100, Greg Stark wrote:
> On 3 April 2018 at 11:35, Anthony Iliopoulos <ail...@altatus.com> wrote:
> > Hi Robert,
> >
> > Fully agree, and the errseq_t fixes have dealt exactly with the issue
> > of making sure that the error is reported to all file descriptors that
> > *happen to be open at the time of error*. But I think one would have a
> > hard time defending a modification to the kernel where this is further
> > extended to cover cases where:
> >
> > process A does write() on some file offset which fails writeback,
> > fsync() gets EIO and exit()s.
> >
> > process B does write() on some other offset which succeeds writeback,
> > but fsync() gets EIO due to (uncleared) failures of earlier process.
> 
> 
> Surely that's exactly what process B would want? If it calls fsync and
> gets a success and later finds out that the file is corrupt and didn't
> match what was in memory it's not going to be happy.

You can't possibly make this assumption. Process B may be reading
and writing to completely disjoint regions from those of process A,
and as such not really caring about earlier failures, only wanting
to ensure its own writes go all the way through. But even if it did
care, the file interfaces make no transactional guarantees. Even
without fsync() there is nothing preventing process B from reading
dirty pages from process A, and based on their content proceed to
to its own business and write/persist new data, while process A
further modifies the not-yet-flushed pages in-memory before flushing.
In this case you'd need explicit synchronization/locking between
the processes anyway, so why would fsync() be an exception?

> This seems like an attempt to co-opt fsync for a new and different
> purpose for which it's poorly designed. It's not an async error
> reporting mechanism for writes. It would be useless as that as any
> process could come along and open your file and eat the errors for
> writes you performed. An async error reporting mechanism would have to
> document which writes it was giving errors for and give you ways to
> control that.

The errseq_t fixes deal with that; errors will be reported to any
process that has an open fd, irrespective to who is the actual caller
of the fsync() that may have induced errors. This is anyway required
as the kernel may evict dirty pages on its own by doing writeback and
as such there needs to be a way to report errors on all open fds.

> The semantics described here are useless for everyone. For a program
> needing to know the error status of the writes it executed, it doesn't
> know which writes are included in which fsync call. For a program

If EIO persists between invocations until explicitly cleared, a process
cannot possibly make any decision as to if it should clear the error
and proceed or some other process will need to leverage that without
coordination, or which writes actually failed for that matter.
We would be back to the case of requiring explicit synchronization
between processes that care about this, in which case the processes
may as well synchronize over calling fsync() in the first place.

Having an opt-in persisting EIO per-fd would practically be a form
of "contract" between "cooperating" processes anyway.

But instead of deconstructing and debating the semantics of the
current mechanism, why not come up with the ideal desired form of
error reporting/tracking granularity etc., and see how this may be
fitted into kernels as a new interface.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-03 Thread Anthony Iliopoulos
Hi Robert,

On Mon, Apr 02, 2018 at 10:54:26PM -0400, Robert Haas wrote:
> On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos <ail...@altatus.com> wrote:
> > Given precisely that the dirty pages which cannot been written-out are
> > practically thrown away, the semantics of fsync() (after the 4.13 fixes)
> > are essentially correct: the first call indicates that a writeback error
> > indeed occurred, while subsequent calls have no reason to indicate an error
> > (assuming no other errors occurred in the meantime).
> 
> Like other people here, I think this is 100% unreasonable, starting
> with "the dirty pages which cannot been written out are practically
> thrown away".  Who decided that was OK, and on the basis of what
> wording in what specification?  I think it's always unreasonable to

If you insist on strict conformance to POSIX, indeed the linux
glibc configuration and associated manpage are probably wrong in
stating that _POSIX_SYNCHRONIZED_IO is supported. The implementation
matches that of the flexibility allowed by not supporting SIO.
There's a long history of brokenness between linux and posix,
and I think there was never an intention of conforming to the
standard.

> throw away the user's data.  If the writes are going to fail, then let
> them keep on failing every time.  *That* wouldn't cause any data loss,
> because we'd never be able to checkpoint, and eventually the user
> would have to kill the server uncleanly, and that would trigger
> recovery.

I believe (as tried to explain earlier) there is a certain assumption
being made that the writer and original owner of data is responsible
for dealing with potential errors in order to avoid data loss (which
should be only of interest to the original writer anyway). It would
be very questionable for the interface to persist the error while
subsequent writes and fsyncs to different offsets may as well go through.
Another process may need to write into the file and fsync, while being
unaware of those newly introduced semantics is now faced with EIO
because some unrelated previous process failed some earlier writes
and did not bother to clear the error for those writes. In a similar
scenario where the second process is aware of the new semantics, it would
naturally go ahead and clear the global error in order to proceed
with its own write()+fsync(), which would essentially amount to the
same problematic semantics you have now.

> Also, this really does make it impossible to write reliable programs.
> Imagine that, while the server is running, somebody runs a program
> which opens a file in the data directory, calls fsync() on it, and
> closes it.  If the fsync() fails, postgres is now borked and has no
> way of being aware of the problem.  If we knew, we could PANIC, but
> we'll never find out, because the unrelated process ate the error.
> This is exactly the sort of ill-considered behavior that makes fcntl()
> locking nearly useless.

Fully agree, and the errseq_t fixes have dealt exactly with the issue
of making sure that the error is reported to all file descriptors that
*happen to be open at the time of error*. But I think one would have a
hard time defending a modification to the kernel where this is further
extended to cover cases where:

process A does write() on some file offset which fails writeback,
fsync() gets EIO and exit()s.

process B does write() on some other offset which succeeds writeback,
but fsync() gets EIO due to (uncleared) failures of earlier process.

This would be a highly user-visible change of semantics from edge-
triggered to level-triggered behavior.

> dodge this issue in another way: suppose that when we write a page
> out, we don't consider it really written until fsync() succeeds.  Then

That's the only way to think about fsync() guarantees unless you
are on a kernel that keeps retrying to persist dirty pages. Assuming
such a model, after repeated and unrecoverable hard failures the
process would have to explicitly inform the kernel to drop the dirty
pages. All the process could do at that point is read back to userspace
the dirty/failed pages and attempt to rewrite them at a different place
(which is current possible too). Most applications would not bother
though to inform the kernel and drop the permanently failed pages;
and thus someone eventually would hit the case that a large amount
of failed writeback pages are running his server out of memory,
at which point people will complain that those semantics are completely
unreasonable.

> we wouldn't need to PANIC if an fsync() fails; we could just re-write
> the page.  Unfortunately, this would also be terrible for performance,
> for pretty much the same reasons: letting the OS cache absorb lots of
> dirty blocks and do write-combining is *necessary* for good
> performance.

Not sure I understand this case. The application may indeed r

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Anthony Iliopoulos
Hi Stephen,

On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote:
>
> fsync() doesn't reflect the status of given pages, however, it reflects
> the status of the file descriptor, and as such the file, on which it's
> called.  This notion that fsync() is actually only responsible for the
> changes which were made to a file since the last fsync() call is pure
> foolishness.  If we were able to pass a list of pages or data ranges to
> fsync() for it to verify they're on disk then perhaps things would be
> different, but we can't, all we can do is ask to "please flush all the
> dirty pages associated with this file descriptor, which represents this
> file we opened, to disk, and let us know if you were successful."
>
> Give us a way to ask "are these specific pages written out to persistant
> storage?" and we would certainly be happy to use it, and to repeatedly
> try to flush out pages which weren't synced to disk due to some
> transient error, and to track those cases and make sure that we don't
> incorrectly assume that they've been transferred to persistent storage.

Indeed fsync() is simply a rather blunt instrument and a narrow legacy
interface but further changing its established semantics (no matter how
unreasonable they may be) is probably not the way to go.

Would using sync_file_range() be helpful? Potential errors would only
apply to pages that cover the requested file ranges. There are a few
caveats though:

(a) it still messes with the top-level error reporting so mixing it
with callers that use fsync() and do care about errors will produce
the same issue (clearing the error status).

(b) the error-reporting granularity is coarse (failure reporting applies
to the entire requested range so you still don't know which particular
pages/file sub-ranges failed writeback)

(c) the same "report and forget" semantics apply to repeated invocations
of the sync_file_range() call, so again action will need to be taken
upon first error encountered for the particular ranges.

> > The application will need to deal with that first error irrespective of
> > subsequent return codes from fsync(). Conceptually every fsync() invocation
> > demarcates an epoch for which it reports potential errors, so the caller
> > needs to take responsibility for that particular epoch.
> 
> We do deal with that error- by realizing that it failed and later
> *retrying* the fsync(), which is when we get back an "all good!
> everything with this file descriptor you've opened is sync'd!" and
> happily expect that to be truth, when, in reality, it's an unfortunate
> lie and there are still pages associated with that file descriptor which
> are, in reality, dirty and not sync'd to disk.

It really turns out that this is not how the fsync() semantics work
though, exactly because the nature of the errors: even if the kernel
retained the dirty bits on the failed pages, retrying persisting them
on the same disk location would simply fail. Instead the kernel opts
for marking those pages clean (since there is no other recovery
strategy), and reporting once to the caller who can potentially deal
with it in some manner. It is sadly a bad and undocumented convention.

> Consider two independent programs where the first one writes to a file
> and then calls the second one whose job it is to go out and fsync(),
> perhaps async from the first, those files.  Is the second program
> supposed to go write to each page that the first one wrote to, in order
> to ensure that all the dirty bits are set so that the fsync() will
> actually return if all the dirty pages are written?

I think what you have in mind are the semantics of sync() rather
than fsync(), but as long as an application needs to ensure data
are persisted to storage, it needs to retain those data in its heap
until fsync() is successful instead of discarding them and relying
on the kernel after write(). The pattern should be roughly like:
write() -> fsync() -> free(), rather than write() -> free() -> fsync().
For example, if a partition gets full upon fsync(), then the application
has a chance to persist the data in a different location, while
the kernel cannot possibly make this decision and recover.

> > Callers that are not affected by the potential outcome of fsync() and
> > do not react on errors, have no reason for calling it in the first place
> > (and thus masking failure from subsequent callers that may indeed care).
> 
> Reacting on an error from an fsync() call could, based on how it's
> documented and actually implemented in other OS's, mean "run another
> fsync() to see if the error has resolved itself."  Requiring that to
> mean "you have to go dirty all of the pages you previously dirtied to
> actually get a subsequent fsync() to do anything" is really just not
> reasonable- a given program may have no idea what was written to
> previously nor any particular reason to need to know, on the expectation
> that the fsync() call will flush any dirty pages, as it's 

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Anthony Iliopoulos
On Mon, Apr 02, 2018 at 12:32:45PM -0700, Andres Freund wrote:
> On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:
> > On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:
> > > Throwing away the dirty pages *and* persisting the error seems a lot
> > > more reasonable. Then provide a fcntl (or whatever) extension that can
> > > clear the error status in the few cases that the application that wants
> > > to gracefully deal with the case.
> > 
> > Given precisely that the dirty pages which cannot been written-out are
> > practically thrown away, the semantics of fsync() (after the 4.13 fixes)
> > are essentially correct: the first call indicates that a writeback error
> > indeed occurred, while subsequent calls have no reason to indicate an error
> > (assuming no other errors occurred in the meantime).
> 
> Meh^2.
> 
> "no reason" - except that there's absolutely no way to know what state
> the data is in. And that your application needs explicit handling of
> such failures. And that one FD might be used in a lots of different
> parts of the application, that fsyncs in one part of the application
> might be an ok failure, and in another not.  Requiring explicit actions
> to acknowledge "we've thrown away your data for unknown reason" seems
> entirely reasonable.

As long as fsync() indicates error on first invocation, the application
is fully aware that between this point of time and the last call to fsync()
data has been lost. Persisting this error any further does not change this
or add any new info - on the contrary it adds confusion as subsequent write()s
and fsync()s on other pages can succeed, but will be reported as failures.

The application will need to deal with that first error irrespective of
subsequent return codes from fsync(). Conceptually every fsync() invocation
demarcates an epoch for which it reports potential errors, so the caller
needs to take responsibility for that particular epoch.

Callers that are not affected by the potential outcome of fsync() and
do not react on errors, have no reason for calling it in the first place
(and thus masking failure from subsequent callers that may indeed care).

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Anthony Iliopoulos
On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-04-01 03:14:46 +0200, Anthony Iliopoulos wrote:
> > On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:
> > > Craig Ringer <cr...@2ndquadrant.com> writes:
> > > > So we should just use the big hammer here.
> > >
> > > And bitch, loudly and publicly, about how broken this kernel behavior is.
> > > If we make enough of a stink maybe it'll get fixed.
> > 
> > It is not likely to be fixed (beyond what has been done already with the
> > manpage patches and errseq_t fixes on the reporting level). The issue is,
> > the kernel needs to deal with hard IO errors at that level somehow, and
> > since those errors typically persist, re-dirtying the pages would not
> > really solve the problem (unless some filesystem remaps the request to a
> > different block, assuming the device is alive).
> 
> Throwing away the dirty pages *and* persisting the error seems a lot
> more reasonable. Then provide a fcntl (or whatever) extension that can
> clear the error status in the few cases that the application that wants
> to gracefully deal with the case.

Given precisely that the dirty pages which cannot been written-out are
practically thrown away, the semantics of fsync() (after the 4.13 fixes)
are essentially correct: the first call indicates that a writeback error
indeed occurred, while subsequent calls have no reason to indicate an error
(assuming no other errors occurred in the meantime).

The error reporting is thus consistent with the intended semantics (which
are sadly not properly documented). Repeated calls to fsync() simply do not
imply that the kernel will retry to writeback the previously-failed pages,
so the application needs to be aware of that. Persisting the error at the
fsync() level would essentially mean moving application policy into the
kernel.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Anthony Iliopoulos
On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:
> Craig Ringer  writes:
> > So we should just use the big hammer here.
>
> And bitch, loudly and publicly, about how broken this kernel behavior is.
> If we make enough of a stink maybe it'll get fixed.

It is not likely to be fixed (beyond what has been done already with the
manpage patches and errseq_t fixes on the reporting level). The issue is,
the kernel needs to deal with hard IO errors at that level somehow, and
since those errors typically persist, re-dirtying the pages would not
really solve the problem (unless some filesystem remaps the request to a
different block, assuming the device is alive). Keeping around dirty
pages that cannot possibly be written out is essentially a memory leak,
as those pages would stay around even after the application has exited.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Anthony Iliopoulos
On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:

> >> Yeah, I see why you want to PANIC.
> >
> > Indeed. Even doing that leaves question marks about all the kernel
> > versions before v4.13, which at this point is pretty much everything
> > out there, not even detecting this reliably. This is messy.

There may still be a way to reliably detect this on older kernel
versions from userspace, but it will be messy whatsoever. On EIO
errors, the kernel will not restore the dirty page flags, but it
will flip the error flags on the failed pages. One could mmap()
the file in question, obtain the PFNs (via /proc/pid/pagemap)
and enumerate those to match the ones with the error flag switched
on (via /proc/kpageflags). This could serve at least as a detection
mechanism, but one could also further use this info to logically
map the pages that failed IO back to the original file offsets,
and potentially retry IO just for those file ranges that cover
the failed pages. Just an idea, not tested.

Best regards,
Anthony