Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-29 Thread Craig Ringer
For archive readers, this thread is continued as
https://www.postgresql.org/message-id/20180427222842.in2e4mibx45zd...@alap3.anarazel.de
and there's a follow-up lwn article at
https://lwn.net/Articles/752613/ too.



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-26 Thread Thomas Munro
On Tue, Apr 24, 2018 at 12:09 PM, Bruce Momjian  wrote:
> On Mon, Apr 23, 2018 at 01:14:48PM -0700, Andres Freund wrote:
>> Hi,
>>
>> On 2018-03-28 10:23:46 +0800, Craig Ringer wrote:
>> > TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at
>> > least on Linux. When fsync() returns success it means "all writes since the
>> > last fsync have hit disk" but we assume it means "all writes since the last
>> > SUCCESSFUL fsync have hit disk".
>>
>> > But then we retried the checkpoint, which retried the fsync(). The retry
>> > succeeded, because the prior fsync() *cleared the AS_EIO bad page flag*.
>>
>> Random other thing we should look at: Some filesystems (nfs yes, xfs
>> ext4 no) flush writes at close(2). We check close() return code, just
>> log it... So close() counts as an fsync for such filesystems().
>
> Well, that's interesting.  You might remember that NFS does not reserve
> space for writes like local file systems like ext4/xfs do.  For that
> reason, we might be able to capture the out-of-space error on close and
> exit sooner for NFS.

It seems like some implementations flush on close and therefore
discover ENOSPC problem at that point, unless they have NVSv4 (RFC
3050) "write delegation" with a promise from the server that a certain
amount of space is available.  It seems like you can't count on that
in any way though, because it's the server that decides when to
delegate and how much space to promise is preallocated, not the
client.  So in userspace you always need to be able to handle errors
including ENOSPC returned by close(), and if you ignore that and
you're using an operating system that immediately incinerates all
evidence after telling you that (so that later fsync() doesn't fail),
you're in trouble.

Some relevant code:

https://github.com/torvalds/linux/commit/5445b1fbd123420bffed5e629a420aa2a16bf849
https://github.com/freebsd/freebsd/blob/master/sys/fs/nfsclient/nfs_clvnops.c#L618

It looks like the bleeding edge of the NFS spec includes a new
ALLOCATE operation that should be able to support posix_fallocate()
(if we were to start using that for extending files):

https://tools.ietf.org/html/rfc7862#page-64

I'm not sure how reliable [posix_]fallocate is on NFS in general
though, and it seems that there are fall-back implementations of
posix_fallocate() that write zeros (or even just feign success?) which
probably won't do anything useful here if not also flushed (that
fallback strategy might only work on eager reservation filesystems
that don't have direct fallocate support?) so there are several layers
(libc, kernel, nfs client, nfs server) that'd need to be aligned for
that to work, and it's not clear how a humble userspace program is
supposed to know if they are.

I guess if you could find a way to amortise the cost of extending
(like Oracle et al do by extending big container datafiles 10MB at a
time or whatever), then simply writing zeros and flushing when doing
that might work out OK, so you wouldn't need such a thing?  (Unless of
course it's a COW filesystem, but that's a different can of worms.)

-- 
Thomas Munro
http://www.enterprisedb.com



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-25 Thread Craig Ringer
On 24 April 2018 at 04:14, Andres Freund  wrote:

> I'm LSF/MM to discuss future behaviour of linux here, but that's how it
> is right now.


Interim LWN.net coverage of that can be found here:
https://lwn.net/Articles/752613/

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-23 Thread Bruce Momjian
On Mon, Apr 23, 2018 at 01:14:48PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-03-28 10:23:46 +0800, Craig Ringer wrote:
> > TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at
> > least on Linux. When fsync() returns success it means "all writes since the
> > last fsync have hit disk" but we assume it means "all writes since the last
> > SUCCESSFUL fsync have hit disk".
> 
> > But then we retried the checkpoint, which retried the fsync(). The retry
> > succeeded, because the prior fsync() *cleared the AS_EIO bad page flag*.
> 
> Random other thing we should look at: Some filesystems (nfs yes, xfs
> ext4 no) flush writes at close(2). We check close() return code, just
> log it... So close() counts as an fsync for such filesystems().

Well, that's interesting.  You might remember that NFS does not reserve
space for writes like local file systems like ext4/xfs do.  For that
reason, we might be able to capture the out-of-space error on close and
exit sooner for NFS.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-23 Thread Andres Freund
Hi,

On 2018-03-28 10:23:46 +0800, Craig Ringer wrote:
> TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at
> least on Linux. When fsync() returns success it means "all writes since the
> last fsync have hit disk" but we assume it means "all writes since the last
> SUCCESSFUL fsync have hit disk".

> But then we retried the checkpoint, which retried the fsync(). The retry
> succeeded, because the prior fsync() *cleared the AS_EIO bad page flag*.

Random other thing we should look at: Some filesystems (nfs yes, xfs
ext4 no) flush writes at close(2). We check close() return code, just
log it... So close() counts as an fsync for such filesystems().

I'm LSF/MM to discuss future behaviour of linux here, but that's how it
is right now.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-20 Thread Bruce Momjian
On Wed, Apr 18, 2018 at 08:45:53PM +0800, Craig Ringer wrote:
> wrOn 18 April 2018 at 19:46, Bruce Momjian  wrote:
> 
> > So, if sync mode passes the write to NFS, and NFS pre-reserves write
> > space, and throws an error on reservation failure, that means that NFS
> > will not corrupt a cluster on out-of-space errors.
> 
> Yeah. I need to verify in a concrete test case.

Thanks.

> The thing is that write() is allowed to be asynchronous anyway. Most
> file systems choose to implement eager reservation of space, but it's
> not mandated. AFAICS that's largely a historical accident to keep
> applications happy, because FSes used to *allocate* the space at
> write() time too, and when they moved to delayed allocations, apps
> tended to break too easily unless they at least reserved space. NFS
> would have to do a round-trip on write() to reserve space.
> 
> The Linux man pages (http://man7.org/linux/man-pages/man2/write.2.html) say:
> 
> "
>A successful return from write() does not make any guarantee that
>data has been committed to disk.  On some filesystems, including NFS,
>it does not even guarantee that space has successfully been reserved
>for the data.  In this case, some errors might be delayed until a
>future write(2), fsync(2), or even close(2).  The only way to be sure
>is to call fsync(2) after you are done writing all your data.
> "
> 
> ... and I'm inclined to believe it when it refuses to make guarantees.
> Especially lately.

Uh, even calling fsync after write isn't 100% safe since the kernel
could have flushed the dirty pages to storage, and failed, and the fsync
would later succeed.  I realize newer kernels have that fixed for files
open during that operation, but that is the minority of installs.

> The idea is that when the SAN's actual physically allocate storage
> gets to 40TB it starts telling you to go buy another rack of storage
> so you don't run out. You don't have to resize volumes, resize file
> systems, etc. All the storage space admin is centralized on the SAN
> and storage team, and your sysadmins, DBAs and app devs are none the
> wiser. You buy storage when you need it, not when the DBA demands they
> need a 200% free space margin just in case. Whether or not you agree
> with this philosophy or think it's sensible is kind of moot, because
> it's an extremely widespread model, and servers you work on may well
> be backed by thin provisioned storage _even if you don't know it_.


> Most FSes only touch the blocks on dirty writeback, or sometimes
> lazily as part of delayed allocation. So if your SAN is running out of
> space and there's 100MB free, each of your 100 FSes may have
> decremented its freelist by 2MB and be happily promising more space to
> apps on write() because, well, as far as they know they're only 50%
> full. When they all do dirty writeback and flush to storage, kaboom,
> there's nowhere to put some of the data.

I see what you are saying --- that the kernel is reserving the write
space from its free space, but the free space doesn't all exist.  I am
not sure how we can tell people to make sure the file system free space
is real.

> You'd have to actually force writes to each page through to the
> backing storage to know for sure the space existed. Yes, the docs say
> 
> "
>After a
>successful call to posix_fallocate(), subsequent writes to bytes in
>the specified range are guaranteed not to fail because of lack of
>disk space.
> "
> 
> ... but they're speaking from the filesystem's perspective. If the FS
> doesn't dirty and flush the actual blocks, a thin provisioned storage
> system won't know.

Frankly, in what cases will a write fail _for_ lack of free space?  It
could be a new WAL file (not recycled), or a pages added to the end of
the table.

Is that it?  It doesn't sound too terrible.  If we can eliminate the
corruption due to free space exxhaustion, it would be a big step
forward.

The next most common failure would be temporary storage failure or
storage communication failure.

Permanent storage failure is "game over" so we don't need to worry about
that.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-18 Thread Craig Ringer
On 19 April 2018 at 07:31, Mark Kirkwood  wrote:
> On 19/04/18 00:45, Craig Ringer wrote:
>
>>
>> I guarantee you that when you create a 100GB EBS volume on AWS EC2,
>> you don't get 100GB of storage preallocated. AWS are probably pretty
>> good about not running out of backing store, though.
>>
>>
>
> Some db folks (used to anyway) advise dd'ing to your freshly attached
> devices on AWS (for performance mainly IIRC), but that would help prevent
> some failure scenarios for any thin provisioned storage (but probably really
> annoy the admins' thereof).

This still makes a lot of sense on AWS EBS, particularly when using a
volume created from a non-empty snapshot. Performance of S3-snapshot
based EBS volumes is spectacularly awful, since they're copy-on-read.
Reading the whole volume helps a lot.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-18 Thread Mark Kirkwood

On 19/04/18 00:45, Craig Ringer wrote:



I guarantee you that when you create a 100GB EBS volume on AWS EC2,
you don't get 100GB of storage preallocated. AWS are probably pretty
good about not running out of backing store, though.




Some db folks (used to anyway) advise dd'ing to your freshly attached 
devices on AWS (for performance mainly IIRC), but that would help 
prevent some failure scenarios for any thin provisioned storage (but 
probably really annoy the admins' thereof).


regards
Mark



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-18 Thread Bruce Momjian
On Tue, Apr 17, 2018 at 02:41:42PM -0700, Andres Freund wrote:
> On 2018-04-17 17:32:45 -0400, Bruce Momjian wrote:
> > On Mon, Apr  9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:
> > > That doesn't seem like a very practical way. It's better than nothing,
> > > of course, but I wonder how would that work with containers (where I
> > > think you may not have access to the kernel log at all). Also, I'm
> > > pretty sure the messages do change based on kernel version (and possibly
> > > filesystem) so parsing it reliably seems rather difficult. And we
> > > probably don't want to PANIC after I/O error on an unrelated device, so
> > > we'd need to understand which devices are related to PostgreSQL.
> 
> You can certainly have access to the kernel log in containers. I'd
> assume such a script wouldn't check various system logs but instead tail
> /dev/kmsg or such. Otherwise the variance between installations would be
> too big.

I was thinking 'dmesg', but the result is similar.

> There's not *that* many different type of error messages and they don't
> change that often. If we'd just detect error for the most common FSs
> we'd probably be good. Detecting a few general storage layer message
> wouldn't be that hard either, most things have been unified over the
> last ~8-10 years.

It is hard to know exactly what the message format should be for each
operating system because it is hard to generate them on demand, and we
would need to filter based on Postgres devices.

The other issue is that once you see a message during a checkpoint and
exit, you don't want to see that message again after the problem has
been fixed and the server restarted.  The simplest solution is to save
the output of the last check and look for only new entries.  I am
attaching a script I run every 15 minutes from cron that emails me any
unexpected kernel messages.

I am thinking we would need a contrib module with sample scripts for
various operating systems.

> > Replying to your specific case, I am not sure how we would use a script
> > to check for I/O errors/space-exhaustion if the postgres user doesn't
> > have access to it.
> 
> Not sure what you mean?
> 
> Space exhaustiion can be checked when allocating space, FWIW. We'd just
> need to use posix_fallocate et al.

I was asking about cases where permissions prevent viewing of kernel
messages.  I think you can view them in containers, but in virtual
machines you might not have access to the host operating system's kernel
messages, and that might be where they are.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +
:

[ "$#" -gt 1 -o \( "$#" -eq 1 -a "$1" != "-i" \) ] && echo "Usage:  $(basename 
$0) [-i]" 1>&2 && exit 1

# change this to a safe directory
TMP="$HOME"

dmesg -T |
egrep -v '\<(mounted filesystem with ordered data mode|MSI/MSI-X|lowering 
kernel.perf_event_max_sample_rate|promiscuous mode|bad checksum|fragment too 
large|short packet)\>' > $TMP/0

if [ "$1" != "-i" -a -e /u/dmesg.log ]
thendiff /u/dmesg.log $TMP/0 | sed -n 's/^> //p' > $TMP/1
[ -s $TMP/1 ] && cat $TMP/1 | mail -s 'changed dmesg output' root
fi

[ -s $TMP/1 -o "$1" = "-i" ] && cp $TMP/0 /u/dmesg.log


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-18 Thread Bruce Momjian
On Wed, Apr 18, 2018 at 06:04:30PM +0800, Craig Ringer wrote:
> On 18 April 2018 at 05:19, Bruce Momjian  wrote:
> > On Tue, Apr 10, 2018 at 05:54:40PM +0100, Greg Stark wrote:
> >> On 10 April 2018 at 02:59, Craig Ringer  wrote:
> >>
> >> > Nitpick: In most cases the kernel reserves disk space immediately,
> >> > before returning from write(). NFS seems to be the main exception
> >> > here.
> >>
> >> I'm kind of puzzled by this. Surely NFS servers store the data in the
> >> filesystem using write(2) or the in-kernel equivalent? So if the
> >> server is backed by a filesystem where write(2) preallocates space
> >> surely the NFS server must behave as if it'spreallocating as well? I
> >> would expect NFS to provide basically the same set of possible
> >> failures as the underlying filesystem (as long as you don't enable
> >> nosync of course).
> >
> > I don't think the write is _sent_ to the NFS at the time of the write,
> > so while the NFS side would reserve the space, it might get the write
> > request until after we return write success to the process.
> 
> It should be sent if you're using sync mode.
> 
> >From my reading of the docs, if you're using async mode you're already
> open to so many potential corruptions you might as well not bother.
> 
> I need to look into this more re NFS and expand the tests I have to
> cover that properly.

So, if sync mode passes the write to NFS, and NFS pre-reserves write
space, and throws an error on reservation failure, that means that NFS
will not corrupt a cluster on out-of-space errors.

So, what about thin provisioning?  I can understand sharing _free_ space
among file systems, but once a write arrives I assume it reserves the
space.  Is the problem that many thin provisioning systems don't have a
sync mode, so you can't force the write to appear on the device before
an fsync?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-18 Thread Craig Ringer
On 10 April 2018 at 20:15, Craig Ringer  wrote:
> On 10 April 2018 at 14:10, Michael Paquier  wrote:
>
>> Well, I think that there is place for improving reporting of failure
>> in file_utils.c for frontends, or at worst have an exit() for any kind
>> of critical failures equivalent to a PANIC.
>
> Yup.
>
> In the mean time, speaking of PANIC, here's the first cut patch to
> make Pg panic on fsync() failures. I need to do some closer review and
> testing, but it's presented here for anyone interested.
>
> I intentionally left some failures as ERROR not PANIC, where the
> entire operation is done as a unit, and an ERROR will cause us to
> retry the whole thing.
>
> For example, when we fsync() a temp file before we move it into place,
> there's no point panicing on failure, because we'll discard the temp
> file on ERROR and retry the whole thing.
>
> I've verified that it works as expected with some modifications to the
> test tool I've been using (pushed).
>
> The main downside is that if we panic in redo, we don't try again. We
> throw our toys and shut down. But arguably if we get the same I/O
> error again in redo, that's the right thing to do anyway, and quite
> likely safer than continuing to ERROR on checkpoints indefinitely.
>
> Patch attached.
>
> To be clear, this patch only deals with the issue of us retrying
> fsyncs when it turns out to be unsafe. This does NOT address any of
> the issues where we won't find out about writeback errors at all.

Thinking about this some more, it'll definitely need a GUC to force it
to continue despite a potential hazard. Otherwise we go backwards from
the status quo if we're in a position where uptime is vital and
correctness problems can be tolerated or repaired later. Kind of like
zero_damaged_pages, we'll need some sort of
continue_after_fsync_errors .

Without that, we'll panic once, enter redo, and if the problem
persists we'll panic in redo and exit the startup process. That's not
going to help users.

I'll amend the patch accordingly as time permits.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-18 Thread Craig Ringer
On 18 April 2018 at 05:19, Bruce Momjian  wrote:
> On Tue, Apr 10, 2018 at 05:54:40PM +0100, Greg Stark wrote:
>> On 10 April 2018 at 02:59, Craig Ringer  wrote:
>>
>> > Nitpick: In most cases the kernel reserves disk space immediately,
>> > before returning from write(). NFS seems to be the main exception
>> > here.
>>
>> I'm kind of puzzled by this. Surely NFS servers store the data in the
>> filesystem using write(2) or the in-kernel equivalent? So if the
>> server is backed by a filesystem where write(2) preallocates space
>> surely the NFS server must behave as if it'spreallocating as well? I
>> would expect NFS to provide basically the same set of possible
>> failures as the underlying filesystem (as long as you don't enable
>> nosync of course).
>
> I don't think the write is _sent_ to the NFS at the time of the write,
> so while the NFS side would reserve the space, it might get the write
> request until after we return write success to the process.

It should be sent if you're using sync mode.

>From my reading of the docs, if you're using async mode you're already
open to so many potential corruptions you might as well not bother.

I need to look into this more re NFS and expand the tests I have to
cover that properly.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-18 Thread Bruce Momjian
On Tue, Apr 17, 2018 at 02:34:53PM -0700, Andres Freund wrote:
> On 2018-04-17 17:29:17 -0400, Bruce Momjian wrote:
> > Also, if we are relying on WAL, we have to make sure WAL is actually
> > safe with fsync, and I am betting only the O_DIRECT methods actually
> > are safe:
> > 
> > #wal_sync_method = fsync# the default is the first 
> > option
> > # supported by the operating 
> > system:
> > #   open_datasync
> >  -->#   fdatasync (default on Linux)
> >  -->#   fsync
> >  -->#   fsync_writethrough
> > #   open_sync
> > 
> > I am betting the marked wal_sync_method methods are not safe since there
> > is time between the write and fsync.
> 
> Hm? That's not really the issue though? One issue is that retries are
> not necessarily safe in buffered IO, the other that fsync might not
> report an error if the fd was closed and opened.

Well, we have have been focusing on the delay between backend or
checkpoint writes and checkpoint fsyncs. My point is that we have the
same problem in doing a write, _then_ fsync for the WAL.  Yes, the delay
is much shorter, but the issue still exists.  I realize that newer Linux
kernels will not have the problem since the file descriptor remains
open, but the problem exists with older/common linux kernels.

> O_DIRECT is only used if wal archiving or streaming isn't used, which
> makes it pretty useless anyway.

Uh, as doesn't 'open_datasync' and 'open_sync' fsync as part of the
write, meaning we can't lose the error report like we can with the
others?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-17 Thread Bruce Momjian
On Mon, Apr  9, 2018 at 12:25:33PM -0700, Peter Geoghegan wrote:
> On Mon, Apr 9, 2018 at 12:13 PM, Andres Freund  wrote:
> > Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
> > absurd, as is some of the proposed ways this is all supposed to
> > work. But I think the case we're discussing is much closer to a near
> > irresolvable corner case than anything else.
> 
> +1
> 
> > We're talking about the storage layer returning an irresolvable
> > error. You're hosed even if we report it properly.  Yes, it'd be nice if
> > we could report it reliably.  But that doesn't change the fact that what
> > we're doing is ensuring that data is safely fsynced unless storage
> > fails, in which case it's not safely fsynced anyway.
> 
> Right. We seem to be implicitly assuming that there is a big
> difference between a problem in the storage layer that we could in
> principle detect, but don't, and any other problem in the storage
> layer. I've read articles claiming that technologies like SMART are
> not really reliable in a practical sense [1], so it seems to me that
> there is reason to doubt that this gap is all that big.
> 
> That said, I suspect that the problems with running out of disk space
> are serious practical problems. I have personally scoffed at stories
> involving Postgres databases corruption that gets attributed to
> running out of disk space. Looks like I was dead wrong.

Yes, I think we need to look at user expectations here.

If the device has a hardware write error, it is true that it is good to
detect it, and it might be permanent or temporary, e.g. NAS/NFS.  The
longer the error persists, the more likely the user will expect
corruption.  However, right now, any length outage could cause
corruption, and it will not be reported in all cases.

Running out of disk space is also something you don't expect to corrupt
your database --- you expect it to only prevent future writes.  It seems
NAS/NFS and any thin provisioned storage will have this problem, and
again, not always reported.

So, our initial action might just be to educate users that write errors
can cause silent corruption, and out-of-space errors on NAS/NFS and any
thin provisioned storage can cause corruption.

Kernel logs (not just Postgres logs) should be monitored for these
issues and fail-over/recovering might be necessary.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-17 Thread Andres Freund
On 2018-04-17 17:32:45 -0400, Bruce Momjian wrote:
> On Mon, Apr  9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:
> > That doesn't seem like a very practical way. It's better than nothing,
> > of course, but I wonder how would that work with containers (where I
> > think you may not have access to the kernel log at all). Also, I'm
> > pretty sure the messages do change based on kernel version (and possibly
> > filesystem) so parsing it reliably seems rather difficult. And we
> > probably don't want to PANIC after I/O error on an unrelated device, so
> > we'd need to understand which devices are related to PostgreSQL.

You can certainly have access to the kernel log in containers. I'd
assume such a script wouldn't check various system logs but instead tail
/dev/kmsg or such. Otherwise the variance between installations would be
too big.

There's not *that* many different type of error messages and they don't
change that often. If we'd just detect error for the most common FSs
we'd probably be good. Detecting a few general storage layer message
wouldn't be that hard either, most things have been unified over the
last ~8-10 years.


> Replying to your specific case, I am not sure how we would use a script
> to check for I/O errors/space-exhaustion if the postgres user doesn't
> have access to it.

Not sure what you mean?

Space exhaustiion can be checked when allocating space, FWIW. We'd just
need to use posix_fallocate et al.


> Does O_DIRECT work in such container cases?

Yes.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-17 Thread Andres Freund
On 2018-04-17 17:29:17 -0400, Bruce Momjian wrote:
> Also, if we are relying on WAL, we have to make sure WAL is actually
> safe with fsync, and I am betting only the O_DIRECT methods actually
> are safe:
> 
>   #wal_sync_method = fsync# the default is the first 
> option
>   # supported by the operating 
> system:
>   #   open_datasync
>-->#   fdatasync (default on Linux)
>-->#   fsync
>-->#   fsync_writethrough
>   #   open_sync
> 
> I am betting the marked wal_sync_method methods are not safe since there
> is time between the write and fsync.

Hm? That's not really the issue though? One issue is that retries are
not necessarily safe in buffered IO, the other that fsync might not
report an error if the fd was closed and opened.

O_DIRECT is only used if wal archiving or streaming isn't used, which
makes it pretty useless anyway.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-17 Thread Bruce Momjian
On Mon, Apr  9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:
> On 04/09/2018 12:29 AM, Bruce Momjian wrote:
> > 
> > An crazy idea would be to have a daemon that checks the logs and
> > stops Postgres when it seems something wrong.
> > 
> 
> That doesn't seem like a very practical way. It's better than nothing,
> of course, but I wonder how would that work with containers (where I
> think you may not have access to the kernel log at all). Also, I'm
> pretty sure the messages do change based on kernel version (and possibly
> filesystem) so parsing it reliably seems rather difficult. And we
> probably don't want to PANIC after I/O error on an unrelated device, so
> we'd need to understand which devices are related to PostgreSQL.

Replying to your specific case, I am not sure how we would use a script
to check for I/O errors/space-exhaustion if the postgres user doesn't
have access to it.  Does O_DIRECT work in such container cases?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-17 Thread Bruce Momjian
On Mon, Apr  9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:
> On 04/09/2018 12:29 AM, Bruce Momjian wrote:
> > 
> > An crazy idea would be to have a daemon that checks the logs and
> > stops Postgres when it seems something wrong.
> > 
> 
> That doesn't seem like a very practical way. It's better than nothing,
> of course, but I wonder how would that work with containers (where I
> think you may not have access to the kernel log at all). Also, I'm
> pretty sure the messages do change based on kernel version (and possibly
> filesystem) so parsing it reliably seems rather difficult. And we
> probably don't want to PANIC after I/O error on an unrelated device, so
> we'd need to understand which devices are related to PostgreSQL.

My more-considered crazy idea is to have a postgresql.conf setting like
archive_command that allows the administrator to specify a command that
will be run _after_ fsync but before the checkpoint is marked as
complete.  While we can have write flush errors before fsync and never
see the errors during fsync, we will not have write flush errors _after_
fsync that are associated with previous writes.

The script should check for I/O or space-exhaustion errors and return
false in that case, in which case we can stop and maybe stop and crash
recover.  We could have an exit of 1 do the former, and an exit of 2 do
the later.

Also, if we are relying on WAL, we have to make sure WAL is actually
safe with fsync, and I am betting only the O_DIRECT methods actually
are safe:

#wal_sync_method = fsync# the default is the first 
option
# supported by the operating 
system:
#   open_datasync
 -->#   fdatasync (default on Linux)
 -->#   fsync
 -->#   fsync_writethrough
#   open_sync

I am betting the marked wal_sync_method methods are not safe since there
is time between the write and fsync.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-17 Thread Bruce Momjian
On Tue, Apr 10, 2018 at 05:54:40PM +0100, Greg Stark wrote:
> On 10 April 2018 at 02:59, Craig Ringer  wrote:
> 
> > Nitpick: In most cases the kernel reserves disk space immediately,
> > before returning from write(). NFS seems to be the main exception
> > here.
> 
> I'm kind of puzzled by this. Surely NFS servers store the data in the
> filesystem using write(2) or the in-kernel equivalent? So if the
> server is backed by a filesystem where write(2) preallocates space
> surely the NFS server must behave as if it'spreallocating as well? I
> would expect NFS to provide basically the same set of possible
> failures as the underlying filesystem (as long as you don't enable
> nosync of course).

I don't think the write is _sent_ to the NFS at the time of the write,
so while the NFS side would reserve the space, it might get the write
request until after we return write success to the process.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-11 Thread Jonathan Corbet
On Wed, 11 Apr 2018 07:29:09 -0700
Andres Freund  wrote:

> If that room can be found, I might be able to make it. Being in SF, I'm
> probably the physically closest PG dev involved in the discussion.

OK, I've dropped the PC a note; hopefully you'll be hearing from them.

jon



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-11 Thread Andres Freund
Hi,

On 2018-04-11 06:05:27 -0600, Jonathan Corbet wrote:
> The event is April 23-25 in Park City, Utah.  I bet that room could be
> found for somebody from the postgresql community, should there be
> somebody who would like to represent the group on this issue.  Let me
> know if an introduction or advocacy from my direction would be helpful.

If that room can be found, I might be able to make it. Being in SF, I'm
probably the physically closest PG dev involved in the discussion.

Thanks for chiming in,

Andres



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-11 Thread Jonathan Corbet
On Tue, 10 Apr 2018 17:40:05 +0200
Anthony Iliopoulos  wrote:

> LSF/MM'18 is upcoming and it would
> have been the perfect opportunity but it's past the CFP deadline.
> It may still worth contacting the organizers to bring forward
> the issue, and see if there is a chance to have someone from
> Pg invited for further discussions.

FWIW, it is my current intention to be sure that the development
community is at least aware of the issue by the time LSFMM starts.

The event is April 23-25 in Park City, Utah.  I bet that room could be
found for somebody from the postgresql community, should there be
somebody who would like to represent the group on this issue.  Let me
know if an introduction or advocacy from my direction would be helpful.

jon



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Joshua D. Drake

On 04/10/2018 12:51 PM, Joshua D. Drake wrote:

-hackers,

The thread is picking up over on the ext4 list. They don't update their 
archives as often as we do, so I can't link to the discussion. What 
would be the preferred method of sharing the info?


Thanks to Anthony for this link:

http://lists.openwall.net/linux-ext4/2018/04/10/33

It isn't quite real time but it keeps things close enough.

jD




Thanks,

JD





--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
* Unless otherwise stated, opinions are my own.   *



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Joshua D. Drake

-hackers,

The thread is picking up over on the ext4 list. They don't update their 
archives as often as we do, so I can't link to the discussion. What 
would be the preferred method of sharing the info?


Thanks,

JD


--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
* Unless otherwise stated, opinions are my own.   *



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Joshua D. Drake

-hackers,

I reached out to the Linux ext4 devs, here is ty...@mit.edu response:

"""
Hi Joshua,

This isn't actually an ext4 issue, but a long-standing VFS/MM issue.

There are going to be multiple opinions about what the right thing to
do.  I'll try to give as unbiased a description as possible, but
certainly some of this is going to be filtered by my own biases no
matter how careful I can be.

First of all, what storage devices will do when they hit an exception
condition is quite non-deterministic.  For example, the vast majority
of SSD's are not power fail certified.  What this means is that if
they suffer a power drop while they are doing a GC, it is quite
possible for data written six months ago to be lost as a result.  The
LBA could potentialy be far, far away from any LBA's that were
recently written, and there could have been multiple CACHE FLUSH
operations in the since the LBA in question was last written six
months ago.  No matter; for a consumer-grade SSD, it's possible for
that LBA to be trashed after an unexpected power drop.

Which is why after a while, one can get quite paranoid and assume that
the only way you can guarantee data robustness is to store multiple
copies and/or use erasure encoding, with some of the copies or shards
written to geographically diverse data centers.

Secondly, I think it's fair to say that the vast majority of the
companies who require data robustness, and are either willing to pay
$$$ to an enterprise distro company like Red Hat, or command a large
enough paying customer base that they can afford to dictate terms to
an enterprise distro, or hire a consultant such as Christoph, or have
their own staffed Linux kernel teams, have tended to use O_DIRECT.  So
for better or for worse, there has not been as much investment in
buffered I/O and data robustness in the face of exception handling of
storage devices.

Next, the reason why fsync() has the behaviour that it does is one
ofhe the most common cases of I/O storage errors in buffered use
cases, certainly as seen by the community distros, is the user who
pulls out USB stick while it is in use.  In that case, if there are
dirtied pages in the page cache, the question is what can you do?
Sooner or later the writes will time out, and if you leave the pages
dirty, then it effectively becomes a permanent memory leak.  You can't
unmount the file system --- that requires writing out all of the pages
such that the dirty bit is turned off.  And if you don't clear the
dirty bit on an I/O error, then they can never be cleaned.  You can't
even re-insert the USB stick; the re-inserted USB stick will get a new
block device.  Worse, when the USB stick was pulled, it will have
suffered a power drop, and see above about what could happen after a
power drop for non-power fail certified flash devices --- it goes
double for the cheap sh*t USB sticks found in the checkout aisle of
Micro Center.

So this is the explanation for why Linux handles I/O errors by
clearing the dirty bit after reporting the error up to user space.
And why there is not eagerness to solve the problem simply by "don't
clear the dirty bit".  For every one Postgres installation that might
have a better recover after an I/O error, there's probably a thousand
clueless Fedora and Ubuntu users who will have a much worse user
experience after a USB stick pull happens.

I can think of things that could be done --- for example, it could be
switchable on a per-block device basis (or maybe a per-mount basis)
whether or not the dirty bit gets cleared after the error is reported
to userspace.  And perhaps there could be a new unmount flag that
causes all dirty pages to be wiped out, which could be used to recover
after a permanent loss of the block device.  But the question is who
is going to invest the time to make these changes?  If there is a
company who is willing to pay to comission this work, it's almost
certainly soluble.  Or if a company which has a kernel on staff is
willing to direct an engineer to work on it, it certainly could be
solved.  But again, of the companies who have client code where we
care about robustness and proper handling of failed disk drives, and
which have a kernel team on staff, pretty much all of the ones I can
think of (e.g., Oracle, Google, etc.) use O_DIRECT and they don't try
to make buffered writes and error reporting via fsync(2) work well.

In general these companies want low-level control over buffer cache
eviction algorithms, which drives them towards the design decision of
effectively implementing the page cache in userspace, and using
O_DIRECT reads/writes.

If you are aware of a company who is willing to pay to have a new
kernel feature implemented to meet your needs, we might be able to
refer you to a company or a consultant who might be able to do that
work.  Let me know off-line if that's the case...

- Ted
"""

--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Greg Stark
On 10 April 2018 at 02:59, Craig Ringer  wrote:

> Nitpick: In most cases the kernel reserves disk space immediately,
> before returning from write(). NFS seems to be the main exception
> here.

I'm kind of puzzled by this. Surely NFS servers store the data in the
filesystem using write(2) or the in-kernel equivalent? So if the
server is backed by a filesystem where write(2) preallocates space
surely the NFS server must behave as if it'spreallocating as well? I
would expect NFS to provide basically the same set of possible
failures as the underlying filesystem (as long as you don't enable
nosync of course).

-- 
greg



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Anthony Iliopoulos
Hi Robert,

On Tue, Apr 10, 2018 at 11:15:46AM -0400, Robert Haas wrote:
> On Mon, Apr 9, 2018 at 3:13 PM, Andres Freund  wrote:
> > Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
> > absurd, as is some of the proposed ways this is all supposed to
> > work. But I think the case we're discussing is much closer to a near
> > irresolvable corner case than anything else.
> 
> Well, I admit that I wasn't entirely serious about that email, but I
> wasn't entirely not-serious either.  If you can't find reliably find
> out whether the contents of the file on disk are the same as the
> contents that the kernel is giving you when you call read(), then you
> are going to have a heck of a time building a reliable system.  If the
> kernel developers are determined to insist on these semantics (and,
> admittedly, I don't know whether that's the case - I've only read
> Anthony's remarks), then I don't really see what we can do except give
> up on buffered I/O (or on Linux).

I think it would be interesting to get in touch with some of the
respective linux kernel maintainers and open up this topic for
more detailed discussions. LSF/MM'18 is upcoming and it would
have been the perfect opportunity but it's past the CFP deadline.
It may still worth contacting the organizers to bring forward
the issue, and see if there is a chance to have someone from
Pg invited for further discussions.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Robert Haas
On Tue, Apr 10, 2018 at 1:37 AM, Craig Ringer  wrote:
> ... but *only if they hit an I/O error* or they're on a FS that
> doesn't reserve space and hit ENOSPC.
>
> It still does 99% of the job. It still flushes all buffers to
> persistent storage and maintains write ordering. It may not detect and
> report failures to the user how we'd expect it to, yes, and that's not
> great. But it's hardly throw up our hands and give up territory
> either. Also, at least for initdb, we can make initdb fsync() its own
> files before close(). Annoying but hardly the end of the world.

I think we'd need every child postgres process started by initdb to do
that individually, which I suspect would slow down initdb quite a lot.
Now admittedly for anybody other than a PostgreSQL developer that's
only a minor issue, and our regression tests mostly run with fsync=off
anyway.  But I have a strong suspicion that our assumptions about how
fsync() reports errors are baked into an awful lot of parts of the
system, and by the time we get unbaking them I think it's going to be
really surprising if we haven't done real harm to overall system
performance.

BTW, I took a look at the MariaDB source code to see whether they've
got this problem too and it sure looks like they do.
os_file_fsync_posix() retries the fsync in a loop with an 0.2 second
sleep after each retry.  It warns after 100 failures and fails an
assertion after 1000 failures.  It is hard to understand why they
would have written the code this way unless they expect errors
reported by fsync() to continue being reported until the underlying
condition is corrected.  But, it looks like they wouldn't have the
problem that we do with trying to reopen files to fsync() them later
-- I spot checked a few places where this code is invoked and in all
of those it looks like the file is already expected to be open.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Robert Haas
On Mon, Apr 9, 2018 at 3:13 PM, Andres Freund  wrote:
> Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
> absurd, as is some of the proposed ways this is all supposed to
> work. But I think the case we're discussing is much closer to a near
> irresolvable corner case than anything else.

Well, I admit that I wasn't entirely serious about that email, but I
wasn't entirely not-serious either.  If you can't find reliably find
out whether the contents of the file on disk are the same as the
contents that the kernel is giving you when you call read(), then you
are going to have a heck of a time building a reliable system.  If the
kernel developers are determined to insist on these semantics (and,
admittedly, I don't know whether that's the case - I've only read
Anthony's remarks), then I don't really see what we can do except give
up on buffered I/O (or on Linux).

> We're talking about the storage layer returning an irresolvable
> error. You're hosed even if we report it properly.  Yes, it'd be nice if
> we could report it reliably.  But that doesn't change the fact that what
> we're doing is ensuring that data is safely fsynced unless storage
> fails, in which case it's not safely fsynced anyway.

I think that reliable error reporting is more than "nice" -- I think
it's essential.  The only argument for the current Linux behavior that
has been so far advanced on this thread, at least as far as I can see,
is that if it kept retrying the buffers forever, it would be pointless
and might run the machine out of memory, so we might as well discard
them.  But previous comments have already illustrated that the kernel
is not really up against a wall there -- it could put individual
inodes into a permanent failure state when it discards their dirty
data, as you suggested, or it could do what others have suggested, and
what I think is better, which is to put the whole filesystem into a
permanent failure state that can be cleared by remounting the FS.
That could be done on an as-needed basis -- if the number of dirty
buffers you're holding onto for some filesystem becomes too large, put
the filesystem into infinite-fail mode and discard them all.  That
behavior would be pretty easy for administrators to understand and
would resolve the entire problem here provided that no PostgreSQL
processes survived the eventual remount.

I also don't really know what we mean by an "unresolvable" error.  If
the drive is beyond all hope, then it doesn't really make sense to
talk about whether the database stored on it is corrupt.  In general
we can't be sure that we'll even get an error - e.g. the system could
be idle and the drive could be on fire.  Maybe this is the case you
meant by "it'd be nice if we could report it reliably".  But at least
in my experience, that's typically not what's going on.  You get some
I/O errors and so you remount the filesystem, or reboot, or rebuild
the array, or ... something.  And then the errors go away and, at that
point, you want to run recovery and continue using your database.  In
this scenario, it matters *quite a bit* what the error reporting was
like during the period when failures were occurring.  In particular,
if the database was allowed to think that it had successfully
checkpointed when it didn't, you're going to start recovery from the
wrong place.

I'm going to shut up now because I'm telling you things that you
obviously already know, but this doesn't sound like a "near
irresolvable corner case".  When the storage goes bonkers, either
PostgreSQL and the kernel can interact in such a way that a checkpoint
can succeed without all of the relevant data getting persisted, or
they don't.  It sounds like right now they do, and I'm not really
clear that we have a reasonable idea how to fix that.  It does not
sound like a PANIC is sufficient.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Craig Ringer
On 10 April 2018 at 14:10, Michael Paquier  wrote:

> Well, I think that there is place for improving reporting of failure
> in file_utils.c for frontends, or at worst have an exit() for any kind
> of critical failures equivalent to a PANIC.

Yup.

In the mean time, speaking of PANIC, here's the first cut patch to
make Pg panic on fsync() failures. I need to do some closer review and
testing, but it's presented here for anyone interested.

I intentionally left some failures as ERROR not PANIC, where the
entire operation is done as a unit, and an ERROR will cause us to
retry the whole thing.

For example, when we fsync() a temp file before we move it into place,
there's no point panicing on failure, because we'll discard the temp
file on ERROR and retry the whole thing.

I've verified that it works as expected with some modifications to the
test tool I've been using (pushed).

The main downside is that if we panic in redo, we don't try again. We
throw our toys and shut down. But arguably if we get the same I/O
error again in redo, that's the right thing to do anyway, and quite
likely safer than continuing to ERROR on checkpoints indefinitely.

Patch attached.

To be clear, this patch only deals with the issue of us retrying
fsyncs when it turns out to be unsafe. This does NOT address any of
the issues where we won't find out about writeback errors at all.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From 09d23dcd5ed87f5f01a43978942d8867e712e280 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Tue, 10 Apr 2018 14:08:32 +0800
Subject: [PATCH v1] PANIC when we detect a possible fsync I/O error instead of
 retrying fsync

Panic the server on fsync failure in places where we can't simply repeat the
whole operation on retry. Most imporantly, panic when fsync fails during a
checkpoint.

This will result in log messages like:

PANIC:  58030: could not fsync file "base/12367/16386": Input/output error
LOG:  0: checkpointer process (PID 10799) was terminated by signal 6: Aborted

and, if the condition persists during redo:

LOG:  0: checkpoint starting: end-of-recovery immediate
PANIC:  58030: could not fsync file "base/12367/16386": Input/output error
LOG:  0: startup process (PID 10808) was terminated by signal 6: Aborted

Why?

In a number of places PostgreSQL we responded to fsync() errors by retrying the
fsync(), expecting that this would force the operating system to repeat any
write attempts. The code assumed that fsync() would return an error on all
subsequent calls until any I/O error was resolved.

This is not what actually happens on some platforms, including Linux. The
operating system may give up and drop dirty buffers for async writes on the
floor and mark the page mapping as bad. The first fsync() clears any error flag
from the page entry and/or our file descriptor. So a subsequent fsync() returns
success, even though the data PostgreSQL wrote was really discarded.

We have no way to find out which writes failed, and no way to ask the kernel to
retry indefinitely, so all we can do is PANIC. Redo will attempt the write
again, and if it fails again, it will also PANIC.

This doesn't completely prevent fsync reliability issues, because it only
handles cases where the kernel actually reports the error to us. It's entirely
possible for a buffered write to be lost without causing fsync to report an
error at all (see discussion below). Work on addressing those issues and
documenting them is ongoing and will be committed separately.

See https://www.postgresql.org/message-id/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com
---
 src/backend/access/heap/rewriteheap.c   |  6 +++---
 src/backend/access/transam/timeline.c   |  4 ++--
 src/backend/access/transam/twophase.c   |  2 +-
 src/backend/access/transam/xlog.c   |  4 ++--
 src/backend/replication/logical/snapbuild.c |  3 +++
 src/backend/storage/file/fd.c   |  6 ++
 src/backend/storage/smgr/md.c   | 22 --
 src/backend/utils/cache/relmapper.c |  2 +-
 8 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 8d3c861a33..0320baffec 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -965,7 +965,7 @@ logical_end_heap_rewrite(RewriteState state)
 	while ((src = (RewriteMappingFile *) hash_seq_search(_status)) != NULL)
 	{
 		if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0)
-			ereport(ERROR,
+			ereport(PANIC,
 	(errcode_for_file_access(),
 	 errmsg("could not fsync file \"%s\": %m", src->path)));
 		FileClose(src->vfd);
@@ -1180,7 +1180,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
 	 */
 	

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Michael Paquier
On Tue, Apr 10, 2018 at 01:37:19PM +0800, Craig Ringer wrote:
> On 10 April 2018 at 13:04, Michael Paquier  wrote:
>> And pg_basebackup.  And pg_dump.  And pg_dumpall.  Anything using initdb
>> -S or fsync_pgdata would enter in those waters.
> 
> ... but *only if they hit an I/O error* or they're on a FS that
> doesn't reserve space and hit ENOSPC.

Sure.

> It still does 99% of the job. It still flushes all buffers to
> persistent storage and maintains write ordering. It may not detect and
> report failures to the user how we'd expect it to, yes, and that's not
> great. But it's hardly throw up our hands and give up territory
> either. Also, at least for initdb, we can make initdb fsync() its own
> files before close(). Annoying but hardly the end of the world.

Well, I think that there is place for improving reporting of failure
in file_utils.c for frontends, or at worst have an exit() for any kind
of critical failures equivalent to a PANIC.
--
Michael


signature.asc
Description: PGP signature


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Craig Ringer
On 10 April 2018 at 13:04, Michael Paquier  wrote:
> On Mon, Apr 09, 2018 at 03:02:11PM -0400, Robert Haas wrote:
>> Another consequence of this behavior that initdb -S is never reliable,
>> so pg_rewind's use of it doesn't actually fix the problem it was
>> intended to solve.  It also means that initdb itself isn't crash-safe,
>> since the data file changes are made by the backend but initdb itself
>> is doing the fsyncs, and initdb has no way of knowing what files the
>> backend is going to create and therefore can't -- even theoretically
>> -- open them first.
>
> And pg_basebackup.  And pg_dump.  And pg_dumpall.  Anything using initdb
> -S or fsync_pgdata would enter in those waters.

... but *only if they hit an I/O error* or they're on a FS that
doesn't reserve space and hit ENOSPC.

It still does 99% of the job. It still flushes all buffers to
persistent storage and maintains write ordering. It may not detect and
report failures to the user how we'd expect it to, yes, and that's not
great. But it's hardly throw up our hands and give up territory
either. Also, at least for initdb, we can make initdb fsync() its own
files before close(). Annoying but hardly the end of the world.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Michael Paquier
On Mon, Apr 09, 2018 at 03:02:11PM -0400, Robert Haas wrote:
> Another consequence of this behavior that initdb -S is never reliable,
> so pg_rewind's use of it doesn't actually fix the problem it was
> intended to solve.  It also means that initdb itself isn't crash-safe,
> since the data file changes are made by the backend but initdb itself
> is doing the fsyncs, and initdb has no way of knowing what files the
> backend is going to create and therefore can't -- even theoretically
> -- open them first.

And pg_basebackup.  And pg_dump.  And pg_dumpall.  Anything using initdb
-S or fsync_pgdata would enter in those waters.
--
Michael


signature.asc
Description: PGP signature


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Craig Ringer
On 10 April 2018 at 08:41, Andreas Karlsson  wrote:
> On 04/09/2018 02:16 PM, Craig Ringer wrote:
>>
>> I'd like a middle ground where the kernel lets us register our interest
>> and tells us if it lost something, without us having to keep eight million
>> FDs open for some long period. "Tell us about anything that happens under
>> pgdata/" or an inotify-style per-directory-registration option. I'd even say
>> that's ideal.
>
>
> Could there be a risk of a race condition here where fsync incorrectly
> returns success before we get the notification of that something went wrong?

We'd examine the notification queue only once all our checkpoint
fsync()s had succeeded, and before we updated the control file to
advance the redo position.

I'm intrigued by the suggestion upthread of using a kprobe or similar
to achieve this. It's a horrifying unportable hack that'd make kernel
people cry, and I don't know if we have any way to flush buffered
probe data to be sure we really get the news in time, but it's a cool
idea too.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund


On April 9, 2018 6:59:03 PM PDT, Craig Ringer  wrote:
>On 10 April 2018 at 04:37, Andres Freund  wrote:
>> Hi,
>>
>> On 2018-04-09 22:30:00 +0200, Tomas Vondra wrote:
>>> Maybe. I'd certainly prefer automated recovery from an temporary I/O
>>> issues (like full disk on thin-provisioning) without the database
>>> crashing and restarting. But I'm not sure it's worth the effort.
>>
>> Oh, I agree on that one. But that's more a question of how we force
>the
>> kernel's hand on allocating disk space. In most cases the kernel
>> allocates the disk space immediately, even if delayed allocation is
>in
>> effect. For the cases where that's not the case (if there are current
>> ones, rather than just past bugs), we should be able to make sure
>that's
>> not an issue by pre-zeroing the data and/or using fallocate.
>
>Nitpick: In most cases the kernel reserves disk space immediately,
>before returning from write(). NFS seems to be the main exception
>here.
>
>EXT4 and XFS don't allocate until later, it by performing actual
>writes to FS metadata, initializing disk blocks, etc. So we won't
>notice errors that are only detectable at actual time of allocation,
>like thin provisioning problems, until after write() returns and we
>face the same writeback issues.
>
>So I reckon you're safe from space-related issues if you're not on NFS
>(and whyyy would you do that?) and not thinly provisioned. I'm sure
>there are other corner cases, but I don't see any reason to expect
>space-exhaustion-related corruption problems on a sensible FS backed
>by a sensible block device. I haven't tested things like quotas,
>verified how reliable space reservation is under concurrency, etc as
>yet.

How's that not solved by pre zeroing and/or fallocate as I suggested above?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Craig Ringer
On 10 April 2018 at 04:37, Andres Freund  wrote:
> Hi,
>
> On 2018-04-09 22:30:00 +0200, Tomas Vondra wrote:
>> Maybe. I'd certainly prefer automated recovery from an temporary I/O
>> issues (like full disk on thin-provisioning) without the database
>> crashing and restarting. But I'm not sure it's worth the effort.
>
> Oh, I agree on that one. But that's more a question of how we force the
> kernel's hand on allocating disk space. In most cases the kernel
> allocates the disk space immediately, even if delayed allocation is in
> effect. For the cases where that's not the case (if there are current
> ones, rather than just past bugs), we should be able to make sure that's
> not an issue by pre-zeroing the data and/or using fallocate.

Nitpick: In most cases the kernel reserves disk space immediately,
before returning from write(). NFS seems to be the main exception
here.

EXT4 and XFS don't allocate until later, it by performing actual
writes to FS metadata, initializing disk blocks, etc. So we won't
notice errors that are only detectable at actual time of allocation,
like thin provisioning problems, until after write() returns and we
face the same writeback issues.

So I reckon you're safe from space-related issues if you're not on NFS
(and whyyy would you do that?) and not thinly provisioned. I'm sure
there are other corner cases, but I don't see any reason to expect
space-exhaustion-related corruption problems on a sensible FS backed
by a sensible block device. I haven't tested things like quotas,
verified how reliable space reservation is under concurrency, etc as
yet.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Craig Ringer
On 10 April 2018 at 04:25, Mark Dilger  wrote:

> I was reading this thread up until now as meaning that the standby could
> receive corrupt WAL data and become corrupted.

Yes, it can, but not directly through the first error.

What can happen is that we think a block got written when it didn't.

If our in memory state diverges from our on disk state, we can make
subsequent WAL writes based on that wrong information. But that's
actually OK, since the standby will have replayed the original WAL
correctly.

I think the only time we'd run into trouble is if we evict the good
(but not written out) data from s_b and the fs buffer cache, then
later read in the old version of a block we failed to overwrite. Data
checksums (if enabled) might catch it unless the write left the whole
block stale. In that case we might generate a full page write with the
stale block and propagate that over WAL to the standby.

So I'd say standbys are relatively safe - very safe if the issue is
caught promptly, and less so over time. But AFAICS WAL-based
replication (physical or logical) is not a perfect defense for this.

However, remember, if your storage system is free of any sort of
overprovisioning, is on a non-network file system, and doesn't use
multipath (or sets it up right) this issue *is exceptionally unlikely
to affect you*.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Thomas Munro
On Tue, Apr 10, 2018 at 1:44 PM, Craig Ringer  wrote:
> On 10 April 2018 at 03:59, Andres Freund  wrote:
>> I don't think that's as hard as some people argued in this thread.  We
>> could very well open a pipe in postmaster with the write end open in
>> each subprocess, and the read end open only in checkpointer (and
>> postmaster, but unused there).  Whenever closing a file descriptor that
>> was dirtied in the current process, send it over the pipe to the
>> checkpointer. The checkpointer then can receive all those file
>> descriptors (making sure it's not above the limit, fsync(), close() ing
>> to make room if necessary).  The biggest complication would presumably
>> be to deduplicate the received filedescriptors for the same file,
>> without loosing track of any errors.
>
> Yep. That'd be a cheaper way to do it, though it wouldn't work on
> Windows. Though we don't know how Windows behaves here at all yet.
>
> Prior discussion upthread had the checkpointer open()ing a file at the
> same time as a backend, before the backend writes to it. But passing
> the fd when the backend is done with it would be better.

How would that interlock with concurrent checkpoints?

I can see how to make that work if the share-fd-or-fsync-now logic
happens in smgrwrite() when called by FlushBuffer() while you hold
io_in_progress, but not if you defer it to some random time later.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Craig Ringer
On 10 April 2018 at 03:59, Andres Freund  wrote:
> On 2018-04-09 14:41:19 -0500, Justin Pryzby wrote:
>> On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:
>> > You could make the argument that it's OK to forget if the entire file
>> > system goes away. But actually, why is that ok?
>>
>> I was going to say that it'd be okay to clear error flag on umount, since any
>> opened files would prevent unmounting; but, then I realized we need to 
>> consider
>> the case of close()ing all FDs then opening them later..in another process.
>
>> On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:
>> > notification descriptor open, where the kernel would inject events
>> > related to writeback failures of files under watch (potentially
>> > enriched to contain info regarding the exact failed pages and
>> > the file offset they map to).
>>
>> For postgres that'd require backend processes to open() an file such that,
>> following its close(), any writeback errors are "signalled" to the 
>> checkpointer
>> process...
>
> I don't think that's as hard as some people argued in this thread.  We
> could very well open a pipe in postmaster with the write end open in
> each subprocess, and the read end open only in checkpointer (and
> postmaster, but unused there).  Whenever closing a file descriptor that
> was dirtied in the current process, send it over the pipe to the
> checkpointer. The checkpointer then can receive all those file
> descriptors (making sure it's not above the limit, fsync(), close() ing
> to make room if necessary).  The biggest complication would presumably
> be to deduplicate the received filedescriptors for the same file,
> without loosing track of any errors.

Yep. That'd be a cheaper way to do it, though it wouldn't work on
Windows. Though we don't know how Windows behaves here at all yet.

Prior discussion upthread had the checkpointer open()ing a file at the
same time as a backend, before the backend writes to it. But passing
the fd when the backend is done with it would be better.

We'd need a way to dup() the fd and pass it back to a backend when it
needed to reopen it sometimes, or just make sure to keep the oldest
copy of the fd when a backend reopens multiple times, but that's no
biggie.

We'd still have to fsync() out early in the checkpointer if we ran out
of space in our FD list, and initscripts would need to change our
ulimit or we'd have to do it ourselves in the checkpointer. But
neither seems insurmountable.

FWIW, I agree that this is a corner case, but it's getting to be a
pretty big corner with the spread of overcommitted, dedupliating SANs,
cloud storage, etc. Not all I/O errors indicate permanent hardware
faults, disk failures, etc, as I outlined earlier. I'm very curious to
know what AWS EBS's error semantics are, and other cloud network block
stores. (I posted on Amazon forums
https://forums.aws.amazon.com/thread.jspa?threadID=279274=0 but
nothing so far).

I'm also not particularly inclined to trust that all file systems will
always reliably reserve space without having some cases where they'll
fail writeback on space exhaustion.

So we don't need to panic and freak out, but it's worth looking at the
direction the storage world is moving in, and whether this will become
a bigger issue over time.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andreas Karlsson

On 04/09/2018 02:16 PM, Craig Ringer wrote:
I'd like a middle ground where the kernel lets us register our interest 
and tells us if it lost something, without us having to keep eight 
million FDs open for some long period. "Tell us about anything that 
happens under pgdata/" or an inotify-style per-directory-registration 
option. I'd even say that's ideal.


Could there be a risk of a race condition here where fsync incorrectly 
returns success before we get the notification of that something went wrong?


Andreas



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Thomas Munro
On Tue, Apr 10, 2018 at 10:33 AM, Thomas Munro
 wrote:
> I wonder if anyone can tell us what Windows, AIX and HPUX do here.

I created a wiki page to track what we know (or think we know) about
fsync() on various operating systems:

https://wiki.postgresql.org/wiki/Fsync_Errors

If anyone has more information or sees mistakes, please go ahead and edit it.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Thomas Munro
On Tue, Apr 10, 2018 at 2:22 AM, Anthony Iliopoulos  wrote:
> On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>> Well, there seem to be kernels that seem to do exactly that already. At
>> least that's how I understand what this thread says about FreeBSD and
>> Illumos, for example. So it's not an entirely insane design, apparently.
>
> It is reasonable, but even FreeBSD has a big fat comment right
> there (since 2017), mentioning that there can be no recovery from
> EIO at the block layer and this needs to be done differently. No
> idea how an application running on top of either FreeBSD or Illumos
> would actually recover from this error (and clear it out), other
> than remounting the fs in order to force dropping of relevant pages.
> It does provide though indeed a persistent error indication that
> would allow Pg to simply reliably panic. But again this does not
> necessarily play well with other applications that may be using
> the filesystem reliably at the same time, and are now faced with
> EIO while their own writes succeed to be persisted.

Right.  For anyone interested, here is the change you mentioned, and
an interesting one that came a bit earlier last year:

https://reviews.freebsd.org/rS316941 -- drop buffers after device goes away
https://reviews.freebsd.org/rS326029 -- update comment about EIO contract

Retrying may well be futile, but at least future fsync() calls won't
report success bogusly.  There may of course be more space-efficient
ways to represent that state as the comment implies, while never lying
to the user -- perhaps involving filesystem level or (pinned) inode
level errors that stop all writes until unmounted.  Something tells me
they won't resort to flakey fsync() error reporting.

I wonder if anyone can tell us what Windows, AIX and HPUX do here.

> [1] 
> https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf

Very interesting, thanks.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Mark Dilger

> On Apr 9, 2018, at 2:25 PM, Tomas Vondra  wrote:
> 
> 
> 
> On 04/09/2018 11:08 PM, Andres Freund wrote:
>> Hi,
>> 
>> On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:
>>> I can also imagine a master and standby that are similarly provisioned,
>>> and thus hit an out of disk error at around the same time, resulting in
>>> corruption on both, even if not the same corruption.
>> 
>> I think it's a grave mistake conflating ENOSPC issues (which we should
>> solve by making sure there's always enough space pre-allocated), with
>> EIO type errors.  The problem is different, the solution is different.

I'm happy to take your word for that.

> In any case, that certainly does not count as data corruption spreading
> from the master to standby.

Maybe not from the point of view of somebody looking at the code.  But a
user might see it differently.  If the data being loaded into the master
and getting replicated to the standby "causes" both to get corrupt, then
it seems like corruption spreading.  I put "causes" in quotes because there
is some argument to be made about "correlation does not prove cause" and so
forth, but it still feels like causation from an arms length perspective.
If there is a pattern of standby servers tending to fail more often right
around the time that the master fails, you'll have a hard time comforting
users, "hey, it's not technically causation."  If loading data into the
master causes the master to hit ENOSPC, and replicating that data to the
standby causes the standby to hit ENOSPC, and if the bug abound ENOSPC has
not been fixed, then this looks like corruption spreading.

I'm certainly planning on taking a hard look at the disk allocation on my
standby servers right soon now.

mark




Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra


On 04/09/2018 11:08 PM, Andres Freund wrote:
> Hi,
> 
> On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:
>> I can also imagine a master and standby that are similarly provisioned,
>> and thus hit an out of disk error at around the same time, resulting in
>> corruption on both, even if not the same corruption.
> 
> I think it's a grave mistake conflating ENOSPC issues (which we should
> solve by making sure there's always enough space pre-allocated), with
> EIO type errors.  The problem is different, the solution is different.
> 

In any case, that certainly does not count as data corruption spreading
from the master to standby.


-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund
Hi,

On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:
> I can also imagine a master and standby that are similarly provisioned,
> and thus hit an out of disk error at around the same time, resulting in
> corruption on both, even if not the same corruption.

I think it's a grave mistake conflating ENOSPC issues (which we should
solve by making sure there's always enough space pre-allocated), with
EIO type errors.  The problem is different, the solution is different.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Mark Dilger

> On Apr 9, 2018, at 1:43 PM, Tomas Vondra  wrote:
> 
> 
> 
> On 04/09/2018 10:25 PM, Mark Dilger wrote:
>> 
>>> On Apr 9, 2018, at 12:13 PM, Andres Freund  wrote:
>>> 
>>> Hi,
>>> 
>>> On 2018-04-09 15:02:11 -0400, Robert Haas wrote:
 I think the simplest technological solution to this problem is to
 rewrite the entire backend and all supporting processes to use
 O_DIRECT everywhere.  To maintain adequate performance, we'll have to
 write a complete I/O scheduling system inside PostgreSQL.  Also, since
 we'll now have to make shared_buffers much larger -- since we'll no
 longer be benefiting from the OS cache -- we'll need to replace the
 use of malloc() with an allocator that pulls from shared_buffers.
 Plus, as noted, we'll need to totally rearchitect several of our
 critical frontend tools.  Let's freeze all other development for the
 next year while we work on that, and put out a notice that Linux is no
 longer a supported platform for any existing release.  Before we do
 that, we might want to check whether fsync() actually writes the data
 to disk in a usable way even with O_DIRECT.  If not, we should just
 de-support Linux entirely as a hopelessly broken and unsupportable
 platform.
>>> 
>>> Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
>>> absurd, as is some of the proposed ways this is all supposed to
>>> work. But I think the case we're discussing is much closer to a near
>>> irresolvable corner case than anything else.
>>> 
>>> We're talking about the storage layer returning an irresolvable
>>> error. You're hosed even if we report it properly.  Yes, it'd be nice if
>>> we could report it reliably.  But that doesn't change the fact that what
>>> we're doing is ensuring that data is safely fsynced unless storage
>>> fails, in which case it's not safely fsynced anyway.
>> 
>> I was reading this thread up until now as meaning that the standby could
>> receive corrupt WAL data and become corrupted.  That seems a much bigger
>> problem than merely having the master become corrupted in some unrecoverable
>> way.  It is a long standing expectation that serious hardware problems on
>> the master can result in the master needing to be replaced.  But there has
>> not been an expectation that the one or more standby servers would be taken
>> down along with the master, leaving all copies of the database unusable.
>> If this bug corrupts the standby servers, too, then it is a whole different
>> class of problem than the one folks have come to expect.
>> 
>> Your comment reads as if this is a problem isolated to whichever server has
>> the problem, and will not get propagated to other servers.  Am I reading
>> that right?
>> 
>> Can anybody clarify this for non-core-hacker folks following along at home?
>> 
> 
> That's a good question. I don't see any guarantee it'd be isolated to
> the master node. Consider this example:
> 
> (0) checkpoint happens on the primary
> 
> (1) a page gets modified, a full-page gets written to WAL
> 
> (2) the page is written out to page cache
> 
> (3) writeback of that page fails (and gets discarded)
> 
> (4) we attempt to modify the page again, but we read the stale version
> 
> (5) we modify the stale version, writing the change to WAL
> 
> 
> The standby will get the full-page, and then a WAL from the stale page
> version. That doesn't seem like a story with a happy end, I guess. But I
> might be easily missing some protection built into the WAL ...

I can also imagine a master and standby that are similarly provisioned,
and thus hit an out of disk error at around the same time, resulting in
corruption on both, even if not the same corruption.  When choosing to
have one standby, or two standbys, or ten standbys, one needs to be able
to assume a certain amount of statistical independence between failures
on one server and failures on another.  If they are tightly correlated
dependent variables, then the conclusion that the probability of all
nodes failing simultaneously is vanishingly small becomes invalid.

mark


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra


On 04/09/2018 10:25 PM, Mark Dilger wrote:
> 
>> On Apr 9, 2018, at 12:13 PM, Andres Freund  wrote:
>>
>> Hi,
>>
>> On 2018-04-09 15:02:11 -0400, Robert Haas wrote:
>>> I think the simplest technological solution to this problem is to
>>> rewrite the entire backend and all supporting processes to use
>>> O_DIRECT everywhere.  To maintain adequate performance, we'll have to
>>> write a complete I/O scheduling system inside PostgreSQL.  Also, since
>>> we'll now have to make shared_buffers much larger -- since we'll no
>>> longer be benefiting from the OS cache -- we'll need to replace the
>>> use of malloc() with an allocator that pulls from shared_buffers.
>>> Plus, as noted, we'll need to totally rearchitect several of our
>>> critical frontend tools.  Let's freeze all other development for the
>>> next year while we work on that, and put out a notice that Linux is no
>>> longer a supported platform for any existing release.  Before we do
>>> that, we might want to check whether fsync() actually writes the data
>>> to disk in a usable way even with O_DIRECT.  If not, we should just
>>> de-support Linux entirely as a hopelessly broken and unsupportable
>>> platform.
>>
>> Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
>> absurd, as is some of the proposed ways this is all supposed to
>> work. But I think the case we're discussing is much closer to a near
>> irresolvable corner case than anything else.
>>
>> We're talking about the storage layer returning an irresolvable
>> error. You're hosed even if we report it properly.  Yes, it'd be nice if
>> we could report it reliably.  But that doesn't change the fact that what
>> we're doing is ensuring that data is safely fsynced unless storage
>> fails, in which case it's not safely fsynced anyway.
> 
> I was reading this thread up until now as meaning that the standby could
> receive corrupt WAL data and become corrupted.  That seems a much bigger
> problem than merely having the master become corrupted in some unrecoverable
> way.  It is a long standing expectation that serious hardware problems on
> the master can result in the master needing to be replaced.  But there has
> not been an expectation that the one or more standby servers would be taken
> down along with the master, leaving all copies of the database unusable.
> If this bug corrupts the standby servers, too, then it is a whole different
> class of problem than the one folks have come to expect.
> 
> Your comment reads as if this is a problem isolated to whichever server has
> the problem, and will not get propagated to other servers.  Am I reading
> that right?
> 
> Can anybody clarify this for non-core-hacker folks following along at home?
> 

That's a good question. I don't see any guarantee it'd be isolated to
the master node. Consider this example:

(0) checkpoint happens on the primary

(1) a page gets modified, a full-page gets written to WAL

(2) the page is written out to page cache

(3) writeback of that page fails (and gets discarded)

(4) we attempt to modify the page again, but we read the stale version

(5) we modify the stale version, writing the change to WAL


The standby will get the full-page, and then a WAL from the stale page
version. That doesn't seem like a story with a happy end, I guess. But I
might be easily missing some protection built into the WAL ...

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund
Hi,

On 2018-04-09 22:30:00 +0200, Tomas Vondra wrote:
> Maybe. I'd certainly prefer automated recovery from an temporary I/O
> issues (like full disk on thin-provisioning) without the database
> crashing and restarting. But I'm not sure it's worth the effort.

Oh, I agree on that one. But that's more a question of how we force the
kernel's hand on allocating disk space. In most cases the kernel
allocates the disk space immediately, even if delayed allocation is in
effect. For the cases where that's not the case (if there are current
ones, rather than just past bugs), we should be able to make sure that's
not an issue by pre-zeroing the data and/or using fallocate.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund
Hi,

On 2018-04-09 13:25:54 -0700, Mark Dilger wrote:
> I was reading this thread up until now as meaning that the standby could
> receive corrupt WAL data and become corrupted.

I don't see that as a real problem here. For one the problematic
scenarios shouldn't readily apply, for another WAL is checksummed.

There's the problem that a new basebackup would potentially become
corrupted however. And similarly pg_rewind.

Note that I'm not saying that we and/or linux shouldn't change
anything. Just that the apocalypse isn't here.


> Your comment reads as if this is a problem isolated to whichever server has
> the problem, and will not get propagated to other servers.  Am I reading
> that right?

I think that's basically right. There's cases where corruption could get
propagated, but they're not straightforward.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra


On 04/09/2018 10:04 PM, Andres Freund wrote:
> Hi,
> 
> On 2018-04-09 21:54:05 +0200, Tomas Vondra wrote:
>> Isn't the expectation that when a fsync call fails, the next one will
>> retry writing the pages in the hope that it succeeds?
> 
> Some people expect that, I personally don't think it's a useful
> expectation.
> 

Maybe. I'd certainly prefer automated recovery from an temporary I/O
issues (like full disk on thin-provisioning) without the database
crashing and restarting. But I'm not sure it's worth the effort.

And most importantly, it's rather delusional to think the kernel
developers are going to be enthusiastic about that approach ...

>
> We should just deal with this by crash-recovery. The big problem I
> see is that you always need to keep an file descriptor open for
> pretty much any file written to inside and outside of postgres, to be
> guaranteed to see errors. And that'd solve that. Even if retrying
> would work, I'd advocate for that (I've done so in the past, and I've
> written code in pg that panics on fsync failure...).
> 

Sure. And it's likely way less invasive from kernel perspective.

>
> What we'd need to do however is to clear that bit during crash 
> recovery... Which is interesting from a policy perspective. Could be 
> that other apps wouldn't want that.
> 

IMHO it'd be enough if a remount clears it.

>
> I also wonder if we couldn't just somewhere read each relevant
> mounted filesystem's errseq value. Whenever checkpointer notices
> before finishing a checkpoint that it has changed, do a crash
> restart.
> 

H, that's an interesting idea, and it's about the only thing that
would help us on older kernels. There's a wb_err in adress_space, but
that's at inode level. Not sure if there's something at fs level.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Mark Dilger

> On Apr 9, 2018, at 12:13 PM, Andres Freund  wrote:
> 
> Hi,
> 
> On 2018-04-09 15:02:11 -0400, Robert Haas wrote:
>> I think the simplest technological solution to this problem is to
>> rewrite the entire backend and all supporting processes to use
>> O_DIRECT everywhere.  To maintain adequate performance, we'll have to
>> write a complete I/O scheduling system inside PostgreSQL.  Also, since
>> we'll now have to make shared_buffers much larger -- since we'll no
>> longer be benefiting from the OS cache -- we'll need to replace the
>> use of malloc() with an allocator that pulls from shared_buffers.
>> Plus, as noted, we'll need to totally rearchitect several of our
>> critical frontend tools.  Let's freeze all other development for the
>> next year while we work on that, and put out a notice that Linux is no
>> longer a supported platform for any existing release.  Before we do
>> that, we might want to check whether fsync() actually writes the data
>> to disk in a usable way even with O_DIRECT.  If not, we should just
>> de-support Linux entirely as a hopelessly broken and unsupportable
>> platform.
> 
> Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
> absurd, as is some of the proposed ways this is all supposed to
> work. But I think the case we're discussing is much closer to a near
> irresolvable corner case than anything else.
> 
> We're talking about the storage layer returning an irresolvable
> error. You're hosed even if we report it properly.  Yes, it'd be nice if
> we could report it reliably.  But that doesn't change the fact that what
> we're doing is ensuring that data is safely fsynced unless storage
> fails, in which case it's not safely fsynced anyway.

I was reading this thread up until now as meaning that the standby could
receive corrupt WAL data and become corrupted.  That seems a much bigger
problem than merely having the master become corrupted in some unrecoverable
way.  It is a long standing expectation that serious hardware problems on
the master can result in the master needing to be replaced.  But there has
not been an expectation that the one or more standby servers would be taken
down along with the master, leaving all copies of the database unusable.
If this bug corrupts the standby servers, too, then it is a whole different
class of problem than the one folks have come to expect.

Your comment reads as if this is a problem isolated to whichever server has
the problem, and will not get propagated to other servers.  Am I reading
that right?

Can anybody clarify this for non-core-hacker folks following along at home?


mark




Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund
Hi,

On 2018-04-09 21:54:05 +0200, Tomas Vondra wrote:
> Isn't the expectation that when a fsync call fails, the next one will
> retry writing the pages in the hope that it succeeds?

Some people expect that, I personally don't think it's a useful
expectation.

We should just deal with this by crash-recovery.  The big problem I see
is that you always need to keep an file descriptor open for pretty much
any file written to inside and outside of postgres, to be guaranteed to
see errors. And that'd solve that.  Even if retrying would work, I'd
advocate for that (I've done so in the past, and I've written code in pg
that panics on fsync failure...).

What we'd need to do however is to clear that bit during crash
recovery... Which is interesting from a policy perspective. Could be
that other apps wouldn't want that.

I also wonder if we couldn't just somewhere read each relevant mounted
filesystem's errseq value. Whenever checkpointer notices before
finishing a checkpoint that it has changed, do a crash restart.


Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund
On 2018-04-09 14:41:19 -0500, Justin Pryzby wrote:
> On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:
> > You could make the argument that it's OK to forget if the entire file
> > system goes away. But actually, why is that ok?
> 
> I was going to say that it'd be okay to clear error flag on umount, since any
> opened files would prevent unmounting; but, then I realized we need to 
> consider
> the case of close()ing all FDs then opening them later..in another process.

> On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:
> > notification descriptor open, where the kernel would inject events
> > related to writeback failures of files under watch (potentially
> > enriched to contain info regarding the exact failed pages and
> > the file offset they map to).
> 
> For postgres that'd require backend processes to open() an file such that,
> following its close(), any writeback errors are "signalled" to the 
> checkpointer
> process...

I don't think that's as hard as some people argued in this thread.  We
could very well open a pipe in postmaster with the write end open in
each subprocess, and the read end open only in checkpointer (and
postmaster, but unused there).  Whenever closing a file descriptor that
was dirtied in the current process, send it over the pipe to the
checkpointer. The checkpointer then can receive all those file
descriptors (making sure it's not above the limit, fsync(), close() ing
to make room if necessary).  The biggest complication would presumably
be to deduplicate the received filedescriptors for the same file,
without loosing track of any errors.

Even better, we could do so via a dedicated worker. That'd quite
possibly end up as a performance benefit.


> I was going to say that's fine for postgres, since it chdir()s into its
> basedir, but actually not fine for nondefault tablespaces..

I think it'd be fair to open PG_VERSION of all created
tablespaces. Would require some hangups to signal checkpointer (or
whichever process) to do so when creating one, but it shouldn't be too
hard.  Some people would complain because they can't do some nasty hacks
anymore, but it'd also save peoples butts by preventing them from
accidentally unmounting.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra


On 04/09/2018 09:37 PM, Andres Freund wrote:
> 
> 
> On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos  
> wrote:
> 
>> I honestly do not expect that keeping around the failed pages will
>> be an acceptable change for most kernels, and as such the
>> recommendation
>> will probably be to coordinate in userspace for the fsync().
> 
> Why is that required? You could very well just keep per inode 
> information about fatal failures that occurred around. Report errors 
> until that bit is explicitly cleared. Yes, that keeps some memory
> around until unmount if nobody clears it. But it's orders of
> magnitude less, and results in usable semantics.
>

Isn't the expectation that when a fsync call fails, the next one will
retry writing the pages in the hope that it succeeds?

Of course, it's also possible to do what you suggested, and simply mark
the inode as failed. In which case the next fsync can't possibly retry
the writes (e.g. after freeing some space on thin-provisioned system),
but we'd get reliable failure mode.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 12:37:03PM -0700, Andres Freund wrote:
> 
> 
> On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos  
> wrote:
> 
> >I honestly do not expect that keeping around the failed pages will
> >be an acceptable change for most kernels, and as such the
> >recommendation
> >will probably be to coordinate in userspace for the fsync().
> 
> Why is that required? You could very well just keep per inode information 
> about fatal failures that occurred around. Report errors until that bit is 
> explicitly cleared.  Yes, that keeps some memory around until unmount if 
> nobody clears it. But it's orders of magnitude less, and results in usable 
> semantics.

As discussed before, I think this could be acceptable, especially
if you pair it with an opt-in mechanism (only applications that
care to deal with this will have to), and would give it a shot.

Still need a way to deal with all other systems and prior kernel
releases that are eating fsync() writeback errors even over sync().

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra
On 04/09/2018 04:22 PM, Anthony Iliopoulos wrote:
> On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>>
>> We already have dirty_bytes and dirty_background_bytes, for example. I
>> don't see why there couldn't be another limit defining how much dirty
>> data to allow before blocking writes altogether. I'm sure it's not that
>> simple, but you get the general idea - do not allow using all available
>> memory because of writeback issues, but don't throw the data away in
>> case it's just a temporary issue.
> 
> Sure, there could be knobs for limiting how much memory such "zombie"
> pages may occupy. Not sure how helpful it would be in the long run
> since this tends to be highly application-specific, and for something
> with a large data footprint one would end up tuning this accordingly
> in a system-wide manner. This has the potential to leave other
> applications running in the same system with very little memory, in
> cases where for example original application crashes and never clears
> the error. Apart from that, further interfaces would need to be provided
> for actually dealing with the error (again assuming non-transient
> issues that may not be fixed transparently and that temporary issues
> are taken care of by lower layers of the stack).
> 

I don't quite see how this is any different from other possible issues
when running multiple applications on the same system. One application
can generate a lot of dirty data, reaching dirty_bytes and forcing the
other applications on the same host to do synchronous writes.

Of course, you might argue that is a temporary condition - it will
resolve itself once the dirty pages get written to storage. In case of
an I/O issue, it is a permanent impact - it will not resolve itself
unless the I/O problem gets fixed.

Not sure what interfaces would need to be written? Possibly something
that says "drop dirty pages for these files" after the application gets
killed or something. That makes sense, of course.

>> Well, there seem to be kernels that seem to do exactly that already. At
>> least that's how I understand what this thread says about FreeBSD and
>> Illumos, for example. So it's not an entirely insane design, apparently.
> 
> It is reasonable, but even FreeBSD has a big fat comment right
> there (since 2017), mentioning that there can be no recovery from
> EIO at the block layer and this needs to be done differently. No
> idea how an application running on top of either FreeBSD or Illumos
> would actually recover from this error (and clear it out), other
> than remounting the fs in order to force dropping of relevant pages.
> It does provide though indeed a persistent error indication that
> would allow Pg to simply reliably panic. But again this does not
> necessarily play well with other applications that may be using
> the filesystem reliably at the same time, and are now faced with
> EIO while their own writes succeed to be persisted.
> 

In my experience when you have a persistent I/O error on a device, it
likely affects all applications using that device. So unmounting the fs
to clear the dirty pages seems like an acceptable solution to me.

I don't see what else the application should do? In a way I'm suggesting
applications don't really want to be responsible for recovering (cleanup
or dirty pages etc.). We're more than happy to hand that over to kernel,
e.g. because each kernel will do that differently. What we however do
want is reliable information about fsync outcome, which we need to
properly manage WAL, checkpoints etc.

> Ideally, you'd want a (potentially persistent) indication of error
> localized to a file region (mapping the corresponding failed writeback
> pages). NetBSD is already implementing fsync_ranges(), which could
> be a step in the right direction.
> 
>> One has to wonder how many applications actually use this correctly,
>> considering PostgreSQL cares about data durability/consistency so much
>> and yet we've been misunderstanding how it works for 20+ years.
> 
> I would expect it would be very few, potentially those that have
> a very simple process model (e.g. embedded DBs that can abort a
> txn on fsync() EIO). I think that durability is a rather complex
> cross-layer issue which has been grossly misunderstood similarly
> in the past (e.g. see [1]). It seems that both the OS and DB
> communities greatly benefit from a periodic reality check, and
> I see this as an opportunity for strengthening the IO stack in
> an end-to-end manner.
> 

Right. What I was getting to is that perhaps the current fsync()
behavior is not very practical for building actual applications.

> Best regards,
> Anthony
> 
> [1] 
> https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf
> 

Thanks. The paper looks interesting.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 12:29:16PM -0700, Andres Freund wrote:
> On 2018-04-09 21:26:21 +0200, Anthony Iliopoulos wrote:
> > What about having buffered IO with implied fsync() atomicity via
> > O_SYNC?
> 
> You're kidding, right?  We could also just add sleep(30)'s all over the
> tree, and hope that that'll solve the problem.  There's a reason we
> don't permanently fsync everything. Namely that it'll be way too slow.

I am assuming you can apply the same principle of selectively using O_SYNC
at times and places that you'd currently actually call fsync().

Also assuming that you'd want to have a backwards-compatible solution for
all those kernels that don't keep the pages around, irrespective of future
fixes. Short of loading a kernel module and dealing with the problem directly,
the only other available options seem to be either O_SYNC, O_DIRECT or ignoring
the issue.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Justin Pryzby
On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:
> You could make the argument that it's OK to forget if the entire file
> system goes away. But actually, why is that ok?

I was going to say that it'd be okay to clear error flag on umount, since any
opened files would prevent unmounting; but, then I realized we need to consider
the case of close()ing all FDs then opening them later..in another process.

I was going to say that's fine for postgres, since it chdir()s into its
basedir, but actually not fine for nondefault tablespaces..

On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:
> notification descriptor open, where the kernel would inject events
> related to writeback failures of files under watch (potentially
> enriched to contain info regarding the exact failed pages and
> the file offset they map to).

For postgres that'd require backend processes to open() an file such that,
following its close(), any writeback errors are "signalled" to the checkpointer
process...

Justin



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund


On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos  wrote:

>I honestly do not expect that keeping around the failed pages will
>be an acceptable change for most kernels, and as such the
>recommendation
>will probably be to coordinate in userspace for the fsync().

Why is that required? You could very well just keep per inode information about 
fatal failures that occurred around. Report errors until that bit is explicitly 
cleared.  Yes, that keeps some memory around until unmount if nobody clears it. 
But it's orders of magnitude less, and results in usable semantics.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund
On 2018-04-09 21:26:21 +0200, Anthony Iliopoulos wrote:
> What about having buffered IO with implied fsync() atomicity via
> O_SYNC?

You're kidding, right?  We could also just add sleep(30)'s all over the
tree, and hope that that'll solve the problem.  There's a reason we
don't permanently fsync everything. Namely that it'll be way too slow.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 04:29:36PM +0100, Greg Stark wrote:
> Honestly I don't think there's *any* way to use the current interface
> to implement reliable operation. Even that embedded database using a
> single process and keeping every file open all the time (which means
> file descriptor limits limit its scalability) can be having silent
> corruption whenever some other process like a backup program comes
> along and calls fsync (or even sync?).

That is indeed true (sync would induce fsync on open inodes and clear
the error), and that's a nasty bug that apparently went unnoticed for
a very long time. Hopefully the errseq_t linux 4.13 fixes deal with at
least this issue, but similar fixes need to be adopted by many other
kernels (all those that mark failed pages as clean).

I honestly do not expect that keeping around the failed pages will
be an acceptable change for most kernels, and as such the recommendation
will probably be to coordinate in userspace for the fsync().

What about having buffered IO with implied fsync() atomicity via O_SYNC?
This would probably necessitate some helper threads that mask the
latency and present an async interface to the rest of PG, but sounds
less intrusive than going for DIO.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Peter Geoghegan
On Mon, Apr 9, 2018 at 12:13 PM, Andres Freund  wrote:
> Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
> absurd, as is some of the proposed ways this is all supposed to
> work. But I think the case we're discussing is much closer to a near
> irresolvable corner case than anything else.

+1

> We're talking about the storage layer returning an irresolvable
> error. You're hosed even if we report it properly.  Yes, it'd be nice if
> we could report it reliably.  But that doesn't change the fact that what
> we're doing is ensuring that data is safely fsynced unless storage
> fails, in which case it's not safely fsynced anyway.

Right. We seem to be implicitly assuming that there is a big
difference between a problem in the storage layer that we could in
principle detect, but don't, and any other problem in the storage
layer. I've read articles claiming that technologies like SMART are
not really reliable in a practical sense [1], so it seems to me that
there is reason to doubt that this gap is all that big.

That said, I suspect that the problems with running out of disk space
are serious practical problems. I have personally scoffed at stories
involving Postgres databases corruption that gets attributed to
running out of disk space. Looks like I was dead wrong.

[1] https://danluu.com/file-consistency/ -- "Filesystem correctness"
-- 
Peter Geoghegan



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra


On 04/09/2018 08:29 PM, Mark Dilger wrote:
> 
>> On Apr 9, 2018, at 10:26 AM, Joshua D. Drake  wrote:
> 
>> We have plenty of YEARS of people not noticing this issue
> 
> I disagree.  I have noticed this problem, but blamed it on other things.
> For over five years now, I have had to tell customers not to use thin
> provisioning, and I have had to add code to postgres to refuse to perform
> inserts or updates if the disk volume is more than 80% full.  I have lost
> count of the number of customers who are running an older version of the
> product (because they refuse to upgrade) and come back with complaints that
> they ran out of disk and now their database is corrupt.  All this time, I
> have been blaming this on virtualization and thin provisioning.
> 

Yeah. There's a big difference between not noticing an issue because it
does not happen very often vs. attributing it to something else. If we
had the ability to revisit past data corruption cases, we would probably
discover a fair number of cases caused by this.

The other thing we probably need to acknowledge is that the environment
changes significantly - things like thin provisioning are likely to get
even more common, increasing the incidence of these issues.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund
Hi,

On 2018-04-09 15:02:11 -0400, Robert Haas wrote:
> I think the simplest technological solution to this problem is to
> rewrite the entire backend and all supporting processes to use
> O_DIRECT everywhere.  To maintain adequate performance, we'll have to
> write a complete I/O scheduling system inside PostgreSQL.  Also, since
> we'll now have to make shared_buffers much larger -- since we'll no
> longer be benefiting from the OS cache -- we'll need to replace the
> use of malloc() with an allocator that pulls from shared_buffers.
> Plus, as noted, we'll need to totally rearchitect several of our
> critical frontend tools.  Let's freeze all other development for the
> next year while we work on that, and put out a notice that Linux is no
> longer a supported platform for any existing release.  Before we do
> that, we might want to check whether fsync() actually writes the data
> to disk in a usable way even with O_DIRECT.  If not, we should just
> de-support Linux entirely as a hopelessly broken and unsupportable
> platform.

Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
absurd, as is some of the proposed ways this is all supposed to
work. But I think the case we're discussing is much closer to a near
irresolvable corner case than anything else.

We're talking about the storage layer returning an irresolvable
error. You're hosed even if we report it properly.  Yes, it'd be nice if
we could report it reliably.  But that doesn't change the fact that what
we're doing is ensuring that data is safely fsynced unless storage
fails, in which case it's not safely fsynced anyway.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Robert Haas
On Mon, Apr 9, 2018 at 12:45 PM, Robert Haas  wrote:
> Ouch.  If a process exits -- say, because the user typed \q into psql
> -- then you're talking about potentially calling fsync() on a really
> large number of file descriptor flushing many gigabytes of data to
> disk.  And it may well be that you never actually wrote any data to
> any of those file descriptors -- those writes could have come from
> other backends.  Or you may have written a little bit of data through
> those FDs, but there could be lots of other data that you end up
> flushing incidentally.  Perfectly innocuous things like starting up a
> backend, running a few short queries, and then having that backend
> exit suddenly turn into something that could have a massive
> system-wide performance impact.
>
> Also, if a backend ever manages to exit without running through this
> code, or writes any dirty blocks afterward, then this still fails to
> fix the problem completely.  I guess that's probably avoidable -- we
> can put this late in the shutdown sequence and PANIC if it fails.
>
> I have a really tough time believing this is the right way to solve
> the problem.  We suffered for years because of ext3's desire to flush
> the entire page cache whenever any single file was fsync()'d, which
> was terrible.  Eventually ext4 became the norm, and the problem went
> away.  Now we're going to deliberately insert logic to do a very
> similar kind of terrible thing because the kernel developers have
> decided that fsync() doesn't have to do what it says on the tin?  I
> grant that there doesn't seem to be a better option, but I bet we're
> going to have a lot of really unhappy users if we do this.

What about the bug we fixed in
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ce439f3379aed857517c8ce207485655000fc8e
?  Say somebody does something along the lines of:

ps uxww | grep postgres | grep -v grep | awk '{print $2}' | xargs kill -9

...and then restarts postgres.  Craig's proposal wouldn't cover this
case, because there was no opportunity to run fsync() after the first
crash, and there's now no way to go back and fsync() any stuff we
didn't fsync() before, because the kernel may have already thrown away
the error state, or may lie to us and tell us everything is fine
(because our new fd wasn't opened early enough).  I can't find the
original discussion that led to that commit right now, so I'm not
exactly sure what scenarios we were thinking about.  But I think it
would at least be a problem if full_page_writes=off or if you had
previously started the server with fsync=off and now wish to switch to
fsync=on after completing a bulk load or similar.  Recovery can read a
page, see that it looks OK, and continue, and then a later fsync()
failure can revert that page to an earlier state and now your database
is corrupted -- and there's absolute no way to detect this because
write() gives you the new page contents later, fsync() doesn't feel
obliged to tell you about the error because your fd wasn't opened
early enough, and eventually the write can be discarded and you'll
revert back to the old page version with no errors ever being reported
anywhere.

Another consequence of this behavior that initdb -S is never reliable,
so pg_rewind's use of it doesn't actually fix the problem it was
intended to solve.  It also means that initdb itself isn't crash-safe,
since the data file changes are made by the backend but initdb itself
is doing the fsyncs, and initdb has no way of knowing what files the
backend is going to create and therefore can't -- even theoretically
-- open them first.

What's being presented to us as the API contract that we should expect
from buffered I/O is that if you open a file and read() from it, call
fsync(), and get no error, the kernel may nevertheless decide that
some previous write that it never managed to flush can't be flushed,
and then revert the page to the contents it had at some point in the
past.  That's mostly or less equivalent to letting a malicious
adversary randomly overwrite database pages plausible-looking but
incorrect contents without notice and hoping you can still build a
reliable system.  You can avoid the problem if you can always open an
fd for every file you want to modify before it's written and hold on
to it until after it's fsync'd, but that's pretty hard to guarantee in
the face of kill -9.

I think the simplest technological solution to this problem is to
rewrite the entire backend and all supporting processes to use
O_DIRECT everywhere.  To maintain adequate performance, we'll have to
write a complete I/O scheduling system inside PostgreSQL.  Also, since
we'll now have to make shared_buffers much larger -- since we'll no
longer be benefiting from the OS cache -- we'll need to replace the
use of malloc() with an allocator that pulls from shared_buffers.
Plus, as noted, we'll need to totally rearchitect several of our
critical frontend tools.  Let's freeze 

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Mark Dilger

> On Apr 9, 2018, at 10:26 AM, Joshua D. Drake  wrote:

> We have plenty of YEARS of people not noticing this issue

I disagree.  I have noticed this problem, but blamed it on other things.
For over five years now, I have had to tell customers not to use thin
provisioning, and I have had to add code to postgres to refuse to perform
inserts or updates if the disk volume is more than 80% full.  I have lost
count of the number of customers who are running an older version of the
product (because they refuse to upgrade) and come back with complaints that
they ran out of disk and now their database is corrupt.  All this time, I
have been blaming this on virtualization and thin provisioning.

mark



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Gasper Zejn
On 09. 04. 2018 15:42, Tomas Vondra wrote:
> On 04/09/2018 12:29 AM, Bruce Momjian wrote:
>> An crazy idea would be to have a daemon that checks the logs and
>> stops Postgres when it seems something wrong.
>>
> That doesn't seem like a very practical way. It's better than nothing,
> of course, but I wonder how would that work with containers (where I
> think you may not have access to the kernel log at all). Also, I'm
> pretty sure the messages do change based on kernel version (and possibly
> filesystem) so parsing it reliably seems rather difficult. And we
> probably don't want to PANIC after I/O error on an unrelated device, so
> we'd need to understand which devices are related to PostgreSQL.
>
> regards
>

For a bit less (or more) crazy idea, I'd imagine creating a Linux kernel
module with kprobe/kretprobe capturing the file passed to fsync or even
byte range within file and corresponding return value shouldn't be that
hard. Kprobe has been a part of Linux kernel for a really long time, and
from first glance it seems like it could be backported to 2.6 too.

Then you could have stable log messages or implement some kind of "fsync
error log notification" via whatever is the most sane way to get this
out of kernel.

If the kernel is new enough and has eBPF support (seems like >=4.4),
using bcc-tools[1] should enable you to write a quick script to get
exactly that info via perf events[2].

Obviously, that's a stopgap solution ...


Kind regards,
Gasper


[1] https://github.com/iovisor/bcc
[2]
https://blog.yadutaf.fr/2016/03/30/turn-any-syscall-into-event-introducing-ebpf-kernel-probes/



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Joshua D. Drake

On 04/09/2018 09:45 AM, Robert Haas wrote:

On Mon, Apr 9, 2018 at 8:16 AM, Craig Ringer  wrote:

In the mean time, I propose that we fsync() on close() before we age FDs out
of the LRU on backends. Yes, that will hurt throughput and cause stalls, but
we don't seem to have many better options. At least it'll only flush what we
actually wrote to the OS buffers not what we may have in shared_buffers. If
the bgwriter does the same thing, we should be 100% safe from this problem
on 4.13+, and it'd be trivial to make it a GUC much like the fsync or
full_page_writes options that people can turn off if they know the risks /
know their storage is safe / don't care.

I have a really tough time believing this is the right way to solve
the problem.  We suffered for years because of ext3's desire to flush
the entire page cache whenever any single file was fsync()'d, which
was terrible.  Eventually ext4 became the norm, and the problem went
away.  Now we're going to deliberately insert logic to do a very
similar kind of terrible thing because the kernel developers have
decided that fsync() doesn't have to do what it says on the tin?  I
grant that there doesn't seem to be a better option, but I bet we're
going to have a lot of really unhappy users if we do this.


I don't have a better option but whatever we do, it should be an optional
(GUC) change. We have plenty of YEARS of people not noticing this issue and
Robert's correct, if we go back to an era of things like stalls it is going
to look bad on us no matter how we describe the problem.

Thanks,

JD


--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
* Unless otherwise stated, opinions are my own.   *




Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Robert Haas
On Mon, Apr 9, 2018 at 8:16 AM, Craig Ringer  wrote:
> In the mean time, I propose that we fsync() on close() before we age FDs out
> of the LRU on backends. Yes, that will hurt throughput and cause stalls, but
> we don't seem to have many better options. At least it'll only flush what we
> actually wrote to the OS buffers not what we may have in shared_buffers. If
> the bgwriter does the same thing, we should be 100% safe from this problem
> on 4.13+, and it'd be trivial to make it a GUC much like the fsync or
> full_page_writes options that people can turn off if they know the risks /
> know their storage is safe / don't care.

Ouch.  If a process exits -- say, because the user typed \q into psql
-- then you're talking about potentially calling fsync() on a really
large number of file descriptor flushing many gigabytes of data to
disk.  And it may well be that you never actually wrote any data to
any of those file descriptors -- those writes could have come from
other backends.  Or you may have written a little bit of data through
those FDs, but there could be lots of other data that you end up
flushing incidentally.  Perfectly innocuous things like starting up a
backend, running a few short queries, and then having that backend
exit suddenly turn into something that could have a massive
system-wide performance impact.

Also, if a backend ever manages to exit without running through this
code, or writes any dirty blocks afterward, then this still fails to
fix the problem completely.  I guess that's probably avoidable -- we
can put this late in the shutdown sequence and PANIC if it fails.

I have a really tough time believing this is the right way to solve
the problem.  We suffered for years because of ext3's desire to flush
the entire page cache whenever any single file was fsync()'d, which
was terrible.  Eventually ext4 became the norm, and the problem went
away.  Now we're going to deliberately insert logic to do a very
similar kind of terrible thing because the kernel developers have
decided that fsync() doesn't have to do what it says on the tin?  I
grant that there doesn't seem to be a better option, but I bet we're
going to have a lot of really unhappy users if we do this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Greg Stark
On 9 April 2018 at 15:22, Anthony Iliopoulos  wrote:
> On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>>
> Sure, there could be knobs for limiting how much memory such "zombie"
> pages may occupy. Not sure how helpful it would be in the long run
> since this tends to be highly application-specific, and for something
> with a large data footprint one would end up tuning this accordingly
> in a system-wide manner.

Surely this is exactly what the kernel is there to manage. It has to
control how much memory is allowed to be full of dirty buffers in the
first place to ensure that the system won't get memory starved if it
can't clean them fast enough. That isn't even about persistent
hardware errors. Even when the hardware is working perfectly it can
only flush buffers so fast.  The whole point of the kernel is to
abstract away shared resources. It's not like user space has any
better view of the situation here. If Postgres implemented all this in
DIRECT_IO it would have exactly the same problem only with less
visibility into what the rest of the system is doing. If every
application implemented its own buffer cache we would be back in the
same boat only with a fragmented memory allocation.

> This has the potential to leave other
> applications running in the same system with very little memory, in
> cases where for example original application crashes and never clears
> the error.

I still think we're speaking two different languages. There's no
application anywhere that's going to "clear the error". The
application has done the writes and if it's calling fsync it wants to
wait until the filesystem can arrange for the write to be persisted.
If the application could manage without the persistence then it
wouldn't have called fsync.

The only way to "clear out" the error would be by having the writes
succeed. There's no reason to think that wouldn't be possible
sometime. The filesystem could remap blocks or an administrator could
replace degraded raid device components. The only thing Postgres could
do to recover would be create a new file and move the data (reading
from the dirty buffer in memory!) to a new file anyways so we would
"clear the error" by just no longer calling fsync on the old file.

We always read fsync as a simple write barrier. That's what the
documentation promised and it's what Postgres always expected. It
sounds like the kernel implementors looked at it as some kind of
communication channel to communicate status report for specific writes
back to user-space. That's a much more complex problem and would have
entirely different interface. I think this is why we're having so much
difficulty communicating.



> It is reasonable, but even FreeBSD has a big fat comment right
> there (since 2017), mentioning that there can be no recovery from
> EIO at the block layer and this needs to be done differently. No
> idea how an application running on top of either FreeBSD or Illumos
> would actually recover from this error (and clear it out), other
> than remounting the fs in order to force dropping of relevant pages.
> It does provide though indeed a persistent error indication that
> would allow Pg to simply reliably panic. But again this does not
> necessarily play well with other applications that may be using
> the filesystem reliably at the same time, and are now faced with
> EIO while their own writes succeed to be persisted.

Well if they're writing to the same file that had a previous error I
doubt there are many applications that would be happy to consider
their writes "persisted" when the file was corrupt. Ironically the
earlier discussion quoted talked about how applications that wanted
more granular communication would be using O_DIRECT -- but what we
have is fsync trying to be *too* granular such that it's impossible to
get any strong guarantees about anything with it.

>> One has to wonder how many applications actually use this correctly,
>> considering PostgreSQL cares about data durability/consistency so much
>> and yet we've been misunderstanding how it works for 20+ years.
>
> I would expect it would be very few, potentially those that have
> a very simple process model (e.g. embedded DBs that can abort a
> txn on fsync() EIO).

Honestly I don't think there's *any* way to use the current interface
to implement reliable operation. Even that embedded database using a
single process and keeping every file open all the time (which means
file descriptor limits limit its scalability) can be having silent
corruption whenever some other process like a backup program comes
along and calls fsync (or even sync?).



-- 
greg



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>
> We already have dirty_bytes and dirty_background_bytes, for example. I
> don't see why there couldn't be another limit defining how much dirty
> data to allow before blocking writes altogether. I'm sure it's not that
> simple, but you get the general idea - do not allow using all available
> memory because of writeback issues, but don't throw the data away in
> case it's just a temporary issue.

Sure, there could be knobs for limiting how much memory such "zombie"
pages may occupy. Not sure how helpful it would be in the long run
since this tends to be highly application-specific, and for something
with a large data footprint one would end up tuning this accordingly
in a system-wide manner. This has the potential to leave other
applications running in the same system with very little memory, in
cases where for example original application crashes and never clears
the error. Apart from that, further interfaces would need to be provided
for actually dealing with the error (again assuming non-transient
issues that may not be fixed transparently and that temporary issues
are taken care of by lower layers of the stack).

> Well, there seem to be kernels that seem to do exactly that already. At
> least that's how I understand what this thread says about FreeBSD and
> Illumos, for example. So it's not an entirely insane design, apparently.

It is reasonable, but even FreeBSD has a big fat comment right
there (since 2017), mentioning that there can be no recovery from
EIO at the block layer and this needs to be done differently. No
idea how an application running on top of either FreeBSD or Illumos
would actually recover from this error (and clear it out), other
than remounting the fs in order to force dropping of relevant pages.
It does provide though indeed a persistent error indication that
would allow Pg to simply reliably panic. But again this does not
necessarily play well with other applications that may be using
the filesystem reliably at the same time, and are now faced with
EIO while their own writes succeed to be persisted.

Ideally, you'd want a (potentially persistent) indication of error
localized to a file region (mapping the corresponding failed writeback
pages). NetBSD is already implementing fsync_ranges(), which could
be a step in the right direction.

> One has to wonder how many applications actually use this correctly,
> considering PostgreSQL cares about data durability/consistency so much
> and yet we've been misunderstanding how it works for 20+ years.

I would expect it would be very few, potentially those that have
a very simple process model (e.g. embedded DBs that can abort a
txn on fsync() EIO). I think that durability is a rather complex
cross-layer issue which has been grossly misunderstood similarly
in the past (e.g. see [1]). It seems that both the OS and DB
communities greatly benefit from a periodic reality check, and
I see this as an opportunity for strengthening the IO stack in
an end-to-end manner.

Best regards,
Anthony

[1] 
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra
On 04/09/2018 04:00 AM, Craig Ringer wrote:
> On 9 April 2018 at 07:16, Andres Freund  > wrote:
>  
> 
> 
> I think the danger presented here is far smaller than some of the
> statements in this thread might make one think.
> 
> 
> Clearly it's not happening a huge amount or we'd have a lot of noise
> about Pg eating people's data, people shouting about how unreliable it
> is, etc. We don't. So it's not some earth shattering imminent threat to
> everyone's data. It's gone unnoticed, or the root cause unidentified,
> for a long time.
> 

Yeah, it clearly isn't the case that everything we do suddenly got
pointless. It's fairly annoying, though.

> I suspect we've written off a fair few issues in the past as "it'd
> bad hardware" when actually, the hardware fault was the trigger for
> a Pg/kernel interaction bug. And blamed containers for things that
> weren't really the container's fault. But even so, if it were
> happening tons, we'd hear more noise.
> 

Right. Write errors are fairly rare, and we've probably ignored a fair
number of cases demonstrating this issue. It kinda reminds me the wisdom
 that not seeing planes with bullet holes in the engine does not mean
engines don't need armor [1].

[1]
https://medium.com/@penguinpress/an-excerpt-from-how-not-to-be-wrong-by-jordan-ellenberg-664e708cfc3d



regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Abhijit Menon-Sen
At 2018-04-09 15:42:35 +0200, tomas.von...@2ndquadrant.com wrote:
>
> On 04/09/2018 12:29 AM, Bruce Momjian wrote:
> > 
> > An crazy idea would be to have a daemon that checks the logs and
> > stops Postgres when it seems something wrong.
> > 
> 
> That doesn't seem like a very practical way.

Not least because Craig's tests showed that you can't rely on *always*
getting an error message in the logs.

-- Abhijit



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra
On 04/09/2018 12:29 AM, Bruce Momjian wrote:
> 
> An crazy idea would be to have a daemon that checks the logs and
> stops Postgres when it seems something wrong.
> 

That doesn't seem like a very practical way. It's better than nothing,
of course, but I wonder how would that work with containers (where I
think you may not have access to the kernel log at all). Also, I'm
pretty sure the messages do change based on kernel version (and possibly
filesystem) so parsing it reliably seems rather difficult. And we
probably don't want to PANIC after I/O error on an unrelated device, so
we'd need to understand which devices are related to PostgreSQL.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra
On 04/09/2018 02:31 PM, Anthony Iliopoulos wrote:
> On Mon, Apr 09, 2018 at 01:03:28PM +0100, Geoff Winkless wrote:
>> On 9 April 2018 at 11:50, Anthony Iliopoulos  wrote:
>>
>>> What you seem to be asking for is the capability of dropping
>>> buffers over the (kernel) fence and idemnifying the application
>>> from any further responsibility, i.e. a hard assurance
>>> that either the kernel will persist the pages or it will
>>> keep them around till the application recovers them
>>> asynchronously, the filesystem is unmounted, or the system
>>> is rebooted.
>>>
>>
>> That seems like a perfectly reasonable position to take, frankly.
> 
> Indeed, as long as you are willing to ignore the consequences of
> this design decision: mainly, how you would recover memory when no
> application is interested in clearing the error. At which point
> other applications with different priorities will find this position
> rather unreasonable since there can be no way out of it for them.

Sure, but the question is whether the system can reasonably operate
after some of the writes failed and the data got lost. Because if it
can't, then recovering the memory is rather useless. It might be better
to stop the system in that case, forcing the system administrator to
resolve the issue somehow (fail-over to a replica, perform recovery from
the last checkpoint, ...).

We already have dirty_bytes and dirty_background_bytes, for example. I
don't see why there couldn't be another limit defining how much dirty
data to allow before blocking writes altogether. I'm sure it's not that
simple, but you get the general idea - do not allow using all available
memory because of writeback issues, but don't throw the data away in
case it's just a temporary issue.

> Good luck convincing any OS kernel upstream to go with this design.
> 

Well, there seem to be kernels that seem to do exactly that already. At
least that's how I understand what this thread says about FreeBSD and
Illumos, for example. So it's not an entirely insane design, apparently.

The question is whether the current design makes it any easier for
user-space developers to build reliable systems. We have tried using it,
and unfortunately the answers seems to be "no" and "Use direct I/O and
manage everything on your own!"

>> The whole _point_ of an Operating System should be that you can do exactly
>> that. As a developer I should be able to call write() and fsync() and know
>> that if both calls have succeeded then the result is on disk, no matter
>> what another application has done in the meantime. If that's a "difficult"
>> problem then that's the OS's problem, not mine. If the OS doesn't do that,
>> it's _not_doing_its_job_.
> 
> No OS kernel that I know of provides any promises for atomicity of a
> write()+fsync() sequence, unless one is using O_SYNC. It doesn't
> provide you with isolation either, as this is delegated to userspace,
> where processes that share a file should coordinate accordingly.
> 

We can (and do) take care of the atomicity and isolation. Implementation
of those parts is obviously very application-specific, and we have WAL
and locks for that purpose. I/O on the other hand seems to be a generic
service provided by the OS - at least that's how we saw it until now.

> It's not a difficult problem, but rather the kernels provide a common
> denominator of possible interfaces and designs that could accommodate
> a wider range of potential application scenarios for which the kernel
> cannot possibly anticipate requirements. There have been plenty of
> experimental works for providing a transactional (ACID) filesystem
> interface to applications. On the opposite end, there have been quite
> a few commercial databases that completely bypass the kernel storage
> stack. But I would assume it is reasonable to figure out something
> between those two extremes that can work in a "portable" fashion.
> 

Users ask us about this quite often, actually. The question is usually
about "RAW devices" and performance, but ultimately it boils down to
buffered vs. direct I/O. So far our answer was we rely on kernel to do
this reliably, because they know how to do that correctly and we simply
don't have the manpower to implement it (portable, reliable, handling
different types of storage, ...).

One has to wonder how many applications actually use this correctly,
considering PostgreSQL cares about data durability/consistency so much
and yet we've been misunderstanding how it works for 20+ years.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 08:16:38PM +0800, Craig Ringer wrote:
> 
> I'd like a middle ground where the kernel lets us register our interest and
> tells us if it lost something, without us having to keep eight million FDs
> open for some long period. "Tell us about anything that happens under
> pgdata/" or an inotify-style per-directory-registration option. I'd even
> say that's ideal.

I see what you are saying. So basically you'd always maintain the
notification descriptor open, where the kernel would inject events
related to writeback failures of files under watch (potentially
enriched to contain info regarding the exact failed pages and
the file offset they map to). The kernel wouldn't even have to
maintain per-page bits to trace the errors, since they will be
consumed by the process that reads the events (or discarded,
when the notification fd is closed).

Assuming this would be possible, wouldn't Pg still need to deal
with synchronizing writers and related issues (since this would
be merely a notification mechanism - not prevent any process
from continuing), which I understand would be rather intrusive
for the current Pg multi-process design.

But other than that, similarly this interface could in principle
be similarly implemented in the BSDs via kqueue(), I suppose,
to provide what you need.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 01:03:28PM +0100, Geoff Winkless wrote:
> On 9 April 2018 at 11:50, Anthony Iliopoulos  wrote:
> 
> > What you seem to be asking for is the capability of dropping
> > buffers over the (kernel) fence and idemnifying the application
> > from any further responsibility, i.e. a hard assurance
> > that either the kernel will persist the pages or it will
> > keep them around till the application recovers them
> > asynchronously, the filesystem is unmounted, or the system
> > is rebooted.
> >
> 
> That seems like a perfectly reasonable position to take, frankly.

Indeed, as long as you are willing to ignore the consequences of
this design decision: mainly, how you would recover memory when no
application is interested in clearing the error. At which point
other applications with different priorities will find this position
rather unreasonable since there can be no way out of it for them.
Good luck convincing any OS kernel upstream to go with this design.

> The whole _point_ of an Operating System should be that you can do exactly
> that. As a developer I should be able to call write() and fsync() and know
> that if both calls have succeeded then the result is on disk, no matter
> what another application has done in the meantime. If that's a "difficult"
> problem then that's the OS's problem, not mine. If the OS doesn't do that,
> it's _not_doing_its_job_.

No OS kernel that I know of provides any promises for atomicity of a
write()+fsync() sequence, unless one is using O_SYNC. It doesn't
provide you with isolation either, as this is delegated to userspace,
where processes that share a file should coordinate accordingly.

It's not a difficult problem, but rather the kernels provide a common
denominator of possible interfaces and designs that could accommodate
a wider range of potential application scenarios for which the kernel
cannot possibly anticipate requirements. There have been plenty of
experimental works for providing a transactional (ACID) filesystem
interface to applications. On the opposite end, there have been quite
a few commercial databases that completely bypass the kernel storage
stack. But I would assume it is reasonable to figure out something
between those two extremes that can work in a "portable" fashion.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Craig Ringer
On 9 April 2018 at 18:50, Anthony Iliopoulos  wrote:

>
> There is a clear responsibility of the application to keep
> its buffers around until a successful fsync(). The kernels
> do report the error (albeit with all the complexities of
> dealing with the interface), at which point the application
> may not assume that the write()s where ever even buffered
> in the kernel page cache in the first place.
>





> What you seem to be asking for is the capability of dropping
> buffers over the (kernel) fence and idemnifying the application
> from any further responsibility, i.e. a hard assurance
> that either the kernel will persist the pages or it will
> keep them around till the application recovers them
> asynchronously, the filesystem is unmounted, or the system
> is rebooted.
>

That's what Pg appears to assume now, yes.

Whether that's reasonable is a whole different topic.

I'd like a middle ground where the kernel lets us register our interest and
tells us if it lost something, without us having to keep eight million FDs
open for some long period. "Tell us about anything that happens under
pgdata/" or an inotify-style per-directory-registration option. I'd even
say that's ideal.

In the mean time, I propose that we fsync() on close() before we age FDs
out of the LRU on backends. Yes, that will hurt throughput and cause
stalls, but we don't seem to have many better options. At least it'll only
flush what we actually wrote to the OS buffers not what we may have in
shared_buffers. If the bgwriter does the same thing, we should be 100% safe
from this problem on 4.13+, and it'd be trivial to make it a GUC much like
the fsync or full_page_writes options that people can turn off if they know
the risks / know their storage is safe / don't care.

Some keen person who wants to later could optimise it by adding a fsync
worker thread pool in backends, so we don't block the main thread. Frankly
that might be a nice thing to have in the checkpointer anyway. But it's out
of scope for fixing this in durability terms.

I'm partway through a patch that makes fsync panic on errors now. Once
that's done, the next step will be to force fsync on close() in md and see
how we go with that.

Thoughts?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Geoff Winkless
On 9 April 2018 at 11:50, Anthony Iliopoulos  wrote:

> What you seem to be asking for is the capability of dropping
> buffers over the (kernel) fence and idemnifying the application
> from any further responsibility, i.e. a hard assurance
> that either the kernel will persist the pages or it will
> keep them around till the application recovers them
> asynchronously, the filesystem is unmounted, or the system
> is rebooted.
>

That seems like a perfectly reasonable position to take, frankly.

The whole _point_ of an Operating System should be that you can do exactly
that. As a developer I should be able to call write() and fsync() and know
that if both calls have succeeded then the result is on disk, no matter
what another application has done in the meantime. If that's a "difficult"
problem then that's the OS's problem, not mine. If the OS doesn't do that,
it's _not_doing_its_job_.

Geoff


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 09:45:40AM +0100, Greg Stark wrote:
> On 8 April 2018 at 22:47, Anthony Iliopoulos  wrote:
> > On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:
> >> On 8 April 2018 at 04:27, Craig Ringer  wrote:
> >> > On 8 April 2018 at 10:16, Thomas Munro 
> >
> > The question is, what should the kernel and application do in cases
> > where this is simply not possible (according to freebsd that keeps
> > dirty pages around after failure, for example, -EIO from the block
> > layer is a contract for unrecoverable errors so it is pointless to
> > keep them dirty). You'd need a specialized interface to clear-out
> > the errors (and drop the dirty pages), or potentially just remount
> > the filesystem.
> 
> Well firstly that's not necessarily the question. ENOSPC is not an
> unrecoverable error. And even unrecoverable errors for a single write
> doesn't mean the write will never be able to succeed in the future.

To make things a bit simpler, let us focus on EIO for the moment.
The contract between the block layer and the filesystem layer is
assumed to be that of, when an EIO is propagated up to the fs,
then you may assume that all possibilities for recovering have
been exhausted in lower layers of the stack. Mind you, I am not
claiming that this contract is either documented or necessarily
respected (in fact there have been studies on the error propagation
and handling of the block layer, see [1]). Let us assume that
this is the design contract though (which appears to be the case
across a number of open-source kernels), and if not - it's a bug.
In this case, indeed the specific write()s will never be able
to succeed in the future, at least not as long as the BIOs are
allocated to the specific failing LBAs.

> But secondly doesn't such an interface already exist? When the device
> is dropped any dirty pages already get dropped with it. What's the
> point in dropping them but keeping the failing device?

I think there are degrees of failure. There are certainly cases
where one may encounter localized unrecoverable medium errors
(specific to certain LBAs) that are non-maskable from the block
layer and below. That does not mean that the device is dropped
at all, so it does make sense to continue all other operations
to all other regions of the device that are functional. In cases
of total device failure, then the filesystem will prevent you
from proceeding anyway.

> But just to underline the point. "pointless to keep them dirty" is
> exactly backwards from the application's point of view. If the error
> writing to persistent media really is unrecoverable then it's all the
> more critical that the pages be kept so the data can be copied to some
> other device. The last thing user space expects to happen is if the
> data can't be written to persistent storage then also immediately
> delete it from RAM. (And the *really* last thing user space expects is
> for this to happen and return no error.)

Right. This implies though that apart from the kernel having
to keep around the dirtied-but-unrecoverable pages for an
unbounded time, that there's further an interface for obtaining
the exact failed pages so that you can read them back. This in
turn means that there needs to be an association between the
fsync() caller and the specific dirtied pages that the caller
intents to drain (for which we'd need an fsync_range(), among
other things). BTW, currently the failed writebacks are not
dropped from memory, but rather marked clean. They could be
lost though due to memory pressure or due to explicit request
(e.g. proc drop_caches), unless mlocked.

There is a clear responsibility of the application to keep
its buffers around until a successful fsync(). The kernels
do report the error (albeit with all the complexities of
dealing with the interface), at which point the application
may not assume that the write()s where ever even buffered
in the kernel page cache in the first place.

What you seem to be asking for is the capability of dropping
buffers over the (kernel) fence and idemnifying the application
from any further responsibility, i.e. a hard assurance
that either the kernel will persist the pages or it will
keep them around till the application recovers them
asynchronously, the filesystem is unmounted, or the system
is rebooted.

Best regards,
Anthony

[1] 
https://www.usenix.org/legacy/event/fast08/tech/full_papers/gunawi/gunawi.pdf



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Greg Stark
On 8 April 2018 at 22:47, Anthony Iliopoulos  wrote:
> On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:
>> On 8 April 2018 at 04:27, Craig Ringer  wrote:
>> > On 8 April 2018 at 10:16, Thomas Munro 
>
> The question is, what should the kernel and application do in cases
> where this is simply not possible (according to freebsd that keeps
> dirty pages around after failure, for example, -EIO from the block
> layer is a contract for unrecoverable errors so it is pointless to
> keep them dirty). You'd need a specialized interface to clear-out
> the errors (and drop the dirty pages), or potentially just remount
> the filesystem.

Well firstly that's not necessarily the question. ENOSPC is not an
unrecoverable error. And even unrecoverable errors for a single write
doesn't mean the write will never be able to succeed in the future.
But secondly doesn't such an interface already exist? When the device
is dropped any dirty pages already get dropped with it. What's the
point in dropping them but keeping the failing device?

But just to underline the point. "pointless to keep them dirty" is
exactly backwards from the application's point of view. If the error
writing to persistent media really is unrecoverable then it's all the
more critical that the pages be kept so the data can be copied to some
other device. The last thing user space expects to happen is if the
data can't be written to persistent storage then also immediately
delete it from RAM. (And the *really* last thing user space expects is
for this to happen and return no error.)

-- 
greg



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Craig Ringer
On 9 April 2018 at 10:06, Andres Freund  wrote:


>
> > And in many failure modes there's no reason to expect any data loss at
> all,
> > like:
> >
> > * Local disk fills up (seems to be safe already due to space reservation
> at
> > write() time)
>
> That definitely should be treated separately.
>

It is, because all the FSes I looked at reserve space before returning from
write(), even if they do delayed allocation. So they won't fail with ENOSPC
at fsync() time or silently due to lost errors on background writeback.
Otherwise we'd be hearing a LOT more noise about this.


> > * Thin-provisioned storage backing local volume iSCSI or paravirt block
> > device fills up
> > * NFS volume fills up
>
> Those should be the same as the above.
>

Unfortunately, they aren't.

AFAICS NFS doesn't reserve space with the other end before returning from
write(), even if mounted with the sync option. So we can get ENOSPC lazily
when the buffer writeback fails due to a full backing file system. This
then travels the same paths as EIO: we fsync(), ERROR, retry, appear to
succeed, and carry on with life losing the data. Or we never hear about the
error in the first place.

(There's a proposed extension that'd allow this, see
https://tools.ietf.org/html/draft-iyer-nfsv4-space-reservation-ops-02#page-5,
but I see no mention of it in fs/nfs. All the reserve_space /
xdr_reserve_space stuff seems to be related to space in protocol messages
at a quick read.)

Thin provisioned storage could vary a fair bit depending on the
implementation. But the specific failure case I saw, prompting this thread,
was on a volume using the stack:

xfs -> lvm2 -> multipath -> ??? -> SAN

(the HBA/iSCSI/whatever was not recorded by the looks, but IIRC it was
iSCSI. I'm checking.)

The SAN ran out of space. Due to use of thin provisioning, Linux *thought*
there was plenty of space on the volume; LVM thought it had plenty of
physical extents free and unallocated, XFS thought there was tons of free
space, etc. The space exhaustion manifested as I/O errors on flushes of
writeback buffers.

The logs were like this:

kernel: sd 2:0:0:1: [sdd] Unhandled sense code
kernel: sd 2:0:0:1: [sdd]
kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kernel: sd 2:0:0:1: [sdd]
kernel: Sense Key : Data Protect [current]
kernel: sd 2:0:0:1: [sdd]
kernel: Add. Sense: Space allocation failed write protect
kernel: sd 2:0:0:1: [sdd] CDB:
kernel: Write(16): **HEX-DATA-CUT-OUT**
kernel: Buffer I/O error on device dm-0, logical block 3098338786
kernel: lost page write due to I/O error on dm-0
kernel: Buffer I/O error on device dm-0, logical block 3098338787

The immediate cause was that Linux's multipath driver didn't seem to
recognise the sense code as retryable, so it gave up and reported it to the
next layer up (LVM). LVM and XFS both seem to think that the lower layer is
responsible for retries, so they toss the write away, and tell any
interested writers if they feel like it, per discussion upthread.

In this case Pg did get the news and reported fsync() errors on
checkpoints, but it only reported an error once per relfilenode. Once it
ran out of failed relfilenodes to cause the checkpoint to ERROR, it
"completed" a "successful" checkpoint and kept on running until the
resulting corruption started to manifest its self and it segfaulted some
time later. As we've now learned, there's no guarantee we'd even get the
news about the I/O errors at all.

WAL was on a separate volume that didn't run out of room immediately, so we
didn't PANIC on WAL write failure and prevent the issue.

In this case if Pg had PANIC'd (and been able to guarantee to get the news
of write failures reliably), there'd have been no corruption and no data
loss despite the underlying storage issue.

If, prior to seeing this, you'd asked me "will my PostgreSQL database be
corrupted if my thin-provisioned volume runs out of space" I'd have said
"Surely not. PostgreSQL won't be corrupted by running out of disk space, it
orders writes carefully and forces flushes so that it will recover
gracefully from write failures."

Except not. I was very surprised.

BTW, it also turns out that the *default* for multipath is to give up on
errors anyway; see the queue_if_no_path option and no_path_retries options.
(Hint: run PostgreSQL with no_path_retries=queue). That's a sane default if
you use O_DIRECT|O_SYNC, and otherwise pretty much a data-eating setup.


I regularly see rather a lot of multipath systems, iSCSI systems, SAN
backed systems, etc. I think we need to be pretty clear that we expect them
to retry indefinitely, and if they report an I/O error we cannot reliably
handle it. We need to patch Pg to PANIC on any fsync() failure and document
that Pg won't notice some storage failure modes that might otherwise be
considered nonfatal or transient, so very specific storage configuration
and testing is required. (Not that anyone will do it).  Also warn against
running on NFS even with 

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Andres Freund
On 2018-04-09 10:00:41 +0800, Craig Ringer wrote:
> I suspect we've written off a fair few issues in the past as "it'd bad
> hardware" when actually, the hardware fault was the trigger for a Pg/kernel
> interaction bug. And blamed containers for things that weren't really the
> container's fault. But even so, if it were happening tons, we'd hear more
> noise.

Agreed on that, but I think that's FAR more likely to be things like
multixacts, index structure corruption due to logic bugs etc.


> I've already been very surprised there when I learned that PostgreSQL
> completely ignores wholly absent relfilenodes. Specifically, if you
> unlink() a relation's backing relfilenode while Pg is down and that file
> has writes pending in the WAL. We merrily re-create it with uninitalized
> pages and go on our way. As Andres pointed out in an offlist discussion,
> redo isn't a consistency check, and it's not obliged to fail in such cases.
> We can say "well, don't do that then" and define away file losses from FS
> corruption etc as not our problem, the lower levels we expect to take care
> of this have failed.

And it'd be a realy bad idea to behave differently.


> And in many failure modes there's no reason to expect any data loss at all,
> like:
> 
> * Local disk fills up (seems to be safe already due to space reservation at
> write() time)

That definitely should be treated separately.


> * Thin-provisioned storage backing local volume iSCSI or paravirt block
> device fills up
> * NFS volume fills up

Those should be the same as the above.


> I think we need to think about a more robust path in future. But it's
> certainly not "stop the world" territory.

I think you're underestimating the complexity of doing that by at least
two orders of magnitude.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Craig Ringer
On 9 April 2018 at 07:16, Andres Freund  wrote:


>
> I think the danger presented here is far smaller than some of the
> statements in this thread might make one think.


Clearly it's not happening a huge amount or we'd have a lot of noise about
Pg eating people's data, people shouting about how unreliable it is, etc.
We don't. So it's not some earth shattering imminent threat to everyone's
data. It's gone unnoticed, or the root cause unidentified, for a long time.

I suspect we've written off a fair few issues in the past as "it'd bad
hardware" when actually, the hardware fault was the trigger for a Pg/kernel
interaction bug. And blamed containers for things that weren't really the
container's fault. But even so, if it were happening tons, we'd hear more
noise.

I've already been very surprised there when I learned that PostgreSQL
completely ignores wholly absent relfilenodes. Specifically, if you
unlink() a relation's backing relfilenode while Pg is down and that file
has writes pending in the WAL. We merrily re-create it with uninitalized
pages and go on our way. As Andres pointed out in an offlist discussion,
redo isn't a consistency check, and it's not obliged to fail in such cases.
We can say "well, don't do that then" and define away file losses from FS
corruption etc as not our problem, the lower levels we expect to take care
of this have failed.

We have to look at what checkpoints are and are not supposed to promise,
and whether this is a problem we just define away as "not our problem, the
lower level failed, we're not obliged to detect this and fail gracefully."

We can choose to say that checkpoints are required to guarantee crash/power
loss safety ONLY and do not attempt to protect against I/O errors of any
sort. In fact, I think we should likely amend the documentation for release
versions to say just that.

In all likelihood, once
> you've got an IO error that kernel level retries don't fix, your
> database is screwed.


Your database is going to be down or have interrupted service.  It's
possible you may have some unreadable data. This could result in localised
damage to one or more relations. That could affect FK relationships,
indexes, all sorts. If you're really unlucky you might lose something
critical like pg_clog/ contents.

But in general your DB should be repairable/recoverable even in those cases.

And in many failure modes there's no reason to expect any data loss at all,
like:

* Local disk fills up (seems to be safe already due to space reservation at
write() time)
* Thin-provisioned storage backing local volume iSCSI or paravirt block
device fills up
* NFS volume fills up
* Multipath I/O error
* Interruption of connectivity to network block device
* Disk develops localized bad sector where we haven't previously written
data

Except for the ENOSPC on NFS, all the rest of the cases can be handled by
expecting the kernel to retry forever and not return until the block is
written or we reach the heat death of the universe. And NFS, well...

Part of the trouble is that the kernel *won't* retry forever in all these
cases, and doesn't seem to have a way to ask it to in all cases.

And if the user hasn't configured it for the right behaviour in terms of
I/O error resilience, we don't find out about it.

So it's not the end of the world, but it'd sure be nice to fix.

Whether fsync reports that or not is really
> somewhat besides the point. We don't panic that way when getting IO
> errors during reads either, and they're more likely to be persistent
> than errors during writes (because remapping on storage layer can fix
> issues, but not during reads).
>

That's because reads don't make promises about what's committed and synced.
I think that's quite different.


> We should fix things so that reported errors are treated with crash
> recovery, and for the rest I think there's very fair arguments to be
> made that that's far outside postgres's remit.
>

Certainly for current versions.

I think we need to think about a more robust path in future. But it's
certainly not "stop the world" territory.

The docs need an update to indicate that we explicitly disclaim
responsibility for I/O errors on async writes, and that the kernel and I/O
stack must be configured never to give up on buffered writes. If it does,
that's not our problem anymore.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Andres Freund
Hi,

On 2018-04-08 16:27:57 -0700, Christophe Pettus wrote:
> > On Apr 8, 2018, at 16:16, Andres Freund  wrote:
> > We don't panic that way when getting IO
> > errors during reads either, and they're more likely to be persistent
> > than errors during writes (because remapping on storage layer can fix
> > issues, but not during reads).
> 
> There is a distinction to be drawn there, though, because we
> immediately pass an error back to the client on a read, but a write
> problem in this situation can be masked for an extended period of
> time.

Only if you're "lucky" enough that your clients actually read that data,
and then you're somehow able to figure out across the whole stack that
these 0.001% of transactions that fail are due to IO errors. Or you also
need to do log analysis.

If you want to solve things like that you need regular reads of all your
data, including verifications etc.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Craig Ringer
On 9 April 2018 at 06:29, Bruce Momjian  wrote:


>
> I think the big problem is that we don't have any way of stopping
> Postgres at the time the kernel reports the errors to the kernel log, so
> we are then returning potentially incorrect results and committing
> transactions that might be wrong or lost.


Right.

Specifically, we need a way to ask the kernel at checkpoint time "was
everything written to [this set of files] flushed successfully since the
last time I asked, no matter who did the writing and no matter how the
writes were flushed?"

If the result is "no" we PANIC and redo. If the hardware/volume is screwed,
the user can fail over to a standby, do PITR, etc.

But we don't have any way to ask that reliably at present.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Craig Ringer
On 9 April 2018 at 05:28, Christophe Pettus  wrote:

>
> > On Apr 8, 2018, at 14:23, Greg Stark  wrote:
> >
> > They consider dirty filesystem buffers when there's
> > hardware failure preventing them from being written "a memory leak".
>
> That's not an irrational position.  File system buffers are *not*
> dedicated memory for file system caching; they're being used for that
> because no one has a better use for them at that moment.  If an inability
> to flush them to disk meant that they suddenly became pinned memory, a
> large copy operation to a yanked USB drive could result in the system
> having no more allocatable memory.  I guess in theory that they could swap
> them, but swapping out a file system buffer in hopes that sometime in the
> future it could be properly written doesn't seem very architecturally sound
> to me.
>

Yep.

Another example is a write to an NFS or iSCSI volume that goes away
forever. What if the app keeps write()ing in the hopes it'll come back, and
by the time the kernel starts reporting EIO for write(), it's already
saddled with a huge volume of dirty writeback buffers it can't get rid of
because someone, one day, might want to know about them?

You could make the argument that it's OK to forget if the entire file
system goes away. But actually, why is that ok? What if it's remounted
again? That'd be really bad too, for someone expecting write reliability.

You can coarsen from dirty buffer tracking to marking the FD(s) bad, but
what  if there's no FD to mark because the file isn't open at the moment?

You can mark the inode cache entry and pin it, I guess. But what if your
app triggered I/O errors over vast numbers of small files? Again, the
kernel's left holding the ball.

It doesn't know if/when an app will return to check. It doesn't know how
long to remember the failure for. It doesn't know when all interested
clients have been informed and it can treat the fault as cleared/repaired,
either, so it'd have to *keep on reporting EIO for PostgreSQL's own writes
and fsyncs() indefinitely*, even once we do recovery.

The only way it could avoid that would be to keep the dirty writeback pages
around and flagged bad, then clear the flag when a new write() replaces the
same file range. I can't imagine that being practical.

Blaming the kernel for this sure is the easy way out.

But IMO we cannot rationally expect the kernel to remember error state
forever for us, then forget it when we expect, all without actually telling
it anything about our activities or even that we still exist and are still
interested in the files/writes. We've closed the files and gone away.

Whatever we do, it's likely going to have to involve not doing that anymore.

Even if we can somehow convince the kernel folks to add a new interface for
us that reports I/O errors to some listener, like an
inotify/fnotify/dnotify/whatever-it-is-today-notify extension reporting
errors in buffered async writes, we won't be able to rely on having it for
5-10 years, and only on Linux.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Christophe Pettus

> On Apr 8, 2018, at 16:16, Andres Freund  wrote:
> We don't panic that way when getting IO
> errors during reads either, and they're more likely to be persistent
> than errors during writes (because remapping on storage layer can fix
> issues, but not during reads).

There is a distinction to be drawn there, though, because we immediately pass 
an error back to the client on a read, but a write problem in this situation 
can be masked for an extended period of time.

That being said...

> There's a lot of not so great things here, but I don't think there's any
> need to panic.

No reason to panic, yes.  We can assume that if this was a very big persistent 
problem, it would be much more widely reported.  It would, however, be good to 
find a way to get the error surfaced back up to the client in a way that is not 
just monitoring the kernel logs.

--
-- Christophe Pettus
   x...@thebuild.com




Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Andres Freund
On 2018-04-08 18:29:16 -0400, Bruce Momjian wrote:
> On Sun, Apr 8, 2018 at 09:38:03AM -0700, Christophe Pettus wrote:
> >
> > > On Apr 8, 2018, at 03:30, Craig Ringer 
> > > wrote:
> > >
> > > These are way more likely than bit flips or other storage level
> > > corruption, and things that we previously expected to detect and
> > > fail gracefully for.
> >
> > This is definitely bad, and it explains a few otherwise-inexplicable
> > corruption issues we've seen.  (And great work tracking it down!)  I
> > think it's important not to panic, though; PostgreSQL doesn't have a
> > reputation for horrible data integrity.  I'm not sure it makes sense
> > to do a major rearchitecting of the storage layer (especially with
> > pluggable storage coming along) to address this.  While the failure
> > modes are more common, the solution (a PITR backup) is one that an
> > installation should have anyway against media failures.
> 
> I think the big problem is that we don't have any way of stopping
> Postgres at the time the kernel reports the errors to the kernel log, so
> we are then returning potentially incorrect results and committing
> transactions that might be wrong or lost.  If we could stop Postgres
> when such errors happen, at least the administrator could fix the
> problem of fail-over to a standby.
> 
> An crazy idea would be to have a daemon that checks the logs and stops
> Postgres when it seems something wrong.

I think the danger presented here is far smaller than some of the
statements in this thread might make one think. In all likelihood, once
you've got an IO error that kernel level retries don't fix, your
database is screwed. Whether fsync reports that or not is really
somewhat besides the point. We don't panic that way when getting IO
errors during reads either, and they're more likely to be persistent
than errors during writes (because remapping on storage layer can fix
issues, but not during reads).

There's a lot of not so great things here, but I don't think there's any
need to panic.

We should fix things so that reported errors are treated with crash
recovery, and for the rest I think there's very fair arguments to be
made that that's far outside postgres's remit.

I think there's pretty good reasons to go to direct IO where supported,
but error handling doesn't strike me as a particularly good reason for
the move.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Christophe Pettus

> On Apr 8, 2018, at 15:29, Bruce Momjian  wrote:
> I think the big problem is that we don't have any way of stopping
> Postgres at the time the kernel reports the errors to the kernel log, so
> we are then returning potentially incorrect results and committing
> transactions that might be wrong or lost.

Yeah, it's bad.  In the short term, the best advice to installations is to 
monitor their kernel logs for errors (which very few do right now), and make 
sure they have a backup strategy which can encompass restoring from an error 
like this.  Even Craig's smart fix of patching the backup label to recover from 
a previous checkpoint doesn't do much good if we don't have WAL records back 
that far (or one of the required WAL records also took a hit).

In the longer term... O_DIRECT seems like the most plausible way out of this, 
but that might be popular with people running on file systems or OSes that 
don't have this issue.  (Setting aside the daunting prospect of implementing 
that.)

--
-- Christophe Pettus
   x...@thebuild.com




Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Bruce Momjian
On Sun, Apr 8, 2018 at 09:38:03AM -0700, Christophe Pettus wrote:
>
> > On Apr 8, 2018, at 03:30, Craig Ringer 
> > wrote:
> >
> > These are way more likely than bit flips or other storage level
> > corruption, and things that we previously expected to detect and
> > fail gracefully for.
>
> This is definitely bad, and it explains a few otherwise-inexplicable
> corruption issues we've seen.  (And great work tracking it down!)  I
> think it's important not to panic, though; PostgreSQL doesn't have a
> reputation for horrible data integrity.  I'm not sure it makes sense
> to do a major rearchitecting of the storage layer (especially with
> pluggable storage coming along) to address this.  While the failure
> modes are more common, the solution (a PITR backup) is one that an
> installation should have anyway against media failures.

I think the big problem is that we don't have any way of stopping
Postgres at the time the kernel reports the errors to the kernel log, so
we are then returning potentially incorrect results and committing
transactions that might be wrong or lost.  If we could stop Postgres
when such errors happen, at least the administrator could fix the
problem of fail-over to a standby.

An crazy idea would be to have a daemon that checks the logs and stops
Postgres when it seems something wrong.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Anthony Iliopoulos
On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:
> On 8 April 2018 at 04:27, Craig Ringer  wrote:
> > On 8 April 2018 at 10:16, Thomas Munro 
> > wrote:
> >
> > If the kernel does writeback in the middle, how on earth is it supposed to
> > know we expect to reopen the file and check back later?
> >
> > Should it just remember "this file had an error" forever, and tell every
> > caller? In that case how could we recover? We'd need some new API to say
> > "yeah, ok already, I'm redoing all my work since the last good fsync() so
> > you can clear the error flag now". Otherwise it'd keep reporting an error
> > after we did redo to recover, too.
> 
> There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep
> track of any errors. We just need fsync to accurately report whether
> all the buffers in the file have been written out. When you call fsync

Instead, fsync() reports when some of the buffers have not been
written out, due to reasons outlined before. As such it may make
some sense to maintain some tracking regarding errors even after
marking failed dirty pages as clean (in fact it has been proposed,
but this introduces memory overhead).

> again the kernel needs to initiate i/o on all the dirty buffers and
> block until they complete successfully. If they complete successfully
> then nobody cares whether they had some failure in the past when i/o
> was initiated at some point in the past.

The question is, what should the kernel and application do in cases
where this is simply not possible (according to freebsd that keeps
dirty pages around after failure, for example, -EIO from the block
layer is a contract for unrecoverable errors so it is pointless to
keep them dirty). You'd need a specialized interface to clear-out
the errors (and drop the dirty pages), or potentially just remount
the filesystem.

> The problem is not that errors aren't been tracked correctly. The
> problem is that dirty buffers are being marked clean when they haven't
> been written out. They consider dirty filesystem buffers when there's
> hardware failure preventing them from being written "a memory leak".
> 
> As long as any error means the kernel has discarded writes then
> there's no real hope of any reliable operation through that interface.

This does not necessarily follow. Whether the kernel discards writes
or not would not really help (see above). It is more a matter of
proper "reporting contract" between userspace and kernel, and tracking
would be a way for facilitating this vs. having a more complex userspace
scheme (as described by others in this thread) where synchronization
for fsync() is required in a multi-process application.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Christophe Pettus

> On Apr 8, 2018, at 14:23, Greg Stark  wrote:
> 
> They consider dirty filesystem buffers when there's
> hardware failure preventing them from being written "a memory leak".

That's not an irrational position.  File system buffers are *not* dedicated 
memory for file system caching; they're being used for that because no one has 
a better use for them at that moment.  If an inability to flush them to disk 
meant that they suddenly became pinned memory, a large copy operation to a 
yanked USB drive could result in the system having no more allocatable memory.  
I guess in theory that they could swap them, but swapping out a file system 
buffer in hopes that sometime in the future it could be properly written 
doesn't seem very architecturally sound to me.

--
-- Christophe Pettus
   x...@thebuild.com




Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Greg Stark
On 8 April 2018 at 04:27, Craig Ringer  wrote:
> On 8 April 2018 at 10:16, Thomas Munro 
> wrote:
>
> If the kernel does writeback in the middle, how on earth is it supposed to
> know we expect to reopen the file and check back later?
>
> Should it just remember "this file had an error" forever, and tell every
> caller? In that case how could we recover? We'd need some new API to say
> "yeah, ok already, I'm redoing all my work since the last good fsync() so
> you can clear the error flag now". Otherwise it'd keep reporting an error
> after we did redo to recover, too.

There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep
track of any errors. We just need fsync to accurately report whether
all the buffers in the file have been written out. When you call fsync
again the kernel needs to initiate i/o on all the dirty buffers and
block until they complete successfully. If they complete successfully
then nobody cares whether they had some failure in the past when i/o
was initiated at some point in the past.

The problem is not that errors aren't been tracked correctly. The
problem is that dirty buffers are being marked clean when they haven't
been written out. They consider dirty filesystem buffers when there's
hardware failure preventing them from being written "a memory leak".

As long as any error means the kernel has discarded writes then
there's no real hope of any reliable operation through that interface.

Going to DIRECTIO is basically recognizing this. That the kernel
filesystem buffer provides no reliable interface so we need to
reimplement it ourselves in user space.

It's rather disheartening. Aside from having to do all that work we
have the added barrier that we don't have as much information about
the hardware as the kernel has. We don't know where raid stripes begin
and end, how big the memory controller buffers are or how to tell when
they're full or empty or how to flush them. etc etc. We also don't
know what else is going on on the machine.

-- 
greg



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Christophe Pettus

> On Apr 8, 2018, at 03:30, Craig Ringer  wrote:
> 
> These are way more likely than bit flips or other storage level corruption, 
> and things that we previously expected to detect and fail gracefully for.

This is definitely bad, and it explains a few otherwise-inexplicable corruption 
issues we've seen.  (And great work tracking it down!)  I think it's important 
not to panic, though; PostgreSQL doesn't have a reputation for horrible data 
integrity.  I'm not sure it makes sense to do a major rearchitecting of the 
storage layer (especially with pluggable storage coming along) to address this. 
 While the failure modes are more common, the solution (a PITR backup) is one 
that an installation should have anyway against media failures.

--
-- Christophe Pettus
   x...@thebuild.com




Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Craig Ringer
On 8 April 2018 at 17:41, Andreas Karlsson  wrote:

> On 04/08/2018 05:27 AM, Craig Ringer wrote:> More below, but here's an
> idea #5: decide InnoDB has the right idea, and
>
>> go to using a single massive blob file, or a few giant blobs.
>>
>
> FYI: MySQL has by default one file per table these days. The old approach
> with one massive file was a maintenance headache so they change the default
> some releases ago.
>
> https://dev.mysql.com/doc/refman/8.0/en/innodb-multiple-tablespaces.html
>
>
Huh, thanks for the update.

We should see how they handle reliable flushing and see if they've looked
into it. If they haven't, we should give them a heads-up and if they have,
lets learn from them.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Craig Ringer
On 8 April 2018 at 11:46, Christophe Pettus  wrote:

>
> > On Apr 7, 2018, at 20:27, Craig Ringer  wrote:
> >
> > Right now I think we're at option (4): If you see anything that smells
> like a write error in your kernel logs, hard-kill postgres with -m
> immediate (do NOT let it do a shutdown checkpoint). If it did a checkpoint
> since the logs, fake up a backup label to force redo to start from the last
> checkpoint before the error. Otherwise, it's safe to just let it start up
> again and do redo again.
>
> Before we spiral down into despair and excessive alcohol consumption, this
> is basically the same situation as a checksum failure or some other kind of
> uncorrected media-level error.  The bad part is that we have to find out
> from the kernel logs rather than from PostgreSQL directly.  But this does
> not strike me as otherwise significantly different from, say, an
> infrequently-accessed disk block reporting an uncorrectable error when we
> finally get around to reading it.
>

I don't entirely agree - because it affects ENOSPC, I/O errors on thin
provisioned storage, I/O errors on multipath storage, etc. (I identified
the original issue on a thin provisioned system that ran out of backing
space, mangling PostgreSQL in a way that made no sense at the time).

These are way more likely than bit flips or other storage level corruption,
and things that we previously expected to detect and fail gracefully for.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Andreas Karlsson
On 04/08/2018 05:27 AM, Craig Ringer wrote:> More below, but here's an 
idea #5: decide InnoDB has the right idea, and

go to using a single massive blob file, or a few giant blobs.


FYI: MySQL has by default one file per table these days. The old 
approach with one massive file was a maintenance headache so they change 
the default some releases ago.


https://dev.mysql.com/doc/refman/8.0/en/innodb-multiple-tablespaces.html

Andreas



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-07 Thread Christophe Pettus

> On Apr 7, 2018, at 20:27, Craig Ringer  wrote:
> 
> Right now I think we're at option (4): If you see anything that smells like a 
> write error in your kernel logs, hard-kill postgres with -m immediate (do NOT 
> let it do a shutdown checkpoint). If it did a checkpoint since the logs, fake 
> up a backup label to force redo to start from the last checkpoint before the 
> error. Otherwise, it's safe to just let it start up again and do redo again.

Before we spiral down into despair and excessive alcohol consumption, this is 
basically the same situation as a checksum failure or some other kind of 
uncorrected media-level error.  The bad part is that we have to find out from 
the kernel logs rather than from PostgreSQL directly.  But this does not strike 
me as otherwise significantly different from, say, an infrequently-accessed 
disk block reporting an uncorrectable error when we finally get around to 
reading it.

--
-- Christophe Pettus
   x...@thebuild.com




Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-07 Thread Peter Geoghegan
On Sat, Apr 7, 2018 at 8:27 PM, Craig Ringer  wrote:
> More below, but here's an idea #5: decide InnoDB has the right idea, and go
> to using a single massive blob file, or a few giant blobs.
>
> We have a storage abstraction that makes this way, way less painful than it
> should be.
>
> We can virtualize relfilenodes into storage extents in relatively few big
> files. We could use sparse regions to make the addressing more convenient,
> but that makes copying and backup painful, so I'd rather not.
>
> Even one file per tablespace for persistent relation heaps, another for
> indexes, another for each fork type.

I'm not sure that we can do that now, since it would break the new
"Optimize btree insertions for common case of increasing values"
optimization. (I did mention this before it went in.)

I've asked Pavan to at least add a note to the nbtree README that
explains the high level theory behind the optimization, as part of
post-commit clean-up. I'll ask him to say something about how it might
affect extent-based storage, too.

-- 
Peter Geoghegan



  1   2   >