Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2018-04-08 Thread Kyotaro HORIGUCHI
At Mon, 9 Apr 2018 13:59:45 +0900, Michael Paquier  wrote 
in <20180409045945.gb1...@paquier.xyz>
> On Mon, Apr 09, 2018 at 01:26:54PM +0900, Kyotaro HORIGUCHI wrote:
> > Hello, I added this as Older Bugs in Open items. (I believe it's
> > legit.)
> 
> Yes, I think that's adapted.  I am hesitating to do the same thing with
> all the other patches marked as bug fixes.

Thanks. I'm also not going to do so on the other "bug fix"es. The
reasons for this patch are I myself recently had a suspected case
(on a costomer site) and the size is not so large. (This doesn't
mean it is easy, of couse..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Fix pg_rewind which can be run as root user

2018-04-08 Thread Michael Paquier
Hi all,

I was just going through pg_rewind's code, and noticed the following
pearl:
/*
 * Don't allow pg_rewind to be run as root, to avoid overwriting the
 * ownership of files in the data directory. We need only check for root
 * -- any other user won't have sufficient permissions to modify files in
 * the data directory.
 */
#ifndef WIN32
if (geteuid() == 0)
{
fprintf(stderr, _("cannot be executed by \"root\"\n"));
fprintf(stderr, _("You must run %s as the PostgreSQL superuser.\n"),
progname);
}
#endif

While that's nice to inform the user about the problem, that actually
does not prevent pg_rewind to run as root.  Attached is a patch, which
needs a back-patch down to 9.5.

Thanks,
--
Michael
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index b9ea6a4c21..a1ab13963a 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -208,6 +208,7 @@ main(int argc, char **argv)
 		fprintf(stderr, _("cannot be executed by \"root\"\n"));
 		fprintf(stderr, _("You must run %s as the PostgreSQL superuser.\n"),
 progname);
+		exit(1);
 	}
 #endif
 


signature.asc
Description: PGP signature


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2018-04-08 Thread Kyotaro HORIGUCHI
Hello, I added this as Older Bugs in Open items. (I believe it's legit.)

The latest patch still applies on the HEAD with some offsets.

At Tue, 23 Jan 2018 18:50:00 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20180123.185000.232069302.horiguchi.kyot...@lab.ntt.co.jp>
> At Fri, 19 Jan 2018 18:24:56 +0900, Michael Paquier 
>  wrote in <20180119092456.ga1...@paquier.xyz>
> > On Fri, Jan 19, 2018 at 10:54:53AM +0900, Kyotaro HORIGUCHI wrote:
> > > On the other hand if one logical record must be read from single
> > > source, we need any means to deterrent wal-recycling on the
> > > primary side. Conducting that within the primary side is rejected
> > > by consensus.
> > 
> > There is this approach of extending the message protocol as well so as
> > the primary can retain the segments it needs to keep around...
> 
> I haven't seen it in detail but it seems meaning that it is done
> by notifying something from the standby to the parimary not
> knowing what is a WAL record on the standby side.
> 
> > > (There might be smarter way to do that, though.) To
> > > do that from the standby side, just retarding write feedbacks or
> > > this kind of special feedback would be needed. In any way it is
> > > necessary to introduce WAL-record awareness into WAL shipping
> > > process and it's seemingly large complexity.
> > 
> > Coming to logical slots, don't you need to be aware of the beginning of
> > the record on the primary to perform correctly decoding of tuples
> > through time travel? If the record is present across segments, it seems
> 
> I'm not sure what the time travel is, but the requried segments
> seems kept safely on logical replication since the publisher
> moves restart_lsn not based on finished commits LSN, but on the
> beginning LSN of running transactions. In other words, logical
> decoding doesn't need to track each record for the purpose of
> keeping segments since it already keeping track of the beginning
> of a transaction. Physical replication is totally unaware of a
> WAL record, it just copies blobs in a convenient unit and only
> cares LSN. But the ignorance seems required to keep performance.
> 
> > to me that it matters. Andres knows the answer to that for sure, so I
> > would expect to be counter-argued in the next 12 hours or so.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] path toward faster partition pruning

2018-04-08 Thread David Rowley
While looking at the docs in [1], I saw that we still mention:

4. Ensure that the constraint_exclusion configuration parameter is not
disabled in postgresql.conf. If it is, queries will not be optimized
as desired.

This is no longer true. The attached patch removed it.

[1] https://www.postgresql.org/docs/10/static/ddl-partitioning.html

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


0001-Remove-mention-of-constraint_exclusion-in-partitioni.patch
Description: Binary data


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Craig Ringer
On 9 April 2018 at 10:06, Andres Freund  wrote:


>
> > And in many failure modes there's no reason to expect any data loss at
> all,
> > like:
> >
> > * Local disk fills up (seems to be safe already due to space reservation
> at
> > write() time)
>
> That definitely should be treated separately.
>

It is, because all the FSes I looked at reserve space before returning from
write(), even if they do delayed allocation. So they won't fail with ENOSPC
at fsync() time or silently due to lost errors on background writeback.
Otherwise we'd be hearing a LOT more noise about this.


> > * Thin-provisioned storage backing local volume iSCSI or paravirt block
> > device fills up
> > * NFS volume fills up
>
> Those should be the same as the above.
>

Unfortunately, they aren't.

AFAICS NFS doesn't reserve space with the other end before returning from
write(), even if mounted with the sync option. So we can get ENOSPC lazily
when the buffer writeback fails due to a full backing file system. This
then travels the same paths as EIO: we fsync(), ERROR, retry, appear to
succeed, and carry on with life losing the data. Or we never hear about the
error in the first place.

(There's a proposed extension that'd allow this, see
https://tools.ietf.org/html/draft-iyer-nfsv4-space-reservation-ops-02#page-5,
but I see no mention of it in fs/nfs. All the reserve_space /
xdr_reserve_space stuff seems to be related to space in protocol messages
at a quick read.)

Thin provisioned storage could vary a fair bit depending on the
implementation. But the specific failure case I saw, prompting this thread,
was on a volume using the stack:

xfs -> lvm2 -> multipath -> ??? -> SAN

(the HBA/iSCSI/whatever was not recorded by the looks, but IIRC it was
iSCSI. I'm checking.)

The SAN ran out of space. Due to use of thin provisioning, Linux *thought*
there was plenty of space on the volume; LVM thought it had plenty of
physical extents free and unallocated, XFS thought there was tons of free
space, etc. The space exhaustion manifested as I/O errors on flushes of
writeback buffers.

The logs were like this:

kernel: sd 2:0:0:1: [sdd] Unhandled sense code
kernel: sd 2:0:0:1: [sdd]
kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kernel: sd 2:0:0:1: [sdd]
kernel: Sense Key : Data Protect [current]
kernel: sd 2:0:0:1: [sdd]
kernel: Add. Sense: Space allocation failed write protect
kernel: sd 2:0:0:1: [sdd] CDB:
kernel: Write(16): **HEX-DATA-CUT-OUT**
kernel: Buffer I/O error on device dm-0, logical block 3098338786
kernel: lost page write due to I/O error on dm-0
kernel: Buffer I/O error on device dm-0, logical block 3098338787

The immediate cause was that Linux's multipath driver didn't seem to
recognise the sense code as retryable, so it gave up and reported it to the
next layer up (LVM). LVM and XFS both seem to think that the lower layer is
responsible for retries, so they toss the write away, and tell any
interested writers if they feel like it, per discussion upthread.

In this case Pg did get the news and reported fsync() errors on
checkpoints, but it only reported an error once per relfilenode. Once it
ran out of failed relfilenodes to cause the checkpoint to ERROR, it
"completed" a "successful" checkpoint and kept on running until the
resulting corruption started to manifest its self and it segfaulted some
time later. As we've now learned, there's no guarantee we'd even get the
news about the I/O errors at all.

WAL was on a separate volume that didn't run out of room immediately, so we
didn't PANIC on WAL write failure and prevent the issue.

In this case if Pg had PANIC'd (and been able to guarantee to get the news
of write failures reliably), there'd have been no corruption and no data
loss despite the underlying storage issue.

If, prior to seeing this, you'd asked me "will my PostgreSQL database be
corrupted if my thin-provisioned volume runs out of space" I'd have said
"Surely not. PostgreSQL won't be corrupted by running out of disk space, it
orders writes carefully and forces flushes so that it will recover
gracefully from write failures."

Except not. I was very surprised.

BTW, it also turns out that the *default* for multipath is to give up on
errors anyway; see the queue_if_no_path option and no_path_retries options.
(Hint: run PostgreSQL with no_path_retries=queue). That's a sane default if
you use O_DIRECT|O_SYNC, and otherwise pretty much a data-eating setup.


I regularly see rather a lot of multipath systems, iSCSI systems, SAN
backed systems, etc. I think we need to be pretty clear that we expect them
to retry indefinitely, and if they report an I/O error we cannot reliably
handle it. We need to patch Pg to PANIC on any fsync() failure and document
that Pg won't notice some storage failure modes that might otherwise be
considered nonfatal or transient, so very specific storage configuration
and testing is required. (Not that anyone will do it).  Also warn against
running on NFS even with 

Re: pgsql: Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.

2018-04-08 Thread Tom Lane
Michael Paquier  writes:
> On Sun, Apr 08, 2018 at 06:35:39PM +, Tom Lane wrote:
>> Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.

> This patch or one of its relatives has visibly broken parallel builds
> for me. "make -j 4 install" directly called after a configure complains:

Hm.  I'd tested "make -j all", but not going directly to "install".
Does it help if you add

$(SUBDIRS:%=install-%-recurse): | submake-generated-headers

to src/Makefile?

(That seems like a bit of a brute-force solution, though.  Anybody
have a better answer?)

regards, tom lane



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Andres Freund
On 2018-04-09 10:00:41 +0800, Craig Ringer wrote:
> I suspect we've written off a fair few issues in the past as "it'd bad
> hardware" when actually, the hardware fault was the trigger for a Pg/kernel
> interaction bug. And blamed containers for things that weren't really the
> container's fault. But even so, if it were happening tons, we'd hear more
> noise.

Agreed on that, but I think that's FAR more likely to be things like
multixacts, index structure corruption due to logic bugs etc.


> I've already been very surprised there when I learned that PostgreSQL
> completely ignores wholly absent relfilenodes. Specifically, if you
> unlink() a relation's backing relfilenode while Pg is down and that file
> has writes pending in the WAL. We merrily re-create it with uninitalized
> pages and go on our way. As Andres pointed out in an offlist discussion,
> redo isn't a consistency check, and it's not obliged to fail in such cases.
> We can say "well, don't do that then" and define away file losses from FS
> corruption etc as not our problem, the lower levels we expect to take care
> of this have failed.

And it'd be a realy bad idea to behave differently.


> And in many failure modes there's no reason to expect any data loss at all,
> like:
> 
> * Local disk fills up (seems to be safe already due to space reservation at
> write() time)

That definitely should be treated separately.


> * Thin-provisioned storage backing local volume iSCSI or paravirt block
> device fills up
> * NFS volume fills up

Those should be the same as the above.


> I think we need to think about a more robust path in future. But it's
> certainly not "stop the world" territory.

I think you're underestimating the complexity of doing that by at least
two orders of magnitude.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Craig Ringer
On 9 April 2018 at 07:16, Andres Freund  wrote:


>
> I think the danger presented here is far smaller than some of the
> statements in this thread might make one think.


Clearly it's not happening a huge amount or we'd have a lot of noise about
Pg eating people's data, people shouting about how unreliable it is, etc.
We don't. So it's not some earth shattering imminent threat to everyone's
data. It's gone unnoticed, or the root cause unidentified, for a long time.

I suspect we've written off a fair few issues in the past as "it'd bad
hardware" when actually, the hardware fault was the trigger for a Pg/kernel
interaction bug. And blamed containers for things that weren't really the
container's fault. But even so, if it were happening tons, we'd hear more
noise.

I've already been very surprised there when I learned that PostgreSQL
completely ignores wholly absent relfilenodes. Specifically, if you
unlink() a relation's backing relfilenode while Pg is down and that file
has writes pending in the WAL. We merrily re-create it with uninitalized
pages and go on our way. As Andres pointed out in an offlist discussion,
redo isn't a consistency check, and it's not obliged to fail in such cases.
We can say "well, don't do that then" and define away file losses from FS
corruption etc as not our problem, the lower levels we expect to take care
of this have failed.

We have to look at what checkpoints are and are not supposed to promise,
and whether this is a problem we just define away as "not our problem, the
lower level failed, we're not obliged to detect this and fail gracefully."

We can choose to say that checkpoints are required to guarantee crash/power
loss safety ONLY and do not attempt to protect against I/O errors of any
sort. In fact, I think we should likely amend the documentation for release
versions to say just that.

In all likelihood, once
> you've got an IO error that kernel level retries don't fix, your
> database is screwed.


Your database is going to be down or have interrupted service.  It's
possible you may have some unreadable data. This could result in localised
damage to one or more relations. That could affect FK relationships,
indexes, all sorts. If you're really unlucky you might lose something
critical like pg_clog/ contents.

But in general your DB should be repairable/recoverable even in those cases.

And in many failure modes there's no reason to expect any data loss at all,
like:

* Local disk fills up (seems to be safe already due to space reservation at
write() time)
* Thin-provisioned storage backing local volume iSCSI or paravirt block
device fills up
* NFS volume fills up
* Multipath I/O error
* Interruption of connectivity to network block device
* Disk develops localized bad sector where we haven't previously written
data

Except for the ENOSPC on NFS, all the rest of the cases can be handled by
expecting the kernel to retry forever and not return until the block is
written or we reach the heat death of the universe. And NFS, well...

Part of the trouble is that the kernel *won't* retry forever in all these
cases, and doesn't seem to have a way to ask it to in all cases.

And if the user hasn't configured it for the right behaviour in terms of
I/O error resilience, we don't find out about it.

So it's not the end of the world, but it'd sure be nice to fix.

Whether fsync reports that or not is really
> somewhat besides the point. We don't panic that way when getting IO
> errors during reads either, and they're more likely to be persistent
> than errors during writes (because remapping on storage layer can fix
> issues, but not during reads).
>

That's because reads don't make promises about what's committed and synced.
I think that's quite different.


> We should fix things so that reported errors are treated with crash
> recovery, and for the rest I think there's very fair arguments to be
> made that that's far outside postgres's remit.
>

Certainly for current versions.

I think we need to think about a more robust path in future. But it's
certainly not "stop the world" territory.

The docs need an update to indicate that we explicitly disclaim
responsibility for I/O errors on async writes, and that the kernel and I/O
stack must be configured never to give up on buffered writes. If it does,
that's not our problem anymore.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Andres Freund
Hi,

On 2018-04-08 16:27:57 -0700, Christophe Pettus wrote:
> > On Apr 8, 2018, at 16:16, Andres Freund  wrote:
> > We don't panic that way when getting IO
> > errors during reads either, and they're more likely to be persistent
> > than errors during writes (because remapping on storage layer can fix
> > issues, but not during reads).
> 
> There is a distinction to be drawn there, though, because we
> immediately pass an error back to the client on a read, but a write
> problem in this situation can be masked for an extended period of
> time.

Only if you're "lucky" enough that your clients actually read that data,
and then you're somehow able to figure out across the whole stack that
these 0.001% of transactions that fail are due to IO errors. Or you also
need to do log analysis.

If you want to solve things like that you need regular reads of all your
data, including verifications etc.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Craig Ringer
On 9 April 2018 at 06:29, Bruce Momjian  wrote:


>
> I think the big problem is that we don't have any way of stopping
> Postgres at the time the kernel reports the errors to the kernel log, so
> we are then returning potentially incorrect results and committing
> transactions that might be wrong or lost.


Right.

Specifically, we need a way to ask the kernel at checkpoint time "was
everything written to [this set of files] flushed successfully since the
last time I asked, no matter who did the writing and no matter how the
writes were flushed?"

If the result is "no" we PANIC and redo. If the hardware/volume is screwed,
the user can fail over to a standby, do PITR, etc.

But we don't have any way to ask that reliably at present.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Craig Ringer
On 9 April 2018 at 05:28, Christophe Pettus  wrote:

>
> > On Apr 8, 2018, at 14:23, Greg Stark  wrote:
> >
> > They consider dirty filesystem buffers when there's
> > hardware failure preventing them from being written "a memory leak".
>
> That's not an irrational position.  File system buffers are *not*
> dedicated memory for file system caching; they're being used for that
> because no one has a better use for them at that moment.  If an inability
> to flush them to disk meant that they suddenly became pinned memory, a
> large copy operation to a yanked USB drive could result in the system
> having no more allocatable memory.  I guess in theory that they could swap
> them, but swapping out a file system buffer in hopes that sometime in the
> future it could be properly written doesn't seem very architecturally sound
> to me.
>

Yep.

Another example is a write to an NFS or iSCSI volume that goes away
forever. What if the app keeps write()ing in the hopes it'll come back, and
by the time the kernel starts reporting EIO for write(), it's already
saddled with a huge volume of dirty writeback buffers it can't get rid of
because someone, one day, might want to know about them?

You could make the argument that it's OK to forget if the entire file
system goes away. But actually, why is that ok? What if it's remounted
again? That'd be really bad too, for someone expecting write reliability.

You can coarsen from dirty buffer tracking to marking the FD(s) bad, but
what  if there's no FD to mark because the file isn't open at the moment?

You can mark the inode cache entry and pin it, I guess. But what if your
app triggered I/O errors over vast numbers of small files? Again, the
kernel's left holding the ball.

It doesn't know if/when an app will return to check. It doesn't know how
long to remember the failure for. It doesn't know when all interested
clients have been informed and it can treat the fault as cleared/repaired,
either, so it'd have to *keep on reporting EIO for PostgreSQL's own writes
and fsyncs() indefinitely*, even once we do recovery.

The only way it could avoid that would be to keep the dirty writeback pages
around and flagged bad, then clear the flag when a new write() replaces the
same file range. I can't imagine that being practical.

Blaming the kernel for this sure is the easy way out.

But IMO we cannot rationally expect the kernel to remember error state
forever for us, then forget it when we expect, all without actually telling
it anything about our activities or even that we still exist and are still
interested in the files/writes. We've closed the files and gone away.

Whatever we do, it's likely going to have to involve not doing that anymore.

Even if we can somehow convince the kernel folks to add a new interface for
us that reports I/O errors to some listener, like an
inotify/fnotify/dnotify/whatever-it-is-today-notify extension reporting
errors in buffered async writes, we won't be able to rely on having it for
5-10 years, and only on Linux.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Warnings and uninitialized variables in TAP tests

2018-04-08 Thread Michael Paquier
Hi all,

While looking at the output of the TAP tests, I have seen warnings like
the following:
Use of uninitialized value $target_lsn in concatenation (.) or string at
/home/foo/git/postgres/src/bin/pg_rewind/../../../src/test/perl/PostgresNode.pm
line 1540.
[...]
./src/bin/pg_basebackup/tmp_check/log/regress_log_010_pg_basebackup:Use
of uninitialized value $str in print at
/home/foo/git/postgres/src/bin/pg_basebackup/../../../src/test/perl/TestLib.pm
line 244.

The first one shows somethng like 30 times, and the second only once.

Attached is a patch to clean up all that.  In order to find all that, I
simply ran the following and they pop up:
find . -name "regress_log*" | xargs grep -i "uninitialized"

What I have found is harmless, still they pollute the logs.

Thanks,
--
Michael
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index f502a2e3c7..0cd510eeea 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -48,7 +48,7 @@ ok(!-d "$tempdir/backup", 'backup directory was cleaned up');
 # but leave the data directory behind
 mkdir("$tempdir/backup")
 	or BAIL_OUT("unable to create $tempdir/backup");
-append_to_file("$tempdir/backup/dir-not-empty.txt");
+append_to_file("$tempdir/backup/dir-not-empty.txt", "Some data");
 
 $node->command_fails([ 'pg_basebackup', '-D', "$tempdir/backup", '-n' ],
 	'failing run with no-clean option');
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1bd91524d7..5a8f084efe 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1541,7 +1541,7 @@ sub wait_for_catchup
 	  . $standby_name . "'s "
 	  . $mode
 	  . "_lsn to pass "
-	  . $target_lsn . " on "
+	  . $lsn_expr . " on "
 	  . $self->name . "\n";
 	my $query =
 qq[SELECT $lsn_expr <= ${mode}_lsn FROM pg_catalog.pg_stat_replication WHERE application_name = '$standby_name';];


signature.asc
Description: PGP signature


Re: pgsql: Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.

2018-04-08 Thread Michael Paquier
On Sun, Apr 08, 2018 at 06:35:39PM +, Tom Lane wrote:
> Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.
> 
> Traditionally, include/catalog/pg_foo.h contains extern declarations
> for functions in backend/catalog/pg_foo.c, in addition to its function
> as the authoritative definition of the pg_foo catalog's rowtype.
> In some cases, we'd been forced to split out those extern declarations
> into separate pg_foo_fn.h headers so that the catalog definitions
> could be #include'd by frontend code.  That problem is gone as of
> commit 9c0a0de4c, so let's undo the splits to make things less
> confusing.

This patch or one of its relatives has visibly broken parallel builds
for me. "make -j 4 install" directly called after a configure complains:
In file included from ../../src/include/catalog/catalog.h:22:0,
 from relpath.c:21:
../../src/include/catalog/pg_class.h:22:10: fatal error: catalog/pg_class_d.h: 
No such file or directory
 #include "catalog/pg_class_d.h"
  ^~
compilation terminated.

I did not take the time to look into the problem, that's just a head's
up.  I am on HEAD at b3b7f789 so it is not like I lack any fixes.
Please note that "make -j 4" directly called works.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] Optional message to user when terminating/cancelling backend

2018-04-08 Thread Michael Paquier
On Fri, Apr 06, 2018 at 11:18:34AM +0200, Daniel Gustafsson wrote:
> Yep, I completely agree.  Attached are patches with the quotes removed and
> rebased since Oids were taken etc.

I still find this idea interesting for plugin authors.  However, as
feature freeze for v11 is in effect, I am marking the patch as returned
with feedback.
--
Michael


signature.asc
Description: PGP signature


Re: Rewriting the test of pg_upgrade as a TAP test - take two

2018-04-08 Thread Michael Paquier
On Tue, Apr 03, 2018 at 03:04:10PM +0900, Michael Paquier wrote:
> At the end, it seems to me that what I am proposing is themost readable
> approach, and with proper documentation things should be handled
> finely...  Or there is an approach you have in mind I do not foresee?

With feature freeze which has been reached, I am marking this patch as
returned with feedback.  Please note that I have no plans to resubmit
again this patch set after this second attempt as I have no idea where
we want things to go.
--
Michael


signature.asc
Description: PGP signature


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Christophe Pettus

> On Apr 8, 2018, at 16:16, Andres Freund  wrote:
> We don't panic that way when getting IO
> errors during reads either, and they're more likely to be persistent
> than errors during writes (because remapping on storage layer can fix
> issues, but not during reads).

There is a distinction to be drawn there, though, because we immediately pass 
an error back to the client on a read, but a write problem in this situation 
can be masked for an extended period of time.

That being said...

> There's a lot of not so great things here, but I don't think there's any
> need to panic.

No reason to panic, yes.  We can assume that if this was a very big persistent 
problem, it would be much more widely reported.  It would, however, be good to 
find a way to get the error surfaced back up to the client in a way that is not 
just monitoring the kernel logs.

--
-- Christophe Pettus
   x...@thebuild.com




Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Andres Freund
On 2018-04-08 18:29:16 -0400, Bruce Momjian wrote:
> On Sun, Apr 8, 2018 at 09:38:03AM -0700, Christophe Pettus wrote:
> >
> > > On Apr 8, 2018, at 03:30, Craig Ringer 
> > > wrote:
> > >
> > > These are way more likely than bit flips or other storage level
> > > corruption, and things that we previously expected to detect and
> > > fail gracefully for.
> >
> > This is definitely bad, and it explains a few otherwise-inexplicable
> > corruption issues we've seen.  (And great work tracking it down!)  I
> > think it's important not to panic, though; PostgreSQL doesn't have a
> > reputation for horrible data integrity.  I'm not sure it makes sense
> > to do a major rearchitecting of the storage layer (especially with
> > pluggable storage coming along) to address this.  While the failure
> > modes are more common, the solution (a PITR backup) is one that an
> > installation should have anyway against media failures.
> 
> I think the big problem is that we don't have any way of stopping
> Postgres at the time the kernel reports the errors to the kernel log, so
> we are then returning potentially incorrect results and committing
> transactions that might be wrong or lost.  If we could stop Postgres
> when such errors happen, at least the administrator could fix the
> problem of fail-over to a standby.
> 
> An crazy idea would be to have a daemon that checks the logs and stops
> Postgres when it seems something wrong.

I think the danger presented here is far smaller than some of the
statements in this thread might make one think. In all likelihood, once
you've got an IO error that kernel level retries don't fix, your
database is screwed. Whether fsync reports that or not is really
somewhat besides the point. We don't panic that way when getting IO
errors during reads either, and they're more likely to be persistent
than errors during writes (because remapping on storage layer can fix
issues, but not during reads).

There's a lot of not so great things here, but I don't think there's any
need to panic.

We should fix things so that reported errors are treated with crash
recovery, and for the rest I think there's very fair arguments to be
made that that's far outside postgres's remit.

I think there's pretty good reasons to go to direct IO where supported,
but error handling doesn't strike me as a particularly good reason for
the move.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Christophe Pettus

> On Apr 8, 2018, at 15:29, Bruce Momjian  wrote:
> I think the big problem is that we don't have any way of stopping
> Postgres at the time the kernel reports the errors to the kernel log, so
> we are then returning potentially incorrect results and committing
> transactions that might be wrong or lost.

Yeah, it's bad.  In the short term, the best advice to installations is to 
monitor their kernel logs for errors (which very few do right now), and make 
sure they have a backup strategy which can encompass restoring from an error 
like this.  Even Craig's smart fix of patching the backup label to recover from 
a previous checkpoint doesn't do much good if we don't have WAL records back 
that far (or one of the required WAL records also took a hit).

In the longer term... O_DIRECT seems like the most plausible way out of this, 
but that might be popular with people running on file systems or OSes that 
don't have this issue.  (Setting aside the daunting prospect of implementing 
that.)

--
-- Christophe Pettus
   x...@thebuild.com




Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Bruce Momjian
On Sun, Apr 8, 2018 at 09:38:03AM -0700, Christophe Pettus wrote:
>
> > On Apr 8, 2018, at 03:30, Craig Ringer 
> > wrote:
> >
> > These are way more likely than bit flips or other storage level
> > corruption, and things that we previously expected to detect and
> > fail gracefully for.
>
> This is definitely bad, and it explains a few otherwise-inexplicable
> corruption issues we've seen.  (And great work tracking it down!)  I
> think it's important not to panic, though; PostgreSQL doesn't have a
> reputation for horrible data integrity.  I'm not sure it makes sense
> to do a major rearchitecting of the storage layer (especially with
> pluggable storage coming along) to address this.  While the failure
> modes are more common, the solution (a PITR backup) is one that an
> installation should have anyway against media failures.

I think the big problem is that we don't have any way of stopping
Postgres at the time the kernel reports the errors to the kernel log, so
we are then returning potentially incorrect results and committing
transactions that might be wrong or lost.  If we could stop Postgres
when such errors happen, at least the administrator could fix the
problem of fail-over to a standby.

An crazy idea would be to have a daemon that checks the logs and stops
Postgres when it seems something wrong.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Anthony Iliopoulos
On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:
> On 8 April 2018 at 04:27, Craig Ringer  wrote:
> > On 8 April 2018 at 10:16, Thomas Munro 
> > wrote:
> >
> > If the kernel does writeback in the middle, how on earth is it supposed to
> > know we expect to reopen the file and check back later?
> >
> > Should it just remember "this file had an error" forever, and tell every
> > caller? In that case how could we recover? We'd need some new API to say
> > "yeah, ok already, I'm redoing all my work since the last good fsync() so
> > you can clear the error flag now". Otherwise it'd keep reporting an error
> > after we did redo to recover, too.
> 
> There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep
> track of any errors. We just need fsync to accurately report whether
> all the buffers in the file have been written out. When you call fsync

Instead, fsync() reports when some of the buffers have not been
written out, due to reasons outlined before. As such it may make
some sense to maintain some tracking regarding errors even after
marking failed dirty pages as clean (in fact it has been proposed,
but this introduces memory overhead).

> again the kernel needs to initiate i/o on all the dirty buffers and
> block until they complete successfully. If they complete successfully
> then nobody cares whether they had some failure in the past when i/o
> was initiated at some point in the past.

The question is, what should the kernel and application do in cases
where this is simply not possible (according to freebsd that keeps
dirty pages around after failure, for example, -EIO from the block
layer is a contract for unrecoverable errors so it is pointless to
keep them dirty). You'd need a specialized interface to clear-out
the errors (and drop the dirty pages), or potentially just remount
the filesystem.

> The problem is not that errors aren't been tracked correctly. The
> problem is that dirty buffers are being marked clean when they haven't
> been written out. They consider dirty filesystem buffers when there's
> hardware failure preventing them from being written "a memory leak".
> 
> As long as any error means the kernel has discarded writes then
> there's no real hope of any reliable operation through that interface.

This does not necessarily follow. Whether the kernel discards writes
or not would not really help (see above). It is more a matter of
proper "reporting contract" between userspace and kernel, and tracking
would be a way for facilitating this vs. having a more complex userspace
scheme (as described by others in this thread) where synchronization
for fsync() is required in a multi-process application.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Christophe Pettus

> On Apr 8, 2018, at 14:23, Greg Stark  wrote:
> 
> They consider dirty filesystem buffers when there's
> hardware failure preventing them from being written "a memory leak".

That's not an irrational position.  File system buffers are *not* dedicated 
memory for file system caching; they're being used for that because no one has 
a better use for them at that moment.  If an inability to flush them to disk 
meant that they suddenly became pinned memory, a large copy operation to a 
yanked USB drive could result in the system having no more allocatable memory.  
I guess in theory that they could swap them, but swapping out a file system 
buffer in hopes that sometime in the future it could be properly written 
doesn't seem very architecturally sound to me.

--
-- Christophe Pettus
   x...@thebuild.com




Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Greg Stark
On 8 April 2018 at 04:27, Craig Ringer  wrote:
> On 8 April 2018 at 10:16, Thomas Munro 
> wrote:
>
> If the kernel does writeback in the middle, how on earth is it supposed to
> know we expect to reopen the file and check back later?
>
> Should it just remember "this file had an error" forever, and tell every
> caller? In that case how could we recover? We'd need some new API to say
> "yeah, ok already, I'm redoing all my work since the last good fsync() so
> you can clear the error flag now". Otherwise it'd keep reporting an error
> after we did redo to recover, too.

There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep
track of any errors. We just need fsync to accurately report whether
all the buffers in the file have been written out. When you call fsync
again the kernel needs to initiate i/o on all the dirty buffers and
block until they complete successfully. If they complete successfully
then nobody cares whether they had some failure in the past when i/o
was initiated at some point in the past.

The problem is not that errors aren't been tracked correctly. The
problem is that dirty buffers are being marked clean when they haven't
been written out. They consider dirty filesystem buffers when there's
hardware failure preventing them from being written "a memory leak".

As long as any error means the kernel has discarded writes then
there's no real hope of any reliable operation through that interface.

Going to DIRECTIO is basically recognizing this. That the kernel
filesystem buffer provides no reliable interface so we need to
reimplement it ourselves in user space.

It's rather disheartening. Aside from having to do all that work we
have the added barrier that we don't have as much information about
the hardware as the kernel has. We don't know where raid stripes begin
and end, how big the memory controller buffers are or how to tell when
they're full or empty or how to flush them. etc etc. We also don't
know what else is going on on the machine.

-- 
greg



Re: WIP: a way forward on bootstrap data

2018-04-08 Thread Andres Freund


On April 8, 2018 10:19:59 AM PDT, Tom Lane  wrote:
>John Naylor  writes:
>> There were a couple more catalog changes that broke patch context, so
>> attached is version 16.
>
>Pushed with a few more adjustments (mostly, another round of
>copy-editing
>on bki.sgml).  Now we wait to see what the buildfarm thinks,
>particularly
>about the MSVC build ...
>
>Congratulations, and many thanks, to John for seeing this through!
>I know it's been a huge amount of work, but we've needed this for
>years.

Seconded and thirded and fourthed (?)!
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: WIP: Covering + unique indexes.

2018-04-08 Thread Peter Geoghegan
On Sun, Apr 8, 2018 at 11:18 AM, Teodor Sigaev  wrote:
> Thank you, fixed

I suggest that we remove some unneeded amcheck tests, as in the
attached patch. They don't seem to add anything.

-- 
Peter Geoghegan
From 0dbbee5bfff8816cddf86961bf4959192f62f1ff Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sun, 8 Apr 2018 11:56:00 -0700
Subject: [PATCH] Remove some superfluous amcheck INCLUDE tests.

Repeating these tests adds unnecessary cycles, since no improvement in
test coverage is expected.

Cleanup from commit 8224de4f42ccf98e08db07b43d52fed72f962ebb.

Author: Peter Geoghegan
---
 contrib/amcheck/expected/check_btree.out | 20 +---
 contrib/amcheck/sql/check_btree.sql  |  6 +-
 2 files changed, 2 insertions(+), 24 deletions(-)

diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index 2a06cce..ed80ac4 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -111,27 +111,9 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
  
 (1 row)
 
-SELECT bt_index_parent_check('bttest_multi_idx', true);
- bt_index_parent_check 

- 
-(1 row)
-
--- repeat same checks with index made by insertions
+-- repeat expansive test for index built using insertions
 TRUNCATE bttest_multi;
 INSERT INTO bttest_multi SELECT i, i%2  FROM generate_series(1, 10) as i;
-SELECT bt_index_check('bttest_multi_idx');
- bt_index_check 
-
- 
-(1 row)
-
-SELECT bt_index_parent_check('bttest_multi_idx', true);
- bt_index_parent_check 

- 
-(1 row)
-
 SELECT bt_index_parent_check('bttest_multi_idx', true);
  bt_index_parent_check 
 ---
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index da2f131..4ca9d2d 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -65,15 +65,11 @@ COMMIT;
 SELECT bt_index_check('bttest_multi_idx');
 -- more expansive test for index with included columns
 SELECT bt_index_parent_check('bttest_multi_idx', true);
-SELECT bt_index_parent_check('bttest_multi_idx', true);
 
--- repeat same checks with index made by insertions
+-- repeat expansive test for index built using insertions
 TRUNCATE bttest_multi;
 INSERT INTO bttest_multi SELECT i, i%2  FROM generate_series(1, 10) as i;
-SELECT bt_index_check('bttest_multi_idx');
 SELECT bt_index_parent_check('bttest_multi_idx', true);
-SELECT bt_index_parent_check('bttest_multi_idx', true);
-
 
 -- cleanup
 DROP TABLE bttest_a;
-- 
2.7.4



Re: WIP: Covering + unique indexes.

2018-04-08 Thread Teodor Sigaev

Thanks to everyone, pushed.


Indeed thanks, this will be a nice feature.

It is giving me a compiler warning on non-cassert builds using gcc 
(Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609:


indextuple.c: In function 'index_truncate_tuple':
indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable]
   int   indnatts = tupleDescriptor->natts;


Thank you, fixed

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/



Re: WIP: a way forward on bootstrap data

2018-04-08 Thread John Naylor
On 4/9/18, Tom Lane  wrote:
> Congratulations, and many thanks, to John for seeing this through!
> I know it's been a huge amount of work, but we've needed this for years.

Many thanks for your review, advice, and additional hacking. Thanks
also to Álvaro for review and initial commits, and to all who
participated in previous discussion.

-John Naylor



Re: Verbosity of genbki.pl

2018-04-08 Thread Teodor Sigaev

1. Print nothing at all.  That's more in keeping with our modern
build practices, but maybe it's too big a change?

2. Print just one message like "Generating postgres.bki and related
files", and I guess a second one for fmgroids.h and related files.

I don't have a strong preference.  Opinions?


Second point, pls. I'd like to see some stage done

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/



Verbosity of genbki.pl

2018-04-08 Thread Tom Lane
(This was discussed in the "way forward on bootstrap data" thread,
but we didn't do anything about it.)

Traditionally genbki.pl has printed "Writing foo" for every file
it writes out.  As of 372728b0d that is a heck of a lot more files
than it once was, so you get this:

$ make -j8 -s
Writing fmgroids.h
Writing fmgrprotos.h
Writing fmgrtab.c
Writing pg_proc_d.h
Writing pg_type_d.h
Writing pg_attribute_d.h
Writing pg_class_d.h
Writing pg_attrdef_d.h
Writing pg_constraint_d.h
Writing pg_inherits_d.h
Writing pg_index_d.h
Writing pg_operator_d.h
Writing pg_opfamily_d.h
Writing pg_opclass_d.h
Writing pg_am_d.h
Writing pg_amop_d.h
Writing pg_amproc_d.h
Writing pg_language_d.h
Writing pg_largeobject_metadata_d.h
Writing pg_largeobject_d.h
Writing pg_aggregate_d.h
Writing pg_statistic_ext_d.h
Writing pg_statistic_d.h
Writing pg_rewrite_d.h
Writing pg_trigger_d.h
Writing pg_event_trigger_d.h
Writing pg_description_d.h
Writing pg_cast_d.h
Writing pg_enum_d.h
Writing pg_namespace_d.h
Writing pg_conversion_d.h
Writing pg_depend_d.h
Writing pg_database_d.h
Writing pg_db_role_setting_d.h
Writing pg_tablespace_d.h
Writing pg_pltemplate_d.h
Writing pg_authid_d.h
Writing pg_auth_members_d.h
Writing pg_shdepend_d.h
Writing pg_shdescription_d.h
Writing pg_ts_config_d.h
Writing pg_ts_config_map_d.h
Writing pg_ts_dict_d.h
Writing pg_ts_parser_d.h
Writing pg_ts_template_d.h
Writing pg_extension_d.h
Writing pg_foreign_data_wrapper_d.h
Writing pg_foreign_server_d.h
Writing pg_user_mapping_d.h
Writing pg_foreign_table_d.h
Writing pg_policy_d.h
Writing pg_replication_origin_d.h
Writing pg_default_acl_d.h
Writing pg_init_privs_d.h
Writing pg_seclabel_d.h
Writing pg_shseclabel_d.h
Writing pg_collation_d.h
Writing pg_partitioned_table_d.h
Writing pg_range_d.h
Writing pg_transform_d.h
Writing pg_sequence_d.h
Writing pg_publication_d.h
Writing pg_publication_rel_d.h
Writing pg_subscription_d.h
Writing pg_subscription_rel_d.h
Writing postgres.bki
Writing schemapg.h
Writing postgres.description
Writing postgres.shdescription
All of PostgreSQL successfully made. Ready to install.

which seems a tad excessive, especially in a -s run.

I think we should drop the per-file notices.  But there are a couple
of alternatives as to what to do exactly:

1. Print nothing at all.  That's more in keeping with our modern
build practices, but maybe it's too big a change?

2. Print just one message like "Generating postgres.bki and related
files", and I guess a second one for fmgroids.h and related files.

I don't have a strong preference.  Opinions?

regards, tom lane



Re: WIP: a way forward on bootstrap data

2018-04-08 Thread Tom Lane
John Naylor  writes:
> There were a couple more catalog changes that broke patch context, so
> attached is version 16.

Pushed with a few more adjustments (mostly, another round of copy-editing
on bki.sgml).  Now we wait to see what the buildfarm thinks, particularly
about the MSVC build ...

Congratulations, and many thanks, to John for seeing this through!
I know it's been a huge amount of work, but we've needed this for years.

regards, tom lane



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Christophe Pettus

> On Apr 8, 2018, at 03:30, Craig Ringer  wrote:
> 
> These are way more likely than bit flips or other storage level corruption, 
> and things that we previously expected to detect and fail gracefully for.

This is definitely bad, and it explains a few otherwise-inexplicable corruption 
issues we've seen.  (And great work tracking it down!)  I think it's important 
not to panic, though; PostgreSQL doesn't have a reputation for horrible data 
integrity.  I'm not sure it makes sense to do a major rearchitecting of the 
storage layer (especially with pluggable storage coming along) to address this. 
 While the failure modes are more common, the solution (a PITR backup) is one 
that an installation should have anyway against media failures.

--
-- Christophe Pettus
   x...@thebuild.com




Re: WIP: Covering + unique indexes.

2018-04-08 Thread Teodor Sigaev

Thank you, fixed

Jeff Janes wrote:
On Sat, Apr 7, 2018 at 4:02 PM, Teodor Sigaev > wrote:


Thanks to everyone, pushed.


Indeed thanks, this will be a nice feature.

It is giving me a compiler warning on non-cassert builds using gcc 
(Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609:


indextuple.c: In function 'index_truncate_tuple':
indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable]
   int   indnatts = tupleDescriptor->natts;

Cheers,

Jeff


--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/



Re: PATCH: Configurable file mode mask

2018-04-08 Thread David Steele
Hi Michael,

On 4/6/18 10:20 AM, Michael Paquier wrote:
> On Fri, Apr 06, 2018 at 09:15:15AM -0400, Stephen Frost wrote:
>> I'll reply to David's last email (where the latest set of patches were
>> included) with my comments/suggestions and I expect we'll be able to get
>> those addressed today and have a final patch to post tonight, with an
>> eye towards committing it tomorrow.
> 
> The feature freeze is on the 8th, so I am going to have limited room to
> comment on things until that day.  If something gets committed, I am
> pretty sure that I'll get out of my pocket a couple of things to improve
> the feature and its interface anyway if of course you are ready to
> accept that.  

Improvements are always welcome!  The core focus of the patch hasn't
changed over the last year, but we've found better ways to implement it
over time.  I'm sure there's more we can do.

> I have limited my role to be a reviewer, so I refrained
> myself from writing any code ;)

Your dedicated and tireless review helped get this patch over the line.
Thank you very much for all your hard work.

-- 
-David
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature


Re: WIP: Covering + unique indexes.

2018-04-08 Thread Jeff Janes
On Sat, Apr 7, 2018 at 4:02 PM, Teodor Sigaev  wrote:

> Thanks to everyone, pushed.
>
>
Indeed thanks, this will be a nice feature.

It is giving me a compiler warning on non-cassert builds using gcc (Ubuntu
5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609:

indextuple.c: In function 'index_truncate_tuple':
indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable]
  int   indnatts = tupleDescriptor->natts;

Cheers,

Jeff


Re: WIP: a way forward on bootstrap data

2018-04-08 Thread John Naylor
I wrote:

> I'll check back in 24 hours to see if everything still applies.

There were a couple more catalog changes that broke patch context, so
attached is version 16.

-John Naylor


v16-bootstrap-data-conversion.tar.gz
Description: GNU Zip compressed data


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Craig Ringer
On 8 April 2018 at 17:41, Andreas Karlsson  wrote:

> On 04/08/2018 05:27 AM, Craig Ringer wrote:> More below, but here's an
> idea #5: decide InnoDB has the right idea, and
>
>> go to using a single massive blob file, or a few giant blobs.
>>
>
> FYI: MySQL has by default one file per table these days. The old approach
> with one massive file was a maintenance headache so they change the default
> some releases ago.
>
> https://dev.mysql.com/doc/refman/8.0/en/innodb-multiple-tablespaces.html
>
>
Huh, thanks for the update.

We should see how they handle reliable flushing and see if they've looked
into it. If they haven't, we should give them a heads-up and if they have,
lets learn from them.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Craig Ringer
On 8 April 2018 at 11:46, Christophe Pettus  wrote:

>
> > On Apr 7, 2018, at 20:27, Craig Ringer  wrote:
> >
> > Right now I think we're at option (4): If you see anything that smells
> like a write error in your kernel logs, hard-kill postgres with -m
> immediate (do NOT let it do a shutdown checkpoint). If it did a checkpoint
> since the logs, fake up a backup label to force redo to start from the last
> checkpoint before the error. Otherwise, it's safe to just let it start up
> again and do redo again.
>
> Before we spiral down into despair and excessive alcohol consumption, this
> is basically the same situation as a checksum failure or some other kind of
> uncorrected media-level error.  The bad part is that we have to find out
> from the kernel logs rather than from PostgreSQL directly.  But this does
> not strike me as otherwise significantly different from, say, an
> infrequently-accessed disk block reporting an uncorrectable error when we
> finally get around to reading it.
>

I don't entirely agree - because it affects ENOSPC, I/O errors on thin
provisioned storage, I/O errors on multipath storage, etc. (I identified
the original issue on a thin provisioned system that ran out of backing
space, mangling PostgreSQL in a way that made no sense at the time).

These are way more likely than bit flips or other storage level corruption,
and things that we previously expected to detect and fail gracefully for.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Andreas Karlsson
On 04/08/2018 05:27 AM, Craig Ringer wrote:> More below, but here's an 
idea #5: decide InnoDB has the right idea, and

go to using a single massive blob file, or a few giant blobs.


FYI: MySQL has by default one file per table these days. The old 
approach with one massive file was a maintenance headache so they change 
the default some releases ago.


https://dev.mysql.com/doc/refman/8.0/en/innodb-multiple-tablespaces.html

Andreas



Re: Checkpoint not retrying failed fsync?

2018-04-08 Thread Thomas Munro
On Sun, Apr 8, 2018 at 5:36 PM, Amit Kapila  wrote:
> Won't in the success case, you need to delete each member (by
> something like bms_del_member) rather than just using bms_free?

Thanks for looking at this.  Yeah, if requests for segment numbers 0
and 1 were in "requests", and 0 succeeded but then 1 fails, my
previous patch would leave both in there to be retried next time
around.  I thought that was pretty harmless so I didn't worry about it
before, but of course you're right that it's not necessary to retry
the ones that succeeded, so we could remove them as we go.  New patch
attached.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Make-sure-we-don-t-forget-about-fsync-requests-af-v3.patch
Description: Binary data


Re: csv format for psql

2018-04-08 Thread Pavel Stehule
2018-04-07 14:23 GMT+02:00 Daniel Verite :

> Peter Eisentraut wrote:
>
> > Another thought: Isn't CSV just the same as unaligned output plus some
> > quoting?  Could we add a quote character setting and then define --csv
> > to be quote-character = " and fieldsep = , ?
>
> Plus footer set to off. So --csv would be like
> \pset format unaligned
> \pset fieldsep ','
> \pset footer off
> \pset unaligned_quote_character '"'
>
> I guess we'd add a \csv command that does these together.
>
> There would still be some discomfort with some of the issues
> discussed upthread. For instance
>   psql -F ';' --csv
> is likely to reset fieldsep to ',' having then a different outcome than
>   psql --csv -F';'
>
> And there's the point on how to come back from \csv to another
> output, given that it has overwritten "footer" that is used across
> all formats, and fieldsep used by unaligned.
> So a user might need to figure out anyway the
> intricaties behind \csv, if it's not an independant format.
>
> On the whole I'm inclined to resubmit the patch with
> fieldsep_csv and some minor changes based on the rest
> of the discussion.
>

+1


>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite
>


Re: [sqlsmith] Failed assertion in create_gather_path

2018-04-08 Thread Jeevan Chalke
Hi,

At some places, I have observed that we are adding a partial path even when
rel's consider_parallel is false. Due to this, the partial path added has
parallel_safe set to false and then later in create_gather_path() assertion
fails.

Attached patch to fix that.


On Sun, Apr 8, 2018 at 12:26 AM, Andreas Seltenreich 
wrote:

> Tom Lane writes:
> > Alvaro Herrera  writes:
> >> Andreas Seltenreich wrote:
> >>> as of 039eb6e92f:
> >>> TRAP: FailedAssertion("!(subpath->parallel_safe)", File:
> "pathnode.c", Line: 1813)
> >
> >> Uh, that's pretty strange -- that patch did not touch the planner at
> >> all, and the relcache.c changes are pretty trivial.
> >
> > I don't think Andreas meant that the bug is necessarily new in
> 039eb6e92f,
> > only that that's the version he happened to be testing.
>
> Right.  I'm not testing often enough yet to be able to report on commit
> granularity :-).  I'll try for a less ambiguos wording in future
> reports.
>
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
From 86f430e35583c6bfdd43c4f31e06ab83e8404e9a Mon Sep 17 00:00:00 2001
From: Jeevan Chalke 
Date: Sun, 8 Apr 2018 12:52:13 +0530
Subject: [PATCH] Add partial path only when rel's consider_parallel is true.

Otherwise, it will result in an assertion failure later in the
create_gather_path().
---
 src/backend/optimizer/path/allpaths.c  | 4 
 src/backend/optimizer/prep/prepunion.c | 2 +-
 src/backend/optimizer/util/pathnode.c  | 3 +++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index c4e4db1..aa24345 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2196,6 +2196,10 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		  pathkeys, required_outer));
 	}
 
+	/* If parallelism is not possible, return. */
+	if (!rel->consider_parallel || !bms_is_empty(required_outer))
+		return;
+
 	/* If consider_parallel is false, there should be no partial paths. */
 	Assert(sub_final_rel->consider_parallel ||
 		   sub_final_rel->partial_pathlist == NIL);
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 5236ab3..f54807d 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -331,7 +331,7 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
 		 * to build a partial path for this relation.  But there's no point in
 		 * considering any path but the cheapest.
 		 */
-		if (final_rel->partial_pathlist != NIL)
+		if (final_rel->consider_parallel && final_rel->partial_pathlist != NIL)
 		{
 			Path	   *partial_subpath;
 			Path	   *partial_path;
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 416b3f9..fa332de 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -770,6 +770,9 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 	/* Check for query cancel. */
 	CHECK_FOR_INTERRUPTS();
 
+	/* Path to be added must be parallel safe. */
+	Assert(new_path->parallel_safe);
+
 	/*
 	 * As in add_path, throw out any paths which are dominated by the new
 	 * path, but throw out the new path if some existing path dominates it.
-- 
1.8.3.1



Re: lazy detoasting

2018-04-08 Thread Andrew Gierth
> "Chapman" == Chapman Flack  writes:

 Chapman> Please bear with me as I check my understanding of snapshot
 Chapman> management by looking at PersistHoldablePortal(). There's a
 Chapman> PushActiveSnapshot(queryDesc->snapshot) in there. Is that
 Chapman> because:

 Chapman> (d) some other reason I haven't thought of ?

It has to be pushed as the active snapshot so that it's the snapshot
that the executor uses to run the query to populate the tuplestore which
becomes the "held" portal content.

I think you're confusing the stack of active snapshots with the registry
of snapshots. The snapshot of a portal that's not currently running in
the executor won't be on the stack, but it will be registered (which is
enough to maintain the session's reported xmin, which is what prevents
visible data from being vacuumed away).

-- 
Andrew (irc:RhodiumToad)