Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
At Mon, 9 Apr 2018 13:59:45 +0900, Michael Paquierwrote in <20180409045945.gb1...@paquier.xyz> > On Mon, Apr 09, 2018 at 01:26:54PM +0900, Kyotaro HORIGUCHI wrote: > > Hello, I added this as Older Bugs in Open items. (I believe it's > > legit.) > > Yes, I think that's adapted. I am hesitating to do the same thing with > all the other patches marked as bug fixes. Thanks. I'm also not going to do so on the other "bug fix"es. The reasons for this patch are I myself recently had a suspected case (on a costomer site) and the size is not so large. (This doesn't mean it is easy, of couse..) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Fix pg_rewind which can be run as root user
Hi all, I was just going through pg_rewind's code, and noticed the following pearl: /* * Don't allow pg_rewind to be run as root, to avoid overwriting the * ownership of files in the data directory. We need only check for root * -- any other user won't have sufficient permissions to modify files in * the data directory. */ #ifndef WIN32 if (geteuid() == 0) { fprintf(stderr, _("cannot be executed by \"root\"\n")); fprintf(stderr, _("You must run %s as the PostgreSQL superuser.\n"), progname); } #endif While that's nice to inform the user about the problem, that actually does not prevent pg_rewind to run as root. Attached is a patch, which needs a back-patch down to 9.5. Thanks, -- Michael diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c index b9ea6a4c21..a1ab13963a 100644 --- a/src/bin/pg_rewind/pg_rewind.c +++ b/src/bin/pg_rewind/pg_rewind.c @@ -208,6 +208,7 @@ main(int argc, char **argv) fprintf(stderr, _("cannot be executed by \"root\"\n")); fprintf(stderr, _("You must run %s as the PostgreSQL superuser.\n"), progname); + exit(1); } #endif signature.asc Description: PGP signature
Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Hello, I added this as Older Bugs in Open items. (I believe it's legit.) The latest patch still applies on the HEAD with some offsets. At Tue, 23 Jan 2018 18:50:00 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20180123.185000.232069302.horiguchi.kyot...@lab.ntt.co.jp> > At Fri, 19 Jan 2018 18:24:56 +0900, Michael Paquier > wrote in <20180119092456.ga1...@paquier.xyz> > > On Fri, Jan 19, 2018 at 10:54:53AM +0900, Kyotaro HORIGUCHI wrote: > > > On the other hand if one logical record must be read from single > > > source, we need any means to deterrent wal-recycling on the > > > primary side. Conducting that within the primary side is rejected > > > by consensus. > > > > There is this approach of extending the message protocol as well so as > > the primary can retain the segments it needs to keep around... > > I haven't seen it in detail but it seems meaning that it is done > by notifying something from the standby to the parimary not > knowing what is a WAL record on the standby side. > > > > (There might be smarter way to do that, though.) To > > > do that from the standby side, just retarding write feedbacks or > > > this kind of special feedback would be needed. In any way it is > > > necessary to introduce WAL-record awareness into WAL shipping > > > process and it's seemingly large complexity. > > > > Coming to logical slots, don't you need to be aware of the beginning of > > the record on the primary to perform correctly decoding of tuples > > through time travel? If the record is present across segments, it seems > > I'm not sure what the time travel is, but the requried segments > seems kept safely on logical replication since the publisher > moves restart_lsn not based on finished commits LSN, but on the > beginning LSN of running transactions. In other words, logical > decoding doesn't need to track each record for the purpose of > keeping segments since it already keeping track of the beginning > of a transaction. Physical replication is totally unaware of a > WAL record, it just copies blobs in a convenient unit and only > cares LSN. But the ignorance seems required to keep performance. > > > to me that it matters. Andres knows the answer to that for sure, so I > > would expect to be counter-argued in the next 12 hours or so. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] path toward faster partition pruning
While looking at the docs in [1], I saw that we still mention: 4. Ensure that the constraint_exclusion configuration parameter is not disabled in postgresql.conf. If it is, queries will not be optimized as desired. This is no longer true. The attached patch removed it. [1] https://www.postgresql.org/docs/10/static/ddl-partitioning.html -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services 0001-Remove-mention-of-constraint_exclusion-in-partitioni.patch Description: Binary data
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 9 April 2018 at 10:06, Andres Freundwrote: > > > And in many failure modes there's no reason to expect any data loss at > all, > > like: > > > > * Local disk fills up (seems to be safe already due to space reservation > at > > write() time) > > That definitely should be treated separately. > It is, because all the FSes I looked at reserve space before returning from write(), even if they do delayed allocation. So they won't fail with ENOSPC at fsync() time or silently due to lost errors on background writeback. Otherwise we'd be hearing a LOT more noise about this. > > * Thin-provisioned storage backing local volume iSCSI or paravirt block > > device fills up > > * NFS volume fills up > > Those should be the same as the above. > Unfortunately, they aren't. AFAICS NFS doesn't reserve space with the other end before returning from write(), even if mounted with the sync option. So we can get ENOSPC lazily when the buffer writeback fails due to a full backing file system. This then travels the same paths as EIO: we fsync(), ERROR, retry, appear to succeed, and carry on with life losing the data. Or we never hear about the error in the first place. (There's a proposed extension that'd allow this, see https://tools.ietf.org/html/draft-iyer-nfsv4-space-reservation-ops-02#page-5, but I see no mention of it in fs/nfs. All the reserve_space / xdr_reserve_space stuff seems to be related to space in protocol messages at a quick read.) Thin provisioned storage could vary a fair bit depending on the implementation. But the specific failure case I saw, prompting this thread, was on a volume using the stack: xfs -> lvm2 -> multipath -> ??? -> SAN (the HBA/iSCSI/whatever was not recorded by the looks, but IIRC it was iSCSI. I'm checking.) The SAN ran out of space. Due to use of thin provisioning, Linux *thought* there was plenty of space on the volume; LVM thought it had plenty of physical extents free and unallocated, XFS thought there was tons of free space, etc. The space exhaustion manifested as I/O errors on flushes of writeback buffers. The logs were like this: kernel: sd 2:0:0:1: [sdd] Unhandled sense code kernel: sd 2:0:0:1: [sdd] kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE kernel: sd 2:0:0:1: [sdd] kernel: Sense Key : Data Protect [current] kernel: sd 2:0:0:1: [sdd] kernel: Add. Sense: Space allocation failed write protect kernel: sd 2:0:0:1: [sdd] CDB: kernel: Write(16): **HEX-DATA-CUT-OUT** kernel: Buffer I/O error on device dm-0, logical block 3098338786 kernel: lost page write due to I/O error on dm-0 kernel: Buffer I/O error on device dm-0, logical block 3098338787 The immediate cause was that Linux's multipath driver didn't seem to recognise the sense code as retryable, so it gave up and reported it to the next layer up (LVM). LVM and XFS both seem to think that the lower layer is responsible for retries, so they toss the write away, and tell any interested writers if they feel like it, per discussion upthread. In this case Pg did get the news and reported fsync() errors on checkpoints, but it only reported an error once per relfilenode. Once it ran out of failed relfilenodes to cause the checkpoint to ERROR, it "completed" a "successful" checkpoint and kept on running until the resulting corruption started to manifest its self and it segfaulted some time later. As we've now learned, there's no guarantee we'd even get the news about the I/O errors at all. WAL was on a separate volume that didn't run out of room immediately, so we didn't PANIC on WAL write failure and prevent the issue. In this case if Pg had PANIC'd (and been able to guarantee to get the news of write failures reliably), there'd have been no corruption and no data loss despite the underlying storage issue. If, prior to seeing this, you'd asked me "will my PostgreSQL database be corrupted if my thin-provisioned volume runs out of space" I'd have said "Surely not. PostgreSQL won't be corrupted by running out of disk space, it orders writes carefully and forces flushes so that it will recover gracefully from write failures." Except not. I was very surprised. BTW, it also turns out that the *default* for multipath is to give up on errors anyway; see the queue_if_no_path option and no_path_retries options. (Hint: run PostgreSQL with no_path_retries=queue). That's a sane default if you use O_DIRECT|O_SYNC, and otherwise pretty much a data-eating setup. I regularly see rather a lot of multipath systems, iSCSI systems, SAN backed systems, etc. I think we need to be pretty clear that we expect them to retry indefinitely, and if they report an I/O error we cannot reliably handle it. We need to patch Pg to PANIC on any fsync() failure and document that Pg won't notice some storage failure modes that might otherwise be considered nonfatal or transient, so very specific storage configuration and testing is required. (Not that anyone will do it). Also warn against running on NFS even with
Re: pgsql: Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.
Michael Paquierwrites: > On Sun, Apr 08, 2018 at 06:35:39PM +, Tom Lane wrote: >> Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers. > This patch or one of its relatives has visibly broken parallel builds > for me. "make -j 4 install" directly called after a configure complains: Hm. I'd tested "make -j all", but not going directly to "install". Does it help if you add $(SUBDIRS:%=install-%-recurse): | submake-generated-headers to src/Makefile? (That seems like a bit of a brute-force solution, though. Anybody have a better answer?) regards, tom lane
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 2018-04-09 10:00:41 +0800, Craig Ringer wrote: > I suspect we've written off a fair few issues in the past as "it'd bad > hardware" when actually, the hardware fault was the trigger for a Pg/kernel > interaction bug. And blamed containers for things that weren't really the > container's fault. But even so, if it were happening tons, we'd hear more > noise. Agreed on that, but I think that's FAR more likely to be things like multixacts, index structure corruption due to logic bugs etc. > I've already been very surprised there when I learned that PostgreSQL > completely ignores wholly absent relfilenodes. Specifically, if you > unlink() a relation's backing relfilenode while Pg is down and that file > has writes pending in the WAL. We merrily re-create it with uninitalized > pages and go on our way. As Andres pointed out in an offlist discussion, > redo isn't a consistency check, and it's not obliged to fail in such cases. > We can say "well, don't do that then" and define away file losses from FS > corruption etc as not our problem, the lower levels we expect to take care > of this have failed. And it'd be a realy bad idea to behave differently. > And in many failure modes there's no reason to expect any data loss at all, > like: > > * Local disk fills up (seems to be safe already due to space reservation at > write() time) That definitely should be treated separately. > * Thin-provisioned storage backing local volume iSCSI or paravirt block > device fills up > * NFS volume fills up Those should be the same as the above. > I think we need to think about a more robust path in future. But it's > certainly not "stop the world" territory. I think you're underestimating the complexity of doing that by at least two orders of magnitude. Greetings, Andres Freund
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 9 April 2018 at 07:16, Andres Freundwrote: > > I think the danger presented here is far smaller than some of the > statements in this thread might make one think. Clearly it's not happening a huge amount or we'd have a lot of noise about Pg eating people's data, people shouting about how unreliable it is, etc. We don't. So it's not some earth shattering imminent threat to everyone's data. It's gone unnoticed, or the root cause unidentified, for a long time. I suspect we've written off a fair few issues in the past as "it'd bad hardware" when actually, the hardware fault was the trigger for a Pg/kernel interaction bug. And blamed containers for things that weren't really the container's fault. But even so, if it were happening tons, we'd hear more noise. I've already been very surprised there when I learned that PostgreSQL completely ignores wholly absent relfilenodes. Specifically, if you unlink() a relation's backing relfilenode while Pg is down and that file has writes pending in the WAL. We merrily re-create it with uninitalized pages and go on our way. As Andres pointed out in an offlist discussion, redo isn't a consistency check, and it's not obliged to fail in such cases. We can say "well, don't do that then" and define away file losses from FS corruption etc as not our problem, the lower levels we expect to take care of this have failed. We have to look at what checkpoints are and are not supposed to promise, and whether this is a problem we just define away as "not our problem, the lower level failed, we're not obliged to detect this and fail gracefully." We can choose to say that checkpoints are required to guarantee crash/power loss safety ONLY and do not attempt to protect against I/O errors of any sort. In fact, I think we should likely amend the documentation for release versions to say just that. In all likelihood, once > you've got an IO error that kernel level retries don't fix, your > database is screwed. Your database is going to be down or have interrupted service. It's possible you may have some unreadable data. This could result in localised damage to one or more relations. That could affect FK relationships, indexes, all sorts. If you're really unlucky you might lose something critical like pg_clog/ contents. But in general your DB should be repairable/recoverable even in those cases. And in many failure modes there's no reason to expect any data loss at all, like: * Local disk fills up (seems to be safe already due to space reservation at write() time) * Thin-provisioned storage backing local volume iSCSI or paravirt block device fills up * NFS volume fills up * Multipath I/O error * Interruption of connectivity to network block device * Disk develops localized bad sector where we haven't previously written data Except for the ENOSPC on NFS, all the rest of the cases can be handled by expecting the kernel to retry forever and not return until the block is written or we reach the heat death of the universe. And NFS, well... Part of the trouble is that the kernel *won't* retry forever in all these cases, and doesn't seem to have a way to ask it to in all cases. And if the user hasn't configured it for the right behaviour in terms of I/O error resilience, we don't find out about it. So it's not the end of the world, but it'd sure be nice to fix. Whether fsync reports that or not is really > somewhat besides the point. We don't panic that way when getting IO > errors during reads either, and they're more likely to be persistent > than errors during writes (because remapping on storage layer can fix > issues, but not during reads). > That's because reads don't make promises about what's committed and synced. I think that's quite different. > We should fix things so that reported errors are treated with crash > recovery, and for the rest I think there's very fair arguments to be > made that that's far outside postgres's remit. > Certainly for current versions. I think we need to think about a more robust path in future. But it's certainly not "stop the world" territory. The docs need an update to indicate that we explicitly disclaim responsibility for I/O errors on async writes, and that the kernel and I/O stack must be configured never to give up on buffered writes. If it does, that's not our problem anymore. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Hi, On 2018-04-08 16:27:57 -0700, Christophe Pettus wrote: > > On Apr 8, 2018, at 16:16, Andres Freundwrote: > > We don't panic that way when getting IO > > errors during reads either, and they're more likely to be persistent > > than errors during writes (because remapping on storage layer can fix > > issues, but not during reads). > > There is a distinction to be drawn there, though, because we > immediately pass an error back to the client on a read, but a write > problem in this situation can be masked for an extended period of > time. Only if you're "lucky" enough that your clients actually read that data, and then you're somehow able to figure out across the whole stack that these 0.001% of transactions that fail are due to IO errors. Or you also need to do log analysis. If you want to solve things like that you need regular reads of all your data, including verifications etc. Greetings, Andres Freund
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 9 April 2018 at 06:29, Bruce Momjianwrote: > > I think the big problem is that we don't have any way of stopping > Postgres at the time the kernel reports the errors to the kernel log, so > we are then returning potentially incorrect results and committing > transactions that might be wrong or lost. Right. Specifically, we need a way to ask the kernel at checkpoint time "was everything written to [this set of files] flushed successfully since the last time I asked, no matter who did the writing and no matter how the writes were flushed?" If the result is "no" we PANIC and redo. If the hardware/volume is screwed, the user can fail over to a standby, do PITR, etc. But we don't have any way to ask that reliably at present. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 9 April 2018 at 05:28, Christophe Pettuswrote: > > > On Apr 8, 2018, at 14:23, Greg Stark wrote: > > > > They consider dirty filesystem buffers when there's > > hardware failure preventing them from being written "a memory leak". > > That's not an irrational position. File system buffers are *not* > dedicated memory for file system caching; they're being used for that > because no one has a better use for them at that moment. If an inability > to flush them to disk meant that they suddenly became pinned memory, a > large copy operation to a yanked USB drive could result in the system > having no more allocatable memory. I guess in theory that they could swap > them, but swapping out a file system buffer in hopes that sometime in the > future it could be properly written doesn't seem very architecturally sound > to me. > Yep. Another example is a write to an NFS or iSCSI volume that goes away forever. What if the app keeps write()ing in the hopes it'll come back, and by the time the kernel starts reporting EIO for write(), it's already saddled with a huge volume of dirty writeback buffers it can't get rid of because someone, one day, might want to know about them? You could make the argument that it's OK to forget if the entire file system goes away. But actually, why is that ok? What if it's remounted again? That'd be really bad too, for someone expecting write reliability. You can coarsen from dirty buffer tracking to marking the FD(s) bad, but what if there's no FD to mark because the file isn't open at the moment? You can mark the inode cache entry and pin it, I guess. But what if your app triggered I/O errors over vast numbers of small files? Again, the kernel's left holding the ball. It doesn't know if/when an app will return to check. It doesn't know how long to remember the failure for. It doesn't know when all interested clients have been informed and it can treat the fault as cleared/repaired, either, so it'd have to *keep on reporting EIO for PostgreSQL's own writes and fsyncs() indefinitely*, even once we do recovery. The only way it could avoid that would be to keep the dirty writeback pages around and flagged bad, then clear the flag when a new write() replaces the same file range. I can't imagine that being practical. Blaming the kernel for this sure is the easy way out. But IMO we cannot rationally expect the kernel to remember error state forever for us, then forget it when we expect, all without actually telling it anything about our activities or even that we still exist and are still interested in the files/writes. We've closed the files and gone away. Whatever we do, it's likely going to have to involve not doing that anymore. Even if we can somehow convince the kernel folks to add a new interface for us that reports I/O errors to some listener, like an inotify/fnotify/dnotify/whatever-it-is-today-notify extension reporting errors in buffered async writes, we won't be able to rely on having it for 5-10 years, and only on Linux. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Warnings and uninitialized variables in TAP tests
Hi all, While looking at the output of the TAP tests, I have seen warnings like the following: Use of uninitialized value $target_lsn in concatenation (.) or string at /home/foo/git/postgres/src/bin/pg_rewind/../../../src/test/perl/PostgresNode.pm line 1540. [...] ./src/bin/pg_basebackup/tmp_check/log/regress_log_010_pg_basebackup:Use of uninitialized value $str in print at /home/foo/git/postgres/src/bin/pg_basebackup/../../../src/test/perl/TestLib.pm line 244. The first one shows somethng like 30 times, and the second only once. Attached is a patch to clean up all that. In order to find all that, I simply ran the following and they pop up: find . -name "regress_log*" | xargs grep -i "uninitialized" What I have found is harmless, still they pollute the logs. Thanks, -- Michael diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl index f502a2e3c7..0cd510eeea 100644 --- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl +++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl @@ -48,7 +48,7 @@ ok(!-d "$tempdir/backup", 'backup directory was cleaned up'); # but leave the data directory behind mkdir("$tempdir/backup") or BAIL_OUT("unable to create $tempdir/backup"); -append_to_file("$tempdir/backup/dir-not-empty.txt"); +append_to_file("$tempdir/backup/dir-not-empty.txt", "Some data"); $node->command_fails([ 'pg_basebackup', '-D', "$tempdir/backup", '-n' ], 'failing run with no-clean option'); diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm index 1bd91524d7..5a8f084efe 100644 --- a/src/test/perl/PostgresNode.pm +++ b/src/test/perl/PostgresNode.pm @@ -1541,7 +1541,7 @@ sub wait_for_catchup . $standby_name . "'s " . $mode . "_lsn to pass " - . $target_lsn . " on " + . $lsn_expr . " on " . $self->name . "\n"; my $query = qq[SELECT $lsn_expr <= ${mode}_lsn FROM pg_catalog.pg_stat_replication WHERE application_name = '$standby_name';]; signature.asc Description: PGP signature
Re: pgsql: Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.
On Sun, Apr 08, 2018 at 06:35:39PM +, Tom Lane wrote: > Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers. > > Traditionally, include/catalog/pg_foo.h contains extern declarations > for functions in backend/catalog/pg_foo.c, in addition to its function > as the authoritative definition of the pg_foo catalog's rowtype. > In some cases, we'd been forced to split out those extern declarations > into separate pg_foo_fn.h headers so that the catalog definitions > could be #include'd by frontend code. That problem is gone as of > commit 9c0a0de4c, so let's undo the splits to make things less > confusing. This patch or one of its relatives has visibly broken parallel builds for me. "make -j 4 install" directly called after a configure complains: In file included from ../../src/include/catalog/catalog.h:22:0, from relpath.c:21: ../../src/include/catalog/pg_class.h:22:10: fatal error: catalog/pg_class_d.h: No such file or directory #include "catalog/pg_class_d.h" ^~ compilation terminated. I did not take the time to look into the problem, that's just a head's up. I am on HEAD at b3b7f789 so it is not like I lack any fixes. Please note that "make -j 4" directly called works. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] Optional message to user when terminating/cancelling backend
On Fri, Apr 06, 2018 at 11:18:34AM +0200, Daniel Gustafsson wrote: > Yep, I completely agree. Attached are patches with the quotes removed and > rebased since Oids were taken etc. I still find this idea interesting for plugin authors. However, as feature freeze for v11 is in effect, I am marking the patch as returned with feedback. -- Michael signature.asc Description: PGP signature
Re: Rewriting the test of pg_upgrade as a TAP test - take two
On Tue, Apr 03, 2018 at 03:04:10PM +0900, Michael Paquier wrote: > At the end, it seems to me that what I am proposing is themost readable > approach, and with proper documentation things should be handled > finely... Or there is an approach you have in mind I do not foresee? With feature freeze which has been reached, I am marking this patch as returned with feedback. Please note that I have no plans to resubmit again this patch set after this second attempt as I have no idea where we want things to go. -- Michael signature.asc Description: PGP signature
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
> On Apr 8, 2018, at 16:16, Andres Freundwrote: > We don't panic that way when getting IO > errors during reads either, and they're more likely to be persistent > than errors during writes (because remapping on storage layer can fix > issues, but not during reads). There is a distinction to be drawn there, though, because we immediately pass an error back to the client on a read, but a write problem in this situation can be masked for an extended period of time. That being said... > There's a lot of not so great things here, but I don't think there's any > need to panic. No reason to panic, yes. We can assume that if this was a very big persistent problem, it would be much more widely reported. It would, however, be good to find a way to get the error surfaced back up to the client in a way that is not just monitoring the kernel logs. -- -- Christophe Pettus x...@thebuild.com
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 2018-04-08 18:29:16 -0400, Bruce Momjian wrote: > On Sun, Apr 8, 2018 at 09:38:03AM -0700, Christophe Pettus wrote: > > > > > On Apr 8, 2018, at 03:30, Craig Ringer> > > wrote: > > > > > > These are way more likely than bit flips or other storage level > > > corruption, and things that we previously expected to detect and > > > fail gracefully for. > > > > This is definitely bad, and it explains a few otherwise-inexplicable > > corruption issues we've seen. (And great work tracking it down!) I > > think it's important not to panic, though; PostgreSQL doesn't have a > > reputation for horrible data integrity. I'm not sure it makes sense > > to do a major rearchitecting of the storage layer (especially with > > pluggable storage coming along) to address this. While the failure > > modes are more common, the solution (a PITR backup) is one that an > > installation should have anyway against media failures. > > I think the big problem is that we don't have any way of stopping > Postgres at the time the kernel reports the errors to the kernel log, so > we are then returning potentially incorrect results and committing > transactions that might be wrong or lost. If we could stop Postgres > when such errors happen, at least the administrator could fix the > problem of fail-over to a standby. > > An crazy idea would be to have a daemon that checks the logs and stops > Postgres when it seems something wrong. I think the danger presented here is far smaller than some of the statements in this thread might make one think. In all likelihood, once you've got an IO error that kernel level retries don't fix, your database is screwed. Whether fsync reports that or not is really somewhat besides the point. We don't panic that way when getting IO errors during reads either, and they're more likely to be persistent than errors during writes (because remapping on storage layer can fix issues, but not during reads). There's a lot of not so great things here, but I don't think there's any need to panic. We should fix things so that reported errors are treated with crash recovery, and for the rest I think there's very fair arguments to be made that that's far outside postgres's remit. I think there's pretty good reasons to go to direct IO where supported, but error handling doesn't strike me as a particularly good reason for the move. Greetings, Andres Freund
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
> On Apr 8, 2018, at 15:29, Bruce Momjianwrote: > I think the big problem is that we don't have any way of stopping > Postgres at the time the kernel reports the errors to the kernel log, so > we are then returning potentially incorrect results and committing > transactions that might be wrong or lost. Yeah, it's bad. In the short term, the best advice to installations is to monitor their kernel logs for errors (which very few do right now), and make sure they have a backup strategy which can encompass restoring from an error like this. Even Craig's smart fix of patching the backup label to recover from a previous checkpoint doesn't do much good if we don't have WAL records back that far (or one of the required WAL records also took a hit). In the longer term... O_DIRECT seems like the most plausible way out of this, but that might be popular with people running on file systems or OSes that don't have this issue. (Setting aside the daunting prospect of implementing that.) -- -- Christophe Pettus x...@thebuild.com
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Sun, Apr 8, 2018 at 09:38:03AM -0700, Christophe Pettus wrote: > > > On Apr 8, 2018, at 03:30, Craig Ringer> > wrote: > > > > These are way more likely than bit flips or other storage level > > corruption, and things that we previously expected to detect and > > fail gracefully for. > > This is definitely bad, and it explains a few otherwise-inexplicable > corruption issues we've seen. (And great work tracking it down!) I > think it's important not to panic, though; PostgreSQL doesn't have a > reputation for horrible data integrity. I'm not sure it makes sense > to do a major rearchitecting of the storage layer (especially with > pluggable storage coming along) to address this. While the failure > modes are more common, the solution (a PITR backup) is one that an > installation should have anyway against media failures. I think the big problem is that we don't have any way of stopping Postgres at the time the kernel reports the errors to the kernel log, so we are then returning potentially incorrect results and committing transactions that might be wrong or lost. If we could stop Postgres when such errors happen, at least the administrator could fix the problem of fail-over to a standby. An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote: > On 8 April 2018 at 04:27, Craig Ringerwrote: > > On 8 April 2018 at 10:16, Thomas Munro > > wrote: > > > > If the kernel does writeback in the middle, how on earth is it supposed to > > know we expect to reopen the file and check back later? > > > > Should it just remember "this file had an error" forever, and tell every > > caller? In that case how could we recover? We'd need some new API to say > > "yeah, ok already, I'm redoing all my work since the last good fsync() so > > you can clear the error flag now". Otherwise it'd keep reporting an error > > after we did redo to recover, too. > > There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep > track of any errors. We just need fsync to accurately report whether > all the buffers in the file have been written out. When you call fsync Instead, fsync() reports when some of the buffers have not been written out, due to reasons outlined before. As such it may make some sense to maintain some tracking regarding errors even after marking failed dirty pages as clean (in fact it has been proposed, but this introduces memory overhead). > again the kernel needs to initiate i/o on all the dirty buffers and > block until they complete successfully. If they complete successfully > then nobody cares whether they had some failure in the past when i/o > was initiated at some point in the past. The question is, what should the kernel and application do in cases where this is simply not possible (according to freebsd that keeps dirty pages around after failure, for example, -EIO from the block layer is a contract for unrecoverable errors so it is pointless to keep them dirty). You'd need a specialized interface to clear-out the errors (and drop the dirty pages), or potentially just remount the filesystem. > The problem is not that errors aren't been tracked correctly. The > problem is that dirty buffers are being marked clean when they haven't > been written out. They consider dirty filesystem buffers when there's > hardware failure preventing them from being written "a memory leak". > > As long as any error means the kernel has discarded writes then > there's no real hope of any reliable operation through that interface. This does not necessarily follow. Whether the kernel discards writes or not would not really help (see above). It is more a matter of proper "reporting contract" between userspace and kernel, and tracking would be a way for facilitating this vs. having a more complex userspace scheme (as described by others in this thread) where synchronization for fsync() is required in a multi-process application. Best regards, Anthony
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
> On Apr 8, 2018, at 14:23, Greg Starkwrote: > > They consider dirty filesystem buffers when there's > hardware failure preventing them from being written "a memory leak". That's not an irrational position. File system buffers are *not* dedicated memory for file system caching; they're being used for that because no one has a better use for them at that moment. If an inability to flush them to disk meant that they suddenly became pinned memory, a large copy operation to a yanked USB drive could result in the system having no more allocatable memory. I guess in theory that they could swap them, but swapping out a file system buffer in hopes that sometime in the future it could be properly written doesn't seem very architecturally sound to me. -- -- Christophe Pettus x...@thebuild.com
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 8 April 2018 at 04:27, Craig Ringerwrote: > On 8 April 2018 at 10:16, Thomas Munro > wrote: > > If the kernel does writeback in the middle, how on earth is it supposed to > know we expect to reopen the file and check back later? > > Should it just remember "this file had an error" forever, and tell every > caller? In that case how could we recover? We'd need some new API to say > "yeah, ok already, I'm redoing all my work since the last good fsync() so > you can clear the error flag now". Otherwise it'd keep reporting an error > after we did redo to recover, too. There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep track of any errors. We just need fsync to accurately report whether all the buffers in the file have been written out. When you call fsync again the kernel needs to initiate i/o on all the dirty buffers and block until they complete successfully. If they complete successfully then nobody cares whether they had some failure in the past when i/o was initiated at some point in the past. The problem is not that errors aren't been tracked correctly. The problem is that dirty buffers are being marked clean when they haven't been written out. They consider dirty filesystem buffers when there's hardware failure preventing them from being written "a memory leak". As long as any error means the kernel has discarded writes then there's no real hope of any reliable operation through that interface. Going to DIRECTIO is basically recognizing this. That the kernel filesystem buffer provides no reliable interface so we need to reimplement it ourselves in user space. It's rather disheartening. Aside from having to do all that work we have the added barrier that we don't have as much information about the hardware as the kernel has. We don't know where raid stripes begin and end, how big the memory controller buffers are or how to tell when they're full or empty or how to flush them. etc etc. We also don't know what else is going on on the machine. -- greg
Re: WIP: a way forward on bootstrap data
On April 8, 2018 10:19:59 AM PDT, Tom Lanewrote: >John Naylor writes: >> There were a couple more catalog changes that broke patch context, so >> attached is version 16. > >Pushed with a few more adjustments (mostly, another round of >copy-editing >on bki.sgml). Now we wait to see what the buildfarm thinks, >particularly >about the MSVC build ... > >Congratulations, and many thanks, to John for seeing this through! >I know it's been a huge amount of work, but we've needed this for >years. Seconded and thirded and fourthed (?)! -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: WIP: Covering + unique indexes.
On Sun, Apr 8, 2018 at 11:18 AM, Teodor Sigaevwrote: > Thank you, fixed I suggest that we remove some unneeded amcheck tests, as in the attached patch. They don't seem to add anything. -- Peter Geoghegan From 0dbbee5bfff8816cddf86961bf4959192f62f1ff Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Sun, 8 Apr 2018 11:56:00 -0700 Subject: [PATCH] Remove some superfluous amcheck INCLUDE tests. Repeating these tests adds unnecessary cycles, since no improvement in test coverage is expected. Cleanup from commit 8224de4f42ccf98e08db07b43d52fed72f962ebb. Author: Peter Geoghegan --- contrib/amcheck/expected/check_btree.out | 20 +--- contrib/amcheck/sql/check_btree.sql | 6 +- 2 files changed, 2 insertions(+), 24 deletions(-) diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out index 2a06cce..ed80ac4 100644 --- a/contrib/amcheck/expected/check_btree.out +++ b/contrib/amcheck/expected/check_btree.out @@ -111,27 +111,9 @@ SELECT bt_index_parent_check('bttest_multi_idx', true); (1 row) -SELECT bt_index_parent_check('bttest_multi_idx', true); - bt_index_parent_check - -(1 row) - --- repeat same checks with index made by insertions +-- repeat expansive test for index built using insertions TRUNCATE bttest_multi; INSERT INTO bttest_multi SELECT i, i%2 FROM generate_series(1, 10) as i; -SELECT bt_index_check('bttest_multi_idx'); - bt_index_check - - -(1 row) - -SELECT bt_index_parent_check('bttest_multi_idx', true); - bt_index_parent_check - -(1 row) - SELECT bt_index_parent_check('bttest_multi_idx', true); bt_index_parent_check --- diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql index da2f131..4ca9d2d 100644 --- a/contrib/amcheck/sql/check_btree.sql +++ b/contrib/amcheck/sql/check_btree.sql @@ -65,15 +65,11 @@ COMMIT; SELECT bt_index_check('bttest_multi_idx'); -- more expansive test for index with included columns SELECT bt_index_parent_check('bttest_multi_idx', true); -SELECT bt_index_parent_check('bttest_multi_idx', true); --- repeat same checks with index made by insertions +-- repeat expansive test for index built using insertions TRUNCATE bttest_multi; INSERT INTO bttest_multi SELECT i, i%2 FROM generate_series(1, 10) as i; -SELECT bt_index_check('bttest_multi_idx'); SELECT bt_index_parent_check('bttest_multi_idx', true); -SELECT bt_index_parent_check('bttest_multi_idx', true); - -- cleanup DROP TABLE bttest_a; -- 2.7.4
Re: WIP: Covering + unique indexes.
Thanks to everyone, pushed. Indeed thanks, this will be a nice feature. It is giving me a compiler warning on non-cassert builds using gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609: indextuple.c: In function 'index_truncate_tuple': indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable] int indnatts = tupleDescriptor->natts; Thank you, fixed -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: a way forward on bootstrap data
On 4/9/18, Tom Lanewrote: > Congratulations, and many thanks, to John for seeing this through! > I know it's been a huge amount of work, but we've needed this for years. Many thanks for your review, advice, and additional hacking. Thanks also to Álvaro for review and initial commits, and to all who participated in previous discussion. -John Naylor
Re: Verbosity of genbki.pl
1. Print nothing at all. That's more in keeping with our modern build practices, but maybe it's too big a change? 2. Print just one message like "Generating postgres.bki and related files", and I guess a second one for fmgroids.h and related files. I don't have a strong preference. Opinions? Second point, pls. I'd like to see some stage done -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Verbosity of genbki.pl
(This was discussed in the "way forward on bootstrap data" thread, but we didn't do anything about it.) Traditionally genbki.pl has printed "Writing foo" for every file it writes out. As of 372728b0d that is a heck of a lot more files than it once was, so you get this: $ make -j8 -s Writing fmgroids.h Writing fmgrprotos.h Writing fmgrtab.c Writing pg_proc_d.h Writing pg_type_d.h Writing pg_attribute_d.h Writing pg_class_d.h Writing pg_attrdef_d.h Writing pg_constraint_d.h Writing pg_inherits_d.h Writing pg_index_d.h Writing pg_operator_d.h Writing pg_opfamily_d.h Writing pg_opclass_d.h Writing pg_am_d.h Writing pg_amop_d.h Writing pg_amproc_d.h Writing pg_language_d.h Writing pg_largeobject_metadata_d.h Writing pg_largeobject_d.h Writing pg_aggregate_d.h Writing pg_statistic_ext_d.h Writing pg_statistic_d.h Writing pg_rewrite_d.h Writing pg_trigger_d.h Writing pg_event_trigger_d.h Writing pg_description_d.h Writing pg_cast_d.h Writing pg_enum_d.h Writing pg_namespace_d.h Writing pg_conversion_d.h Writing pg_depend_d.h Writing pg_database_d.h Writing pg_db_role_setting_d.h Writing pg_tablespace_d.h Writing pg_pltemplate_d.h Writing pg_authid_d.h Writing pg_auth_members_d.h Writing pg_shdepend_d.h Writing pg_shdescription_d.h Writing pg_ts_config_d.h Writing pg_ts_config_map_d.h Writing pg_ts_dict_d.h Writing pg_ts_parser_d.h Writing pg_ts_template_d.h Writing pg_extension_d.h Writing pg_foreign_data_wrapper_d.h Writing pg_foreign_server_d.h Writing pg_user_mapping_d.h Writing pg_foreign_table_d.h Writing pg_policy_d.h Writing pg_replication_origin_d.h Writing pg_default_acl_d.h Writing pg_init_privs_d.h Writing pg_seclabel_d.h Writing pg_shseclabel_d.h Writing pg_collation_d.h Writing pg_partitioned_table_d.h Writing pg_range_d.h Writing pg_transform_d.h Writing pg_sequence_d.h Writing pg_publication_d.h Writing pg_publication_rel_d.h Writing pg_subscription_d.h Writing pg_subscription_rel_d.h Writing postgres.bki Writing schemapg.h Writing postgres.description Writing postgres.shdescription All of PostgreSQL successfully made. Ready to install. which seems a tad excessive, especially in a -s run. I think we should drop the per-file notices. But there are a couple of alternatives as to what to do exactly: 1. Print nothing at all. That's more in keeping with our modern build practices, but maybe it's too big a change? 2. Print just one message like "Generating postgres.bki and related files", and I guess a second one for fmgroids.h and related files. I don't have a strong preference. Opinions? regards, tom lane
Re: WIP: a way forward on bootstrap data
John Naylorwrites: > There were a couple more catalog changes that broke patch context, so > attached is version 16. Pushed with a few more adjustments (mostly, another round of copy-editing on bki.sgml). Now we wait to see what the buildfarm thinks, particularly about the MSVC build ... Congratulations, and many thanks, to John for seeing this through! I know it's been a huge amount of work, but we've needed this for years. regards, tom lane
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
> On Apr 8, 2018, at 03:30, Craig Ringerwrote: > > These are way more likely than bit flips or other storage level corruption, > and things that we previously expected to detect and fail gracefully for. This is definitely bad, and it explains a few otherwise-inexplicable corruption issues we've seen. (And great work tracking it down!) I think it's important not to panic, though; PostgreSQL doesn't have a reputation for horrible data integrity. I'm not sure it makes sense to do a major rearchitecting of the storage layer (especially with pluggable storage coming along) to address this. While the failure modes are more common, the solution (a PITR backup) is one that an installation should have anyway against media failures. -- -- Christophe Pettus x...@thebuild.com
Re: WIP: Covering + unique indexes.
Thank you, fixed Jeff Janes wrote: On Sat, Apr 7, 2018 at 4:02 PM, Teodor Sigaev> wrote: Thanks to everyone, pushed. Indeed thanks, this will be a nice feature. It is giving me a compiler warning on non-cassert builds using gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609: indextuple.c: In function 'index_truncate_tuple': indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable] int indnatts = tupleDescriptor->natts; Cheers, Jeff -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: PATCH: Configurable file mode mask
Hi Michael, On 4/6/18 10:20 AM, Michael Paquier wrote: > On Fri, Apr 06, 2018 at 09:15:15AM -0400, Stephen Frost wrote: >> I'll reply to David's last email (where the latest set of patches were >> included) with my comments/suggestions and I expect we'll be able to get >> those addressed today and have a final patch to post tonight, with an >> eye towards committing it tomorrow. > > The feature freeze is on the 8th, so I am going to have limited room to > comment on things until that day. If something gets committed, I am > pretty sure that I'll get out of my pocket a couple of things to improve > the feature and its interface anyway if of course you are ready to > accept that. Improvements are always welcome! The core focus of the patch hasn't changed over the last year, but we've found better ways to implement it over time. I'm sure there's more we can do. > I have limited my role to be a reviewer, so I refrained > myself from writing any code ;) Your dedicated and tireless review helped get this patch over the line. Thank you very much for all your hard work. -- -David da...@pgmasters.net signature.asc Description: OpenPGP digital signature
Re: WIP: Covering + unique indexes.
On Sat, Apr 7, 2018 at 4:02 PM, Teodor Sigaevwrote: > Thanks to everyone, pushed. > > Indeed thanks, this will be a nice feature. It is giving me a compiler warning on non-cassert builds using gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609: indextuple.c: In function 'index_truncate_tuple': indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable] int indnatts = tupleDescriptor->natts; Cheers, Jeff
Re: WIP: a way forward on bootstrap data
I wrote: > I'll check back in 24 hours to see if everything still applies. There were a couple more catalog changes that broke patch context, so attached is version 16. -John Naylor v16-bootstrap-data-conversion.tar.gz Description: GNU Zip compressed data
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 8 April 2018 at 17:41, Andreas Karlssonwrote: > On 04/08/2018 05:27 AM, Craig Ringer wrote:> More below, but here's an > idea #5: decide InnoDB has the right idea, and > >> go to using a single massive blob file, or a few giant blobs. >> > > FYI: MySQL has by default one file per table these days. The old approach > with one massive file was a maintenance headache so they change the default > some releases ago. > > https://dev.mysql.com/doc/refman/8.0/en/innodb-multiple-tablespaces.html > > Huh, thanks for the update. We should see how they handle reliable flushing and see if they've looked into it. If they haven't, we should give them a heads-up and if they have, lets learn from them. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 8 April 2018 at 11:46, Christophe Pettuswrote: > > > On Apr 7, 2018, at 20:27, Craig Ringer wrote: > > > > Right now I think we're at option (4): If you see anything that smells > like a write error in your kernel logs, hard-kill postgres with -m > immediate (do NOT let it do a shutdown checkpoint). If it did a checkpoint > since the logs, fake up a backup label to force redo to start from the last > checkpoint before the error. Otherwise, it's safe to just let it start up > again and do redo again. > > Before we spiral down into despair and excessive alcohol consumption, this > is basically the same situation as a checksum failure or some other kind of > uncorrected media-level error. The bad part is that we have to find out > from the kernel logs rather than from PostgreSQL directly. But this does > not strike me as otherwise significantly different from, say, an > infrequently-accessed disk block reporting an uncorrectable error when we > finally get around to reading it. > I don't entirely agree - because it affects ENOSPC, I/O errors on thin provisioned storage, I/O errors on multipath storage, etc. (I identified the original issue on a thin provisioned system that ran out of backing space, mangling PostgreSQL in a way that made no sense at the time). These are way more likely than bit flips or other storage level corruption, and things that we previously expected to detect and fail gracefully for. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 04/08/2018 05:27 AM, Craig Ringer wrote:> More below, but here's an idea #5: decide InnoDB has the right idea, and go to using a single massive blob file, or a few giant blobs. FYI: MySQL has by default one file per table these days. The old approach with one massive file was a maintenance headache so they change the default some releases ago. https://dev.mysql.com/doc/refman/8.0/en/innodb-multiple-tablespaces.html Andreas
Re: Checkpoint not retrying failed fsync?
On Sun, Apr 8, 2018 at 5:36 PM, Amit Kapilawrote: > Won't in the success case, you need to delete each member (by > something like bms_del_member) rather than just using bms_free? Thanks for looking at this. Yeah, if requests for segment numbers 0 and 1 were in "requests", and 0 succeeded but then 1 fails, my previous patch would leave both in there to be retried next time around. I thought that was pretty harmless so I didn't worry about it before, but of course you're right that it's not necessary to retry the ones that succeeded, so we could remove them as we go. New patch attached. -- Thomas Munro http://www.enterprisedb.com 0001-Make-sure-we-don-t-forget-about-fsync-requests-af-v3.patch Description: Binary data
Re: csv format for psql
2018-04-07 14:23 GMT+02:00 Daniel Verite: > Peter Eisentraut wrote: > > > Another thought: Isn't CSV just the same as unaligned output plus some > > quoting? Could we add a quote character setting and then define --csv > > to be quote-character = " and fieldsep = , ? > > Plus footer set to off. So --csv would be like > \pset format unaligned > \pset fieldsep ',' > \pset footer off > \pset unaligned_quote_character '"' > > I guess we'd add a \csv command that does these together. > > There would still be some discomfort with some of the issues > discussed upthread. For instance > psql -F ';' --csv > is likely to reset fieldsep to ',' having then a different outcome than > psql --csv -F';' > > And there's the point on how to come back from \csv to another > output, given that it has overwritten "footer" that is used across > all formats, and fieldsep used by unaligned. > So a user might need to figure out anyway the > intricaties behind \csv, if it's not an independant format. > > On the whole I'm inclined to resubmit the patch with > fieldsep_csv and some minor changes based on the rest > of the discussion. > +1 > > Best regards, > -- > Daniel Vérité > PostgreSQL-powered mailer: http://www.manitou-mail.org > Twitter: @DanielVerite >
Re: [sqlsmith] Failed assertion in create_gather_path
Hi, At some places, I have observed that we are adding a partial path even when rel's consider_parallel is false. Due to this, the partial path added has parallel_safe set to false and then later in create_gather_path() assertion fails. Attached patch to fix that. On Sun, Apr 8, 2018 at 12:26 AM, Andreas Seltenreichwrote: > Tom Lane writes: > > Alvaro Herrera writes: > >> Andreas Seltenreich wrote: > >>> as of 039eb6e92f: > >>> TRAP: FailedAssertion("!(subpath->parallel_safe)", File: > "pathnode.c", Line: 1813) > > > >> Uh, that's pretty strange -- that patch did not touch the planner at > >> all, and the relcache.c changes are pretty trivial. > > > > I don't think Andreas meant that the bug is necessarily new in > 039eb6e92f, > > only that that's the version he happened to be testing. > > Right. I'm not testing often enough yet to be able to report on commit > granularity :-). I'll try for a less ambiguos wording in future > reports. > > -- Jeevan Chalke Technical Architect, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company From 86f430e35583c6bfdd43c4f31e06ab83e8404e9a Mon Sep 17 00:00:00 2001 From: Jeevan Chalke Date: Sun, 8 Apr 2018 12:52:13 +0530 Subject: [PATCH] Add partial path only when rel's consider_parallel is true. Otherwise, it will result in an assertion failure later in the create_gather_path(). --- src/backend/optimizer/path/allpaths.c | 4 src/backend/optimizer/prep/prepunion.c | 2 +- src/backend/optimizer/util/pathnode.c | 3 +++ 3 files changed, 8 insertions(+), 1 deletion(-) diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index c4e4db1..aa24345 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -2196,6 +2196,10 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel, pathkeys, required_outer)); } + /* If parallelism is not possible, return. */ + if (!rel->consider_parallel || !bms_is_empty(required_outer)) + return; + /* If consider_parallel is false, there should be no partial paths. */ Assert(sub_final_rel->consider_parallel || sub_final_rel->partial_pathlist == NIL); diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c index 5236ab3..f54807d 100644 --- a/src/backend/optimizer/prep/prepunion.c +++ b/src/backend/optimizer/prep/prepunion.c @@ -331,7 +331,7 @@ recurse_set_operations(Node *setOp, PlannerInfo *root, * to build a partial path for this relation. But there's no point in * considering any path but the cheapest. */ - if (final_rel->partial_pathlist != NIL) + if (final_rel->consider_parallel && final_rel->partial_pathlist != NIL) { Path *partial_subpath; Path *partial_path; diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c index 416b3f9..fa332de 100644 --- a/src/backend/optimizer/util/pathnode.c +++ b/src/backend/optimizer/util/pathnode.c @@ -770,6 +770,9 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path) /* Check for query cancel. */ CHECK_FOR_INTERRUPTS(); + /* Path to be added must be parallel safe. */ + Assert(new_path->parallel_safe); + /* * As in add_path, throw out any paths which are dominated by the new * path, but throw out the new path if some existing path dominates it. -- 1.8.3.1
Re: lazy detoasting
> "Chapman" == Chapman Flackwrites: Chapman> Please bear with me as I check my understanding of snapshot Chapman> management by looking at PersistHoldablePortal(). There's a Chapman> PushActiveSnapshot(queryDesc->snapshot) in there. Is that Chapman> because: Chapman> (d) some other reason I haven't thought of ? It has to be pushed as the active snapshot so that it's the snapshot that the executor uses to run the query to populate the tuplestore which becomes the "held" portal content. I think you're confusing the stack of active snapshots with the registry of snapshots. The snapshot of a portal that's not currently running in the executor won't be on the stack, but it will be registered (which is enough to maintain the session's reported xmin, which is what prevents visible data from being vacuumed away). -- Andrew (irc:RhodiumToad)