Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
Robert Haas writes: > On Fri, Mar 4, 2016 at 11:17 AM, Tom Lane wrote: >> Huh? Parallel workers are read-only; what would they be doing sending >> any of these messages? > Mumble. I have no idea what's happening here. OK, after inserting a bunch of debug logging I have figured out what is happening. The updates on trunc_stats_test et al, being updates, are done in the session's main backend. But we also have these queries: -- do a seqscan SELECT count(*) FROM tenk2; -- do an indexscan SELECT count(*) FROM tenk2 WHERE unique1 = 1; These can be, and are, done in parallel worker processes (and not necessarily the same one, either). AFAICT, the parallel worker processes send their stats messages to the stats collector more or less immediately after processing their queries. However, because of the rate-limiting logic in pgstat_report_stat, the main backend doesn't. The point of that "pg_sleep(1.0)" (which was actually added *after* wait_for_stats) is to ensure that the half-second delay in the rate limiter has been soaked up, and the stats messages sent, before we start waiting for the results to become visible in the stats collector's output. So the sequence of events when we get a failure looks like 1. parallel workers send stats updates for seqscan and indexscan on tenk2. 2. stats collector emits output files, probably as a result of an autovacuum request. 3. session's main backend finishes "pg_sleep(1.0)" and sends stats updates for what it's done lately, including the updates on trunc_stats_test et al. 4. wait_for_stats() observes that the tenk2 idx_scan count has already advanced and figures it need not wait at all. 5. We print stale stats for trunc_stats_test et al. So it appears to me that to make this robust, we need to adjust wait_for_stats to verify advances on *all three of* the tenk2 seq_scan count, the tenk2 idx_scan count, and at least one of the trunc_stats_test tables' counters, because those could be coming from three different backend processes. If we ever allow parallel workers to do writes, this will really become a mess. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Fri, Mar 4, 2016 at 11:17 AM, Tom Lane wrote: > Robert Haas writes: >> On Fri, Mar 4, 2016 at 11:03 AM, Tom Lane wrote: >>> Well, that would make the function more complicated, but maybe it's a >>> better answer. On the other hand, we know that the stats updates are >>> delivered in a deterministic order, so why not simply replace the >>> existing test in the wait function with one that looks for the truncation >>> updates? If we've gotten those, we must have gotten the earlier ones. > >> I'm not sure if that's actually true with parallel mode. I'm pretty >> sure the earlier workers will have terminated before the later ones >> start, but is that enough to guarantee that the stats collector sees >> the messages in that order? > > Huh? Parallel workers are read-only; what would they be doing sending > any of these messages? Mumble. I have no idea what's happening here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
Robert Haas writes: > On Fri, Mar 4, 2016 at 11:03 AM, Tom Lane wrote: >> Well, that would make the function more complicated, but maybe it's a >> better answer. On the other hand, we know that the stats updates are >> delivered in a deterministic order, so why not simply replace the >> existing test in the wait function with one that looks for the truncation >> updates? If we've gotten those, we must have gotten the earlier ones. > I'm not sure if that's actually true with parallel mode. I'm pretty > sure the earlier workers will have terminated before the later ones > start, but is that enough to guarantee that the stats collector sees > the messages in that order? Huh? Parallel workers are read-only; what would they be doing sending any of these messages? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
Robert Haas wrote: > I'm not sure if that's actually true with parallel mode. I'm pretty > sure the earlier workers will have terminated before the later ones > start, but is that enough to guarantee that the stats collector sees > the messages in that order? Um. So if you have two queries that run in sequence, it's possible for workers of the first query to be still running when workers for the second query finish? That would be very strange. If that's not what you're saying, I don't understand what guarantees you say we don't have. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Fri, Mar 4, 2016 at 11:03 AM, Tom Lane wrote: > Alvaro Herrera writes: >> Tom Lane wrote: >>> That's what it looks like to me. I now think that the apparent >>> connection to parallel query is a mirage. The reason we've only >>> seen a few cases so far is that the flapping test is new: it >>> wad added in commit d42358efb16cc811, on 20 Feb. > >> It was added on Feb 20 all right, but of *last year*. It's been there >> working happily for a year now. > > Wup, you're right, failed to look closely enough at the commit log > entry. So that puts us back to wondering why exactly parallel query > is triggering this. Still, Robert's experiment with removing the > pg_sleep seems fairly conclusive: it is possible to get the failure > without parallel query. > >> Instead of adding another sleep function, another possibility is to add >> two booleans, one for the index counter and another for the truncate >> counters, and only terminate the sleep if both are true. I don't see >> any reason to make this test any slower than it already is. > > Well, that would make the function more complicated, but maybe it's a > better answer. On the other hand, we know that the stats updates are > delivered in a deterministic order, so why not simply replace the > existing test in the wait function with one that looks for the truncation > updates? If we've gotten those, we must have gotten the earlier ones. I'm not sure if that's actually true with parallel mode. I'm pretty sure the earlier workers will have terminated before the later ones start, but is that enough to guarantee that the stats collector sees the messages in that order? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Fri, Mar 4, 2016 at 10:33 AM, Tom Lane wrote: > Robert Haas writes: >> Sure. If you have an idea what the right thing to do is, please go >> ahead. > > Yeah, I'll modify the patch and commit sometime later today. OK, if you're basing that on the patch I sent upthread, please credit Rahila Syed as the original author of that code. (I modified it before posting, but only trivially.) Of course if you do something else, then never mind. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
Alvaro Herrera writes: > Tom Lane wrote: >> That's what it looks like to me. I now think that the apparent >> connection to parallel query is a mirage. The reason we've only >> seen a few cases so far is that the flapping test is new: it >> wad added in commit d42358efb16cc811, on 20 Feb. > It was added on Feb 20 all right, but of *last year*. It's been there > working happily for a year now. Wup, you're right, failed to look closely enough at the commit log entry. So that puts us back to wondering why exactly parallel query is triggering this. Still, Robert's experiment with removing the pg_sleep seems fairly conclusive: it is possible to get the failure without parallel query. > Instead of adding another sleep function, another possibility is to add > two booleans, one for the index counter and another for the truncate > counters, and only terminate the sleep if both are true. I don't see > any reason to make this test any slower than it already is. Well, that would make the function more complicated, but maybe it's a better answer. On the other hand, we know that the stats updates are delivered in a deterministic order, so why not simply replace the existing test in the wait function with one that looks for the truncation updates? If we've gotten those, we must have gotten the earlier ones. In any case, the real answer to making the test less slow is to get rid of that vestigial pg_sleep. I'm wondering why we failed to remove that when we put in the wait_for_stats function... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
Tom Lane wrote: > Robert Haas writes: > > Sure. If you have an idea what the right thing to do is, please go > > ahead. > > Yeah, I'll modify the patch and commit sometime later today. > > > I actually don't have a clear idea what's going on here. I > > guess it's that the wait_for_stats() guarantees that the stats message > > from the index insertion has been received but the status messages > > from the "trunc" tables might not have gotten there yet. > > That's what it looks like to me. I now think that the apparent > connection to parallel query is a mirage. The reason we've only > seen a few cases so far is that the flapping test is new: it > wad added in commit d42358efb16cc811, on 20 Feb. If we left it > as-is, I think we'd eventually see the same failure without forcing > parallel mode. In fact, that's pretty much what you describe below, > isn't it? The pg_sleep is sort of half-bakedly substituting for > a proper wait. It was added on Feb 20 all right, but of *last year*. It's been there working happily for a year now. The reason I added the trunc test in the middle of the index update tests is that I dislike tests that sleep for long without real purpose; it seems pretty reasonable to me to have both sleeps actually be the same wait. Instead of adding another sleep function, another possibility is to add two booleans, one for the index counter and another for the truncate counters, and only terminate the sleep if both are true. I don't see any reason to make this test any slower than it already is. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
Robert Haas writes: > Sure. If you have an idea what the right thing to do is, please go > ahead. Yeah, I'll modify the patch and commit sometime later today. > I actually don't have a clear idea what's going on here. I > guess it's that the wait_for_stats() guarantees that the stats message > from the index insertion has been received but the status messages > from the "trunc" tables might not have gotten there yet. That's what it looks like to me. I now think that the apparent connection to parallel query is a mirage. The reason we've only seen a few cases so far is that the flapping test is new: it wad added in commit d42358efb16cc811, on 20 Feb. If we left it as-is, I think we'd eventually see the same failure without forcing parallel mode. In fact, that's pretty much what you describe below, isn't it? The pg_sleep is sort of half-bakedly substituting for a proper wait. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Fri, Mar 4, 2016 at 12:46 AM, Tom Lane wrote: > Robert Haas writes: >> A couple of my colleagues have been looking into this. It's not >> entirely clear to me what's going on here yet, but it looks like the >> stats get there if you wait long enough. Rahila Syed was able to >> reproduce the problem and says that the attached patch fixes it. But >> I don't quite understand why this should fix it. > > I don't like this patch much. While the new function is not bad in > itself, it looks really weird to call it immediately after the other > wait function. And the reason for that, AFAICT, is that somebody dropped > the entire "truncation stats" test sequence into the middle of unrelated > tests, evidently in the vain hope that that way they could piggyback > on the existing wait. Which these failures are showing us is wrong. > > I think we should move all the inserted logic down so that it's not in the > middle of unrelated testing. Sure. If you have an idea what the right thing to do is, please go ahead. I actually don't have a clear idea what's going on here. I guess it's that the wait_for_stats() guarantees that the stats message from the index insertion has been received but the status messages from the "trunc" tables might not have gotten there yet. I thought maybe that works without parallelism because all of those messages are coming from the same backend, and therefore if you have the later one you must have all of the earlier ones, too. But if you're running some of the queries in parallel workers then it's possible for a stats message from a worker run later to arrive later. But it's not that after all, because when I run the regression tests with the pg_sleep removed, I get this: *** /Users/rhaas/pgsql/src/test/regress/expected/stats.out 2016-03-04 08:55:33.0 -0500 --- /Users/rhaas/pgsql/src/test/regress/results/stats.out 2016-03-04 09:00:29.0 -0500 *** *** 127,140 1 (1 row) - -- force the rate-limiting logic in pgstat_report_tabstat() to time out - -- and send a message - SELECT pg_sleep(1.0); - pg_sleep - -- - - (1 row) - -- wait for stats collector to update SELECT wait_for_stats(); wait_for_stats --- 127,132 *** *** 148,158 WHERE relname like 'trunc_stats_test%' order by relname; relname | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup | n_dead_tup ---+---+---+---++ ! trunc_stats_test | 3 | 0 | 0 | 0 | 0 ! trunc_stats_test1 | 4 | 2 | 1 | 1 | 0 ! trunc_stats_test2 | 1 | 0 | 0 | 1 | 0 ! trunc_stats_test3 | 4 | 0 | 0 | 2 | 2 ! trunc_stats_test4 | 2 | 0 | 0 | 0 | 2 (5 rows) SELECT st.seq_scan >= pr.seq_scan + 1, --- 140,150 WHERE relname like 'trunc_stats_test%' order by relname; relname | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup | n_dead_tup ---+---+---+---++ ! trunc_stats_test | 0 | 0 | 0 | 0 | 0 ! trunc_stats_test1 | 0 | 0 | 0 | 0 | 0 ! trunc_stats_test2 | 0 | 0 | 0 | 0 | 0 ! trunc_stats_test3 | 0 | 0 | 0 | 0 | 0 ! trunc_stats_test4 | 0 | 0 | 0 | 0 | 0 (5 rows) SELECT st.seq_scan >= pr.seq_scan + 1, *** *** 163,169 WHERE st.relname='tenk2' AND cl.relname='tenk2'; ?column? | ?column? | ?column? | ?column? --+--+--+-- ! t| t| t| t (1 row) SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages, --- 155,161 WHERE st.relname='tenk2' AND cl.relname='tenk2'; ?column? | ?column? | ?column? | ?column? --+--+--+-- ! f| f| f| f (1 row) SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages, *** *** 172,178 WHERE st.relname='tenk2' AND cl.relname='tenk2'; ?column? | ?column? --+-- ! t| t (1 row) SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer --- 164,170 WHERE st.relname='tenk2' AND cl.relname='tenk2'; ?column? | ?column? --+-- ! t| f (1 row) SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer That looks suspiciously similar to the failure we're getting with the force_parallel_mode testing, but I'm still confused. >> BTW, this comment is obsolete: > >> -- force the rate-limiting logic in pgstat_report_tabstat() to time out >> -- and send a message >> SELECT pg_sleep(1.
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
Robert Haas writes: > A couple of my colleagues have been looking into this. It's not > entirely clear to me what's going on here yet, but it looks like the > stats get there if you wait long enough. Rahila Syed was able to > reproduce the problem and says that the attached patch fixes it. But > I don't quite understand why this should fix it. I don't like this patch much. While the new function is not bad in itself, it looks really weird to call it immediately after the other wait function. And the reason for that, AFAICT, is that somebody dropped the entire "truncation stats" test sequence into the middle of unrelated tests, evidently in the vain hope that that way they could piggyback on the existing wait. Which these failures are showing us is wrong. I think we should move all the inserted logic down so that it's not in the middle of unrelated testing. > BTW, this comment is obsolete: > -- force the rate-limiting logic in pgstat_report_tabstat() to time out > -- and send a message > SELECT pg_sleep(1.0); > pg_sleep > -- > (1 row) > That function was renamed in commit > 93c701edc6c6f065cd25f77f63ab31aff085d6ac, but apparently Tom forgot to > grep for other uses of that identifier name. Duh :-(. Actually, do we need that sleep at all anymore? Seems like wait_for_stats ought to cover it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Thu, Mar 3, 2016 at 1:10 AM, Tom Lane wrote: > Noah Misch writes: >> I've modified buildfarm member mandrill to use force_parallel_mode=regress >> and >> max_parallel_degree=5; a full run passes. We'll now see if it intermittently >> fails the stats test, like Tom witnessed: >> http://www.postgresql.org/message-id/30385.1456077...@sss.pgh.pa.us > > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2016-03-02%2023%3A34%3A10 A couple of my colleagues have been looking into this. It's not entirely clear to me what's going on here yet, but it looks like the stats get there if you wait long enough. Rahila Syed was able to reproduce the problem and says that the attached patch fixes it. But I don't quite understand why this should fix it. BTW, this comment is obsolete: -- force the rate-limiting logic in pgstat_report_tabstat() to time out -- and send a message SELECT pg_sleep(1.0); pg_sleep -- (1 row) That function was renamed in commit 93c701edc6c6f065cd25f77f63ab31aff085d6ac, but apparently Tom forgot to grep for other uses of that identifier name. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company wait-for-trunc-stats.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
Noah Misch writes: > I've modified buildfarm member mandrill to use force_parallel_mode=regress and > max_parallel_degree=5; a full run passes. We'll now see if it intermittently > fails the stats test, like Tom witnessed: > http://www.postgresql.org/message-id/30385.1456077...@sss.pgh.pa.us http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2016-03-02%2023%3A34%3A10 regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Sat, Feb 27, 2016 at 7:05 PM, Noah Misch wrote: > On Fri, Feb 26, 2016 at 04:16:58PM +0530, Robert Haas wrote: >> Committed these patches after revising the comment you wrote and >> adding documentation. > > I've modified buildfarm member mandrill to use force_parallel_mode=regress and > max_parallel_degree=5; a full run passes. We'll now see if it intermittently > fails the stats test, like Tom witnessed: > http://www.postgresql.org/message-id/30385.1456077...@sss.pgh.pa.us Thank you. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Fri, Feb 26, 2016 at 04:16:58PM +0530, Robert Haas wrote: > Committed these patches after revising the comment you wrote and > adding documentation. I've modified buildfarm member mandrill to use force_parallel_mode=regress and max_parallel_degree=5; a full run passes. We'll now see if it intermittently fails the stats test, like Tom witnessed: http://www.postgresql.org/message-id/30385.1456077...@sss.pgh.pa.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Wed, Feb 24, 2016 at 12:59 PM, Thomas Munro wrote: > On Wed, Feb 24, 2016 at 5:48 PM, Thomas Munro > wrote: >> Here is a first pass at that. [...] > > On Wed, Feb 24, 2016 at 1:23 AM, Robert Haas wrote: >> file_fdw is parallel-safe, ... > > And here is a patch to apply on top of the last one, to make file_fdw > return true. But does it really work correctly under parallelism? Seems like it. Running the regression tests for file_fdw under force_parallel_mode=regress, max_parallel_degree>0 passes; you can verify that's actually doing something by using force_parallel_mode=on, which will result in some predictable failures. From a theoretical point of view, there's no reason I can see why reading a file shouldn't work just as well from a parallel worker as from the leader. They both have the same view of the filesystem, and in neither case are we trying to write any data; we're just trying to read it. Committed these patches after revising the comment you wrote and adding documentation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Wed, Feb 24, 2016 at 5:48 PM, Thomas Munro wrote: > Here is a first pass at that. [...] On Wed, Feb 24, 2016 at 1:23 AM, Robert Haas wrote: > file_fdw is parallel-safe, ... And here is a patch to apply on top of the last one, to make file_fdw return true. But does it really work correctly under parallelism? -- Thomas Munro http://www.enterprisedb.com file-fdw-parallel-safe.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Tue, Feb 23, 2016 at 6:45 PM, Robert Haas wrote: > On Tue, Feb 23, 2016 at 2:06 AM, Tom Lane wrote: >> Robert Haas writes: Foreign tables are supposed to be categorically excluded from parallelism. Not sure why that's not working in this instance. >> >> BTW, I wonder where you think that's supposed to be enforced, because >> I sure can't find any such logic. >> >> I suppose that has_parallel_hazard() would be the logical place to >> notice foreign tables, but it currently doesn't even visit RTEs, >> much less contain any code to check if their tables are foreign. >> Or did you have another place in mind to do that? > > RTEs are checked in set_rel_consider_parallel(), and I thought there > was a check there related to foreign tables, but there isn't. Oops. > In view of 69d34408e5e7adcef8ef2f4e9c4f2919637e9a06, we shouldn't > blindly assume that foreign scans are not parallel-safe, but we can't > blindly assume the opposite either. Maybe we should assume that the > foreign scan is parallel-safe only if one or more of the new methods > introduced by the aforementioned commit are set, but actually that > doesn't seem quite right. That would tell us whether the scan itself > can be parallelized, not whether it's safe to run serially but within > a parallel worker. I think maybe we need a new FDW API that gets > called from set_rel_consider_parallel() with the root, rel, and rte as > arguments and which can return a Boolean. If the callback is not set, > assume false. Here is a first pass at that. The patch adds IsForeignScanParallelSafe to the FDW API. postgres_fdw returns false (unnecessary but useful to verify that the regression test breaks if you change it to true), and others don't provide the function so fall back to false. I suspect there may be opportunities to return true even if snapshots and uncommitted reads aren't magically coordinated among workers. For example: users of MongoDB-type systems and text files have no expectation of either snapshot semantics or transaction isolation in the first place, so doing stuff in parallel won't be any less safe than usual as far as visibility is concerned; postgres_fdw could in theory export/import snapshots and allow parallelism in limited cases if it can somehow prove there have been no uncommitted writes; and non-MVCC/snapshot RDBMSs might be OK in lower isolation levels if you haven't written anything or have explicitly opted in to uncommitted reads (otherwise you'd risk invisible deadlock against the leader when trying to read what it has written). Please also find attached a tiny patch to respect TEMP_CONFIG for contribs. -- Thomas Munro http://www.enterprisedb.com temp-config-for-contribs.patch Description: Binary data fdw-parallel-safe-api.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Tue, Feb 23, 2016 at 11:47 AM, Tom Lane wrote: > Even if there were, it would not fix this bug, because AFAICS the only > thing that set_rel_consider_parallel is chartered to do is set the > per-relation consider_parallel flag. The failure that is happening in > that regression test with force_parallel_mode turned on happens because > standard_planner plasters a Gather node at the top of the plan, causing > the whole plan including the FDW access to happen inside a parallel > worker. The only way to prevent that is to clear the > wholePlanParallelSafe flag, which as far as I can tell (not that any of > this is documented worth a damn) isn't something that > set_rel_consider_parallel is supposed to do. Hmm. Well, if you tested it, or looked at the places where wholePlanParallelSafe is cleared, you would find that it DOES fix the bug. create_plan() clears wholePlanParallelSafe if the plan is not parallel-safe, and the plan won't be parallel-safe unless consider_parallel was set for the underlying relation. In case you'd like to test it for yourself, here's the PoC patch I wrote: diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index bcb668f..8a4179e 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -527,6 +527,11 @@ set_rel_consider_parallel(PlannerInfo *root, RelOptInfo *rel, return; return; } + + /* Not for foreign tables. */ + if (rte->relkind == RELKIND_FOREIGN_TABLE) + return; + break; case RTE_SUBQUERY: Adding that makes the postgres_fdw case pass. > It looks to me like there is a good deal of fuzzy thinking here about the > difference between locally parallelizable and globally parallelizable > plans, ie Gather at the top vs Gather somewhere else. If you have a specific complaint, I'm happy to try to improve things, or you can. I think however that it is also possible that you haven't fully understood the code I've spent the last year or so developing yet, possibly because I haven't documented it well enough, but possibly also because you haven't spent much time looking on it yet. I'm glad you are, by the way, because I'm sure there are a bunch of things here that you can improve over what I was able to do, especially on the planner side of things, and that would be great. However, a bit of forbearance would be appreciated. > I also note with > dismay that turning force_parallel_mode on seems to pretty much disable > any testing of local parallelism. No, I don't think so. It doesn't push a Gather node on top of a plan that already contains a Gather, because such a plan isn't parallel_safe. Nor does it suppress generation of parallel paths otherwise. >> In view of 69d34408e5e7adcef8ef2f4e9c4f2919637e9a06, we shouldn't >> blindly assume that foreign scans are not parallel-safe, but we can't >> blindly assume the opposite either. Maybe we should assume that the >> foreign scan is parallel-safe only if one or more of the new methods >> introduced by the aforementioned commit are set, but actually that >> doesn't seem quite right. That would tell us whether the scan itself >> can be parallelized, not whether it's safe to run serially but within >> a parallel worker. I think maybe we need a new FDW API that gets >> called from set_rel_consider_parallel() with the root, rel, and rte as >> arguments and which can return a Boolean. If the callback is not set, >> assume false. > > Meh. As things stand, postgres_fdw would have to aver that it can't ever > be safely parallelized, which doesn't seem like a very satisfactory answer > even if there are other FDWs that work differently (and which would those > be? None that use a socket-style connection to an external server.) file_fdw is parallel-safe, and KaiGai posted a patch that makes it parallel-aware, though that would have needed more work than I'm willing to put in right now to make it committable. So in other words... > The commit you mention above seems to me to highlight the dangers of > accepting hook patches with no working use-case to back them up. > AFAICT it's basically useless for typical FDWs because of this > multiple-connection problem. ...I didn't ignore this principal. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
Robert Haas writes: > On Tue, Feb 23, 2016 at 2:06 AM, Tom Lane wrote: >> Robert Haas writes: >>> Foreign tables are supposed to be categorically excluded from >>> parallelism. Not sure why that's not working in this instance. >> BTW, I wonder where you think that's supposed to be enforced, because >> I sure can't find any such logic. > RTEs are checked in set_rel_consider_parallel(), and I thought there > was a check there related to foreign tables, but there isn't. Oops. Even if there were, it would not fix this bug, because AFAICS the only thing that set_rel_consider_parallel is chartered to do is set the per-relation consider_parallel flag. The failure that is happening in that regression test with force_parallel_mode turned on happens because standard_planner plasters a Gather node at the top of the plan, causing the whole plan including the FDW access to happen inside a parallel worker. The only way to prevent that is to clear the wholePlanParallelSafe flag, which as far as I can tell (not that any of this is documented worth a damn) isn't something that set_rel_consider_parallel is supposed to do. It looks to me like there is a good deal of fuzzy thinking here about the difference between locally parallelizable and globally parallelizable plans, ie Gather at the top vs Gather somewhere else. I also note with dismay that turning force_parallel_mode on seems to pretty much disable any testing of local parallelism. > In view of 69d34408e5e7adcef8ef2f4e9c4f2919637e9a06, we shouldn't > blindly assume that foreign scans are not parallel-safe, but we can't > blindly assume the opposite either. Maybe we should assume that the > foreign scan is parallel-safe only if one or more of the new methods > introduced by the aforementioned commit are set, but actually that > doesn't seem quite right. That would tell us whether the scan itself > can be parallelized, not whether it's safe to run serially but within > a parallel worker. I think maybe we need a new FDW API that gets > called from set_rel_consider_parallel() with the root, rel, and rte as > arguments and which can return a Boolean. If the callback is not set, > assume false. Meh. As things stand, postgres_fdw would have to aver that it can't ever be safely parallelized, which doesn't seem like a very satisfactory answer even if there are other FDWs that work differently (and which would those be? None that use a socket-style connection to an external server.) The commit you mention above seems to me to highlight the dangers of accepting hook patches with no working use-case to back them up. AFAICT it's basically useless for typical FDWs because of this multiple-connection problem. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Tue, Feb 23, 2016 at 2:06 AM, Tom Lane wrote: > Robert Haas writes: >>> Foreign tables are supposed to be categorically excluded from >>> parallelism. Not sure why that's not working in this instance. > > BTW, I wonder where you think that's supposed to be enforced, because > I sure can't find any such logic. > > I suppose that has_parallel_hazard() would be the logical place to > notice foreign tables, but it currently doesn't even visit RTEs, > much less contain any code to check if their tables are foreign. > Or did you have another place in mind to do that? RTEs are checked in set_rel_consider_parallel(), and I thought there was a check there related to foreign tables, but there isn't. Oops. In view of 69d34408e5e7adcef8ef2f4e9c4f2919637e9a06, we shouldn't blindly assume that foreign scans are not parallel-safe, but we can't blindly assume the opposite either. Maybe we should assume that the foreign scan is parallel-safe only if one or more of the new methods introduced by the aforementioned commit are set, but actually that doesn't seem quite right. That would tell us whether the scan itself can be parallelized, not whether it's safe to run serially but within a parallel worker. I think maybe we need a new FDW API that gets called from set_rel_consider_parallel() with the root, rel, and rte as arguments and which can return a Boolean. If the callback is not set, assume false. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
Thomas Munro writes: > On Tue, Feb 23, 2016 at 4:03 AM, Tom Lane wrote: >> I've not looked at the test case to see if this is exactly what's >> going wrong, but it's pretty easy to see how there might be a problem: >> consider a STABLE user-defined function that does a SELECT from a foreign >> table. If that function call gets pushed down into a parallel worker >> then it would fail as described. > I thought user defined functions were not a problem since it's the > user's responsibility to declare functions' parallel safety correctly. > The manual says: "In general, if a function is labeled as being safe > when it is restricted or unsafe, or if it is labeled as being > restricted when it is in fact unsafe, it may throw errors or produce > wrong answers when used in a parallel query"[1]. Hm. I'm not terribly happy with this its-the-users-problem approach to things, mainly because I have little confidence that somebody couldn't figure out a security exploit based on it. > The case of a plain old SELECT (as seen in the failing regression > test) is definitely a problem though and FDW access there needs to be > detected automatically. Yes, the problem we're actually seeing in that regression test is not dependent on a function wrapper. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Tue, Feb 23, 2016 at 4:03 AM, Tom Lane wrote: > Robert Haas writes: >> On Mon, Feb 22, 2016 at 11:02 AM, Thomas Munro >> wrote: >>> The postgres_fdw failure is a visibility-of-my-own-uncommitted-work >>> problem. The first command in a transaction updates a row via an FDW, >>> and then the SELECT expects to see the effects, but when run in a >>> background worker it creates a new FDW connection that can't see the >>> uncommitted UPDATE. > >> Foreign tables are supposed to be categorically excluded from >> parallelism. Not sure why that's not working in this instance. > > I've not looked at the test case to see if this is exactly what's > going wrong, but it's pretty easy to see how there might be a problem: > consider a STABLE user-defined function that does a SELECT from a foreign > table. If that function call gets pushed down into a parallel worker > then it would fail as described. I thought user defined functions were not a problem since it's the user's responsibility to declare functions' parallel safety correctly. The manual says: "In general, if a function is labeled as being safe when it is restricted or unsafe, or if it is labeled as being restricted when it is in fact unsafe, it may throw errors or produce wrong answers when used in a parallel query"[1]. Uncommitted changes on foreign tables are indeed invisible to functions declared as PARALLEL SAFE, when run with force_parallel_mode = on, max_parallel_degree = 2, but the default is UNSAFE and in that case the containing query is never parallelised. Perhaps the documentation could use a specific mention of this subtlety with FDWs in the PARALLEL section? The case of a plain old SELECT (as seen in the failing regression test) is definitely a problem though and FDW access there needs to be detected automatically. I also thought that has_parallel_hazard_walker might be the right place for that logic, as you suggested in your later message. [1] http://www.postgresql.org/docs/devel/static/sql-createfunction.html -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
Robert Haas writes: >> Foreign tables are supposed to be categorically excluded from >> parallelism. Not sure why that's not working in this instance. BTW, I wonder where you think that's supposed to be enforced, because I sure can't find any such logic. I suppose that has_parallel_hazard() would be the logical place to notice foreign tables, but it currently doesn't even visit RTEs, much less contain any code to check if their tables are foreign. Or did you have another place in mind to do that? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
Robert Haas writes: > On Mon, Feb 22, 2016 at 11:02 AM, Thomas Munro > wrote: >> The postgres_fdw failure is a visibility-of-my-own-uncommitted-work >> problem. The first command in a transaction updates a row via an FDW, >> and then the SELECT expects to see the effects, but when run in a >> background worker it creates a new FDW connection that can't see the >> uncommitted UPDATE. > Foreign tables are supposed to be categorically excluded from > parallelism. Not sure why that's not working in this instance. I've not looked at the test case to see if this is exactly what's going wrong, but it's pretty easy to see how there might be a problem: consider a STABLE user-defined function that does a SELECT from a foreign table. If that function call gets pushed down into a parallel worker then it would fail as described. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Mon, Feb 22, 2016 at 11:02 AM, Thomas Munro wrote: > On Tue, Feb 16, 2016 at 12:12 PM, Noah Misch wrote: >> On Mon, Feb 15, 2016 at 06:07:48PM -0500, Tom Lane wrote: >>> Noah Misch writes: >>> > I configured a copy of animal "mandrill" that way and launched a test run. >>> > The postgres_fdw suite failed as attached. A manual "make -C contrib >>> > installcheck" fails the same way on a ppc64 GNU/Linux box, but it passes >>> > on >>> > x86_64 and aarch64. Since contrib test suites don't recognize >>> > TEMP_CONFIG, >>> > check-world passes everywhere. >>> >>> Hm, is this with or without the ppc-related atomics fix you just found? >> >> Without those. The ppc64 GNU/Linux configuration used gcc, though, and the >> atomics change affects xlC only. Also, the postgres_fdw behavior does not >> appear probabilistic; it failed twenty times in a row. > > The postgres_fdw failure is a visibility-of-my-own-uncommitted-work > problem. The first command in a transaction updates a row via an FDW, > and then the SELECT expects to see the effects, but when run in a > background worker it creates a new FDW connection that can't see the > uncommitted UPDATE. > > I wonder if parallelism of queries involving an FDW should not be > allowed if your transaction has written through the FDW. Foreign tables are supposed to be categorically excluded from parallelism. Not sure why that's not working in this instance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Tue, Feb 16, 2016 at 12:12 PM, Noah Misch wrote: > On Mon, Feb 15, 2016 at 06:07:48PM -0500, Tom Lane wrote: >> Noah Misch writes: >> > I configured a copy of animal "mandrill" that way and launched a test run. >> > The postgres_fdw suite failed as attached. A manual "make -C contrib >> > installcheck" fails the same way on a ppc64 GNU/Linux box, but it passes on >> > x86_64 and aarch64. Since contrib test suites don't recognize TEMP_CONFIG, >> > check-world passes everywhere. >> >> Hm, is this with or without the ppc-related atomics fix you just found? > > Without those. The ppc64 GNU/Linux configuration used gcc, though, and the > atomics change affects xlC only. Also, the postgres_fdw behavior does not > appear probabilistic; it failed twenty times in a row. The postgres_fdw failure is a visibility-of-my-own-uncommitted-work problem. The first command in a transaction updates a row via an FDW, and then the SELECT expects to see the effects, but when run in a background worker it creates a new FDW connection that can't see the uncommitted UPDATE. I wonder if parallelism of queries involving an FDW should not be allowed if your transaction has written through the FDW. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On 02/15/2016 07:57 PM, Tom Lane wrote: Noah Misch writes: On Mon, Feb 15, 2016 at 07:31:40PM -0500, Robert Haas wrote: Oh, crap. I didn't realize that TEMP_CONFIG didn't affect the contrib test suites. Is there any reason for that, or is it just kinda where we ended up? To my knowledge, it's just the undesirable place we ended up. Yeah. +1 for fixing that, if it's not unreasonably painful. +1 for fixing it everywhere. Historical note: back when TEMP_CONFIG was implemented, the main regression set was just about the only set of tests the buildfarm ran using a temp install. That wasn't even available for contrib and the PLs, IIRC. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
Noah Misch writes: > On Mon, Feb 15, 2016 at 07:31:40PM -0500, Robert Haas wrote: >> Oh, crap. I didn't realize that TEMP_CONFIG didn't affect the contrib >> test suites. Is there any reason for that, or is it just kinda where >> we ended up? > To my knowledge, it's just the undesirable place we ended up. Yeah. +1 for fixing that, if it's not unreasonably painful. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Mon, Feb 15, 2016 at 07:31:40PM -0500, Robert Haas wrote: > On Mon, Feb 15, 2016 at 5:52 PM, Noah Misch wrote: > > On Mon, Feb 08, 2016 at 02:49:27PM -0500, Robert Haas wrote: > >> force_parallel_mode=regress > >> max_parallel_degree=2 > >> > >> And then run this: make check-world > >> TEMP_CONFIG=/path/to/aforementioned/file > > I configured a copy of animal "mandrill" that way and launched a test run. > > The postgres_fdw suite failed as attached. A manual "make -C contrib > > installcheck" fails the same way on a ppc64 GNU/Linux box, but it passes on > > x86_64 and aarch64. Since contrib test suites don't recognize TEMP_CONFIG, > > check-world passes everywhere. > > Oh, crap. I didn't realize that TEMP_CONFIG didn't affect the contrib > test suites. Is there any reason for that, or is it just kinda where > we ended up? To my knowledge, it's just the undesirable place we ended up. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Mon, Feb 15, 2016 at 5:52 PM, Noah Misch wrote: > On Mon, Feb 08, 2016 at 02:49:27PM -0500, Robert Haas wrote: >> Well, what I've done is push into the buildfarm code that will allow >> us to do *the most exhaustive* testing that I know how to do in an >> automated fashion. Which is to create a file that says this: >> >> force_parallel_mode=regress >> max_parallel_degree=2 >> >> And then run this: make check-world TEMP_CONFIG=/path/to/aforementioned/file >> >> Now, that is not going to find bugs in the deadlock.c portion of the >> group locking patch, but it's been wildly successful in finding bugs >> in other parts of the parallelism code, and there might well be a few >> more that we haven't found yet, which is why I'm hoping that we'll get >> this procedure running regularly either on all buildfarm machines, or >> on some subset of them, or on new animals that just do this. > > I configured a copy of animal "mandrill" that way and launched a test run. > The postgres_fdw suite failed as attached. A manual "make -C contrib > installcheck" fails the same way on a ppc64 GNU/Linux box, but it passes on > x86_64 and aarch64. Since contrib test suites don't recognize TEMP_CONFIG, > check-world passes everywhere. Oh, crap. I didn't realize that TEMP_CONFIG didn't affect the contrib test suites. Is there any reason for that, or is it just kinda where we ended up? Retrying it the way you did it, I see the same errors here, so I think this isn't a PPC-specific problem, but just a problem in general. I've actually seen these kinds of errors before in earlier versions of the testing code that eventually became force_parallel_mode. I got fooled into believing I'd fixed the problem because of my confusion about how TEMP_CONFIG worked. I think this is more likely to be a bug in force_parallel_mode than a bug in the code that checks whether a normal parallel query is safe, but I'll have to track it down before I can say for sure. Thanks for testing this. It's not delightful to discover that I muffed this, but better to find it now than in 6 months. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Mon, Feb 15, 2016 at 06:07:48PM -0500, Tom Lane wrote: > Noah Misch writes: > > I configured a copy of animal "mandrill" that way and launched a test run. > > The postgres_fdw suite failed as attached. A manual "make -C contrib > > installcheck" fails the same way on a ppc64 GNU/Linux box, but it passes on > > x86_64 and aarch64. Since contrib test suites don't recognize TEMP_CONFIG, > > check-world passes everywhere. > > Hm, is this with or without the ppc-related atomics fix you just found? Without those. The ppc64 GNU/Linux configuration used gcc, though, and the atomics change affects xlC only. Also, the postgres_fdw behavior does not appear probabilistic; it failed twenty times in a row. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
Noah Misch writes: > I configured a copy of animal "mandrill" that way and launched a test run. > The postgres_fdw suite failed as attached. A manual "make -C contrib > installcheck" fails the same way on a ppc64 GNU/Linux box, but it passes on > x86_64 and aarch64. Since contrib test suites don't recognize TEMP_CONFIG, > check-world passes everywhere. Hm, is this with or without the ppc-related atomics fix you just found? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers