Re: TopoSort() fix
On Tue, Jul 30, 2019 at 2:10 PM Tom Lane wrote: > Sure. But I think what we can foresee is that if there are any bugs > reachable this way, they'd be reachable and need fixing regardless. > We've already established that parallel workers can take and release locks > that their leader isn't holding. Apparently, they won't take anything > stronger than RowExclusiveLock; but even AccessShare is enough to let a > process participate in all interesting behaviors of the lock manager, > including blocking, being blocked, and being released early by deadlock > resolution. And the advisory-lock functions are pretty darn thin wrappers > around the lock manager. So I'm finding it hard to see where there's > incremental risk, even if a user does intentionally bypass the parallel > safety markings. And what we get in return is an easier way to add tests > for this area. Sure, I was basically just asking whether you could foresee any crash-risk of the proposed change. It sounds like the answer is "no," so I'm fine with it on that basis. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: TopoSort() fix
Robert Haas writes: > On Tue, Jul 30, 2019 at 1:44 PM Tom Lane wrote: >> Well, there'd be an actual isolation test that they work ;-), if you >> override the marking. Admittedly, one test case does not prove that >> there's no way to crash the system, but that can be said of most >> parts of Postgres. > True. I'm just talking about what we can foresee. Sure. But I think what we can foresee is that if there are any bugs reachable this way, they'd be reachable and need fixing regardless. We've already established that parallel workers can take and release locks that their leader isn't holding. Apparently, they won't take anything stronger than RowExclusiveLock; but even AccessShare is enough to let a process participate in all interesting behaviors of the lock manager, including blocking, being blocked, and being released early by deadlock resolution. And the advisory-lock functions are pretty darn thin wrappers around the lock manager. So I'm finding it hard to see where there's incremental risk, even if a user does intentionally bypass the parallel safety markings. And what we get in return is an easier way to add tests for this area. regards, tom lane
Re: TopoSort() fix
On Tue, Jul 30, 2019 at 1:44 PM Tom Lane wrote: > Well, there'd be an actual isolation test that they work ;-), if you > override the marking. Admittedly, one test case does not prove that > there's no way to crash the system, but that can be said of most > parts of Postgres. True. I'm just talking about what we can foresee. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: TopoSort() fix
Robert Haas writes: > On Tue, Jul 30, 2019 at 1:36 PM Tom Lane wrote: >> In any case, my question at the moment is whether we need the belt-and- >> suspenders-too approach of having both non-parallel-safe marking and an >> explicit check inside these functions. We've largely moved away from >> hard-wired checks for e.g. superuserness, and surely these things are >> less dangerous than most formerly-superuser-only functions. > If we can't think of a way that the lack of these checks could crash > it, then I think it's OK to remove the hardwired checks. If we can, > I'd favor keeping them. Well, there'd be an actual isolation test that they work ;-), if you override the marking. Admittedly, one test case does not prove that there's no way to crash the system, but that can be said of most parts of Postgres. regards, tom lane
Re: TopoSort() fix
On Tue, Jul 30, 2019 at 1:36 PM Tom Lane wrote: > No, there's a sufficient reason why we should force advisory locks > to be taken in the leader process, namely that the behavior is totally > different if we don't: they will disappear at the end of the parallel > worker run, not at end of transaction or session as documented. Oh, good point. I forgot about that. > However, that argument doesn't seem to be a reason why the advisory-lock > functions couldn't be parallel-restricted rather than parallel-unsafe. Agreed. > In any case, my question at the moment is whether we need the belt-and- > suspenders-too approach of having both non-parallel-safe marking and an > explicit check inside these functions. We've largely moved away from > hard-wired checks for e.g. superuserness, and surely these things are > less dangerous than most formerly-superuser-only functions. If we can't think of a way that the lack of these checks could crash it, then I think it's OK to remove the hardwired checks. If we can, I'd favor keeping them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: TopoSort() fix
Robert Haas writes: > On Tue, Jul 30, 2019 at 10:27 AM Tom Lane wrote: >> (BTW, why aren't these functions just "parallel restricted"?) > ... > But it is really pretty arguable whether we should feel responsible > for that problem. We could just decide that if you're doing that, and > you don't want the scenario described above to happen, you oughta mark > the function that contains this logic at least PARALLEL RESTRICTED, > and if you don't, then it's your fault for doing a dumb thing. I > believe when we were early on in the development of this we wanted to > be very conservative lest, ah, someone accuse us of not locking things > down well enough, but maybe at this point parallel query is a > sufficiently well-established concept that we should lighten up on > some cases where we took an overly-stringent line. If we take that > view, then I'm not sure why these functions couldn't be just marked > PARALLEL SAFE. No, there's a sufficient reason why we should force advisory locks to be taken in the leader process, namely that the behavior is totally different if we don't: they will disappear at the end of the parallel worker run, not at end of transaction or session as documented. However, that argument doesn't seem to be a reason why the advisory-lock functions couldn't be parallel-restricted rather than parallel-unsafe. In any case, my question at the moment is whether we need the belt-and- suspenders-too approach of having both non-parallel-safe marking and an explicit check inside these functions. We've largely moved away from hard-wired checks for e.g. superuserness, and surely these things are less dangerous than most formerly-superuser-only functions. regards, tom lane
Re: TopoSort() fix
On Tue, Jul 30, 2019 at 10:27 AM Tom Lane wrote: > I also looked into whether one could use SELECT FOR UPDATE/SHARE to get > stronger locks at a tuple level, but that's been blocked off as well. > You guys really did a pretty good job of locking that down. Thanks. We learned from the master. > After thinking about this for awhile, though, I believe it might be > reasonable to just remove PreventAdvisoryLocksInParallelMode() > altogether. The "parallel unsafe" markings on the advisory-lock > functions seem like adequate protection against somebody running > them in a parallel worker. If you defeat that by calling them from > a mislabeled-parallel-safe wrapper (as the proposed test case does), > then any negative consequences are on your own head. AFAICT the > only actual negative consequence is that the locks disappear the > moment the parallel worker exits, so we'd not be opening any large > holes even to people who rip off the safety cover. > > (BTW, why aren't these functions just "parallel restricted"?) I don't exactly remember why we installed all of these restrictions any more. You might be able to find some discussion of it by searching the archives. I believe we may have been concerned about the fact that group locking would cause advisory locks taken in one process not to conflict with the same advisory lock taken in some cooperating process, and maybe that would be unwelcome behavior for someone. For example, suppose the user defines a function that takes an advisory lock on the number 1, does a bunch of stuff that should never happen multiply at the same time, and then releases the lock. Without parallel query, that will work. With parallel query, it won't, because several workers running the same query might run the same function simultaneously and their locks won't conflict. But it is really pretty arguable whether we should feel responsible for that problem. We could just decide that if you're doing that, and you don't want the scenario described above to happen, you oughta mark the function that contains this logic at least PARALLEL RESTRICTED, and if you don't, then it's your fault for doing a dumb thing. I believe when we were early on in the development of this we wanted to be very conservative lest, ah, someone accuse us of not locking things down well enough, but maybe at this point parallel query is a sufficiently well-established concept that we should lighten up on some cases where we took an overly-stringent line. If we take that view, then I'm not sure why these functions couldn't be just marked PARALLEL SAFE. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: TopoSort() fix
Robert Haas writes: > On Mon, Jul 29, 2019 at 9:48 PM Tom Lane wrote: >> I believe the only accessible route to taking any sort of new lock >> in a parallel worker is catalog lookups causing AccessShareLock on >> a catalog. > Can't the worker just query a previously-untouched table, maybe by > constructing a string and then using EXECUTE to execute it? Hm, yeah, looks like you could get a new AccessShareLock that way too. But not any exclusive lock. I also looked into whether one could use SELECT FOR UPDATE/SHARE to get stronger locks at a tuple level, but that's been blocked off as well. You guys really did a pretty good job of locking that down. After thinking about this for awhile, though, I believe it might be reasonable to just remove PreventAdvisoryLocksInParallelMode() altogether. The "parallel unsafe" markings on the advisory-lock functions seem like adequate protection against somebody running them in a parallel worker. If you defeat that by calling them from a mislabeled-parallel-safe wrapper (as the proposed test case does), then any negative consequences are on your own head. AFAICT the only actual negative consequence is that the locks disappear the moment the parallel worker exits, so we'd not be opening any large holes even to people who rip off the safety cover. (BTW, why aren't these functions just "parallel restricted"?) regards, tom lane
Re: TopoSort() fix
Alvaro Herrera writes: > On 2019-Jul-29, Tom Lane wrote: >> FYI, I just got done inventing a way to reach that code, and I have >> to suspect that it's impossible to do so in production, because under >> ordinary circumstances no parallel worker will take any exclusive lock >> that isn't already held by its leader. > Hmm, okay, so this wasn't a bug that would have bit anyone in practice, > yeah? That's reassuring. At the least, you'd have to go well out of your way to make it happen. regards, tom lane
Re: TopoSort() fix
On 2019-Jul-29, Tom Lane wrote: > FYI, I just got done inventing a way to reach that code, and I have > to suspect that it's impossible to do so in production, because under > ordinary circumstances no parallel worker will take any exclusive lock > that isn't already held by its leader. Hmm, okay, so this wasn't a bug that would have bit anyone in practice, yeah? That's reassuring. Thanks, -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: TopoSort() fix
On Mon, Jul 29, 2019 at 10:56:05AM -0400, Tom Lane wrote: > Michael Paquier writes: >> +1. Any volunteers? > > If Robert doesn't weigh in pretty soon, I'll take responsibility for it. Thanks Tom for taking care of it! -- Michael signature.asc Description: PGP signature
Re: TopoSort() fix
On Mon, Jul 29, 2019 at 9:48 PM Tom Lane wrote: > I tried that first. There are backstops preventing doing LOCK TABLE > in a worker, just like for advisory locks. > > I believe the only accessible route to taking any sort of new lock > in a parallel worker is catalog lookups causing AccessShareLock on > a catalog. Can't the worker just query a previously-untouched table, maybe by constructing a string and then using EXECUTE to execute it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: TopoSort() fix
Robert Haas writes: > On Mon, Jul 29, 2019 at 5:55 PM Tom Lane wrote: >> FYI, I just got done inventing a way to reach that code, and I have >> to suspect that it's impossible to do so in production, because under >> ordinary circumstances no parallel worker will take any exclusive lock >> that isn't already held by its leader. (If you happen to know an >> easy counterexample, let's see it.) > I think the way you could make that happen would be to run a parallel > query that calls a user-defined function which does LOCK TABLE. I tried that first. There are backstops preventing doing LOCK TABLE in a worker, just like for advisory locks. I believe the only accessible route to taking any sort of new lock in a parallel worker is catalog lookups causing AccessShareLock on a catalog. In principle, maybe you could make a deadlock situation by combining parallel workers with something that takes AccessExclusiveLock on a catalog ... but making that into a reliable test case sounds about impossible, because AEL on a catalog will have all sorts of unpleasant side-effects, such as blocking isolationtester's own queries. (Getting it to work in a CLOBBER_CACHE_ALWAYS build seems right out, for instance.) regards, tom lane
Re: TopoSort() fix
On Mon, Jul 29, 2019 at 5:55 PM Tom Lane wrote: > FYI, I just got done inventing a way to reach that code, and I have > to suspect that it's impossible to do so in production, because under > ordinary circumstances no parallel worker will take any exclusive lock > that isn't already held by its leader. (If you happen to know an > easy counterexample, let's see it.) I think the way you could make that happen would be to run a parallel query that calls a user-defined function which does LOCK TABLE. > Anyway, armed with this, I was able to prove that HEAD just hangs up > on this test case; apparently the deadlock checker never detects that > the additional holders of the advisory lock need to be rearranged. > And removing that "break" fixes it. Nice! > So I'll go commit the break-ectomy, but what do people think about > testing this better? I think it's a great idea. I was never very happy with the amount of exercise I was able to give this code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: TopoSort() fix
[ removing , as that mailing address seems to be MIA ] Robert Haas writes: > On Mon, Jul 29, 2019 at 10:56 AM Tom Lane wrote: >> If Robert doesn't weigh in pretty soon, I'll take responsibility for it. > That's fine, or if you prefer that I commit it, I will. FYI, I just got done inventing a way to reach that code, and I have to suspect that it's impossible to do so in production, because under ordinary circumstances no parallel worker will take any exclusive lock that isn't already held by its leader. (If you happen to know an easy counterexample, let's see it.) The attached heavily-hacked version of deadlock-soft.spec makes it go by forcing duplicate advisory locks to be taken in worker processes, which of course first requires disabling PreventAdvisoryLocksInParallelMode(). I kind of wonder if we should provide some debug-only, here-be-dragons way to disable that restriction so that we could make this an official regression test, because I'm now pretty suspicious that none of this code has ever executed before. Anyway, armed with this, I was able to prove that HEAD just hangs up on this test case; apparently the deadlock checker never detects that the additional holders of the advisory lock need to be rearranged. And removing that "break" fixes it. So I'll go commit the break-ectomy, but what do people think about testing this better? regards, tom lane diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c index ffd1970..1e815a2 100644 --- a/src/backend/utils/adt/lockfuncs.c +++ b/src/backend/utils/adt/lockfuncs.c @@ -658,10 +658,6 @@ pg_isolation_test_session_is_blocked(PG_FUNCTION_ARGS) static void PreventAdvisoryLocksInParallelMode(void) { - if (IsInParallelMode()) - ereport(ERROR, -(errcode(ERRCODE_INVALID_TRANSACTION_STATE), - errmsg("cannot use advisory locks during a parallel operation"))); } /* diff --git a/src/test/isolation/expected/deadlock-soft.out b/src/test/isolation/expected/deadlock-soft.out index 24a35da..7abbd19 100644 --- a/src/test/isolation/expected/deadlock-soft.out +++ b/src/test/isolation/expected/deadlock-soft.out @@ -1,17 +1,47 @@ Parsed test spec with 4 sessions starting permutation: d1a1 d2a2 e1l e2l d1a2 d2a1 d1c e1c d2c e2c -step d1a1: LOCK TABLE a1 IN ACCESS SHARE MODE; -step d2a2: LOCK TABLE a2 IN ACCESS SHARE MODE; -step e1l: LOCK TABLE a1 IN ACCESS EXCLUSIVE MODE; -step e2l: LOCK TABLE a2 IN ACCESS EXCLUSIVE MODE; -step d1a2: LOCK TABLE a2 IN ACCESS SHARE MODE; -step d2a1: LOCK TABLE a1 IN ACCESS SHARE MODE; +step d1a1: select lock_share(1,x) from bigt limit 1; +lock_share + +1 +step d2a2: select lock_share(2,x) from bigt limit 1; +lock_share + +1 +step e1l: select lock_excl(1,x) from bigt limit 1; +step e2l: select lock_excl(2,x) from bigt limit 1; +step d1a2: SET force_parallel_mode = on; +set parallel_setup_cost=0; +set parallel_tuple_cost=0; +set min_parallel_table_scan_size=0; +set parallel_leader_participation=off; +set max_parallel_workers_per_gather=4; + select sum(lock_share(2,x)) from bigt; +step d2a1: SET force_parallel_mode = on; +set parallel_setup_cost=0; +set parallel_tuple_cost=0; +set min_parallel_table_scan_size=0; +set parallel_leader_participation=off; +set max_parallel_workers_per_gather=4; + select sum(lock_share(1,x)) from bigt; step d1a2: <... completed> +sum + +1 step d1c: COMMIT; step e1l: <... completed> -step e1c: COMMIT; +lock_excl + +1 step d2a1: <... completed> +sum + +1 +step e1c: COMMIT; step d2c: COMMIT; step e2l: <... completed> +lock_excl + +1 step e2c: COMMIT; diff --git a/src/test/isolation/specs/deadlock-soft.spec b/src/test/isolation/specs/deadlock-soft.spec index 49d16e0..b03f0c7 100644 --- a/src/test/isolation/specs/deadlock-soft.spec +++ b/src/test/isolation/specs/deadlock-soft.spec @@ -1,3 +1,12 @@ +# Test deadlock resolution with parallel process groups. + +# It's fairly hard to get parallel worker processes to block on locks, +# since generally they don't want any locks their leader didn't already +# take. We cheat like mad here by making a function that takes a lock, +# and is incorrectly marked immutable so that it can execute in a worker. +# (This also requires disabling the lockfuncs.c code that prevents that.) + +# Otherwise, this is morally equivalent to standard deadlock-soft.spec: # Four-process deadlock with two hard edges and two soft edges. # d2 waits for e1 (soft edge), e1 waits for d1 (hard edge), # d1 waits for e2 (soft edge), e2 waits for d2 (hard edge). @@ -6,35 +15,67 @@ setup { - CREATE TABLE a1 (); - CREATE TABLE a2 (); + create function lock_share(int,int) returns int language sql as + 'select pg_advisory_xact_lock_shared($1); select 1;' immutable parallel safe; + + create function lock_excl(int,int) returns int language sql as + 'select pg_advisory_xact_lock($1);
Re: TopoSort() fix
On Mon, Jul 29, 2019 at 10:56 AM Tom Lane wrote: > Michael Paquier writes: > > On Fri, Jul 26, 2019 at 08:24:16PM -0400, Tom Lane wrote: > >> I think this is a sufficiently obvious bug, and a sufficiently > >> obvious fix, that we should just fix it and not insist on getting > >> a reproducible test case first. I think a test case would soon > >> bit-rot anyway, and no longer exercise the problem. > >> I certainly do *not* want to wait so long that we miss the > >> upcoming minor releases. > > > +1. Any volunteers? > > If Robert doesn't weigh in pretty soon, I'll take responsibility for it. That's fine, or if you prefer that I commit it, I will. I just got back from a week's vacation and am only very gradually unburying myself from mounds of email. (Of course, the way pgsql-hackers is getting, there's sort of always a mound of email these days.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: TopoSort() fix
Michael Paquier writes: > On Fri, Jul 26, 2019 at 08:24:16PM -0400, Tom Lane wrote: >> I think this is a sufficiently obvious bug, and a sufficiently >> obvious fix, that we should just fix it and not insist on getting >> a reproducible test case first. I think a test case would soon >> bit-rot anyway, and no longer exercise the problem. >> I certainly do *not* want to wait so long that we miss the >> upcoming minor releases. > +1. Any volunteers? If Robert doesn't weigh in pretty soon, I'll take responsibility for it. regards, tom lane
Re: TopoSort() fix
On Fri, Jul 26, 2019 at 08:24:16PM -0400, Tom Lane wrote: > I think this is a sufficiently obvious bug, and a sufficiently > obvious fix, that we should just fix it and not insist on getting > a reproducible test case first. I think a test case would soon > bit-rot anyway, and no longer exercise the problem. > > I certainly do *not* want to wait so long that we miss the > upcoming minor releases. +1. Any volunteers? -- Michael signature.asc Description: PGP signature
Re: TopoSort() fix
Andres Freund writes: > On 2019-07-26 18:05:38 -0400, Alvaro Herrera wrote: >> Hello, is anybody looking into this issue? > I guess this is on Robert's docket otherwise. He's on vacation till > early next week... I think this is a sufficiently obvious bug, and a sufficiently obvious fix, that we should just fix it and not insist on getting a reproducible test case first. I think a test case would soon bit-rot anyway, and no longer exercise the problem. I certainly do *not* want to wait so long that we miss the upcoming minor releases. regards, tom lane
Re: TopoSort() fix
Hi, On 2019-07-26 18:05:38 -0400, Alvaro Herrera wrote: > On 2019-Jul-04, Rui Hai Jiang wrote: > > > I'll try to figure out some scenarios to do the test. A parallel process > > group is needed for the test. Rui, have you made any progress on this? > > Actually I was trying to do some testing against the locking mechanism. I > > happened to see this issue. > > Hello, is anybody looking into this issue? I guess this is on Robert's docket otherwise. He's on vacation till early next week... Greetings, Andres Freund
Re: TopoSort() fix
On 2019-Jul-04, Rui Hai Jiang wrote: > I'll try to figure out some scenarios to do the test. A parallel process > group is needed for the test. > > Actually I was trying to do some testing against the locking mechanism. I > happened to see this issue. Hello, is anybody looking into this issue? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: TopoSort() fix
I'll try to figure out some scenarios to do the test. A parallel process group is needed for the test. Actually I was trying to do some testing against the locking mechanism. I happened to see this issue. On Wed, Jul 3, 2019 at 9:38 PM Robert Haas wrote: > On Tue, Jul 2, 2019 at 11:23 AM Tom Lane wrote: > > Rui Hai Jiang writes: > > > I think I found an issue in the TopoSort() function. > > > > This indeed seems like a live bug introduced by a1c1af2a. > > Robert? > > This is pretty thoroughly swapped out of my head, but it looks like > that analysis might be correct. > > Is it practical to come up with a test case that demonstrates the problem? > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Re: TopoSort() fix
On Tue, Jul 2, 2019 at 11:23 AM Tom Lane wrote: > Rui Hai Jiang writes: > > I think I found an issue in the TopoSort() function. > > This indeed seems like a live bug introduced by a1c1af2a. > Robert? This is pretty thoroughly swapped out of my head, but it looks like that analysis might be correct. Is it practical to come up with a test case that demonstrates the problem? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: TopoSort() fix
On Wed, Jul 03, 2019 at 10:41:59AM +0800, Rui Hai Jiang wrote: > Could the attached patch fix this issue? Or does any one else plan to fix > it? > > If people are busy and have not time, I can go ahead to fix it. To fix > this issue, do we need a patch for each official branch? Only a committer could merge any fix you produce. What you have sent looks fine to me, so let's wait for Robert, who has visiblu broken this part to comment. Back-patched versions are usually taken care of by the committer merging the fix, and by experience it is better to agree about the shape of a patch on HEAD before working on other branches. Depending on the review done, the patch's shape may change slightly... -- Michael signature.asc Description: PGP signature
Re: TopoSort() fix
Could the attached patch fix this issue? Or does any one else plan to fix it? If people are busy and have not time, I can go ahead to fix it. To fix this issue, do we need a patch for each official branch? Regards, Ruihai On Tue, Jul 2, 2019 at 11:23 PM Tom Lane wrote: > Rui Hai Jiang writes: > > I think I found an issue in the TopoSort() function. > > This indeed seems like a live bug introduced by a1c1af2a. > Robert? > > regards, tom lane >
Re: TopoSort() fix
Rui Hai Jiang writes: > I think I found an issue in the TopoSort() function. This indeed seems like a live bug introduced by a1c1af2a. Robert? regards, tom lane
TopoSort() fix
Hi Hackers, I think I found an issue in the TopoSort() function. As the comments say, /* . * .. If there are any other processes * in the same lock group on the queue, set their number of * beforeConstraints to -1 to indicate that they should be emitted * with their groupmates rather than considered separately. */ If the line "break;" exists, there is no chance to set beforeConstraints to -1 for other processes in the same lock group. So, I think we need delete the line "break;" . See the patch. I just took a look, and I found all the following versions have this line . postgresql-12beta2, postgresql-12beta1, postgresql-11.4, postgresql-11.3,postgresql-11.0, postgresql-10.9,postgresql-10.5, postgresql-10.0 Thanks, Ruihai TopoSort.patch Description: Binary data