Re: [HACKERS] Buildfarm failure and dubious coding in predicate.c
Thomas Munro writes: > Ahh, I think I see it. This is an EXEC_BACKEND build farm animal. > Theory: After the backend we see had removed the scratch entry and > before it had restored it, another backend started up and ran > InitPredicateLocks(), which inserted a new scratch entry without > interlocking. Ouch. Yes, I think you're probably right. It needs to skip that if IsUnderPostmaster. Seems like there ought to be an Assert(!found) there, too. And I don't think I entirely like the fact that there's no assertions about the found/not found cases below, either. Will fix, unless you're already on it? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Buildfarm failure and dubious coding in predicate.c
On Tue, Jul 25, 2017 at 7:24 AM, Tom Lane wrote: > Thomas Munro writes: >> Ahh, I think I see it. This is an EXEC_BACKEND build farm animal. >> Theory: After the backend we see had removed the scratch entry and >> before it had restored it, another backend started up and ran >> InitPredicateLocks(), which inserted a new scratch entry without >> interlocking. > > Ouch. Yes, I think you're probably right. It needs to skip that if > IsUnderPostmaster. Seems like there ought to be an Assert(!found) > there, too. And I don't think I entirely like the fact that there's > no assertions about the found/not found cases below, either. > > Will fix, unless you're already on it? I was going to send a short patch that would test IsUnderPostmaster, but I got lost down a rabbit hole trying to figure out how to make my EXEC_BACKEND builds run on this machine... Please go ahead. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Buildfarm failure and dubious coding in predicate.c
On Mon, Jul 24, 2017 at 11:51 AM, Thomas Munro wrote: > On Sun, Jul 23, 2017 at 8:32 AM, Tom Lane wrote: >> Meanwhile, it's still pretty unclear what happened yesterday on >> culicidae. > > That failure is indeed baffling. The only code that inserts > (HASH_ENTER[_NULL]) into PredicateLockTargetHash: > > 1. CreatePredicateLock(). I would be a bug if that ever tried to > insert a { 0, 0, 0, 0 } tag, and in any case it holds > SerializablePredicateLockListLock in LW_SHARED. > > 2. TransferPredicateLocksToNewTarget(), which removes and restores > the scratch entry and also explicitly inserts a transferred entry. It > asserts that it holds SerializablePredicateLockListLock and is called > only by PredicateLockPageSplit() which acquires it in LW_EXCLUSIVE. > > 3. DropAllPredicateLocksFromTable(), which removes and restores the > scratch entry and also explicitly inserts a transferred entry. > Acquires SerializablePredicateLockListLock in LW_EXCLUSIVE. Ahh, I think I see it. This is an EXEC_BACKEND build farm animal. Theory: After the backend we see had removed the scratch entry and before it had restored it, another backend started up and ran InitPredicateLocks(), which inserted a new scratch entry without interlocking. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Buildfarm failure and dubious coding in predicate.c
On Sun, Jul 23, 2017 at 8:32 AM, Tom Lane wrote: > Meanwhile, it's still pretty unclear what happened yesterday on > culicidae. That failure is indeed baffling. The only code that inserts (HASH_ENTER[_NULL]) into PredicateLockTargetHash: 1. CreatePredicateLock(). I would be a bug if that ever tried to insert a { 0, 0, 0, 0 } tag, and in any case it holds SerializablePredicateLockListLock in LW_SHARED. 2. TransferPredicateLocksToNewTarget(), which removes and restores the scratch entry and also explicitly inserts a transferred entry. It asserts that it holds SerializablePredicateLockListLock and is called only by PredicateLockPageSplit() which acquires it in LW_EXCLUSIVE. 3. DropAllPredicateLocksFromTable(), which removes and restores the scratch entry and also explicitly inserts a transferred entry. Acquires SerializablePredicateLockListLock in LW_EXCLUSIVE. I wondered if DropAllPredicateLocksFromTable() had itself inserted a tag that accidentally looks like the scratch tag in between removing and restoring, perhaps because the relation passed in had a bogus 0 DB OID etc, but it constructs a tag with SET_PREDICATELOCKTARGETTAG_RELATION(heaptargettag, dbId, heapId) which sets locktag_field3 to InvalidBlockNumber == -1, not 0 so that can't explain it. I wondered if a concurrent PredicateLockPageSplit() called TransferPredicateLocksToNewTarget() using a newtargettag built from a Relation that somehow had a bogus relation with DB OID 0, rel OID 0 and newblkno 0, but that doesn't help because SerializablePredicateLockListLock is acquired at LW_EXCLUSIVE so it can't run concurrently. It looks a bit like something at a lower level needs to be broken (GCC 6.3 released 6 months ago, maybe interacts badly with some clever memory model-dependent code of ours?) or something needs to be trashing memory. Here's the set of tests that ran concurrently with select_into, whose backtrace we see ("DROP SCHEMA selinto_schema CASCADE;"): parallel group (20 tests): select_distinct_on delete select_having random btree_index select_distinct namespace update case hash_index select_implicit subselect select_into arrays prepared_xacts transactions portals aggregates join union Of those I see that prepared_xacts, portals and transactions explicitly use SERIALIZABLE (which may or may not be important). I wonder if the thing to do here is to run selinto (or maybe just its setup and tear-down, "DROP SCHEMA ...") concurrently with those others in tight loops and burn some CPU. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Buildfarm failure and dubious coding in predicate.c
I wrote: > And, while I'm looking at this ... isn't this "scratch target" logic > just an ineffective attempt at waving a dead chicken? It's assuming > that freeing an entry in a shared hash table guarantees that it can > insert another entry. But that hash table is partitioned, meaning it has > a separate freelist per partition. So the extra entry only provides a > guarantee that you can insert something into the same partition it's in, > making it useless for this purpose AFAICS. After further study I found the bit in get_hash_entry() about "borrowing" entries from other freelists. That makes all of this safe after all, which is a good thing because I'd also found other assumptions that would have been broken by the behavior I posited above. But I think that that code is desperately undercommented -- the reader could be forgiven for thinking that it was an optional optimization, and not something that is critical for correctness of several different callers. I'm going to go improve the comments. Meanwhile, it's still pretty unclear what happened yesterday on culicidae. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Buildfarm failure and dubious coding in predicate.c
Buildfarm member culicidae just showed a transient failure in the 9.4 branch: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-07-21%2017%3A49%3A37 It's an assert trap, for which the buildfarm helpfully captured a stack trace: #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x7fb8d388a3fa in __GI_abort () at abort.c:89 #2 0x558d34d90814 in ExceptionalCondition (conditionName=conditionName@entry=0x558d34df6e2d "!(!found)", errorType=errorType@entry=0x558d34dcef3c "FailedAssertion", fileName=fileName@entry=0x558d34f19dc0 "/home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/storage/lmgr/predicate.c", lineNumber=lineNumber@entry=2023) at /home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/utils/error/assert.c:54 #3 0x558d34c9374b in RestoreScratchTarget (lockheld=lockheld@entry=1 '\001') at /home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/storage/lmgr/predicate.c:2023 #4 0x558d34c966c4 in DropAllPredicateLocksFromTable (transfer=1 '\001', relation=relation@entry=0x7fb8d4d3aa18) at /home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/storage/lmgr/predicate.c:2997 #5 TransferPredicateLocksToHeapRelation (relation=relation@entry=0x7fb8d4d3aa18) at /home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/storage/lmgr/predicate.c:3014 #6 0x558d34ac7a70 in index_drop (indexId=29755, concurrent=concurrent@entry=0 '\000') at /home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/catalog/index.c:1516 #7 0x558d34ac00f8 in doDeletion (flags=-1369083928, object=0x558d35c2c03c) at /home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/catalog/dependency.c:1125 #8 deleteOneObject (object=0x558d35c2c03c, depRel=depRel@entry=0x7fffae656fe8, flags=flags@entry=0) at /home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/catalog/dependency.c:1036 #9 0x558d34ac0545 in deleteObjectsInList (targetObjects=targetObjects@entry=0x558d35bae140, depRel=depRel@entry=0x7fffae656fe8, flags=flags@entry=0) at /home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/catalog/dependency.c:227 #10 0x558d34ac06c8 in performMultipleDeletions (objects=objects@entry=0x558d35badef0, behavior=DROP_CASCADE, flags=flags@entry=0) at /home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/catalog/dependency.c:366 #11 0x558d34b3e2e9 in RemoveObjects (stmt=stmt@entry=0x558d35bf5678) at /home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/commands/dropcmds.c:134 #12 0x558d34ca61f0 in ExecDropStmt (stmt=stmt@entry=0x558d35bf5678, isTopLevel=isTopLevel@entry=1 '\001') at /home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/tcop/utility.c:1364 #13 0x558d34ca8455 in ProcessUtilitySlow (parsetree=parsetree@entry=0x558d35bf5678, queryString=queryString@entry=0x558d35bf4b50 "DROP SCHEMA selinto_schema CASCADE;", context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, dest=dest@entry=0x558d35bf5a20, completionTag=completionTag@entry=0x7fffae657710 "") at /home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/tcop/utility.c:1295 I've been staring at that for a little while, and I can't see any logic error that would lead to the failure. Clearly it'd be expected if two sessions tried to remove/reinsert the "scratch target" concurrently, but the locking operations should be enough to prevent that. (Moreover, if that had happened, you'd have expected an earlier assertion failure in one or the other of the RemoveScratchTarget calls.) Plausible explanations at this point seem to be: 1. Cosmic ray bit-flip. 2. There's some bug in the lock infrastructure, allowing two processes to acquire an LWLock concurrently. 3. Logic error I'm missing. Probably it's #3, but what? And, while I'm looking at this ... isn't this "scratch target" logic just an ineffective attempt at waving a dead chicken? It's assuming that freeing an entry in a shared hash table guarantees that it can insert another entry. But that hash table is partitioned, meaning it has a separate freelist per partition. So the extra entry only provides a guarantee that you can insert something into the same partition it's in, making it useless for this purpose AFAICS. By the same token, I do not think I believe the nearby assumptions that deleting one entry from PredicateLockHash guarantees we can insert another one. That hash is partitioned as well. It looks to me like we either need to do a fairly significant rewrite here, or to give up on making these hashtables partitioned. Either one is pretty annoying, considering the very low probab