Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-13 Thread Robert Haas
On Sun, Aug 13, 2017 at 3:05 PM, Tom Lane  wrote:
> Not having heard anyone arguing against that, I'll go make it so,
> ie AtEOXact_CatCache is toast in all branches.

Great, thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-13 Thread Tom Lane
Robert Haas  writes:
> On Tue, Aug 8, 2017 at 10:48 PM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> On Tue, Aug 8, 2017 at 4:34 PM, Tom Lane  wrote:
 In the meantime, I think my vote would be to remove AtEOXact_CatCache.

>>> In all supported branches?

>> Whatever we do about this issue, I don't feel a need to do it further
>> back than HEAD.  It's a non-problem except in an assert-enabled build,
>> and we don't recommend running those for production, only development.

> Sure, but people still do testing and development against older
> branches - bug fixes, for example.  It doesn't make much sense to me
> to leave code that we know does the wrong thing in the back branches.

Not having heard anyone arguing against that, I'll go make it so,
ie AtEOXact_CatCache is toast in all branches.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-13 Thread Tom Lane
Andreas Seltenreich  writes:
> Tom Lane writes:
>> I wonder if Andreas would be interested in trying the randomly-timed-
>> SIGTERM thing with sqlsmith.

> So far, most of the core dumps generated are Jeevan's assertion failing
> with backtraces through SearchCatCacheList.  The rest is failing this
> assertion:
> TRAP: FailedAssertion("!(portal->cleanup == ((void *)0))", File: 
> "portalmem.c", Line: 846)
> Example backtrace below.  They all happened during a rollback statement.
> Testing was done on master at 2336f84284.

Interesting.  I can reproduce this by forcing a FATAL exit right there,
eg by adding

if (pstmt->utilityStmt && IsA(pstmt->utilityStmt, TransactionStmt) &&
((TransactionStmt *) pstmt->utilityStmt)->kind == 
TRANS_STMT_ROLLBACK)
InterruptPending = ProcDiePending = true;

before PortalRunMulti's CHECK_FOR_INTERRUPTS.  But it only fails if the
rollback is trying to exit a transaction that's already suffered an error.
The explanation seems to be:

1. Because we already aborted the transaction, xact.c is in state
blockState == TBLOCK_ABORT.

2. This causes AbortOutOfAnyTransaction to think that it can skip
doing AbortTransaction() and go straight to CleanupTransaction().

3. AtCleanup_Portals() is expecting that we already ran, and cleared,
the portal cleanup hook (PortalCleanup) for every live portal.  But
we have not done so for the active portal that we were in the midst
of running ROLLBACK in.  So it fails the mentioned assertion.

Thus, the basic problem is that we get to CleanupTransaction() without
having done PortalDrop on the portal we're running ROLLBACK in.

We could take this as an argument for simply removing that assertion,
which would mean that when AtCleanup_Portals calls PortalDrop, the cleanup
hook would get run, the same as it would have if exec_simple_query had had
a chance to drop the portal.  However, I'm still pretty afraid of allowing
arbitrary code to execute during CleanupTransaction().  What seems like
a better idea is to make AtCleanup_Portals just summarily clear the
cleanup hook, perhaps while emitting a warning.

I tried that (see portalmem.c changes in attached patch), and what I got
was this:

regression=# rollback;
WARNING:  skipping cleanup for portal ""
FATAL:  terminating connection due to administrator command
FATAL:  cannot drop active portal ""
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

or in the server log:

2017-08-13 13:49:50.312 EDT [8737] FATAL:  terminating connection due to 
administrator command
2017-08-13 13:49:50.312 EDT [8737] STATEMENT:  rollback;
2017-08-13 13:49:50.312 EDT [8737] WARNING:  skipping cleanup for portal ""
2017-08-13 13:49:50.312 EDT [8737] FATAL:  cannot drop active portal ""

Well, we could tolerate that maybe, but it doesn't seem like it's actually
acting nicely; the active portal is still creating problems for us.

After some further thought, I decided to try making
AbortOutOfAnyTransaction call AtAbort_Portals() in this situation, thus
basically doing the minimum part of AbortTransaction() needed to clean up
the active portal.  That almost worked --- my initial try got an assertion
failure in mcxt.c, because somebody was trying to drop the
CurrentMemoryContext.  So the minimum part of AbortTransaction that we
need here is really AtAbort_Memory + AtAbort_Portals.  After further
thought I decided it'd be a good idea to phrase that as an unconditional
AtAbort_Memory at the top of AbortOutOfAnyTransaction, thus making sure
we are in some valid context to start with; and then, in case the loop
itself doesn't have anything to do, we need AtCleanup_Memory() at the
bottom of the function to revert CurrentMemoryContext to TopMemoryContext.

In short, the attached patch seems to fix it nicely.  We definitely need
the xact.c changes, but the ones in portalmem.c are perhaps optional, as
in theory now the assertion will never be violated.  But I'm inclined
to keep the portalmem changes anyway, as that will make it more robust
if the situation ever happens in a non-assert build.

Note: there are a couple of comments elsewhere in portalmem.c that need
adjustment if we keep the portalmem changes.  I have not bothered to do
that yet, as it's just cosmetic.

Comments?

regards, tom lane

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 50c3c3b..1eabc67 100644
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
*** AbortOutOfAnyTransaction(void)
*** 4233,4238 
--- 4233,4241 
  {
  	TransactionState s = CurrentTransactionState;
  
+ 	/* Ensure we're not running in a doomed memory context */
+ 	AtAbort_Memory();
+ 
  	/*
  	 * Get out of any transaction or nested transaction
  	 */
*** 

Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-13 Thread Andreas Seltenreich
Tom Lane writes:
> I wonder if Andreas would be interested in trying the randomly-timed-
> SIGTERM thing with sqlsmith.

So far, most of the core dumps generated are Jeevan's assertion failing
with backtraces through SearchCatCacheList.  The rest is failing this
assertion:

TRAP: FailedAssertion("!(portal->cleanup == ((void *)0))", File: 
"portalmem.c", Line: 846)

Example backtrace below.  They all happened during a rollback statement.
Testing was done on master at 2336f84284.

regards,
Andreas

Core was generated by `postgres: smith regression [local] ROLLBACK  
'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#1  0x7f4c26d3240a in __GI_abort () at abort.c:89
#2  0x559d18897a73 in ExceptionalCondition 
(conditionName=conditionName@entry=0x559d18a81370 "!(portal->cleanup == ((void 
*)0))", errorType=errorType@entry=0x559d188e3f7d "FailedAssertion", 
fileName=fileName@entry=0x559d18a81013 "portalmem.c", 
lineNumber=lineNumber@entry=846) at assert.c:54
#3  0x559d188c42f1 in AtCleanup_Portals () at portalmem.c:846
#4  0x559d18536cb7 in CleanupTransaction () at xact.c:2652
#5  0x559d1853b825 in AbortOutOfAnyTransaction () at xact.c:4278
#6  0x559d188a7799 in ShutdownPostgres (code=, 
arg=) at postinit.c:1146
#7  0x559d1876b4e9 in shmem_exit (code=code@entry=1) at ipc.c:228
#8  0x559d1876b5fa in proc_exit_prepare (code=code@entry=1) at ipc.c:185
#9  0x559d1876b688 in proc_exit (code=code@entry=1) at ipc.c:102
#10 0x559d188999b1 in errfinish (dummy=) at elog.c:543
#11 0x559d1878fefa in ProcessInterrupts () at postgres.c:2841
#12 0x559d18790829 in ProcessInterrupts () at postgres.c:2828
#13 0x559d18795395 in PortalRunMulti (portal=portal@entry=0x559d197f2bf0, 
isTopLevel=isTopLevel@entry=1 '\001', setHoldSnapshot=setHoldSnapshot@entry=0 
'\000', dest=dest@entry=0x559d19850c40, altdest=altdest@entry=0x559d19850c40, 
completionTag=completionTag@entry=0x7ffc04f1b560 "") at pquery.c:1239
#14 0x559d18796069 in PortalRun (portal=portal@entry=0x559d197f2bf0, 
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', 
run_once=run_once@entry=1 '\001', dest=dest@entry=0x559d19850c40, 
altdest=altdest@entry=0x559d19850c40, completionTag=0x7ffc04f1b560 "") at 
pquery.c:799
#15 0x559d18791dca in exec_simple_query (query_string=0x559d1984fe00 
"ROLLBACK;") at postgres.c:1099
#16 0x559d18793af1 in PostgresMain (argc=, 
argv=argv@entry=0x559d197fa078, dbname=, username=) at postgres.c:4090
#17 0x559d184a3428 in BackendRun (port=0x559d197e8f00) at postmaster.c:4357
#18 BackendStartup (port=0x559d197e8f00) at postmaster.c:4029
#19 ServerLoop () at postmaster.c:1753
#20 0x559d1871ad65 in PostmasterMain (argc=3, argv=0x559d197be5a0) at 
postmaster.c:1361
#21 0x559d184a4a6d in main (argc=3, argv=0x559d197be5a0) at main.c:228


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-10 Thread Michael Paquier
On Thu, Aug 10, 2017 at 9:00 PM, Robert Haas  wrote:
> On Thu, Aug 10, 2017 at 2:50 AM, Andreas Seltenreich  
> wrote:
>> Will do.  Won't miss this chance to try out discostu's extension
>> pg_rage_terminator[1] :-)
>> [1]  https://github.com/disco-stu/pg_rage_terminator
>
> Oh, that's *awesome*.

You can do that at query level using the planner hook:
https://github.com/michaelpq/pg_plugins/tree/master/pg_panic
The PANIC should be lowered to a FATAL though.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-10 Thread Robert Haas
On Tue, Aug 8, 2017 at 10:48 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Aug 8, 2017 at 4:34 PM, Tom Lane  wrote:
>>> In the meantime, I think my vote would be to remove AtEOXact_CatCache.
>
>> In all supported branches?
>
> Whatever we do about this issue, I don't feel a need to do it further
> back than HEAD.  It's a non-problem except in an assert-enabled build,
> and we don't recommend running those for production, only development.

Sure, but people still do testing and development against older
branches - bug fixes, for example.  It doesn't make much sense to me
to leave code that we know does the wrong thing in the back branches.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-10 Thread Robert Haas
On Thu, Aug 10, 2017 at 2:50 AM, Andreas Seltenreich  wrote:
> Will do.  Won't miss this chance to try out discostu's extension
> pg_rage_terminator[1] :-)
> [1]  https://github.com/disco-stu/pg_rage_terminator

Oh, that's *awesome*.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-10 Thread Andreas Seltenreich
Tom Lane writes:

> I wonder if Andreas would be interested in trying the randomly-timed-
> SIGTERM thing with sqlsmith.

Will do.  Won't miss this chance to try out discostu's extension
pg_rage_terminator[1] :-)

regards,
Andreas

Footnotes: 
[1]  https://github.com/disco-stu/pg_rage_terminator


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-08 Thread Tom Lane
Robert Haas  writes:
> On Tue, Aug 8, 2017 at 4:34 PM, Tom Lane  wrote:
>> In the meantime, I think my vote would be to remove AtEOXact_CatCache.

> In all supported branches?

Whatever we do about this issue, I don't feel a need to do it further
back than HEAD.  It's a non-problem except in an assert-enabled build,
and we don't recommend running those for production, only development.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-08 Thread Robert Haas
On Tue, Aug 8, 2017 at 4:34 PM, Tom Lane  wrote:
> In the meantime, I think my vote would be to remove AtEOXact_CatCache.

In all supported branches?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-08 Thread Tom Lane
Robert Haas  writes:
> On Tue, Aug 8, 2017 at 12:26 PM, Tom Lane  wrote:
>> Yeah, I thought about weakening the assertions too, but I couldn't
>> see a fix of that kind that didn't seem mighty ad-hoc.

> More concretely, the present example seems no different than the
> ResourceOwner stuff emitting warnings on commit and remaining silent
> on abort.  We could make it complain on both commit and abort, but
> then it would fail spuriously because there's no other mechanism to
> release resources in the abort path, so we don't.  Similarly here, we
> have every reason to expect the check to pass during ERROR recovery
> but there is no reason to expect it to pass during FATAL recovery, yet
> as coded we will do the test anyway.  That's wrong.

I think that's arguing from expedience not principle.  We had every
reason to think it would pass during FATAL errors too, until we noticed
this specific misbehavior; and there is at least as much of an argument
that this is a bug in SearchCatCacheList as there is that the check
is too strong.

>> Now, there is some room to argue that AtEOXact_CatCache() is just
>> obsolete and we should remove it altogether; I don't think it's
>> caught a real bug since we invented resowners in 8.1.

> Yeah, the same thought crossed my mind.  IIUC, we'd crash if a
> catcache reference were acquired without CurrentResourceOwner being
> valid, so this is really just a belt-and-suspenders check.

Right.  It was worth keeping it around till we were sure all the bugs
were shaken out of ResourceOwners, but surely we crossed the point of
diminishing returns for that long ago.

> ...  I'm also not deadly opposed to
> redesigning the whole mechanism either, but I think that should be
> approached with a measure of caution: it might end up reducing
> reliability rather than increasing it.  I suggest in any case that we
> start with a surgical fix.

In the absence of clear evidence that there are similar bugs elsewhere,
I agree that redesigning FATAL exits would likely cause more problems
than it solves.  But I feel like it would be a good thing to test for.
I wonder if Andreas would be interested in trying the randomly-timed-
SIGTERM thing with sqlsmith.

In the meantime, I think my vote would be to remove AtEOXact_CatCache.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-08 Thread Robert Haas
On Tue, Aug 8, 2017 at 12:26 PM, Tom Lane  wrote:
> Yeah, I thought about weakening the assertions too, but I couldn't
> see a fix of that kind that didn't seem mighty ad-hoc.

I don't really see what's ad-hoc about skipping it in the case where a
FATAL is in progress.  I mean, skipping a sanity check only in the
cases where we know it might pass - and are OK with the fact that it
might not pass - seems to me to be an extremely difficult policy to
argue against on rational grounds.  That's just the definition of
writing correct sanity checks.

More concretely, the present example seems no different than the
ResourceOwner stuff emitting warnings on commit and remaining silent
on abort.  We could make it complain on both commit and abort, but
then it would fail spuriously because there's no other mechanism to
release resources in the abort path, so we don't.  Similarly here, we
have every reason to expect the check to pass during ERROR recovery
but there is no reason to expect it to pass during FATAL recovery, yet
as coded we will do the test anyway.  That's wrong.

> Now, there is some room to argue that AtEOXact_CatCache() is just
> obsolete and we should remove it altogether; I don't think it's
> caught a real bug since we invented resowners in 8.1.

Yeah, the same thought crossed my mind.  IIUC, we'd crash if a
catcache reference were acquired without CurrentResourceOwner being
valid, so this is really just a belt-and-suspenders check.

> But, again, the larger question to my mind is where else we may have
> similar issues.  There's certainly never been any methodical attempt
> to see whether elog(FATAL) will work everywhere.

IIRC, you raised a similar objection back when Bruce added
pg_terminate_backend(), which caused that change to be briefly
reverted before ultimately being put back.  Despite the hardening you
did back then, I think it's highly likely that bugs remain in that
path, and I am of course not opposed to trying to improve the
situation.  That having been said, the bugs that remain are probably
mostly quite low-probability, because otherwise we'd have found them
by now.  I think parallel query is likely to flush out a few of the
ones that remain by creating a new class of backends that terminate
after a single query and may get killed by the leader at any time;
that's how we discovered this issue.  Fuzz testing could be done, too,
e.g. run something like sqlsmith simultaneously in many sessions while
killing off backends at random.  I'm also not deadly opposed to
redesigning the whole mechanism either, but I think that should be
approached with a measure of caution: it might end up reducing
reliability rather than increasing it.  I suggest in any case that we
start with a surgical fix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-08 Thread Tom Lane
Robert Haas  writes:
> On Tue, Aug 8, 2017 at 11:36 AM, Tom Lane  wrote:
>> We could respond to this by using PG_ENSURE_ERROR_CLEANUP there instead
>> of plain PG_TRY.  But I have an itchy feeling that there may be a lot
>> of places with similar issues.  Should we be revisiting the basic way
>> that elog(FATAL) works, to make it less unlike elog(ERROR)?

> ... Arguably, this assertion is merely overzealous; we could skip
> this processing when proc_exit_inprogress, for example.

Yeah, I thought about weakening the assertions too, but I couldn't
see a fix of that kind that didn't seem mighty ad-hoc.

Now, there is some room to argue that AtEOXact_CatCache() is just
obsolete and we should remove it altogether; I don't think it's
caught a real bug since we invented resowners in 8.1.  Short of that,
though, I'm not really happy with just arbitrarily weakening the
checks.

But, again, the larger question to my mind is where else we may have
similar issues.  There's certainly never been any methodical attempt
to see whether elog(FATAL) will work everywhere.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-08 Thread Robert Haas
On Tue, Aug 8, 2017 at 11:36 AM, Tom Lane  wrote:
>> By looking at the stack-trace, and as discussed it with my team members;
>> what we have observed that in SearchCatCacheList(), we are incrementing
>> refcount and then decrementing it at the end. However for some reason, if
>> we are in TRY() block (where we increment the refcount), and hit with any
>> interrupt, we failed to decrement the refcount due to which later we get
>> assertion failure.
>
> Hm.  So SearchCatCacheList has a PG_TRY block that is meant to release
> those refcounts, but if you hit the backend with a SIGTERM while it's
> in that function, control goes out through elog(FATAL) which doesn't
> execute the PG_CATCH cleanup.  But it does do AbortTransaction which
> calls AtEOXact_CatCache, and that is expecting that all the cache
> refcounts have reached zero.
>
> We could respond to this by using PG_ENSURE_ERROR_CLEANUP there instead
> of plain PG_TRY.  But I have an itchy feeling that there may be a lot
> of places with similar issues.  Should we be revisiting the basic way
> that elog(FATAL) works, to make it less unlike elog(ERROR)?

I'm not sure what the technically best fix here is.  It strikes me
that the charter of FATAL ought to be to exit the backend cleanly,
safely, and expeditiously.  PG_ENSURE_ERROR_CLEANUP, or some similar
mechanism, is necessary when we need to correct the contents of shared
memory so that other backends can continue to function, but there's no
such problem here.  You're proposing to do more work in the FATAL
pathway so that the debugging cross-checks in the FATAL pathway will
pass, which is not an entirely palatable idea.  It's definitely
frustrating that the ERROR and FATAL pathways are so different - that
generated a surprising amount of the work around DSM - but I'm still
skeptical about a solution that involves doing more cleanup of what
are essentially irrelevant backend-local data structures in the FATAL
path.  Arguably, this assertion is merely overzealous; we could skip
this processing when proc_exit_inprogress, for example.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-08 Thread Tom Lane
Jeevan Chalke  writes:
> We have observed a random server crash (FailedAssertion), while running few
> tests at our end. Stack-trace is attached.

> By looking at the stack-trace, and as discussed it with my team members;
> what we have observed that in SearchCatCacheList(), we are incrementing
> refcount and then decrementing it at the end. However for some reason, if
> we are in TRY() block (where we increment the refcount), and hit with any
> interrupt, we failed to decrement the refcount due to which later we get
> assertion failure.

Hm.  So SearchCatCacheList has a PG_TRY block that is meant to release
those refcounts, but if you hit the backend with a SIGTERM while it's
in that function, control goes out through elog(FATAL) which doesn't
execute the PG_CATCH cleanup.  But it does do AbortTransaction which
calls AtEOXact_CatCache, and that is expecting that all the cache
refcounts have reached zero.

We could respond to this by using PG_ENSURE_ERROR_CLEANUP there instead
of plain PG_TRY.  But I have an itchy feeling that there may be a lot
of places with similar issues.  Should we be revisiting the basic way
that elog(FATAL) works, to make it less unlike elog(ERROR)?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-08 Thread Jeevan Chalke
Hi,

We have observed a random server crash (FailedAssertion), while running few
tests at our end. Stack-trace is attached.

By looking at the stack-trace, and as discussed it with my team members;
what we have observed that in SearchCatCacheList(), we are incrementing
refcount and then decrementing it at the end. However for some reason, if
we are in TRY() block (where we increment the refcount), and hit with any
interrupt, we failed to decrement the refcount due to which later we get
assertion failure.

To mimic the scenario, I have added a sleep in SearchCatCacheList() as
given below:

diff --git a/src/backend/utils/cache/catcache.c
b/src/backend/utils/cache/catcache.c
index e7e8e3b..eb6d4b5 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -1520,6 +1520,9 @@ SearchCatCacheList(CatCache *cache,
hashValue = CatalogCacheComputeTupleHashValue(cache, ntp);
hashIndex = HASH_INDEX(hashValue, cache->cc_nbuckets);

+   elog(INFO, "Sleeping for 0.1 seconds.");
+   pg_usleep(10L); /* 0.1 seconds */
+
bucket = >cc_bucket[hashIndex];
dlist_foreach(iter, bucket)
{

And then followed these steps to get a server crash:

-- Terminal 1
DROP TYPE typ;
DROP FUNCTION func(x int);

CREATE TYPE typ AS (X VARCHAR(50), Y INT);

CREATE OR REPLACE FUNCTION func(x int) RETURNS int AS $$
DECLARE
  rec typ;
  var2 numeric;
BEGIN
  RAISE NOTICE 'Function Called.';
  REC.X := 'Hello';
  REC.Y := 0;

  IF (rec.Y + var2) = 0 THEN
RAISE NOTICE 'Check Pass';
  END IF;

  RETURN 1;
END;
$$ LANGUAGE plpgsql;

SELECT pg_backend_pid();

SELECT func(1);

-- Terminal 2, should be run in parallel when SELECT func(1) is in progress
in terminal 1.
SELECT pg_terminate_backend();


I thought it worth posting here to get others attention.

I have observed this on the master branch, but can also be reproducible on
back-branches.

Thanks
-- 
Jeevan Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Program terminated with signal 6, Aborted.
#0  0x7fdff78951d7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install 
audit-libs-2.6.5-3.el7_3.1.x86_64 cyrus-sasl-lib-2.1.26-20.el7_2.x86_64 
glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 
krb5-libs-1.14.1-27.el7_3.x86_64 libcap-ng-0.7.5-4.el7.x86_64 
libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 
libicu-50.1.2-15.el7.x86_64 libselinux-2.5-6.el7.x86_64 
libstdc++-4.8.5-11.el7.x86_64 nspr-4.13.1-1.0.el7_3.x86_64 
nss-3.28.2-1.6.el7_3.x86_64 nss-softokn-freebl-3.16.2.3-14.4.el7.x86_64 
nss-util-3.28.2-1.1.el7_3.x86_64 openldap-2.4.40-13.el7.x86_64 
openssl-libs-1.0.1e-60.el7_3.1.x86_64 pam-1.1.8-18.el7.x86_64 
pcre-8.32-15.el7_2.1.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x7fdff78951d7 in raise () from /lib64/libc.so.6
#1  0x7fdff78968c8 in abort () from /lib64/libc.so.6
#2  0x009d43ff in ExceptionalCondition (conditionName=0xc02255 
"!(ct->refcount == 0)", errorType=0xc020b4 "FailedAssertion", 
fileName=0xc02060 "catcache.c", lineNumber=567) at assert.c:54
#3  0x009b59e0 in AtEOXact_CatCache (isCommit=0 '\000') at 
catcache.c:567
#4  0x00538a8d in AbortTransaction () at xact.c:2613
#5  0x0053a95e in AbortOutOfAnyTransaction () at xact.c:4271
#6  0x009e8e99 in ShutdownPostgres (code=1, arg=0) at postinit.c:1146
#7  0x008456e1 in shmem_exit (code=1) at ipc.c:228
#8  0x008455d5 in proc_exit_prepare (code=1) at ipc.c:185
#9  0x00845543 in proc_exit (code=1) at ipc.c:102
#10 0x009d4bae in errfinish (dummy=0) at elog.c:543
#11 0x008761b4 in ProcessInterrupts () at postgres.c:2882
#12 0x009d4bea in errfinish (dummy=0) at elog.c:565
#13 0x009d7097 in elog_finish (elevel=17, fmt=0xc02678 "Sleeping for 
0.1 seconds.") at elog.c:1378
#14 0x009b765b in SearchCatCacheList (cache=0x27636d0, nkeys=1, 
v1=11604184, v2=0, v3=0, v4=0) at catcache.c:1523
#15 0x009cbd20 in SearchSysCacheList (cacheId=37, nkeys=1, 
key1=11604184, key2=0, key3=0, key4=0) at syscache.c:1338
#16 0x0056e22e in OpernameGetCandidates (names=0x27f3b08, oprkind=98 
'b', missing_schema_ok=0 '\000') at namespace.c:1576
#17 0x005f09a6 in oper (pstate=0x27f3ec0, opname=0x27f3b08, ltypeId=23, 
rtypeId=1700, noError=0 '\000', location=14) at parse_oper.c:414
#18 0x005f132d in make_op (pstate=0x27f3ec0, opname=0x27f3b08, 
ltree=0x27f40f0, rtree=0x27f4128, last_srf=0x0, location=14) at parse_oper.c:778
#19 0x005e5e8d in transformAExprOp (pstate=0x27f3ec0, a=0x27f3a60) at 
parse_expr.c:933
#20 0x005e454e in transformExprRecurse (pstate=0x27f3ec0, 
expr=0x27f3a60) at parse_expr.c:217
#21 0x005e5e44 in transformAExprOp (pstate=0x27f3ec0, a=0x27f3b78) at 
parse_expr.c:930
#22 0x005e454e in transformExprRecurse (pstate=0x27f3ec0,