Re: [HACKERS] What is happening on buildfarm member crake?

2014-02-01 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sat, Jan 25, 2014 at 5:04 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yeah.  If Robert's diagnosis is correct, and it sounds pretty plausible,
 then this is really just one instance of a bug that's probably pretty
 widespread in our signal handlers.  Somebody needs to go through 'em
 all and look for touches of shared memory.

 I haven't made a comprehensive study of every signal handler we have,
 [ but here's how to fix procsignal_sigusr1_handler ]

I've trawled all the remaining signal handlers (or at least everything
declared with SIGNAL_ARGS, which hopefully is all of them).  I find
the following bugs:

1. A couple of signal handlers do
if (MyWalSnd)
SetLatch(MyWalSnd-latch);
which is fine as far as it goes, but the shutdown sequence in WalSndKill
has exactly the same bug you just fixed in ProcKill: it needs to clear
MyWalSnd before disowning the latch, not after.

2. WalRcvSigUsr1Handler and worker_spi_sighup fail to preserve errno.

Will fix, but the latter bug is a tad disappointing considering that
the coding rule about saving errno in signal handlers has been there
for a *long* time.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What is happening on buildfarm member crake?

2014-01-31 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sat, Jan 25, 2014 at 5:04 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yeah.  If Robert's diagnosis is correct, and it sounds pretty plausible,
 then this is really just one instance of a bug that's probably pretty
 widespread in our signal handlers.  Somebody needs to go through 'em
 all and look for touches of shared memory.

 I haven't made a comprehensive study of every signal handler we have,
 but looking at procsignal_sigusr1_handler, the list of functions that
 can get called from here is quite short: CheckProcSignal(),
 RecoveryConflictInterrupt(), SetLatch(), and latch_sigusr1_handler().

You forgot HandleCatchupInterrupt and HandleNotifyInterrupt, but those
are also safe because they do nothing once proc_exit_inprogress is set.

 I think maybe the best fix is to *clear* MyProc in
 ProcKill/AuxiliaryProcKill and MyProcSignalSlot in
 CleanupProcSignalState, as shown in the attached patch.

This looks good to me in principle.  A couple minor beefs:

* The addition to CleanupProcSignalState could use a comment,
similar to the one you added in ProcKill.

* I think the code in ProcKill and AuxiliaryProcKill might be more
readable if the new local variable was named myproc (lower case).

 and we can easily add a NULL guard to the SetLatch() call in
 procsignal_sigusr1_handler, which the attached patch also does.

Um ... no such change actually visible in patch, but it's clearly
necessary.

 This might not be a complete fix to every problem of this type that
 exists anywhere in our code, but I think it's enough to make the world
 safe for procsignal_sigusr1_handler.

Yeah; at the least this should cut down on the buildfarm noise we
are seeing ATM.

 Assuming nobody objects too much to this basic approach, should I
 back-patch the parts of this that apply pre-9.4?

Yes, I think so.  We have seen some reports of irreproducible crashes
at process exit, and maybe this explains them.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What is happening on buildfarm member crake?

2014-01-31 Thread Robert Haas
On Fri, Jan 31, 2014 at 6:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 This looks good to me in principle.  A couple minor beefs:

 * The addition to CleanupProcSignalState could use a comment,
 similar to the one you added in ProcKill.

OK.

 * I think the code in ProcKill and AuxiliaryProcKill might be more
 readable if the new local variable was named myproc (lower case).

grep indicates that naming is less common that what I picked, so I
chose to stick with what I picked.

 and we can easily add a NULL guard to the SetLatch() call in
 procsignal_sigusr1_handler, which the attached patch also does.

 Um ... no such change actually visible in patch, but it's clearly
 necessary.

Fixed.

 This might not be a complete fix to every problem of this type that
 exists anywhere in our code, but I think it's enough to make the world
 safe for procsignal_sigusr1_handler.

 Yeah; at the least this should cut down on the buildfarm noise we
 are seeing ATM.

 Assuming nobody objects too much to this basic approach, should I
 back-patch the parts of this that apply pre-9.4?

 Yes, I think so.  We have seen some reports of irreproducible crashes
 at process exit, and maybe this explains them.

OK, done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What is happening on buildfarm member crake?

2014-01-27 Thread Robert Haas
On Sat, Jan 25, 2014 at 5:04 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yeah.  If Robert's diagnosis is correct, and it sounds pretty plausible,
 then this is really just one instance of a bug that's probably pretty
 widespread in our signal handlers.  Somebody needs to go through 'em
 all and look for touches of shared memory.

I haven't made a comprehensive study of every signal handler we have,
but looking at procsignal_sigusr1_handler, the list of functions that
can get called from here is quite short: CheckProcSignal(),
RecoveryConflictInterrupt(), SetLatch(), and latch_sigusr1_handler().
Taking those in reverse order:

- latch_sigusr1_handler() is fine.  Nothing down this path touches
shared memory; moreover, if we've already disowned our latch, the
waiting flag won't be set and this will do nothing at all.
- The call to SetLatch() is problematic as we already know.  This is
new code in 9.4.
- RecoveryConflictInterrupt() does nothing if proc_exit_inprogress is
set.  So it's fine.
- CheckProcSignal() also appears problematic.  If we've already
detached shared memory, MyProcSignalSlot will be pointing to garbage,
but we'll try to dereference it anyway.

I think maybe the best fix is to *clear* MyProc in
ProcKill/AuxiliaryProcKill and MyProcSignalSlot in
CleanupProcSignalState, as shown in the attached patch.  Most places
that dereference those pointers already check that they aren't null,
and we can easily add a NULL guard to the SetLatch() call in
procsignal_sigusr1_handler, which the attached patch also does.

This might not be a complete fix to every problem of this type that
exists anywhere in our code, but I think it's enough to make the world
safe for procsignal_sigusr1_handler.  We also have a *large* number of
signal handlers that do little more than this:

if (MyProc)
SetLatch(MyProc-procLatch);

...and this change should make all of those safe as well.  So I think
this is a pretty good start.

Assuming nobody objects too much to this basic approach, should I
back-patch the parts of this that apply pre-9.4?  The problem with
CleanupProcSignalState, at least, goes all the way back to 9.0, when
the signal-multiplexing infrastructure was introduced.  But the
probability of an actual crash must be pretty low, or I imagine we
would have noticed this sooner.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 6ebabce..1372a7e 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -149,6 +149,8 @@ CleanupProcSignalState(int status, Datum arg)
 	slot = ProcSignalSlots[pss_idx - 1];
 	Assert(slot == MyProcSignalSlot);
 
+	MyProcSignalSlot = NULL;
+
 	/* sanity check */
 	if (slot-pss_pid != MyProcPid)
 	{
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index ee6c24c..f64e1c4 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -772,6 +772,7 @@ ProcKill(int code, Datum arg)
 {
 	/* use volatile pointer to prevent code rearrangement */
 	volatile PROC_HDR *procglobal = ProcGlobal;
+	PGPROC	   *proc;
 
 	Assert(MyProc != NULL);
 
@@ -796,31 +797,34 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
-	/* Release ownership of the process's latch, too */
-	DisownLatch(MyProc-procLatch);
+	/*
+	 * Clear MyProc first; then disown the process latch.  This is so that
+	 * signal handlers won't try to clear the process latch after it's no
+	 * longer ours.
+	 */
+	proc = MyProc;
+	MyProc = NULL;
+	DisownLatch(proc-procLatch);
 
 	SpinLockAcquire(ProcStructLock);
 
 	/* Return PGPROC structure (and semaphore) to appropriate freelist */
 	if (IsAnyAutoVacuumProcess())
 	{
-		MyProc-links.next = (SHM_QUEUE *) procglobal-autovacFreeProcs;
-		procglobal-autovacFreeProcs = MyProc;
+		proc-links.next = (SHM_QUEUE *) procglobal-autovacFreeProcs;
+		procglobal-autovacFreeProcs = proc;
 	}
 	else if (IsBackgroundWorker)
 	{
-		MyProc-links.next = (SHM_QUEUE *) procglobal-bgworkerFreeProcs;
-		procglobal-bgworkerFreeProcs = MyProc;
+		proc-links.next = (SHM_QUEUE *) procglobal-bgworkerFreeProcs;
+		procglobal-bgworkerFreeProcs = proc;
 	}
 	else
 	{
-		MyProc-links.next = (SHM_QUEUE *) procglobal-freeProcs;
-		procglobal-freeProcs = MyProc;
+		proc-links.next = (SHM_QUEUE *) procglobal-freeProcs;
+		procglobal-freeProcs = proc;
 	}
 
-	/* PGPROC struct isn't mine anymore */
-	MyProc = NULL;
-
 	/* Update shared estimate of spins_per_delay */
 	procglobal-spins_per_delay = update_spins_per_delay(procglobal-spins_per_delay);
 
@@ -849,6 +853,7 @@ AuxiliaryProcKill(int code, Datum arg)
 {
 	int			proctype = DatumGetInt32(arg);
 	PGPROC	   *auxproc PG_USED_FOR_ASSERTS_ONLY;
+	PGPROC	   *proc;
 
 	Assert(proctype = 0  proctype  NUM_AUXILIARY_PROCS);
 
@@ -859,16 +864,19 @@ AuxiliaryProcKill(int code, Datum arg)
 	/* Release any LW locks I am holding (see notes above) */
 

Re: [HACKERS] What is happening on buildfarm member crake?

2014-01-25 Thread Andrew Dunstan


On 01/19/2014 08:22 PM, Robert Haas wrote:

Hmm, that looks an awful lot like the SIGUSR1 signal handler is
getting called after we've already completed shmem_exit.  And indeed
that seems like the sort of thing that would result in dying horribly
in just this way.  The obvious fix seems to be to check
proc_exit_inprogress before doing anything that might touch shared
memory, but there are a lot of other SIGUSR1 handlers that don't do
that either.  However, in those cases, the likely cause of a SIGUSR1
would be a sinval catchup interrupt or a recovery conflict, which
aren't likely to be so far delayed that they arrive after we've
already disconnected from shared memory.  But the dynamic background
workers stuff adds a new possible cause of SIGUSR1: the postmaster
letting us know that a child has started or died.  And that could
happen even after we've detached shared memory.



Is anything happening about this? We're still getting quite a few of 
these: 
http://www.pgbuildfarm.org/cgi-bin/show_failures.pl?max_days=3member=crake


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What is happening on buildfarm member crake?

2014-01-25 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 01/19/2014 08:22 PM, Robert Haas wrote:
 Hmm, that looks an awful lot like the SIGUSR1 signal handler is
 getting called after we've already completed shmem_exit.  And indeed
 that seems like the sort of thing that would result in dying horribly
 in just this way.  The obvious fix seems to be to check
 proc_exit_inprogress before doing anything that might touch shared
 memory, but there are a lot of other SIGUSR1 handlers that don't do
 that either.  However, in those cases, the likely cause of a SIGUSR1
 would be a sinval catchup interrupt or a recovery conflict, which
 aren't likely to be so far delayed that they arrive after we've
 already disconnected from shared memory.  But the dynamic background
 workers stuff adds a new possible cause of SIGUSR1: the postmaster
 letting us know that a child has started or died.  And that could
 happen even after we've detached shared memory.

 Is anything happening about this? We're still getting quite a few of 
 these: 
 http://www.pgbuildfarm.org/cgi-bin/show_failures.pl?max_days=3member=crake

Yeah.  If Robert's diagnosis is correct, and it sounds pretty plausible,
then this is really just one instance of a bug that's probably pretty
widespread in our signal handlers.  Somebody needs to go through 'em
all and look for touches of shared memory.

I'm not sure if we can just disable signal response the moment the
proc_exit_inprogress flag goes up, though.  In some cases such as lock
handling, it's likely that we need that functionality to keep working
for some part of the shutdown process.  We might end up having to disable
individual signal handlers at appropriate places.

Ick.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What is happening on buildfarm member crake?

2014-01-25 Thread Andrew Dunstan


On 01/25/2014 05:04 PM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

On 01/19/2014 08:22 PM, Robert Haas wrote:

Hmm, that looks an awful lot like the SIGUSR1 signal handler is
getting called after we've already completed shmem_exit.  And indeed
that seems like the sort of thing that would result in dying horribly
in just this way.  The obvious fix seems to be to check
proc_exit_inprogress before doing anything that might touch shared
memory, but there are a lot of other SIGUSR1 handlers that don't do
that either.  However, in those cases, the likely cause of a SIGUSR1
would be a sinval catchup interrupt or a recovery conflict, which
aren't likely to be so far delayed that they arrive after we've
already disconnected from shared memory.  But the dynamic background
workers stuff adds a new possible cause of SIGUSR1: the postmaster
letting us know that a child has started or died.  And that could
happen even after we've detached shared memory.

Is anything happening about this? We're still getting quite a few of
these:
http://www.pgbuildfarm.org/cgi-bin/show_failures.pl?max_days=3member=crake

Yeah.  If Robert's diagnosis is correct, and it sounds pretty plausible,
then this is really just one instance of a bug that's probably pretty
widespread in our signal handlers.  Somebody needs to go through 'em
all and look for touches of shared memory.

I'm not sure if we can just disable signal response the moment the
proc_exit_inprogress flag goes up, though.  In some cases such as lock
handling, it's likely that we need that functionality to keep working
for some part of the shutdown process.  We might end up having to disable
individual signal handlers at appropriate places.

Ick.




Yeah. Since we're now providing for user-defined backends, maybe we need 
some mechanism for white-listing handlers that can run in whole or part 
during shutdown.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] What is happening on buildfarm member crake?

2014-01-19 Thread Tom Lane
Since contrib/test_shm_mq went into the tree, crake has been crashing
intermittently (maybe one time in three) at the contribcheck step.
While there's not conclusive proof that test_shm_mq is at fault, the
circumstantial evidence is pretty strong.  For instance, in the
postmaster log for the last failure,
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crakedt=2014-01-19%2013%3A26%3A08
we see

LOG:  registering background worker test_shm_mq
LOG:  starting background worker process test_shm_mq
LOG:  worker process: test_shm_mq (PID 12585) exited with exit code 1
LOG:  unregistering background worker test_shm_mq
LOG:  registering background worker test_shm_mq
LOG:  registering background worker test_shm_mq
LOG:  registering background worker test_shm_mq
LOG:  starting background worker process test_shm_mq
LOG:  starting background worker process test_shm_mq
LOG:  starting background worker process test_shm_mq
LOG:  worker process: test_shm_mq (PID 12588) exited with exit code 1
LOG:  unregistering background worker test_shm_mq
LOG:  worker process: test_shm_mq (PID 12586) exited with exit code 1
LOG:  unregistering background worker test_shm_mq
LOG:  worker process: test_shm_mq (PID 12587) exited with exit code 1
LOG:  unregistering background worker test_shm_mq
LOG:  server process (PID 12584) was terminated by signal 11: Segmentation fault
LOG:  terminating any other active server processes

Comparing the PIDs makes it seem pretty likely that what's crashing is the
backend running the test_shm_mq test.  Since the connected psql doesn't
notice any problem, most likely the crash occurs during backend shutdown
after psql has disconnected (which would also explain why no current
query gets shown in the crash report).

crake is running F16 x86_64, which I don't have installed here.  So I
tried to reproduce this on RHEL6 and F19 x86_64 machines, with build
options matching what crake says it's using, with absolutely no success.
I don't think any other buildfarm critters are showing this either, which
makes it look like it's possibly specific to the particular compiler
version crake has got.  Or maybe there's some other factor needed to
trigger it.  Anyway, I wonder whether Andrew could get a stack trace from
the core dump next time it happens.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What is happening on buildfarm member crake?

2014-01-19 Thread Tom Lane
I wrote:
 I don't think any other buildfarm critters are showing this either,

Scratch that --- closer inspection of the buildfarm logs shows that
kouprey and lapwing have suffered identical failures, though much less
often than crake.  That lets out the theory that it's x86_64 specific, or
even 64-bit specific.  Still have no idea why I can't reproduce it here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What is happening on buildfarm member crake?

2014-01-19 Thread Andrew Dunstan


On 01/19/2014 04:14 PM, Tom Lane wrote:

I wrote:

I don't think any other buildfarm critters are showing this either,

Scratch that --- closer inspection of the buildfarm logs shows that
kouprey and lapwing have suffered identical failures, though much less
often than crake.  That lets out the theory that it's x86_64 specific, or
even 64-bit specific.  Still have no idea why I can't reproduce it here.


I can give you an account on the machine if you want to play.

Also crake does produce backtraces  on core dumps, and they are at the 
bottom of the buildfarm log. The latest failure backtrace is reproduced 
below.


   == stack trace: 
/home/bf/bfr/root/HEAD/inst/data-C/core.12584 ==
   [New LWP 12584]
   [Thread debugging using libthread_db enabled]
   Using host libthread_db library /lib64/libthread_db.so.1.
   Core was generated by `postgres: buildfarm contrib_regression_test_shm_mq'.
   Program terminated with signal 11, Segmentation fault.
   #0  SetLatch (latch=0x1c) at pg_latch.c:509
   509  if (latch-is_set)
   #0  SetLatch (latch=0x1c) at pg_latch.c:509
   #1  0x0064c04e in procsignal_sigusr1_handler 
(postgres_signal_arg=optimized out) at 
/home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/procsignal.c:289
   #2  signal handler called
   #3  _dl_fini () at dl-fini.c:190
   #4  0x00361ba39931 in __run_exit_handlers (status=0, listp=0x361bdb1668, 
run_list_atexit=true) at exit.c:78
   #5  0x00361ba399b5 in __GI_exit (status=optimized out) at exit.c:100
   #6  0x006485a6 in proc_exit (code=0) at 
/home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/ipc.c:143
   #7  0x00663abb in PostgresMain (argc=optimized out, argv=optimized out, 
dbname=0x12b8170 contrib_regression_test_shm_mq, username=optimized out) at 
/home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/tcop/postgres.c:4225
   #8  0x0062220f in BackendRun (port=0x12d6bf0) at 
/home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:4083
   #9  BackendStartup (port=0x12d6bf0) at 
/home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:3772
   #10 ServerLoop () at 
/home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1583
   #11 PostmasterMain (argc=optimized out, argv=optimized out) at 
/home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1238
   #12 0x0045e2e8 in main (argc=3, argv=0x12b7430) at 
/home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/main/main.c:205

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What is happening on buildfarm member crake?

2014-01-19 Thread Robert Haas
On Sun, Jan 19, 2014 at 7:53 PM, Andrew Dunstan and...@dunslane.net wrote:
 Also crake does produce backtraces  on core dumps, and they are at the
 bottom of the buildfarm log. The latest failure backtrace is reproduced
 below.

== stack trace:
 /home/bf/bfr/root/HEAD/inst/data-C/core.12584 ==
[New LWP 12584]
[Thread debugging using libthread_db enabled]
Using host libthread_db library /lib64/libthread_db.so.1.
Core was generated by `postgres: buildfarm
 contrib_regression_test_shm_mq'.
Program terminated with signal 11, Segmentation fault.
#0  SetLatch (latch=0x1c) at pg_latch.c:509
509  if (latch-is_set)
#0  SetLatch (latch=0x1c) at pg_latch.c:509
#1  0x0064c04e in procsignal_sigusr1_handler
 (postgres_signal_arg=optimized out) at
 /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/procsignal.c:289
#2  signal handler called
#3  _dl_fini () at dl-fini.c:190
#4  0x00361ba39931 in __run_exit_handlers (status=0,
 listp=0x361bdb1668, run_list_atexit=true) at exit.c:78
#5  0x00361ba399b5 in __GI_exit (status=optimized out) at
 exit.c:100
#6  0x006485a6 in proc_exit (code=0) at
 /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/ipc.c:143
#7  0x00663abb in PostgresMain (argc=optimized out,
 argv=optimized out, dbname=0x12b8170 contrib_regression_test_shm_mq,
 username=optimized out) at
 /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/tcop/postgres.c:4225
#8  0x0062220f in BackendRun (port=0x12d6bf0) at
 /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:4083
#9  BackendStartup (port=0x12d6bf0) at
 /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:3772
#10 ServerLoop () at
 /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1583
#11 PostmasterMain (argc=optimized out, argv=optimized out) at
 /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1238
#12 0x0045e2e8 in main (argc=3, argv=0x12b7430) at
 /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/main/main.c:205

Hmm, that looks an awful lot like the SIGUSR1 signal handler is
getting called after we've already completed shmem_exit.  And indeed
that seems like the sort of thing that would result in dying horribly
in just this way.  The obvious fix seems to be to check
proc_exit_inprogress before doing anything that might touch shared
memory, but there are a lot of other SIGUSR1 handlers that don't do
that either.  However, in those cases, the likely cause of a SIGUSR1
would be a sinval catchup interrupt or a recovery conflict, which
aren't likely to be so far delayed that they arrive after we've
already disconnected from shared memory.  But the dynamic background
workers stuff adds a new possible cause of SIGUSR1: the postmaster
letting us know that a child has started or died.  And that could
happen even after we've detached shared memory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers