Re: [HACKERS] 8.4-vintage problem in postmaster.c

2010-11-24 Thread Stefan Kaltenbrunner

On 11/15/2010 03:24 PM, Alvaro Herrera wrote:

Excerpts from Tom Lane's message of sáb nov 13 19:07:50 -0300 2010:

Stefan Kaltenbrunnerste...@kaltenbrunner.cc  writes:

On 11/13/2010 06:58 PM, Tom Lane wrote:

Just looking at it, I think that the logic in canAcceptConnections got
broken by somebody in 8.4, and then broken some more in 9.0: in some
cases it will return an okay to proceed status without having checked
for TOOMANY children.  Was this system possibly in PM_WAIT_BACKUP or
PM_HOT_STANDBY state?  What version was actually running?



I don't have too many details on the actual setup (working on that) but
the box in question is running 8.4.2 and had no issues before the
upgrade to 8.4 (ie 8.3 was reported to work fine - so a 8.4+ breakage
looks plausible).


Well, this failure would certainly involve a flood of connection
attempts, so it's possible it's a pre-existing bug that they just did
not happen to trip over before.  But the sequence of events that I'm
thinking about is a smart shutdown attempt (SIGTERM to postmaster)
while an online backup is in progress, followed by a flood of
near-simultaneous connection attempts while the backup is still active.


As far as I could gather from Stefan's description, I think this is
pretty unlikely.  It seems to me that the too many children error
message is very common in the 8.3 setup already, and the only reason
they have a problem on 8.4 is that it crashes instead.


not sure if that is true - but 8.4 crashes whereas 8.3  just (seems to) 
works - the issue is still there with 8_4_STABLE...


DEBUG3 level output (last few hours - 7MB in size) is available under 
http://www.kaltenbrunner.cc/files/postgresql-2010-11-24_143513.log


From looking at the code I'm not immediatly seeing what is going wrong 
here but maybe somebody else has an idea.



Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 8.4-vintage problem in postmaster.c

2010-11-15 Thread Alvaro Herrera
Excerpts from Tom Lane's message of sáb nov 13 19:07:50 -0300 2010:
 Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:
  On 11/13/2010 06:58 PM, Tom Lane wrote:
  Just looking at it, I think that the logic in canAcceptConnections got
  broken by somebody in 8.4, and then broken some more in 9.0: in some
  cases it will return an okay to proceed status without having checked
  for TOOMANY children.  Was this system possibly in PM_WAIT_BACKUP or
  PM_HOT_STANDBY state?  What version was actually running?
 
  I don't have too many details on the actual setup (working on that) but 
  the box in question is running 8.4.2 and had no issues before the 
  upgrade to 8.4 (ie 8.3 was reported to work fine - so a 8.4+ breakage 
  looks plausible).
 
 Well, this failure would certainly involve a flood of connection
 attempts, so it's possible it's a pre-existing bug that they just did
 not happen to trip over before.  But the sequence of events that I'm
 thinking about is a smart shutdown attempt (SIGTERM to postmaster)
 while an online backup is in progress, followed by a flood of
 near-simultaneous connection attempts while the backup is still active.

As far as I could gather from Stefan's description, I think this is
pretty unlikely.  It seems to me that the too many children error
message is very common in the 8.3 setup already, and the only reason
they have a problem on 8.4 is that it crashes instead.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 8.4-vintage problem in postmaster.c

2010-11-13 Thread Tom Lane
Alvaro Herrera alvhe...@alvh.no-ip.org writes:
 Stefan Kaltenbrunner reported a problem in postmaster via IM to me.  I
 thought I had nailed down the bug, but after more careful reading of the
 code, turns out I was wrong.

 The reported problem is that postmaster shuts itself down with this
 error message:

 2010-11-12 10:19:05 CET FATAL:  no free slots in PMChildFlags array

Just looking at it, I think that the logic in canAcceptConnections got
broken by somebody in 8.4, and then broken some more in 9.0: in some
cases it will return an okay to proceed status without having checked
for TOOMANY children.  Was this system possibly in PM_WAIT_BACKUP or
PM_HOT_STANDBY state?  What version was actually running?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 8.4-vintage problem in postmaster.c

2010-11-13 Thread Stefan Kaltenbrunner

On 11/13/2010 06:58 PM, Tom Lane wrote:

Alvaro Herreraalvhe...@alvh.no-ip.org  writes:

Stefan Kaltenbrunner reported a problem in postmaster via IM to me.  I
thought I had nailed down the bug, but after more careful reading of the
code, turns out I was wrong.



The reported problem is that postmaster shuts itself down with this
error message:



2010-11-12 10:19:05 CET FATAL:  no free slots in PMChildFlags array


Just looking at it, I think that the logic in canAcceptConnections got
broken by somebody in 8.4, and then broken some more in 9.0: in some
cases it will return an okay to proceed status without having checked
for TOOMANY children.  Was this system possibly in PM_WAIT_BACKUP or
PM_HOT_STANDBY state?  What version was actually running?


I don't have too many details on the actual setup (working on that) but 
the box in question is running 8.4.2 and had no issues before the 
upgrade to 8.4 (ie 8.3 was reported to work fine - so a 8.4+ breakage 
looks plausible).



Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 8.4-vintage problem in postmaster.c

2010-11-13 Thread Tom Lane
Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:
 On 11/13/2010 06:58 PM, Tom Lane wrote:
 Just looking at it, I think that the logic in canAcceptConnections got
 broken by somebody in 8.4, and then broken some more in 9.0: in some
 cases it will return an okay to proceed status without having checked
 for TOOMANY children.  Was this system possibly in PM_WAIT_BACKUP or
 PM_HOT_STANDBY state?  What version was actually running?

 I don't have too many details on the actual setup (working on that) but 
 the box in question is running 8.4.2 and had no issues before the 
 upgrade to 8.4 (ie 8.3 was reported to work fine - so a 8.4+ breakage 
 looks plausible).

Well, this failure would certainly involve a flood of connection
attempts, so it's possible it's a pre-existing bug that they just did
not happen to trip over before.  But the sequence of events that I'm
thinking about is a smart shutdown attempt (SIGTERM to postmaster)
while an online backup is in progress, followed by a flood of
near-simultaneous connection attempts while the backup is still active.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 8.4-vintage problem in postmaster.c

2010-11-13 Thread Stefan Kaltenbrunner

On 11/13/2010 11:07 PM, Tom Lane wrote:

Stefan Kaltenbrunnerste...@kaltenbrunner.cc  writes:

On 11/13/2010 06:58 PM, Tom Lane wrote:

Just looking at it, I think that the logic in canAcceptConnections got
broken by somebody in 8.4, and then broken some more in 9.0: in some
cases it will return an okay to proceed status without having checked
for TOOMANY children.  Was this system possibly in PM_WAIT_BACKUP or
PM_HOT_STANDBY state?  What version was actually running?



I don't have too many details on the actual setup (working on that) but
the box in question is running 8.4.2 and had no issues before the
upgrade to 8.4 (ie 8.3 was reported to work fine - so a 8.4+ breakage
looks plausible).


Well, this failure would certainly involve a flood of connection
attempts, so it's possible it's a pre-existing bug that they just did
not happen to trip over before.  But the sequence of events that I'm


afaik this seems to be fairly reproducible on the current box so 
something in 8.4 seems to trigger that issue more often now.




thinking about is a smart shutdown attempt (SIGTERM to postmaster)
while an online backup is in progress, followed by a flood of
near-simultaneous connection attempts while the backup is still active.


interesting - but I don't think that the setup in question actually uses 
online backups at all..




Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers