Re: [HACKERS] Bug in walsender when calling out to do_pg_stop_backup (and others?)

2011-12-09 Thread Heikki Linnakangas

On 03.12.2011 18:37, Heikki Linnakangas wrote:

One small change I'd like to make is to treat the loss of connection
more as a new top-level event, rather than as a new reason for query
cancellation. A lost connection is more like receiving SIGTERM, really.
Please take a look at the attached patch. I've been testing this by
doing SELECT pg_sleep(1), a from generate_series(1,1000) a;, and
killing the connection with killall -9 psql.


Ok, committed this.


One remaining issue I have with this is the wording of the error
message. The closest existing message we have is in old-mode COPY:

ereport(FATAL,
(errcode(ERRCODE_CONNECTION_FAILURE),
errmsg(connection lost during COPY to stdout)));

In the patch, I made the new message just connection lost, but does
anyone else have a better opinion on that? Perhaps it should be
connection lost while sending response to client. Or connection lost
during execution of statement, but that might not be accurate if we're
not executing a statement at the moment, but just sending a sync
message or similar.


I made the error message connection to client lost in the end.

I still wonder if it would be safe to simply elog(FATAL) in 
internal_flush(), but didn't dare to do that without more thorough analysis.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in walsender when calling out to do_pg_stop_backup (and others?)

2011-12-03 Thread Heikki Linnakangas

On 19.10.2011 19:41, Greg Jaskiewicz wrote:

On 19 Oct 2011, at 18:28, Florian Pflug wrote:

All the other flags which indicate cancellation reasons are set from signal 
handers, I believe. We could of course mark as ClientConnectionLostPending as 
volatile just to be consistent. Not sure whether that's a good idea, or not. It 
might prevent a mistake should we ever add code to detect lost connections 
asynchronously (i.e., from somewhere else than pq_flush). And the cost is 
probably negligible, because CHECK_FOR_INTERRUPTS tests for InterruptPending 
before calling ProcessInterrupts(), so we only pay the cost of volatile if 
there's actually an interrupt pending. But I still think it's better to add 
qualifies such a volatile only when really necessary. A comment about why it 
*isn't* volatile is probably in order, though, so I'll add that in the next 
version of the patch.


Makes sense.

I had to ask, because it sticks out. And indeed there is a possibility that 
someone will one day will try to use from signal handler, etc.


Let's just mark it as volatile. It's not clear to me that we'll never 
set or read the flag while in a signal handler. We probably don't, but 
what if ImmediateInterruptOK is 'true', for example, just when we fail 
to send a response and try to set the flags. In that case, we have to be 
careful that ClientConnectionLost is set before InterruptPending (which 
we can only guarantee if both are volatile, I believe), otherwise a 
signal might arrive when we've just set InterruptPending, but not yet 
ClientConnectionLost. ProcessInterrupts() would clear InterruptPending, 
but not see ClientConnectionLost, so we would lose the interrupt.


That's not serious, and the window for that happening would be extremely 
small, and I don't think it can actually happen as the code stands, 
because we never try to send anything while ImmediateInterruptOK == 
true. But better safe than sorry.



One small change I'd like to make is to treat the loss of connection 
more as a new top-level event, rather than as a new reason for query 
cancellation. A lost connection is more like receiving SIGTERM, really. 
Please take a look at the attached patch. I've been testing this by 
doing SELECT pg_sleep(1), a from generate_series(1,1000) a;, and 
killing the connection with killall -9 psql.


One remaining issue I have with this is the wording of the error 
message. The closest existing message we have is in old-mode COPY:


ereport(FATAL,
(errcode(ERRCODE_CONNECTION_FAILURE),
 errmsg(connection lost during COPY to stdout)));

In the patch, I made the new message just connection lost, but does 
anyone else have a better opinion on that? Perhaps it should be 
connection lost while sending response to client. Or connection lost 
during execution of statement, but that might not be accurate if we're 
not executing a statement at the moment, but just sending a sync 
message or similar.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index b83a2ef..c36cf31 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -1247,9 +1247,13 @@ internal_flush(void)
 
 			/*
 			 * We drop the buffered data anyway so that processing can
-			 * continue, even though we'll probably quit soon.
+			 * continue, even though we'll probably quit soon. We also
+			 * set a flag that'll cause the next CHECK_FOR_INTERRUPTS
+			 * to terminate the connection.
 			 */
 			PqSendStart = PqSendPointer = 0;
+			ClientConnectionLost = 1;
+			InterruptPending = 1;
 			return EOF;
 		}
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 976a832..13ea594 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2823,6 +2823,19 @@ ProcessInterrupts(void)
 	(errcode(ERRCODE_ADMIN_SHUTDOWN),
 			 errmsg(terminating connection due to administrator command)));
 	}
+	if (ClientConnectionLost)
+	{
+		QueryCancelPending = false;		/* lost connection trumps QueryCancel */
+		ImmediateInterruptOK = false;	/* not idle anymore */
+		DisableNotifyInterrupt();
+		DisableCatchupInterrupt();
+		/* As in quickdie, don't risk sending to client during auth */
+		if (ClientAuthInProgress  whereToSendOutput == DestRemote)
+			whereToSendOutput = DestNone;
+		ereport(FATAL,
+(errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg(connection lost)));
+	}
 	if (QueryCancelPending)
 	{
 		QueryCancelPending = false;
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 9ce64e6..9417c7a 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -29,6 +29,7 @@ ProtocolVersion FrontendProtocol;
 volatile bool InterruptPending = false;
 volatile bool QueryCancelPending = false;
 volatile bool ProcDiePending = false;
+volatile bool ClientConnectionLost = false;
 volatile bool ImmediateInterruptOK = false;
 volatile uint32 

Re: [HACKERS] Bug in walsender when calling out to do_pg_stop_backup (and others?)

2011-10-27 Thread Bruce Momjian
Greg Jaskiewicz wrote:
 
 On 19 Oct 2011, at 18:28, Florian Pflug wrote:
  All the other flags which indicate cancellation reasons are set from signal 
  handers, I believe. We could of course mark as ClientConnectionLostPending 
  as volatile just to be consistent. Not sure whether that's a good idea, or 
  not. It might prevent a mistake should we ever add code to detect lost 
  connections asynchronously (i.e., from somewhere else than pq_flush). And 
  the cost is probably negligible, because CHECK_FOR_INTERRUPTS tests for 
  InterruptPending before calling ProcessInterrupts(), so we only pay the 
  cost of volatile if there's actually an interrupt pending. But I still 
  think it's better to add qualifies such a volatile only when really 
  necessary. A comment about why it *isn't* volatile is probably in order, 
  though, so I'll add that in the next version of the patch.
 
 Makes sense.
 
 I had to ask, because it sticks out. And indeed there is a possibility
 that someone will one day will try to use from signal handler, etc.

A C comment might help there.

--
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in walsender when calling out to do_pg_stop_backup (and others?)

2011-10-19 Thread Florian Pflug
On Oct19, 2011, at 17:47 , Greg Jaskiewicz wrote:
 On 15 Oct 2011, at 11:31, Florian Pflug wrote:
 
 Ok, here's a first cut.
 
 So I looked at the patch, and first thing that pops out, 
 is lack of the volatile keyword before the ClientConnectionLostPending 
 variable is defined. Is that done on purpose ? Is that on purpose ?

That's on purpose. volatile is only necessary for variables which are either 
accessed from within signal handlers or which live in shared memory. Neither is 
true for ClientConnectionLostPending, so non-volatile should be fine.

 Otherwise the patch itself looks ok. 
 I haven't tested the code, just reviewed the patch itself. And it obviously 
 needs testing, should be easy to follow your original problem description. 

Yeah, further testing is on my todo list. The interesting case is probably what 
happens if the connection is dropped while there's already a cancellation 
request pending. And also the other way around - a cancellation request 
arriving after we've already discovered that the connection is gone.

 Btw, I just tried to do it through commitfest.postgresql.org , but before I 
 get my head around on how to add myself to the reviewer list there - I 
 thought I'll just send this response here.

You just need to click Edit Patch and put your name into the Reviewer field. 
You do need a postgres community account to edit patches, but the signup 
process is quite quick and painless AFAIR.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in walsender when calling out to do_pg_stop_backup (and others?)

2011-10-19 Thread Greg Jaskiewicz

On 19 Oct 2011, at 17:54, Florian Pflug wrote:

 On Oct19, 2011, at 17:47 , Greg Jaskiewicz wrote:
 On 15 Oct 2011, at 11:31, Florian Pflug wrote:
 
 Ok, here's a first cut.
 
 So I looked at the patch, and first thing that pops out, 
 is lack of the volatile keyword before the ClientConnectionLostPending 
 variable is defined. Is that done on purpose ? Is that on purpose ?
 
 That's on purpose. volatile is only necessary for variables which are either 
 accessed from within signal handlers or which live in shared memory. Neither 
 is true for ClientConnectionLostPending, so non-volatile should be fine.
Ok, cool.
I'm aware of the reasons behind volatile, just noticed that some other flags 
used in similar way are marked as such. At the end of the day, this is just a 
hint to the compiler anyway. 

 
 Otherwise the patch itself looks ok. 
 I haven't tested the code, just reviewed the patch itself. And it obviously 
 needs testing, should be easy to follow your original problem description. 
 
 Yeah, further testing is on my todo list. The interesting case is probably 
 what happens if the connection is dropped while there's already a 
 cancellation request pending. And also the other way around - a cancellation 
 request arriving after we've already discovered that the connection is gone.
 
 Btw, I just tried to do it through commitfest.postgresql.org , but before I 
 get my head around on how to add myself to the reviewer list there - I 
 thought I'll just send this response here.
 
 You just need to click Edit Patch and put your name into the Reviewer 
 field. You do need a postgres community account to edit patches, but the 
 signup process is quite quick and painless AFAIR.


Ok, clicking 'edit patch' sounds a bit big. Probably better would be, to just 
be able to click on some sort of I'm in button/checkbox type of thing. 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in walsender when calling out to do_pg_stop_backup (and others?)

2011-10-19 Thread Florian Pflug
On Oct19, 2011, at 18:05 , Greg Jaskiewicz wrote:
 On 19 Oct 2011, at 17:54, Florian Pflug wrote:
 
 On Oct19, 2011, at 17:47 , Greg Jaskiewicz wrote:
 On 15 Oct 2011, at 11:31, Florian Pflug wrote:
 
 Ok, here's a first cut.
 
 So I looked at the patch, and first thing that pops out, 
 is lack of the volatile keyword before the ClientConnectionLostPending 
 variable is defined. Is that done on purpose ? Is that on purpose ?
 
 That's on purpose. volatile is only necessary for variables which are either 
 accessed from within signal handlers or which live in shared memory. Neither 
 is true for ClientConnectionLostPending, so non-volatile should be fine.
 Ok, cool.
 I'm aware of the reasons behind volatile, just noticed that some other flags 
 used in similar way are marked as such. At the end of the day, this is just a 
 hint to the compiler anyway. 

All the other flags which indicate cancellation reasons are set from signal 
handers, I believe. We could of course mark as ClientConnectionLostPending as 
volatile just to be consistent. Not sure whether that's a good idea, or not. It 
might prevent a mistake should we ever add code to detect lost connections 
asynchronously (i.e., from somewhere else than pq_flush). And the cost is 
probably negligible, because CHECK_FOR_INTERRUPTS tests for InterruptPending 
before calling ProcessInterrupts(), so we only pay the cost of volatile if 
there's actually an interrupt pending. But I still think it's better to add 
qualifies such a volatile only when really necessary. A comment about why it 
*isn't* volatile is probably in order, though, so I'll add that in the next 
version of the patch.

best regards,
Florian Pflug

PS: Thanks for the review. It's very much appreciated!


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in walsender when calling out to do_pg_stop_backup (and others?)

2011-10-19 Thread Greg Jaskiewicz

On 15 Oct 2011, at 11:31, Florian Pflug wrote:
 
 Ok, here's a first cut.

So I looked at the patch, and first thing that pops out, 
is lack of the volatile keyword before the ClientConnectionLostPending variable 
is defined. Is that done on purpose ? Is that on purpose ?


Otherwise the patch itself looks ok. 
I haven't tested the code, just reviewed the patch itself. And it obviously 
needs testing, should be easy to follow your original problem description. 


Btw, I just tried to do it through commitfest.postgresql.org , but before I get 
my head around on how to add myself to the reviewer list there - I thought I'll 
just send this response here.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in walsender when calling out to do_pg_stop_backup (and others?)

2011-10-19 Thread Greg Jaskiewicz

On 19 Oct 2011, at 18:28, Florian Pflug wrote:
 All the other flags which indicate cancellation reasons are set from signal 
 handers, I believe. We could of course mark as ClientConnectionLostPending as 
 volatile just to be consistent. Not sure whether that's a good idea, or not. 
 It might prevent a mistake should we ever add code to detect lost connections 
 asynchronously (i.e., from somewhere else than pq_flush). And the cost is 
 probably negligible, because CHECK_FOR_INTERRUPTS tests for InterruptPending 
 before calling ProcessInterrupts(), so we only pay the cost of volatile if 
 there's actually an interrupt pending. But I still think it's better to add 
 qualifies such a volatile only when really necessary. A comment about why it 
 *isn't* volatile is probably in order, though, so I'll add that in the next 
 version of the patch.
 
Makes sense. 

I had to ask, because it sticks out. And indeed there is a possibility that 
someone will one day will try to use from signal handler, etc. 


 best regards,
 Florian Pflug
 
 PS: Thanks for the review. It's very much appreciated!
No problem, Got inspired by the talk I attended here at the pgconf.eu. Seems 
like a good idea to do these things, I have years of experience as developer 
and doing (relatively) well thanks to PostgreSQL - makes sense to contribute 
something back. 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in walsender when calling out to do_pg_stop_backup (and others?)

2011-10-15 Thread Florian Pflug
On Oct11, 2011, at 09:21 , Magnus Hagander wrote:
 On Tue, Oct 11, 2011 at 03:29, Florian Pflug f...@phlo.org wrote:
 On Oct10, 2011, at 21:25 , Magnus Hagander wrote:
 On Thu, Oct 6, 2011 at 23:46, Florian Pflug f...@phlo.org wrote:
 It'd be nice to generally terminate a backend if the client vanishes, but 
 so
 far I haven't had any bright ideas. Using FASYNC and F_SETOWN unfortunately
 sends a signal *everytime* the fd becomes readable or writeable, not only 
 on
 EOF. Doing select() in CHECK_FOR_INTERRUPTS seems far too expensive. We 
 could
 make the postmaster keep the fd's of around even after forking a backend, 
 and
 make it watch for broken connections using select(). But with a large 
 max_backends
 settings, we'd risk running out of fds in the postmaster...
 
 Ugh. Yeah. But at least catching it and terminating it when we *do*
 notice it's down would certainly make sense...
 
 I'll try to put together a patch that sets a flag if we discover a broken
 connection in pq_flush, and tests that flag in CHECK_FOR_INTERRUPTS. Unless 
 you
 wanna, of course.
 
 Please do, I won't have time to even think about it until after
 pgconf.eu anyway ;)

Ok, here's a first cut.

I've based this on how query cancellation due to recovery conflicts work -
internal_flush() sets QueryCancelPending and ClientConnectionLostPending.

If QueryCancelPending is set, CHECK_FOR_INTERRUPTS checks 
ClientConnectionLostPending, and if it's set it does ereport(FATAL).

I've only done light testing so far - basically the only case I've tested is
killing pg_basebackup while it's waiting for all required WAL to be archived.

best regards,
Florian Pflug



pg.discon_cancel.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in walsender when calling out to do_pg_stop_backup (and others?)

2011-10-11 Thread Magnus Hagander
On Tue, Oct 11, 2011 at 03:29, Florian Pflug f...@phlo.org wrote:
 On Oct10, 2011, at 21:25 , Magnus Hagander wrote:
 On Thu, Oct 6, 2011 at 23:46, Florian Pflug f...@phlo.org wrote:
 It'd be nice to generally terminate a backend if the client vanishes, but so
 far I haven't had any bright ideas. Using FASYNC and F_SETOWN unfortunately
 sends a signal *everytime* the fd becomes readable or writeable, not only on
 EOF. Doing select() in CHECK_FOR_INTERRUPTS seems far too expensive. We 
 could
 make the postmaster keep the fd's of around even after forking a backend, 
 and
 make it watch for broken connections using select(). But with a large 
 max_backends
 settings, we'd risk running out of fds in the postmaster...

 Ugh. Yeah. But at least catching it and terminating it when we *do*
 notice it's down would certainly make sense...

 I'll try to put together a patch that sets a flag if we discover a broken
 connection in pq_flush, and tests that flag in CHECK_FOR_INTERRUPTS. Unless 
 you
 wanna, of course.

Please do, I won't have time to even think about it until after
pgconf.eu anyway ;)


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in walsender when calling out to do_pg_stop_backup (and others?)

2011-10-10 Thread Magnus Hagander
On Thu, Oct 6, 2011 at 23:46, Florian Pflug f...@phlo.org wrote:
 On Oct6, 2011, at 21:48 , Magnus Hagander wrote:
 The question is, should we do more? To me, it'd make sense to terminate
 a backend once it's connection is gone. We could, for example, make
 pq_flush() set a global flag, and make CHECK_FOR_INTERRUPTS handle a
 broken connection that same way as a SIGINT or SIGTERM.

 The problem here is that we're hanging at a place where we don't touch
 the socket. So we won't notice the socket is gone. We'd have to do a
 select() or something like that at regular intervals to make sure it's
 there, no?

 We do emit NOTICEs saying pg_stop_backup still waiting ...  repeatedly,
 so we should notice a dead connection sooner or later. When I tried, I even
 got a message in the log complaining about the broken pipe.

Ah, good point, that should be doable. Forgot about that message...


 As it stands, the interval between two NOTICEs grows exponentially - we
 send the first after waiting 5 second, the next after waiting 60 seconds,
 and then after waiting 120, 240, 480, ... seconds. This means that that the
 backend would in the worst case linger the same amount of time *after*
 pg_basebackup was cancelled that pg_basebackup waited for *before* it was
 cancelled.

 It'd be nice to generally terminate a backend if the client vanishes, but so
 far I haven't had any bright ideas. Using FASYNC and F_SETOWN unfortunately
 sends a signal *everytime* the fd becomes readable or writeable, not only on
 EOF. Doing select() in CHECK_FOR_INTERRUPTS seems far too expensive. We could
 make the postmaster keep the fd's of around even after forking a backend, and
 make it watch for broken connections using select(). But with a large 
 max_backends
 settings, we'd risk running out of fds in the postmaster...

Ugh. Yeah. But at least catching it and terminating it when we *do*
notice it's down would certainly make sense...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in walsender when calling out to do_pg_stop_backup (and others?)

2011-10-10 Thread Florian Pflug
On Oct10, 2011, at 21:25 , Magnus Hagander wrote:
 On Thu, Oct 6, 2011 at 23:46, Florian Pflug f...@phlo.org wrote:
 It'd be nice to generally terminate a backend if the client vanishes, but so
 far I haven't had any bright ideas. Using FASYNC and F_SETOWN unfortunately
 sends a signal *everytime* the fd becomes readable or writeable, not only on
 EOF. Doing select() in CHECK_FOR_INTERRUPTS seems far too expensive. We could
 make the postmaster keep the fd's of around even after forking a backend, and
 make it watch for broken connections using select(). But with a large 
 max_backends
 settings, we'd risk running out of fds in the postmaster...
 
 Ugh. Yeah. But at least catching it and terminating it when we *do*
 notice it's down would certainly make sense...

I'll try to put together a patch that sets a flag if we discover a broken
connection in pq_flush, and tests that flag in CHECK_FOR_INTERRUPTS. Unless you
wanna, of course.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in walsender when calling out to do_pg_stop_backup (and others?)

2011-10-06 Thread Florian Pflug
On Oct5, 2011, at 15:30 , Magnus Hagander wrote:
 When walsender calls out to do_pg_stop_backup() (during base backups),
 it is not possible to terminate the process with a SIGTERM - it
 requires a SIGKILL. This can leave unkillable backends for example if
 archive_mode is on and archive_command is failing (or not set). A
 similar thing would happen in other cases if walsender calls out to
 something that would block (do_pg_start_backup() for example), but the
 stop one is easy to provoke.

Hm, this seems to be related to another buglet I noticed a while ago,
but then forgot about again. If one terminates pg_basebackup while it's
waiting for all required WAL to be archived, the backend process only
exits once that waiting phase is over. If, like in your failure case,
archive_command fails indefinity (or isn't set), the backend process
stays around forever.

Your patch would improve that only insofar as it'd at least allow an
immediate shutdown request to succeed - as it stands, that doesn't work
because, as you mentioned, the blocked walsender doesn't handle SIGTERM.

The question is, should we do more? To me, it'd make sense to terminate
a backend once it's connection is gone. We could, for example, make
pq_flush() set a global flag, and make CHECK_FOR_INTERRUPTS handle a
broken connection that same way as a SIGINT or SIGTERM.

Thoughts?

 ISTM one way to fix it is the attached, which is to have walsender set
 the global flags saying that we have received sigterm, which in turn
 causes the CHECK_FOR_INTERRUPTS() calls in the routines to properly
 exit the process. AFAICT it works fine. Any holes in this approach?

Seems sensible, but my knowledge about these code paths is quite limited..

 Second, I wonder if we should add a SIGINT handler as well, that would
 make it possible to send a cancel signal. Given that the end result
 would be the same (at least if we want to keep with the walsender is
 simple path), I'm not sure it's necessary - but it would at least
 help those doing pg_cancel_backend()... thoughts?

If all that's needed is a simple SIGINT handler that sets one flag, it'd
say let's add it. 

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in walsender when calling out to do_pg_stop_backup (and others?)

2011-10-06 Thread Magnus Hagander
On Thu, Oct 6, 2011 at 14:34, Florian Pflug f...@phlo.org wrote:
 On Oct5, 2011, at 15:30 , Magnus Hagander wrote:
 When walsender calls out to do_pg_stop_backup() (during base backups),
 it is not possible to terminate the process with a SIGTERM - it
 requires a SIGKILL. This can leave unkillable backends for example if
 archive_mode is on and archive_command is failing (or not set). A
 similar thing would happen in other cases if walsender calls out to
 something that would block (do_pg_start_backup() for example), but the
 stop one is easy to provoke.

 Hm, this seems to be related to another buglet I noticed a while ago,
 but then forgot about again. If one terminates pg_basebackup while it's
 waiting for all required WAL to be archived, the backend process only
 exits once that waiting phase is over. If, like in your failure case,
 archive_command fails indefinity (or isn't set), the backend process
 stays around forever.

Yes.


 Your patch would improve that only insofar as it'd at least allow an
 immediate shutdown request to succeed - as it stands, that doesn't work
 because, as you mentioned, the blocked walsender doesn't handle SIGTERM.

Exactly.


 The question is, should we do more? To me, it'd make sense to terminate
 a backend once it's connection is gone. We could, for example, make
 pq_flush() set a global flag, and make CHECK_FOR_INTERRUPTS handle a
 broken connection that same way as a SIGINT or SIGTERM.

The problem here is that we're hanging at a place where we don't touch
the socket. So we won't notice the socket is gone. We'd have to do a
select() or something like that at regular intervals to make sure it's
there, no?


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in walsender when calling out to do_pg_stop_backup (and others?)

2011-10-06 Thread Magnus Hagander
On Thu, Oct 6, 2011 at 04:22, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Oct 5, 2011 at 10:30 PM, Magnus Hagander mag...@hagander.net wrote:
 When walsender calls out to do_pg_stop_backup() (during base backups),
 it is not possible to terminate the process with a SIGTERM - it
 requires a SIGKILL. This can leave unkillable backends for example if
 archive_mode is on and archive_command is failing (or not set). A
 similar thing would happen in other cases if walsender calls out to
 something that would block (do_pg_start_backup() for example), but the
 stop one is easy to provoke.

 Good catch!

 ISTM one way to fix it is the attached, which is to have walsender set
 the global flags saying that we have received sigterm, which in turn
 causes the CHECK_FOR_INTERRUPTS() calls in the routines to properly
 exit the process. AFAICT it works fine. Any holes in this approach?

 Very simple patch. Looks fine.

Ok, thanks. Applied.


 Second, I wonder if we should add a SIGINT handler as well, that would
 make it possible to send a cancel signal. Given that the end result
 would be the same (at least if we want to keep with the walsender is
 simple path), I'm not sure it's necessary - but it would at least
 help those doing pg_cancel_backend()... thoughts?

 I don't think that's necessary because, as you suggested, there is no
 use case for *now*. We can add that handler when someone proposes
 the feature which requires that.

Yeah. It's certainly not something backpatch-worthy.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in walsender when calling out to do_pg_stop_backup (and others?)

2011-10-06 Thread Florian Pflug
On Oct6, 2011, at 21:48 , Magnus Hagander wrote:
 The question is, should we do more? To me, it'd make sense to terminate
 a backend once it's connection is gone. We could, for example, make
 pq_flush() set a global flag, and make CHECK_FOR_INTERRUPTS handle a
 broken connection that same way as a SIGINT or SIGTERM.
 
 The problem here is that we're hanging at a place where we don't touch
 the socket. So we won't notice the socket is gone. We'd have to do a
 select() or something like that at regular intervals to make sure it's
 there, no?

We do emit NOTICEs saying pg_stop_backup still waiting ...  repeatedly,
so we should notice a dead connection sooner or later. When I tried, I even
got a message in the log complaining about the broken pipe.

As it stands, the interval between two NOTICEs grows exponentially - we
send the first after waiting 5 second, the next after waiting 60 seconds,
and then after waiting 120, 240, 480, ... seconds. This means that that the
backend would in the worst case linger the same amount of time *after*
pg_basebackup was cancelled that pg_basebackup waited for *before* it was
cancelled.

It'd be nice to generally terminate a backend if the client vanishes, but so
far I haven't had any bright ideas. Using FASYNC and F_SETOWN unfortunately
sends a signal *everytime* the fd becomes readable or writeable, not only on
EOF. Doing select() in CHECK_FOR_INTERRUPTS seems far too expensive. We could
make the postmaster keep the fd's of around even after forking a backend, and
make it watch for broken connections using select(). But with a large 
max_backends
settings, we'd risk running out of fds in the postmaster...

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in walsender when calling out to do_pg_stop_backup (and others?)

2011-10-05 Thread Fujii Masao
On Wed, Oct 5, 2011 at 10:30 PM, Magnus Hagander mag...@hagander.net wrote:
 When walsender calls out to do_pg_stop_backup() (during base backups),
 it is not possible to terminate the process with a SIGTERM - it
 requires a SIGKILL. This can leave unkillable backends for example if
 archive_mode is on and archive_command is failing (or not set). A
 similar thing would happen in other cases if walsender calls out to
 something that would block (do_pg_start_backup() for example), but the
 stop one is easy to provoke.

Good catch!

 ISTM one way to fix it is the attached, which is to have walsender set
 the global flags saying that we have received sigterm, which in turn
 causes the CHECK_FOR_INTERRUPTS() calls in the routines to properly
 exit the process. AFAICT it works fine. Any holes in this approach?

Very simple patch. Looks fine.

 Second, I wonder if we should add a SIGINT handler as well, that would
 make it possible to send a cancel signal. Given that the end result
 would be the same (at least if we want to keep with the walsender is
 simple path), I'm not sure it's necessary - but it would at least
 help those doing pg_cancel_backend()... thoughts?

I don't think that's necessary because, as you suggested, there is no
use case for *now*. We can add that handler when someone proposes
the feature which requires that.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers