Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-24 Thread Jeff Janes
On Fri, Sep 9, 2011 at 2:51 AM, Marti Raudsepp ma...@juffo.org wrote:
 On Thu, Sep 8, 2011 at 03:22, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Sep 7, 2011 at 5:24 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
 I remember we had bugs whereby an encoding conversion would fail,
 leading to elog trying to report this problem, but this attempt would
 also incur a conversion step, failing recursively until elog's stack got
 full.  I'm not saying this is impossible to solve, just something to
 keep in mind.

 Looking at elog.c, this only seems to apply to messages sent to the
 client from a backend connection. No conversion is done for log
 messages.

 Can we do something like: pass through ASCII characters unchanged, and
 output anything with the high-bit set as \xhexdigithexdigit?  That
 might be garbled in some cases, but the goal here is not perfection.
 We're just trying to give the admin (or PostgreSQL-guru-for-hire) a
 clue where to start looking for the problem.

 Or we might just replace them with '?'. This has the advantage of not
 expanding query length 4x if it does happen to be corrupted. The vast
 majority of queries are ASCII-only anyway.

Should this patch be reviewed as is, or will the substitution of
non-ASCII be implemented?

It seems like everyone agrees that this feature is wanted, but Tom is
still very much opposed to the general approach to implement it, as
being too dangerous.
Is it the reviewer's job to try to convince him otherwise?

Thanks,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-24 Thread Robert Haas
On Sat, Sep 24, 2011 at 1:51 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 It seems like everyone agrees that this feature is wanted, but Tom is
 still very much opposed to the general approach to implement it, as
 being too dangerous.
 Is it the reviewer's job to try to convince him otherwise?

It's not the reviewer's job to convince Tom of anything in particular,
but I think it's helpful for them to state their opinion, whatever it
may be (agreeing with Tom, disagreeing with Tom, or whatever).

IMHO, the most compelling argument against the OP's approach made so
far is the encoding issue.  I was hoping someone (maybe the OP?) would
have an opinion on that, an idea how to work around it, or something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-24 Thread Marti Raudsepp
On Sat, Sep 24, 2011 at 22:23, Robert Haas robertmh...@gmail.com wrote:
 It's not the reviewer's job to convince Tom of anything in particular,
 but I think it's helpful for them to state their opinion, whatever it
 may be (agreeing with Tom, disagreeing with Tom, or whatever).

My opinion is that this can be made safe enough and I explained why I
think so here: http://archives.postgresql.org/pgsql-hackers/2011-09/msg00308.php

Launching another process to read 1kB out of shared memory and print
it to log sounds like overkill. But if that's deemed necessary, I'm
willing to code it up too.

However, I now realize that it does make sense to write a separate
simpler function for the crashed backend case with no
vbeentry-st_changecount check loops, no checkUser, etc. That would be
more robust and easier to review.

I'll try to send a new patch implementing this in a few days.

 IMHO, the most compelling argument against the OP's approach made so
 far is the encoding issue.  I was hoping someone (maybe the OP?) would
 have an opinion on that, an idea how to work around it, or something.

I propsed replacing non-ASCII characters with '?' earlier. That would
be simpler to code, but obviously wouldn't preserve non-ASCII
characters in case the crash has anything to do with those. Since
nobody else weighed in on the '\x##' vs '?' choice, I didn't implement
it yet; but I will in my next submission.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-09 Thread Marti Raudsepp
On Thu, Sep 8, 2011 at 03:22, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Sep 7, 2011 at 5:24 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
 I remember we had bugs whereby an encoding conversion would fail,
 leading to elog trying to report this problem, but this attempt would
 also incur a conversion step, failing recursively until elog's stack got
 full.  I'm not saying this is impossible to solve, just something to
 keep in mind.

Looking at elog.c, this only seems to apply to messages sent to the
client from a backend connection. No conversion is done for log
messages.

 Can we do something like: pass through ASCII characters unchanged, and
 output anything with the high-bit set as \xhexdigithexdigit?  That
 might be garbled in some cases, but the goal here is not perfection.
 We're just trying to give the admin (or PostgreSQL-guru-for-hire) a
 clue where to start looking for the problem.

Or we might just replace them with '?'. This has the advantage of not
expanding query length 4x if it does happen to be corrupted. The vast
majority of queries are ASCII-only anyway.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-07 Thread Alvaro Herrera
Excerpts from Robert Haas's message of mar sep 06 19:57:07 -0300 2011:
 On Tue, Sep 6, 2011 at 6:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Robert Haas robertmh...@gmail.com writes:
  On Tue, Sep 6, 2011 at 5:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  And I doubt
  that the goal is worth taking risks for.
 
  I am unable to count the number of times that I have had a customer
  come to me and say well, the backend crashed.  And I go look at
  their logs and I have no idea what happened.
 
  gdb and print debug_query_string?
 
 Surely you're kidding.  These are customer systems which I frequently
 don't even have access to.  They don't always have gdb installed
 (sometimes they are Windows systems) and if they do the customer isn't
 likely to know how to use it, and even if they do they don't think the
 better of us for needing such a tool to troubleshoot a crash.

I'm with Robert on this ... been there, got that look.


 TBH, I'm very unclear what could cause the postmaster to go belly-up
 copying a bounded amount of data out of shared memory for logging
 purposes only.  It's surely possible to make the code safe against any
 sequence of bytes that might be found there.

A mishandled encoding conversion could be problematic, so that needs to
be carefully considered (perhaps just shut off unconditionally).

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-07 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Robert Haas's message of mar sep 06 19:57:07 -0300 2011:
 TBH, I'm very unclear what could cause the postmaster to go belly-up
 copying a bounded amount of data out of shared memory for logging
 purposes only.  It's surely possible to make the code safe against any
 sequence of bytes that might be found there.

 A mishandled encoding conversion could be problematic, so that needs to
 be carefully considered (perhaps just shut off unconditionally).

That, and the question of exactly what makes the amount bounded, and
probably six other things that could go wrong.  But I'm sure Andrew
won't be pleased with a proposal to inject unknown-encoding data into
the logs.

I remain of the opinion that this needs to be kept out of the postmaster.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-07 Thread Robert Haas
On Wed, Sep 7, 2011 at 3:42 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 TBH, I'm very unclear what could cause the postmaster to go belly-up
 copying a bounded amount of data out of shared memory for logging
 purposes only.  It's surely possible to make the code safe against any
 sequence of bytes that might be found there.

 A mishandled encoding conversion could be problematic, so that needs to
 be carefully considered (perhaps just shut off unconditionally).

It's not really a problem for the postmaster if something just plain
old fails.  Where we get into trouble is if it manages to (a) crash,
(b) take an excessive amount of time to complete, or (c) screw up the
postmaster state in some way we can't recover from.  But if any of
those are an issue then, yeah, just shut it off.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-07 Thread Robert Haas
On Wed, Sep 7, 2011 at 4:00 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Robert Haas's message of mar sep 06 19:57:07 -0300 2011:
 TBH, I'm very unclear what could cause the postmaster to go belly-up
 copying a bounded amount of data out of shared memory for logging
 purposes only.  It's surely possible to make the code safe against any
 sequence of bytes that might be found there.

 A mishandled encoding conversion could be problematic, so that needs to
 be carefully considered (perhaps just shut off unconditionally).

 That, and the question of exactly what makes the amount bounded, and

The fact that it's a fixed-size chunk of shmem.  We won't copy more
bytes than the size of the buffer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-07 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Sep 7, 2011 at 3:42 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
 A mishandled encoding conversion could be problematic, so that needs to
 be carefully considered (perhaps just shut off unconditionally).

 It's not really a problem for the postmaster if something just plain
 old fails.  Where we get into trouble is if it manages to (a) crash,
 (b) take an excessive amount of time to complete, or (c) screw up the
 postmaster state in some way we can't recover from.  But if any of
 those are an issue then, yeah, just shut it off.

Keep in mind that in the postmaster, elog(ERROR) *is* a crash.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-07 Thread Robert Haas
On Wed, Sep 7, 2011 at 4:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Wed, Sep 7, 2011 at 3:42 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
 A mishandled encoding conversion could be problematic, so that needs to
 be carefully considered (perhaps just shut off unconditionally).

 It's not really a problem for the postmaster if something just plain
 old fails.  Where we get into trouble is if it manages to (a) crash,
 (b) take an excessive amount of time to complete, or (c) screw up the
 postmaster state in some way we can't recover from.  But if any of
 those are an issue then, yeah, just shut it off.

 Keep in mind that in the postmaster, elog(ERROR) *is* a crash.

Right... but a function that returns -1 to indicate that something
didn't work should be OK, as long as whatever it does is otherwise
extremely boring.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-07 Thread Marti Raudsepp
On Wed, Sep 7, 2011 at 22:42, Alvaro Herrera alvhe...@commandprompt.com wrote:
 A mishandled encoding conversion could be problematic, so that needs to
 be carefully considered (perhaps just shut off unconditionally).

Are you referring to log file encoding or something else? The log file
is already potentially mixed-encoding, as different databases may have
different encodings and backends just dump bytes to it in their
current encoding.

pg_verify_mbstr() could potentially be used with noError=true if we
can figure out the backend's current encoding, but that indeed sounds
like something to avoid.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-07 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Sep 7, 2011 at 4:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Keep in mind that in the postmaster, elog(ERROR) *is* a crash.

 Right... but a function that returns -1 to indicate that something
 didn't work should be OK, as long as whatever it does is otherwise
 extremely boring.

The more functionality you put in the postmaster, the more likely it is
to trip over an elog(ERROR) somewhere.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-07 Thread Alvaro Herrera
Excerpts from Marti Raudsepp's message of mié sep 07 18:09:32 -0300 2011:
 On Wed, Sep 7, 2011 at 22:42, Alvaro Herrera alvhe...@commandprompt.com 
 wrote:
  A mishandled encoding conversion could be problematic, so that needs to
  be carefully considered (perhaps just shut off unconditionally).
 
 Are you referring to log file encoding or something else? The log file
 is already potentially mixed-encoding, as different databases may have
 different encodings and backends just dump bytes to it in their
 current encoding.

I am referring to the fact that whatever the backend puts in shared
memory is going to be in its encoding setting, which may not necessarily
match the postmaster's.  And if it doesn't, the postmaster might try to
convert it using settings not valid for the string, possibly leading to
crashes.

I remember we had bugs whereby an encoding conversion would fail,
leading to elog trying to report this problem, but this attempt would
also incur a conversion step, failing recursively until elog's stack got
full.  I'm not saying this is impossible to solve, just something to
keep in mind.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-07 Thread Robert Haas
On Wed, Sep 7, 2011 at 5:24 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Excerpts from Marti Raudsepp's message of mié sep 07 18:09:32 -0300 2011:
 On Wed, Sep 7, 2011 at 22:42, Alvaro Herrera alvhe...@commandprompt.com 
 wrote:
  A mishandled encoding conversion could be problematic, so that needs to
  be carefully considered (perhaps just shut off unconditionally).

 Are you referring to log file encoding or something else? The log file
 is already potentially mixed-encoding, as different databases may have
 different encodings and backends just dump bytes to it in their
 current encoding.

 I am referring to the fact that whatever the backend puts in shared
 memory is going to be in its encoding setting, which may not necessarily
 match the postmaster's.  And if it doesn't, the postmaster might try to
 convert it using settings not valid for the string, possibly leading to
 crashes.

 I remember we had bugs whereby an encoding conversion would fail,
 leading to elog trying to report this problem, but this attempt would
 also incur a conversion step, failing recursively until elog's stack got
 full.  I'm not saying this is impossible to solve, just something to
 keep in mind.

Can we do something like: pass through ASCII characters unchanged, and
output anything with the high-bit set as \xhexdigithexdigit?  That
might be garbled in some cases, but the goal here is not perfection.
We're just trying to give the admin (or PostgreSQL-guru-for-hire) a
clue where to start looking for the problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-06 Thread Marti Raudsepp
Hi list,

This patch adds the backend's current running query to the backend
crash message.

The crashing query is often a good starting point in debugging the
cause, and much more easily accessible than core dumps.

Example output:
LOG:  server process (PID 31451) was terminated by signal 9: Killed
DETAIL:  Running query: DO LANGUAGE plpythonu 'import os;os.kill(os.getpid(),9)'

The message Running query might not be entirely true, as it might be
e.g. a vacuum activity string too. But it sounds better than Activity
string or anything else I could come up with.

Also refactored pgstat_get_backend_current_activity() such that it
always returns consistent results. (Previously it returned a pointer to
shared memory that could theoretically change and was vulnerable to
races) The function can also deal with shared memory corruption (if
isCrashed is true), so that corruption doesn't cause a postmaster crash
or hang.

I'm not happy with the indenting depth in the function, but it would
be hard to split anything out.

Regards,
Marti
From 0eec93a7b9d59ab9d7e101f515b0c72c0c7c383f Mon Sep 17 00:00:00 2001
From: Marti Raudsepp ma...@juffo.org
Date: Fri, 2 Sep 2011 03:04:24 +0300
Subject: [PATCH] Log crashed backend's query (activity string)

The crashing query is often a good starting point in debugging the
cause, and much more easily accessible than core dumps.

Example output:
LOG:  server process (PID 31451) was terminated by signal 9: Killed
DETAIL:  Running query: DO LANGUAGE plpythonu 'import os;os.kill(os.getpid(),9)'

Also refactored pgstat_get_backend_current_activity() such that it
always returns consistent results. (Previously it returned a pointer to
shared memory that could theoretically change and was vulnerable to
races) The function can also deal with shared memory corruption (if
isCrashed is true), so that corruption doesn't cause a postmaster crash
or hang.
---
 src/backend/postmaster/pgstat.c |   58 ---
 src/backend/postmaster/postmaster.c |   17 +++---
 src/backend/storage/lmgr/deadlock.c |2 +-
 src/include/pgstat.h|2 +-
 4 files changed, 54 insertions(+), 25 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index be32ca8..3b13e58 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2681,22 +2681,20 @@ pgstat_read_current_status(void)
  *	and so will provide current information regardless of the age of our
  *	transaction's snapshot of the status array.
  *
- *	It is the caller's responsibility to invoke this only for backends whose
- *	state is expected to remain stable while the result is in use.	The
- *	only current use is in deadlock reporting, where we can expect that
- *	the target backend is blocked on a lock.  (There are corner cases
- *	where the target's wait could get aborted while we are looking at it,
- *	but the very worst consequence is to return a pointer to a string
- *	that's been changed, so we won't worry too much.)
+ *	If isCrashed is true, we try to do it in a way that's safe from shared
+ *	memory corruption. This is used for reporting the crashed backend's query.
  *
  *	Note: return strings for special cases match pg_stat_get_backend_activity.
  * --
  */
 const char *
-pgstat_get_backend_current_activity(int pid, bool checkUser)
+pgstat_get_backend_current_activity(int pid, bool checkUser, bool isCrashed)
 {
 	PgBackendStatus *beentry;
 	int			i;
+	int			maxlen = pgstat_track_activity_query_size;
+	char	   *result = NULL; /* silence uninitialized warning */
+	char	   *buffer = NULL;
 
 	beentry = BackendStatusArray;
 	for (i = 1; i = MaxBackends; i++)
@@ -2720,8 +2718,40 @@ pgstat_get_backend_current_activity(int pid, bool checkUser)
 
 			found = (vbeentry-st_procpid == pid);
 
+			if (found)
+			{
+char	   *activity = vbeentry-st_activity;
+
+if (checkUser  !superuser()  beentry-st_userid != GetUserId())
+	result = insufficient privilege;
+/*
+ * Don't access activity pointer before we verify that it
+ * falls into BackendActivityBuffer. To make sure that the
+ * string's ending is contained within the buffer, the string
+ * can start at offset (MaxBackends - 1) at most.
+ */
+else if (activity  BackendActivityBuffer ||
+		 activity  (BackendActivityBuffer +
+	 mul_size(maxlen, MaxBackends - 1)))
+	result = command string corrupt;
+else if (*(activity) == '\0')
+	result = command string not enabled;
+else
+{
+	if (!buffer)
+		buffer = (char *) palloc(maxlen);
+
+	strncpy(buffer, activity, maxlen);
+	buffer[maxlen - 1] = '\0';
+	result = buffer;
+}
+			}
+
+			/* Don't check changecount oddness if the backend crashed -- it
+			 * might be corrupt and stuck at an odd value
+			 */
 			if (save_changecount == vbeentry-st_changecount 
-(save_changecount  1) == 0)
+(isCrashed || (save_changecount  1) == 0))
 break;
 
 			

Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-06 Thread Robert Haas
On Tue, Sep 6, 2011 at 4:52 PM, Marti Raudsepp ma...@juffo.org wrote:
 This patch adds the backend's current running query to the backend
 crash message.

 The crashing query is often a good starting point in debugging the
 cause, and much more easily accessible than core dumps.

 Example output:
 LOG:  server process (PID 31451) was terminated by signal 9: Killed
 DETAIL:  Running query: DO LANGUAGE plpythonu 'import 
 os;os.kill(os.getpid(),9)'

 The message Running query might not be entirely true, as it might be
 e.g. a vacuum activity string too. But it sounds better than Activity
 string or anything else I could come up with.

 Also refactored pgstat_get_backend_current_activity() such that it
 always returns consistent results. (Previously it returned a pointer to
 shared memory that could theoretically change and was vulnerable to
 races) The function can also deal with shared memory corruption (if
 isCrashed is true), so that corruption doesn't cause a postmaster crash
 or hang.

I haven't looked at the patch, but boy would this save me a lot of
support headaches if we can make it work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-06 Thread Tom Lane
Marti Raudsepp ma...@juffo.org writes:
 This patch adds the backend's current running query to the backend
 crash message.

Sorry, this patch is entirely unacceptable.  We cannot have the
postmaster's functioning depending on the contents of shared memory
still being valid ... most especially not when we know that somebody
just crashed, and could have corrupted the shared memory in arbitrary
ways.  No, I don't think your attempts to validate the data are
adequate, nor do I believe they can be made adequate.  And I doubt
that the goal is worth taking risks for.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-06 Thread Robert Haas
On Tue, Sep 6, 2011 at 5:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Marti Raudsepp ma...@juffo.org writes:
 This patch adds the backend's current running query to the backend
 crash message.

 Sorry, this patch is entirely unacceptable.  We cannot have the
 postmaster's functioning depending on the contents of shared memory
 still being valid ... most especially not when we know that somebody
 just crashed, and could have corrupted the shared memory in arbitrary
 ways.  No, I don't think your attempts to validate the data are
 adequate, nor do I believe they can be made adequate.

Why and why not?

 And I doubt
 that the goal is worth taking risks for.

I am unable to count the number of times that I have had a customer
come to me and say well, the backend crashed.  And I go look at
their logs and I have no idea what happened.  So then I tell them to
include %p in log_line_prefix and set log_min_duration_statement=0 and
call me if it happens again.  This is a huge nuisance and a serious
interference with attempts to do meaningful troubleshooting.  When it
doesn't add days or weeks to the time to resolution, it's because it
prevents resolution altogether.  We really, really need something like
this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-06 Thread Marti Raudsepp
On Wed, Sep 7, 2011 at 00:34, Tom Lane t...@sss.pgh.pa.us wrote:
 No, I don't think your attempts to validate the data are
 adequate, nor do I believe they can be made adequate.

Can you think of any concrete situations that would fail the current validation?

As far as I can tell, the only pointer I'm following from the shm
region is vbeentry-st_activity, which is being validated that it
still points to the BackendActivityBuffer.

The rest -- BackendStatusArray, BackendActivityBuffer, MaxBackends and
pgstat_track_activity_query_size -- are process-local variables and
initialized at startup time. Given that other backends cannot change
these variables, nor deallocate postmaster's shm regions, we can take
for granted that they still point to the shm region.

Whether the data contained in shm is corrupt or not, we really don't
care. If we don't find the matching PID, we just return a placeholder
string. In the worst case, the activity buffer is corrupt and the user
gets a kilobyte of garbage in their log file.

Ok, well there's a worse case -- if a backend keeps incrementing
vbeentry-st_changecount at a faster rate. However, this seems pretty
unlikely if the backend has already terminated. We could terminate
other backends *first* before logging the query to make sure they
can't touch it.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-06 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Sep 6, 2011 at 5:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 And I doubt
 that the goal is worth taking risks for.

 I am unable to count the number of times that I have had a customer
 come to me and say well, the backend crashed.  And I go look at
 their logs and I have no idea what happened.

gdb and print debug_query_string?

I don't dispute that this would be nice to have.  But I don't think that
it's sane to compromise the postmaster's reliability in order to print
information of doubtful accuracy.

If you want to do something that doesn't violate the system's basic
design goals, think about setting up a SIGSEGV handler that tries to
print debug_query_string via elog before crashing.  It might well crash
too, but it won't be risking taking out more of the database with it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-06 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 Tom Lane t...@sss.pgh.pa.us wrote:
 Marti Raudsepp ma...@juffo.org writes:
 This patch adds the backend's current running query to the
 backend crash message.

 Sorry, this patch is entirely unacceptable.  We cannot have the
 postmaster's functioning depending on the contents of shared
 memory still being valid ... most especially not when we know
 that somebody just crashed, and could have corrupted the shared
 memory in arbitrary ways.  No, I don't think your attempts to
 validate the data are adequate, nor do I believe they can be made
 adequate.
 
 Why and why not?
 
 And I doubt
 that the goal is worth taking risks for.
 
 I am unable to count the number of times that I have had a
 customer come to me and say well, the backend crashed.  And I go
 look at their logs and I have no idea what happened.  So then I
 tell them to include %p in log_line_prefix and set
 log_min_duration_statement=0 and call me if it happens again. 
 This is a huge nuisance and a serious interference with attempts
 to do meaningful troubleshooting.  When it doesn't add days or
 weeks to the time to resolution, it's because it prevents
 resolution altogether.  We really, really need something like
 this.
 
I haven't had this experience more than a few times, but a few is
enough to recognize how painful it can be.  It seems we're brave
enough to log *some* information at crash time, in spite of the risk
that memory may be corrupted in unpredictable ways.  Sure, there is
a slim chance that when you think you're writing to the log you've
actually got a handle to a segment of a heap file, but that chance
is extremely slim -- and if that's where you're at you've probably
already written a 'segfault' message there, anyway.  My gut feel is
this would allow diagnosis in a timely fashion often enough to save
more data than it puts at risk, to say nothing of people's time.
 
I don't know whether the patch on the table is coded as defensively
as it should be given the perilous times the new code would come
into play, but I don't think the idea should be rejected out of
hand.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-06 Thread Robert Haas
On Tue, Sep 6, 2011 at 6:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Tue, Sep 6, 2011 at 5:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 And I doubt
 that the goal is worth taking risks for.

 I am unable to count the number of times that I have had a customer
 come to me and say well, the backend crashed.  And I go look at
 their logs and I have no idea what happened.

 gdb and print debug_query_string?

Surely you're kidding.  These are customer systems which I frequently
don't even have access to.  They don't always have gdb installed
(sometimes they are Windows systems) and if they do the customer isn't
likely to know how to use it, and even if they do they don't think the
better of us for needing such a tool to troubleshoot a crash.  Even if
none of that were an issue, gdb is only going to work if you attach it
before the crash or have a core dump available.  Typically you don't
know the crash is going to happen and core dumps aren't enabled
anyway.

 I don't dispute that this would be nice to have.  But I don't think that
 it's sane to compromise the postmaster's reliability in order to print
 information of doubtful accuracy.

In practice, I think very few crashes will clobber it.  A lot of
crashes are going to be caused by a null pointer deference in some
random part of the program, an assertion failure, the OOM killer, etc.
 It's certainly POSSIBLE that it could get clobbered, but it shouldn't
be very likely; and as Marti says, with proper defensive coding, the
worst case scenario if it does happen should be some log garbage.

 If you want to do something that doesn't violate the system's basic
 design goals, think about setting up a SIGSEGV handler that tries to
 print debug_query_string via elog before crashing.  It might well crash
 too, but it won't be risking taking out more of the database with it.

I don't think that's adequate.  You need to trap a lot more than just
SIGSEGV to catch all the crashes - there's also SIGABRT and SIGILL and
a bunch of other ones, including SIGKILL.  I think you really, really
need something that executes outside the context of the dying process.

TBH, I'm very unclear what could cause the postmaster to go belly-up
copying a bounded amount of data out of shared memory for logging
purposes only.  It's surely possible to make the code safe against any
sequence of bytes that might be found there.  The only real danger
seems to be that the memory access itself might trigger a segmentation
fault of some sort - but how is that going to happen?  The child can't
unmap the address space in the parent, can it?  If it's a real danger,
perhaps we could fork off a dedicated child process just to read the
relevant portion of shared memory and emit a log message - but I'm not
seeing what plausible scenario that would guard against.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers