Re: [HACKERS] standalone backend PANICs during recovery

2016-09-16 Thread Bernd Helmle


--On 3. September 2016 um 02:05:00 -0400 Tom Lane  wrote:

>> Well, the "use case" was a crashed hot standby at a customer site (it
>> kept PANICing after restarting), where i decided to have a look through
>> the recovery process with gdb. The exact PANIC was
> 
>> 2016-03-29 15:12:40 CEST [16097]: [26-1] user=,db=,xid=0,vxid=1/0 FATAL:
>> unexpected GIN leaf action: 0
> 
> BTW, that didn't happen to be on big-endian hardware did it?

You're right, this was RHEL7 on a POWER7 machine.

-- 
Thanks

Bernd


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] standalone backend PANICs during recovery

2016-09-03 Thread Tom Lane
Bernd Helmle  writes:
> Well, the "use case" was a crashed hot standby at a customer site (it kept
> PANICing after restarting), where i decided to have a look through the
> recovery process with gdb. The exact PANIC was

> 2016-03-29 15:12:40 CEST [16097]: [26-1] user=,db=,xid=0,vxid=1/0 FATAL:
> unexpected GIN leaf action: 0

BTW, that didn't happen to be on big-endian hardware did it?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] standalone backend PANICs during recovery

2016-08-31 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Aug 30, 2016 at 11:00 PM, Tom Lane  wrote:
>> Thinking about it some more ... what we actually need to prevent, AFAICS,
>> is standby_mode becoming true in a standalone backend.

> I have spent some time playing with that and you are right. Only
> standby_mode = on is able trigger a failure here, and the first one is
> in WaitForWALToBecomeAvailable(). I'd just put the check at the end of
> readRecoveryCommandFile(), this will avoid extra thinking should
> readRecoveryCommandFile() be moved around. That's unlikely to happen
> but it is a cheap insurance.

Pushed with minor adjustments --- I moved the check to be closer to
existing sanity checks in readRecoveryCommandFile, tweaked the wording
(the man page for postgres refers to --single as "single-user server"
mode) and added an errcode.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] standalone backend PANICs during recovery

2016-08-30 Thread Michael Paquier
On Tue, Aug 30, 2016 at 11:00 PM, Tom Lane  wrote:
> Michael Paquier  writes:
>> Does the attached suit better then?
>
> Thinking about it some more ... what we actually need to prevent, AFAICS,
> is standby_mode becoming true in a standalone backend.  If you don't set
> that, then the process will do PITR recovery, but I'm not aware that there
> is any problem with running that standalone, and maybe it's useful to
> allow it for debugging purposes.  So I'm thinking more like
>
> /*
>  * Check for recovery control file, and if so set up state for offline
>  * recovery
>  */
> readRecoveryCommandFile();
>
> +   /*
> +* We don't support standby_mode in standalone backends; that
> +* requires other processes such as WAL receivers to be alive.
> +*/
> +   if (StandbyModeRequested && !IsUnderPostmaster)
> +   ereport(FATAL, ...);
> +
> /*
>  * Save archive_cleanup_command in shared memory so that other 
> processes
>  * can see it.
>  */
>
> or we could put the check near the bottom of readRecoveryCommandFile.
> Not sure which placement is clearer.

I have spent some time playing with that and you are right. Only
standby_mode = on is able trigger a failure here, and the first one is
in WaitForWALToBecomeAvailable(). I'd just put the check at the end of
readRecoveryCommandFile(), this will avoid extra thinking should
readRecoveryCommandFile() be moved around. That's unlikely to happen
but it is a cheap insurance.

I have added a test on that using 001_stream_rep.pl, now that's not
actually a good idea to push this test as if it passes the execution
of postgres would just hang and prevent the test to continue to run
and move on. But it makes it easier for anybody looking at this patch
to test the pattern prevented here.
-- 
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index acd95aa..b44fed0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5216,6 +5216,14 @@ readRecoveryCommandFile(void)
 	}
 
 	FreeConfigVariables(head);
+
+	/*
+	 * We don't support standby_mode in standalone backends; that
+	 * requires other processes such as the WAL receiver to be alive.
+	 */
+	if (StandbyModeRequested && !IsUnderPostmaster)
+		ereport(FATAL,
+(errmsg("standby mode is not supported for standalone backends")));
 }
 
 /*
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index fd71095..7b47301 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 4;
+use Test::More tests => 5;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -18,6 +18,11 @@ $node_master->backup($backup_name);
 my $node_standby_1 = get_new_node('standby_1');
 $node_standby_1->init_from_backup($node_master, $backup_name,
 	has_streaming => 1);
+
+# Standby mode not supported for standalone backends
+command_fails(['postgres', '--single', '-D', $node_standby_1->data_dir],
+			  'no standby mode support for standalone backends');
+
 $node_standby_1->start;
 
 # Take backup of standby 1 (not mandatory, but useful to check if

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] standalone backend PANICs during recovery

2016-08-30 Thread Tom Lane
Michael Paquier  writes:
> Does the attached suit better then?

Thinking about it some more ... what we actually need to prevent, AFAICS,
is standby_mode becoming true in a standalone backend.  If you don't set
that, then the process will do PITR recovery, but I'm not aware that there
is any problem with running that standalone, and maybe it's useful to
allow it for debugging purposes.  So I'm thinking more like

/*
 * Check for recovery control file, and if so set up state for offline
 * recovery
 */
readRecoveryCommandFile();

+   /*
+* We don't support standby_mode in standalone backends; that
+* requires other processes such as WAL receivers to be alive.
+*/
+   if (StandbyModeRequested && !IsUnderPostmaster)
+   ereport(FATAL, ...);
+
/*
 * Save archive_cleanup_command in shared memory so that other processes
 * can see it.
 */

or we could put the check near the bottom of readRecoveryCommandFile.
Not sure which placement is clearer.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] standalone backend PANICs during recovery

2016-08-30 Thread Michael Paquier
On Tue, Aug 30, 2016 at 10:24 PM, Tom Lane  wrote:
> Michael Paquier  writes:
>> On Tue, Aug 30, 2016 at 9:48 PM, Tom Lane  wrote:
>>> Hm, StartupXLOG seems like a pretty random place to check that, especially
>>> since doing it there requires an extra stat() call.  Why didn't you just
>>> make readRecoveryCommandFile() error out?
>
>> Well, the idea is to do the check before doing anything on PGDATA and
>> leave it intact, particularly the post-crash fsync().
>
> I don't see anything very exciting between the beginning of StartupXLOG
> and readRecoveryCommandFile.  In particular, doing the fsync seems like
> a perfectly harmless and maybe-good thing.  If there were some operation
> with potentially bad side-effects in that range, it would be dangerous
> anyway because of the risk of readRecoveryCommandFile erroring out due
> to invalid contents of recovery.conf.

Does the attached suit better then?
-- 
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index acd95aa..5426f75 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4957,6 +4957,16 @@ readRecoveryCommandFile(void)
  errmsg("could not open recovery command file \"%s\": %m",
 		RECOVERY_COMMAND_FILE)));
 	}
+	else if (!IsPostmasterEnvironment)
+	{
+		/*
+		 * Prevent standalone process to start if recovery is wanted. A lot of
+		 * code paths in recovery depend on the assumption that it is not the
+		 * case so recovery would just badly fail.
+		 */
+		ereport(FATAL,
+(errmsg("recovery.conf is not allowed in a standalone process")));
+	}
 
 	/*
 	 * Since we're asking ParseConfigFp() to report errors as FATAL, there's

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] standalone backend PANICs during recovery

2016-08-30 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Aug 30, 2016 at 9:48 PM, Tom Lane  wrote:
>> Hm, StartupXLOG seems like a pretty random place to check that, especially
>> since doing it there requires an extra stat() call.  Why didn't you just
>> make readRecoveryCommandFile() error out?

> Well, the idea is to do the check before doing anything on PGDATA and
> leave it intact, particularly the post-crash fsync().

I don't see anything very exciting between the beginning of StartupXLOG
and readRecoveryCommandFile.  In particular, doing the fsync seems like
a perfectly harmless and maybe-good thing.  If there were some operation
with potentially bad side-effects in that range, it would be dangerous
anyway because of the risk of readRecoveryCommandFile erroring out due
to invalid contents of recovery.conf.

This might be an argument for re-ordering what we're doing in StartupXLOG,
but that seems like an independent discussion.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] standalone backend PANICs during recovery

2016-08-30 Thread Michael Paquier
On Tue, Aug 30, 2016 at 9:48 PM, Tom Lane  wrote:
> Michael Paquier  writes:
>> On Wed, Aug 24, 2016 at 5:07 PM, Bernd Helmle  wrote:
>>> That said, i'm okay if --single is not intended to bring up a hot standby.
>>> There are many other ways to debug such problems.
>
>> This patch is still on the CF app:
>> https://commitfest.postgresql.org/10/610/
>> Even after thinking about it, my opinion is still the same. Let's
>> prevent a standalone backend to start if it sees recovery.conf.
>> Attached is a patch and the CF entry is switched as ready for
>> committer to move on.
>
> Hm, StartupXLOG seems like a pretty random place to check that, especially
> since doing it there requires an extra stat() call.  Why didn't you just
> make readRecoveryCommandFile() error out?

Well, the idea is to do the check before doing anything on PGDATA and
leave it intact, particularly the post-crash fsync().
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] standalone backend PANICs during recovery

2016-08-30 Thread Tom Lane
Michael Paquier  writes:
> On Wed, Aug 24, 2016 at 5:07 PM, Bernd Helmle  wrote:
>> That said, i'm okay if --single is not intended to bring up a hot standby.
>> There are many other ways to debug such problems.

> This patch is still on the CF app:
> https://commitfest.postgresql.org/10/610/
> Even after thinking about it, my opinion is still the same. Let's
> prevent a standalone backend to start if it sees recovery.conf.
> Attached is a patch and the CF entry is switched as ready for
> committer to move on.

Hm, StartupXLOG seems like a pretty random place to check that, especially
since doing it there requires an extra stat() call.  Why didn't you just
make readRecoveryCommandFile() error out?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] standalone backend PANICs during recovery

2016-08-30 Thread Michael Paquier
On Wed, Aug 24, 2016 at 5:07 PM, Bernd Helmle  wrote:
> That said, i'm okay if --single is not intended to bring up a hot standby.
> There are many other ways to debug such problems.

This patch is still on the CF app:
https://commitfest.postgresql.org/10/610/
Even after thinking about it, my opinion is still the same. Let's
prevent a standalone backend to start if it sees recovery.conf.
Attached is a patch and the CF entry is switched as ready for
committer to move on.
-- 
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index acd95aa..3d88727 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5930,6 +5930,18 @@ StartupXLOG(void)
 	struct stat st;
 
 	/*
+	 * Prevent standalone process to start if recovery is wanted. A lot of
+	 * code paths in recovery depend on the assumption that it is not the
+	 * case.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		if (stat(RECOVERY_COMMAND_FILE, ) == 0)
+			ereport(FATAL,
+	(errmsg("recovery.conf is not allowed with a standalone process")));
+	}
+
+	/*
 	 * Read control file and check XLOG status looks valid.
 	 *
 	 * Note: in most control paths, *ControlFile is already valid and we need

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] standalone backend PANICs during recovery

2016-08-24 Thread Bernd Helmle


--On 20. August 2016 12:41:48 -0400 Tom Lane  wrote:

> So at this point I'm pretty baffled as to what the actual use-case is
> here.  I am tempted to say that a standalone backend should refuse to
> start at all if a recovery.conf file is present.  If we do want to
> allow the case, we need some careful thought about what it should do.

Well, the "use case" was a crashed hot standby at a customer site (it kept
PANICing after restarting), where i decided to have a look through the
recovery process with gdb. The exact PANIC was

2016-03-29 15:12:40 CEST [16097]: [26-1] user=,db=,xid=0,vxid=1/0 FATAL:
unexpected GIN leaf action: 0

I had the idea that it was quick and dirty to use a single backend. I was
surprised that this time it PANIC'ed differently

That said, i'm okay if --single is not intended to bring up a hot standby.
There are many other ways to debug such problems.

-- 
Thanks

Bernd


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] standalone backend PANICs during recovery

2016-08-21 Thread Michael Paquier
On Sun, Aug 21, 2016 at 1:41 AM, Tom Lane  wrote:
> So at this point I'm pretty baffled as to what the actual use-case is
> here.

It is easier to attach a debugger in this case to analyze the problem?

> I am tempted to say that a standalone backend should refuse to
> start at all if a recovery.conf file is present.  If we do want to
> allow the case, we need some careful thought about what it should do.

+1 for preventing an instance in --single mode to start if
recovery.conf is present. It is better to put safeguards than getting
unwanted behaviors.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] standalone backend PANICs during recovery

2016-08-20 Thread Tom Lane
I wrote:
> In short, I don't think control should have been here at all.  My proposal
> for a fix is to force EnableHotStandby to remain false in a standalone
> backend.

I tried to reproduce Bernd's problem by starting a standalone backend in
a data directory that was configured as a hot standby slave, and soon
found that there are much bigger issues than this.  The startup sequence
soon tries to wait for WAL to arrive, which in HEAD uses

WaitLatch(>recoveryWakeupLatch,
  WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
  5000L);

which immediately elog(FATAL)s because a standalone backend has no parent
postmaster and so postmaster_alive_fds[] isn't set.  But if it didn't do
that, it'd wait forever because of course there is no active WAL receiver
process that would ever provide more WAL.

The only way that you'd ever get to a command prompt is if somebody made a
promotion trigger file, which would cause the startup code to promote the
cluster into master status, which does not really seem like something that
would be a good idea in Bernd's proposed use case of "investigating a
problem".

Alternatively, if we were to force standby_mode off in a standalone
backend, it would come to the command prompt right away but again it would
have effectively promoted the cluster to master.  That is certainly not
something we should ever do automatically.

So at this point I'm pretty baffled as to what the actual use-case is
here.  I am tempted to say that a standalone backend should refuse to
start at all if a recovery.conf file is present.  If we do want to
allow the case, we need some careful thought about what it should do.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] standalone backend PANICs during recovery

2016-08-19 Thread Tom Lane
[ got around to looking at this finally ]

Alvaro Herrera  writes:
> Bernd Helmle wrote:
>> While investigating a problem on a streaming hot standby instance at a
>> customer site, i got the following when using a standalone backend:
>> 
>> PANIC:  btree_xlog_delete_get_latestRemovedXid: cannot operate with
>> inconsistent data
>> CONTEXT:  xlog redo delete: index 1663/65588/65625; iblk 11, heap
>> 1663/65588/65613;
>> 
>> The responsible PANIC is triggered here:
>> src/backend/access/nbtree/nbtxlog.c:555

> This PANIC was introduced in the commit below.  AFAICS your proposed
> patch and rationale are correct.

I'm not very convinced by this.  What seems to me to be the core issue is
how it can make sense for InHotStandby to read as TRUE in a standalone
backend.  The best-case situation if that's true is that the standalone
backend will do a lot of unnecessary work on the baseless assumption that
there are other backends it has to keep things valid for.  The worst-case
situation is that you trip over some bug and are unable to complete
recovery, even though you would have if the hot-standby support hadn't
been activated.

In short, I don't think control should have been here at all.  My proposal
for a fix is to force EnableHotStandby to remain false in a standalone
backend.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] standalone backend PANICs during recovery

2016-04-14 Thread Bernd Helmle

--On 5. April 2016 21:50:17 -0400 Robert Haas  wrote:

> If nobody's going to commit this right away, this should be added to
> the next CommitFest so we don't forget it.

Done.

-- 
Thanks

Bernd


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] standalone backend PANICs during recovery

2016-04-05 Thread Robert Haas
On Sat, Apr 2, 2016 at 5:57 PM, Alvaro Herrera  wrote:
> Bernd Helmle wrote:
>> While investigating a problem on a streaming hot standby instance at a
>> customer site, i got the following when using a standalone backend:
>>
>> PANIC:  btree_xlog_delete_get_latestRemovedXid: cannot operate with
>> inconsistent data
>> CONTEXT:  xlog redo delete: index 1663/65588/65625; iblk 11, heap
>> 1663/65588/65613;
>>
>> The responsible PANIC is triggered here:
>>
>> src/backend/access/nbtree/nbtxlog.c:555
>>
>> btree_xlog_delete_get_latestRemovedXid(XLogReaderState *record)
>
> This PANIC was introduced in the commit below.  AFAICS your proposed
> patch and rationale are correct.

If nobody's going to commit this right away, this should be added to
the next CommitFest so we don't forget it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] standalone backend PANICs during recovery

2016-04-02 Thread Alvaro Herrera
Bernd Helmle wrote:
> While investigating a problem on a streaming hot standby instance at a
> customer site, i got the following when using a standalone backend:
> 
> PANIC:  btree_xlog_delete_get_latestRemovedXid: cannot operate with
> inconsistent data
> CONTEXT:  xlog redo delete: index 1663/65588/65625; iblk 11, heap
> 1663/65588/65613;
> 
> The responsible PANIC is triggered here:
> 
> src/backend/access/nbtree/nbtxlog.c:555
> 
> btree_xlog_delete_get_latestRemovedXid(XLogReaderState *record)

This PANIC was introduced in the commit below.  AFAICS your proposed
patch and rationale are correct.


commit f786e91a75b2f64527dcf321e754b6448fcad7fe
Author: Tom Lane 
AuthorDate: Fri Aug 3 15:41:18 2012 -0400
CommitDate: Fri Aug 3 15:41:18 2012 -0400

Improve underdocumented btree_xlog_delete_get_latestRemovedXid() code.

As noted by Noah Misch, btree_xlog_delete_get_latestRemovedXid is
critically dependent on the assumption that it's examining a consistent
state of the database.  This was undocumented though, so the
seemingly-unrelated check for no active HS sessions might be thought to be
merely an optional optimization.  Improve comments, and add an explicit
check of reachedConsistency just to be sure.

This function returns InvalidTransactionId (thereby killing all HS
transactions) in several cases that are not nearly unlikely enough for my
taste.  This commit doesn't attempt to fix those deficiencies, just
document them.

Back-patch to 9.2, not from any real functional need but just to keep the
branches more closely synced to simplify possible future back-patching.



-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers