Re: [HACKERS] Postgres abort found in 9.3.11

2016-11-22 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hello,

The setup is made of hot-standby architecture and the issue is seen during 
normal run with normal load of 50% insert and 50% delete operation.
During startup of the standby node, we copy the data directory from the active 
postgres using pg_basebackup.

Meanwhile we are trying to create a test bed for people to try.

Regards,
Sandhya

-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Tuesday, November 22, 2016 1:47 AM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya@nokia.com>
Cc: pgsql-hackers@postgresql.org; Itnal, Prakash (Nokia - IN/Bangalore) 
<prakash.it...@nokia.com>
Subject: Re: [HACKERS] Postgres abort found in 9.3.11

"K S, Sandhya (Nokia - IN/Bangalore)" <sandhya@nokia.com> writes:
> As suggested by you, we upgraded the postgres to version 9.3.14. Also we 
> removed all the patches we had applied before. But the issue is still 
> observed in the latest version as well.

Can you make a test case for other people to try?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres abort found in 9.3.11

2016-11-21 Thread Tom Lane
"K S, Sandhya (Nokia - IN/Bangalore)"  writes:
> As suggested by you, we upgraded the postgres to version 9.3.14. Also we 
> removed all the patches we had applied before. But the issue is still 
> observed in the latest version as well.

Can you make a test case for other people to try?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres abort found in 9.3.11

2016-11-21 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hello,

As suggested by you, we upgraded the postgres to version 9.3.14. Also we 
removed all the patches we had applied before. But the issue is still observed 
in the latest version as well.

The issue is seen during normal run and only observed in the standby node. 

This time as well, the same error log is observed.
node-1 postgres[8743]: [18-1] PANIC:  btree_xlog_delete_get_latestRemovedXid: 
cannot operate with inconsistent data

Can you please share your inputs which would help us proceed further?

Regards,
Sandhya
 
-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Friday, September 16, 2016 1:29 AM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya@nokia.com>
Cc: pgsql-hackers@postgresql.org; Itnal, Prakash (Nokia - IN/Bangalore) 
<prakash.it...@nokia.com>
Subject: Re: [HACKERS] Postgres abort found in 9.3.11

"K S, Sandhya (Nokia - IN/Bangalore)" <sandhya@nokia.com> writes:
> We tried to replicate the scenario without our patch(exiting postmaster) and 
> still we were able to see the issue.

> Same error was seen this time as well.
> node-0 postgres[8243]: [1-2] HINT:  Is another postmaster already running on 
> port 5433? If not, wait a few seconds and retry.  
> node-1 postgres[8650]: [18-1] PANIC:  btree_xlog_delete_get_latestRemovedXid: 
> cannot operate with inconsistent data

> Crash was not seen in 9.3.9 without the patch but it was reproduced in 9.3.11.
> So something specifically changed between 9.3.9 and 9.3.11 is causing the 
> issue.

Well, I looked through the git history from 9.3.9 to 9.3.11 and I don't
see anything that seems likely to explain a problem here.

If you can reproduce this, which it sounds like you can, maybe you could
create a self-contained test case for other people to try?

Also worth noting is that the current 9.3.x release is 9.3.14.  You
might save yourself some time by updating and seeing if it still
reproduces in 9.3.14.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres abort found in 9.3.11

2016-09-15 Thread Tom Lane
"K S, Sandhya (Nokia - IN/Bangalore)"  writes:
> We tried to replicate the scenario without our patch(exiting postmaster) and 
> still we were able to see the issue.

> Same error was seen this time as well.
> node-0 postgres[8243]: [1-2] HINT:  Is another postmaster already running on 
> port 5433? If not, wait a few seconds and retry.  
> node-1 postgres[8650]: [18-1] PANIC:  btree_xlog_delete_get_latestRemovedXid: 
> cannot operate with inconsistent data

> Crash was not seen in 9.3.9 without the patch but it was reproduced in 9.3.11.
> So something specifically changed between 9.3.9 and 9.3.11 is causing the 
> issue.

Well, I looked through the git history from 9.3.9 to 9.3.11 and I don't
see anything that seems likely to explain a problem here.

If you can reproduce this, which it sounds like you can, maybe you could
create a self-contained test case for other people to try?

Also worth noting is that the current 9.3.x release is 9.3.14.  You
might save yourself some time by updating and seeing if it still
reproduces in 9.3.14.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres abort found in 9.3.11

2016-09-15 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hello,

We tried to replicate the scenario without our patch(exiting postmaster) and 
still we were able to see the issue.

Same error was seen this time as well.
node-0 postgres[8243]: [1-2] HINT:  Is another postmaster already running on 
port 5433? If not, wait a few seconds and retry.  
node-1 postgres[8650]: [18-1] PANIC:  btree_xlog_delete_get_latestRemovedXid: 
cannot operate with inconsistent data

Crash was not seen in 9.3.9 without the patch but it was reproduced in 9.3.11.
So something specifically changed between 9.3.9 and 9.3.11 is causing the issue.

Thanks in advance!!!

Sandhya

-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Tuesday, September 06, 2016 5:04 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya@nokia.com>
Cc: pgsql-hackers@postgresql.org; Itnal, Prakash (Nokia - IN/Bangalore) 
<prakash.it...@nokia.com>
Subject: Re: [HACKERS] Postgres abort found in 9.3.11

"K S, Sandhya (Nokia - IN/Bangalore)" <sandhya@nokia.com> writes:
> I was able to find a patch file where there is a call to ExitPostmaster() in 
> postmaster.c . 

> @@ -3081,6 +3081,11 @@
> shmem_exit(1);
> reset_shared(PostPortNumber);
 
> +   /* recovery termination */
> +   ereport(FATAL,
> +   (errmsg("recovery termination due to process crash")));
> +   ExitPostmaster(99);
> +
> StartupPID = StartupDataBase();
> Assert(StartupPID != 0); 
> pmState = PM_STARTUP;

There's no such code in the community sources, and I can't say that
such a patch looks like a bright idea to me.  It would disable any
restart after a crash (not only during recovery).

If you're running a version with assorted random non-community patches,
we can't really offer much support for that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres abort found in 9.3.11

2016-09-06 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hello,

I was able to find a patch file where there is a call to ExitPostmaster() in 
postmaster.c . 

@@ -3081,6 +3081,11 @@
shmem_exit(1);
reset_shared(PostPortNumber);
 
+   /* recovery termination */
+   ereport(FATAL,
+   (errmsg("recovery termination due to process crash")));
+   ExitPostmaster(99);
+
StartupPID = StartupDataBase();
Assert(StartupPID != 0); 
pmState = PM_STARTUP;

But this patch is there from 2009 when Postgres was upgraded to 9.0. I am 
checking on why this patch was introduced in the first place.
Still the question exists of why the issue is not seen in version 9.3.9 but 
exists in 9.3.11.

Also the case of standalone recovery is taken care of with introduction of the 
patch file.

"err-3" is part of postgres source code(nbtxlog.c). Two different lines are 
combined probably leading to confusion.
Aug 22 11:44:52.065760 crit node-1 postgres[8629]: [18-1] err-3:  
btree_xlog_delete_get_latestRemovedXid: cannot operate with inconsistent data
Aug 22 11:44:52.065971 crit node-1 postgres[8629]: [18-2] CONTEXT:  xlog redo 
delete: index 1663/16386/17378; iblk 1, heap 1663/16386/16518;

Thanks in advance!!!
Sandhya


-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Thursday, September 01, 2016 7:19 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya@nokia.com>
Cc: pgsql-hackers@postgresql.org; Itnal, Prakash (Nokia - IN/Bangalore) 
<prakash.it...@nokia.com>
Subject: Re: [HACKERS] Postgres abort found in 9.3.11

"K S, Sandhya (Nokia - IN/Bangalore)" <sandhya@nokia.com> writes:
> Our setup is a hot-standby architecture. This crash is occurring only on 
> stand-by node. Postgres continues to run without any issues on active node.
> Postmaster is waiting for a start and is throwing this message.

> Aug 22 11:44:21.462555 info node-0 postgres[8222]: [1-2] HINT:  Is another 
> postmaster already running on port 5433? If not, wait a few seconds and 
> retry.  
> Aug 22 11:44:52.065760 crit node-1 postgres[8629]: [18-1] err-3:  
> btree_xlog_delete_get_latestRemovedXid: cannot operate with inconsistent 
> dataAug 22 11:44:52.065971 crit CFPU-1 postgres[8629]: [18-2] CONTEXT:  xlog 
> redo delete: index 1663/16386/17378; iblk 1, heap 1663/16386/16518;

Hmm, that HINT seems to be the tail end of a message indicating that the
postmaster is refusing to start because of an existing postmaster.  Why
is that appearing?  If you've got some script that's overeagerly launching
and killing postmasters, maybe that's the ultimate cause of problems.

The only method I've heard of for getting that get_latestRemovedXid
error is to try to launch a standalone backend (postgres --single)
in a standby server directory.  We don't support that, cf
https://www.postgresql.org/message-id/flat/00F0B2CEF6D0CEF8A90119D4%40eje.credativ.lan

BTW, I'm curious about the "err-3:" part.  That would not be expected
in any standard build of Postgres ... is this something custom modified?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres abort found in 9.3.11

2016-09-06 Thread Tom Lane
"K S, Sandhya (Nokia - IN/Bangalore)"  writes:
> I was able to find a patch file where there is a call to ExitPostmaster() in 
> postmaster.c . 

> @@ -3081,6 +3081,11 @@
> shmem_exit(1);
> reset_shared(PostPortNumber);
 
> +   /* recovery termination */
> +   ereport(FATAL,
> +   (errmsg("recovery termination due to process crash")));
> +   ExitPostmaster(99);
> +
> StartupPID = StartupDataBase();
> Assert(StartupPID != 0); 
> pmState = PM_STARTUP;

There's no such code in the community sources, and I can't say that
such a patch looks like a bright idea to me.  It would disable any
restart after a crash (not only during recovery).

If you're running a version with assorted random non-community patches,
we can't really offer much support for that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres abort found in 9.3.11

2016-09-02 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hello Tom,

Apologies for delayed reply.

Our setup is a hot-standby architecture. This crash is occurring only on 
stand-by node. Postgres continues to run without any issues on active node.
Postmaster is waiting for a start and is throwing this message.

Aug 22 11:44:21.462555 info node-0 postgres[8222]: [1-2] HINT:  Is another 
postmaster already running on port 5433? If not, wait a few seconds and retry.  
Aug 22 11:44:52.065760 crit node-1 postgres[8629]: [18-1] err-3:  
btree_xlog_delete_get_latestRemovedXid: cannot operate with inconsistent 
dataAug 22 11:44:52.065971 crit CFPU-1 postgres[8629]: [18-2] CONTEXT:  xlog 
redo delete: index 1663/16386/17378; iblk 1, heap 1663/16386/16518;
Aug 22 11:44:52.085486 info node-1 coredumper: Generating core file 

The standby postgres recovers automatically on next restart. This is because we 
always copy db freshly from active node on restart.

We implemented one patch to force kill walsender on active side. This is done 
to avoid prolonged wait if standby node is not reachable (for eg. Force power 
off or LAN cable removal). This implementation exists from long time. However 
the issue only recently observed after upgrading to 9.3.11. Do you think this 
force kill of walsender might lead to such issues in latest postgres?


Regards,
Sandhya

-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Tuesday, August 30, 2016 5:09 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya@nokia.com>
Cc: pgsql-hackers@postgresql.org; Itnal, Prakash (Nokia - IN/Bangalore) 
<prakash.it...@nokia.com>
Subject: Re: [HACKERS] Postgres abort found in 9.3.11

"K S, Sandhya (Nokia - IN/Bangalore)" <sandhya@nokia.com> writes:
> During the server restart, we are getting postgres crash with sigabrt. No 
> other operation being performed.
> Attached the backtrace.

What shows up in the postmaster log?

> The occurrence is occasional. The issue is seen once in 30~50 times.

Does it successfully restart if you try again?  If not, what are you
doing to recover?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres abort found in 9.3.11

2016-09-01 Thread Tom Lane
"K S, Sandhya (Nokia - IN/Bangalore)"  writes:
> Our setup is a hot-standby architecture. This crash is occurring only on 
> stand-by node. Postgres continues to run without any issues on active node.
> Postmaster is waiting for a start and is throwing this message.

> Aug 22 11:44:21.462555 info node-0 postgres[8222]: [1-2] HINT:  Is another 
> postmaster already running on port 5433? If not, wait a few seconds and 
> retry.  
> Aug 22 11:44:52.065760 crit node-1 postgres[8629]: [18-1] err-3:  
> btree_xlog_delete_get_latestRemovedXid: cannot operate with inconsistent 
> dataAug 22 11:44:52.065971 crit CFPU-1 postgres[8629]: [18-2] CONTEXT:  xlog 
> redo delete: index 1663/16386/17378; iblk 1, heap 1663/16386/16518;

Hmm, that HINT seems to be the tail end of a message indicating that the
postmaster is refusing to start because of an existing postmaster.  Why
is that appearing?  If you've got some script that's overeagerly launching
and killing postmasters, maybe that's the ultimate cause of problems.

The only method I've heard of for getting that get_latestRemovedXid
error is to try to launch a standalone backend (postgres --single)
in a standby server directory.  We don't support that, cf
https://www.postgresql.org/message-id/flat/00F0B2CEF6D0CEF8A90119D4%40eje.credativ.lan

BTW, I'm curious about the "err-3:" part.  That would not be expected
in any standard build of Postgres ... is this something custom modified?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres abort found in 9.3.11

2016-08-30 Thread Tom Lane
"K S, Sandhya (Nokia - IN/Bangalore)"  writes:
> During the server restart, we are getting postgres crash with sigabrt. No 
> other operation being performed.
> Attached the backtrace.

What shows up in the postmaster log?

> The occurrence is occasional. The issue is seen once in 30~50 times.

Does it successfully restart if you try again?  If not, what are you
doing to recover?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers