Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

2017-09-22 Thread Masahiko Sawada
On Fri, Sep 22, 2017 at 3:46 PM, Michael Paquier
 wrote:
> On Fri, Sep 22, 2017 at 3:24 PM, Masahiko Sawada  
> wrote:
>> I have a question. Since WALInsertLockRelease seems not to call
>> LWLockWaitForVar I thought you wanted to mean LWLockReleaseClearVar
>> instead, is that right?
>
> This looks like a copy-pasto from my side. Thanks for spotting it.

Okay, attached the latest version patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


fix_do_pg_abort_backup_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

2017-09-22 Thread Michael Paquier
On Fri, Sep 22, 2017 at 3:24 PM, Masahiko Sawada  wrote:
> I have a question. Since WALInsertLockRelease seems not to call
> LWLockWaitForVar I thought you wanted to mean LWLockReleaseClearVar
> instead, is that right?

This looks like a copy-pasto from my side. Thanks for spotting it.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

2017-09-22 Thread Masahiko Sawada
On Fri, Sep 22, 2017 at 3:00 PM, Michael Paquier
 wrote:
> On Fri, Sep 22, 2017 at 11:34 AM, Masahiko Sawada  
> wrote:
>> You're right. I updated the patch so that it exits do_pg_abort_backup
>> if the state is NONE and setting the state to NONE in
>> do_pg_stop_backup before releasing the WAL insert lock.
>
> -/* Clean up session-level lock */
> +/*
> + * Clean up session-level lock. To avoid interrupting before changing
> + * the backup state by LWLockWaitForVar we change it while holding the
> + * WAL insert lock.
> + */
>  sessionBackupState = SESSION_BACKUP_NONE;
> You could just mention directly CHECK_FOR_INTERRUPTS here.

Thank you for the reviewing.
I have a question. Since WALInsertLockRelease seems not to call
LWLockWaitForVar I thought you wanted to mean LWLockReleaseClearVar
instead, is that right?

> +/* Quick exit if we have done the backup */
> +if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE)
> +return;
>
> The patch contents look fine to me. I have also tested things in
> depths but could not find any problems. I also looked again at the
> backup start code, trying to see any flows between the moment the
> session backup lock is changed and the moment the callback to do
> backup cleanup is registered but I have not found any issues. I am
> marking this as ready for committer.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

2017-09-22 Thread Michael Paquier
On Fri, Sep 22, 2017 at 11:34 AM, Masahiko Sawada  wrote:
> You're right. I updated the patch so that it exits do_pg_abort_backup
> if the state is NONE and setting the state to NONE in
> do_pg_stop_backup before releasing the WAL insert lock.

-/* Clean up session-level lock */
+/*
+ * Clean up session-level lock. To avoid interrupting before changing
+ * the backup state by LWLockWaitForVar we change it while holding the
+ * WAL insert lock.
+ */
 sessionBackupState = SESSION_BACKUP_NONE;
You could just mention directly CHECK_FOR_INTERRUPTS here.

+/* Quick exit if we have done the backup */
+if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE)
+return;

The patch contents look fine to me. I have also tested things in
depths but could not find any problems. I also looked again at the
backup start code, trying to see any flows between the moment the
session backup lock is changed and the moment the callback to do
backup cleanup is registered but I have not found any issues. I am
marking this as ready for committer.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

2017-09-21 Thread Masahiko Sawada
On Thu, Sep 21, 2017 at 5:56 PM, Michael Paquier
 wrote:
>
> Yep, but the deficiency is caused by the use before_shmem_exit() in
> the SQL-level functions present in 9.6~ which make the cleanup happen
> potentially twice. If you look at the code of basebackup.c, you would
> notice that the cleanups of the counters only happen within the
> PG_ENSURE_ERROR_CLEANUP() blocks, so a backpatch to 9.6 should be
> enough.

Thank you for explaining. I understood and agreed..

On Fri, Sep 22, 2017 at 8:02 AM, Michael Paquier
 wrote:
> On Thu, Sep 21, 2017 at 5:56 PM, Michael Paquier
>  wrote:
>> On Thu, Sep 21, 2017 at 4:40 PM, Masahiko Sawada  
>> wrote:
>>> On Thu, Sep 21, 2017 at 2:25 PM, Michael Paquier
>>>  wrote:
 +-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
 +   if (XLogCtl->Insert.nonExclusiveBackups > 0)
 +   XLogCtl->Insert.nonExclusiveBackups--;
 Hm, no, I don't agree. I think that instead you should just leave
 do_pg_abort_backup() immediately if sessionBackupState is set to
 SESSION_BACKUP_NONE. This variable is the link between the global
 counters and the session stopping the backup so I don't think that we
 should touch this assertion of this counter. I think that this method
 would be safe as well for backup start as pg_start_backup_callback
 takes care of any cleanup. Also because the counters are incremented
 before entering in the PG_ENSURE_ERROR_CLEANUP block, and
 sessionBackupState is updated just after leaving the block.
>>>
>>> I think that the assertion failure still can happen if the process
>>> aborts after decremented the counter and before setting to
>>> SESSION_BACKUP_NONE. Am I missing something?
>>
>> The process would stop at the next CHECK_FOR_INTERRUPTS() and trigger
>> the cleanup at this moment. So this happens when waiting for the
>> archives to be done, and the session flag is set to NONE at this
>> point.
>
> And actually, with two non-exclusive backups taken in parallel, it is
> still possible to fail on another assertion in do_pg_stop_backup()
> even with your patch. Imagine the following:
> 1) Two backups are taken, counter for non-exclusive backups is set at 2.
> 2) One backup is stopped, then interrupted. This causes the counter to
> be decremented twice, once in do_pg_stop_backup, and once when
> aborting. Counter is at 0, switching as well forcePageWrites to
> false..
> 3) The second backup stops, a new assertion failure is triggered.
> Without assertions the counter would get at -1.

You're right. I updated the patch so that it exits do_pg_abort_backup
if the state is NONE and setting the state to NONE in
do_pg_stop_backup before releasing the WAL insert lock.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


fix_do_pg_abort_backup_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

2017-09-21 Thread Michael Paquier
On Fri, Sep 22, 2017 at 10:41 AM, Craig Ringer  wrote:
> Another one to watch out for is that elog(...) and ereport(...) invoke
> CHECK_FOR_INTERRUPTS. That's given me exciting surprises before when
> combined with assertion checking and various exit cleanup hooks.

Ahah. Good point. In this case LWLockWaitForVar() is one thing to
worry about if LWLock tracing is enabled because it can log things
before holding the existing interrupts. This can be avoided easily in
the case of this issue by updating sessionBackupState before calling
WALInsertLockRelease and while holding the WAL insert lock. Surely
this meritates a comment in the patch we want here.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

2017-09-21 Thread Craig Ringer
On 21 September 2017 at 16:56, Michael Paquier 
wrote:

> On Thu, Sep 21, 2017 at 4:40 PM, Masahiko Sawada 
> wrote:
> > On Thu, Sep 21, 2017 at 2:25 PM, Michael Paquier
> >  wrote:
> >> On Thu, Sep 21, 2017 at 1:07 AM, Masahiko Sawada 
> wrote:
> >>> The bug can happen in PostgreSQL 9.1 or higher that non-exclusive
> >>> backup has been introduced, so we should back-patch to the all
> >>> supported versions.
> >>
> >> There is a typo here right? Non-exclusive backups have been introduced
> >> in 9.6. Why would a back-patch further down be needed?
> >
> > I think the non-exclusive backups infrastructure has been introduced
> > in 9.1 for pg_basebackup. I've not checked yet that it can be
> > reproduced using pg_basebackup in PG9.1 but I thought it could happen
> > as far as I checked the code.
>
> Yep, but the deficiency is caused by the use before_shmem_exit() in
> the SQL-level functions present in 9.6~ which make the cleanup happen
> potentially twice. If you look at the code of basebackup.c, you would
> notice that the cleanups of the counters only happen within the
> PG_ENSURE_ERROR_CLEANUP() blocks, so a backpatch to 9.6 should be
> enough.
>
> >> +-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
> >> +   if (XLogCtl->Insert.nonExclusiveBackups > 0)
> >> +   XLogCtl->Insert.nonExclusiveBackups--;
> >> Hm, no, I don't agree. I think that instead you should just leave
> >> do_pg_abort_backup() immediately if sessionBackupState is set to
> >> SESSION_BACKUP_NONE. This variable is the link between the global
> >> counters and the session stopping the backup so I don't think that we
> >> should touch this assertion of this counter. I think that this method
> >> would be safe as well for backup start as pg_start_backup_callback
> >> takes care of any cleanup. Also because the counters are incremented
> >> before entering in the PG_ENSURE_ERROR_CLEANUP block, and
> >> sessionBackupState is updated just after leaving the block.
> >
> > I think that the assertion failure still can happen if the process
> > aborts after decremented the counter and before setting to
> > SESSION_BACKUP_NONE. Am I missing something?
>
> The process would stop at the next CHECK_FOR_INTERRUPTS() and trigger
> the cleanup at this moment. So this happens when waiting for the
> archives to be done, and the session flag is set to NONE at this
> point.
>

Another one to watch out for is that elog(...) and ereport(...) invoke
CHECK_FOR_INTERRUPTS. That's given me exciting surprises before when
combined with assertion checking and various exit cleanup hooks.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

2017-09-21 Thread Michael Paquier
On Thu, Sep 21, 2017 at 5:56 PM, Michael Paquier
 wrote:
> On Thu, Sep 21, 2017 at 4:40 PM, Masahiko Sawada  
> wrote:
>> On Thu, Sep 21, 2017 at 2:25 PM, Michael Paquier
>>  wrote:
>>> +-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
>>> +   if (XLogCtl->Insert.nonExclusiveBackups > 0)
>>> +   XLogCtl->Insert.nonExclusiveBackups--;
>>> Hm, no, I don't agree. I think that instead you should just leave
>>> do_pg_abort_backup() immediately if sessionBackupState is set to
>>> SESSION_BACKUP_NONE. This variable is the link between the global
>>> counters and the session stopping the backup so I don't think that we
>>> should touch this assertion of this counter. I think that this method
>>> would be safe as well for backup start as pg_start_backup_callback
>>> takes care of any cleanup. Also because the counters are incremented
>>> before entering in the PG_ENSURE_ERROR_CLEANUP block, and
>>> sessionBackupState is updated just after leaving the block.
>>
>> I think that the assertion failure still can happen if the process
>> aborts after decremented the counter and before setting to
>> SESSION_BACKUP_NONE. Am I missing something?
>
> The process would stop at the next CHECK_FOR_INTERRUPTS() and trigger
> the cleanup at this moment. So this happens when waiting for the
> archives to be done, and the session flag is set to NONE at this
> point.

And actually, with two non-exclusive backups taken in parallel, it is
still possible to fail on another assertion in do_pg_stop_backup()
even with your patch. Imagine the following:
1) Two backups are taken, counter for non-exclusive backups is set at 2.
2) One backup is stopped, then interrupted. This causes the counter to
be decremented twice, once in do_pg_stop_backup, and once when
aborting. Counter is at 0, switching as well forcePageWrites to
false..
3) The second backup stops, a new assertion failure is triggered.
Without assertions the counter would get at -1.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

2017-09-21 Thread Michael Paquier
On Thu, Sep 21, 2017 at 4:40 PM, Masahiko Sawada  wrote:
> On Thu, Sep 21, 2017 at 2:25 PM, Michael Paquier
>  wrote:
>> On Thu, Sep 21, 2017 at 1:07 AM, Masahiko Sawada  
>> wrote:
>>> The bug can happen in PostgreSQL 9.1 or higher that non-exclusive
>>> backup has been introduced, so we should back-patch to the all
>>> supported versions.
>>
>> There is a typo here right? Non-exclusive backups have been introduced
>> in 9.6. Why would a back-patch further down be needed?
>
> I think the non-exclusive backups infrastructure has been introduced
> in 9.1 for pg_basebackup. I've not checked yet that it can be
> reproduced using pg_basebackup in PG9.1 but I thought it could happen
> as far as I checked the code.

Yep, but the deficiency is caused by the use before_shmem_exit() in
the SQL-level functions present in 9.6~ which make the cleanup happen
potentially twice. If you look at the code of basebackup.c, you would
notice that the cleanups of the counters only happen within the
PG_ENSURE_ERROR_CLEANUP() blocks, so a backpatch to 9.6 should be
enough.

>> +-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
>> +   if (XLogCtl->Insert.nonExclusiveBackups > 0)
>> +   XLogCtl->Insert.nonExclusiveBackups--;
>> Hm, no, I don't agree. I think that instead you should just leave
>> do_pg_abort_backup() immediately if sessionBackupState is set to
>> SESSION_BACKUP_NONE. This variable is the link between the global
>> counters and the session stopping the backup so I don't think that we
>> should touch this assertion of this counter. I think that this method
>> would be safe as well for backup start as pg_start_backup_callback
>> takes care of any cleanup. Also because the counters are incremented
>> before entering in the PG_ENSURE_ERROR_CLEANUP block, and
>> sessionBackupState is updated just after leaving the block.
>
> I think that the assertion failure still can happen if the process
> aborts after decremented the counter and before setting to
> SESSION_BACKUP_NONE. Am I missing something?

The process would stop at the next CHECK_FOR_INTERRUPTS() and trigger
the cleanup at this moment. So this happens when waiting for the
archives to be done, and the session flag is set to NONE at this
point.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

2017-09-21 Thread Masahiko Sawada
On Thu, Sep 21, 2017 at 2:25 PM, Michael Paquier
 wrote:
> On Thu, Sep 21, 2017 at 1:07 AM, Masahiko Sawada  
> wrote:
>> The bug can happen in PostgreSQL 9.1 or higher that non-exclusive
>> backup has been introduced, so we should back-patch to the all
>> supported versions.
>
> There is a typo here right? Non-exclusive backups have been introduced
> in 9.6. Why would a back-patch further down be needed?

I think the non-exclusive backups infrastructure has been introduced
in 9.1 for pg_basebackup. I've not checked yet that it can be
reproduced using pg_basebackup in PG9.1 but I thought it could happen
as far as I checked the code.

>
>> I changed do_pg_abort_backup() so that it decrements
>> nonExclusiveBackups only if > 0. Feedback is very welcome.
>
> +-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
> +   if (XLogCtl->Insert.nonExclusiveBackups > 0)
> +   XLogCtl->Insert.nonExclusiveBackups--;
> Hm, no, I don't agree. I think that instead you should just leave
> do_pg_abort_backup() immediately if sessionBackupState is set to
> SESSION_BACKUP_NONE. This variable is the link between the global
> counters and the session stopping the backup so I don't think that we
> should touch this assertion of this counter. I think that this method
> would be safe as well for backup start as pg_start_backup_callback
> takes care of any cleanup. Also because the counters are incremented
> before entering in the PG_ENSURE_ERROR_CLEANUP block, and
> sessionBackupState is updated just after leaving the block.

I think that the assertion failure still can happen if the process
aborts after decremented the counter and before setting to
SESSION_BACKUP_NONE. Am I missing something?

>
> About the patch:
> +-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
> There is a typo on this line.
>
> Adding that to the next CF would be a good idea so as we don't forget
> about it. Feel free to add me as reviewer.

Thank you!

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

2017-09-20 Thread Michael Paquier
On Thu, Sep 21, 2017 at 1:07 AM, Masahiko Sawada  wrote:
> The bug can happen in PostgreSQL 9.1 or higher that non-exclusive
> backup has been introduced, so we should back-patch to the all
> supported versions.

There is a typo here right? Non-exclusive backups have been introduced
in 9.6. Why would a back-patch further down be needed?

> I changed do_pg_abort_backup() so that it decrements
> nonExclusiveBackups only if > 0. Feedback is very welcome.

+-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
+   if (XLogCtl->Insert.nonExclusiveBackups > 0)
+   XLogCtl->Insert.nonExclusiveBackups--;
Hm, no, I don't agree. I think that instead you should just leave
do_pg_abort_backup() immediately if sessionBackupState is set to
SESSION_BACKUP_NONE. This variable is the link between the global
counters and the session stopping the backup so I don't think that we
should touch this assertion of this counter. I think that this method
would be safe as well for backup start as pg_start_backup_callback
takes care of any cleanup. Also because the counters are incremented
before entering in the PG_ENSURE_ERROR_CLEANUP block, and
sessionBackupState is updated just after leaving the block.

About the patch:
+-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
There is a typo on this line.

Adding that to the next CF would be a good idea so as we don't forget
about it. Feel free to add me as reviewer.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

2017-09-20 Thread Masahiko Sawada
Hi,

I got an assert failure when executing pg_terminate_backend to the
process that waiting for WAL to be archived in non-exclusive backup
mode.

TRAP: FailedAssertion("!(XLogCtl->Insert.nonExclusiveBackups > 0)",
File: "xlog.c", Line: 11123)

The one reproducing procedure is,
1. Start postgres with enabling the WAL archive
2. Execute pg_start_backup()
3. Revoke the access privileges of archive directory. (e.g., chown
root:root /path/to/archive)
4. Execute pg_stop_backup() and hangs
5. Execute pg_terminate_backend() to the process that is waiting for
WAL to be archived.
Got the assertion failure.

Perhaps we can reproduce it using pg_basebackup as well.

I think the cause of this is that do_pg_abort_backup can be called
even after we decremented XLogCtl->Insert.nonExclusiveBackups in
do_pg_stop_backup(). That is, do_pg_abort_backup can be called with
XLogCtl->Insert.nonExclusiveBackups = 0 when the transaction aborts
after processed the nonExclusiveBackups (e.g, during waiting for WAL
to be archived)
The bug can happen in PostgreSQL 9.1 or higher that non-exclusive
backup has been introduced, so we should back-patch to the all
supported versions.

I changed do_pg_abort_backup() so that it decrements
nonExclusiveBackups only if > 0. Feedback is very welcome.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0513471..b0381e8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11165,8 +11165,15 @@ void
 do_pg_abort_backup(void)
 {
 	WALInsertLockAcquireExclusive();
-	Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
-	XLogCtl->Insert.nonExclusiveBackups--;
+
+	/*
+	 * This can be called after we decremented nonExclusiveBackups in
+	 * do_pg_stop_backup. So prevent it to be negative value.
+	 */
+-	Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
+	if (XLogCtl->Insert.nonExclusiveBackups > 0)
+		XLogCtl->Insert.nonExclusiveBackups--;
+
 
 	if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE &&
 		XLogCtl->Insert.nonExclusiveBackups == 0)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers