Re: [HACKERS] [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

2016-07-11 Thread Magnus Hagander
On Mon, Jul 11, 2016 at 3:05 PM, Michael Paquier 
wrote:

> On Mon, Jul 11, 2016 at 7:01 PM, Magnus Hagander 
> wrote:
> > But isn't this also a pre-existing bug in 9.5? Or did we change something
> > else that suddenly made it visible?
>
> What has been patched here is a defect caused by pg_start_backup(),
> and not pg_basebackup. In the case of the latter, ThisTimelineID gets
> set by GetStandbyFlushRecPtr() in the context of the WAL sender used
> to send the base backup. In short, this is only a defect of 9.6, where
> pg_start_backup() can be used on standbys for the first time for
> non-exclusive backups.
>
> So the issue does not actually pre-exist, GetStandbyFlushRecPtr()
> playing its role to set up the timeline ID.


Ah, that's where we gt it from. Gotcha, makes sense. Thanks for confirming!

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

2016-07-11 Thread Michael Paquier
On Mon, Jul 11, 2016 at 7:01 PM, Magnus Hagander  wrote:
> But isn't this also a pre-existing bug in 9.5? Or did we change something
> else that suddenly made it visible?

What has been patched here is a defect caused by pg_start_backup(),
and not pg_basebackup. In the case of the latter, ThisTimelineID gets
set by GetStandbyFlushRecPtr() in the context of the WAL sender used
to send the base backup. In short, this is only a defect of 9.6, where
pg_start_backup() can be used on standbys for the first time for
non-exclusive backups.

So the issue does not actually pre-exist, GetStandbyFlushRecPtr()
playing its role to set up the timeline ID.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

2016-07-11 Thread Amit Kapila
On Mon, Jul 11, 2016 at 3:31 PM, Magnus Hagander  wrote:
>
>
> On Thu, Jul 7, 2016 at 8:38 AM, Michael Paquier 
> wrote:
>>
>> On Thu, Jul 7, 2016 at 12:57 AM, Marco Nenciarini
>>  wrote:
>> > After further analysis, the issue is that we retrieve the starttli from
>> > the ControlFile structure, but it was using ThisTimeLineID when writing
>> > the backup label.
>> >
>> > I've attached a very simple patch that fixes it.
>>
>> ThisTimeLineID is always set at 0 on purpose on a standby, so we
>> cannot rely on it (well it is set temporarily when recycling old
>> segments). At recovery when parsing the backup_label file there is no
>> actual use of the start segment name, so that's only a cosmetic
>> change. But surely it would be better to get that fixed, because
>> that's useful for debugging.
>>
>> While looking at your patch, I thought that it would have been
>> tempting to use GetXLogReplayRecPtr() to get the timeline ID when in
>> recovery, but what we really want to know here is the timeline of the
>> last REDO pointer, which is starttli, and that's more consistent with
>> the fact that we use startpoint when writing the backup_label file. In
>> short, +1 for this fix.
>>
>> I am adding that in the list of open items, adding Magnus in CC whose
>> commit for non-exclusive backups is at the origin of this defect.
>
>
> I agree this looks correct.
>
> But isn't this also a pre-existing bug in 9.5? Or did we change something
> else that suddenly made it visible?
>

I think the bug is pre-existing, but it becomes visible to user now by new API.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

2016-07-11 Thread Magnus Hagander
On Thu, Jul 7, 2016 at 8:38 AM, Michael Paquier 
wrote:

> On Thu, Jul 7, 2016 at 12:57 AM, Marco Nenciarini
>  wrote:
> > After further analysis, the issue is that we retrieve the starttli from
> > the ControlFile structure, but it was using ThisTimeLineID when writing
> > the backup label.
> >
> > I've attached a very simple patch that fixes it.
>
> ThisTimeLineID is always set at 0 on purpose on a standby, so we
> cannot rely on it (well it is set temporarily when recycling old
> segments). At recovery when parsing the backup_label file there is no
> actual use of the start segment name, so that's only a cosmetic
> change. But surely it would be better to get that fixed, because
> that's useful for debugging.
>
> While looking at your patch, I thought that it would have been
> tempting to use GetXLogReplayRecPtr() to get the timeline ID when in
> recovery, but what we really want to know here is the timeline of the
> last REDO pointer, which is starttli, and that's more consistent with
> the fact that we use startpoint when writing the backup_label file. In
> short, +1 for this fix.
>
> I am adding that in the list of open items, adding Magnus in CC whose
> commit for non-exclusive backups is at the origin of this defect.
>

I agree this looks correct.

But isn't this also a pre-existing bug in 9.5? Or did we change something
else that suddenly made it visible?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

2016-07-09 Thread Magnus Hagander
On Jul 9, 2016 4:52 AM, "Noah Misch"  wrote:
>
> On Thu, Jul 07, 2016 at 03:38:26PM +0900, Michael Paquier wrote:
> > On Thu, Jul 7, 2016 at 12:57 AM, Marco Nenciarini
> >  wrote:
> > > After further analysis, the issue is that we retrieve the starttli
from
> > > the ControlFile structure, but it was using ThisTimeLineID when
writing
> > > the backup label.
> > >
> > > I've attached a very simple patch that fixes it.
> >
> > ThisTimeLineID is always set at 0 on purpose on a standby, so we
> > cannot rely on it (well it is set temporarily when recycling old
> > segments). At recovery when parsing the backup_label file there is no
> > actual use of the start segment name, so that's only a cosmetic
> > change. But surely it would be better to get that fixed, because
> > that's useful for debugging.
> >
> > While looking at your patch, I thought that it would have been
> > tempting to use GetXLogReplayRecPtr() to get the timeline ID when in
> > recovery, but what we really want to know here is the timeline of the
> > last REDO pointer, which is starttli, and that's more consistent with
> > the fact that we use startpoint when writing the backup_label file. In
> > short, +1 for this fix.
> >
> > I am adding that in the list of open items, adding Magnus in CC whose
> > commit for non-exclusive backups is at the origin of this defect.
>
> [Action required within 72 hours.  This is a generic notification.]
>
> The above-described topic is currently a PostgreSQL 9.6 open item.
Magnus,
> since you committed the patch believed to have created it, you own this
open
> item.  If some other commit is more relevant or if this does not belong
as a
> 9.6 open item, please let us know.  Otherwise, please observe the policy
on
> open item ownership[1] and send a status update within 72 hours of this
> message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all
fixed
> well in advance of shipping 9.6rc1.  Consequently, I will appreciate your
> efforts toward speedy resolution.  Thanks.
>
> [1]
http://www.postgresql.org/message-id/20160527025039.ga447...@tornado.leadboat.com

I'll take a look at this on Monday when I'm back home from Russia. It looks
like people have it under control, so hopefully that just means committing
the available solution in which case it'll be finished by then.

/Magnus


Re: [HACKERS] [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

2016-07-08 Thread Amit Kapila
On Thu, Jul 7, 2016 at 12:08 PM, Michael Paquier
 wrote:
> On Thu, Jul 7, 2016 at 12:57 AM, Marco Nenciarini
>  wrote:
>> After further analysis, the issue is that we retrieve the starttli from
>> the ControlFile structure, but it was using ThisTimeLineID when writing
>> the backup label.
>>
>> I've attached a very simple patch that fixes it.
>
> ThisTimeLineID is always set at 0 on purpose on a standby, so we
> cannot rely on it (well it is set temporarily when recycling old
> segments). At recovery when parsing the backup_label file there is no
> actual use of the start segment name, so that's only a cosmetic
> change. But surely it would be better to get that fixed, because
> that's useful for debugging.
>
> While looking at your patch, I thought that it would have been
> tempting to use GetXLogReplayRecPtr() to get the timeline ID when in
> recovery, but what we really want to know here is the timeline of the
> last REDO pointer, which is starttli, and that's more consistent with
> the fact that we use startpoint when writing the backup_label file. In
> short, +1 for this fix.
>

+1, the fix looks right to me as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

2016-07-08 Thread Michael Paquier
On Fri, Jul 8, 2016 at 6:40 PM, Marco Nenciarini
 wrote:
> The resulting backup is working perfectly, because Postgres has no use
> for pg_stop_backup LSN, but this can confuse any tool that uses the stop
> LSN to figure out which WAL files are needed by the backup (in this case
> the only file needed is the one containing the start checkpoint).
>
> After some discussion with Álvaro, my proposal is to avoid that by
> returning the stoppoint as the maximum between the startpoint and the
> min_recovery_end_location, in case of backup from the standby.

You are facing a pattern similar to the problem reported already on
this thread by Horiguchi-san:
http://www.postgresql.org/message-id/20160609.215558.118976703.horiguchi.kyot...@lab.ntt.co.jp
And it seems to me that you are jumping to an incorrect conclusion,
what we'd want to do is to update a bit more aggressively the minimum
recovery point in cases on a node in recovery in the case where no
buffers are flushed by other backends.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

2016-07-06 Thread Michael Paquier
On Thu, Jul 7, 2016 at 12:57 AM, Marco Nenciarini
 wrote:
> After further analysis, the issue is that we retrieve the starttli from
> the ControlFile structure, but it was using ThisTimeLineID when writing
> the backup label.
>
> I've attached a very simple patch that fixes it.

ThisTimeLineID is always set at 0 on purpose on a standby, so we
cannot rely on it (well it is set temporarily when recycling old
segments). At recovery when parsing the backup_label file there is no
actual use of the start segment name, so that's only a cosmetic
change. But surely it would be better to get that fixed, because
that's useful for debugging.

While looking at your patch, I thought that it would have been
tempting to use GetXLogReplayRecPtr() to get the timeline ID when in
recovery, but what we really want to know here is the timeline of the
last REDO pointer, which is starttli, and that's more consistent with
the fact that we use startpoint when writing the backup_label file. In
short, +1 for this fix.

I am adding that in the list of open items, adding Magnus in CC whose
commit for non-exclusive backups is at the origin of this defect.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers