Re: [HACKERS] Reading timeline from pg_control on replication slave

2017-10-30 Thread Michael Paquier
On Mon, Oct 30, 2017 at 2:48 AM, Craig Ringer  wrote:
> (I'd quite like ThisTimeLineID to go away as a global. It's messy and
> confusing, and I'd much rather it be fetched when needed).

Yeah, I agree. My take on the matter is that we could just use the
status data of the control file which is in shared memory as the only
writers to it are the checkpointer and the startup processes.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reading timeline from pg_control on replication slave

2017-10-29 Thread Craig Ringer
On 28 October 2017 at 06:09, Michael Paquier  wrote:
> On Fri, Oct 27, 2017 at 1:04 AM, Andrey Borodin  wrote:
>> I'm working on backups from replication salve in WAL-G [0]
>> Backups used to use result of pg_walfile_name(pg_start_backup(...)). Call to 
>> pg_start_backup() works nice, but "pg_walfile_name() cannot be executed 
>> during recovery."
>> This function has LSN as argument and reads TimeLineId from global state.
>> So I made a function[1] that, if on replica, reads timeline from pg_control 
>> file and formats WAL file name as is it was produces by pg_wal_filename(lsn).
>
> ThisTimeLineID is not something you can rely on for standby backends
> as it is not set during recovery.

That's not much of a concern really, you just have to ensure you call
GetXLogReplayRecPtr and set ThisTimeLineID.

(I'd quite like ThisTimeLineID to go away as a global. It's messy and
confusing, and I'd much rather it be fetched when needed).

However, that doesn't negate the rest of the issues you raised.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reading timeline from pg_control on replication slave

2017-10-28 Thread Andrey Borodin
Hi, Michael!

Thank you very much for these comments!

> 28 окт. 2017 г., в 3:09, Michael Paquier  
> написал(а):
> 
> ThisTimeLineID is not something you can rely on for standby backends
> as it is not set during recovery. That's the reason behind
> pg_walfile_name disabled during recovery. There are three things
> popping on top of my mind that one could think about:
> 1) Backups cannot be completed when started on a standby in recovery
> and when stopped after the standby has been promoted, meaning that its
> timeline has changed.
> 2) After a standby has been promoted, by using pg_start_backup, you
> issue a checkpoint which makes sure that the control file gets flushed
> with the new information, so when pg_start_backup returns to the
> caller you should have the correct timeline number because the outer
> function gets evaluated last.
> 3) Backups taken from cascading standbys, where a direct parent has
> been promoted.
> 
> 1) and 2) are actually not problems per the restrictions I am giving
> above, but 3) is. If I recall correctly, when a streaming standby does
> a timeline jump, a restart point is not immediately generated, so you
> could have the timeline on the control file not updated to the latest
> timeline value, meaning that you could have the WAL file name you use
> here referring to a previous timeline and not the newest one.
> 
> In short, yes, what you are doing is definitely risky in my opinion,
> particularly for complex cascading setups.

We are using TimeLineId from pg_control only to give a name to backup. Slightly 
stale timeline Id will not incur significant problems as long as pg_control is 
picked up after backup finalization.

But from your words I see that the safest option is to check timeline from 
pg_control after start and after stop. If this timelines differ - invalidate 
backup entirely. This does not seem too hard condition for invalidation, does 
it?

Best regards, Andrey Borodin.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reading timeline from pg_control on replication slave

2017-10-27 Thread Michael Paquier
On Fri, Oct 27, 2017 at 1:04 AM, Andrey Borodin  wrote:
> I'm working on backups from replication salve in WAL-G [0]
> Backups used to use result of pg_walfile_name(pg_start_backup(...)). Call to 
> pg_start_backup() works nice, but "pg_walfile_name() cannot be executed 
> during recovery."
> This function has LSN as argument and reads TimeLineId from global state.
> So I made a function[1] that, if on replica, reads timeline from pg_control 
> file and formats WAL file name as is it was produces by pg_wal_filename(lsn).

ThisTimeLineID is not something you can rely on for standby backends
as it is not set during recovery. That's the reason behind
pg_walfile_name disabled during recovery. There are three things
popping on top of my mind that one could think about:
1) Backups cannot be completed when started on a standby in recovery
and when stopped after the standby has been promoted, meaning that its
timeline has changed.
2) After a standby has been promoted, by using pg_start_backup, you
issue a checkpoint which makes sure that the control file gets flushed
with the new information, so when pg_start_backup returns to the
caller you should have the correct timeline number because the outer
function gets evaluated last.
3) Backups taken from cascading standbys, where a direct parent has
been promoted.

1) and 2) are actually not problems per the restrictions I am giving
above, but 3) is. If I recall correctly, when a streaming standby does
a timeline jump, a restart point is not immediately generated, so you
could have the timeline on the control file not updated to the latest
timeline value, meaning that you could have the WAL file name you use
here referring to a previous timeline and not the newest one.

In short, yes, what you are doing is definitely risky in my opinion,
particularly for complex cascading setups.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Reading timeline from pg_control on replication slave

2017-10-27 Thread Andrey Borodin
Hi, hackers!

I'm working on backups from replication salve in WAL-G [0]
Backups used to use result of pg_walfile_name(pg_start_backup(...)). Call to 
pg_start_backup() works nice, but "pg_walfile_name() cannot be executed during 
recovery."
This function has LSN as argument and reads TimeLineId from global state.
So I made a function[1] that, if on replica, reads timeline from pg_control 
file and formats WAL file name as is it was produces by pg_wal_filename(lsn).

Are there any serious dangers? Obviously, this hack is not crisp and clear. Is 
the risk of reading stale timeline really a problem? By reading TimeLineId from 
file I'm fighting those precautions in pg_walfile_name(..) which were 
implemented for a reason, I guess. 

Thanks for reading this. I'll be happy to hear any comments on the matter.

[0] https://github.com/wal-g/wal-g/pull/32
[1] 
https://github.com/wal-g/wal-g/pull/32/files#diff-455316c14fb04c9f9748a0798ad151d9R158

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers