Re: Standby trying "restore_command" before local WAL

2018-08-08 Thread Stephen Frost
Greetings,

* David Steele (da...@pgmasters.net) wrote:
> I can see cases where it *might* be worth it, but several backup tools
> support prefetch and/or parallelism which should be able to keep
> Postgres fed with WAL unless there is very high latency to the repo.
> I'm not sure the small performance boost (if any) would be enough to
> justify the extra code paths and tests.

...

> That said, if there is anything we can do in core to make these tools
> simpler or more performant I'm all for it.  For instance, it might be
> good to have return code (as suggested upstream) that indicates to
> Postgres that the segment in pg_wal is a good choice.   Some use cases
> might benefit enormously even if it doesn't make sense as a general case.

Well, if there are some use cases that would benefit greatly from it
then it may just be worth it.  That said, it doesn't sound like a common
enough ask to be something we'd prioritize for pgbackrest, at least not
without actual client demand/interest in it.  If someone else wants to
write the tests and the code and contribute that capability, we'd
certainly welcome it.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Standby trying "restore_command" before local WAL

2018-08-08 Thread David Steele
On 8/8/18 11:45 AM, Stephen Frost wrote:
> 
> * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
>> On 08/08/2018 04:08 PM, David Steele wrote:
>>> On 8/7/18 12:05 PM, Stephen Frost wrote:
> All I'm saying is that (assuming my understanding of RestoreArchivedFile 
> is
> correct) we can't just do that in the current restore_command. We do need 
> a
> way to ask the archive for some metadata/checksums, and restore_command is
> too late.

 Yeah, I see what you mean there.  An alternative might be to *not*
 unlink the file, in case the restore command wants to check if it's
 valid and matches what's in the archive, and instead restore the
 requested file somewhere else and then perform an unlink/mv after
 the restore command has been run.
>>> I don't see any cases (in master, at least) where RestoredArchivedFile()
>>> would unlink an existing WAL segment.  If you look at the call sites
>>> they all pass "RECOVERYHISTORY" or "RECOVERYXLOG"  to the recovername
>>> parameter.
>>>
>>> RestoreArchivedFile() does overwrite the existing file after we decide
>>> that we prefer the restored version over the one in pg_wal, but that
>>> happens after restore_command returns.
>>>
>>> So as far as I can see, it could be possible for the restore_command to
>>> do something clever here.
>>
>> Oh, I see - I really was reading RestoredArchivedFile() wrong ...
>>
>> Yes, it seems possible to do this in restore_command. Currently it'd require
>> constructing the local segment path (replacing RECOVERYXLOG with the actual
>> segment filename), but that's solvable by adding a new pattern similar to
>> %p.
> 
> Nice, but I don't know that there's any need to add another %-flag,
> really.  David, do you think we'd really need that?

%p is already available for restore_command, so the path from %p and the
segment name from %f would do the trick.

>> I'm not sure if packing all this into a single restore_command is a good
>> idea, because I'm pretty sure it'll lead to abhorrently monstrous bash
>> commands jammed into the restore_command. But then again, we kinda already
>> have that, and all archival systems already provide simple and correct
>> commands doing the right thing. And extending that would not be a big deal.
> 
> Running a bash snippet as restore_command is a bad idea already, imv.

Agreed.  Archiving is more complicated than backup in many ways and a
good implementation is far from trivial.

> The real PG backup tools already have much more complicated restore
> commands written in C, Perl, or Python, and could definitely implement
> this.  There's still a bit of a question as to if it'd really be worth
> it, but perhaps it would be in certain cases.

I can see cases where it *might* be worth it, but several backup tools
support prefetch and/or parallelism which should be able to keep
Postgres fed with WAL unless there is very high latency to the repo.
I'm not sure the small performance boost (if any) would be enough to
justify the extra code paths and tests.

That said, if there is anything we can do in core to make these tools
simpler or more performant I'm all for it.  For instance, it might be
good to have return code (as suggested upstream) that indicates to
Postgres that the segment in pg_wal is a good choice.   Some use cases
might benefit enormously even if it doesn't make sense as a general case.

Regards,
-- 
-David
da...@pgmasters.net



Re: Standby trying "restore_command" before local WAL

2018-08-08 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 08/08/2018 04:08 PM, David Steele wrote:
> >On 8/7/18 12:05 PM, Stephen Frost wrote:
> >>>All I'm saying is that (assuming my understanding of RestoreArchivedFile is
> >>>correct) we can't just do that in the current restore_command. We do need a
> >>>way to ask the archive for some metadata/checksums, and restore_command is
> >>>too late.
> >>
> >>Yeah, I see what you mean there.  An alternative might be to *not*
> >>unlink the file, in case the restore command wants to check if it's
> >>valid and matches what's in the archive, and instead restore the
> >>requested file somewhere else and then perform an unlink/mv after
> >>the restore command has been run.
> >I don't see any cases (in master, at least) where RestoredArchivedFile()
> >would unlink an existing WAL segment.  If you look at the call sites
> >they all pass "RECOVERYHISTORY" or "RECOVERYXLOG"  to the recovername
> >parameter.
> >
> >RestoreArchivedFile() does overwrite the existing file after we decide
> >that we prefer the restored version over the one in pg_wal, but that
> >happens after restore_command returns.
> >
> >So as far as I can see, it could be possible for the restore_command to
> >do something clever here.
> 
> Oh, I see - I really was reading RestoredArchivedFile() wrong ...
> 
> Yes, it seems possible to do this in restore_command. Currently it'd require
> constructing the local segment path (replacing RECOVERYXLOG with the actual
> segment filename), but that's solvable by adding a new pattern similar to
> %p.

Nice, but I don't know that there's any need to add another %-flag,
really.  David, do you think we'd really need that?

> I'm not sure if packing all this into a single restore_command is a good
> idea, because I'm pretty sure it'll lead to abhorrently monstrous bash
> commands jammed into the restore_command. But then again, we kinda already
> have that, and all archival systems already provide simple and correct
> commands doing the right thing. And extending that would not be a big deal.

Running a bash snippet as restore_command is a bad idea already, imv.
The real PG backup tools already have much more complicated restore
commands written in C, Perl, or Python, and could definitely implement
this.  There's still a bit of a question as to if it'd really be worth
it, but perhaps it would be in certain cases.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Standby trying "restore_command" before local WAL

2018-08-08 Thread Tomas Vondra




On 08/08/2018 04:08 PM, David Steele wrote:

On 8/7/18 12:05 PM, Stephen Frost wrote:


All I'm saying is that (assuming my understanding of RestoreArchivedFile is
correct) we can't just do that in the current restore_command. We do need a
way to ask the archive for some metadata/checksums, and restore_command is
too late.


Yeah, I see what you mean there.  An alternative might be to *not*
unlink the file, in case the restore command wants to check if it's
valid and matches what's in the archive, and instead restore the
requested file somewhere else and then perform an unlink/mv after
the restore command has been run.

I don't see any cases (in master, at least) where RestoredArchivedFile()
would unlink an existing WAL segment.  If you look at the call sites
they all pass "RECOVERYHISTORY" or "RECOVERYXLOG"  to the recovername
parameter.

RestoreArchivedFile() does overwrite the existing file after we decide
that we prefer the restored version over the one in pg_wal, but that
happens after restore_command returns.

So as far as I can see, it could be possible for the restore_command to
do something clever here.




Oh, I see - I really was reading RestoredArchivedFile() wrong ...

Yes, it seems possible to do this in restore_command. Currently it'd 
require constructing the local segment path (replacing RECOVERYXLOG with 
the actual segment filename), but that's solvable by adding a new 
pattern similar to %p.


I'm not sure if packing all this into a single restore_command is a good 
idea, because I'm pretty sure it'll lead to abhorrently monstrous bash 
commands jammed into the restore_command. But then again, we kinda 
already have that, and all archival systems already provide simple and 
correct commands doing the right thing. And extending that would not be 
a big deal.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Standby trying "restore_command" before local WAL

2018-08-08 Thread David Steele
On 8/7/18 12:05 PM, Stephen Frost wrote:
>>
>> All I'm saying is that (assuming my understanding of RestoreArchivedFile is
>> correct) we can't just do that in the current restore_command. We do need a
>> way to ask the archive for some metadata/checksums, and restore_command is
>> too late.
> 
> Yeah, I see what you mean there.  An alternative might be to *not*
> unlink the file, in case the restore command wants to check if it's
> valid and matches what's in the archive, and instead restore the
> requested file somewhere else and then perform an unlink/mv after
> the restore command has been run.
I don't see any cases (in master, at least) where RestoredArchivedFile()
would unlink an existing WAL segment.  If you look at the call sites
they all pass "RECOVERYHISTORY" or "RECOVERYXLOG"  to the recovername
parameter.

RestoreArchivedFile() does overwrite the existing file after we decide
that we prefer the restored version over the one in pg_wal, but that
happens after restore_command returns.

So as far as I can see, it could be possible for the restore_command to
do something clever here.

Regards,
-- 
-David
da...@pgmasters.net



Re: Standby trying "restore_command" before local WAL

2018-08-08 Thread David Steele
On 8/7/18 11:42 AM, Stephen Frost wrote:
> 
>>> CRC's are per WAL record, and possibly some WAL records might not be ok
>>> to replay, or at least we need to make sure that we replay the right set
>>> of WAL in the right order even when there are partial WAL files being
>>> given to PG (that aren't named that way...).  The more I think about
>>> this, I think we really need to avoid partial WAL files entirely- what
>>> are we going to do when we get to the end of one?  We'd need to request
>>> the full one from the restore command anyway, so seems like we should
>>> just go ahead and get it from the archive, the question is if there's an
>>> easy/cheap way to detect partial WAL files in pg_wal.
>>
>> As explained above, I don't think this is actually a problem. The checksums
>> do cover the whole file thanks to chaining, and there are ways to detect
>> partial segments. IMHO it's fine if we replay a segment and then find out it
>> was partial and that we need to fetch it from archive anyway and re-apply it
>> - it should not be very common case, except when the user does something
>> silly.
> 
> As long as we *do* go off and try to fetch that WAL file and replay it,
> and don't assume that the end of that partial WAL file means the end of
> WAL replay, then I think you may be right and that it'd be fine, but it
> does seem a bit risky to me.

This assumes that the local partial is a subset of the archived full WAL
segment, which should be true in most cases but I don't think we can
discount the possibility that it isn't.  Split-brain is certainly a way
to get to differing partials, though in that case things are already
pretty bad.

I've seen some pretty messed up situations and usually it is best to
treat the WAL archive as the ground truth.  If the archive_command is
smart enough not to overwrite WAL segments that already exist with
different versions then it should be a reliable record that all servers
can be replayed from (split-brains aside).  I think it's best to treat
the local WAL with some suspicion unless it is known to be good, i.e.
just restored from archive.

I do agree that most inconsistencies could be detected and throw an
error, but only if the WAL in the repository is examined, which means
making a round-trip there anyway.

At the very least, it seems that simple enabling "read from pg_wal
first" is not a good idea without making other changes to ensure it is
done correctly.

Regards,
-- 
-David
da...@pgmasters.net



Re: Standby trying "restore_command" before local WAL

2018-08-07 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> That's how I read this part of RestoreArchivedFile:
> 
> https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlogarchive.c#L110
> 
> The very first thing it does is checking if the local file exists, and if it
> does then calling unlink().

Oh, yeah, once we've decided to go down that route we're doing that,
wasn't thinking it through completely, sorry for the noise.

> >>We'd need invoking some other command before restore_command, which would do
> >>this comparison and decide whether to use local or remote WAL.
> >
> >Ugh, that sounds like a lot of additional complication that I'm not sure
> >would be all that useful or sensible for this particular case, if that's
> >what it would require.  I'd rather we try to figure out some way to get
> >rid of restore_command entirely instead of bolting on more things around
> >it.
> 
> The simpler the better of course.
> 
> All I'm saying is that (assuming my understanding of RestoreArchivedFile is
> correct) we can't just do that in the current restore_command. We do need a
> way to ask the archive for some metadata/checksums, and restore_command is
> too late.

Yeah, I see what you mean there.  An alternative might be to *not*
unlink the file, in case the restore command wants to check if it's
valid and matches what's in the archive, and instead restore the
requested file somewhere else and then perform an unlink/mv after
the restore command has been run.

I'm still not entirely sure that this is a particularly good idea, but
it would at least avoid having another command for a user to have to
configure.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Standby trying "restore_command" before local WAL

2018-08-07 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 08/06/2018 09:32 PM, Stephen Frost wrote:
> >* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> >>On 08/06/2018 06:11 PM, Stephen Frost wrote:
> >WAL checksums are per WAL record, not across the whole file...  And,
> >yes, this seems like something we could probably write code to ensure
> >we're doing correctly, possibly even without changing the existing WAL
> >format or maybe we would have to, but either way, seems like additional
> >code that would need to be written and some serious thought put into it
> >to make sure it's correct.  I'm all for that, just to be clear, but what
> >I don't think we can do is move forward with a change to just prefer
> >pg_wal over restore command.
> 
> While the checksums are per-record, the record also contains xl_prev, so
> it's effectively a chain of checksums covering the whole file. The other
> thing you can do is look at the first record of the next segment and use the
> xl_prev to detect if the previous segment was not partial.
> 
> But I do agree doing this properly may require writing some new code and
> checking that those cases are detected correctly.

Right, might be possible but isn't something we necessairly guarantee
will always work correctly today in this situation where we've got old
WAL files that were pulled over a period of time (imagine that we copy
WAL file ABC before PG was done with it, but we don't get around to
copying WAL file DEF until much later after ABC has been completed and
DEF is only partial, or other such scenarios).  Remember, the scenario
we're talking about here is where someone did a pg_start_backup, took
their time copying all the files in PGDATA, including pg_wal, and then
at some later point ran pg_stop_backup.  We have no idea when those WAL
files were copied and they could have been anywhere between the
pg_start_backup point and the pg_stop_backup point.

> (Note: There was a discussion about replacing xl_prev with LSN of the
> current record, and IMHO that would work just as well, although it might be
> a bit more expensive for some operations.)

I haven't thought very much about this so don't have much of an opinion
on it one way or the other at this point.

> >CRC's are per WAL record, and possibly some WAL records might not be ok
> >to replay, or at least we need to make sure that we replay the right set
> >of WAL in the right order even when there are partial WAL files being
> >given to PG (that aren't named that way...).  The more I think about
> >this, I think we really need to avoid partial WAL files entirely- what
> >are we going to do when we get to the end of one?  We'd need to request
> >the full one from the restore command anyway, so seems like we should
> >just go ahead and get it from the archive, the question is if there's an
> >easy/cheap way to detect partial WAL files in pg_wal.
> 
> As explained above, I don't think this is actually a problem. The checksums
> do cover the whole file thanks to chaining, and there are ways to detect
> partial segments. IMHO it's fine if we replay a segment and then find out it
> was partial and that we need to fetch it from archive anyway and re-apply it
> - it should not be very common case, except when the user does something
> silly.

As long as we *do* go off and try to fetch that WAL file and replay it,
and don't assume that the end of that partial WAL file means the end of
WAL replay, then I think you may be right and that it'd be fine, but it
does seem a bit risky to me.

> >As I mention above, seems like what we should really be doing is having
> >a way to know when a WAL file is full but still in pg_wal for some
> >reason (hasn't been replayed yet due to intential lag on the replica, or
> >unintentional lag on the replica, etc), and perhaps we even only do that
> >on replicas or have tools like pg_basebackup and pgbackrest do that when
> >they're doing a restore, so that it's clear to PG that all these files
> >are full WAL and they can replay them completely.
> 
> IMHO we can deduce if from first xl_prev of the next segment (assuming we
> have the next segment available locally, which we likely have except for the
> last one).

See above for why I don't think it's actually that simple..

> >I was actually just thinking about having pgbackrest do exactly that. :)
> >We already have the full sha256 checksum of every WAL file in the
> >pgbackrest repository, it seems like it wouldn't be too hard to
> >calculate the checksum of the files in pg_wal and ask the repo what the
> >checksums are for those WAL files and then use the files in pg_wal if
> >they checksums match.  I can already imagine the argument coming back
> >though- that'd require additional testing to ensure it's done correctly
> >and I'm really not sure that it'd end up being all that much better
> >except in some very special circumstances where there's a low bandwidth
> >connection to the repo and often a lot of WAL left over in 

Re: Standby trying "restore_command" before local WAL

2018-08-07 Thread Tomas Vondra




On 08/06/2018 09:32 PM, Stephen Frost wrote:

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:

On 08/06/2018 06:11 PM, Stephen Frost wrote:

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:

On 08/06/2018 05:19 PM, Stephen Frost wrote:

* David Steele (da...@pgmasters.net) wrote:

I think for the stated scenario (known good standby that has been
shutdown gracefully) it makes perfect sense to trust the contents of
pg_wal.  Call this scenario #1.

An alternate scenario (#2) is that the data directory was copied using a
basic copy tool and the pg_wal directory was not excluded from the copy.
  This means the contents of pg_wal will be in an inconsistent state.
The files that are there might be partials (not with the extension,
though) and you can easily have multiple partials.  You will almost
certainly not have everything you need to get to consistency.


Yeah. But as Simon said, we do have fairly strong protections about applying
corrupted WAL - every record is CRC-checked. So why not to fall-back to the
restore_command only if the locally available WAL is not fully consistent?


"Corrupted" doesn't necessairly only mean "the data file was munged by
the storage somehow."  In this case, corrupted could be an old and only
partial WAL file, in which case we'd possibly be missing WAL that needs
to be replayed to bring the cluster back to a valid state, no?


Why wouldn't that be detected by checksums? Also, why wouldn't this be
handled correctly by the logic I proposed, i.e. falling-back to remote WAL
segment if the file is incomplete/broken in some sense?


WAL checksums are per WAL record, not across the whole file...  And,
yes, this seems like something we could probably write code to ensure
we're doing correctly, possibly even without changing the existing WAL
format or maybe we would have to, but either way, seems like additional
code that would need to be written and some serious thought put into it
to make sure it's correct.  I'm all for that, just to be clear, but what
I don't think we can do is move forward with a change to just prefer
pg_wal over restore command.



While the checksums are per-record, the record also contains xl_prev, so 
it's effectively a chain of checksums covering the whole file. The other 
thing you can do is look at the first record of the next segment and use 
the xl_prev to detect if the previous segment was not partial.


But I do agree doing this properly may require writing some new code and 
checking that those cases are detected correctly.


(Note: There was a discussion about replacing xl_prev with LSN of the 
current record, and IMHO that would work just as well, although it might 
be a bit more expensive for some operations.)



But there's another good scenario (#3): where the pg_wal directory was
preloaded with all the WAL required to make the cluster consistent or
all the WAL that was available at restore time.  In this case, it would
be make sense to prefer the contents of pg_wal and only switch to
restore_command after that has been exhausted.

So, the choice of whether to prefer locally-stored or
restore_command-fetched WAL is context-dependent, in my mind.


Agreed.


Maybe, not sure.


The argument that David makes above in scenario #2 certainly looks
entirely likely to me and I don't think we've got any real protections
against that.  The current common use-cases happen to work around the
risk because tools like pg_basebackup ignore the existing pg_wal
directory when doing the backup and instead populate it with exactly the
correct WAL that's needed, and in cases where a restore command is
specified will always pull back only valid WAL, but I don't think we can
decide that this scenario (#2 from above):


I'm probably missing something, but why couldn't we detect that using CRC.
Or make sure we can detect that, e.g. by adding some additional info into
each WAL segment?


CRC's are per WAL record, and possibly some WAL records might not be ok
to replay, or at least we need to make sure that we replay the right set
of WAL in the right order even when there are partial WAL files being
given to PG (that aren't named that way...).  The more I think about
this, I think we really need to avoid partial WAL files entirely- what
are we going to do when we get to the end of one?  We'd need to request
the full one from the restore command anyway, so seems like we should
just go ahead and get it from the archive, the question is if there's an
easy/cheap way to detect partial WAL files in pg_wal.



As explained above, I don't think this is actually a problem. The 
checksums do cover the whole file thanks to chaining, and there are ways 
to detect partial segments. IMHO it's fine if we replay a segment and 
then find out it was partial and that we need to fetch it from archive 
anyway and re-apply it - it should not be very common case, except when 
the user does something silly.



Ideally we could have a default that is safe in each scenario with

Re: Standby trying "restore_command" before local WAL

2018-08-06 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 08/06/2018 06:11 PM, Stephen Frost wrote:
> >* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> >>On 08/06/2018 05:19 PM, Stephen Frost wrote:
> >>>* David Steele (da...@pgmasters.net) wrote:
> I think for the stated scenario (known good standby that has been
> shutdown gracefully) it makes perfect sense to trust the contents of
> pg_wal.  Call this scenario #1.
> 
> An alternate scenario (#2) is that the data directory was copied using a
> basic copy tool and the pg_wal directory was not excluded from the copy.
>   This means the contents of pg_wal will be in an inconsistent state.
> The files that are there might be partials (not with the extension,
> though) and you can easily have multiple partials.  You will almost
> certainly not have everything you need to get to consistency.
> >>
> >>Yeah. But as Simon said, we do have fairly strong protections about applying
> >>corrupted WAL - every record is CRC-checked. So why not to fall-back to the
> >>restore_command only if the locally available WAL is not fully consistent?
> >
> >"Corrupted" doesn't necessairly only mean "the data file was munged by
> >the storage somehow."  In this case, corrupted could be an old and only
> >partial WAL file, in which case we'd possibly be missing WAL that needs
> >to be replayed to bring the cluster back to a valid state, no?
> 
> Why wouldn't that be detected by checksums? Also, why wouldn't this be
> handled correctly by the logic I proposed, i.e. falling-back to remote WAL
> segment if the file is incomplete/broken in some sense?

WAL checksums are per WAL record, not across the whole file...  And,
yes, this seems like something we could probably write code to ensure
we're doing correctly, possibly even without changing the existing WAL
format or maybe we would have to, but either way, seems like additional
code that would need to be written and some serious thought put into it
to make sure it's correct.  I'm all for that, just to be clear, but what
I don't think we can do is move forward with a change to just prefer
pg_wal over restore command.

> But there's another good scenario (#3): where the pg_wal directory was
> preloaded with all the WAL required to make the cluster consistent or
> all the WAL that was available at restore time.  In this case, it would
> be make sense to prefer the contents of pg_wal and only switch to
> restore_command after that has been exhausted.
> 
> So, the choice of whether to prefer locally-stored or
> restore_command-fetched WAL is context-dependent, in my mind.
> >>>
> >>>Agreed.
> >>
> >>Maybe, not sure.
> >
> >The argument that David makes above in scenario #2 certainly looks
> >entirely likely to me and I don't think we've got any real protections
> >against that.  The current common use-cases happen to work around the
> >risk because tools like pg_basebackup ignore the existing pg_wal
> >directory when doing the backup and instead populate it with exactly the
> >correct WAL that's needed, and in cases where a restore command is
> >specified will always pull back only valid WAL, but I don't think we can
> >decide that this scenario (#2 from above):
> 
> I'm probably missing something, but why couldn't we detect that using CRC.
> Or make sure we can detect that, e.g. by adding some additional info into
> each WAL segment?

CRC's are per WAL record, and possibly some WAL records might not be ok
to replay, or at least we need to make sure that we replay the right set
of WAL in the right order even when there are partial WAL files being
given to PG (that aren't named that way...).  The more I think about
this, I think we really need to avoid partial WAL files entirely- what
are we going to do when we get to the end of one?  We'd need to request
the full one from the restore command anyway, so seems like we should
just go ahead and get it from the archive, the question is if there's an
easy/cheap way to detect partial WAL files in pg_wal.

> Ideally we could have a default that is safe in each scenario with
> perhaps an override if the user knows better.  Scenario #1 would allow
> WAL to be read from pg_wal by default, scenario #2 would prefer fetched
> WAL, and scenario #3 could use a GUC to override the default fetch 
> behavior.
> >>>
> >>>Not sure how we'd be able to automatically realize which scenario we're
> >>>in though..?
> >>
> >>But do we need to know it? I mean, can't we try the local WAL first, use it
> >>if it passes the CRC checks (and possibly some other checks), and only
> >>fallback to the remote WAL if it's identified as broken?
> >
> >Maybe- but I think we need to be quite sure about that and I don't
> >believe that just checking the CRCs is enough.
> 
> Sure. So what else would we need to add to each WAL segment to make that
> possible? It needs to be cheap enough not to cause perf regression, but 

Re: Standby trying "restore_command" before local WAL

2018-08-06 Thread Tomas Vondra




On 08/06/2018 06:11 PM, Stephen Frost wrote:

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:

On 08/06/2018 05:19 PM, Stephen Frost wrote:

* David Steele (da...@pgmasters.net) wrote:

I think for the stated scenario (known good standby that has been
shutdown gracefully) it makes perfect sense to trust the contents of
pg_wal.  Call this scenario #1.

An alternate scenario (#2) is that the data directory was copied using a
basic copy tool and the pg_wal directory was not excluded from the copy.
  This means the contents of pg_wal will be in an inconsistent state.
The files that are there might be partials (not with the extension,
though) and you can easily have multiple partials.  You will almost
certainly not have everything you need to get to consistency.


Yeah. But as Simon said, we do have fairly strong protections about applying
corrupted WAL - every record is CRC-checked. So why not to fall-back to the
restore_command only if the locally available WAL is not fully consistent?


"Corrupted" doesn't necessairly only mean "the data file was munged by
the storage somehow."  In this case, corrupted could be an old and only
partial WAL file, in which case we'd possibly be missing WAL that needs
to be replayed to bring the cluster back to a valid state, no?



Why wouldn't that be detected by checksums? Also, why wouldn't this be 
handled correctly by the logic I proposed, i.e. falling-back to remote 
WAL segment if the file is incomplete/broken in some sense?



But there's another good scenario (#3): where the pg_wal directory was
preloaded with all the WAL required to make the cluster consistent or
all the WAL that was available at restore time.  In this case, it would
be make sense to prefer the contents of pg_wal and only switch to
restore_command after that has been exhausted.

So, the choice of whether to prefer locally-stored or
restore_command-fetched WAL is context-dependent, in my mind.


Agreed.


Maybe, not sure.


The argument that David makes above in scenario #2 certainly looks
entirely likely to me and I don't think we've got any real protections
against that.  The current common use-cases happen to work around the
risk because tools like pg_basebackup ignore the existing pg_wal
directory when doing the backup and instead populate it with exactly the
correct WAL that's needed, and in cases where a restore command is
specified will always pull back only valid WAL, but I don't think we can
decide that this scenario (#2 from above):



I'm probably missing something, but why couldn't we detect that using 
CRC. Or make sure we can detect that, e.g. by adding some additional 
info into each WAL segment?




Ideally we could have a default that is safe in each scenario with
perhaps an override if the user knows better.  Scenario #1 would allow
WAL to be read from pg_wal by default, scenario #2 would prefer fetched
WAL, and scenario #3 could use a GUC to override the default fetch behavior.


Not sure how we'd be able to automatically realize which scenario we're
in though..?


But do we need to know it? I mean, can't we try the local WAL first, use it
if it passes the CRC checks (and possibly some other checks), and only
fallback to the remote WAL if it's identified as broken?


Maybe- but I think we need to be quite sure about that and I don't
believe that just checking the CRCs is enough.



Sure. So what else would we need to add to each WAL segment to make that 
possible? It needs to be cheap enough not to cause perf regression, but 
e.g. a CRC of each segment + amount of data actually stored in it should 
be sufficient.


Another idea - what if we instead allow fetching additional information 
of the archived WAL, and make decision based on that? For example, 
before restore_command we could request md5 checksum from the archive, 
compare it to the local WAL, and if they match then use the local one. 
Otherwise request the remote one. That'd still be a win.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Standby trying "restore_command" before local WAL

2018-08-06 Thread Jaime Casanova
On Mon, 6 Aug 2018 at 11:01, Stephen Frost  wrote:
>
> > What about the following cases?
> > 1. replica host crashed, and in pg_wal we have a few thousands WAL files.
>
> If this is the case then the replica was very far behind on replay,
> presumably, and in some of those cases rebuilding the replica might
> very well be faster than replaying all of that WAL.  This case does
> sound like it should be alright though.
>

it could also be a delayed standby, and in that case we will have in
the replica lots of valid -delayed apply on porpouse, not on master
anymore- WALs, restarting from archive in that case is a poor
solution...

-- 
Jaime Casanova  www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Standby trying "restore_command" before local WAL

2018-08-06 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 08/06/2018 05:19 PM, Stephen Frost wrote:
> >* David Steele (da...@pgmasters.net) wrote:
> >>I think for the stated scenario (known good standby that has been
> >>shutdown gracefully) it makes perfect sense to trust the contents of
> >>pg_wal.  Call this scenario #1.
> >>
> >>An alternate scenario (#2) is that the data directory was copied using a
> >>basic copy tool and the pg_wal directory was not excluded from the copy.
> >>  This means the contents of pg_wal will be in an inconsistent state.
> >>The files that are there might be partials (not with the extension,
> >>though) and you can easily have multiple partials.  You will almost
> >>certainly not have everything you need to get to consistency.
> 
> Yeah. But as Simon said, we do have fairly strong protections about applying
> corrupted WAL - every record is CRC-checked. So why not to fall-back to the
> restore_command only if the locally available WAL is not fully consistent?

"Corrupted" doesn't necessairly only mean "the data file was munged by
the storage somehow."  In this case, corrupted could be an old and only
partial WAL file, in which case we'd possibly be missing WAL that needs
to be replayed to bring the cluster back to a valid state, no?

> >>But there's another good scenario (#3): where the pg_wal directory was
> >>preloaded with all the WAL required to make the cluster consistent or
> >>all the WAL that was available at restore time.  In this case, it would
> >>be make sense to prefer the contents of pg_wal and only switch to
> >>restore_command after that has been exhausted.
> >>
> >>So, the choice of whether to prefer locally-stored or
> >>restore_command-fetched WAL is context-dependent, in my mind.
> >
> >Agreed.
> 
> Maybe, not sure.

The argument that David makes above in scenario #2 certainly looks
entirely likely to me and I don't think we've got any real protections
against that.  The current common use-cases happen to work around the
risk because tools like pg_basebackup ignore the existing pg_wal
directory when doing the backup and instead populate it with exactly the
correct WAL that's needed, and in cases where a restore command is
specified will always pull back only valid WAL, but I don't think we can
decide that this scenario (#2 from above):

#
pg_start_backup
copy all files (including pg_wal)
pg_stop_backup
have a recovery.conf with a valid and correct restore_command set
start PG
#

isn't something to worry about.  Maybe I'm all wet and changing our
existing preference to read from pg_wal first and only then go to
restore_command will still work properly with this scenario, but I don't
think that's the case.  I've not tested it though, just going off of
what I recall and what I understood those comments to be talking about.

If the above pseudo-code works just fine with preferring pg_wal over
restore_command, then fine, let's just make that change.  If it doesn't
though, then I don't think we can make that change and instead we need
to either figure out some way to determine what's acceptable to pull
from pg_wal and what we have to ask restore_command for, or ask the user
to explicitly tell us if we can use pg_wal first.

> >>Ideally we could have a default that is safe in each scenario with
> >>perhaps an override if the user knows better.  Scenario #1 would allow
> >>WAL to be read from pg_wal by default, scenario #2 would prefer fetched
> >>WAL, and scenario #3 could use a GUC to override the default fetch behavior.
> >
> >Not sure how we'd be able to automatically realize which scenario we're
> >in though..?
> 
> But do we need to know it? I mean, can't we try the local WAL first, use it
> if it passes the CRC checks (and possibly some other checks), and only
> fallback to the remote WAL if it's identified as broken?

Maybe- but I think we need to be quite sure about that and I don't
believe that just checking the CRCs is enough.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Standby trying "restore_command" before local WAL

2018-08-06 Thread Stephen Frost
Greetings,

* Alexander Kukushkin (cyberd...@gmail.com) wrote:
> 2018-07-31 20:25 GMT+02:00 Stephen Frost :
> > There's still a question here, at least from my perspective, as to which
> > is actually going to be faster to perform recovery based off of.  A good
> > restore command, which pre-fetches the WAL in parallel and gets it local
> > and on the same filesystem, meaning that the restore_command only has to
> > execute essentially a 'mv' and return back to PG for the next WAL file,
> > is really rather fast, compared to streaming that same data over the
> > network with a single TCP connection to the primary.  Of course, there's
> > a lot of variables there and it depends on the network speed between the
> > various pieces, but I've certainly had cases where a replica catches up
> > much faster using restore command than streaming from the primary.
> 
> Sure, mv is incredibly fast, but not calling external script/binary at
> all is still faster than calling it.

I don't believe I was disputing that, apologies if it came across that
way.  Certainly, reading files directly without going through restore
command is going to be faster than having to call restore command.  The
point I was attempting to make is that using restore command might be
(and in some cases, certainly is) faster than streaming from a primary.

> What about the following cases?
> 1. replica host crashed, and in pg_wal we have a few thousands WAL files.

If this is the case then the replica was very far behind on replay,
presumably, and in some of those cases rebuilding the replica might
very well be faster than replaying all of that WAL.  This case does
sound like it should be alright though.

> 2. we are creating a new replica with pg_basebackup -X stream, it
> takes a long time and again leaves a few thousands WAL files.

This is certainly typical and also should be a safe case and therefore
seems like a good case where we'd want to be able to tell the system to
use what's in pg_wal first- perhaps that could be an option in
recovery.conf which pg_basebackup and other tools that are managing the
pg_wal directory and ensuring that all the WAL there is valid would be
able to write into the recovery.conf.

> In both cases, if there is no restore_command in the recovery.conf,
> postgres will happily read WAL files from pg_wal and only when there
> is nothing left it will try to start streaming.
> 
> But, if restore_command is defined, it will always call the
> restore_command, for every single WAL file it wants to restore.
> If the restore_command exits with non zero exit code, postgres is
> happily restoring the file from pg_wal!
> And, only if the file is not there or not valid, postgres is trying to
> start streaming.

Yeah, I have to agree that it's not great that we don't seem to be
entirely consistent here, as Robert pointed out up-thread.

> >From my point of view, there is no difference between having no
> restore_command and relying only on streaming replication and having
> the restore_comman which always fails.
> Therefore I don't really understand why we stick to the
> "restore_command => pg_wal => streaming" and why it is not possible to
> change it to "pg_wal => restore_command => streaming" or maybe even
> (pg_wal => streaming => restore_command).

I don't think I disagreed anywhere about having the option.  There's a
good point to be made that if we can figure out what the right thing to
do is then we should just do that instead of having an option for it.

If there's any case where the pg_wal directory might have invalid WAL
to be replayed over top of the current cluster, though, then we
shouldn't just be using that WAL and instead should be asking the user
to let us know if the WAL is ok to use.  If we can know when the WAL is
invalid and ignore using it in those cases, then we should just go ahead
and do that, but I'm unconvinced that's actually the case in a situation
such as what David Steele described in his scenario #2.

> I am not sure about the last option, but in any case. before going to
> some remote place, postgres should try to find (and try to replay) the
> WAL file in the pg_wal.

Only if we know that it's valid to be replayed over the current cluster.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Standby trying "restore_command" before local WAL

2018-08-06 Thread Tomas Vondra

On 08/06/2018 05:19 PM, Stephen Frost wrote:

Greetings,

* David Steele (da...@pgmasters.net) wrote:

I think for the stated scenario (known good standby that has been
shutdown gracefully) it makes perfect sense to trust the contents of
pg_wal.  Call this scenario #1.

An alternate scenario (#2) is that the data directory was copied using a
basic copy tool and the pg_wal directory was not excluded from the copy.
  This means the contents of pg_wal will be in an inconsistent state.
The files that are there might be partials (not with the extension,
though) and you can easily have multiple partials.  You will almost
certainly not have everything you need to get to consistency.



Yeah. But as Simon said, we do have fairly strong protections about 
applying corrupted WAL - every record is CRC-checked. So why not to 
fall-back to the restore_command only if the locally available WAL is 
not fully consistent?



But there's another good scenario (#3): where the pg_wal directory was
preloaded with all the WAL required to make the cluster consistent or
all the WAL that was available at restore time.  In this case, it would
be make sense to prefer the contents of pg_wal and only switch to
restore_command after that has been exhausted.

So, the choice of whether to prefer locally-stored or
restore_command-fetched WAL is context-dependent, in my mind.


Agreed.



Maybe, not sure.


Ideally we could have a default that is safe in each scenario with
perhaps an override if the user knows better.  Scenario #1 would allow
WAL to be read from pg_wal by default, scenario #2 would prefer fetched
WAL, and scenario #3 could use a GUC to override the default fetch behavior.


Not sure how we'd be able to automatically realize which scenario we're
in though..?



But do we need to know it? I mean, can't we try the local WAL first, use 
it if it passes the CRC checks (and possibly some other checks), and 
only fallback to the remote WAL if it's identified as broken?



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Standby trying "restore_command" before local WAL

2018-08-06 Thread Stephen Frost
Greetings,

* David Steele (da...@pgmasters.net) wrote:
> I think for the stated scenario (known good standby that has been
> shutdown gracefully) it makes perfect sense to trust the contents of
> pg_wal.  Call this scenario #1.
> 
> An alternate scenario (#2) is that the data directory was copied using a
> basic copy tool and the pg_wal directory was not excluded from the copy.
>  This means the contents of pg_wal will be in an inconsistent state.
> The files that are there might be partials (not with the extension,
> though) and you can easily have multiple partials.  You will almost
> certainly not have everything you need to get to consistency.
> 
> But there's another good scenario (#3): where the pg_wal directory was
> preloaded with all the WAL required to make the cluster consistent or
> all the WAL that was available at restore time.  In this case, it would
> be make sense to prefer the contents of pg_wal and only switch to
> restore_command after that has been exhausted.
> 
> So, the choice of whether to prefer locally-stored or
> restore_command-fetched WAL is context-dependent, in my mind.

Agreed.

> Ideally we could have a default that is safe in each scenario with
> perhaps an override if the user knows better.  Scenario #1 would allow
> WAL to be read from pg_wal by default, scenario #2 would prefer fetched
> WAL, and scenario #3 could use a GUC to override the default fetch behavior.

Not sure how we'd be able to automatically realize which scenario we're
in though..?

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Standby trying "restore_command" before local WAL

2018-08-03 Thread Michael Paquier
On Tue, Jul 31, 2018 at 02:55:58PM +0200, Emre Hasegeli wrote:
> == The Workarounds ==
> 
> We can possibly work around this inside the "restore_command" or
> by delaying the archiving.  Working around inside the "restore_command"
> would involve checking whether the file exists under pg_wal/.  This
> should not be easy because the WAL file may be written partially.  It
> should be easier for Postgres to do this as it knows where to stop
> processing the local WAL.

It is also not that complicated to check if a WAL segment is properly
shaped by just running pg_waldump or such, so that would be fine for all
your cases with back-branches perhaps?

> == The Change ==
> 
> This "restore_command" behavior is coming from the initial archiving
> and point-in-time-recovery implementation [2].   The code says
> "the reason is that the file in XLOGDIR could be an old, un-filled or
> partly-filled version that was copied and restored as part of
> backing up $PGDATA."  This was probably a good reason in 2004, but
> I don't think it still is.  AFAIK "pg_basebackup" eliminates this
> problem.

pg_basebackup is not the only backup solution, though I'd like that
folks use it more, it can be a bottleneck and comes with its own
limitations when streaming for example tar data with multiple
tablespaces for example still...

> Also, with this reasoning, we should also try streaming from the
> master before trying the local WAL, but AFAIU we don't.

... You have a point here, things are rather inconsistent by this
argument.  I have not worked on that in details, but at least
WaitForWALToBecomeAvailable() which enforces XLOG_FROM_ARCHIVE when the
current source is XLOG_FROM_PG_WAL would need to be changed.
--
Michael


signature.asc
Description: PGP signature


Re: Standby trying "restore_command" before local WAL

2018-08-03 Thread David Steele
On 8/2/18 4:08 PM, Robert Haas wrote:
> On Wed, Aug 1, 2018 at 7:14 AM, Emre Hasegeli  wrote:
>>> There's still a question here, at least from my perspective, as to which
>>> is actually going to be faster to perform recovery based off of.  A good
>>> restore command, which pre-fetches the WAL in parallel and gets it local
>>> and on the same filesystem, meaning that the restore_command only has to
>>> execute essentially a 'mv' and return back to PG for the next WAL file,
>>> is really rather fast, compared to streaming that same data over the
>>> network with a single TCP connection to the primary.  Of course, there's
>>> a lot of variables there and it depends on the network speed between the
>>> various pieces, but I've certainly had cases where a replica catches up
>>> much faster using restore command than streaming from the primary.
>>
>> Trying "restore_command" before streaming replication is totally fine.
>> It is not likely that the same WAL would be on both places anyway.
>>
>> My problem is trying "restore_command" before the local WAL.  I
>> understand the historic reason of this design, but I don't think it is
>> expected behavior to anybody who is using "restore_command" together
>> with streaming replication.
> 
> Right.  I don't really understand the argument that this should be
> controlled by a GUC.  I could see having a GUC to choose between
> archiving-first and streaming-first, but if it's safe to use the WAL
> we've already got in pg_xlog, it seems like that should take priority
> over every other approach.  The comments lend themselves to a certain
> amount of doubt over whether we can actually trust the contents of
> pg_xlog, but if we can't, it seems like we just shouldn't use it - at
> all - ever.  It doesn't make much sense to me to say "hey, pg_xlog
> might have evil bad WAL in it that we shouldn't replay, so let's look
> for the same WAL elsewhere first, but then if we don't find it, we'll
> just replay the bad stuff."

I think for the stated scenario (known good standby that has been
shutdown gracefully) it makes perfect sense to trust the contents of
pg_wal.  Call this scenario #1.

An alternate scenario (#2) is that the data directory was copied using a
basic copy tool and the pg_wal directory was not excluded from the copy.
 This means the contents of pg_wal will be in an inconsistent state.
The files that are there might be partials (not with the extension,
though) and you can easily have multiple partials.  You will almost
certainly not have everything you need to get to consistency.

But there's another good scenario (#3): where the pg_wal directory was
preloaded with all the WAL required to make the cluster consistent or
all the WAL that was available at restore time.  In this case, it would
be make sense to prefer the contents of pg_wal and only switch to
restore_command after that has been exhausted.

So, the choice of whether to prefer locally-stored or
restore_command-fetched WAL is context-dependent, in my mind.

Ideally we could have a default that is safe in each scenario with
perhaps an override if the user knows better.  Scenario #1 would allow
WAL to be read from pg_wal by default, scenario #2 would prefer fetched
WAL, and scenario #3 could use a GUC to override the default fetch behavior.

Regards,
-- 
-David
da...@pgmasters.net



Re: Standby trying "restore_command" before local WAL

2018-08-03 Thread Simon Riggs
On 2 August 2018 at 21:08, Robert Haas  wrote:
> On Wed, Aug 1, 2018 at 7:14 AM, Emre Hasegeli  wrote:
>>> There's still a question here, at least from my perspective, as to which
>>> is actually going to be faster to perform recovery based off of.  A good
>>> restore command, which pre-fetches the WAL in parallel and gets it local
>>> and on the same filesystem, meaning that the restore_command only has to
>>> execute essentially a 'mv' and return back to PG for the next WAL file,
>>> is really rather fast, compared to streaming that same data over the
>>> network with a single TCP connection to the primary.  Of course, there's
>>> a lot of variables there and it depends on the network speed between the
>>> various pieces, but I've certainly had cases where a replica catches up
>>> much faster using restore command than streaming from the primary.
>>
>> Trying "restore_command" before streaming replication is totally fine.
>> It is not likely that the same WAL would be on both places anyway.
>>
>> My problem is trying "restore_command" before the local WAL.  I
>> understand the historic reason of this design, but I don't think it is
>> expected behavior to anybody who is using "restore_command" together
>> with streaming replication.
>
> Right.  I don't really understand the argument that this should be
> controlled by a GUC.  I could see having a GUC to choose between
> archiving-first and streaming-first, but if it's safe to use the WAL
> we've already got in pg_xlog, it seems like that should take priority
> over every other approach.  The comments lend themselves to a certain
> amount of doubt over whether we can actually trust the contents of
> pg_xlog, but if we can't, it seems like we just shouldn't use it - at
> all - ever.  It doesn't make much sense to me to say "hey, pg_xlog
> might have evil bad WAL in it that we shouldn't replay, so let's look
> for the same WAL elsewhere first, but then if we don't find it, we'll
> just replay the bad stuff."  I might be missing something, but that
> sounds a lot like "hey, this mysterious gloop I found might be rat
> poison, so let me go check if there's some fruit in the fruit bowl,
> but if I don't find any, I'm just going to eat the mysterious gloop."

The existing mechanism is designed to recover from data loss and looks
to still be a safe default.

If you have some data corruption somewhere you will want to trust the
archive copy rather than the pg_wal copy. If you don't have a copy in
the archive and the *only* copy you have is in pg_wal then we attempt
to trust it, bearing in mind each WAL record is CRC checked and
unlikely to pass if there is noticeable corruption. If you have strong
doubts about the contents of pg_wal, then you can simply delete the
files from pg_wal before you start, so you do effectively have a level
of control of how much you trust the files there.

That default wasn't ever changed when we introduced streaming, hence
the complaint.

I guess what we could say is, if the user has both streaming and
restore_command configured, then trust pg_wal. That would solve the
problem without a new parameter, but I can see cases where you might
want to still have control, but as Stephen says, a good
restore_command script will sort those cases out.

If we trust pg_wal over the archive, you still need to solve the
problem of what happens if pg_wal is behind the archive, so when you
hit end of pg_wal you would need to trap that error and flip back to
requesting any missing files, including the existing WAL file, from
the archive. That sounds like a challenge, parameter or not. A better
alternative might be to pre-scan each file in pg_wal and if it is
either not present or in some way corrupt, try to get from the
archive, so more of an exception handling situation, though still some
code.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Standby trying "restore_command" before local WAL

2018-08-03 Thread Alexander Kukushkin
Hi,

2018-07-31 20:25 GMT+02:00 Stephen Frost :
>
>
> There's still a question here, at least from my perspective, as to which
> is actually going to be faster to perform recovery based off of.  A good
> restore command, which pre-fetches the WAL in parallel and gets it local
> and on the same filesystem, meaning that the restore_command only has to
> execute essentially a 'mv' and return back to PG for the next WAL file,
> is really rather fast, compared to streaming that same data over the
> network with a single TCP connection to the primary.  Of course, there's
> a lot of variables there and it depends on the network speed between the
> various pieces, but I've certainly had cases where a replica catches up
> much faster using restore command than streaming from the primary.


Sure, mv is incredibly fast, but not calling external script/binary at
all is still faster than calling it.

What about the following cases?
1. replica host crashed, and in pg_wal we have a few thousands WAL files.
2. we are creating a new replica with pg_basebackup -X stream, it
takes a long time and again leaves a few thousands WAL files.

In both cases, if there is no restore_command in the recovery.conf,
postgres will happily read WAL files from pg_wal and only when there
is nothing left it will try to start streaming.

But, if restore_command is defined, it will always call the
restore_command, for every single WAL file it wants to restore.
If the restore_command exits with non zero exit code, postgres is
happily restoring the file from pg_wal!
And, only if the file is not there or not valid, postgres is trying to
start streaming.

>From my point of view, there is no difference between having no
restore_command and relying only on streaming replication and having
the restore_comman which always fails.
Therefore I don't really understand why we stick to the
"restore_command => pg_wal => streaming" and why it is not possible to
change it to "pg_wal => restore_command => streaming" or maybe even
(pg_wal => streaming => restore_command).
I am not sure about the last option, but in any case. before going to
some remote place, postgres should try to find (and try to replay) the
WAL file in the pg_wal.

Regards,
--
Alexander Kukushkin



Re: Standby trying "restore_command" before local WAL

2018-08-02 Thread Robert Haas
On Wed, Aug 1, 2018 at 7:14 AM, Emre Hasegeli  wrote:
>> There's still a question here, at least from my perspective, as to which
>> is actually going to be faster to perform recovery based off of.  A good
>> restore command, which pre-fetches the WAL in parallel and gets it local
>> and on the same filesystem, meaning that the restore_command only has to
>> execute essentially a 'mv' and return back to PG for the next WAL file,
>> is really rather fast, compared to streaming that same data over the
>> network with a single TCP connection to the primary.  Of course, there's
>> a lot of variables there and it depends on the network speed between the
>> various pieces, but I've certainly had cases where a replica catches up
>> much faster using restore command than streaming from the primary.
>
> Trying "restore_command" before streaming replication is totally fine.
> It is not likely that the same WAL would be on both places anyway.
>
> My problem is trying "restore_command" before the local WAL.  I
> understand the historic reason of this design, but I don't think it is
> expected behavior to anybody who is using "restore_command" together
> with streaming replication.

Right.  I don't really understand the argument that this should be
controlled by a GUC.  I could see having a GUC to choose between
archiving-first and streaming-first, but if it's safe to use the WAL
we've already got in pg_xlog, it seems like that should take priority
over every other approach.  The comments lend themselves to a certain
amount of doubt over whether we can actually trust the contents of
pg_xlog, but if we can't, it seems like we just shouldn't use it - at
all - ever.  It doesn't make much sense to me to say "hey, pg_xlog
might have evil bad WAL in it that we shouldn't replay, so let's look
for the same WAL elsewhere first, but then if we don't find it, we'll
just replay the bad stuff."  I might be missing something, but that
sounds a lot like "hey, this mysterious gloop I found might be rat
poison, so let me go check if there's some fruit in the fruit bowl,
but if I don't find any, I'm just going to eat the mysterious gloop."

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Standby trying "restore_command" before local WAL

2018-08-01 Thread Emre Hasegeli
> There's still a question here, at least from my perspective, as to which
> is actually going to be faster to perform recovery based off of.  A good
> restore command, which pre-fetches the WAL in parallel and gets it local
> and on the same filesystem, meaning that the restore_command only has to
> execute essentially a 'mv' and return back to PG for the next WAL file,
> is really rather fast, compared to streaming that same data over the
> network with a single TCP connection to the primary.  Of course, there's
> a lot of variables there and it depends on the network speed between the
> various pieces, but I've certainly had cases where a replica catches up
> much faster using restore command than streaming from the primary.

Trying "restore_command" before streaming replication is totally fine.
It is not likely that the same WAL would be on both places anyway.

My problem is trying "restore_command" before the local WAL.  I
understand the historic reason of this design, but I don't think it is
expected behavior to anybody who is using "restore_command" together
with streaming replication.

Of course speeding up "restore_command" is a good thing to do
independently.  Thank you for working on this.



Re: Standby trying "restore_command" before local WAL

2018-07-31 Thread Stephen Frost
Greetings,

* Sergei Kornilov (s...@zsrv.org) wrote:
> > As mentioned by others, it sounds like we could have an option to try
> > contacting the primary before running restore_commnad
> Why about primary?
> If we have restore_command on slave (or during point in time recovery) - we 
> force using XLOG_FROM_ARCHIVE, even if XLOG_FROM_PG_WAL source can provide 
> next WAL. As say xlog.c comment [1]:

Right..

> > * We just successfully read a file in pg_wal. We prefer files in
> > * the archive over ones in pg_wal, so try the next file again
> > * from the archive first.
> 
> We have some actual reason why we prefer restore_command instead of using 
> local wal files first?

Yes, as discussed in the comments mentioned up-thread.

> Partially written WAL? Streaming replication can leave partially written WAL 
> and we can handle this correctly.

Sure, though even in that case there seems to be a reasonable use-case
here for an option to control if restore_command is used to get the next
needed WAL or if the primary should be asked for the WAL first.

There's still a question here, at least from my perspective, as to which
is actually going to be faster to perform recovery based off of.  A good
restore command, which pre-fetches the WAL in parallel and gets it local
and on the same filesystem, meaning that the restore_command only has to
execute essentially a 'mv' and return back to PG for the next WAL file,
is really rather fast, compared to streaming that same data over the
network with a single TCP connection to the primary.  Of course, there's
a lot of variables there and it depends on the network speed between the
various pieces, but I've certainly had cases where a replica catches up
much faster using restore command than streaming from the primary.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Standby trying "restore_command" before local WAL

2018-07-31 Thread Sergei Kornilov
Hello

> As mentioned by others, it sounds like we could have an option to try
> contacting the primary before running restore_commnad
Why about primary?
If we have restore_command on slave (or during point in time recovery) - we 
force using XLOG_FROM_ARCHIVE, even if XLOG_FROM_PG_WAL source can provide next 
WAL. As say xlog.c comment [1]:

> * We just successfully read a file in pg_wal. We prefer files in
> * the archive over ones in pg_wal, so try the next file again
> * from the archive first.

We have some actual reason why we prefer restore_command instead of using local 
wal files first?
Partially written WAL? Streaming replication can leave partially written WAL 
and we can handle this correctly.

Later (in XLogFileReadAnyTLI caller) we change XLOG_FROM_ARCHIVE to 
XLOG_FROM_ANY, but we force call XLOG_FROM_ARCHIVE first and then look in 
XLOG_FROM_PG_WAL. Or i am wrong?

regards, Sergei

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/transam/xlog.c;h=493f1db7b97a94b882382fc3d2112634f56c86a3;hb=refs/heads/master#l12113



Re: Standby trying "restore_command" before local WAL

2018-07-31 Thread Stephen Frost
Greetings,

* Emre Hasegeli (e...@hasegeli.com) wrote:
> This issue came to our attention after we migrated an application from
> an object storage backend, and noticed that restarting a standby node
> takes hours or sometimes days.
> 
> We are using shared WAL archive and find it practical to have
> "restore_command" configured in case we would left a standby offline
> for a long time.  However, during the short window the standby is
> restarted, the master manages to archive a segment.  Then,
> the standby executes "restore_command"  successfully, and continues
> downloading WAL from the archive causing the recovery to take orders
> of magnitude longer.

Why is it taking so long for the restore command to return the WAL
file..?  Is the issue that the primary is generating enough WAL that
whatever you're using for the restore command isn't able to feed PG the
WAL fast enough?  We ran into exactly that problem with users of
pgbackrest and took a different approach- we parallelized pulling down
the WAL files into a local queue to be able to feed PG's replay as fast
as possible.  We've also moved to C for the parts of the archive-get
and archive-push commands that are run out of archive_command and
restore_command to make them as fast as possible (and are working to
finish rewriting the rest into C as well).

> == The Workarounds ==
> 
> We can possibly work around this inside the "restore_command" or
> by delaying the archiving.  Working around inside the "restore_command"
> would involve checking whether the file exists under pg_wal/.  This
> should not be easy because the WAL file may be written partially.  It
> should be easier for Postgres to do this as it knows where to stop
> processing the local WAL.

Yeah, handing PG a partially written WAL from the existing pg_wal
directory sounds like a rather bad idea..

> It should also be possible to work around this problem by delaying
> archiving using "wal_keep_segments", or replication slots, or simply
> with sleep().  Though, none of those is the correct solution to
> the problem.  We don't need the master to keep more segments for
> the standbys.  We already have more than enough.

Right, you certainly don't want to hope that wal_keep_segments is
enough or to keep extra WAL on the primary.  You could use a replication
slot to avoid the wal_keep_segments risk, but then you're still going to
have potentially a lot of unnecessary WAL on the primary if the replica
has been offline.

> == The Change ==
> 
> This "restore_command" behavior is coming from the initial archiving
> and point-in-time-recovery implementation [2].   The code says
> "the reason is that the file in XLOGDIR could be an old, un-filled or
> partly-filled version that was copied and restored as part of
> backing up $PGDATA."  This was probably a good reason in 2004, but
> I don't think it still is.  AFAIK "pg_basebackup" eliminates this
> problem.  Also, with this reasoning, we should also try streaming from
> the master before trying the local WAL, but AFAIU we don't.

Just because pg_basebackup wouldn't necessairly create that risk doesn't
mean that the risk has gone away; pg_basebackup isn't the only way for a
replica to be brought up.

> If there will be a consensus on fixing this, I can try to prepare
> a patch.  The company, I am currently working for, is also interested
> in sponsoring a support company to fix this problem.

As mentioned by others, it sounds like we could have an option to try
contacting the primary before running restore_commnad (or, more
generally, an option which allows a user to specify their preference
regarding what PG should try to do first), but I'm trying to figure out
how going to the primary is going to solve your problem- if the replica
is offline for long enough, then the WAL isn't going to be available on
the primary as you outlined above and the WAL won't be local either
(since the replica wasn't online while the primary was creating WAL..),
so the only option would be to go to the restore_command.

As such, it seems like a faster restore_command would end up being a
better and more general solution.

That said, I don't have any particular issue with a patch to allow the
user to tell PG to try asking the primary first.  Of course, that
wouldn't be available until v12.

Thanks!

Stephen


signature.asc
Description: PGP signature


Standby trying "restore_command" before local WAL

2018-07-31 Thread Emre Hasegeli
Currently the startup process tries the "restore_command" before
the WAL files locally available under pg_wal/ [1].  I believe we should
change this behavior.

== The Problem ==

This issue came to our attention after we migrated an application from
an object storage backend, and noticed that restarting a standby node
takes hours or sometimes days.

We are using shared WAL archive and find it practical to have
"restore_command" configured in case we would left a standby offline
for a long time.  However, during the short window the standby is
restarted, the master manages to archive a segment.  Then,
the standby executes "restore_command"  successfully, and continues
downloading WAL from the archive causing the recovery to take orders
of magnitude longer.

== The Workarounds ==

We can possibly work around this inside the "restore_command" or
by delaying the archiving.  Working around inside the "restore_command"
would involve checking whether the file exists under pg_wal/.  This
should not be easy because the WAL file may be written partially.  It
should be easier for Postgres to do this as it knows where to stop
processing the local WAL.

It should also be possible to work around this problem by delaying
archiving using "wal_keep_segments", or replication slots, or simply
with sleep().  Though, none of those is the correct solution to
the problem.  We don't need the master to keep more segments for
the standbys.  We already have more than enough.

== The Change ==

This "restore_command" behavior is coming from the initial archiving
and point-in-time-recovery implementation [2].   The code says
"the reason is that the file in XLOGDIR could be an old, un-filled or
partly-filled version that was copied and restored as part of
backing up $PGDATA."  This was probably a good reason in 2004, but
I don't think it still is.  AFAIK "pg_basebackup" eliminates this
problem.  Also, with this reasoning, we should also try streaming from
the master before trying the local WAL, but AFAIU we don't.

If there will be a consensus on fixing this, I can try to prepare
a patch.  The company, I am currently working for, is also interested
in sponsoring a support company to fix this problem.

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/transam/xlogarchive.c;h=5c6de4989c9a#l72

[2] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=66ec2db7284