Re: Safe switchover

2020-07-13 Thread Jehan-Guillaume de Rorthais
On Fri, 10 Jul 2020 10:05:52 +1000
James Sewell  wrote:

> Hi all,
> 
> I’m trying to work out a procedure for a safe zero data loss switchover
> under (high) load, which allows the old master to be reconnected without
> the use of pgrewind.
> 
> Would the following be sane?
> 
> - open connection to database
> - smart shutdown master
> - terminate all other connections
> - wait for shutdown (archiving will finish)
> - Identify last archived WAL file name (ensure it’s on backup server.)
> - wait till a standby has applied this WAL
> - promote that standby
> - attach old master to new master

During graceful shutdown (smart or fast), the primary is waiting for standby to
catchup with replication before the full stop. But nothing will save you if the
designated standby disconnect by itself during the shutdown procedure.

I usually use:

- shutdown the primary
- use pg_controldata to check its last redo checkpoint 
  and/or use pg_waldump to find its shutdown checkpoint
- check logs on standby and replication status and make sure it received the
  shutdown checkpoint (pg_waldump on the standby and/or last received lsn)
- promote the standby
- setup the old primary as standby
- start the old primary as new standby.

Regards,




Re: Safe switchover

2020-07-13 Thread Stephen Frost
Greetings,

* Paul Förster (paul.foers...@gmail.com) wrote:
> > On 13. Jul, 2020, at 18:00, Stephen Frost  wrote:
> > Ah, yes, if you rebuild the replica from a backup (or from the primary),
> > then sure, that's pretty similar to the pgbackrest delta restore, except
> > that when using delta restore we're only rewriting files that have a
> > different SHA checksum after being scanned, and we're pulling from the
> > backup repo anything that's needed and not putting load on the primary.
> 
> so, from what I understand, pgbackrest bottom line merely reduces copy 
> overhead in such a particular case. *Kind of* like shutdown primary, rsync, 
> and then startup.

Yes, though unlike rsync, pgbackrest can also run in parallel.. :)

> > There's been a few discussions on -hackers about this, that'd probably
> > be the place to discuss it further..
> 
> I'm not hacker, I'm just a DBA. :-)

You're a PG user, so your input is (at least) as valuable regarding
things like this, imv. :)

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Safe switchover

2020-07-13 Thread Paul Förster
Hi Stephen,

> On 13. Jul, 2020, at 18:00, Stephen Frost  wrote:
> 
> A pgbackrest delta restore will scan the entire data directory and
> verify every file matches the last backup, or it'll replace the file
> with what was in the backup that's being used.  If there's an error
> during any of that, the restore will fail.

ok, I didn't know that. Thanks very much. I'll look into it.

> That re-validation of the entire data directory is a pretty huge
> difference compared to how pg_rewind works.

I agree.

> Ah, yes, if you rebuild the replica from a backup (or from the primary),
> then sure, that's pretty similar to the pgbackrest delta restore, except
> that when using delta restore we're only rewriting files that have a
> different SHA checksum after being scanned, and we're pulling from the
> backup repo anything that's needed and not putting load on the primary.

so, from what I understand, pgbackrest bottom line merely reduces copy overhead 
in such a particular case. *Kind of* like shutdown primary, rsync, and then 
startup.

> There's been a few discussions on -hackers about this, that'd probably
> be the place to discuss it further..

I'm not hacker, I'm just a DBA. :-)

Cheers,
Paul



Re: Safe switchover

2020-07-13 Thread Stephen Frost
Greetings,

* Paul Förster (paul.foers...@gmail.com) wrote:
> > On 13. Jul, 2020, at 17:47, Stephen Frost  wrote:
> > 
> > Sure, Patroni will handle the failover fine- but that's not what I was
> > referring to.  If the server crashes and you have no idea why or what
> > happened, I would strongly recommend against using pg_rewind to rebuild
> > it to be a replica as there's no validation happening- you might
> > failover to it much later and, if you're lucky, discover quickly that
> > some blocks had gotten corrupted or if you're unlucky not discover until
> > much later that something was corrupted when the crash happened.  Using
> > initdb -k is good, but PG is only going to check the block when it goes
> > to read it, which might not be until much later especially on a system
> > that's been rebuilt as a replica.
> 
> I see your point, yet, I'm not sure how pgbackrest could protect us from such 
> a situation.

A pgbackrest delta restore will scan the entire data directory and
verify every file matches the last backup, or it'll replace the file
with what was in the backup that's being used.  If there's an error
during any of that, the restore will fail.

That re-validation of the entire data directory is a pretty huge
difference compared to how pg_rewind works.

> > This seems like an independent question and I'm not really sure what is
> > meant here by 'reinit it with Patroni'.
> 
> reinit basically deletes the replica database cluster and triggers a new full 
> copy of the primary. You can either "patronictl reinit" or kill patroni, rm 
> -r ${PGDATA}, and start patroni. This is basically the same.

Ah, yes, if you rebuild the replica from a backup (or from the primary),
then sure, that's pretty similar to the pgbackrest delta restore, except
that when using delta restore we're only rewriting files that have a
different SHA checksum after being scanned, and we're pulling from the
backup repo anything that's needed and not putting load on the primary.

> > I agree that it'd be good to have -k on by default.
> 
> so, now, we're two. :-) Anyone else? ;-)

There's been a few discussions on -hackers about this, that'd probably
be the place to discuss it further..

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Safe switchover

2020-07-13 Thread Paul Förster
Hi Stephen,

> On 13. Jul, 2020, at 17:47, Stephen Frost  wrote:
> 
> Sure, Patroni will handle the failover fine- but that's not what I was
> referring to.  If the server crashes and you have no idea why or what
> happened, I would strongly recommend against using pg_rewind to rebuild
> it to be a replica as there's no validation happening- you might
> failover to it much later and, if you're lucky, discover quickly that
> some blocks had gotten corrupted or if you're unlucky not discover until
> much later that something was corrupted when the crash happened.  Using
> initdb -k is good, but PG is only going to check the block when it goes
> to read it, which might not be until much later especially on a system
> that's been rebuilt as a replica.

I see your point, yet, I'm not sure how pgbackrest could protect us from such a 
situation.

> This seems like an independent question and I'm not really sure what is
> meant here by 'reinit it with Patroni'.

reinit basically deletes the replica database cluster and triggers a new full 
copy of the primary. You can either "patronictl reinit" or kill patroni, rm -r 
${PGDATA}, and start patroni. This is basically the same.

> I agree that it'd be good to have -k on by default.

so, now, we're two. :-) Anyone else? ;-)

Cheers,
Paul



Re: Safe switchover

2020-07-13 Thread Stephen Frost
Greetings,

* Paul Förster (paul.foers...@gmail.com) wrote:
> > On 10. Jul, 2020, at 17:45, Stephen Frost  wrote:
> > Sure, if you know exactly why the former primary failed and have
> > confidence that nothing actually bad happened then pg_rewind can work
> > (though it's still not what I'd generally recommend).
> > 
> > If you don't actually know what happened to the former primary to cause
> > it to fail then I definitely wouldn't use pg_rewind on it since it
> > doesn't have any checks to make sure that the data is actually
> > generally consistent.  These days you could get a bit of a better
> > feeling by running pg_checksums against the data dir, but that's not
> > going to be as good as doing a pgbackrest delta restore when it comes to
> > making sure that everything is valid.
> 
> we use Netapp plus continuous archiving. To protect agains block corruption, 
> all our database clusters have been created with initdb -k. So they should 
> report block corruptions in the log.

I certainly encourage using initdb -k.

> The usual reason why a database cluster goes down is because the server is 
> shut down which initiates a switchover and is not problematic. If the server 
> goes down by a power outage, system crash or similar, then an automatic 
> failover is initiated, which, according to our experience, is also not 
> problematic. Patroni seems to handle both situations well.

Sure, Patroni will handle the failover fine- but that's not what I was
referring to.  If the server crashes and you have no idea why or what
happened, I would strongly recommend against using pg_rewind to rebuild
it to be a replica as there's no validation happening- you might
failover to it much later and, if you're lucky, discover quickly that
some blocks had gotten corrupted or if you're unlucky not discover until
much later that something was corrupted when the crash happened.  Using
initdb -k is good, but PG is only going to check the block when it goes
to read it, which might not be until much later especially on a system
that's been rebuilt as a replica.

> The worst case is, that both servers crash, which is pretty unlikely. So, the 
> worst case is that we have to perform a volume restore with Netapp and replay 
> the WAL files since that last snapshot. Should the replica database cluster 
> be damaged too, then we may need to reinit it with Patroni. This is 
> acceptable even for large database clusters because replication runs fast. 
> But the possibility is very, very small.

This seems like an independent question and I'm not really sure what is
meant here by 'reinit it with Patroni'.

> So, in my opinion, -k should be default and if one wanted to create a 
> non-checksummed database cluster, it would have to be stated on the command 
> line explicitly. This IMHO is a reasonable way to make people migrate to 
> checksums over time as database clusters are migrated.

I agree that it'd be good to have -k on by default.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Safe switchover

2020-07-10 Thread Paul Förster
Hi Stephen,

> On 10. Jul, 2020, at 17:45, Stephen Frost  wrote:
> 
> Sure, if you know exactly why the former primary failed and have
> confidence that nothing actually bad happened then pg_rewind can work
> (though it's still not what I'd generally recommend).
> 
> If you don't actually know what happened to the former primary to cause
> it to fail then I definitely wouldn't use pg_rewind on it since it
> doesn't have any checks to make sure that the data is actually
> generally consistent.  These days you could get a bit of a better
> feeling by running pg_checksums against the data dir, but that's not
> going to be as good as doing a pgbackrest delta restore when it comes to
> making sure that everything is valid.

we use Netapp plus continuous archiving. To protect agains block corruption, 
all our database clusters have been created with initdb -k. So they should 
report block corruptions in the log.

The usual reason why a database cluster goes down is because the server is shut 
down which initiates a switchover and is not problematic. If the server goes 
down by a power outage, system crash or similar, then an automatic failover is 
initiated, which, according to our experience, is also not problematic. Patroni 
seems to handle both situations well.

The worst case is, that both servers crash, which is pretty unlikely. So, the 
worst case is that we have to perform a volume restore with Netapp and replay 
the WAL files since that last snapshot. Should the replica database cluster be 
damaged too, then we may need to reinit it with Patroni. This is acceptable 
even for large database clusters because replication runs fast. But the 
possibility is very, very small.

Why the -k option of initdb isn't default anyway, is beyond me. Yes, I know the 
argument about pg_upgrade messages, which people can't seem to cope with for 
some reason, but I can't see the reasoning. If I wanted to do a pg_upgrade from 
an older non-checksummed database cluster to a new major version with 
checksums, then I'd do initdb explicitly without checksums and perform the 
upgrade. Then I would enable checksums and that's it from then on. It's a one 
time only simple command for each affected database cluster.

So, in my opinion, -k should be default and if one wanted to create a 
non-checksummed database cluster, it would have to be stated on the command 
line explicitly. This IMHO is a reasonable way to make people migrate to 
checksums over time as database clusters are migrated.

But then, that's only my opinion. There is no absolute truth.

Cheers,
Paul



Re: Safe switchover

2020-07-10 Thread Stephen Frost
Greetings,

* Paul Förster (paul.foers...@gmail.com) wrote:
> > On 10. Jul, 2020, at 17:29, Stephen Frost  wrote:
> > Patroni also has the option to use pgbackrest instead of pg_rewind.
> 
> right. Sorry, I forgot about that. We use pg_rewind which works great.

Sure, if you know exactly why the former primary failed and have
confidence that nothing actually bad happened then pg_rewind can work
(though it's still not what I'd generally recommend).

If you don't actually know what happened to the former primary to cause
it to fail then I definitely wouldn't use pg_rewind on it since it
doesn't have any checks to make sure that the data is actually
generally consistent.  These days you could get a bit of a better
feeling by running pg_checksums against the data dir, but that's not
going to be as good as doing a pgbackrest delta restore when it comes to
making sure that everything is valid.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Safe switchover

2020-07-10 Thread Paul Förster
Hi Stephen,

> On 10. Jul, 2020, at 17:29, Stephen Frost  wrote:
> 
> Patroni also has the option to use pgbackrest instead of pg_rewind.

right. Sorry, I forgot about that. We use pg_rewind which works great.

Cheers,
Paul




Re: Safe switchover

2020-07-10 Thread Stephen Frost
Greetings,

* Paul Förster (paul.foers...@gmail.com) wrote:
> I wouldn't work out the procedure myself, especially since there is a free 
> working solution already. It's dangerous if you do it yourself and make a 
> mistake.
> 
> In our company, we rely on Patroni (https://github.com/zalando/patroni). Yes, 
> it uses pg_rewind in the background but it does switchover and failover fast 
> and reliably (also for high load databases), handles timelines automatically 
> and so on.

Patroni also has the option to use pgbackrest instead of pg_rewind.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Safe switchover

2020-07-10 Thread Paul Förster
Hi James,

I wouldn't work out the procedure myself, especially since there is a free 
working solution already. It's dangerous if you do it yourself and make a 
mistake.

In our company, we rely on Patroni (https://github.com/zalando/patroni). Yes, 
it uses pg_rewind in the background but it does switchover and failover fast 
and reliably (also for high load databases), handles timelines automatically 
and so on.

For the DCS, we use etcd, but you can choose yourself, which one you want.

Cheers,
Paul


> On 10. Jul, 2020, at 09:18, James Sewell  wrote:
> 
> 
> - open connection to database
> - smart shutdown master
> - terminate all other connections
> - wait for shutdown (archiving will finish)
> 
> OK despite what it looked like from the code - upon testing it seems like 
> even a fast shutdown will wait for logs to be archived *as long as progress 
> is being made* (the same as smart).
> 
> 
> The contents of this email are confidential and may be subject to legal or 
> professional privilege and copyright. No representation is made that this 
> email is free of viruses or other defects. If you have received this 
> communication in error, you may not copy or distribute any part of it or 
> otherwise disclose its contents to anyone. Please advise the sender of your 
> incorrect receipt of this correspondence.





Re: Safe switchover

2020-07-10 Thread James Sewell
> - open connection to database
> - smart shutdown master
> - terminate all other connections
> - wait for shutdown (archiving will finish)
>

OK despite what it looked like from the code - upon testing it seems like
even a fast shutdown will wait for logs to be archived *as long as progress
is being made* (the same as smart).

-- 
The contents of this email are confidential and may be subject to legal or 
professional privilege and copyright. No representation is made that this 
email is free of viruses or other defects. If you have received this 
communication in error, you may not copy or distribute any part of it or 
otherwise disclose its contents to anyone. Please advise the sender of your 
incorrect receipt of this correspondence.


Safe switchover

2020-07-09 Thread James Sewell
Hi all,

I’m trying to work out a procedure for a safe zero data loss switchover
under (high) load, which allows the old master to be reconnected without
the use of pgrewind.

Would the following be sane?

- open connection to database
- smart shutdown master
- terminate all other connections
- wait for shutdown (archiving will finish)
- Identify last archived WAL file name (ensure it’s on backup server.)
- wait till a standby has applied this WAL
- promote that standby
- attach old master to new master

Cheers,
James Sewell
-- 
James Sewell,
*Chief Architect, Jirotech*
Suite 46, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
*P *(+61) 2 8099 9000 <(+61)%202%208099%209000>  *W* www.jirotech.com  *F *
(+61) 2 8099 9099 <(+61)%202%208099%209000>

-- 
The contents of this email are confidential and may be subject to legal or 
professional privilege and copyright. No representation is made that this 
email is free of viruses or other defects. If you have received this 
communication in error, you may not copy or distribute any part of it or 
otherwise disclose its contents to anyone. Please advise the sender of your 
incorrect receipt of this correspondence.