Re: [HACKERS] Detecting skipped data from logical slots (data silently skipped)

2016-08-17 Thread Craig Ringer
On 17 August 2016 at 05:18, Andres Freund  wrote:

> On 2016-08-08 10:59:20 +0800, Craig Ringer wrote:
> > Right. Though if we flush lazily I'm surprised the effect is that big,
> > you're the one who did the work and knows the significance of it.
>
> It will be. Either you're increasing bloat (by not increasing the
> slot's wal position / catalog xmin), or you're adding frequent syncs on
> an idle connection.
>


My thinking is that we should be able to do it lazily, like we do already
with feedback during apply of changes.  The problem is that right now we
can't tell the difference between confirmed_flush_lsn advances in response
to keepalives when there's no interesting upstream activity, and advances
when the client replays and confirms real activity of interest. So we can
add a new field in logical slots that tracks the last confirmed_flush_lsn
update that occurred as a result of an actual write to the client rather
than keepalive responses. No new resource retention is required, no new
client messages, no new protocol fields. Just one new field in a logical
slot.



* Add a new field, say last_write_lsn, in slots. A logical slot updates
this whenever an output plugin sends something to the client in response to
a callback. last_write_lsn is not advanced along with confirmed_flush_lsn
when we just skip over data that's not of interest like writes to other
DBs or changes that are filtered out by the output plugin, only when the
output plugin actually sends something to the client.

* A candidate_last_write_lsn type mechanism is needed to ensure we don't
flush out advances of last_write_lsn before we've got client feedback to
confirm it flushed the changes resulting from the output plugin writes. The
same sort of logic as used for candidate_restart_lsn & restart_lsn will
work fine, but we don't have to make sure it's flushed like we do with
restart_lsn, we can just dirty the slot and wait for the next slot
checkpoint - it's pretty harmless if candidate_last_write_lsn is older than
reality, it just adds a small window where we won't detect lost changes.

* Clients like BDR and pglogical already send feedback lazily. They track
the server's flush position and sending feedback for an upstream lsn when
we know the corresponding downstream writes and associated replication
origin advances have been flushed to disk. (As you know, having written
it). Behaviour during normal apply doesn't need to change. Neither does
behaviour during idle; clients don't have to advance their replication
origin in response to server keepalives, though they may do so lazily.

*  When a client starts a new decoding session we
check last_write_lsn against the client-requested LSN from the client's
replication origin. We ERROR if last_write_lsn is newer than
the LSN requested by the client, indicating that the client is trying to
replay changes it or someone else using the same slot has already seen and
confirmed.

*  catalog_xmin advances and WAL removal are NOT limited by last_write_lsn,
we can freely remove WAL after last_write_lsn and vacuum catalogs. On
reconnect we continue to skip to confirmed_flush_lsn if asked for an older
LSN, just like we currently do. The difference is that now we know we're
skipping data that wasn't of interest to the client so it didn't result in
eager client side replication origin advances.


Think of last_write_lsn as "the value of confirmed_flush_lsn last time the
client actually flushed something interesting". We can safely skip from any
value >= last_write_lsn to the current slot confirmed_lsn if asked to start
replay at any LSN within that range. We CANNOT safely skip from
< last_write_lsn to confirmed_flush_lsn since we know the client would miss
data it already received and confirmed but seems to have forgotten due to
lying fsync(), restore from snapshot backup, etc.

We'd need more flushes on the upstream only if we were going to try to
guarantee that we'd detect all lost changes from a client, since
last_write_lsn would need flushing in response to every client feedback
message during apply (but not idle). Even then the client could've flushed
more changes we haven't got feedback for yet, so it's not really possible
to totally prevent the problem. I don't think total prevention is too
interesting though. A window since the last slot checkpoint where we don't
detect problems *if* the server has also crashed and restarted isn't too
bad and is a lot better than the current situation.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Detecting skipped data from logical slots (data silently skipped)

2016-08-16 Thread Andres Freund
On 2016-08-08 10:59:20 +0800, Craig Ringer wrote:
> Right. Though if we flush lazily I'm surprised the effect is that big,
> you're the one who did the work and knows the significance of it.

It will be. Either you're increasing bloat (by not increasing the
slot's wal position / catalog xmin), or you're adding frequent syncs on
an idle connection.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Detecting skipped data from logical slots (data silently skipped)

2016-08-07 Thread Craig Ringer
On 5 August 2016 at 14:07, Andres Freund  wrote:

>
> > > The simplest fix would be to require downstreams to flush their
> replication
> > > origin when they get a hot standby feedback message, before they send a
> > > reply with confirmation. That could be somewhat painful for
> performance, but
> > > can be alleviated somewhat by waiting for the downstream postgres to
> get
> > > around to doing a flush anyway and only forcing it if we're getting
> close to
> > > the walsender timeout. That's pretty much what BDR and pglogical do
> when
> > > applying transactions to avoid having to do a disk flush for each and
> every
> > > applied xact. Then we change START_REPLICATION ... LOGICAL so it
> ERRORs if
> > > you ask for a too-old LSN rather than silently ignoring it.
> >
> > That's basically just proposing to revert this broken optimization,
> > IIUC, and instead just try not to flush too often on the standby.
>
> The effect of the optimization is *massive* if you are replicating a
> less active database, or a less active subset of a database, in a
> cluster with lots of other activity. I don't think that can just be
> disregard, to protect against something with plenty of other failure
> scenarios.
>


Right. Though if we flush lazily I'm surprised the effect is that big,
you're the one who did the work and knows the significance of it.

All I'm trying to say is that I think the current behaviour is too
dangerous. It doesn't just lead to failure, but easy, undetectable, silent
failure when users perform common and simple tasks like starting a snapshot
or filesystem-level pg_start_backup() copy of a DB. The only reason it
can't happen for pg_basebackup too is that we omit slots during
pg_basebackup . That inconsistency between snapshot/fs-level and
pg_basebackup is unfortunate but understandable.

So I'm not saying "this whole idea must go". I'm saying I think it's to
permissive and needs to be able to be stricter about what it's allowed to
skip, so we can differentiate between "nothing interesting here" and "um, I
think someone else consumed data I needed, I'd better bail out now". I've
never been comfortable with the skipping behaviour and found it confusing
right from the start, but now I have definite cases it can cause silent
inconsistencies and really think it needs to be modified.

Robert's point that we could keep track of the skippable range is IMO a
good one. An extra slot attribute with the last LSN that resulted in the
output plugin doing a write to the client would be sufficient, at least at
this point. To anticipate future needs where we might want to allow output
plugins to ignore some things, control could be handed to the output plugin
by allowing it to also make a function call for the position to be
explicitly advanced even if it performs no writes.

That way we can safely skip ahead if the client asks us for an LSN equal to
or after the last real data we sent them, but refuse to skip if we sent
them data after the LSN they're asking for.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Detecting skipped data from logical slots (data silently skipped)

2016-08-05 Thread Andres Freund
On 2016-08-05 00:12:41 -0400, Robert Haas wrote:
> > The cause is an optimisation intended to allow the downstream to avoid
> > having to do local writes and flushes when the upstream's activity isn't of
> > interest to it and doesn't result in replicated rows. When the upstream does
> > a bunch of writes to another database or otherwise produces WAL not of
> > interest to the downstream we send the downstream keepalive messages that
> > include the upstream's current xlog position and the client replies to
> > acknowledge it's seen the new LSN. But, so that we can avoid disk flushes on
> > the downstream, we permit it to skip advancing its replication origin in
> > response to those keepalives. We continue to advance the confirmed_flush_lsn
> > and restart_lsn in the replication slot on the upstream so we can free WAL
> > that's not needed and move the catalog_xmin up. The replication origin on
> > the downstream falls behind the confirmed_flush_lsn on the upstream.
> 
> This seems entirely too clever.  The upstream could safely remember
> that if the downstream asks for WAL position X it's safe to begin
> streaming from WAL position Y because nothing in the middle is
> interesting, but it can hardly decide to unilaterally ignore the
> request position.
> 
> > The simplest fix would be to require downstreams to flush their replication
> > origin when they get a hot standby feedback message, before they send a
> > reply with confirmation. That could be somewhat painful for performance, but
> > can be alleviated somewhat by waiting for the downstream postgres to get
> > around to doing a flush anyway and only forcing it if we're getting close to
> > the walsender timeout. That's pretty much what BDR and pglogical do when
> > applying transactions to avoid having to do a disk flush for each and every
> > applied xact. Then we change START_REPLICATION ... LOGICAL so it ERRORs if
> > you ask for a too-old LSN rather than silently ignoring it.
> 
> That's basically just proposing to revert this broken optimization,
> IIUC, and instead just try not to flush too often on the standby.

The effect of the optimization is *massive* if you are replicating a
less active database, or a less active subset of a database, in a
cluster with lots of other activity. I don't think that can just be
disregard, to protect against something with plenty of other failure
scenarios.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Detecting skipped data from logical slots (data silently skipped)

2016-08-04 Thread Robert Haas
On Wed, Aug 3, 2016 at 9:39 AM, Craig Ringer  wrote:
> I think we have a bit of a problem with the behaviour specified for logical
> slots, one that makes it hard to prevent a outdated snapshot or backup of a
> logical-slot-using downstream from knowing it's missing a chunk of data
> that's been consumed from a slot. That's not great since slots are supposed
> to ensure a continuous, gapless data stream.
>
> If the downstream requests that logical decoding restarts at an LSN older
> than the slot's confirmed_flush_lsn, we silently ignore the client's request
> and start replay at the confirmed_flush_lsn. That's by design and fine
> normally, since we know the gap LSNs contained no transactions of interest
> to the downstream.

Wow, that sucks.

> The cause is an optimisation intended to allow the downstream to avoid
> having to do local writes and flushes when the upstream's activity isn't of
> interest to it and doesn't result in replicated rows. When the upstream does
> a bunch of writes to another database or otherwise produces WAL not of
> interest to the downstream we send the downstream keepalive messages that
> include the upstream's current xlog position and the client replies to
> acknowledge it's seen the new LSN. But, so that we can avoid disk flushes on
> the downstream, we permit it to skip advancing its replication origin in
> response to those keepalives. We continue to advance the confirmed_flush_lsn
> and restart_lsn in the replication slot on the upstream so we can free WAL
> that's not needed and move the catalog_xmin up. The replication origin on
> the downstream falls behind the confirmed_flush_lsn on the upstream.

This seems entirely too clever.  The upstream could safely remember
that if the downstream asks for WAL position X it's safe to begin
streaming from WAL position Y because nothing in the middle is
interesting, but it can hardly decide to unilaterally ignore the
request position.

> The simplest fix would be to require downstreams to flush their replication
> origin when they get a hot standby feedback message, before they send a
> reply with confirmation. That could be somewhat painful for performance, but
> can be alleviated somewhat by waiting for the downstream postgres to get
> around to doing a flush anyway and only forcing it if we're getting close to
> the walsender timeout. That's pretty much what BDR and pglogical do when
> applying transactions to avoid having to do a disk flush for each and every
> applied xact. Then we change START_REPLICATION ... LOGICAL so it ERRORs if
> you ask for a too-old LSN rather than silently ignoring it.

That's basically just proposing to revert this broken optimization,
IIUC, and instead just try not to flush too often on the standby.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Detecting skipped data from logical slots (data silently skipped)

2016-08-03 Thread Craig Ringer
On 3 August 2016 at 21:55, Greg Stark  wrote:

> I didn't follow all of that but I wonder if it isn't just that when you
> restore from backup you should be creating a new slot?
>

Yes, you should.

But when you restore a Pg instance from a disk snapshot or similar it can't
tell it isn't the original of its self. It has no way to know "I shouldn't
connect to this slot and consume data from it". Right now the upstream has
no way to tell it "that data's gone, sorry" - it just ignores the
downstream's requested start position and starts from wherever it thinks
the downstream is up to.

With physical replication we'll detect such a problem. With logical
replication we'll silently continue with an invisible gap in the history.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Detecting skipped data from logical slots (data silently skipped)

2016-08-03 Thread Greg Stark
I didn't follow all of that but I wonder if it isn't just that when you
restore from backup you should be creating a new slot?

On 3 Aug 2016 14:39, "Craig Ringer"  wrote:

> Hi all
>
> I think we have a bit of a problem with the behaviour specified for
> logical slots, one that makes it hard to prevent a outdated snapshot or
> backup of a logical-slot-using downstream from knowing it's missing a chunk
> of data that's been consumed from a slot. That's not great since slots are
> supposed to ensure a continuous, gapless data stream.
>
> If the downstream requests that logical decoding restarts at an LSN older
> than the slot's confirmed_flush_lsn, we silently ignore the client's
> request and start replay at the confirmed_flush_lsn. That's by design and
> fine normally, since we know the gap LSNs contained no transactions of
> interest to the downstream.
>
> But it's *bad* if the downstream is actually a copy of the original
> downstream that's been started from a restored backup/snapshot. In that
> case the downstream won't know that some other client, probably a newer
> instance of its self, consumed rows it should've seen. It'll merrily
> continue replaying and not know it isn't consistent.
>
> The cause is an optimisation intended to allow the downstream to avoid
> having to do local writes and flushes when the upstream's activity isn't of
> interest to it and doesn't result in replicated rows. When the upstream
> does a bunch of writes to another database or otherwise produces WAL not of
> interest to the downstream we send the downstream keepalive messages that
> include the upstream's current xlog position and the client replies to
> acknowledge it's seen the new LSN. But, so that we can avoid disk flushes
> on the downstream, we permit it to skip advancing its replication origin in
> response to those keepalives. We continue to advance the
> confirmed_flush_lsn and restart_lsn in the replication slot on the upstream
> so we can free WAL that's not needed and move the catalog_xmin up. The
> replication origin on the downstream falls behind the confirmed_flush_lsn
> on the upstream.
>
> This means that if the downstream exits/crashes before receiving some new
> row, its replication origin will tell it that it last replayed some LSN
> older than what it really did, and older than what the server retains.
> Logical decoding doesn't allow the client to "jump backwards" and replay
> anything older than the confirmed_lsn. Since we "know" that the gap cannot
> contain rows of interest, otherwise we'd have updated the replication
> origin, we just skip and start replay at the confirmed_flush_lsn.
>
> That means that if the downstream is restored from a backup it has no way
> of knowing it can't rejoin and become consistent because it can't tell the
> difference between "everything's fine, replication origin intentionally
> behind confirmed_flush_lsn due to activity not of interest" and "we've
> missed data consumed from this slot by some other peer and should refuse to
> continue replay".
>
> The simplest fix would be to require downstreams to flush their
> replication origin when they get a hot standby feedback message, before
> they send a reply with confirmation. That could be somewhat painful for
> performance, but can be alleviated somewhat by waiting for the downstream
> postgres to get around to doing a flush anyway and only forcing it if we're
> getting close to the walsender timeout. That's pretty much what BDR and
> pglogical do when applying transactions to avoid having to do a disk flush
> for each and every applied xact. Then we change START_REPLICATION ...
> LOGICAL so it ERRORs if you ask for a too-old LSN rather than silently
> ignoring it.
>
> This problem can also bite you if you restore a copy of a downstream (say,
> to look at since-deleted data) while the original happens to be
> disconnected for some reason. The copy connects to the upstream and
> consumes some data from the slot. Then when the original comes back on line
> it has no idea there's a gap in its time stream.
>
> This came up when investigating issues with people using snapshot-based
> BDR and pglogical backup/restore. It's a real-world problem that can result
> in silent data inconsistency.
>
> Thoughts on the proposed fix? Any ideas for lower-impact fixes that'd
> still allow a downstream to find out if it's missed data?
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


[HACKERS] Detecting skipped data from logical slots (data silently skipped)

2016-08-03 Thread Craig Ringer
Hi all

I think we have a bit of a problem with the behaviour specified for logical
slots, one that makes it hard to prevent a outdated snapshot or backup of a
logical-slot-using downstream from knowing it's missing a chunk of data
that's been consumed from a slot. That's not great since slots are supposed
to ensure a continuous, gapless data stream.

If the downstream requests that logical decoding restarts at an LSN older
than the slot's confirmed_flush_lsn, we silently ignore the client's
request and start replay at the confirmed_flush_lsn. That's by design and
fine normally, since we know the gap LSNs contained no transactions of
interest to the downstream.

But it's *bad* if the downstream is actually a copy of the original
downstream that's been started from a restored backup/snapshot. In that
case the downstream won't know that some other client, probably a newer
instance of its self, consumed rows it should've seen. It'll merrily
continue replaying and not know it isn't consistent.

The cause is an optimisation intended to allow the downstream to avoid
having to do local writes and flushes when the upstream's activity isn't of
interest to it and doesn't result in replicated rows. When the upstream
does a bunch of writes to another database or otherwise produces WAL not of
interest to the downstream we send the downstream keepalive messages that
include the upstream's current xlog position and the client replies to
acknowledge it's seen the new LSN. But, so that we can avoid disk flushes
on the downstream, we permit it to skip advancing its replication origin in
response to those keepalives. We continue to advance the
confirmed_flush_lsn and restart_lsn in the replication slot on the upstream
so we can free WAL that's not needed and move the catalog_xmin up. The
replication origin on the downstream falls behind the confirmed_flush_lsn
on the upstream.

This means that if the downstream exits/crashes before receiving some new
row, its replication origin will tell it that it last replayed some LSN
older than what it really did, and older than what the server retains.
Logical decoding doesn't allow the client to "jump backwards" and replay
anything older than the confirmed_lsn. Since we "know" that the gap cannot
contain rows of interest, otherwise we'd have updated the replication
origin, we just skip and start replay at the confirmed_flush_lsn.

That means that if the downstream is restored from a backup it has no way
of knowing it can't rejoin and become consistent because it can't tell the
difference between "everything's fine, replication origin intentionally
behind confirmed_flush_lsn due to activity not of interest" and "we've
missed data consumed from this slot by some other peer and should refuse to
continue replay".

The simplest fix would be to require downstreams to flush their replication
origin when they get a hot standby feedback message, before they send a
reply with confirmation. That could be somewhat painful for performance,
but can be alleviated somewhat by waiting for the downstream postgres to
get around to doing a flush anyway and only forcing it if we're getting
close to the walsender timeout. That's pretty much what BDR and pglogical
do when applying transactions to avoid having to do a disk flush for each
and every applied xact. Then we change START_REPLICATION ... LOGICAL so it
ERRORs if you ask for a too-old LSN rather than silently ignoring it.

This problem can also bite you if you restore a copy of a downstream (say,
to look at since-deleted data) while the original happens to be
disconnected for some reason. The copy connects to the upstream and
consumes some data from the slot. Then when the original comes back on line
it has no idea there's a gap in its time stream.

This came up when investigating issues with people using snapshot-based BDR
and pglogical backup/restore. It's a real-world problem that can result in
silent data inconsistency.

Thoughts on the proposed fix? Any ideas for lower-impact fixes that'd still
allow a downstream to find out if it's missed data?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services