Replication conflicts not processed in ClientWrite

Magnus Hagander Mon, 04 Mar 2024 05:13:06 -0800

When a backend is blocked on writing data (such as with a network
error or a very slow client), indicated with wait event ClientWrite,
it  appears to not properly notice that it's overrunning
max_standby_streaming_delay, and therefore does not cancel the
transaction on the backend.


I've reproduced this repeatedly on Ubuntu 20.04 with PostgreSQL 15 out
of the debian packages. Curiously enough, if I install the debug
symbols and restart, in order to get a backtrace, it starts processing
the cancellation again and can no longer reproduce. So it sounds like
some timing issue around it.

My simple test was, with session 1 on the standby and session 2 on the primary:
Session 1: begin transaction isolation level repeatable read;
Session 1: select count(*) from testtable;
Session 2: alter table testtable rename to testtable2;
Session 1: select * from testtable t1 cross join testtable t2;
kill -STOP <the pid of session 1>

At this point, replication lag sartgs growing on the standby and it
never terminates the session.

If I then SIGCONT it, it will get terminated by replication conflict.

If I kill the session hard, the replication lag recovers immediately.

AFAICT if the confliact happens at ClientRead, for example, it's
picked up immediately, but there's something in ClientWrite that
prevents it.

My first thought would be OpenSSL, but this is reproducible both on
tls-over-tcp and on unix sockets.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Replication conflicts not processed in ClientWrite

Reply via email to