PostgreSql: Canceled on conflict out to old pivot

2023-11-26 Thread Wirch, Eduard
Hi

We have a PostgreSql 15 server serving around 30 databases, one schema each
with the same layout. Each database is used by one application instance.
The application consistently uses transactions with isolation level
serializable to access the database, optimizing by using explicit read only
transactions, where applicable. Once the server reaches 100% we get an
increased amount of serialize conflict errors. This is expected, due to
more concurrent access. But I fail to explain this kind of error:

ERROR: could not serialize access due to read/write dependencies among
transactions
  Detail: Reason code: Canceled on identification as a pivot, with conflict
out to old committed transaction 61866959.

There is a variation of the error:

PSQLException: ERROR: could not serialize access due to read/write
dependencies among transactions
  Detail: Reason code: Canceled on conflict out to old pivot 61940806.

We're logging the id, begin and end of every transaction. Transaction
61940806 was committed without errors. The transaction responsible for the
above error was started 40min later (and failed immediately). With 61866959
it is even more extreme: the first conflict error occurred 2.5h after
61866959 was committed.

The DB table access pattern is too complex to lay out here. There are like
20 tables that are read/written to. Transactions are usually short living.
The longest transaction that could occur is 1 min long. My understanding of
serializable isolation is that only overlapping transactions can conflict.
I can be pretty sure that in the above cases there is no single
transaction, which overlaps with 61940806 and with the failing transaction
40 min later. Such long running transactions would cause different types of
errors in our system ("out of shared memory", "You might need to increase
max_pred_locks_per_transaction").

Why does PostgreSql detect a conflict with a transaction which was
committed more than 1h before? Can there be a long dependency chain between
many short running transactions? Does the high load prevent Postgres from
doing some clean up?

Cheers,
Eduard


Re: replication primary writting infinite number of WAL files

2023-11-26 Thread Les
Peter J. Holzer  (2023. nov. 26., V, 12:30):

>
> > nov 24 10:20:19 docker02 33d8b96b9062[1200]: 2023-11-24 10:20:19.691 CET
> [35] > LOG:  checkpoints are occurring too frequently (23 seconds apart)
> > nov 24 10:20:42 docker02 33d8b96b9062[1200]: 2023-11-24 10:20:42.938 CET
> [35] > LOG:  checkpoints are occurring too frequently (23 seconds apart)
>
> Interesting. If the database writes 1.5 GB/s of WALs and max_wal_size is
> the default of 1GB, shouldn't there be a checkpoint about every 0.7
> seconds instead of just every 22 seconds?
>
>
That log is from the beginning of the problem, 10:20. It started slowly,
then ramped up.

[image: image.png]

I'm not sure what happened at 10:32. At 10:38 there was the first server
restart + shut down all clients. After recovery at 10:40, writing again. At
10:44 we dropped the slot and it went down to 5MB/sec within one minute.

On the second occasion it was writing at 3GB/sec for some time, but then we
acted very quickly:

[image: image.png]

Network I/O was always about half of the disk I/O, very consistently (One
half of the data was going to one standby, the other half could not be sent
because of the slow network).

After dropping the slot, writing always went down within about one minute.
It was also very consistent, stopped exactly after dropping the slot.

After separating DEV and PROD networks completely, yesterday we have
created a new standby again (third try). No problems so far. I hope it will
remain that way. Still investigating the applications (pg clients), looking
for bugs.

Thank you for your help!

   Laszlo


Re: pg_getnameinfo_all() failed: Temporary failure in name resolution

2023-11-26 Thread Ron Johnson
On Sun, Nov 26, 2023 at 6:51 AM Peter J. Holzer  wrote:
[snip]

> > But is not explanatory.  I want to see host names in the log file when
> > possible, not just IP addresses.
>
> So now that you have IP addresses again, are there any for which a
> reverse lookup doesn't work?
>

I've gone through /etc/hosts, and there are some mismatches between it and
DNS.  Maybe that's the cause of the errors... maybe not.  


Re: pg_getnameinfo_all() failed: Temporary failure in name resolution

2023-11-26 Thread Peter J. Holzer
On 2023-11-24 16:39:57 -0500, Ron Johnson wrote:
> On Fri, Nov 24, 2023 at 4:26 PM David G. Johnston 
> wrote:
> 
> On Friday, November 24, 2023, Ron Johnson  wrote:
> 
> 
> The second "way" sounds interesting, but what is it filled with?

With "???"

(which isn't very helpful. I would have hoped it's filled in with a
representation of the IP address)



> What does it matter?  It’s an internal detail that apparently gets exposed
> as [unknown] appearing in your log file where the client ip address would
> normally be.
> 
> 
> Because I'd hope that the log file would tell me the errant host name.  

The error message says that the host name could not be determined
because of a DNS error. So obviously it can't tell you the host name.
It could tell you the IP address, though.


> I added "log_hostname = on" to postgresql.conf earlier in the day. 
> When I commented that out, the errors stopped happening.
> 
> 
> This makes sense.
> 
> 
> But is not explanatory.  I want to see host names in the log file when
> possible, not just IP addresses.

So now that you have IP addresses again, are there any for which a
reverse lookup doesn't work?

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Re: replication primary writting infinite number of WAL files

2023-11-26 Thread Peter J. Holzer
On 2023-11-24 22:05:12 +0100, Les wrote:
> And there is nothing in the logs in that time period besides "checkpoints
> are happening too frequently"?
> 
> 
> 
> Here are all the logs we have from that minute.
> 
> ╰─# journalctl CONTAINER_ID=33d8b96b9062 --since '2023-11-24 10:18:00' --until
> '2023-11-24 10:21:00.000'                                  

[citing only the "checkpoints are occurring too frequently" entries:]
> nov 24 10:18:04 docker02 33d8b96b9062[1200]: 2023-11-24 10:18:04.517 CET [35] 
> > LOG:  checkpoints are occurring too frequently (22 seconds apart)
> nov 24 10:18:26 docker02 33d8b96b9062[1200]: 2023-11-24 10:18:26.776 CET [35] 
> > LOG:  checkpoints are occurring too frequently (22 seconds apart)
> nov 24 10:18:49 docker02 33d8b96b9062[1200]: 2023-11-24 10:18:49.002 CET [35] 
> > LOG:  checkpoints are occurring too frequently (22 seconds apart)
> nov 24 10:19:11 docker02 33d8b96b9062[1200]: 2023-11-24 10:19:11.320 CET [35] 
> > LOG:  checkpoints are occurring too frequently (23 seconds apart)
> nov 24 10:19:34 docker02 33d8b96b9062[1200]: 2023-11-24 10:19:34.058 CET [35] 
> > LOG:  checkpoints are occurring too frequently (23 seconds apart)
> nov 24 10:19:56 docker02 33d8b96b9062[1200]: 2023-11-24 10:19:56.160 CET [35] 
> > LOG:  checkpoints are occurring too frequently (22 seconds apart)
> nov 24 10:20:19 docker02 33d8b96b9062[1200]: 2023-11-24 10:20:19.691 CET [35] 
> > LOG:  checkpoints are occurring too frequently (23 seconds apart)
> nov 24 10:20:42 docker02 33d8b96b9062[1200]: 2023-11-24 10:20:42.938 CET [35] 
> > LOG:  checkpoints are occurring too frequently (23 seconds apart)

Interesting. If the database writes 1.5 GB/s of WALs and max_wal_size is
the default of 1GB, shouldn't there be a checkpoint about every 0.7
seconds instead of just every 22 seconds?

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature