PostgreSql: Canceled on conflict out to old pivot
Hi We have a PostgreSql 15 server serving around 30 databases, one schema each with the same layout. Each database is used by one application instance. The application consistently uses transactions with isolation level serializable to access the database, optimizing by using explicit read only transactions, where applicable. Once the server reaches 100% we get an increased amount of serialize conflict errors. This is expected, due to more concurrent access. But I fail to explain this kind of error: ERROR: could not serialize access due to read/write dependencies among transactions Detail: Reason code: Canceled on identification as a pivot, with conflict out to old committed transaction 61866959. There is a variation of the error: PSQLException: ERROR: could not serialize access due to read/write dependencies among transactions Detail: Reason code: Canceled on conflict out to old pivot 61940806. We're logging the id, begin and end of every transaction. Transaction 61940806 was committed without errors. The transaction responsible for the above error was started 40min later (and failed immediately). With 61866959 it is even more extreme: the first conflict error occurred 2.5h after 61866959 was committed. The DB table access pattern is too complex to lay out here. There are like 20 tables that are read/written to. Transactions are usually short living. The longest transaction that could occur is 1 min long. My understanding of serializable isolation is that only overlapping transactions can conflict. I can be pretty sure that in the above cases there is no single transaction, which overlaps with 61940806 and with the failing transaction 40 min later. Such long running transactions would cause different types of errors in our system ("out of shared memory", "You might need to increase max_pred_locks_per_transaction"). Why does PostgreSql detect a conflict with a transaction which was committed more than 1h before? Can there be a long dependency chain between many short running transactions? Does the high load prevent Postgres from doing some clean up? Cheers, Eduard
Re: replication primary writting infinite number of WAL files
Peter J. Holzer (2023. nov. 26., V, 12:30): > > > nov 24 10:20:19 docker02 33d8b96b9062[1200]: 2023-11-24 10:20:19.691 CET > [35] > LOG: checkpoints are occurring too frequently (23 seconds apart) > > nov 24 10:20:42 docker02 33d8b96b9062[1200]: 2023-11-24 10:20:42.938 CET > [35] > LOG: checkpoints are occurring too frequently (23 seconds apart) > > Interesting. If the database writes 1.5 GB/s of WALs and max_wal_size is > the default of 1GB, shouldn't there be a checkpoint about every 0.7 > seconds instead of just every 22 seconds? > > That log is from the beginning of the problem, 10:20. It started slowly, then ramped up. [image: image.png] I'm not sure what happened at 10:32. At 10:38 there was the first server restart + shut down all clients. After recovery at 10:40, writing again. At 10:44 we dropped the slot and it went down to 5MB/sec within one minute. On the second occasion it was writing at 3GB/sec for some time, but then we acted very quickly: [image: image.png] Network I/O was always about half of the disk I/O, very consistently (One half of the data was going to one standby, the other half could not be sent because of the slow network). After dropping the slot, writing always went down within about one minute. It was also very consistent, stopped exactly after dropping the slot. After separating DEV and PROD networks completely, yesterday we have created a new standby again (third try). No problems so far. I hope it will remain that way. Still investigating the applications (pg clients), looking for bugs. Thank you for your help! Laszlo
Re: pg_getnameinfo_all() failed: Temporary failure in name resolution
On Sun, Nov 26, 2023 at 6:51 AM Peter J. Holzer wrote: [snip] > > But is not explanatory. I want to see host names in the log file when > > possible, not just IP addresses. > > So now that you have IP addresses again, are there any for which a > reverse lookup doesn't work? > I've gone through /etc/hosts, and there are some mismatches between it and DNS. Maybe that's the cause of the errors... maybe not.
Re: pg_getnameinfo_all() failed: Temporary failure in name resolution
On 2023-11-24 16:39:57 -0500, Ron Johnson wrote: > On Fri, Nov 24, 2023 at 4:26 PM David G. Johnston > wrote: > > On Friday, November 24, 2023, Ron Johnson wrote: > > > The second "way" sounds interesting, but what is it filled with? With "???" (which isn't very helpful. I would have hoped it's filled in with a representation of the IP address) > What does it matter? It’s an internal detail that apparently gets exposed > as [unknown] appearing in your log file where the client ip address would > normally be. > > > Because I'd hope that the log file would tell me the errant host name. The error message says that the host name could not be determined because of a DNS error. So obviously it can't tell you the host name. It could tell you the IP address, though. > I added "log_hostname = on" to postgresql.conf earlier in the day. > When I commented that out, the errors stopped happening. > > > This makes sense. > > > But is not explanatory. I want to see host names in the log file when > possible, not just IP addresses. So now that you have IP addresses again, are there any for which a reverse lookup doesn't work? hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature
Re: replication primary writting infinite number of WAL files
On 2023-11-24 22:05:12 +0100, Les wrote: > And there is nothing in the logs in that time period besides "checkpoints > are happening too frequently"? > > > > Here are all the logs we have from that minute. > > ╰─# journalctl CONTAINER_ID=33d8b96b9062 --since '2023-11-24 10:18:00' --until > '2023-11-24 10:21:00.000' [citing only the "checkpoints are occurring too frequently" entries:] > nov 24 10:18:04 docker02 33d8b96b9062[1200]: 2023-11-24 10:18:04.517 CET [35] > > LOG: checkpoints are occurring too frequently (22 seconds apart) > nov 24 10:18:26 docker02 33d8b96b9062[1200]: 2023-11-24 10:18:26.776 CET [35] > > LOG: checkpoints are occurring too frequently (22 seconds apart) > nov 24 10:18:49 docker02 33d8b96b9062[1200]: 2023-11-24 10:18:49.002 CET [35] > > LOG: checkpoints are occurring too frequently (22 seconds apart) > nov 24 10:19:11 docker02 33d8b96b9062[1200]: 2023-11-24 10:19:11.320 CET [35] > > LOG: checkpoints are occurring too frequently (23 seconds apart) > nov 24 10:19:34 docker02 33d8b96b9062[1200]: 2023-11-24 10:19:34.058 CET [35] > > LOG: checkpoints are occurring too frequently (23 seconds apart) > nov 24 10:19:56 docker02 33d8b96b9062[1200]: 2023-11-24 10:19:56.160 CET [35] > > LOG: checkpoints are occurring too frequently (22 seconds apart) > nov 24 10:20:19 docker02 33d8b96b9062[1200]: 2023-11-24 10:20:19.691 CET [35] > > LOG: checkpoints are occurring too frequently (23 seconds apart) > nov 24 10:20:42 docker02 33d8b96b9062[1200]: 2023-11-24 10:20:42.938 CET [35] > > LOG: checkpoints are occurring too frequently (23 seconds apart) Interesting. If the database writes 1.5 GB/s of WALs and max_wal_size is the default of 1GB, shouldn't there be a checkpoint about every 0.7 seconds instead of just every 22 seconds? hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature