Slow plan choice with prepared query
Hi, I've experienced a situation where the planner seems to make a very poor choice with a prepared query after the first five executions. Looking at the documentation, I think this happens because it switches from a custom plan to a generic one, and doesn't make a good choice for the generic one. Postgres version: running in docker, reports to be 'Debian 14.1-1.pgdg110+1' If I force it to use a custom plan via 'set local plan_cache_mode = force_custom_plan', then I don't notice any slowdown. Without it, the 6th and onwards calls can take 1 second to 15 seconds each, as opposed to about 10ms. Since I have a workaround, I don't necessarily need assistance, but posting this here in case it's of value as a test case. Here's a test case that reliably duplicates this issue for me: create table test ( test_id serial primary key, data text ); insert into test (data) (select data from (select generate_series(1,1) AS id, md5(random()::text) AS data) x); prepare foo_test(text, text, int, text, bool) as SELECT * FROM (SELECT *, count(*) OVER () > $3 AS has_more, row_number() OVER () FROM ( WITH counted AS ( SELECT count(*) AS total FROM (select test_id::text, data from test where (cast($1 as text) is null or lower(data) like '%' || lower($1) || '%') and (cast($2 as text) is null or lower(test_id::text) like '%' || lower($2) || '%')) base ), cursor_row AS ( SELECT base.test_id FROM (select test_id::text, data from test where (cast($1 as text) is null or lower(data) like '%' || lower($1) || '%') and (cast($2 as text) is null or lower(test_id::text) like '%' || lower($2) || '%')) base WHERE base.test_id = $4 ) SELECT counted.*, base.* FROM (select test_id::text, data from test where (cast($1 as text) is null or lower(data) like '%' || lower($1) || '%') and (cast($2 as text) is null or lower(test_id::text) like '%' || lower($2) || '%')) base LEFT JOIN cursor_row ON true LEFT JOIN counted ON true WHERE (( $4 IS NULL OR cast($5 as bool) IS NULL ) OR ( (base.test_id) > (cursor_row.test_id) )) ORDER BY base.test_id ASC LIMIT $3 + 1 ) xy LIMIT $3 ) z ORDER BY row_number ASC; \timing execute foo_test(null, null, 5, 500, true); execute foo_test(null, null, 5, 500, true); execute foo_test(null, null, 5, 500, true); execute foo_test(null, null, 5, 500, true); execute foo_test(null, null, 5, 500, true); -- This one should be slower: execute foo_test(null, null, 5, 500, true);
Re: An I/O error occurred while sending to the backend (PG 13.4)
You originally mailed about an error on the client, and now you found corresponding server logs, which suggests a veritable network issue. Are the postgres clients and server on the same subnet ? If not, what are the intermediate routers ? Is there any NAT happening ? Do those devices have any interesting logs that correspond with the server/client connection failures ? Have you tried enabling TCP keepalives ? This might help to convince a NAT device not to forget about your connection. https://www.postgresql.org/docs/current/runtime-config-connection.html tcp_keepalives_idle=9 tcp_keepalives_interval=9 tcp_keepalives_count=0 tcp_user_timeout=0 -- You apparently have this set, but it cannot work on windows, so just generates noise. On linux, you can check the keepalive counters in "netstat -not" to be sure that it's enabled. A similar switch hopefully exists for windows. -- Justin
Re: An I/O error occurred while sending to the backend (PG 13.4)
On Thu, Feb 24, 2022 at 12:47:42AM +, l...@laurent-hasson.com wrote: > On Sat, Dec 04, 2021 at 05:32:10PM +, l...@laurent-hasson.com wrote: > > I have a data warehouse with a fairly complex ETL process that has > > been running for years now across PG 9.6, 11.2 and now 13.4 for the > > past couple of months. I have been getting the error "An I/O error > > occurred while sending to the backend" quite often under load in 13.4 > > which I never used to get on 11.2. I have applied some tricks, > particularly > > with the socketTimeout JDBC configuration. > It'd be nice to see a network trace for this too. Using tcpdump or > wireshark. Preferably from the client side. > > Hello Justin, > > It has been ages! The issue has been happening a bit more often recently, as > much as once every 10 days or so. As a reminder, the set up is Postgres 13.4 > on Windows Server with 16cores and 64GB memory. The scenario where this > occurs is an ETL tool called Pentaho Kettle (V7) connecting to the DB for > DataWarehouse workloads. The tool is Java-based and connects via JDBC using > postgresql-42.2.5.jar. There are no particular settings besides the > socketTimeout setting mentioned above. > > The workload has some steps being lots of quick transactions for dimension > tables for example, but some fact table calculations, especially large > pivots, can make queries run for 40mn up to over an hour (a few of those). > > I caught these in the logs at the time of a failure but unsure what to make > of that: > > 2022-02-21 02:10:43.605 EST [1368] LOG: unexpected EOF on client connection > with an open transaction > 2022-02-21 02:10:43.605 EST [3304] LOG: could not receive data from client: > An existing connection was forcibly closed by the remote host. > > 2022-02-21 02:10:43.605 EST [3304] LOG: unexpected EOF on client connection > with an open transaction > 2022-02-21 02:31:38.808 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not > supported I suggest to enable CSV logging, which has many more columns of data. Some of them might provide an insight - I'm not sure. log_destination=csvlog (in addition to whatever else you have set). And the aforementioned network trace. You could set a capture filter on TCP SYN|RST so it's not absurdly large. From my notes, it might look like this: (tcp[tcpflags]&(tcp-rst|tcp-syn|tcp-fin)!=0) -- Justin
RE: An I/O error occurred while sending to the backend (PG 13.4)
> -Original Message- > From: l...@laurent-hasson.com > Sent: Saturday, December 4, 2021 14:18 > To: Justin Pryzby > Cc: pgsql-performa...@postgresql.org > Subject: RE: An I/O error occurred while sending to the backend (PG 13.4) > > > > -Original Message- > > From: Justin Pryzby > > Sent: Saturday, December 4, 2021 12:59 > > To: l...@laurent-hasson.com > > Cc: pgsql-performa...@postgresql.org > > Subject: Re: An I/O error occurred while sending to the backend (PG > > 13.4) > > > > On Sat, Dec 04, 2021 at 05:32:10PM +, l...@laurent-hasson.com > > wrote: > > > I have a data warehouse with a fairly complex ETL process that has > > been running for years now across PG 9.6, 11.2 and now 13.4 for the > > past couple of months. I have been getting the error "An I/O error > > occurred while sending to the backend" quite often under load in 13.4 > > which I never used to get on 11.2. I have applied some tricks, > particularly > > with the socketTimeout JDBC configuration. > > > > > > So my first question is whether anyone has any idea why this is > > happening? My hardware and general PG configuration have not > > changed between 11.2 and 13.4 and I NEVER experienced this on 11.2 > in > > about 2y of production. > > > > > > Second, I have one stored procedure that takes a very long time to > run > > (40mn more or less), so obviously, I'd need to set socketTimeout to > > something like 1h in order to call it and not timeout. That doesn't seem > > reasonable? > > > > Is the DB server local or remote (TCP/IP) to the client? > > > > Could you collect the corresponding postgres query logs when this > > happens ? > > > > It'd be nice to see a network trace for this too. Using tcpdump or > > wireshark. > > Preferably from the client side. > > > > FWIW, I suspect the JDBC socketTimeout is a bad workaround. > > > > -- > > Justin > > It's a remote server, but all on a local network. Network performance is I > am sure not the issue. Also, the system is on Windows Server. What are you > expecting to see out of a tcpdump? I'll try to get PG logs on the failing query. > > Thank you, > Laurent. > > > > Hello Justin, It has been ages! The issue has been happening a bit more often recently, as much as once every 10 days or so. As a reminder, the set up is Postgres 13.4 on Windows Server with 16cores and 64GB memory. The scenario where this occurs is an ETL tool called Pentaho Kettle (V7) connecting to the DB for DataWarehouse workloads. The tool is Java-based and connects via JDBC using postgresql-42.2.5.jar. There are no particular settings besides the socketTimeout setting mentioned above. The workload has some steps being lots of quick transactions for dimension tables for example, but some fact table calculations, especially large pivots, can make queries run for 40mn up to over an hour (a few of those). I caught these in the logs at the time of a failure but unsure what to make of that: 2022-02-21 02:08:16.214 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:29.347 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:30.371 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:30.463 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:30.596 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:30.687 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:30.786 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:30.873 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:30.976 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:31.050 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:31.131 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:31.240 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:31.906 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:31.988 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:33.068 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:34.850 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:10:43.596 EST [836] LOG: could not receive data from client: An existing connection was forcibly closed by the remote host. 2022-02-21 02:10:43.598 EST [8616] LOG: could not receive data from client: An existing connection was forcibly c