Re: Slow plan choice with prepared query
As per PG official documentation on PREPARE, it is working as expected. Use custom plan, but after 5th iteration compare cost of custom plan vs generic plan and use the one with the less cost which is the generic plan even though it is not as performant. Look at explain output to see the diffs between 5th iteration and 6th one: explain (analyze, summary, buffers true) execute foo_test(null, null, 5, 500, true); It appears the SORT is the problem and a mismatch between text and integer for base.text_id? --> WHERE base.test_id = $4 Regards, Michael Vitale MichaelDBA wrote on 2/24/2022 1:45 PM: Dag, if you ain't right! I can duplicate this on the ones I tested with: PG v11 and v14. Gonna start diving into this myself... Regards, Michael Vitale Mark Saward wrote on 2/23/2022 10:37 PM: Hi, I've experienced a situation where the planner seems to make a very poor choice with a prepared query after the first five executions. Looking at the documentation, I think this happens because it switches from a custom plan to a generic one, and doesn't make a good choice for the generic one. Postgres version: running in docker, reports to be 'Debian 14.1-1.pgdg110+1' If I force it to use a custom plan via 'set local plan_cache_mode = force_custom_plan', then I don't notice any slowdown. Without it, the 6th and onwards calls can take 1 second to 15 seconds each, as opposed to about 10ms. Since I have a workaround, I don't necessarily need assistance, but posting this here in case it's of value as a test case. Here's a test case that reliably duplicates this issue for me: create table test ( test_id serial primary key, data text ); insert into test (data) (select data from (select generate_series(1,1) AS id, md5(random()::text) AS data) x); prepare foo_test(text, text, int, text, bool) as SELECT * FROM (SELECT *, count(*) OVER () > $3 AS has_more, row_number() OVER () FROM ( WITH counted AS ( SELECT count(*) AS total FROM (select test_id::text, data from test where (cast($1 as text) is null or lower(data) like '%' || lower($1) || '%') and (cast($2 as text) is null or lower(test_id::text) like '%' || lower($2) || '%')) base ), cursor_row AS ( SELECT base.test_id FROM (select test_id::text, data from test where (cast($1 as text) is null or lower(data) like '%' || lower($1) || '%') and (cast($2 as text) is null or lower(test_id::text) like '%' || lower($2) || '%')) base WHERE base.test_id = $4 ) SELECT counted.*, base.* FROM (select test_id::text, data from test where (cast($1 as text) is null or lower(data) like '%' || lower($1) || '%') and (cast($2 as text) is null or lower(test_id::text) like '%' || lower($2) || '%')) base LEFT JOIN cursor_row ON true LEFT JOIN counted ON true WHERE (( $4 IS NULL OR cast($5 as bool) IS NULL ) OR ( (base.test_id) > (cursor_row.test_id) )) ORDER BY base.test_id ASC LIMIT $3 + 1 ) xy LIMIT $3 ) z ORDER BY row_number ASC; \timing execute foo_test(null, null, 5, 500, true); execute foo_test(null, null, 5, 500, true); execute foo_test(null, null, 5, 500, true); execute foo_test(null, null, 5, 500, true); execute foo_test(null, null, 5, 500, true); -- This one should be slower: execute foo_test(null, null, 5, 500, true);
Re: Slow plan choice with prepared query
Dag, if you ain't right! I can duplicate this on the ones I tested with: PG v11 and v14. Gonna start diving into this myself... Regards, Michael Vitale Mark Saward wrote on 2/23/2022 10:37 PM: Hi, I've experienced a situation where the planner seems to make a very poor choice with a prepared query after the first five executions. Looking at the documentation, I think this happens because it switches from a custom plan to a generic one, and doesn't make a good choice for the generic one. Postgres version: running in docker, reports to be 'Debian 14.1-1.pgdg110+1' If I force it to use a custom plan via 'set local plan_cache_mode = force_custom_plan', then I don't notice any slowdown. Without it, the 6th and onwards calls can take 1 second to 15 seconds each, as opposed to about 10ms. Since I have a workaround, I don't necessarily need assistance, but posting this here in case it's of value as a test case. Here's a test case that reliably duplicates this issue for me: create table test ( test_id serial primary key, data text ); insert into test (data) (select data from (select generate_series(1,1) AS id, md5(random()::text) AS data) x); prepare foo_test(text, text, int, text, bool) as SELECT * FROM (SELECT *, count(*) OVER () > $3 AS has_more, row_number() OVER () FROM ( WITH counted AS ( SELECT count(*) AS total FROM (select test_id::text, data from test where (cast($1 as text) is null or lower(data) like '%' || lower($1) || '%') and (cast($2 as text) is null or lower(test_id::text) like '%' || lower($2) || '%')) base ), cursor_row AS ( SELECT base.test_id FROM (select test_id::text, data from test where (cast($1 as text) is null or lower(data) like '%' || lower($1) || '%') and (cast($2 as text) is null or lower(test_id::text) like '%' || lower($2) || '%')) base WHERE base.test_id = $4 ) SELECT counted.*, base.* FROM (select test_id::text, data from test where (cast($1 as text) is null or lower(data) like '%' || lower($1) || '%') and (cast($2 as text) is null or lower(test_id::text) like '%' || lower($2) || '%')) base LEFT JOIN cursor_row ON true LEFT JOIN counted ON true WHERE (( $4 IS NULL OR cast($5 as bool) IS NULL ) OR ( (base.test_id) > (cursor_row.test_id) )) ORDER BY base.test_id ASC LIMIT $3 + 1 ) xy LIMIT $3 ) z ORDER BY row_number ASC; \timing execute foo_test(null, null, 5, 500, true); execute foo_test(null, null, 5, 500, true); execute foo_test(null, null, 5, 500, true); execute foo_test(null, null, 5, 500, true); execute foo_test(null, null, 5, 500, true); -- This one should be slower: execute foo_test(null, null, 5, 500, true);
Re: Advice needed: query performance deteriorates by 2000% within 1 minute
Peter Adlersburg writes: > Limit (cost=0.00..804.97 rows=10 width=22) (actual > time=23970.845..25588.432 rows=1 loops=1) >-> Seq Scan on "order" (cost=0.00..3863.86 rows=48 width=22) (actual > time=23970.843..25588.429 rows=1 loops=1) > Filter: (jsonb_to_tsvector('english'::regconfig, content, > '["all"]'::jsonb) @@ '''1.20709841'''::tsquery) > Rows Removed by Filter: 9652 > Planning Time: 0.430 ms > Execution Time: 25588.448 ms I think the expense here comes from re-executing jsonb_to_tsvector a lot of times. By default that's estimated as 100 times more expensive than a simple function (such as addition), but these results make it seem like that's an understatement. You might try something like alter function jsonb_to_tsvector(regconfig, jsonb, jsonb) cost 1000; to further discourage the planner from picking this plan shape. Possibly the cost estimate for ts_match_tq (the function underlying this variant of @@) needs to be bumped up as well. (Bear in mind that pg_dump will not propagate such hacks on system-defined objects, so you'll need a note to reapply any such changes after dump/reload or pg_upgrade.) regards, tom lane
Re: Advice needed: query performance deteriorates by 2000% within 1 minute
You are getting row estimate 48 in both cases, so it seems perhaps tied to the free space map that will mean more heap lookups from the index, to the point where the planner thinks that doing sequential scan is less costly. What is random_page_cost set to? Do you have default autovacuum/analyze settings? It is probably worth running "explain (analyze, buffers, verbose) select..." to get a bit more insight. I expect that the buffers increase gradually and then it switches to sequential scan at some point. Perhaps not directly related, but might be interesting to look at- With indexes on expressions, you get custom stats. It might be worth taking a look at those and seeing if they give anything approaching proper estimates. eg. select * from pg_class where relname = 'idx_customer_phone_numbers_phone_number_gist'; select * from pg_statistic where starelid = 'idx_customer_phone_numbers_phone_number_gist'::regclass; select * from pg_stats where tablename = 'idx_customer_phone_numbers_phone_number_gist'; JSONB is a bit painful to use from a query planning perspective. Values in a jsonb column are fine for me in a select clause, but not ON or WHERE with very rare exceptions. Though, maybe that's not so applicable when you are doing full text search.
Advice needed: query performance deteriorates by 2000% within 1 minute
Dear fellow DBAs, I am seeking for some guidance with the following case that our developers have thrown at me and I apologize in advance for this lengthy mail ... $> postgres --version postgres (PostgreSQL) 13.6 We are dealing with the following issue: select version, content from orderstore.order WHERE jsonb_to_tsvector('english', content, '["all"]') @@ websearch_to_tsquery('english', '1.20709841') limit 10 ; The Devs told me that this query normally finishes within a reasonable amount of time (<1sec) but every day - and all of a sudden - performance worsens to execution times > 20sec. Furthermore I was told: "When we change the query to 'limit 100' it runs fast again" "When we execute a 'vacuum orderstore.order' everything becomes good again - but that only lasts for a few hours" So I scheduled a little script to be executed every minute which contains 3 explains. 1 query with limit 10 1 query with limit 100 1 query with the limit-clause omitted And here's a quick grep of the result after a few hours: ... Execution Time: 1.413 ms <= limit 10 Execution Time: 0.389 ms <= limit 100 Execution Time: 0.297 ms <= limit clause omitted Execution Time: 1.456 ms Execution Time: 0.396 ms Execution Time: 0.302 ms Execution Time: 1.412 ms Execution Time: 0.428 ms Execution Time: 0.255 ms Execution Time: 1.404 ms Execution Time: 0.403 ms Execution Time: 0.258 ms Execution Time: 25588.448 ms<= limit 10 Execution Time: 0.919 ms<= limit 100 Execution Time: 0.453 ms<= limit clause omitted Execution Time: 25657.524 ms Execution Time: 0.965 ms Execution Time: 0.452 ms Execution Time: 25843.139 ms Execution Time: 0.959 ms Execution Time: 0.446 ms Execution Time: 25631.389 ms Execution Time: 0.946 ms Execution Time: 0.447 ms Execution Time: 25452.764 ms Execution Time: 0.937 ms Execution Time: 0.444 ms <= here I manually vacuumed the table Execution Time: 0.071 ms Execution Time: 0.021 ms Execution Time: 0.015 ms Execution Time: 0.072 ms Execution Time: 0.023 ms Execution Time: 0.017 ms Execution Time: 0.064 ms Execution Time: 0.021 ms Execution Time: 0.015 ms Execution Time: 0.063 ms Execution Time: 0.020 ms Execution Time: 0.015 ms ... Turned out the devs were right with their complaints. The execution plan changed within one minute from using an index to a sequential scan; Here are the details (In the "LOG:"-line I select the current timestamp and the row count of the table): *** the last 'good' run: *** LOG: | 2022-02-24 13:47:01.747416+01 | 9653 LIMIT 10: Limit (cost=752.37..789.30 rows=10 width=22) (actual time=1.388..1.390 rows=1 loops=1) -> Bitmap Heap Scan on "order" (cost=752.37..929.63 rows=48 width=22) (actual time=1.387..1.388 rows=1 loops=1) Recheck Cond: (jsonb_to_tsvector('english'::regconfig, content, '["all"]'::jsonb) @@ '''1.20709841'''::tsquery) Heap Blocks: exact=1 -> Bitmap Index Scan on idx_fulltext_content (cost=0.00..752.36 rows=48 width=0) (actual time=1.374..1.374 rows=1 loops=1) Index Cond: (jsonb_to_tsvector('english'::regconfig, content, '["all"]'::jsonb) @@ '''1.20709841'''::tsquery) Planning Time: 0.401 ms Execution Time: 1.404 ms LIMIT 100: Limit (cost=752.37..929.63 rows=48 width=22) (actual time=0.391..0.391 rows=1 loops=1) -> Bitmap Heap Scan on "order" (cost=752.37..929.63 rows=48 width=22) (actual time=0.390..0.391 rows=1 loops=1) Recheck Cond: (jsonb_to_tsvector('english'::regconfig, content, '["all"]'::jsonb) @@ '''1.20709841'''::tsquery) Heap Blocks: exact=1 -> Bitmap Index Scan on idx_fulltext_content (cost=0.00..752.36 rows=48 width=0) (actual time=0.387..0.387 rows=1 loops=1) Index Cond: (jsonb_to_tsvector('english'::regconfig, content, '["all"]'::jsonb) @@ '''1.20709841'''::tsquery) Planning Time: 0.136 ms Execution Time: 0.403 ms NO LIMIT: Bitmap Heap Scan on "order" (cost=752.37..929.63 rows=48 width=22) (actual time=0.248..0.249 rows=1 loops=1) Recheck Cond: (jsonb_to_tsvector('english'::regconfig, content, '["all"]'::jsonb) @@ '''1.20709841'''::tsquery) Heap Blocks: exact=1 -> Bitmap Index Scan on idx_fulltext_content (cost=0.00..752.36 rows=48 width=0) (actual time=0.245..0.245 rows=1 loops=1) Index Cond: (jsonb_to_tsvector('english'::regconfig, content, '["all"]'::jsonb) @@ '''1.20709841'''::tsquery) Planning Time: 0.107 ms Execution Time: 0.258 ms * *** the first 'bad' run (one minute later *** * LOG: | 2022-02-24 13:48:01.840362+01 | 9653 LIMIT 10: Limit (cost=0.00..804.97 rows=10 width=22) (actual time=23970.845..25588.432 rows=1 loops=1) -> Seq Scan on "order" (cost=0.00..3863.86 rows=48 width=22) (actual time=23970.843..25588.429 rows
Re: An I/O error occurred while sending to the backend (PG 13.4)
Em qui., 24 de fev. de 2022 às 09:59, Justin Pryzby escreveu: > On Thu, Feb 24, 2022 at 08:50:45AM -0300, Ranier Vilela wrote: > > I can't understand why you are still using 13.4? > > [1] There is a long discussion about the issue with 13.4, the project was > > made to fix a DLL bottleneck. > > > > Why you not use 13.6? > > That other problem (and its fix) were in the windows build environment, > and not > an issue in some postgres version. Yeah, correct. But I think that it was very clear in the other thread that version 13.4, on Windows, may have a slowdown, because of the DLL problem. So it would be better to use the latest available version that has this specific fix and many others. regards, Ranier Vilela
Re: An I/O error occurred while sending to the backend (PG 13.4)
On Thu, Feb 24, 2022 at 08:50:45AM -0300, Ranier Vilela wrote: > I can't understand why you are still using 13.4? > [1] There is a long discussion about the issue with 13.4, the project was > made to fix a DLL bottleneck. > > Why you not use 13.6? That other problem (and its fix) were in the windows build environment, and not an issue in some postgres version. It's still a good idea to schedule an update. -- Justin
Re: An I/O error occurred while sending to the backend (PG 13.4)
Em qua., 23 de fev. de 2022 às 21:47, l...@laurent-hasson.com < l...@laurent-hasson.com> escreveu: > > >> -Original Message- >> From: l...@laurent-hasson.com >> Sent: Saturday, December 4, 2021 14:18 >> To: Justin Pryzby >> Cc: pgsql-performa...@postgresql.org >> Subject: RE: An I/O error occurred while sending to the backend (PG > 13.4) >> >> >> > -Original Message- >> > From: Justin Pryzby >> > Sent: Saturday, December 4, 2021 12:59 >> > To: l...@laurent-hasson.com >> > Cc: pgsql-performa...@postgresql.org >> > Subject: Re: An I/O error occurred while sending to the > backend (PG >> > 13.4) >> > >> > On Sat, Dec 04, 2021 at 05:32:10PM +, > l...@laurent-hasson.com >> > wrote: >> > > I have a data warehouse with a fairly complex ETL process > that has >> > been running for years now across PG 9.6, 11.2 and now 13.4 > for the >> > past couple of months. I have been getting the error "An I/O > error >> > occurred while sending to the backend" quite often under load > in 13.4 >> > which I never used to get on 11.2. I have applied some tricks, >> particularly >> > with the socketTimeout JDBC configuration. >> > > >> > > So my first question is whether anyone has any idea why this > is >> > happening? My hardware and general PG configuration have not >> > changed between 11.2 and 13.4 and I NEVER experienced this on > 11.2 >> in >> > about 2y of production. >> > > >> > > Second, I have one stored procedure that takes a very long > time to >> run >> > (40mn more or less), so obviously, I'd need to set > socketTimeout to >> > something like 1h in order to call it and not timeout. That > doesn't seem >> > reasonable? >> > >> > Is the DB server local or remote (TCP/IP) to the client? >> > >> > Could you collect the corresponding postgres query logs when > this >> > happens ? >> > >> > It'd be nice to see a network trace for this too. Using > tcpdump or >> > wireshark. >> > Preferably from the client side. >> > >> > FWIW, I suspect the JDBC socketTimeout is a bad workaround. >> > >> > -- >> > Justin >> >> It's a remote server, but all on a local network. Network > performance is I >> am sure not the issue. Also, the system is on Windows Server. What > are you >> expecting to see out of a tcpdump? I'll try to get PG logs on the > failing query. >> >> Thank you, >> Laurent. >> >> >> >> > > Hello Justin, > > It has been ages! The issue has been happening a bit more often recently, > as much as once every 10 days or so. As a reminder, the set up is Postgres > 13.4 on Windows Server with 16cores and 64GB memory. I can't understand why you are still using 13.4? [1] There is a long discussion about the issue with 13.4, the project was made to fix a DLL bottleneck. Why you not use 13.6? regards, Ranier Vilela [1] https://www.postgresql.org/message-id/MN2PR15MB2560BBB3EC911D973C2FE3F885A89%40MN2PR15MB2560.namprd15.prod.outlook.com