Re: Slow plan choice with prepared query

2022-02-24 Thread MichaelDBA
As per PG official documentation on PREPARE, it is working as expected.  
Use custom plan, but after 5th iteration compare cost of custom plan vs 
generic plan and use the one with the less cost which is the generic 
plan even though it is not as performant. Look at explain output to see 
the diffs between 5th iteration and 6th one:  explain (analyze, summary, 
buffers true) execute foo_test(null, null, 5, 500, true);


It appears the SORT is the problem and a mismatch between text and 
integer for base.text_id? --> WHERE  base.test_id = $4


Regards,
Michael Vitale



MichaelDBA wrote on 2/24/2022 1:45 PM:
Dag, if you ain't right!  I can duplicate this on the ones I tested 
with: PG v11 and v14.  Gonna start diving into this myself...


Regards,
Michael Vitale


Mark Saward wrote on 2/23/2022 10:37 PM:

Hi,


I've experienced a situation where the planner seems to make a very 
poor choice with a prepared query after the first five executions. 
Looking at the documentation, I think this happens because it 
switches from a custom plan to a generic one, and doesn't make a good 
choice for the generic one.


Postgres version: running in docker, reports to be 'Debian 
14.1-1.pgdg110+1'


If I force it to use a custom plan via 'set local plan_cache_mode = 
force_custom_plan', then I don't notice any slowdown.  Without it, 
the 6th and onwards calls can take 1 second to 15 seconds each, as 
opposed to about 10ms.


Since I have a workaround, I don't necessarily need assistance, but 
posting this here in case it's of value as a test case. Here's a test 
case that reliably duplicates this issue for me:




create table test (
  test_id serial primary key,
  data text
);

insert into test (data) (select data from (select 
generate_series(1,1) AS id, md5(random()::text) AS data) x);


prepare foo_test(text, text, int, text, bool) as SELECT * FROM (SELECT
  *,
  count(*) OVER () > $3 AS has_more,
  row_number() OVER ()
  FROM (
    WITH counted AS (
  SELECT count(*) AS total
  FROM   (select test_id::text, data
from test
where
  (cast($1 as text) is null or lower(data) like '%' || lower($1) || '%')
and
  (cast($2 as text) is null or lower(test_id::text) like '%' || 
lower($2) || '%')) base

    ), cursor_row AS (
  SELECT base.test_id
  FROM   (select test_id::text, data
from test
where
  (cast($1 as text) is null or lower(data) like '%' || lower($1) || '%')
and
  (cast($2 as text) is null or lower(test_id::text) like '%' || 
lower($2) || '%')) base

  WHERE  base.test_id = $4
    )
    SELECT counted.*, base.*
  FROM   (select test_id::text, data
from test
where
  (cast($1 as text) is null or lower(data) like '%' || lower($1) || '%')
and
  (cast($2 as text) is null or lower(test_id::text) like '%' || 
lower($2) || '%')) base

  LEFT JOIN   cursor_row ON true
  LEFT JOIN   counted ON true
  WHERE ((
    $4 IS NULL OR cast($5 as bool) IS NULL
  ) OR (
    (base.test_id)
  > (cursor_row.test_id)
  ))
  ORDER BY base.test_id ASC
  LIMIT $3 + 1
) xy LIMIT $3 ) z ORDER BY row_number ASC;

\timing

execute foo_test(null, null, 5, 500, true);
execute foo_test(null, null, 5, 500, true);
execute foo_test(null, null, 5, 500, true);
execute foo_test(null, null, 5, 500, true);
execute foo_test(null, null, 5, 500, true);

-- This one should be slower:
execute foo_test(null, null, 5, 500, true);













Re: Slow plan choice with prepared query

2022-02-24 Thread MichaelDBA
Dag, if you ain't right!  I can duplicate this on the ones I tested 
with: PG v11 and v14.  Gonna start diving into this myself...


Regards,
Michael Vitale


Mark Saward wrote on 2/23/2022 10:37 PM:

Hi,


I've experienced a situation where the planner seems to make a very 
poor choice with a prepared query after the first five executions.  
Looking at the documentation, I think this happens because it switches 
from a custom plan to a generic one, and doesn't make a good choice 
for the generic one.


Postgres version: running in docker, reports to be 'Debian 
14.1-1.pgdg110+1'


If I force it to use a custom plan via 'set local plan_cache_mode = 
force_custom_plan', then I don't notice any slowdown.  Without it, the 
6th and onwards calls can take 1 second to 15 seconds each, as opposed 
to about 10ms.


Since I have a workaround, I don't necessarily need assistance, but 
posting this here in case it's of value as a test case. Here's a test 
case that reliably duplicates this issue for me:




create table test (
  test_id serial primary key,
  data text
);

insert into test (data) (select data from (select 
generate_series(1,1) AS id, md5(random()::text) AS data) x);


prepare foo_test(text, text, int, text, bool) as SELECT * FROM (SELECT
  *,
  count(*) OVER () > $3 AS has_more,
  row_number() OVER ()
  FROM (
    WITH counted AS (
  SELECT count(*) AS total
  FROM   (select test_id::text, data
from test
where
  (cast($1 as text) is null or lower(data) like '%' || lower($1) || '%')
and
  (cast($2 as text) is null or lower(test_id::text) like '%' || 
lower($2) || '%')) base

    ), cursor_row AS (
  SELECT base.test_id
  FROM   (select test_id::text, data
from test
where
  (cast($1 as text) is null or lower(data) like '%' || lower($1) || '%')
and
  (cast($2 as text) is null or lower(test_id::text) like '%' || 
lower($2) || '%')) base

  WHERE  base.test_id = $4
    )
    SELECT counted.*, base.*
  FROM   (select test_id::text, data
from test
where
  (cast($1 as text) is null or lower(data) like '%' || lower($1) || '%')
and
  (cast($2 as text) is null or lower(test_id::text) like '%' || 
lower($2) || '%')) base

  LEFT JOIN   cursor_row ON true
  LEFT JOIN   counted ON true
  WHERE ((
    $4 IS NULL OR cast($5 as bool) IS NULL
  ) OR (
    (base.test_id)
  > (cursor_row.test_id)
  ))
  ORDER BY base.test_id ASC
  LIMIT $3 + 1
) xy LIMIT $3 ) z ORDER BY row_number ASC;

\timing

execute foo_test(null, null, 5, 500, true);
execute foo_test(null, null, 5, 500, true);
execute foo_test(null, null, 5, 500, true);
execute foo_test(null, null, 5, 500, true);
execute foo_test(null, null, 5, 500, true);

-- This one should be slower:
execute foo_test(null, null, 5, 500, true);









Re: Advice needed: query performance deteriorates by 2000% within 1 minute

2022-02-24 Thread Tom Lane
Peter Adlersburg  writes:
>  Limit  (cost=0.00..804.97 rows=10 width=22) (actual
> time=23970.845..25588.432 rows=1 loops=1)
>->  Seq Scan on "order"  (cost=0.00..3863.86 rows=48 width=22) (actual
> time=23970.843..25588.429 rows=1 loops=1)
>  Filter: (jsonb_to_tsvector('english'::regconfig, content,
> '["all"]'::jsonb) @@ '''1.20709841'''::tsquery)
>  Rows Removed by Filter: 9652
>  Planning Time: 0.430 ms
>  Execution Time: 25588.448 ms

I think the expense here comes from re-executing jsonb_to_tsvector
a lot of times.  By default that's estimated as 100 times more expensive
than a simple function (such as addition), but these results make it
seem like that's an understatement.  You might try something like

alter function jsonb_to_tsvector(regconfig, jsonb, jsonb) cost 1000;

to further discourage the planner from picking this plan shape.

Possibly the cost estimate for ts_match_tq (the function underlying
this variant of @@) needs to be bumped up as well.

(Bear in mind that pg_dump will not propagate such hacks on
system-defined objects, so you'll need a note to reapply
any such changes after dump/reload or pg_upgrade.)

regards, tom lane




Re: Advice needed: query performance deteriorates by 2000% within 1 minute

2022-02-24 Thread Michael Lewis
You are getting row estimate 48 in both cases, so it seems perhaps tied to
the free space map that will mean more heap lookups from the index, to the
point where the planner thinks that doing sequential scan is less costly.

What is random_page_cost set to? Do you have default autovacuum/analyze
settings?

It is probably worth running "explain (analyze, buffers, verbose)
select..." to get a bit more insight. I expect that the buffers increase
gradually and then it switches to sequential scan at some point.


Perhaps not directly related, but might be interesting to look at-
With indexes on expressions, you get custom stats. It might be worth taking
a look at those and seeing if they give anything approaching proper
estimates.

eg.
select * from pg_class where relname =
'idx_customer_phone_numbers_phone_number_gist';
select * from pg_statistic where starelid =
'idx_customer_phone_numbers_phone_number_gist'::regclass;
select * from pg_stats where tablename =
'idx_customer_phone_numbers_phone_number_gist';

JSONB is a bit painful to use from a query planning perspective. Values in
a jsonb column are fine for me in a select clause, but not ON or WHERE with
very rare exceptions. Though, maybe that's not so applicable when you are
doing full text search.


Advice needed: query performance deteriorates by 2000% within 1 minute

2022-02-24 Thread Peter Adlersburg
Dear fellow DBAs,

I am seeking for some guidance with the following case that our developers
have thrown at me and I apologize in advance for this lengthy mail ...

$> postgres --version
postgres (PostgreSQL) 13.6


We are dealing with the following issue:


select version, content from orderstore.order
WHERE jsonb_to_tsvector('english', content, '["all"]') @@
websearch_to_tsquery('english', '1.20709841') limit 10 ;


The Devs told me that this query normally finishes within a reasonable
amount of time (<1sec) but every day - and all of a sudden - performance
worsens to execution times > 20sec.

Furthermore I was told:

"When we change the query to 'limit 100' it runs fast again"
"When we execute a 'vacuum orderstore.order' everything becomes good again
- but that only lasts for a few hours"

So I scheduled a little script to be executed every minute which contains 3
explains.

1 query with limit 10
1 query with limit 100
1 query with the limit-clause omitted

And here's a quick grep of the result after a few hours:

 ...

 Execution Time: 1.413 ms   <= limit 10
 Execution Time: 0.389 ms   <= limit 100
 Execution Time: 0.297 ms   <= limit clause omitted
 Execution Time: 1.456 ms
 Execution Time: 0.396 ms
 Execution Time: 0.302 ms
 Execution Time: 1.412 ms
 Execution Time: 0.428 ms
 Execution Time: 0.255 ms
 Execution Time: 1.404 ms
 Execution Time: 0.403 ms
 Execution Time: 0.258 ms
 Execution Time: 25588.448 ms<= limit 10
 Execution Time: 0.919 ms<= limit 100
 Execution Time: 0.453 ms<= limit clause omitted
 Execution Time: 25657.524 ms
 Execution Time: 0.965 ms
 Execution Time: 0.452 ms
 Execution Time: 25843.139 ms
 Execution Time: 0.959 ms
 Execution Time: 0.446 ms
 Execution Time: 25631.389 ms
 Execution Time: 0.946 ms
 Execution Time: 0.447 ms
 Execution Time: 25452.764 ms
 Execution Time: 0.937 ms
 Execution Time: 0.444 ms
  <= here I manually vacuumed the table
 Execution Time: 0.071 ms
 Execution Time: 0.021 ms
 Execution Time: 0.015 ms
 Execution Time: 0.072 ms
 Execution Time: 0.023 ms
 Execution Time: 0.017 ms
 Execution Time: 0.064 ms
 Execution Time: 0.021 ms
 Execution Time: 0.015 ms
 Execution Time: 0.063 ms
 Execution Time: 0.020 ms
 Execution Time: 0.015 ms

 ...


Turned out the devs were right with their complaints.

The execution plan changed within one minute from using an index to a
sequential scan;

Here are the details (In the "LOG:"-line I select the current timestamp and
the row count of the table):


*** the last 'good' run: ***


 LOG: | 2022-02-24 13:47:01.747416+01 |  9653

 LIMIT 10:

 Limit  (cost=752.37..789.30 rows=10 width=22) (actual time=1.388..1.390
rows=1 loops=1)
   ->  Bitmap Heap Scan on "order"  (cost=752.37..929.63 rows=48 width=22)
(actual time=1.387..1.388 rows=1 loops=1)
 Recheck Cond: (jsonb_to_tsvector('english'::regconfig, content,
'["all"]'::jsonb) @@ '''1.20709841'''::tsquery)
 Heap Blocks: exact=1
 ->  Bitmap Index Scan on idx_fulltext_content  (cost=0.00..752.36
rows=48 width=0) (actual time=1.374..1.374 rows=1 loops=1)
   Index Cond: (jsonb_to_tsvector('english'::regconfig,
content, '["all"]'::jsonb) @@ '''1.20709841'''::tsquery)
 Planning Time: 0.401 ms
 Execution Time: 1.404 ms

 LIMIT 100:

 Limit  (cost=752.37..929.63 rows=48 width=22) (actual time=0.391..0.391
rows=1 loops=1)
   ->  Bitmap Heap Scan on "order"  (cost=752.37..929.63 rows=48 width=22)
(actual time=0.390..0.391 rows=1 loops=1)
 Recheck Cond: (jsonb_to_tsvector('english'::regconfig, content,
'["all"]'::jsonb) @@ '''1.20709841'''::tsquery)
 Heap Blocks: exact=1
 ->  Bitmap Index Scan on idx_fulltext_content  (cost=0.00..752.36
rows=48 width=0) (actual time=0.387..0.387 rows=1 loops=1)
   Index Cond: (jsonb_to_tsvector('english'::regconfig,
content, '["all"]'::jsonb) @@ '''1.20709841'''::tsquery)
 Planning Time: 0.136 ms
 Execution Time: 0.403 ms

 NO LIMIT:

 Bitmap Heap Scan on "order"  (cost=752.37..929.63 rows=48 width=22)
(actual time=0.248..0.249 rows=1 loops=1)
   Recheck Cond: (jsonb_to_tsvector('english'::regconfig, content,
'["all"]'::jsonb) @@ '''1.20709841'''::tsquery)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on idx_fulltext_content  (cost=0.00..752.36
rows=48 width=0) (actual time=0.245..0.245 rows=1 loops=1)
 Index Cond: (jsonb_to_tsvector('english'::regconfig, content,
'["all"]'::jsonb) @@ '''1.20709841'''::tsquery)
 Planning Time: 0.107 ms
 Execution Time: 0.258 ms


*
*** the first 'bad' run (one minute later ***
*

 LOG: | 2022-02-24 13:48:01.840362+01 |  9653

 LIMIT 10:

 Limit  (cost=0.00..804.97 rows=10 width=22) (actual
time=23970.845..25588.432 rows=1 loops=1)
   ->  Seq Scan on "order"  (cost=0.00..3863.86 rows=48 width=22) (actual
time=23970.843..25588.429 rows

Re: An I/O error occurred while sending to the backend (PG 13.4)

2022-02-24 Thread Ranier Vilela
Em qui., 24 de fev. de 2022 às 09:59, Justin Pryzby 
escreveu:

> On Thu, Feb 24, 2022 at 08:50:45AM -0300, Ranier Vilela wrote:
> > I can't understand why you are still using 13.4?
> > [1] There is a long discussion about the issue with 13.4, the project was
> > made to fix a DLL bottleneck.
> >
> > Why you not use 13.6?
>
> That other problem (and its fix) were in the windows build environment,
> and not
> an issue in some postgres version.

Yeah, correct.
But I think that it was very clear in the other thread that version 13.4,
on Windows, may have a slowdown, because of the DLL problem.
So it would be better to use the latest available version
that has this specific fix and many others.

regards,
Ranier Vilela


Re: An I/O error occurred while sending to the backend (PG 13.4)

2022-02-24 Thread Justin Pryzby
On Thu, Feb 24, 2022 at 08:50:45AM -0300, Ranier Vilela wrote:
> I can't understand why you are still using 13.4?
> [1] There is a long discussion about the issue with 13.4, the project was
> made to fix a DLL bottleneck.
> 
> Why you not use 13.6?

That other problem (and its fix) were in the windows build environment, and not
an issue in some postgres version.  It's still a good idea to schedule an
update.

-- 
Justin




Re: An I/O error occurred while sending to the backend (PG 13.4)

2022-02-24 Thread Ranier Vilela
Em qua., 23 de fev. de 2022 às 21:47, l...@laurent-hasson.com <
l...@laurent-hasson.com> escreveu:

>
>
>>  -Original Message-
>>  From: l...@laurent-hasson.com 
>>  Sent: Saturday, December 4, 2021 14:18
>>  To: Justin Pryzby 
>>  Cc: pgsql-performa...@postgresql.org
>>  Subject: RE: An I/O error occurred while sending to the backend (PG
> 13.4)
>>
>>
>> >  -Original Message-
>> >  From: Justin Pryzby 
>> >  Sent: Saturday, December 4, 2021 12:59
>> >  To: l...@laurent-hasson.com
>> >  Cc: pgsql-performa...@postgresql.org
>> >  Subject: Re: An I/O error occurred while sending to the
> backend (PG
>> >  13.4)
>> >
>> >  On Sat, Dec 04, 2021 at 05:32:10PM +,
> l...@laurent-hasson.com
>> >  wrote:
>> >  > I have a data warehouse with a fairly complex ETL process
> that has
>> >  been running for years now across PG 9.6, 11.2 and now 13.4
> for the
>> >  past couple of months. I have been getting the error "An I/O
> error
>> >  occurred while sending to the backend" quite often under load
> in 13.4
>> >  which I never used to get on 11.2. I have applied some tricks,
>>  particularly
>> >  with the socketTimeout JDBC configuration.
>> >  >
>> >  > So my first question is whether anyone has any idea why this
> is
>> >  happening? My hardware and general PG configuration have not
>> >  changed between 11.2 and 13.4 and I NEVER experienced this on
> 11.2
>>  in
>> >  about 2y of production.
>> >  >
>> >  > Second, I have one stored procedure that takes a very long
> time to
>>  run
>> >  (40mn more or less), so obviously, I'd need to set
> socketTimeout to
>> >  something like 1h in order to call it and not timeout. That
> doesn't seem
>> >  reasonable?
>> >
>> >  Is the DB server local or remote (TCP/IP) to the client?
>> >
>> >  Could you collect the corresponding postgres query logs when
> this
>> >  happens ?
>> >
>> >  It'd be nice to see a network trace for this too.  Using
> tcpdump or
>> >  wireshark.
>> >  Preferably from the client side.
>> >
>> >  FWIW, I suspect the JDBC socketTimeout is a bad workaround.
>> >
>> >  --
>> >  Justin
>>
>>  It's a remote server, but all on a local network. Network
> performance is I
>>  am sure not the issue. Also, the system is on Windows Server. What
> are you
>>  expecting to see out of a tcpdump? I'll try to get PG logs on the
> failing query.
>>
>>  Thank you,
>>  Laurent.
>>
>>
>>
>>
>
> Hello Justin,
>
> It has been ages! The issue has been happening a bit more often recently,
> as much as once every 10 days or so. As a reminder, the set up is Postgres
> 13.4 on Windows Server with 16cores and 64GB memory.

I can't understand why you are still using 13.4?
[1] There is a long discussion about the issue with 13.4, the project was
made to fix a DLL bottleneck.

Why you not use 13.6?

regards,
Ranier Vilela

[1]
https://www.postgresql.org/message-id/MN2PR15MB2560BBB3EC911D973C2FE3F885A89%40MN2PR15MB2560.namprd15.prod.outlook.com