Postgresql equal join on function with columns not use index

2023-06-12 Thread James Pang (chaolpan)
Hi,
   We migrate from Oracle to Postgresql14.8, one SQL has regression in 
Postgres  run in 5800 milliseconds in Postgresql v14.8,  but the same SQL got 
done in several hundred milliseconds in Oracle database.
   With multiple table JOINs, if the join condition is  
tablea.column1=tableb.column1, optimizer will use the index to filter data in 
nest loops, but if  tablea.column1=regexp_replace(tableb.column1),
Optimizer will not be able to use the index on tablea.column1, then it do a 
table scan and nestloop to produce a lot rows then use 
tablea.column1=regexp_replace(tableb.column1) as a filter.  As a workaround 
we create a view then use tablea.column1=view.column1 that works.
 Is it expected ?details as below.

   SELECT DISTINCT a.xxx, b.xxx as TOLLFREE FROM tableA a, tableB b
WHERE a.column1 = regexp_replace(b.column1,'[^0-9]','') 
AND b.column2 = $1 AND b.column3= $2
AND NOT EXISTS (SELECT 1 FROM tableC c
WHERE c.xxid = b.xxid
AND c.xxtype = case when b.col4 = 1 then 
'TollFree' else 'Toll' end
AND c.xxid = $3)

 Unique  (cost=423374.60..423377.87 rows=436 width=21) (actual 
time=6070.963..6071.054 rows=395 loops=1)
   Buffers: shared hit=148
   ->  Sort  (cost=423374.60..423375.69 rows=436 width=21) (actual 
time=6070.963..6070.992 rows=397 loops=1)
 Sort Key: a.xx, b.xx
 Sort Method: quicksort  Memory: 56kB
 Buffers: shared hit=148
 ->  Nested Loop  (cost=0.69..423355.48 rows=436 width=21) (actual 
time=120.338..6070.669 rows=397 loops=1)
   Join Filter: ((a.column1)::text = 
regexp_replace((b.column1)::text, '[^0-9]'::text, ''::text))
 <<  Seq Scan on tableA a  (cost=0.00..161.12 rows=7712 width=25) 
(actual time=0.022..1.380 rows=7712 loops=1)
 Buffers: shared hit=84
   ->  Materialize  (cost=0.69..153.12 rows=207 width=21) (actual 
time=0.000..0.011 rows=196 loops=7712)
 Buffers: shared hit=58
 ->  Nested Loop Anti Join  (cost=0.69..152.09 rows=207 
width=21) (actual time=0.069..0.278 rows=196 loops=1)
   Join Filter: ((c.xxid = b.xxid) AND ((c.)::text 
= CASE WHEN (b.column2 = 1) THEN 'aaa'::text ELSE 'bbb'::t
ext END))
   Buffers: shared hit=58
   ->  Index Scan using idx_xxx on tableB b  
(cost=0.42..146.55 rows=207 width=29) (actual time=0.047..0.207 rows=196 
loops=1)
 Index Cond: ((colum3 = 40957) AND (column2 = 
1))
 Buffers: shared hit=56
   ->  Materialize  (cost=0.27..1.40 rows=1 width=15) 
(actual time=0.000..0.000 rows=0 loops=196)
 Buffers: shared hit=2
 ->  Index Only Scan using pk_ on tableC c  
(cost=0.27..1.39 rows=1 width=15
) (actual time=0.020..0.020 rows=0 loops=1)
   Index Cond: (xxxid = 12407262)
   Heap Fetches: 0
   Buffers: shared hit=2

If we create a view ,the SQL got done in several million seconds,
CREATE VIEW tableBREGXP as (select 
xx,column2,column3,xxid,regexp_replace(column1,'[^0-9]','') as column1 from 
tableB);
   SELECT DISTINCT a.xxx, b.xxx as TOLLFREE FROM tableA a, tableBREGXP 
b<<< replace the tableB with view name.
WHERE a.column1 = b.column1 AND b.column2 = $1 AND 
b.column3= $2   <<< use b.column1 to replace 
regexp_replace((b.column1)::text, '[^0-9]'::text, ''::text))
AND NOT EXISTS (SELECT 1 FROM tableC c
WHERE c.xxid = b.xxid
AND c.xxtype = case when b.col4 = 1 then 
'TollFree' else 'Toll' end
AND c.xxid = $3)

HashAggregate  (cost=408.19..412.76 rows=457 width=21) (actual 
time=4.524..4.644 rows=395 loops=1)
   Group Key: a.xxx, b.xx
   Batches: 1  Memory Usage: 61kB
   Buffers: shared hit=693
   ->  Nested Loop  (cost=0.97..405.90 rows=457 width=21) (actual 
time=0.154..4.205 rows=397 loops=1)
 Buffers: shared hit=693
 ->  Nested Loop Anti Join  (cost=0.69..214.97 rows=217 width=40) 
(actual time=0.137..2.877 rows=196 loops=1)
   Join Filter: ((c.xxyid = b.xxid) AND ((c.x)::text = CASE 
WHEN (b.column2 = 1) THEN 'To
llFree'::text ELSE 'Toll'::text END))
   Buffers: shared hit=55
   ->  Index Scan using idx_xxx on b  (cost=0.42..207.06 rows=217 
width=64) (actual time=0.123..2.725 rows=196 loops=1)
 Index Cond: ((column2 = 40957) AND (column3 = 1))
 Buffers: shared hi

Re: Postgresql equal join on function with columns not use index

2023-06-12 Thread Tom Lane
"James Pang (chaolpan)"  writes:
>We migrate from Oracle to Postgresql14.8, one SQL has regression in 
> Postgres  run in 5800 milliseconds in Postgresql v14.8,  but the same SQL got 
> done in several hundred milliseconds in Oracle database.
>With multiple table JOINs, if the join condition is  
> tablea.column1=tableb.column1, optimizer will use the index to filter data in 
> nest loops, but if  tablea.column1=regexp_replace(tableb.column1),
> Optimizer will not be able to use the index on tablea.column1, then it do a 
> table scan and nestloop to produce a lot rows then use 
> tablea.column1=regexp_replace(tableb.column1) as a filter.  As a 
> workaround we create a view then use tablea.column1=view.column1 that works.
>  Is it expected ?details as below.

It's impossible to comment on this usefully with such a fragmentary
description of the problem.  Please send a complete, self-contained
test case if you want anybody to look at it carefully.

https://wiki.postgresql.org/wiki/Slow_Query_Questions

regards, tom lane




RE: Postgresql equal join on function with columns not use index

2023-06-12 Thread James Pang (chaolpan)
Hi, 
Looks like it's the function "regexp_replace" volatile and restrict=false 
make the difference,  we have our application role with default 
search_path=oracle,$user,public,pg_catalog.
 =#select 
oid,proname,pronamespace::regnamespace,prosecdef,proisstrict,provolatile from 
pg_proc where proname='regexp_replace' order by oid;
  oid  |proname | pronamespace | prosecdef | proisstrict | provolatile
---++--+---+-+-
  2284 | regexp_replace | pg_catalog   | f | t   | i
  2285 | regexp_replace | pg_catalog   | f | t   | i
 17095 | regexp_replace | oracle   | f | f   | v 
 17096 | regexp_replace | oracle   | f | f   | v
 17097 | regexp_replace | oracle   | f | f   | v
 17098 | regexp_replace | oracle   | f | f   | v

--with default it use orafce, oracle.regexp_replace function,
Select  a.phonenumber,... from tableA a, tableB b where a.phonenumber=oracle. 
regexp_replace(b.PHONENUMBER,'[^0-9]','') , 
  --index on a.phonenumber not used
 
Switch to pg_catalog.regexp_replace(b.PHONENUMBER,'[^0-9]',''), 
  Index on a.phonenumber got used.

Thanks,

James Pang 

-Original Message-
From: Tom Lane  
Sent: Monday, June 12, 2023 9:19 PM
To: James Pang (chaolpan) 
Cc: pgsql-performance@lists.postgresql.org
Subject: Re: Postgresql equal join on function with columns not use index

"James Pang (chaolpan)"  writes:
>We migrate from Oracle to Postgresql14.8, one SQL has regression in 
> Postgres  run in 5800 milliseconds in Postgresql v14.8,  but the same SQL got 
> done in several hundred milliseconds in Oracle database.
>With multiple table JOINs, if the join condition is  
> tablea.column1=tableb.column1, optimizer will use the index to filter 
> data in nest loops, but if  
> tablea.column1=regexp_replace(tableb.column1),
> Optimizer will not be able to use the index on tablea.column1, then it do a 
> table scan and nestloop to produce a lot rows then use 
> tablea.column1=regexp_replace(tableb.column1) as a filter.  As a 
> workaround we create a view then use tablea.column1=view.column1 that works.
>  Is it expected ?details as below.

It's impossible to comment on this usefully with such a fragmentary description 
of the problem.  Please send a complete, self-contained test case if you want 
anybody to look at it carefully.

https://wiki.postgresql.org/wiki/Slow_Query_Questions

regards, tom lane




Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT

2023-06-12 Thread benoit
Hello


I have a database with few 60gb tables. Tables rows are requested with multiple 
ANY or IN operators. I am not able to find an easy way to make DB able to use 
indexes. I often hit the index, but see a a spike of 200mb of IO or disk read.


I am using version 13 but soon 14.


I wrote a reproduction script on version 14 with plans included. 
https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d


I also have plans on a snapshot of the DB with real data.

- The current query that I try to improve : 
https://explain.dalibo.com/plan/8b8f6e0he9feb551

  - I added the DB schema + index in query view. As you can see I have many 
indexes for testing purpose and try what the planner can do.

- The optimized query when I have only one ANY and migrate to UNION ALL for 
each parameter of the ANY operator 
https://explain.dalibo.com/plan/427gg053d07328ga . Query is fast as I would 
like but it means generate some merge to be able to get a fast result.

- The new issue I have when I have a new ANY operator on the previous optimized 
query. Big IO/read https://explain.dalibo.com/plan/e7ha9g637b4eh946


It seems to me quite undoable to generate for every parameters a query that 
will then merge. I have sometimes 3-4 ANY operators with up to 15 elements in 
an array.


Is there a misusage of my indexes?

Is there a limitation when using ANY or IN operators and ordered LIMIT behind?


Thanks a lot



Re: Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT

2023-06-12 Thread Peter Geoghegan
On Mon, Jun 12, 2023 at 1:17 PM benoit  wrote:
> Is there a misusage of my indexes?
>
> Is there a limitation when using ANY or IN operators and ordered LIMIT behind?

It's complicated. Do you find that you get satisfactory performance if
you force a bitmap index scan? In other words, what is the effect of
"set enable_indexscan = off" on your original query? Does that speed
up execution at all? (I think that this approach ought to produce a
plan that uses a bitmap index scan in place of the index scan, without
changing anything else.)

-- 
Peter Geoghegan




RE: Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT

2023-06-12 Thread benoit
Sadly it doesn't help to disable indexscan. The plan : 
https://explain.dalibo.com/plan/3b3gfce5b29c3hh4


De : Peter Geoghegan 
Envoyé : lundi 12 juin 2023 22:34:50
À : benoit
Cc : pgsql-performance@lists.postgresql.org
Objet : Re: Forced to use UNION ALL when having multiple ANY operators and 
ORDER BY LIMIT

On Mon, Jun 12, 2023 at 1:17 PM benoit  wrote:
> Is there a misusage of my indexes?
>
> Is there a limitation when using ANY or IN operators and ordered LIMIT behind?

It's complicated. Do you find that you get satisfactory performance if
you force a bitmap index scan? In other words, what is the effect of
"set enable_indexscan = off" on your original query? Does that speed
up execution at all? (I think that this approach ought to produce a
plan that uses a bitmap index scan in place of the index scan, without
changing anything else.)

--
Peter Geoghegan


Re: Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT

2023-06-12 Thread Chris Hoover
I normally create my indexes to match the where clause of the query. While 
technically, it should not matter, I find a lot of time, it does.

I would create an index on (status, sender_reference, sent_at) and see if the 
improves your query performance.

SELECT *
  FROM docs
 WHERE status IN ('draft', 'sent') 
   AND sender_reference IN ('Custom/1175', 'Client/362', 'Custom/280') 
 ORDER BY sent_at DESC

Thanks,


Chris Hoover
Senior DBA
AWeber.com
Cell: (803) 528-2269
Email: chr...@aweber.com



> On Jun 12, 2023, at 4:17 PM, benoit  wrote:
> 
> Hello
> 
> I have a database with few 60gb tables. Tables rows are requested with 
> multiple ANY or IN operators. I am not able to find an easy way to make DB 
> able to use indexes. I often hit the index, but see a a spike of 200mb of IO 
> or disk read.
> 
> I am using version 13 but soon 14.
> 
> I wrote a reproduction script on version 14 with plans included. 
> https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d
> 
> I also have plans on a snapshot of the DB with real data.
> - The current query that I try to improve : 
> https://explain.dalibo.com/plan/8b8f6e0he9feb551
>   - I added the DB schema + index in query view. As you can see I have many 
> indexes for testing purpose and try what the planner can do.
> - The optimized query when I have only one ANY and migrate to UNION ALL for 
> each parameter of the ANY operator 
> https://explain.dalibo.com/plan/427gg053d07328ga . Query is fast as I would 
> like but it means generate some merge to be able to get a fast result.
> - The new issue I have when I have a new ANY operator on the previous 
> optimized query. Big IO/read https://explain.dalibo.com/plan/e7ha9g637b4eh946
> 
> It seems to me quite undoable to generate for every parameters a query that 
> will then merge. I have sometimes 3-4 ANY operators with up to 15 elements in 
> an array.
> 
> Is there a misusage of my indexes?
> Is there a limitation when using ANY or IN operators and ordered LIMIT behind?
> 
> Thanks a lot



RE: Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT

2023-06-12 Thread benoit
This new index is used but still the read is 230mb.


https://explain.dalibo.com/plan/b0f28a9e8a136afd



De : Chris Hoover 
Envoyé : lundi 12 juin 2023 22:55
À : benoit
Cc : pgsql-performance@lists.postgresql.org
Objet : Re: Forced to use UNION ALL when having multiple ANY operators and 
ORDER BY LIMIT

I normally create my indexes to match the where clause of the query. While 
technically, it should not matter, I find a lot of time, it does.

I would create an index on (status, sender_reference, sent_at) and see if the 
improves your query performance.

SELECT * FROM docs WHERE status IN ('draft', 'sent') AND 
sender_reference IN ('Custom/1175', 'Client/362', 'Custom/280') ORDER BY 
sent_at DESC
Thanks,


Chris Hoover
Senior DBA
AWeber.com
Cell: (803) 528-2269
Email: chr...@aweber.com



On Jun 12, 2023, at 4:17 PM, benoit  wrote:


Hello

I have a database with few 60gb tables. Tables rows are requested with multiple 
ANY or IN operators. I am not able to find an easy way to make DB able to use 
indexes. I often hit the index, but see a a spike of 200mb of IO or disk read.

I am using version 13 but soon 14.

I wrote a reproduction script on version 14 with plans included. 
https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d

I also have plans on a snapshot of the DB with real data.
- The current query that I try to improve : 
https://explain.dalibo.com/plan/8b8f6e0he9feb551
  - I added the DB schema + index in query view. As you can see I have many 
indexes for testing purpose and try what the planner can do.
- The optimized query when I have only one ANY and migrate to UNION ALL for 
each parameter of the ANY operator 
https://explain.dalibo.com/plan/427gg053d07328ga . Query is fast as I would 
like but it means generate some merge to be able to get a fast result.
- The new issue I have when I have a new ANY operator on the previous optimized 
query. Big IO/read https://explain.dalibo.com/plan/e7ha9g637b4eh946

It seems to me quite undoable to generate for every parameters a query that 
will then merge. I have sometimes 3-4 ANY operators with up to 15 elements in 
an array.

Is there a misusage of my indexes?
Is there a limitation when using ANY or IN operators and ordered LIMIT behind?

Thanks a lot



Re: Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT

2023-06-12 Thread Chris Hoover
So, had a bit more time to look into this.

Here is the issue:
Your query is requesting 20 rows.  However, you are doing a sort on sent_at.  
Because of this, the database is having to pull all rows that match the status 
and sender_reference, sort them, and then give you 20.

From your example:
1.  You have 29744 rows that match your criteria.   ->
Index Scan using docs_sent_at_idx_1 on public.docs  (cost=0.56..39748945.58 
rows=29744 width=38) 

2. To get those 29744 rows, the database had to read 5046 database blocks 
(8KB/block).  45046 * 8KB is your ~220MB of read. ->  
Buffers: shared hit=1421 read=45046

3.  Since you are selecting all rows, the database uses the index to find the 
matching rows and then has to go read the head blocks to retrieve and validate 
the rows.  Then it sorts all the returned rows by date and returns
the first 20 rows.

So, PostgreSQL is preforming well.  It’s just an expensive way to get 20 rows 
and I don’t see an easy way to make it better if that is what is needed.

Thanks,


Chris Hoover
Senior DBA
AWeber.com
Cell: (803) 528-2269
Email: chr...@aweber.com



> On Jun 12, 2023, at 5:34 PM, benoit  wrote:
> 
> This new index is used but still the read is 230mb.
> 
> https://explain.dalibo.com/plan/b0f28a9e8a136afd
> 
> 
> De : Chris Hoover 
> Envoyé : lundi 12 juin 2023 22:55
> À : benoit
> Cc : pgsql-performance@lists.postgresql.org
> Objet : Re: Forced to use UNION ALL when having multiple ANY operators and 
> ORDER BY LIMIT
>  
> I normally create my indexes to match the where clause of the query. While 
> technically, it should not matter, I find a lot of time, it does.
> 
> I would create an index on (status, sender_reference, sent_at) and see if the 
> improves your query performance.
> 
> 
> SELECT * FROM docs WHERE status 
> IN ('draft',
> 'sent')
> AND sender_reference 
> IN ('Custom/1175',
> 'Client/362',
> 'Custom/280')
> ORDER BY sent_at DESC
> Thanks,
> 
> 
> Chris Hoover
> Senior DBA
> AWeber.com
> Cell: (803) 528-2269
> Email: chr...@aweber.com
> 
> 
> 
>> On Jun 12, 2023, at 4:17 PM, benoit  wrote:
>> 
>> Hello
>> 
>> I have a database with few 60gb tables. Tables rows are requested with 
>> multiple ANY or IN operators. I am not able to find an easy way to make DB 
>> able to use indexes. I often hit the index, but see a a spike of 200mb of IO 
>> or disk read.
>> 
>> I am using version 13 but soon 14.
>> 
>> I wrote a reproduction script on version 14 with plans included. 
>> https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d
>> 
>> I also have plans on a snapshot of the DB with real data.
>> - The current query that I try to improve : 
>> https://explain.dalibo.com/plan/8b8f6e0he9feb551
>>   - I added the DB schema + index in query view. As you can see I have many 
>> indexes for testing purpose and try what the planner can do.
>> - The optimized query when I have only one ANY and migrate to UNION ALL for 
>> each parameter of the ANY operator 
>> https://explain.dalibo.com/plan/427gg053d07328ga . Query is fast as I would 
>> like but it means generate some merge to be able to get a fast result.
>> - The new issue I have when I have a new ANY operator on the previous 
>> optimized query. Big IO/read https://explain.dalibo.com/plan/e7ha9g637b4eh946
>> 
>> It seems to me quite undoable to generate for every parameters a query that 
>> will then merge. I have sometimes 3-4 ANY operators with up to 15 elements 
>> in an array.
>> 
>> Is there a misusage of my indexes?
>> Is there a limitation when using ANY or IN operators and ordered LIMIT 
>> behind?
>> 
>> Thanks a lot