Postgresql equal join on function with columns not use index
Hi, We migrate from Oracle to Postgresql14.8, one SQL has regression in Postgres run in 5800 milliseconds in Postgresql v14.8, but the same SQL got done in several hundred milliseconds in Oracle database. With multiple table JOINs, if the join condition is tablea.column1=tableb.column1, optimizer will use the index to filter data in nest loops, but if tablea.column1=regexp_replace(tableb.column1), Optimizer will not be able to use the index on tablea.column1, then it do a table scan and nestloop to produce a lot rows then use tablea.column1=regexp_replace(tableb.column1) as a filter. As a workaround we create a view then use tablea.column1=view.column1 that works. Is it expected ?details as below. SELECT DISTINCT a.xxx, b.xxx as TOLLFREE FROM tableA a, tableB b WHERE a.column1 = regexp_replace(b.column1,'[^0-9]','') AND b.column2 = $1 AND b.column3= $2 AND NOT EXISTS (SELECT 1 FROM tableC c WHERE c.xxid = b.xxid AND c.xxtype = case when b.col4 = 1 then 'TollFree' else 'Toll' end AND c.xxid = $3) Unique (cost=423374.60..423377.87 rows=436 width=21) (actual time=6070.963..6071.054 rows=395 loops=1) Buffers: shared hit=148 -> Sort (cost=423374.60..423375.69 rows=436 width=21) (actual time=6070.963..6070.992 rows=397 loops=1) Sort Key: a.xx, b.xx Sort Method: quicksort Memory: 56kB Buffers: shared hit=148 -> Nested Loop (cost=0.69..423355.48 rows=436 width=21) (actual time=120.338..6070.669 rows=397 loops=1) Join Filter: ((a.column1)::text = regexp_replace((b.column1)::text, '[^0-9]'::text, ''::text)) << Seq Scan on tableA a (cost=0.00..161.12 rows=7712 width=25) (actual time=0.022..1.380 rows=7712 loops=1) Buffers: shared hit=84 -> Materialize (cost=0.69..153.12 rows=207 width=21) (actual time=0.000..0.011 rows=196 loops=7712) Buffers: shared hit=58 -> Nested Loop Anti Join (cost=0.69..152.09 rows=207 width=21) (actual time=0.069..0.278 rows=196 loops=1) Join Filter: ((c.xxid = b.xxid) AND ((c.)::text = CASE WHEN (b.column2 = 1) THEN 'aaa'::text ELSE 'bbb'::t ext END)) Buffers: shared hit=58 -> Index Scan using idx_xxx on tableB b (cost=0.42..146.55 rows=207 width=29) (actual time=0.047..0.207 rows=196 loops=1) Index Cond: ((colum3 = 40957) AND (column2 = 1)) Buffers: shared hit=56 -> Materialize (cost=0.27..1.40 rows=1 width=15) (actual time=0.000..0.000 rows=0 loops=196) Buffers: shared hit=2 -> Index Only Scan using pk_ on tableC c (cost=0.27..1.39 rows=1 width=15 ) (actual time=0.020..0.020 rows=0 loops=1) Index Cond: (xxxid = 12407262) Heap Fetches: 0 Buffers: shared hit=2 If we create a view ,the SQL got done in several million seconds, CREATE VIEW tableBREGXP as (select xx,column2,column3,xxid,regexp_replace(column1,'[^0-9]','') as column1 from tableB); SELECT DISTINCT a.xxx, b.xxx as TOLLFREE FROM tableA a, tableBREGXP b<<< replace the tableB with view name. WHERE a.column1 = b.column1 AND b.column2 = $1 AND b.column3= $2 <<< use b.column1 to replace regexp_replace((b.column1)::text, '[^0-9]'::text, ''::text)) AND NOT EXISTS (SELECT 1 FROM tableC c WHERE c.xxid = b.xxid AND c.xxtype = case when b.col4 = 1 then 'TollFree' else 'Toll' end AND c.xxid = $3) HashAggregate (cost=408.19..412.76 rows=457 width=21) (actual time=4.524..4.644 rows=395 loops=1) Group Key: a.xxx, b.xx Batches: 1 Memory Usage: 61kB Buffers: shared hit=693 -> Nested Loop (cost=0.97..405.90 rows=457 width=21) (actual time=0.154..4.205 rows=397 loops=1) Buffers: shared hit=693 -> Nested Loop Anti Join (cost=0.69..214.97 rows=217 width=40) (actual time=0.137..2.877 rows=196 loops=1) Join Filter: ((c.xxyid = b.xxid) AND ((c.x)::text = CASE WHEN (b.column2 = 1) THEN 'To llFree'::text ELSE 'Toll'::text END)) Buffers: shared hit=55 -> Index Scan using idx_xxx on b (cost=0.42..207.06 rows=217 width=64) (actual time=0.123..2.725 rows=196 loops=1) Index Cond: ((column2 = 40957) AND (column3 = 1)) Buffers: shared hi
Re: Postgresql equal join on function with columns not use index
"James Pang (chaolpan)" writes: >We migrate from Oracle to Postgresql14.8, one SQL has regression in > Postgres run in 5800 milliseconds in Postgresql v14.8, but the same SQL got > done in several hundred milliseconds in Oracle database. >With multiple table JOINs, if the join condition is > tablea.column1=tableb.column1, optimizer will use the index to filter data in > nest loops, but if tablea.column1=regexp_replace(tableb.column1), > Optimizer will not be able to use the index on tablea.column1, then it do a > table scan and nestloop to produce a lot rows then use > tablea.column1=regexp_replace(tableb.column1) as a filter. As a > workaround we create a view then use tablea.column1=view.column1 that works. > Is it expected ?details as below. It's impossible to comment on this usefully with such a fragmentary description of the problem. Please send a complete, self-contained test case if you want anybody to look at it carefully. https://wiki.postgresql.org/wiki/Slow_Query_Questions regards, tom lane
RE: Postgresql equal join on function with columns not use index
Hi, Looks like it's the function "regexp_replace" volatile and restrict=false make the difference, we have our application role with default search_path=oracle,$user,public,pg_catalog. =#select oid,proname,pronamespace::regnamespace,prosecdef,proisstrict,provolatile from pg_proc where proname='regexp_replace' order by oid; oid |proname | pronamespace | prosecdef | proisstrict | provolatile ---++--+---+-+- 2284 | regexp_replace | pg_catalog | f | t | i 2285 | regexp_replace | pg_catalog | f | t | i 17095 | regexp_replace | oracle | f | f | v 17096 | regexp_replace | oracle | f | f | v 17097 | regexp_replace | oracle | f | f | v 17098 | regexp_replace | oracle | f | f | v --with default it use orafce, oracle.regexp_replace function, Select a.phonenumber,... from tableA a, tableB b where a.phonenumber=oracle. regexp_replace(b.PHONENUMBER,'[^0-9]','') , --index on a.phonenumber not used Switch to pg_catalog.regexp_replace(b.PHONENUMBER,'[^0-9]',''), Index on a.phonenumber got used. Thanks, James Pang -Original Message- From: Tom Lane Sent: Monday, June 12, 2023 9:19 PM To: James Pang (chaolpan) Cc: pgsql-performance@lists.postgresql.org Subject: Re: Postgresql equal join on function with columns not use index "James Pang (chaolpan)" writes: >We migrate from Oracle to Postgresql14.8, one SQL has regression in > Postgres run in 5800 milliseconds in Postgresql v14.8, but the same SQL got > done in several hundred milliseconds in Oracle database. >With multiple table JOINs, if the join condition is > tablea.column1=tableb.column1, optimizer will use the index to filter > data in nest loops, but if > tablea.column1=regexp_replace(tableb.column1), > Optimizer will not be able to use the index on tablea.column1, then it do a > table scan and nestloop to produce a lot rows then use > tablea.column1=regexp_replace(tableb.column1) as a filter. As a > workaround we create a view then use tablea.column1=view.column1 that works. > Is it expected ?details as below. It's impossible to comment on this usefully with such a fragmentary description of the problem. Please send a complete, self-contained test case if you want anybody to look at it carefully. https://wiki.postgresql.org/wiki/Slow_Query_Questions regards, tom lane
Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT
Hello I have a database with few 60gb tables. Tables rows are requested with multiple ANY or IN operators. I am not able to find an easy way to make DB able to use indexes. I often hit the index, but see a a spike of 200mb of IO or disk read. I am using version 13 but soon 14. I wrote a reproduction script on version 14 with plans included. https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d I also have plans on a snapshot of the DB with real data. - The current query that I try to improve : https://explain.dalibo.com/plan/8b8f6e0he9feb551 - I added the DB schema + index in query view. As you can see I have many indexes for testing purpose and try what the planner can do. - The optimized query when I have only one ANY and migrate to UNION ALL for each parameter of the ANY operator https://explain.dalibo.com/plan/427gg053d07328ga . Query is fast as I would like but it means generate some merge to be able to get a fast result. - The new issue I have when I have a new ANY operator on the previous optimized query. Big IO/read https://explain.dalibo.com/plan/e7ha9g637b4eh946 It seems to me quite undoable to generate for every parameters a query that will then merge. I have sometimes 3-4 ANY operators with up to 15 elements in an array. Is there a misusage of my indexes? Is there a limitation when using ANY or IN operators and ordered LIMIT behind? Thanks a lot
Re: Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT
On Mon, Jun 12, 2023 at 1:17 PM benoit wrote: > Is there a misusage of my indexes? > > Is there a limitation when using ANY or IN operators and ordered LIMIT behind? It's complicated. Do you find that you get satisfactory performance if you force a bitmap index scan? In other words, what is the effect of "set enable_indexscan = off" on your original query? Does that speed up execution at all? (I think that this approach ought to produce a plan that uses a bitmap index scan in place of the index scan, without changing anything else.) -- Peter Geoghegan
RE: Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT
Sadly it doesn't help to disable indexscan. The plan : https://explain.dalibo.com/plan/3b3gfce5b29c3hh4 De : Peter Geoghegan Envoyé : lundi 12 juin 2023 22:34:50 À : benoit Cc : pgsql-performance@lists.postgresql.org Objet : Re: Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT On Mon, Jun 12, 2023 at 1:17 PM benoit wrote: > Is there a misusage of my indexes? > > Is there a limitation when using ANY or IN operators and ordered LIMIT behind? It's complicated. Do you find that you get satisfactory performance if you force a bitmap index scan? In other words, what is the effect of "set enable_indexscan = off" on your original query? Does that speed up execution at all? (I think that this approach ought to produce a plan that uses a bitmap index scan in place of the index scan, without changing anything else.) -- Peter Geoghegan
Re: Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT
I normally create my indexes to match the where clause of the query. While technically, it should not matter, I find a lot of time, it does. I would create an index on (status, sender_reference, sent_at) and see if the improves your query performance. SELECT * FROM docs WHERE status IN ('draft', 'sent') AND sender_reference IN ('Custom/1175', 'Client/362', 'Custom/280') ORDER BY sent_at DESC Thanks, Chris Hoover Senior DBA AWeber.com Cell: (803) 528-2269 Email: chr...@aweber.com > On Jun 12, 2023, at 4:17 PM, benoit wrote: > > Hello > > I have a database with few 60gb tables. Tables rows are requested with > multiple ANY or IN operators. I am not able to find an easy way to make DB > able to use indexes. I often hit the index, but see a a spike of 200mb of IO > or disk read. > > I am using version 13 but soon 14. > > I wrote a reproduction script on version 14 with plans included. > https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d > > I also have plans on a snapshot of the DB with real data. > - The current query that I try to improve : > https://explain.dalibo.com/plan/8b8f6e0he9feb551 > - I added the DB schema + index in query view. As you can see I have many > indexes for testing purpose and try what the planner can do. > - The optimized query when I have only one ANY and migrate to UNION ALL for > each parameter of the ANY operator > https://explain.dalibo.com/plan/427gg053d07328ga . Query is fast as I would > like but it means generate some merge to be able to get a fast result. > - The new issue I have when I have a new ANY operator on the previous > optimized query. Big IO/read https://explain.dalibo.com/plan/e7ha9g637b4eh946 > > It seems to me quite undoable to generate for every parameters a query that > will then merge. I have sometimes 3-4 ANY operators with up to 15 elements in > an array. > > Is there a misusage of my indexes? > Is there a limitation when using ANY or IN operators and ordered LIMIT behind? > > Thanks a lot
RE: Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT
This new index is used but still the read is 230mb. https://explain.dalibo.com/plan/b0f28a9e8a136afd De : Chris Hoover Envoyé : lundi 12 juin 2023 22:55 À : benoit Cc : pgsql-performance@lists.postgresql.org Objet : Re: Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT I normally create my indexes to match the where clause of the query. While technically, it should not matter, I find a lot of time, it does. I would create an index on (status, sender_reference, sent_at) and see if the improves your query performance. SELECT * FROM docs WHERE status IN ('draft', 'sent') AND sender_reference IN ('Custom/1175', 'Client/362', 'Custom/280') ORDER BY sent_at DESC Thanks, Chris Hoover Senior DBA AWeber.com Cell: (803) 528-2269 Email: chr...@aweber.com On Jun 12, 2023, at 4:17 PM, benoit wrote: Hello I have a database with few 60gb tables. Tables rows are requested with multiple ANY or IN operators. I am not able to find an easy way to make DB able to use indexes. I often hit the index, but see a a spike of 200mb of IO or disk read. I am using version 13 but soon 14. I wrote a reproduction script on version 14 with plans included. https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d I also have plans on a snapshot of the DB with real data. - The current query that I try to improve : https://explain.dalibo.com/plan/8b8f6e0he9feb551 - I added the DB schema + index in query view. As you can see I have many indexes for testing purpose and try what the planner can do. - The optimized query when I have only one ANY and migrate to UNION ALL for each parameter of the ANY operator https://explain.dalibo.com/plan/427gg053d07328ga . Query is fast as I would like but it means generate some merge to be able to get a fast result. - The new issue I have when I have a new ANY operator on the previous optimized query. Big IO/read https://explain.dalibo.com/plan/e7ha9g637b4eh946 It seems to me quite undoable to generate for every parameters a query that will then merge. I have sometimes 3-4 ANY operators with up to 15 elements in an array. Is there a misusage of my indexes? Is there a limitation when using ANY or IN operators and ordered LIMIT behind? Thanks a lot
Re: Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT
So, had a bit more time to look into this. Here is the issue: Your query is requesting 20 rows. However, you are doing a sort on sent_at. Because of this, the database is having to pull all rows that match the status and sender_reference, sort them, and then give you 20. From your example: 1. You have 29744 rows that match your criteria. -> Index Scan using docs_sent_at_idx_1 on public.docs (cost=0.56..39748945.58 rows=29744 width=38) 2. To get those 29744 rows, the database had to read 5046 database blocks (8KB/block). 45046 * 8KB is your ~220MB of read. -> Buffers: shared hit=1421 read=45046 3. Since you are selecting all rows, the database uses the index to find the matching rows and then has to go read the head blocks to retrieve and validate the rows. Then it sorts all the returned rows by date and returns the first 20 rows. So, PostgreSQL is preforming well. It’s just an expensive way to get 20 rows and I don’t see an easy way to make it better if that is what is needed. Thanks, Chris Hoover Senior DBA AWeber.com Cell: (803) 528-2269 Email: chr...@aweber.com > On Jun 12, 2023, at 5:34 PM, benoit wrote: > > This new index is used but still the read is 230mb. > > https://explain.dalibo.com/plan/b0f28a9e8a136afd > > > De : Chris Hoover > Envoyé : lundi 12 juin 2023 22:55 > À : benoit > Cc : pgsql-performance@lists.postgresql.org > Objet : Re: Forced to use UNION ALL when having multiple ANY operators and > ORDER BY LIMIT > > I normally create my indexes to match the where clause of the query. While > technically, it should not matter, I find a lot of time, it does. > > I would create an index on (status, sender_reference, sent_at) and see if the > improves your query performance. > > > SELECT * FROM docs WHERE status > IN ('draft', > 'sent') > AND sender_reference > IN ('Custom/1175', > 'Client/362', > 'Custom/280') > ORDER BY sent_at DESC > Thanks, > > > Chris Hoover > Senior DBA > AWeber.com > Cell: (803) 528-2269 > Email: chr...@aweber.com > > > >> On Jun 12, 2023, at 4:17 PM, benoit wrote: >> >> Hello >> >> I have a database with few 60gb tables. Tables rows are requested with >> multiple ANY or IN operators. I am not able to find an easy way to make DB >> able to use indexes. I often hit the index, but see a a spike of 200mb of IO >> or disk read. >> >> I am using version 13 but soon 14. >> >> I wrote a reproduction script on version 14 with plans included. >> https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d >> >> I also have plans on a snapshot of the DB with real data. >> - The current query that I try to improve : >> https://explain.dalibo.com/plan/8b8f6e0he9feb551 >> - I added the DB schema + index in query view. As you can see I have many >> indexes for testing purpose and try what the planner can do. >> - The optimized query when I have only one ANY and migrate to UNION ALL for >> each parameter of the ANY operator >> https://explain.dalibo.com/plan/427gg053d07328ga . Query is fast as I would >> like but it means generate some merge to be able to get a fast result. >> - The new issue I have when I have a new ANY operator on the previous >> optimized query. Big IO/read https://explain.dalibo.com/plan/e7ha9g637b4eh946 >> >> It seems to me quite undoable to generate for every parameters a query that >> will then merge. I have sometimes 3-4 ANY operators with up to 15 elements >> in an array. >> >> Is there a misusage of my indexes? >> Is there a limitation when using ANY or IN operators and ordered LIMIT >> behind? >> >> Thanks a lot