Using PG-18.0, I'm having problems with trgm and index-usage when using similarity-function.
Given the following schema: CREATE TABLE person ( entity_id BIGSERIAL PRIMARY KEY, firstname text, lastname text, fullname_search text, birthdate date ); INSERT INTO person SELECT pers.entity_id, pers.firstname, pers.lastname, pers.fullname_search, pers.birthdate FROM onp_crm_person pers; CREATE INDEX test_idx_onp_crm_person_duplicates ON person USING gin (fullname_search gin_trgm_ops) ; This query uses index and is fast, but I have to set pg_trgm.similarity_threshold manually first: SET pg_trgm.similarity_threshold = 0.9; EXPLAIN (ANALYZE, BUFFERS) SELECT p.entity_id, p.firstname, p.lastname, p.fullname_search, p.birthdate FROM person p WHERE EXISTS ( SELECT * FROM person d WHERE d.fullname_search % p.fullname_search AND d.entity_id <> p.entity_id LIMIT 1 ) ; ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Nested Loop Semi Join (cost=2.19..319906.05 rows=256 width=45) (actual time=29.527..2626.465 rows=7497.00 loops=1) │ │ Buffers: shared hit=972919 │ │ -> Seq Scan on person p (cost=0.00..488.81 rows=25581 width=45) (actual time=0.024..1.127 rows=25581.00 loops=1) │ │ Buffers: shared hit=233 │ │ -> Bitmap Heap Scan on person d (cost=2.19..10.03 rows=256 width=25) (actual time=0.102..0.102 rows=0.29 loops=25581) │ │ Recheck Cond: (fullname_search % p.fullname_search) │ │ Rows Removed by Index Recheck: 0 │ │ Filter: (entity_id <> p.entity_id) │ │ Rows Removed by Filter: 1 │ │ Heap Blocks: exact=34252 │ │ Buffers: shared hit=972686 │ │ -> Bitmap Index Scan on test_idx_onp_crm_person_duplicates (cost=0.00..2.13 rows=256 width=0) (actual time=0.097..0.097 rows=1.99 loops=25581) │ │ Index Cond: (fullname_search % p.fullname_search) │ │ Index Searches: 25581 │ │ Buffers: shared hit=933853 │ │ Planning: │ │ Buffers: shared hit=2 │ │ Planning Time: 3.620 ms │ │ JIT: │ │ Functions: 7 │ │ Options: Inlining false, Optimization false, Expressions true, Deforming true │ │ Timing: Generation 0.772 ms (Deform 0.230 ms), Inlining 0.000 ms, Optimization 5.522 ms, Emission 23.219 ms, Total 29.513 ms │ │ Execution Time: 2628.643 ms │ └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ (23 rows) But using similarity-function: EXPLAIN (ANALYZE, BUFFERS) SELECT p.entity_id, p.firstname, p.lastname, p.fullname_search, p.birthdate FROM person p WHERE EXISTS ( SELECT * FROM person d WHERE similarity(d.fullname_search, p.fullname_search) >= 0.9 AND d.entity_id <> p.entity_id LIMIT 1 ) ; … this has been running for 30 minutes now and I've cancelled it. Any suggestions on how to fix this query so I can: * specify the similarity-factor as part of the query * Make it fast Thanks. -- Andreas Joseph Krogh
