trgm and index-usage when using similarity function

Andreas Joseph Krogh Wed, 05 Nov 2025 00:18:34 -0800


Using PG-18.0, I'm having problems with trgm and index-usage when using 
similarity-function.




Given the following schema:

CREATE TABLE person
(
entity_id BIGSERIAL PRIMARY KEY,
firstname text,
lastname text,
fullname_search text,
birthdate date
);

INSERT INTO person
SELECT pers.entity_id, pers.firstname, pers.lastname, pers.fullname_search, 
pers.birthdate
FROM onp_crm_person pers;

CREATE INDEX test_idx_onp_crm_person_duplicates
ON person USING gin (fullname_search gin_trgm_ops)
;


This query uses index and is fast, but I have to set 
pg_trgm.similarity_threshold manually first:

SET pg_trgm.similarity_threshold = 0.9;

EXPLAIN (ANALYZE, BUFFERS)
SELECT p.entity_id, p.firstname, p.lastname, p.fullname_search, p.birthdate
FROM person p
WHERE EXISTS (
SELECT *
FROM person d
 WHERE d.fullname_search % p.fullname_search
AND d.entity_id <> p.entity_id
LIMIT 1
)
;

┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │

├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Nested Loop Semi Join (cost=2.19..319906.05 rows=256 width=45) (actual 
time=29.527..2626.465 rows=7497.00 loops=1) │
│ Buffers: shared hit=972919 │
│ -> Seq Scan on person p (cost=0.00..488.81 rows=25581 width=45) (actual 
time=0.024..1.127 rows=25581.00 loops=1) │
│ Buffers: shared hit=233 │
│ -> Bitmap Heap Scan on person d (cost=2.19..10.03 rows=256 width=25) (actual 
time=0.102..0.102 rows=0.29 loops=25581) │
│ Recheck Cond: (fullname_search % p.fullname_search) │
│ Rows Removed by Index Recheck: 0 │
│ Filter: (entity_id <> p.entity_id) │
│ Rows Removed by Filter: 1 │
│ Heap Blocks: exact=34252 │
│ Buffers: shared hit=972686 │
│ -> Bitmap Index Scan on test_idx_onp_crm_person_duplicates (cost=0.00..2.13 
rows=256 width=0) (actual time=0.097..0.097 rows=1.99 loops=25581) │
│ Index Cond: (fullname_search % p.fullname_search) │
│ Index Searches: 25581 │
│ Buffers: shared hit=933853 │
│ Planning: │
│ Buffers: shared hit=2 │
│ Planning Time: 3.620 ms │
│ JIT: │
│ Functions: 7 │
│ Options: Inlining false, Optimization false, Expressions true, Deforming 
true │
│ Timing: Generation 0.772 ms (Deform 0.230 ms), Inlining 0.000 ms, 
Optimization 5.522 ms, Emission 23.219 ms, Total 29.513 ms │
│ Execution Time: 2628.643 ms │

└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(23 rows)


But using similarity-function:

EXPLAIN (ANALYZE, BUFFERS)
SELECT p.entity_id, p.firstname, p.lastname, p.fullname_search, p.birthdate
FROM person p
WHERE EXISTS (
SELECT *
FROM person d
 WHERE similarity(d.fullname_search, p.fullname_search) >= 0.9
AND d.entity_id <> p.entity_id
LIMIT 1
)
;
… this has been running for 30 minutes now and I've cancelled it.



Any suggestions on how to fix this query so I can:

 * 
specify the similarity-factor as part of the query

 * 
Make it fast

Thanks.






--
Andreas Joseph Krogh

trgm and index-usage when using similarity function

Reply via email to