pg_trgm word_similarity query does not use index for input strings longer than 8 characters

2021-12-06 Thread pgsql-performance

Hello,

recently I wrote a query that provides suggestions from a Postgres table.
It should be able to work despite smaller typos and thus I chose to use 
the pg_trgm extension (https://www.postgresql.org/docs/current/pgtrgm.html).
When measuring the performance, I observed great differences in the 
query time, depending on the input string.
Analysis showed that Postgres sometimes used the created indexes and 
sometimes it didn't, even though it would provide a considerable speedup.


In the included test case the degradation occurs for all input strings 
of length 8 or longer, for shorter strings the index is used.


My questions:
Why doesn't the query planner choose to use the index?
Can I make Postgres use the index, and if so, how?
I understand that trying to outsmart the planner is generally a bad 
idea. Maybe the query can be rewritten or there are some parameters that 
could be tweaked.



## Setup Information

Hardware: Intel i5-8250U, 8GB RAM, encrypted SSD, no RAID
$ uname -a
Linux 5.11.0-40-generic #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 
2021 x86_64 x86_64 x86_64 GNU/Linux


Software:
OS: Ubuntu 20.04
Postgres: PostgreSQL 14.1 (Debian 14.1-1.pgdg110+1) on 
x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 
64-bit

The Postgres docker image was used.
Docker: Docker version 20.10.5, build 55c4c88
Image used: postgres:14.1

Configuration:
The config file was not changed.
name|  current_setting   |source
++--
 application_name   | psql   | client
 client_encoding| UTF8   | client
 DateStyle  | ISO, MDY   | configuration file
 default_text_search_config | pg_catalog.english | configuration file
 dynamic_shared_memory_type | posix  | configuration file
 enable_seqscan | off| session
 lc_messages| en_US.utf8 | configuration file
 lc_monetary| en_US.utf8 | configuration file
 lc_numeric | en_US.utf8 | configuration file
 lc_time| en_US.utf8 | configuration file
 listen_addresses   | *  | configuration file
 log_timezone   | Etc/UTC| configuration file
 max_connections| 100| configuration file
 max_stack_depth| 2MB| environment variable
 max_wal_size   | 1GB| configuration file
 min_wal_size   | 80MB   | configuration file
 shared_buffers | 128MB  | configuration file
 TimeZone   | Etc/UTC| configuration file


## Test Case
The test case creates a simple table and fills it with 1 identical 
entries.
The query is executed twice with an 8 character string, once with 
sequential scans enabled, and once with sequential scans disabled.
The first query does not use the index, even if the second query shows 
that it would be much faster.


docker run --name postgres -e POSTGRES_PASSWORD=postgres -d postgres:14.1
docker exec -it postgres bash
psql -U postgres

CREATE EXTENSION pg_trgm;

CREATE TABLE song (
artist  varchar(20),
title   varchar(20)
);

INSERT INTO song (artist, title)
SELECT 'artist','title'
FROM generate_series(1,1);

CREATE INDEX artist_trgm ON song USING GIN (artist gin_trgm_ops);
CREATE INDEX title_trgm ON song USING GIN (title gin_trgm_ops);

-- Tips from https://wiki.postgresql.org/wiki/Slow_Query_Questions
ANALYZE;
VACUUM;
REINDEX TABLE song;

\set query '12345678'

-- This query is slow
EXPLAIN ANALYZE
SELECT song.artist, song.title
FROM song
WHERE (song.artist %> :'query' OR song.title %> :'query')
;

set enable_seqscan=off;

-- This query is fast
EXPLAIN ANALYZE
SELECT song.artist, song.title
FROM song
WHERE (song.artist %> :'query' OR song.title %> :'query')
;


## Additional Test Case Info

Schemata:
  Table "public.song"
 Column | Type  | Collation | Nullable | Default | 
Storage  | Compression | Stats target | Description

+---+---+--+-+--+-+--+-
 artist | character varying(20) |   |  | | 
extended | |  |
 title  | character varying(20) |   |  | | 
extended | |  |

Indexes:
"artist_trgm" gin (artist gin_trgm_ops)
"title_trgm" gin (title gin_trgm_ops)
Access method: heap
  Index "public.artist_trgm"
 Column |  Type   | Key? | Definition | Storage | Stats target
+-+--++-+--
 artist | integer | yes  | artist | plain   |
gin, for table "public.song"
   Index "public.

Re: pg_trgm word_similarity query does not use index for input strings longer than 8 characters

2021-12-09 Thread pgsql-performance

Thank you both a lot for the insights and your input.

> Yeah, this test case seems very unrealistic, both as to table size
> and as to the lack of variability of the table entries.

The example was based on real data with a more complicated query which
prompted me to investigate the issue. The distinction between slow and
fast queries is not as clear cut as with the generated data, but the
general problem remains.

>> Since you have SSDs, you should tune "random_page_cost = 1.1".

I tested different values of random_page_cost with various queries. Too
small values increased the execution time again, due to too eager index
usage. I identified the optimum for my use case at 1.4. This solved my
problem, thanks.

Regards
Jonathan

On 07.12.21 18:08, Tom Lane wrote:

Laurenz Albe  writes:

On Tue, 2021-11-30 at 22:38 +0100, pgsql-performa...@jhacker.de wrote:

INSERT INTO song (artist, title)
SELECT 'artist','title'
FROM generate_series(1,1);

\set query '12345678'

-- This query is slow
EXPLAIN ANALYZE
SELECT song.artist, song.title
FROM song
WHERE (song.artist %> :'query' OR song.title %> :'query')
;



The table is quite small; with a bigger table, the test would be more 
meaningful.


Yeah, this test case seems very unrealistic, both as to table size
and as to the lack of variability of the table entries.  I think the
latter is causing the indexscans to take less time than they otherwise
might, because none of the extracted trigrams find any matches.


Since you have SSDs, you should tune "random_page_cost = 1.1".


Right.  Poking at gincostestimate a bit, I see that for this
operator the indexscan cost estimate is basically driven by the
number of trigrams extracted from the query string (nine in this
test case) and the index size; those lead to a predicted number
of index page fetches that's then scaled by random_page_cost.
That's coming out to make it look more expensive than the seqscan.
It's actually not more expensive, but that's partially because
page fetch costs are really zero in this test case (everything
will stay in shared buffers the whole time), and partially because
the unrealistic data pattern is leading to not having to look at
as much of the index as gincostestimate expected.

In general, it appears correct that longer query strings lead to a
higher index cost estimate, because they produce more trigrams so
there's more work for the index match to do.  (At some level, a
longer query means more work in the seqscan case too; but our cost
models are inadequate to predict that.)

regards, tom lane