Re: Bad query plan when you add many OR conditions

2020-01-10 Thread Jeff Janes
On Thu, Jan 9, 2020 at 8:11 PM Marco Colli  wrote:

> Hello!
>
> I have a query on a large table that is very fast (0s):
>
> https://gist.github.com/collimarco/039412b4fe0dcf39955888f96eff29db#file-fast_query-txt
>
> Basically the query matches the rows that have a tag1 OR tag2 OR tag3 OR
> tag4 OR tag5...
>

Each branch of the OR is anticipated to return 400 rows, but it actually
returns 0.  If it actually were to return 400 rows per branch, than
eventually the plan switch actually would make sense.

Why is the estimate off by so much?  If you run a simple select, what the
actual and expected number of rows WHERE project_id = 12345?  WHERE tags @>
'{crt:2018_11}'?  Is one of those estimates way off reality, or is it only
the conjunction which is deranged?


> How can I encourage PostgreSQL to use the Bitmap Index Scan even when
> there are many OR conditions? I have tried with VACUUM ANALYZE
> subscriptions but it doesn't help.
>
> Note: the query is generated dynamically by customers of a SaaS, so I
> don't have full control on it
>

Do you have enough control to change the ORDER BY to:

ORDER BY ("subscriptions"."created_at" + interval '0 days') DESC

Cheers,

Jeff


Re: Bad query plan when you add many OR conditions

2020-01-10 Thread Marco Colli
Sorry, I didn't notice the SELECT * and I said something stupid...
However my reasoning should be still valid: I mean, PG could find the few
relevant rows (there's a LIMIT 30) using ONLY the index. It has all the
information required inside the index! Then it can simply access to that
rows on disk... It cannot take ~1 minute to access a few rows on disk (max
30 rows, actual 0 rows).



On Fri, Jan 10, 2020 at 4:18 PM Tom Lane  wrote:

> Marco Colli  writes:
> > As you can see it is a *index scan* and not an *index only* scan... I
> don't
> > understand why. The index includes all the fields used by the query... so
> > an index only scan should be possible.
>
> Huh?  The query is "select * from ...", so it retrieves *all* columns
> of the table.
>
> regards, tom lane
>


Re: Bad query plan when you add many OR conditions

2020-01-10 Thread Tom Lane
Marco Colli  writes:
> As you can see it is a *index scan* and not an *index only* scan... I don't
> understand why. The index includes all the fields used by the query... so
> an index only scan should be possible.

Huh?  The query is "select * from ...", so it retrieves *all* columns
of the table.

regards, tom lane




Re: Bad query plan when you add many OR conditions

2020-01-10 Thread Marco Colli
Before trying other solutions I would like to make PG use an index-only
scan (it should be fast enough for our purpose).

I have tried to disable the other indexes and forced PG to use this index
(which includes all the fields of the query):
index_subscriptions_on_project_id_and_created_at_and_tags

The problem is that the query plan is this:
https://gist.github.com/collimarco/03f3dde372f001485518b8deca2f3b24#file-index_scan_instead_of_index_only-txt

As you can see it is a *index scan* and not an *index only* scan... I don't
understand why. The index includes all the fields used by the query... so
an index only scan should be possible.


On Fri, Jan 10, 2020 at 2:34 PM Justin Pryzby  wrote:

> On Fri, Jan 10, 2020 at 12:03:39PM +0100, Marco Colli wrote:
> > I have added this index which would allow an index only scan:
> > "index_subscriptions_on_project_id_and_created_at_and_tags" btree
> > (project_id, created_at DESC, tags) WHERE trashed_at IS NULL
>
> Are those the only columns in subscriptions ?
>
> > But Postgresql continues to use this index (which has less information
> and
> > then requires slow access to disk):
> > "index_subscriptions_on_project_id_and_created_at" btree (project_id,
> > created_at DESC)
>
> Did you vacuum the table ?
> Did you try to "explain" the query after dropping the 1st index (like:
> begin;
> DROP INDEX..; explain analyze..; rollback).
>
> Also, is the first (other) index btree_gin (you can \dx to show
> extensions) ?
>
> I think it needs to be a gin index to search tags ?
>
> On Fri, Jan 10, 2020 at 01:42:24PM +0100, Marco Colli wrote:
> > I would like to try your solution but I read that ALTER TABLE... SET
> > STATISTICS  locks the table... Since it is just an experiment and we
> don't
> > know if it actually works it would be greate to avoid locking a large
> table
> > (50M) in production.
>
> I suggest to CREATE TABLE test_subscriptions (LIKE subscriptions INCLUDING
> ALL); INSERT INTO test_subscriptions SELECT * FROM subscriptions; ANALYZE
> test_subscriptions;
>
> Anyway, ALTER..SET STATS requires a strong lock but for only a brief moment
> (assuming it doesn't have to wait).  Possibly you'd be ok doing SET
> statement_timeout='1s'; ALTER TABLE
>
> > Does CREATE  STATISTICS lock the table too?
>
> You can check by SET client_min_messages=debug; SET lock_timeout=333; SET
> log_lock_waits=on;
> Looks like it needs ShareUpdateExclusiveLock.
>
> > Does statistics work on an array field like tags? (I can't find any
> > information)
>
> It think it'd be data type agnostic.  And seems to work with arrays.
>
> On Fri, Jan 10, 2020 at 02:30:27PM +0100, Marco Colli wrote:
> > @Justin Pryzby I have tried this as you suggested:
> >
> > CREATE STATISTICS statistics_on_subscriptions_project_id_and_tags ON
> > project_id, tags FROM subscriptions;
> > VACUUM ANALYZE subscriptions;
> >
> > Unfortunately nothing changes and Postgresql continues to use the wrong
> > plan (maybe stats don't work well on array fields like tags??).
>
> It'd help to see SELECT stxddependencies FROM pg_statistic_ext WHERE
> stxoid='subscriptions'::regclass
>


Re: Bad query plan when you add many OR conditions

2020-01-10 Thread Justin Pryzby
On Fri, Jan 10, 2020 at 12:03:39PM +0100, Marco Colli wrote:
> I have added this index which would allow an index only scan:
> "index_subscriptions_on_project_id_and_created_at_and_tags" btree
> (project_id, created_at DESC, tags) WHERE trashed_at IS NULL

Are those the only columns in subscriptions ?

> But Postgresql continues to use this index (which has less information and
> then requires slow access to disk):
> "index_subscriptions_on_project_id_and_created_at" btree (project_id,
> created_at DESC)

Did you vacuum the table ?
Did you try to "explain" the query after dropping the 1st index (like: begin;
DROP INDEX..; explain analyze..; rollback).

Also, is the first (other) index btree_gin (you can \dx to show extensions) ?

I think it needs to be a gin index to search tags ?

On Fri, Jan 10, 2020 at 01:42:24PM +0100, Marco Colli wrote:
> I would like to try your solution but I read that ALTER TABLE... SET
> STATISTICS  locks the table... Since it is just an experiment and we don't
> know if it actually works it would be greate to avoid locking a large table
> (50M) in production.

I suggest to CREATE TABLE test_subscriptions (LIKE subscriptions INCLUDING
ALL); INSERT INTO test_subscriptions SELECT * FROM subscriptions; ANALYZE 
test_subscriptions;

Anyway, ALTER..SET STATS requires a strong lock but for only a brief moment
(assuming it doesn't have to wait).  Possibly you'd be ok doing SET
statement_timeout='1s'; ALTER TABLE  

> Does CREATE  STATISTICS lock the table too?

You can check by SET client_min_messages=debug; SET lock_timeout=333; SET 
log_lock_waits=on;
Looks like it needs ShareUpdateExclusiveLock.

> Does statistics work on an array field like tags? (I can't find any
> information)

It think it'd be data type agnostic.  And seems to work with arrays.

On Fri, Jan 10, 2020 at 02:30:27PM +0100, Marco Colli wrote:
> @Justin Pryzby I have tried this as you suggested:
> 
> CREATE STATISTICS statistics_on_subscriptions_project_id_and_tags ON
> project_id, tags FROM subscriptions;
> VACUUM ANALYZE subscriptions;
> 
> Unfortunately nothing changes and Postgresql continues to use the wrong
> plan (maybe stats don't work well on array fields like tags??).

It'd help to see SELECT stxddependencies FROM pg_statistic_ext WHERE
stxoid='subscriptions'::regclass




Re: Bad query plan when you add many OR conditions

2020-01-10 Thread Marco Colli
@Justin Pryzby I have tried this as you suggested:

CREATE STATISTICS statistics_on_subscriptions_project_id_and_tags ON
project_id, tags FROM subscriptions;
VACUUM ANALYZE subscriptions;

Unfortunately nothing changes and Postgresql continues to use the wrong
plan (maybe stats don't work well on array fields like tags??).

On Fri, Jan 10, 2020 at 4:06 AM Justin Pryzby  wrote:

> On Fri, Jan 10, 2020 at 02:11:14AM +0100, Marco Colli wrote:
> > I have a query on a large table that is very fast (0s):
> >
> https://gist.github.com/collimarco/039412b4fe0dcf39955888f96eff29db#file-fast_query-txt
>
> ORDER BY + LIMIT is a query which sometimes has issues, you can probably
> find
> more by searching.  The planner thinks it'll hit the LIMIT pretty soon and
> only
> run a fraction of the index scan - but then it turns out to be wrong.
>
> You might have poor statistics on project_id and/or tags.  This *might*
> help:
> ALTER TABLE subscriptions ALTER project_id SET STATISTICS 2000; ANALYZE
> subscriptions;
>
> But I'm guessing there's correlation between the two, which the planner
> doesn't
> know.  If you're running at least v10, I'm guessing it would help to CREATE
> STATISTICS on those columns (and analyze).
>
> See one similar problem here (not involving LIMIT).
>
> https://www.postgresql.org/message-id/flat/CABFxtPedz4zL%2BaPWut4%2B%3Dum4av1aAXr6OVRfRB_6K7mJKMbEcw%40mail.gmail.com
>


Re: Bad query plan when you add many OR conditions

2020-01-10 Thread Marco Colli
I am trying different solutions and what I have found is even more
surprising to me...

The query is always this:
https://gist.github.com/collimarco/039412b4fe0dcf39955888f96eff29db#file-slow_query-txt

I have added this index which would allow an index only scan:
"index_subscriptions_on_project_id_and_created_at_and_tags" btree
(project_id, created_at DESC, tags) WHERE trashed_at IS NULL

But Postgresql continues to use this index (which has less information and
then requires slow access to disk):
"index_subscriptions_on_project_id_and_created_at" btree (project_id,
created_at DESC)



On Fri, Jan 10, 2020 at 4:06 AM Justin Pryzby  wrote:

> On Fri, Jan 10, 2020 at 02:11:14AM +0100, Marco Colli wrote:
> > I have a query on a large table that is very fast (0s):
> >
> https://gist.github.com/collimarco/039412b4fe0dcf39955888f96eff29db#file-fast_query-txt
>
> ORDER BY + LIMIT is a query which sometimes has issues, you can probably
> find
> more by searching.  The planner thinks it'll hit the LIMIT pretty soon and
> only
> run a fraction of the index scan - but then it turns out to be wrong.
>
> You might have poor statistics on project_id and/or tags.  This *might*
> help:
> ALTER TABLE subscriptions ALTER project_id SET STATISTICS 2000; ANALYZE
> subscriptions;
>
> But I'm guessing there's correlation between the two, which the planner
> doesn't
> know.  If you're running at least v10, I'm guessing it would help to CREATE
> STATISTICS on those columns (and analyze).
>
> See one similar problem here (not involving LIMIT).
>
> https://www.postgresql.org/message-id/flat/CABFxtPedz4zL%2BaPWut4%2B%3Dum4av1aAXr6OVRfRB_6K7mJKMbEcw%40mail.gmail.com
>