Re: Duplicate WHERE condition changes performance and plan

2020-04-24 Thread singh...@gmail.com
> If you're using SSD storage, or if the DB is small compared with 
> shared_buffers or RAM, then random_page_cost should be closer to 
> seq_page_cost.

I don't *think* we are using SSDs  but I'll need to confirm that though.

> How large are the indexes? problem_id_idx1 ?

Using the query from here:
https://wiki.postgresql.org/wiki/Index_Maintenance#Index_size.2Fusage_statistics
Output here: https://gist.github.com/indy-singh/e33eabe5cc937043c93b42a8783b3bfb

I've setup a repo here where it is possible to reproduce the weird
behaviour I'm getting:-

https://github.com/indy-singh/postgres-duplicate-where-conditon

That contains the data (amended to remove any private information) as
well as the statements need to recreate tables, indices, and
constraints,

I think after some trial and error this is something to do with the
size of the table and statistics. I've been trying to put together a
Short, Self Contained, Correct example (http://sscce.org/) and the
problem only appears when fill problem_instance.message with junk, but
I have to do it in two steps as outlined in the README in repo.

Indy




Re: PostgreSQL does not choose my indexes well

2020-04-24 Thread Jeff Janes
On Thu, Apr 23, 2020 at 7:36 AM Arcadio Ortega Reinoso <
arcadio.ort...@gmail.com> wrote:

> explain (analyze, buffers, format text) select * from entidad where
> cod_tabla = 4
>
>
> Index Scan using idx_tabla_entidad on entidad (cost=0.56..51121.41
> rows=1405216 width=20) (actual time=0.037..242.609 rows=1409985 loops=1)
>Index Cond: ((cod_tabla)::bigint = 4)
>Buffers: shared hit=12839
> Planning Time: 0.158 ms
> Execution Time: 311.828 ms
>

In order to read 1409985 / 12839 = 109 rows per buffer page, the table must
be extraordinarily well clustered on this index.  That degree of clustering
is going to steal much of the thunder from the index-only scan.  But in my
hands, it does still prefer the partial index with index-only scan by a
cost estimate ratio of 3 to 1 (despite it actually being slightly slower)
so I don't know why you don't get it being used.

This was how I populated the table:

insert into entidad select id, floor(random()*25)::int,
floor(random()*1000)::int from generate_series(1,3400) f(id);
cluster entidad USING idx_tabla_entidad ;

0.3 seconds for 1.4 million rows is pretty good.  How much better are you
hoping to get by micro-managing the planner?

To figure it out, it might help to see the  explain (analyze, buffers,
format text) of the plan you want it to use.  But the only way I see to do
that is to drop the other index.

If you don't want to "really" drop the index, you can drop it in a
transaction, run the "explain (analyze, buffers, format text)" query, and
rollback the transaction.  (Note this will lock the table for the entire
duration of the transaction, so it is not something to do cavalierly in
production)

Cheers,

Jeff


Re: PostgreSQL does not choose my indexes well

2020-04-24 Thread Stephen Frost
Greetings,

* Jeff Janes (jeff.ja...@gmail.com) wrote:
> In order to read 1409985 / 12839 = 109 rows per buffer page, the table must
> be extraordinarily well clustered on this index.  That degree of clustering
> is going to steal much of the thunder from the index-only scan.  But in my
> hands, it does still prefer the partial index with index-only scan by a
> cost estimate ratio of 3 to 1 (despite it actually being slightly slower)
> so I don't know why you don't get it being used.

Turns out to be because what was provided wasn't actually what was being
used- there's a domain in there and that seems to gum up the works and
make it so we don't consider the partial index as being something we can
use (see the discussion at the end of the other sub-thread).

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: PostgreSQL does not choose my indexes well

2020-04-24 Thread Tom Lane
Stephen Frost  writes:
> Turns out to be because what was provided wasn't actually what was being
> used- there's a domain in there and that seems to gum up the works and
> make it so we don't consider the partial index as being something we can
> use (see the discussion at the end of the other sub-thread).

Some simple experiments here don't find that a domain-type column prevents
use of the partial index.  So it's still not entirely clear what's
happening for the OP.  I concur with Jeff's suggestion to try forcing
use of the desired index, and see whether it happens at all and what
the cost estimate is.

I'm also wondering exactly which Postgres version this is.

regards, tom lane




Re: PostgreSQL does not choose my indexes well

2020-04-24 Thread Jeff Janes
On Fri, Apr 24, 2020 at 2:33 PM Stephen Frost  wrote:

> Greetings,
>
> * Jeff Janes (jeff.ja...@gmail.com) wrote:
> > In order to read 1409985 / 12839 = 109 rows per buffer page, the table
> must
> > be extraordinarily well clustered on this index.  That degree of
> clustering
> > is going to steal much of the thunder from the index-only scan.  But in
> my
> > hands, it does still prefer the partial index with index-only scan by a
> > cost estimate ratio of 3 to 1 (despite it actually being slightly slower)
> > so I don't know why you don't get it being used.
>
> Turns out to be because what was provided wasn't actually what was being
> used- there's a domain in there and that seems to gum up the works and
> make it so we don't consider the partial index as being something we can
> use (see the discussion at the end of the other sub-thread).
>


Thanks.  I somehow managed to overlook the existence of the entire last 24
hours of discussion.  But if I change the type of entidad.cod_tabla to
match the domain now shown in table.cod_table, I can still get the index
only scan over the partial index.  Now the cost estimate has changed so it
slightly prefers the other index instead (in agreement with the original
report) but usage of the partial index-only can is still possible (e.g. if
I drop the single column full-table index).  I don't understand why the
domain changes the estimate without changing the execution, but it isn't
something that is very important to me.  I'm more interested in the index
only scan is not actually much if any faster.  Even if there is no IO
benefit due to the clustering, I'd still expect there to be some CPU
benefit of not jumping back and forth between index pages and heap pages,
but iI don't know how much effort it is worth to put into that either.

Cheers,

Jeff