Re: SQL performance issue (postgresql chooses a bad plan when a better one is available)

2021-03-23 Thread Hannu Krosing
When I do serious database development I try to use database functions
as much as possible.

You can attach any flag value to a function in which case it gets set
when the function is running,

In your case you could probably wrap your query into an set-returning
`LANGUAGE SQL` function [1] and then include

`SET enable_material=false`

as part of the `CREATE FUNCTION` [2]

--
[1] https://www.postgresql.org/docs/current/xfunc-sql.html
[2] https://www.postgresql.org/docs/13/sql-createfunction.html

On Tue, Mar 23, 2021 at 4:22 PM Chris Stephens  wrote:
>
> "set enable_material=false;" produces an efficient plan. good to know there 
> are *some* knobs to turn when the optimizer comes up with a bad plan. would 
> be awesome if you could lock that plan into place w/out altering the variable.
>
> thanks for the help Hannu!
>
> On Mon, Mar 22, 2021 at 4:39 PM Hannu Krosing  wrote:
>>
>> you can play around various `enable_*` flags to see if disabling any
>> of these will *maybe* yield the plan you were expecting, and then
>> check the costs in EXPLAIN to see if the optimiser also thinks this
>> plan is cheaper.
>>
>>
>> On Mon, Mar 22, 2021 at 6:29 PM Chris Stephens  wrote:
>> >
>> > we are but i was hoping to get a better understanding of where the 
>> > optimizer is going wrong and what i can do about it.
>> >
>> > chris
>> >
>> >
>> > On Mon, Mar 22, 2021 at 9:54 AM Laurenz Albe  
>> > wrote:
>> >>
>> >> On Mon, 2021-03-22 at 08:10 -0500, Chris Stephens wrote:
>> >> > The following SQL takes ~25 seconds to run. I'm relatively new to 
>> >> > postgres
>> >> >  but the execution plan (https://explain.depesz.com/s/N4oR) looks like 
>> >> > it's
>> >> >  materializing the entire EXISTS subquery for each row returned by the 
>> >> > rest
>> >> >  of the query before probing for plate_384_id existence. postgres is
>> >> >  choosing sequential scans on sample_plate_384 and test_result when 
>> >> > suitable,
>> >> >  efficient indexes exist. a re-written query produces a much better plan
>> >> >  (https://explain.depesz.com/s/zXJ6). Executing the EXISTS portion of 
>> >> > the
>> >> >  query with an explicit PLATE_384_ID yields the execution plan we want 
>> >> > as
>> >> >  well (https://explain.depesz.com/s/3QAK). unnesting the EXISTS and 
>> >> > adding
>> >> >  a DISTINCT on the result also yields a better plan.
>> >>
>> >> Great!  Then use one of the rewritten queries.
>> >>
>> >> Yours,
>> >> Laurenz Albe
>> >> --
>> >> Cybertec | https://www.cybertec-postgresql.com
>> >>




Re: Odd (slow) plan choice with min/max

2021-03-23 Thread Paul McGarry
On Wed, 24 Mar 2021 at 00:07, Rick Otten  wrote:

>
>> Yes, the columns are highly correlated, but that alone doesn't seem like
>> it should be sufficient criteria to choose this plan.
>> Ie the selection criteria (1 day of data about a year ago) has a year+
>> worth of data after it and probably a decade of data before it, so anything
>> walking a correlated index from top or bottom is going to have to walk past
>> a lot of data before it gets to data that fits the criteria.
>>
>
>
> I assume you have a statistic on the correlated columns, ie `create
> statistic` ?
>

I didn't, but adding
==
CREATE STATISTICS risk_risk_id_time_correlation_stats ON risk_id,time FROM
risk;
analyze risk;
==
doesn't seem to help.
I get the same plan before/after. Second run was faster, but just because
data was hot.



> If you can't use partitions on your date column, can you use partial
> indexes instead?   Or a functional index with min() over day and max() over
> day?
>

I don't particularly want to add more weird indexes to solve this one
particular query. as the existing risk_id index should make it efficient
enough if only the planner chose to use it. This is part of an archiving
job, identifying sections of historical data, so not a query that needs to
be super optimised, but essentially doing a full table scan
backwards/forwards as it is now is doing a lot of unnecessary IO that would
be best left free for more time sensitive queries.My count(() workaround
works so we can use that.
I'm more interested in understanding why the planner makes what seems to be
an obviously bad choice.

Paul


Re: SQL performance issue (postgresql chooses a bad plan when a better one is available)

2021-03-23 Thread Chris Stephens
"set enable_material=false;" produces an efficient plan. good to know there
are *some* knobs to turn when the optimizer comes up with a bad plan. would
be awesome if you could lock that plan into place w/out altering the
variable.

thanks for the help Hannu!

On Mon, Mar 22, 2021 at 4:39 PM Hannu Krosing  wrote:

> you can play around various `enable_*` flags to see if disabling any
> of these will *maybe* yield the plan you were expecting, and then
> check the costs in EXPLAIN to see if the optimiser also thinks this
> plan is cheaper.
>
>
> On Mon, Mar 22, 2021 at 6:29 PM Chris Stephens 
> wrote:
> >
> > we are but i was hoping to get a better understanding of where the
> optimizer is going wrong and what i can do about it.
> >
> > chris
> >
> >
> > On Mon, Mar 22, 2021 at 9:54 AM Laurenz Albe 
> wrote:
> >>
> >> On Mon, 2021-03-22 at 08:10 -0500, Chris Stephens wrote:
> >> > The following SQL takes ~25 seconds to run. I'm relatively new to
> postgres
> >> >  but the execution plan (https://explain.depesz.com/s/N4oR) looks
> like it's
> >> >  materializing the entire EXISTS subquery for each row returned by
> the rest
> >> >  of the query before probing for plate_384_id existence. postgres is
> >> >  choosing sequential scans on sample_plate_384 and test_result when
> suitable,
> >> >  efficient indexes exist. a re-written query produces a much better
> plan
> >> >  (https://explain.depesz.com/s/zXJ6). Executing the EXISTS portion
> of the
> >> >  query with an explicit PLATE_384_ID yields the execution plan we
> want as
> >> >  well (https://explain.depesz.com/s/3QAK). unnesting the EXISTS and
> adding
> >> >  a DISTINCT on the result also yields a better plan.
> >>
> >> Great!  Then use one of the rewritten queries.
> >>
> >> Yours,
> >> Laurenz Albe
> >> --
> >> Cybertec | https://www.cybertec-postgresql.com
> >>
>


Re: Odd (slow) plan choice with min/max

2021-03-23 Thread Rick Otten
On Tue, Mar 23, 2021 at 2:52 AM Paul McGarry  wrote:

>
>
> On Tue, 23 Mar 2021 at 16:13, Justin Pryzby  wrote:
>
>> On Tue, Mar 23, 2021 at 03:00:38PM +1100, Paul McGarry wrote:
>> > I have a query where Postgresql (11.9 at the moment) is making an odd
>> plan
>> > choice, choosing to use index scans which require filtering out
>> millions of
>> > rows, rather than "just" doing an aggregate over the rows the where
>> clause
>> > targets which is much faster.
>> > AFAICT it isn't a statistics problem, at least increasing the stats
>> target
>> > and analyzing the table doesn't seem to fix the problem.
>>
>> >  explain analyze select min(risk_id),max(risk_id) from risk where
>> > time>='2020-01-20 15:00:07+00' and time < '2020-01-21 15:00:08+00';
>>
>> I'm guessing the time and ID columns are highly correlated...
>>
>> So the planner thinks it can get the smallest ID by scanning the ID
>> index, but
>> then ends up rejecting the first 161e6 rows for which the time is too
>> low, and
>> fails the >= condition.
>>
>> And thinks it can get the greatest ID by backward scanning the ID idx,
>> but ends
>> up rejecting/filtering the first 41e6 rows, for which the time is too
>> high,
>> failing the < condition.
>>
>
> Yes, the columns are highly correlated, but that alone doesn't seem like
> it should be sufficient criteria to choose this plan.
> Ie the selection criteria (1 day of data about a year ago) has a year+
> worth of data after it and probably a decade of data before it, so anything
> walking a correlated index from top or bottom is going to have to walk past
> a lot of data before it gets to data that fits the criteria.
>


I assume you have a statistic on the correlated columns, ie `create
statistic` ?

If you can't use partitions on your date column, can you use partial
indexes instead?   Or a functional index with min() over day and max() over
day?


Re: Odd (slow) plan choice with min/max

2021-03-23 Thread Paul McGarry
On Tue, 23 Mar 2021 at 16:13, Justin Pryzby  wrote:

> On Tue, Mar 23, 2021 at 03:00:38PM +1100, Paul McGarry wrote:
> > I have a query where Postgresql (11.9 at the moment) is making an odd
> plan
> > choice, choosing to use index scans which require filtering out millions
> of
> > rows, rather than "just" doing an aggregate over the rows the where
> clause
> > targets which is much faster.
> > AFAICT it isn't a statistics problem, at least increasing the stats
> target
> > and analyzing the table doesn't seem to fix the problem.
>
> >  explain analyze select min(risk_id),max(risk_id) from risk where
> > time>='2020-01-20 15:00:07+00' and time < '2020-01-21 15:00:08+00';
>
> I'm guessing the time and ID columns are highly correlated...
>
> So the planner thinks it can get the smallest ID by scanning the ID index,
> but
> then ends up rejecting the first 161e6 rows for which the time is too low,
> and
> fails the >= condition.
>
> And thinks it can get the greatest ID by backward scanning the ID idx, but
> ends
> up rejecting/filtering the first 41e6 rows, for which the time is too high,
> failing the < condition.
>

Yes, the columns are highly correlated, but that alone doesn't seem like it
should be sufficient criteria to choose this plan.
Ie the selection criteria (1 day of data about a year ago) has a year+
worth of data after it and probably a decade of data before it, so anything
walking a correlated index from top or bottom is going to have to walk past
a lot of data before it gets to data that fits the criteria.


> One solution seems to be to create an index on (i,j), but I don't know if
> there's a better way.
>
>
Adding the count() stops the planner considering the option so that will
work for now.
My colleague has pointed out that we had the same issue in November and I
came up with the count() workaround then too, but somehow seem to have
forgotten it in the meantime and reinvented it today. I wonder if I posted
to pgsql-performance then too.

Maybe time for me to read the PG12 release notes

Paul