Re: Make the qual cost on index Filter slightly higher than qual cost on index Cond.

2020-05-29 Thread Andy Fan
> The other serious error we could be making here is to change things on
>> the basis of just a few examples.  You really need a pretty wide range
>> of test cases to be sure that you're not making things worse, any time
>> you're twiddling basic parameters like these.
>>
>>
> I will try more thing with this direction,  thanks for suggestion.
>

I choose TPC-H for this purpose and the data and index setup based on [1],
the attached normal.log is the plan without this patch, and patched.log is
the
plan with the patch.  In general,  the best path doesn't change due to this
patch,
All the plans whose cost changed has the following patten, which is
expected.

Index Scan ...
   Index Cond: ...
   Filter: ...

If you diff the two file, you may find the cost of "Index Scan" doesn't
change,
that is mainly because it only show 2 digits in cost, which is not accurate
enough
to show the difference.  However with a nest loop,  the overall plan shows
the cost
difference.

[1] https://ankane.org/tpc-h

-- 
Best Regards
Andy Fan


normal.log
Description: Binary data


patched.log
Description: Binary data


Re: Make the qual cost on index Filter slightly higher than qual cost on index Cond.

2020-05-29 Thread Andy Fan
On Fri, May 29, 2020 at 9:37 PM Ashutosh Bapat 
wrote:

> On Fri, May 29, 2020 at 6:40 AM Andy Fan  wrote:
> >
> >
> >>
> >> >so we need to optimize the cost model for such case, the method is the
> >> >patch I mentioned above.
> >>
> >> Making the planner more robust w.r.t. to estimation errors is nice, but
> >> I wouldn't go as far saying we should optimize for such cases. The stats
> >> can be arbitrarily off, so should we expect the error to be 10%, 100% or
> >> 100%?
> >
> >
> > I don't think my patch relay on anything like that.   My patch doesn't
> fix the
> > statistics issue,  just adding the extra cost on qual cost on Index
> Filter part.
> > Assume the query pattern are where col1= X and col2 = Y. The impacts are
> :
> > 1).  Make the cost of (col1, other_column) is higher than (col1, col2)
> > 2). The relationship between seqscan and index scan on index (col1,
> other_column)
> > is changed, (this is something I don't want).  However my cost
> difference between
> > index scan & seq scan usually very huge, so the change above should has
> > nearly no impact on that choice.   3). Make the cost higher index scan
> for
> > Index (col1) only.  Overall I think nothing will make thing worse.
>
> When the statistics is almost correct (or better than what you have in
> your example), the index which does not cover all the columns in all
> the conditions will be expensive anyways because of extra cost to
> access heap for the extra rows not filtered by that index. An index
> covering all the conditions would have its scan cost cheaper since
> there will be fewer rows and hence fewer heap page accesses because of
> more filtering. So I don't think we need any change in the current

costing model.
>

Thank you for your reply.  Looks you comments is based on the statistics
is almost correct (or better than what I have in my example),  That is
true.
However my goal is to figure out a way which can generate better plan even
the statistics is not correct (the statistics with such issue is not very
uncommon,
I just run into one such case and spend 1 week to handle some
non-technology
stuff after that).   I think the current issue is even my patch can make
the worst case
better, we need to make sure the average performance not worse.

-- 
Best Regards
Andy Fan


Re: Make the qual cost on index Filter slightly higher than qual cost on index Cond.

2020-05-29 Thread Ashutosh Bapat
On Fri, May 29, 2020 at 6:40 AM Andy Fan  wrote:
>
>
>>
>> >so we need to optimize the cost model for such case, the method is the
>> >patch I mentioned above.
>>
>> Making the planner more robust w.r.t. to estimation errors is nice, but
>> I wouldn't go as far saying we should optimize for such cases. The stats
>> can be arbitrarily off, so should we expect the error to be 10%, 100% or
>> 100%?
>
>
> I don't think my patch relay on anything like that.   My patch doesn't fix the
> statistics issue,  just adding the extra cost on qual cost on Index Filter 
> part.
> Assume the query pattern are where col1= X and col2 = Y. The impacts are :
> 1).  Make the cost of (col1, other_column) is higher than (col1, col2)
> 2). The relationship between seqscan and index scan on index (col1, 
> other_column)
> is changed, (this is something I don't want).  However my cost difference 
> between
> index scan & seq scan usually very huge, so the change above should has
> nearly no impact on that choice.   3). Make the cost higher index scan for
> Index (col1) only.  Overall I think nothing will make thing worse.

When the statistics is almost correct (or better than what you have in
your example), the index which does not cover all the columns in all
the conditions will be expensive anyways because of extra cost to
access heap for the extra rows not filtered by that index. An index
covering all the conditions would have its scan cost cheaper since
there will be fewer rows and hence fewer heap page accesses because of
more filtering. So I don't think we need any change in the current
costing model.

>
>>
>> We'd probably end up with plans that handle worst cases well,
>> but the average performance would end up being way worse :-(
>>
>
> That's possible,  that's why I hope to get some feedback on that.  Actually I
> can't think out such case.   can you have anything like that in mind?
>
> 
> I'm feeling that (qpqual_cost.per_tuple * 1.001) is not good enough since user
> may have some where expensive_func(col1) = X.   we may change it
> cpu_tuple_cost + qpqual_cost.per_tuple  + (0.0001) * list_lenght(qpquals).
>
> --
> Best Regards
> Andy Fan



-- 
Best Wishes,
Ashutosh Bapat




Re: Make the qual cost on index Filter slightly higher than qual cost on index Cond.

2020-05-28 Thread Andy Fan
Thanks all of you for your feedback.

On Fri, May 29, 2020 at 9:04 AM Tom Lane  wrote:

> Tomas Vondra  writes:
> > On Wed, May 27, 2020 at 09:58:04PM +0800, Andy Fan wrote:
> >> so we need to optimize the cost model for such case, the method is the
> >> patch I mentioned above.
>
> > Making the planner more robust w.r.t. to estimation errors is nice, but
> > I wouldn't go as far saying we should optimize for such cases.
>
> Yeah, it's a serious mistake to try to "optimize" for cases where we have
> no data or wrong data.  By definition, we don't know what we're doing,
> so who's to say whether we've made it better or worse?


Actually I think it is a more robust way..  the patch can't fix think all
the impact
of bad statistics(That is impossible I think),  but it will make some
simple things
better and make others no worse.  By definition I think I know what we are
doing
here, like what I replied to Tomas above.  But it is possible my think is
wrong.


> The other serious error we could be making here is to change things on
> the basis of just a few examples.  You really need a pretty wide range
> of test cases to be sure that you're not making things worse, any time
> you're twiddling basic parameters like these.
>
>
I will try more thing with this direction,  thanks for suggestion.

-- 
Best Regards
Andy Fan


Re: Make the qual cost on index Filter slightly higher than qual cost on index Cond.

2020-05-28 Thread Andy Fan
> >so we need to optimize the cost model for such case, the method is the
> >patch I mentioned above.
>
> Making the planner more robust w.r.t. to estimation errors is nice, but
> I wouldn't go as far saying we should optimize for such cases. The stats
> can be arbitrarily off, so should we expect the error to be 10%, 100% or
> 100%?


I don't think my patch relay on anything like that.   My patch doesn't fix
the
statistics issue,  just adding the extra cost on qual cost on Index Filter
part.
Assume the query pattern are where col1= X and col2 = Y. The impacts are :
1).  Make the cost of (col1, other_column) is higher than (col1, col2)
2). The relationship between seqscan and index scan on index (col1,
other_column)
is changed, (this is something I don't want).  However my cost difference
between
index scan & seq scan usually very huge, so the change above should has
nearly no impact on that choice.   3). Make the cost higher index scan for
Index (col1) only.  Overall I think nothing will make thing worse.


> We'd probably end up with plans that handle worst cases well,
> but the average performance would end up being way worse :-(
>
>
That's possible,  that's why I hope to get some feedback on that.  Actually
I
can't think out such case.   can you have anything like that in mind?


I'm feeling that (qpqual_cost.per_tuple * 1.001) is not good enough since
user
may have some where expensive_func(col1) = X.   we may change it
cpu_tuple_cost + qpqual_cost.per_tuple  + (0.0001) * list_lenght(qpquals).

-- 
Best Regards
Andy Fan


Re: Make the qual cost on index Filter slightly higher than qual cost on index Cond.

2020-05-28 Thread Tom Lane
Tomas Vondra  writes:
> On Wed, May 27, 2020 at 09:58:04PM +0800, Andy Fan wrote:
>> so we need to optimize the cost model for such case, the method is the
>> patch I mentioned above.

> Making the planner more robust w.r.t. to estimation errors is nice, but
> I wouldn't go as far saying we should optimize for such cases.

Yeah, it's a serious mistake to try to "optimize" for cases where we have
no data or wrong data.  By definition, we don't know what we're doing,
so who's to say whether we've made it better or worse?  And the possible
side effects on cases where we do have good data are not to be ignored.

> Anyway, I kinda doubt making the conditions 1.001 more expensive is a
> way to make the planning more robust. I'm pretty sure we could construct
> examples in the opposite direction, in which case this change make it
> more likely we use the wrong index.

The other serious error we could be making here is to change things on
the basis of just a few examples.  You really need a pretty wide range
of test cases to be sure that you're not making things worse, any time
you're twiddling basic parameters like these.

regards, tom lane




Re: Make the qual cost on index Filter slightly higher than qual cost on index Cond.

2020-05-28 Thread Tomas Vondra

On Wed, May 27, 2020 at 09:58:04PM +0800, Andy Fan wrote:

On Wed, May 27, 2020 at 8:01 PM Ashutosh Bapat <
ashutosh.ba...@2ndquadrant.com> wrote:




On Wed, 27 May 2020 at 04:43, Andy Fan  wrote:


You can use the attached sql to reproduce this issue, but I'm not sure
you can
get the above result at the first time that is because when optimizer
think the
2 index scan have the same cost, it will choose the first one it found,
the order
depends on RelationGetIndexList.  If so,  you may try drop and create
j1_i_im5 index.

The sense behind this patch is we still use the cost based optimizer,
just when we
we find out the 2 index scans have the same cost,  we prefer to use the
index which
have more qual filter on Index Cond.  This is implemented by adjust the
qual cost
on index filter slightly higher.



Thanks for the example and the explanation.

The execution time difference in your example is pretty high to account
for executing the filter on so many rows. My guess is this has to do with
the heap access. For applying the filter the entire row needs to be fetched
from the heap. So we should investigate this case from that angle. Another
guess I have is the statistics is not correct and hence the cost is wrong.



I believe this is a statistics issue and then the cost is wrong.


I think you're both right. Most of the time probably comes from the
heap accesses, but the dabatabase has no chance to account for that
as there was no analyze after inseting the data causing that. So it's
very difficult to account for this when computing the cost.


More characters of this issue are:  1).  If a data is out of range in
the old statistics, optimizer will given an 1 row assumption.  2).
based on the 1 row assumption,  for query "col1=out_of_range_val AND
col2 = any_value"   Index (col1, col2) and (col1, col3) will have
exactly same cost for current cost model. 3).  If the statistics was
wrong, (col1, col3) maybe a very bad plan as shown above, but index
(col1, col2) should  always better/no worse than (col1, col3) in any
case.  4). To expand the rule, for query "col1 = out_of_range_val AND
col2 = any_value AND col3 = any_val", index are (col1, col2, col_m) and
(col1, col_m, col_n),  the former index will aways has better/no worse
than the later one.  5). an statistics issue like this is not
uncommon, for example an log based application, creation_date is very
easy to out of range in statistics.



Right. There are many ways to cause issues like this.


so we need to optimize the cost model for such case, the method is the
patch I mentioned above.


Making the planner more robust w.r.t. to estimation errors is nice, but
I wouldn't go as far saying we should optimize for such cases. The stats
can be arbitrarily off, so should we expect the error to be 10%, 100% or
100%? We'd probably end up with plans that handle worst cases well,
but the average performance would end up being way worse :-(

Anyway, I kinda doubt making the conditions 1.001 more expensive is a
way to make the planning more robust. I'm pretty sure we could construct
examples in the opposite direction, in which case this change make it
more likely we use the wrong index.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Make the qual cost on index Filter slightly higher than qual cost on index Cond.

2020-05-27 Thread Andy Fan
On Wed, May 27, 2020 at 8:01 PM Ashutosh Bapat <
ashutosh.ba...@2ndquadrant.com> wrote:

>
>
> On Wed, 27 May 2020 at 04:43, Andy Fan  wrote:
>
>> You can use the attached sql to reproduce this issue, but I'm not sure
>> you can
>> get the above result at the first time that is because when optimizer
>> think the
>> 2 index scan have the same cost, it will choose the first one it found,
>> the order
>> depends on RelationGetIndexList.  If so,  you may try drop and create
>> j1_i_im5 index.
>>
>> The sense behind this patch is we still use the cost based optimizer,
>> just when we
>> we find out the 2 index scans have the same cost,  we prefer to use the
>> index which
>> have more qual filter on Index Cond.  This is implemented by adjust the
>> qual cost
>> on index filter slightly higher.
>>
>
> Thanks for the example and the explanation.
>
> The execution time difference in your example is pretty high to account
> for executing the filter on so many rows. My guess is this has to do with
> the heap access. For applying the filter the entire row needs to be fetched
> from the heap. So we should investigate this case from that angle. Another
> guess I have is the statistics is not correct and hence the cost is wrong.
>
>
I believe this is a statistics issue and then the cost is wrong.  More
characters of this
issue are:  1).  If a data is out of range in the old statistics,
optimizer will given an 1 row
assumption.  2).  based on the 1 row assumption,  for query
"col1=out_of_range_val AND
col2 = any_value"   Index (col1, col2) and (col1, col3) will have exactly
same cost for current
cost model. 3).  If the statistics was wrong, (col1, col3) maybe a very bad
plan as shown
above, but index (col1, col2) should  always better/no worse than (col1,
col3) in any case.
4). To expand the rule, for query "col1 = out_of_range_val AND col2 =
any_value AND col3 = any_val",
index are (col1, col2, col_m) and (col1, col_m, col_n),  the former index
will aways has better/no worse
than the later one.  5). an statistics issue like this is not  uncommon,
for example
an log based application, creation_date is very easy to out of range in
statistics.

so we need to optimize the cost model for such case, the method is the
patch I mentioned above.
I can't have a solid data to prove oracle did something similar, but based
on the talk with my
customer,  oracle is likely did something like this.

-- 
Best Regards
Andy Fan


Re: Make the qual cost on index Filter slightly higher than qual cost on index Cond.

2020-05-27 Thread Ashutosh Bapat
On Wed, 27 May 2020 at 04:43, Andy Fan  wrote:

> You can use the attached sql to reproduce this issue, but I'm not sure you
> can
> get the above result at the first time that is because when optimizer
> think the
> 2 index scan have the same cost, it will choose the first one it found,
> the order
> depends on RelationGetIndexList.  If so,  you may try drop and create
> j1_i_im5 index.
>
> The sense behind this patch is we still use the cost based optimizer, just
> when we
> we find out the 2 index scans have the same cost,  we prefer to use the
> index which
> have more qual filter on Index Cond.  This is implemented by adjust the
> qual cost
> on index filter slightly higher.
>

Thanks for the example and the explanation.

The execution time difference in your example is pretty high to account for
executing the filter on so many rows. My guess is this has to do with the
heap access. For applying the filter the entire row needs to be fetched
from the heap. So we should investigate this case from that angle. Another
guess I have is the statistics is not correct and hence the cost is wrong.

-- 
Best Wishes,
Ashutosh


Re: Make the qual cost on index Filter slightly higher than qual cost on index Cond.

2020-05-26 Thread Andy Fan
You can use the attached sql to reproduce this issue, but I'm not sure you
can
get the above result at the first time that is because when optimizer think
the
2 index scan have the same cost, it will choose the first one it found, the
order
depends on RelationGetIndexList.  If so,  you may try drop and create
j1_i_im5 index.

The sense behind this patch is we still use the cost based optimizer, just
when we
we find out the 2 index scans have the same cost,  we prefer to use the
index which
have more qual filter on Index Cond.  This is implemented by adjust the
qual cost
on index filter slightly higher.

The issue here is not so uncommon in real life.  consider a log based
application, which
has serval indexes on with create_date as a leading column,  when the
create_date
first load the for the given day but before the new statistics is gathered,
that probably run
into this issue.

-- 
Best Regards
Andy Fan


index_choose.sql
Description: Binary data


Re: Make the qual cost on index Filter slightly higher than qual cost on index Cond.

2020-05-26 Thread Andy Fan
On Tue, May 26, 2020 at 9:59 PM Ashutosh Bapat 
wrote:

> On Tue, May 26, 2020 at 1:52 PM Andy Fan  wrote:
> >
> >
> > Consider the below example:
> >
> > create table j1(i int, im5 int,  im100 int, im1000 int);
> > insert into j1 select i, i%5, i%100, i%1000 from generate_series(1,
> 1000)i;
> > create index j1_i_im5 on j1(i, im5);
> > create index j1_i_im100 on j1(i, im100);
> > analyze j1;
> > explain select * from j1 where i = 100 and im5 = 5;
> >
> > We may get the plan like this:
> >
> > demo=# explain select  * from  j1 where i = 100 and im5 = 1;
> >   QUERY PLAN
> > --
> >  Index Scan using j1_i_im100 on j1  (cost=0.43..8.46 rows=1 width=16)
> >Index Cond: (i = 100)
> >Filter: (im5 = 1)
> > (3 rows)
> >
> > At this case, optimizer can estimate there are only 1 row to return, so
> both
> > indexes have same cost, which one will be choose is un-controlable. This
> is
> > fine for above query based on the estimation is accurate. However
> estimation
> > can't be always accurate in real life. Some inaccurate estimation can
> cause an
> > wrong index choose. As an experience, j1_i_im5 index should always be
> choose
> > for above query.
>
> I think we need a better example where choosing an index makes a
> difference.
>
> An index can be chosen just because it's path was created before some
> other more appropriate index but the cost difference was within fuzzy
> limit. Purely based on the order in which index paths are created.
>

Here is an further example with the above case:

demo=# insert into j1 select 1, 1, 1, 1 from generate_series(1, 10)i;
INSERT 0 10

With the current implementation, it is

demo=# explain analyze select * from j1 where i = 1 and im5 = 2;
QUERY PLAN
--
 Index Scan using j1_i_im100 on j1  (cost=0.43..8.44 rows=1 width=16)
(actual time=63.431..63.431 rows=0 loops=1)
   Index Cond: (i = 1)
   Filter: (im5 = 2)
   Rows Removed by Filter: 11
 Planning Time: 0.183 ms
 Execution Time: 63.484 ms
(6 rows)

With the patch above, it can always choose a correct index even the
statistics is inaccurate:

demo=# explain analyze select * from j1 where i = 1 and im5 = 2;
  QUERY PLAN
--
 Index Scan using j1_i_im5 on j1  (cost=0.43..8.46 rows=1 width=16) (actual
time=0.030..0.030 rows=0 loops=1)
   Index Cond: ((i = 1) AND (im5 = 2))
 Planning Time: 1.087 ms
 Execution Time: 0.077 ms
(4 rows)

-- 
Best Regards
Andy Fan


Re: Make the qual cost on index Filter slightly higher than qual cost on index Cond.

2020-05-26 Thread Ashutosh Bapat
On Tue, May 26, 2020 at 1:52 PM Andy Fan  wrote:
>
>
> Consider the below example:
>
> create table j1(i int, im5 int,  im100 int, im1000 int);
> insert into j1 select i, i%5, i%100, i%1000 from generate_series(1, 
> 1000)i;
> create index j1_i_im5 on j1(i, im5);
> create index j1_i_im100 on j1(i, im100);
> analyze j1;
> explain select * from j1 where i = 100 and im5 = 5;
>
> We may get the plan like this:
>
> demo=# explain select  * from  j1 where i = 100 and im5 = 1;
>   QUERY PLAN
> --
>  Index Scan using j1_i_im100 on j1  (cost=0.43..8.46 rows=1 width=16)
>Index Cond: (i = 100)
>Filter: (im5 = 1)
> (3 rows)
>
> At this case, optimizer can estimate there are only 1 row to return, so both
> indexes have same cost, which one will be choose is un-controlable. This is
> fine for above query based on the estimation is accurate. However estimation
> can't be always accurate in real life. Some inaccurate estimation can cause an
> wrong index choose. As an experience, j1_i_im5 index should always be choose
> for above query.

I think we need a better example where choosing an index makes a difference.

An index can be chosen just because it's path was created before some
other more appropriate index but the cost difference was within fuzzy
limit. Purely based on the order in which index paths are created.

>
> This one line change is the best method I can think.
>
> -   cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
> +  cpu_per_tuple = cpu_tuple_cost + (qpqual_cost.per_tuple * 1.001);
>
> We make the qual cost on index filter is slightly higher than qual cost in 
> Index
> Cond. This will also good for QUAL (i=x AND m=y AND n=z). Index are (i, m,
> other_col1) and (i, other_col1, other_col2).  But this change also
> changed the relation between the qual cost on index scan and qual cost on seq
> scan. However I think that impact is so tiny that I think we can ignore that 
> (we
> can choose a better factor between 1 and 1.001).
>
> Even the root cause of this issue comes from an inaccurate estimation. but I
> don't think that is an issue easy/possible to fix, however I'm open for
> suggestion on that as well.
>
> Any suggestions?
>
> --
> Best Regards
> Andy Fan



-- 
Best Wishes,
Ashutosh Bapat