Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation

2018-10-01 Thread Michael Paquier
On Fri, Jul 27, 2018 at 02:19:27PM +0530, Ashutosh Bapat wrote:
> [ removal of latest arguments ]
> +1, if we could do that.

The patch seems to have stuck a bit, so I am marking it as returned with
feedback because of no activity.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation

2018-07-27 Thread Ashutosh Bapat
On Thu, Jul 26, 2018 at 10:30 PM, Dean Rasheed  wrote:
> On 26 July 2018 at 07:12, Ashutosh Bapat
>  wrote:
>> In the patch clauselist_selectivity() gets called repeatedly for every
>> new qual added to the clause list. Instead, if we try to combine the
>> cost/row estimation with order_qual_clauses() or
>> clauselist_selectivity(), we might be able to what the current patch
>> does in O(n). clauselist_selectivity() calculates combined selectivity
>> of all the given quals in O(n). We should do something similar to that
>> in this case.
>
> Duplicating the logic of clauselist_selectivity() seems like quite a
> lot of effort to go to just for improved filter cost estimates. Note
> also that clauselist_selectivity() might get a good deal more complex
> with multivariate stats.

I am not suggesting to duplicate code in clauselist_selectivity().
Instead I am suggesting that we incorporate costing in that function.
I don't know how feasible is that.

>
> Perhaps a reasonable simplification would be to just treat the clauses
> as independent when estimating the filter cost, and so use the
> following as a "good enough" estimate for the filter cost:
>
>   cost(A) + sel(A) * cost(B) + sel(A) * sel(B) * cost(C) + ...
>
> That would probably be an improvement over what we currently have. It
> would be O(n) to compute, and it would probably use the cached
> selectivity estimates for the clauses.

That looks like a good trade-off. But looking at the following comment
in clause_selectivity(), I doubt if this will work in all the cases
/*
 * If possible, cache the result of the selectivity calculation for
 * the clause.  We can cache if varRelid is zero or the clause
 * contains only vars of that relid --- otherwise varRelid will affect
 * the result, so mustn't cache.  Outer join quals might be examined
 * with either their join's actual jointype or JOIN_INNER, so we need
 * two cache variables to remember both cases.  Note: we assume the
 * result won't change if we are switching the input relations or
 * considering a unique-ified case, so we only need one cache variable
 * for all non-JOIN_INNER cases.
 */


>
> Note also that with this simplified expression for the filter cost, it
> would be possible to improve order_qual_clauses() to take into account
> the clause selectivities when sorting the clauses, and minimise the
> cost of the filter by evaluating more selective clauses sooner, if
> they're not too expensive. I believe that amounts to sorting by
> cost/(1-sel) rather than just cost, except for clauses with sel==1,
> which it makes sense to move to the end, and just sort by cost.

+1, if we could do that.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation

2018-07-26 Thread Dean Rasheed
On 26 July 2018 at 07:12, Ashutosh Bapat
 wrote:
> In the patch clauselist_selectivity() gets called repeatedly for every
> new qual added to the clause list. Instead, if we try to combine the
> cost/row estimation with order_qual_clauses() or
> clauselist_selectivity(), we might be able to what the current patch
> does in O(n). clauselist_selectivity() calculates combined selectivity
> of all the given quals in O(n). We should do something similar to that
> in this case.

Duplicating the logic of clauselist_selectivity() seems like quite a
lot of effort to go to just for improved filter cost estimates. Note
also that clauselist_selectivity() might get a good deal more complex
with multivariate stats.

Perhaps a reasonable simplification would be to just treat the clauses
as independent when estimating the filter cost, and so use the
following as a "good enough" estimate for the filter cost:

  cost(A) + sel(A) * cost(B) + sel(A) * sel(B) * cost(C) + ...

That would probably be an improvement over what we currently have. It
would be O(n) to compute, and it would probably use the cached
selectivity estimates for the clauses.

Note also that with this simplified expression for the filter cost, it
would be possible to improve order_qual_clauses() to take into account
the clause selectivities when sorting the clauses, and minimise the
cost of the filter by evaluating more selective clauses sooner, if
they're not too expensive. I believe that amounts to sorting by
cost/(1-sel) rather than just cost, except for clauses with sel==1,
which it makes sense to move to the end, and just sort by cost.

Regards,
Dean



Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation

2018-07-26 Thread Ashutosh Bapat
On Mon, Jan 29, 2018 at 10:42 AM, Yuto Hayamizu  wrote:
> On Fri, Jan 19, 2018 at 5:07 PM, Yuto Hayamizu  wrote:
>> My idea of improving this patch is that give a threshold N_limit,
>> and for q_1 ... q_N_limit, do the same weighted cost estimation in the
>> current version of this patch.
>> For q_{N_limit+1} , stop calling clauselist_selectivity for
>> calculating the weight
>> and reuse the result of clauselist_selectivity({q_1,q_2, ..., q_N_limit}).
>> For example, if N_limit=100, additional overhead is only
>> sub-milliseconds per each range table entry,
>> and cost estimation is surely better than the current postgres 
>> implementation.
>
> Attached patch implemented the improvement idea above.
> With this patch attached, performance degradation of the test query
> with many quals was <1%.
> Example test query is attached.
>

I looked at the patch and the idea sounds reasonable to me. But I
still have doubts about the usefulness of the patch. What this patch
does is to produce more accurate costs of scan of a single relation.
That's good, but it does that for all the paths created for that
relation. So the accurate estimate doesn't help us to choose one
method of scanning the relation over the other method. It also does
not affect the costs of different join paths, sort paths etc. IOW, I
can not imagine a find a query which will be planned in a different
way than upstream when we apply the patch and the new plan will be
more efficient than the previous one. I might be missing something
but. Can you please provide such a query.

In the patch clauselist_selectivity() gets called repeatedly for every
new qual added to the clause list. Instead, if we try to combine the
cost/row estimation with order_qual_clauses() or
clauselist_selectivity(), we might be able to what the current patch
does in O(n). clauselist_selectivity() calculates combined selectivity
of all the given quals in O(n). We should do something similar to that
in this case.

As suggested upthread by Tom, it will be more useful to have this
feature work with not just scans but joins as well.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation

2018-01-28 Thread Yuto Hayamizu
On Fri, Jan 19, 2018 at 5:07 PM, Yuto Hayamizu  wrote:
> My idea of improving this patch is that give a threshold N_limit,
> and for q_1 ... q_N_limit, do the same weighted cost estimation in the
> current version of this patch.
> For q_{N_limit+1} , stop calling clauselist_selectivity for
> calculating the weight
> and reuse the result of clauselist_selectivity({q_1,q_2, ..., q_N_limit}).
> For example, if N_limit=100, additional overhead is only
> sub-milliseconds per each range table entry,
> and cost estimation is surely better than the current postgres implementation.

Attached patch implemented the improvement idea above.
With this patch attached, performance degradation of the test query
with many quals was <1%.
Example test query is attached.

regards,


Yuto Hayamizu


Mitigate-filter-cost-overestimation-v3.patch
Description: Binary data


testquery.sql
Description: Binary data


Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation

2018-01-19 Thread Yuto Hayamizu
On Thu, Nov 9, 2017 at 12:33 PM, Ashutosh Bapat
 wrote:
> different set, within each set the order is same. FWIW, we can order
> all clauses in largest set once and use that order every time. Albeit
> we will have to remember the order somewhere OR make the separator
> routine retain the order in the larger set, which I guess is true
> about all separator functions.

For this patch, sorting of a qual list happens only once for each
range table entry, not for each path. So there is no need for caching
sorted qual lists as far as I know.


regards,
Yuto Hayamizu



Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation

2018-01-19 Thread Yuto Hayamizu
On Wed, Nov 8, 2017 at 6:48 AM, Tom Lane  wrote:
> Because (1) up to now there's been no need to consider the qual ordering
> till later, and (2) re-doing that sort for each path seemed unduly
> expensive.  If we're to try to estimate whether later quals will be
> reached, then sure the ordering becomes important.  I'm still concerned
> about (2) though.

This patch changes 'set_baserel_size_estimates', and it is called only once
for each range table entry, so sorting does not happen for each path and
its negative impact on optimizer performance is negligible.

While sorting quals does not cost so much, I noticed another performance
problem of this patch.
Patched 'set_baserel_size_estimates'  is O(N^2) (N is the number of quals)
and would take so long time when a query has a large qual list.
After sorting quals, it loops over quals q_1,q_2, ..., q_N to
estimated "weighted"
cost of each qual q_k. In each iteration, in order to calculate "the
weight" of q_k,
it calls clauselist_selectivity({q_1,q_2, ..., q_k}), which loops k
times in the function.

I've checked actual negative impact on optimization time with the
artificial query:
SELECT count(*) FROM pgbench_accounts WHERE (randomly generated N quals);

N=50: 0.5966 msec (original), 0.938 msec (patched)
N=100: 1.1992 msec (original), 1.5396 msec (patched)
N=200: 1.9167 msec (original),  4.6676 msec (patched)
N=400: 2.7583 msec (original), 12.0786 msec (patched)
N=800: 5.2919 msec (original), 38.0228 msec (patched)
N=1600: 9.3428 msec (original), 167.737 msec (patched)

>From this result, patched version is almost O(N^2).
100 quals in a query seems unusually large number, but I think there may exists
some real-world queries that have hundreds or thousands of quals,
so I feel this patch needs improvement for handling large N.

My idea of improving this patch is that give a threshold N_limit,
and for q_1 ... q_N_limit, do the same weighted cost estimation in the
current version of this patch.
For q_{N_limit+1} , stop calling clauselist_selectivity for
calculating the weight
and reuse the result of clauselist_selectivity({q_1,q_2, ..., q_N_limit}).
For example, if N_limit=100, additional overhead is only
sub-milliseconds per each range table entry,
and cost estimation is surely better than the current postgres implementation.

Any thoughts?



Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation

2018-01-16 Thread Yuto Hayamizu
On Sun, Jan 7, 2018 at 8:25 AM, Stephen Frost  wrote:
>
> Greetings,
>
> * Michael Paquier (michael.paqu...@gmail.com) wrote:
> > On Thu, Nov 9, 2017 at 12:33 PM, Ashutosh Bapat
> >  wrote:
> > > Looking at order_qual_clauses(), we can say that a set of quals q1
> > >  qn are ordered the same irrespective of the set of clauses they
> > > are subset of. E.g. if {q1 .. qn} is subset of Q (ordered as Qo) and
> > > also Q' (ordered as Q'o) the order in which they appear in Qo and Q'o
> > > is same. So, even if different paths segregate the clauses in
> > > different set, within each set the order is same. FWIW, we can order
> > > all clauses in largest set once and use that order every time. Albeit
> > > we will have to remember the order somewhere OR make the separator
> > > routine retain the order in the larger set, which I guess is true
> > > about all separator functions.
> >
> > I am not sure what to think about this patch, so moved to next CF. The
> > patch still applies. Hayamizu-san, it would be nice as well if you
> > could review other's patches. One patch reviewed for one patch
> > submitted, with equal difficulty. You should also get a community
> > account so as it is possible to add your name as an author of this
> > patch.
>
> While this patch does still apply, it doesn't pass the 'make check'
> regression tests and there appears to be concern raised up-thread about
> the performance impact.
>
> Yuto Hayamizu, it looks like some good questions were raised and which
> need to be addressed, in addition to making sure that the patch at least
> passes 'make check', so I'm leaving this as waiting-for-author.  If you
> believe the patch is no longer viable, we can change it to 'returned
> with feedback', otherwise, an updated patch and some responses to the
> questions raised would be great as we're already a week into this
> commitfest and this thread has been dormant since near the start of the
> last CF.
>
> Thanks!
>
> Stephen


Thank you for your kind guidance, and sorry for late reply.

Attached patch fixes regression tests and now it passes the 'make check' tests.
Since this patch changes cost estimation in set_baserel_size_estimates,
picked query plans were changed for some tests, so I've updated their
expected EXPLAIN results.

I'm going to answer raised questions by replying to each message soon.

regards,

Yuto Hayamizu


Mitigate-filter-cost-overestimation-v2.patch
Description: Binary data


Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation

2018-01-06 Thread Stephen Frost
Greetings,

* Michael Paquier (michael.paqu...@gmail.com) wrote:
> On Thu, Nov 9, 2017 at 12:33 PM, Ashutosh Bapat
>  wrote:
> > Looking at order_qual_clauses(), we can say that a set of quals q1
> >  qn are ordered the same irrespective of the set of clauses they
> > are subset of. E.g. if {q1 .. qn} is subset of Q (ordered as Qo) and
> > also Q' (ordered as Q'o) the order in which they appear in Qo and Q'o
> > is same. So, even if different paths segregate the clauses in
> > different set, within each set the order is same. FWIW, we can order
> > all clauses in largest set once and use that order every time. Albeit
> > we will have to remember the order somewhere OR make the separator
> > routine retain the order in the larger set, which I guess is true
> > about all separator functions.
> 
> I am not sure what to think about this patch, so moved to next CF. The
> patch still applies. Hayamizu-san, it would be nice as well if you
> could review other's patches. One patch reviewed for one patch
> submitted, with equal difficulty. You should also get a community
> account so as it is possible to add your name as an author of this
> patch.

While this patch does still apply, it doesn't pass the 'make check'
regression tests and there appears to be concern raised up-thread about
the performance impact.

Yuto Hayamizu, it looks like some good questions were raised and which
need to be addressed, in addition to making sure that the patch at least
passes 'make check', so I'm leaving this as waiting-for-author.  If you
believe the patch is no longer viable, we can change it to 'returned
with feedback', otherwise, an updated patch and some responses to the
questions raised would be great as we're already a week into this
commitfest and this thread has been dormant since near the start of the
last CF.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation

2017-11-29 Thread Michael Paquier
On Thu, Nov 9, 2017 at 12:33 PM, Ashutosh Bapat
 wrote:
> Looking at order_qual_clauses(), we can say that a set of quals q1
>  qn are ordered the same irrespective of the set of clauses they
> are subset of. E.g. if {q1 .. qn} is subset of Q (ordered as Qo) and
> also Q' (ordered as Q'o) the order in which they appear in Qo and Q'o
> is same. So, even if different paths segregate the clauses in
> different set, within each set the order is same. FWIW, we can order
> all clauses in largest set once and use that order every time. Albeit
> we will have to remember the order somewhere OR make the separator
> routine retain the order in the larger set, which I guess is true
> about all separator functions.

I am not sure what to think about this patch, so moved to next CF. The
patch still applies. Hayamizu-san, it would be nice as well if you
could review other's patches. One patch reviewed for one patch
submitted, with equal difficulty. You should also get a community
account so as it is possible to add your name as an author of this
patch.
-- 
Michael