Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation
On Wed, Nov 8, 2017 at 3:18 AM, Tom Lane wrote: > Robert Haas writes: >> I think it would be a good idea, as Thomas says, to order the qual >> clauses at an earlier stage and then remember our decision. However, >> we have to think about whether that's going to increase planning time >> in a noticeable way. I wonder why we currently postpone this until >> such a late phase of planning. > > Because (1) up to now there's been no need to consider the qual ordering > till later, and (2) re-doing that sort for each path seemed unduly > expensive. If we're to try to estimate whether later quals will be > reached, then sure the ordering becomes important. I'm still concerned > about (2) though. If there were a way to pre-sort the quals once for > all paths, the problem goes away, but I don't think that works --- > indexscans may segregate some quals, and in join cases different paths > will actually have different sets of quals they need to check depending > on the join order. > Looking at order_qual_clauses(), we can say that a set of quals q1 qn are ordered the same irrespective of the set of clauses they are subset of. E.g. if {q1 .. qn} is subset of Q (ordered as Qo) and also Q' (ordered as Q'o) the order in which they appear in Qo and Q'o is same. So, even if different paths segregate the clauses in different set, within each set the order is same. FWIW, we can order all clauses in largest set once and use that order every time. Albeit we will have to remember the order somewhere OR make the separator routine retain the order in the larger set, which I guess is true about all separator functions. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation
Robert Haas writes: > I think it would be a good idea, as Thomas says, to order the qual > clauses at an earlier stage and then remember our decision. However, > we have to think about whether that's going to increase planning time > in a noticeable way. I wonder why we currently postpone this until > such a late phase of planning. Because (1) up to now there's been no need to consider the qual ordering till later, and (2) re-doing that sort for each path seemed unduly expensive. If we're to try to estimate whether later quals will be reached, then sure the ordering becomes important. I'm still concerned about (2) though. If there were a way to pre-sort the quals once for all paths, the problem goes away, but I don't think that works --- indexscans may segregate some quals, and in join cases different paths will actually have different sets of quals they need to check depending on the join order. So the bottom line here is how much is it going to cost us to add this additional refinement in cost estimation, and is it worth it given our extremely poor level of accuracy in expression cost estimation. Color me dubious. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation
On Mon, Nov 6, 2017 at 5:19 AM, Ashutosh Bapat wrote: > IIRC, only thing that changes between plan time quals and execution > time quals is constaint folding of constant parameters. But I don't > think we change the selectivity estimates when that's done. At the > same time, I don't think we should make a lot of effort to make sure > that the order used during the estimation is same as the order at the > execution; we are anyway estimating. There can always be some > difference between what's estimated and what actually happens. I don't think that's a good justification for allowing the order to vary. It's true that, at execution time, we may find row counts and selectivities to be different than what we predicted, but that is a case of the real data being different from our idealized data -- which is difficult to avoid in general. Allowing our actual decisions to be different from the decisions we thought we would make seems like programmer sloppiness. It would also be very confusing if the planner uses one ordering to estimate the cost and then a different order at execution time and in the EXPLAIN ANALYZE output. How could anybody understand what had happened in such a case? I think it would be a good idea, as Thomas says, to order the qual clauses at an earlier stage and then remember our decision. However, we have to think about whether that's going to increase planning time in a noticeable way. I wonder why we currently postpone this until such a late phase of planning. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation
On Mon, Nov 6, 2017 at 10:01 AM, Thomas Munro wrote: > > This idea seems to makes intuitive sense. I see that you use > order_qual_clauses() to know what order they'll run in, so I'm > wondering if there is any reason we shouldn't do it up front and keep > it during path building, instead of running it again at plan creation > time. Is there some way it could ever produce a different result at > the two times? IIRC, only thing that changes between plan time quals and execution time quals is constaint folding of constant parameters. But I don't think we change the selectivity estimates when that's done. At the same time, I don't think we should make a lot of effort to make sure that the order used during the estimation is same as the order at the execution; we are anyway estimating. There can always be some difference between what's estimated and what actually happens. > Why not also apply this logic to qpquals of joins, > foreign scans, subplans? That is, instead of replacing cost_qual_eval > with this code for baserels, I wonder if we should teach > cost_qual_eval how to do this so those other users could also benefit > (having perhaps already ordered the qual clauses earlier). +1. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation
On Mon, Sep 11, 2017 at 7:43 PM, Yuto Hayamizu wrote: > Suppose that there are three qual clauses in a scan node, current > postgres estimates per-tuple cost of the filter to be: >cost(A) + cost(B) + cost(C) > > And basic idea of the attached patch is: >cost(A) + clauselist_selectivity({A}) * cost(B) + > clauselist_selectivity({A, B}) * cost(C) I am no planner expert and I haven't tested or studied the patch in detail, but here is some feedback for what it's worth. This idea seems to makes intuitive sense. I see that you use order_qual_clauses() to know what order they'll run in, so I'm wondering if there is any reason we shouldn't do it up front and keep it during path building, instead of running it again at plan creation time. Is there some way it could ever produce a different result at the two times? Why not also apply this logic to qpquals of joins, foreign scans, subplans? That is, instead of replacing cost_qual_eval with this code for baserels, I wonder if we should teach cost_qual_eval how to do this so those other users could also benefit (having perhaps already ordered the qual clauses earlier). This is one of those patches that risks having an impact on many query plans. Yikes. Of all the regression test queries, only updatable_views complained though, and that involves leakproof functions. I see that those get some kind of special treatment in order_qual_clauses(). + ordered_clauses = order_qual_clauses(root, rel->baserestrictinfo); + clause_list = ordered_clauses; Is clause_list necessary? Can't you just use ordered_clauses for the outer and inner loops? + List *clause_list_for_sel = NULL; The convention is to use NIL for empty lists (a nod to the Lisp machine they prototyped this project on). + /* Make a temporary clause list for selectivity calcuation */ s/calcuation/calculation/ -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation
On Mon, Sep 11, 2017 at 04:43:46PM +0900, Yuto Hayamizu wrote: > Hi hackers, > > Currently, cost of a filter with multiple clauses is estimated by > summing up estimated cost of each clause. As long as a filter > consists of simple clauses and its cost is fairly small, it works > fine. However, when there exists some heavy clauses (like SubQuery or > user-defined functions) and most of tuples are filtered by other > simple clauses, total cost is likely to be overestimated. An attached > patch mitigates this overestimation by using selectivity estimation of > subsets of clauses in a filter. I've taken the liberty of adding this to the upcoming commitfest. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH] Overestimated filter cost and its mitigation
Hi hackers, Currently, cost of a filter with multiple clauses is estimated by summing up estimated cost of each clause. As long as a filter consists of simple clauses and its cost is fairly small, it works fine. However, when there exists some heavy clauses (like SubQuery or user-defined functions) and most of tuples are filtered by other simple clauses, total cost is likely to be overestimated. An attached patch mitigates this overestimation by using selectivity estimation of subsets of clauses in a filter. Suppose that there are three qual clauses in a scan node, current postgres estimates per-tuple cost of the filter to be: cost(A) + cost(B) + cost(C) And basic idea of the attached patch is: cost(A) + clauselist_selectivity({A}) * cost(B) + clauselist_selectivity({A, B}) * cost(C) We first noticed this overestimation during performance analysis of our real applications. In order to reproduce the situation, we created a similar query with pgbench data. The database was prepared by following commands: pgbench --scale=1000 -i pgb5 pgbench -c 10 -t 10 pgb5 and the query is: select e.tid, e.aid, e.bid, e.mtime from pgbench_history e where e.tid between 1 and 1000 and (select count(*) from pgbench_history f where f.mtime < e.mtime and f.bid = e.bid group by f.bid) > 100 and e.aid between 1 and 10; == Query plan with current upstream == 1 Seq Scan on pgbench_history e (cost=0.00..21393523889.00 rows=28 width=20) (actual time=2391.683..21775.191 rows=85 loops=1) 2 Filter: ((tid >= 1) AND (tid <= 1000) AND (aid >= 1) AND (aid <= 10) AND ((SubPlan 1) > 100)) 3 Rows Removed by Filter: 15 4SubPlan 1 5 -> GroupAggregate (cost=0.00..21393.50 rows=283 width=12) (actual time=229.050..229.050 rows=1 loops=94) 6 Group Key: f.bid 7 -> Seq Scan on pgbench_history f (cost=0.00..21389.00 rows=333 width=4) (actual time=5.036..228.851 rows=529 loops=94) 8 Filter: ((mtime < e.mtime) AND (bid = e.bid)) 9 Rows Removed by Filter: 999471 Most amount of total cost 21,393,523,889 comes from: (cost of SubPlan 1) * (# of rows in pgbench_history) = 21,393.50 * 1,000,000 = 21,393,000,000 but in actual run, SubPlan 1 was executed only 94 times, and total cost was overestimated more than 1 times greater. == Query plan with patched upstream == 1 Seq Scan on pgbench_history e (cost=0.00..1687169.88 rows=28 width=20) (actual time=2388.802..21721.498 rows=85 loops=1) 2 Filter: ((tid >= 1) AND (tid <= 1000) AND (aid >= 1) AND (aid <= 10) AND ((SubPlan 1) > 100)) 3 Rows Removed by Filter: 15 4 SubPlan 1 5 -> GroupAggregate (cost=0.00..19726.83 rows=283 width=12) (actual time=228.507..228.507 rows=1 loops=94) 6 Group Key: f.bid 7 -> Seq Scan on pgbench_history f (cost=0.00..19722.33 rows=333 width=4) (actual time=5.378..228.316 rows=529 loops=94) 8 Filter: ((mtime < e.mtime) AND (bid = e.bid)) 9 Rows Removed by Filter: 999471 In patched version of upstream, total cost is largely decreased. In cost estimation, only 84.4 tuples of pgbench_history are expected to be evaluated with SubPlan 1. It is close to actual number 94 to some extent, and total cost seems much more reasonable than cost with current upstream. Any thoughts? Regards, Yuto Hayamizu Mitigate-filter-cost-overestimation.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers