Re: [HACKERS] planner missing a trick for foreign tables w/OR conditions
Robert Haas robertmh...@gmail.com writes: On Tue, Dec 17, 2013 at 12:28 PM, Tom Lane t...@sss.pgh.pa.us wrote: So at this point I'm pretty much talked into it. We could eliminate the dependence on indexes entirely, and replace this code with a step that simply tries to pull single-base-relation quals out of ORs wherever it can find one. You could argue that the produced quals would sometimes not be worth testing for, but we could apply a heuristic that says to forget it unless the estimated selectivity of the extracted qual is less than, I dunno, 0.5 maybe. This is an area where I think things are very different from local and remote tables. For a local table, the cost of transmitting a row from one plan node to another is basically zero. For a remote table, it's potentially far higher, although unfortunately it's hard to know exactly. But if the qual is cheap to evaluate, and we're getting back a lot of rows, I suspect even eliminating 5-10% of them could work out to a win. With a local table, 50% sounds like a reasonable number. I used 90% as a cutoff in the attached draft patch. Not sure if it's worth expending a lot of sweat on the point. Preliminary testing of this makes it look like a good idea. In particular, it will now extract restriction clauses from ORs even if those clauses have nothing to do with indexes, as exhibited in the second new regression test case. (I was unhappy to find that there were no regression tests covering this behavior at all; though I guess that's not so surprising given that orindxpath.c dates from before we could put EXPLAIN output into the regression tests.) A cosmetic issue that has to be resolved is where to put the new code. orindxpath.c is clearly now a misnomer; in fact, I don't think that the code in this form belongs in optimizer/path/ at all. There's some argument for putting it in initsplan.c, but that file is pretty big already. Maybe a new file in optimizer/util/? Any thoughts on that? Also, there's some janitorial cleanup work that remains to be done. I believe that make_restrictinfo_from_bitmapqual is now dead code, and generate_bitmap_or_paths could now be static-ified and probably simplified by dropping its restriction_only option. I've not investigated in detail what we could get rid of once create_or_index_quals is gone. regards, tom lane diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 96fe50f..88a566e 100644 *** a/src/backend/optimizer/path/allpaths.c --- b/src/backend/optimizer/path/allpaths.c *** set_plain_rel_size(PlannerInfo *root, Re *** 362,378 /* Mark rel with estimated output rows, width, etc */ set_baserel_size_estimates(root, rel); - - /* - * Check to see if we can extract any restriction conditions from join - * quals that are OR-of-AND structures. If so, add them to the rel's - * restriction list, and redo the above steps. - */ - if (create_or_index_quals(root, rel)) - { - check_partial_indexes(root, rel); - set_baserel_size_estimates(root, rel); - } } /* --- 362,367 diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 606734a..0e05107 100644 *** a/src/backend/optimizer/path/indxpath.c --- b/src/backend/optimizer/path/indxpath.c *** choose_bitmap_and(PlannerInfo *root, Rel *** 1354,1360 * we can remove this limitation. (But note that this also defends * against flat-out duplicate input paths, which can happen because * match_join_clauses_to_index will find the same OR join clauses that ! * create_or_index_quals has pulled OR restriction clauses out of.) * * For the same reason, we reject AND combinations in which an index * predicate clause duplicates another clause. Here we find it necessary --- 1354,1361 * we can remove this limitation. (But note that this also defends * against flat-out duplicate input paths, which can happen because * match_join_clauses_to_index will find the same OR join clauses that ! * extract_restriction_or_clauses has pulled OR restriction clauses out ! * of.) * * For the same reason, we reject AND combinations in which an index * predicate clause duplicates another clause. Here we find it necessary diff --git a/src/backend/optimizer/path/orindxpath.c b/src/backend/optimizer/path/orindxpath.c index 16f29d3..5a8d73b 100644 *** a/src/backend/optimizer/path/orindxpath.c --- b/src/backend/optimizer/path/orindxpath.c *** *** 15,187 #include postgres.h #include optimizer/cost.h #include optimizer/paths.h #include optimizer/restrictinfo.h ! /*-- ! * create_or_index_quals * Examine join OR-of-AND quals to see if any useful restriction OR * clauses can be extracted. If so, add them to the query. * ! * Although a join clause must reference other relations overall, ! * an OR of ANDs
Re: [HACKERS] planner missing a trick for foreign tables w/OR conditions
Robert Haas robertmh...@gmail.com writes: On Mon, Dec 16, 2013 at 6:59 PM, Tom Lane t...@sss.pgh.pa.us wrote: The hard part is not extracting the partial qual. The hard part is trying to make sure that adding this entirely-redundant scan qual doesn't catastrophically degrade join size estimates. OK, I had a feeling that's where the problem was likely to be. Do you have any thoughts about a more principled way of solving this problem? I mean, off-hand, it's not clear to me that the comments about this being a MAJOR HACK aren't overstated. Well, the business about injecting the correction by adjusting a cached selectivity is certainly a hack, but it's not one that I think is urgent to get rid of; I don't foresee anything that's likely to break it soon. I might be missing something, but I suspect it works fine if every path for the relation is generating the same rows. I had been thinking it would fall down if there are several OR conditions affecting different collections of rels, but after going through the math again, I'm now thinking I was wrong and it does in fact work out. As you say, we do depend on all paths generating the same rows, but since the extracted single-rel quals are inserted as plain baserestrictinfo quals, that'll be true. A bigger potential objection is that we're opening ourselves to larger problems with estimation failures due to correlated qual conditions, but again I'm finding that the math doesn't bear that out. It's reasonable to assume that our estimate for the extracted qual will be better than our estimate for the OR as a whole, so our adjusted size estimates for the filtered base relations are probably OK. And the adjustment to the OR clause selectivity means that the size estimate for the join comes out exactly the same. We'll actually be better off than with what is likely to happen now, which is that people manually extract the simplified condition and insert it into the query explicitly. PG doesn't realize that that's redundant and so will underestimate the join size. So at this point I'm pretty much talked into it. We could eliminate the dependence on indexes entirely, and replace this code with a step that simply tries to pull single-base-relation quals out of ORs wherever it can find one. You could argue that the produced quals would sometimes not be worth testing for, but we could apply a heuristic that says to forget it unless the estimated selectivity of the extracted qual is less than, I dunno, 0.5 maybe. (I wonder if it'd be worth inserting a check that there's not already a manually-generated equivalent clause, too ...) A very nice thing about this is we could do this step ahead of relation size estimate setting and thus remove the redundant work that currently happens in set_plain_rel_size when the optimization fires. Which is another aspect of the current code that's a hack, so getting rid of it would be a net reduction in hackiness. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] planner missing a trick for foreign tables w/OR conditions
On Tue, Dec 17, 2013 at 12:28 PM, Tom Lane t...@sss.pgh.pa.us wrote: I had been thinking it would fall down if there are several OR conditions affecting different collections of rels, but after going through the math again, I'm now thinking I was wrong and it does in fact work out. As you say, we do depend on all paths generating the same rows, but since the extracted single-rel quals are inserted as plain baserestrictinfo quals, that'll be true. OK. A bigger potential objection is that we're opening ourselves to larger problems with estimation failures due to correlated qual conditions, but again I'm finding that the math doesn't bear that out. It's reasonable to assume that our estimate for the extracted qual will be better than our estimate for the OR as a whole, so our adjusted size estimates for the filtered base relations are probably OK. And the adjustment to the OR clause selectivity means that the size estimate for the join comes out exactly the same. We'll actually be better off than with what is likely to happen now, which is that people manually extract the simplified condition and insert it into the query explicitly. PG doesn't realize that that's redundant and so will underestimate the join size. I had not thought of that, but it seems like a valid point. So at this point I'm pretty much talked into it. We could eliminate the dependence on indexes entirely, and replace this code with a step that simply tries to pull single-base-relation quals out of ORs wherever it can find one. You could argue that the produced quals would sometimes not be worth testing for, but we could apply a heuristic that says to forget it unless the estimated selectivity of the extracted qual is less than, I dunno, 0.5 maybe. This is an area where I think things are very different from local and remote tables. For a local table, the cost of transmitting a row from one plan node to another is basically zero. For a remote table, it's potentially far higher, although unfortunately it's hard to know exactly. But if the qual is cheap to evaluate, and we're getting back a lot of rows, I suspect even eliminating 5-10% of them could work out to a win. With a local table, 50% sounds like a reasonable number. Another point to ponder is that there could be places where this actually changes the plan significantly for the better. Pre-filtering one side of a join might, for example, reduce the amount of data on one side to the point where a hash join is chosen over some other strategy. I don't know that this will actually help all that many people but the best case is pretty dramatic for those it does help: the partial qual might be almost as selective as the join condition itself. (I wonder if it'd be worth inserting a check that there's not already a manually-generated equivalent clause, too ...) Sounds a little too clever IMHO. A very nice thing about this is we could do this step ahead of relation size estimate setting and thus remove the redundant work that currently happens in set_plain_rel_size when the optimization fires. Which is another aspect of the current code that's a hack, so getting rid of it would be a net reduction in hackiness. I'm not sure that would save anything measurable performance-wise, but the hackiness reduction would be nice to have. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] planner missing a trick for foreign tables w/OR conditions
On 17 December 2013 17:28, Tom Lane t...@sss.pgh.pa.us wrote: So at this point I'm pretty much talked into it. We could eliminate the dependence on indexes entirely, and replace this code with a step that simply tries to pull single-base-relation quals out of ORs wherever it can find one. You could argue that the produced quals would sometimes not be worth testing for, but we could apply a heuristic that says to forget it unless the estimated selectivity of the extracted qual is less than, I dunno, 0.5 maybe. (I wonder if it'd be worth inserting a check that there's not already a manually-generated equivalent clause, too ...) Sounds sensible. What surprises me is we don't have an API that allows an FDW to decide what it can accept or not. It seems strange to have a unilateral decision by our planner about what another planner is capable of. Should we extend the API to allow the question to be asked? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] planner missing a trick for foreign tables w/OR conditions
Robert Haas robertmh...@gmail.com writes: On Tue, Dec 17, 2013 at 12:28 PM, Tom Lane t...@sss.pgh.pa.us wrote: (I wonder if it'd be worth inserting a check that there's not already a manually-generated equivalent clause, too ...) Sounds a little too clever IMHO. The argument for doing it is that we might otherwise find ourselves degrading the plans for previously-manually-optimized queries. On the other hand, the existing index-driven code has probably forestalled the need for many people to do that; at least, I don't recall seeing much discussion of doing that sort of thing by hand. I'm happy to leave the issue out of the first version of the patch, anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] planner missing a trick for foreign tables w/OR conditions
Simon Riggs si...@2ndquadrant.com writes: What surprises me is we don't have an API that allows an FDW to decide what it can accept or not. It seems strange to have a unilateral decision by our planner about what another planner is capable of. Uh, what? There's certainly missing features in our FDW APIs --- no ability to push over joins or aggregates for instance --- but none of that has anything to do with assumptions about what the other end is capable of. We're just not done inventing those APIs. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] planner missing a trick for foreign tables w/OR conditions
Consider a query such as: SELECT * FROM a, b WHERE (a.x = 42 AND b.y = 43) OR (a.x = 44 AND b.z = 45); If a and/or b are regular tables, the query planner will cleverly consider the possibility of using an index on a to filter for rows with a.x = 42 OR a.x = 44, or of using an index on b to filter for rows where b.y = 43 OR b.z = 45. But if they are foreign tables, this optimization isn't considered, because we don't intrinsically know anything about what indexes are present on the foreign side. However, this optimization could potentially be quite valuable. In fact, it's arguably more useful here for regular tables, because even if no index is present on the foreign side, applying the condition on the remote side might eliminate enough data transfer overhead to win. The only situation in which I can really see it losing is if the simplified qual ends up eliminating too few rows to cover the remote side's processing costs; I'm not sure how possible that is, or how to know whether it might be the case. To see how this can torpedo performance, run the attached SQL file on an empty database, and then run these quereis: explain analyze SELECT other.id, other.title, local.id, local.title FROM other INNER JOIN local ON other.id = local.id WHERE local.title = md5(1::text) OR (local.title = md5(3::text) AND other.id = 3); explain analyze SELECT other.id, other.title, frgn.id, frgn.title FROM other INNER JOIN frgn ON other.id = frgn.id WHERE frgn.title = md5(1::text) OR (frgn.title = md5(3::text) AND other.id = 3); Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company rm32176.sql Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] planner missing a trick for foreign tables w/OR conditions
On Mon, Dec 16, 2013 at 12:41:43PM -0500, Robert Haas wrote: Consider a query such as: SELECT * FROM a, b WHERE (a.x = 42 AND b.y = 43) OR (a.x = 44 AND b.z = 45); If a and/or b are regular tables, the query planner will cleverly consider the possibility of using an index on a to filter for rows with a.x = 42 OR a.x = 44, or of using an index on b to filter for rows where b.y = 43 OR b.z = 45. But if they are foreign tables, this optimization isn't considered, because we don't intrinsically know anything about what indexes are present on the foreign side. However, this optimization could potentially be quite valuable. In fact, it's arguably more useful here for regular tables, because even if no index is present on the foreign side, applying the condition on the remote side might eliminate enough data transfer overhead to win. The only situation in which I can really see it losing is if the simplified qual ends up eliminating too few rows to cover the remote side's processing costs; I'm not sure how possible that is, or how to know whether it might be the case. To see how this can torpedo performance, run the attached SQL file on an empty database, and then run these quereis: +1 for fixing this bug :) Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] planner missing a trick for foreign tables w/OR conditions
Robert Haas robertmh...@gmail.com writes: Consider a query such as: SELECT * FROM a, b WHERE (a.x = 42 AND b.y = 43) OR (a.x = 44 AND b.z = 45); If a and/or b are regular tables, the query planner will cleverly consider the possibility of using an index on a to filter for rows with a.x = 42 OR a.x = 44, or of using an index on b to filter for rows where b.y = 43 OR b.z = 45. But if they are foreign tables, this optimization isn't considered, because we don't intrinsically know anything about what indexes are present on the foreign side. However, this optimization could potentially be quite valuable. In fact, it's arguably more useful here for regular tables, because even if no index is present on the foreign side, applying the condition on the remote side might eliminate enough data transfer overhead to win. The only situation in which I can really see it losing is if the simplified qual ends up eliminating too few rows to cover the remote side's processing costs; I'm not sure how possible that is, or how to know whether it might be the case. Thoughts? The problem is that that optimization is a crock; see the comments for create_or_index_quals(). We can't just turn it loose to CNF-ify every OR it might find. The case that we support at the moment is to CNF-ify whichever single OR condition looks like the best win, and it's hard to see how to do that without any index knowledge. In principle, when we're using remote estimates, we could probably ask the remote server about each possibility ... but that could be expensive. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] planner missing a trick for foreign tables w/OR conditions
On Mon, Dec 16, 2013 at 2:04 PM, Tom Lane t...@sss.pgh.pa.us wrote: The problem is that that optimization is a crock; see the comments for create_or_index_quals(). We can't just turn it loose to CNF-ify every OR it might find. The case that we support at the moment is to CNF-ify whichever single OR condition looks like the best win, and it's hard to see how to do that without any index knowledge. In principle, when we're using remote estimates, we could probably ask the remote server about each possibility ... but that could be expensive. Could we get by without actually converting to CNF? Suppose we define JOINQUAL-STRIP(P, B), where P is a join clause and B is a truth value. Something like this: JOINQUAL-STRIP(NOT Q, B) = NOT (JOINQUAL-STRIP(Q, NOT B)) JOINQUAL-STRIP(Q1 AND Q2 AND ... AND Qn, B) = the conjunction of JOINQUAL-STRIP(Qi, B) for all i JOINQUAL-STRIP(Q1 OR Q2 OR ... OR Qn, B) = the disjunction of JOINQUAL-STRIP(Qi, B) for all i For any join clause not of one of the above forms, JOINQUAL-STRIP(P, B) is equal to P if there are no Vars in P referring to any table other than the one for which we're constructing baserestrictinfo, and to B otherwise. Given this definition, we can take each join clause P and apply JOINQUAL-STRIP(P, true) to it. If the result is true, forget it. If the result is anything else, it's a simplified version of the original AND/OR/NOT tree that we can apply on the remote side (if it's pushdown-safe) to pre-filter the results. In plain English, we walk down through AND, OR, and NOT nodes and inspect what's underneath. Whenever we find something that references only the remote table under consideration, we keep it as is. If we find something that touches any other table, we assume it's true unless we're beneath an odd number of negations, in which case we assume it's false. e.g. if we start with (L1 AND R1) OR NOT(L2 OR R2), where the Li reference local vars and the Ri only remote vars, we get (true AND R1) OR NOT(false OR R2); after some trivial simplification, this reduces to R1 OR NOT R2, which is indeed a suitable pre-filter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] planner missing a trick for foreign tables w/OR conditions
Robert Haas robertmh...@gmail.com writes: On Mon, Dec 16, 2013 at 2:04 PM, Tom Lane t...@sss.pgh.pa.us wrote: The problem is that that optimization is a crock; see the comments for create_or_index_quals(). We can't just turn it loose to CNF-ify every OR it might find. The case that we support at the moment is to CNF-ify whichever single OR condition looks like the best win, and it's hard to see how to do that without any index knowledge. Could we get by without actually converting to CNF? The hard part is not extracting the partial qual. The hard part is trying to make sure that adding this entirely-redundant scan qual doesn't catastrophically degrade join size estimates. The hack of making an inverse adjustment to the original OR clause's selectivity works, more or less, for a single join OR condition. I don't think it works if there's several modified OR conditions (possibly covering different sets of relations). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] planner missing a trick for foreign tables w/OR conditions
On Mon, Dec 16, 2013 at 6:59 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Mon, Dec 16, 2013 at 2:04 PM, Tom Lane t...@sss.pgh.pa.us wrote: The problem is that that optimization is a crock; see the comments for create_or_index_quals(). We can't just turn it loose to CNF-ify every OR it might find. The case that we support at the moment is to CNF-ify whichever single OR condition looks like the best win, and it's hard to see how to do that without any index knowledge. Could we get by without actually converting to CNF? The hard part is not extracting the partial qual. The hard part is trying to make sure that adding this entirely-redundant scan qual doesn't catastrophically degrade join size estimates. OK, I had a feeling that's where the problem was likely to be. Do you have any thoughts about a more principled way of solving this problem? I mean, off-hand, it's not clear to me that the comments about this being a MAJOR HACK aren't overstated. I mean, if we expect a join qual for {X Y} to have a given selectivity, but then we pre-filter X with a clause that is deliberately redundant with that join qual, then the join qual does indeed become less selective when view as applying to the surviving rows. This does not strike me as very much different from the oft-encountered problem of estimating selectivity for a = 1 AND b = 1, where a and b are correlated. Whichever qual we apply first changes the data distribution such that the selectivity of the second qual is not the same as it would have been when applied to the entirety of the data. Estimating it that way would not be a hack; it would be reality. Now, it might be true that frobbing what is intended as a cache based on the knowledge that the cache will never be flushed is a hack. The hack of making an inverse adjustment to the original OR clause's selectivity works, more or less, for a single join OR condition. I don't think it works if there's several modified OR conditions (possibly covering different sets of relations). I might be missing something, but I suspect it works fine if every path for the relation is generating the same rows. The partial qual is definitely going to be applied before the join qual, so the join qual will surely be hitting only data that's been pre-filtered by the partial qual, and so its selectivity will be correspondingly more. Where it seems to me that you'd run in to trouble is if we created one path that doesn't bother enforcing the partial qual (relying on the fact that it will be applied post-join) and another path that does. Now you've got the same selectivity estimate for the join in two cases where the incoming data distribution is significantly different, which is bad. Do you see another danger? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers