Re: [HACKERS] WIP patch for parameterized inner paths
To be clear, I'd love to have this feature. But if there is a choice between reducing planning time significantly for everyone and NOT getting this feature, and increasing planning time significantly for everyone and getting this feature, I think we will make more people happy by doing the first one. We're not really talking about are we going to accept or reject a specific feature. We're talking about whether we're going to decide that the last two years worth of planner development were headed in the wrong direction and we're now going to reject that and try to think of some entirely new concept. This isn't an isolated patch, it's the necessary next step in a multi-year development plan. The fact that it's a bit slower at the moment just means there's still work to do. Those planner improvements are very promising despite the current extra cost in planning in some situations. And it looks like a good solution to have an effective and interesting index-only usage. We've hitten the planner limits and a lower level of the planner need to be revisited. Please Tom, keep on. And it might be that 9.2 is a very good release to do that because if there is performance impact from this patch (at the end), there are so many improvment in other places that it should be easely compensated for users. It might be harder to do the change in 9.3 if the performance of 9.3 are lower than the ones from 9.2. -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
On Wed, Jan 25, 2012 at 11:24 AM, Tom Lane t...@sss.pgh.pa.us wrote: I wrote: Attached is a WIP patch for parameterized paths, along the lines we have discussed before: ... I've made considerable progress on the TODO items I listed: indxpath.c has been ripped apart and restructured, and I have it considering parameterized paths for hash sub-joins. I made a deliberate policy decision not to work very hard on exploring parameterized mergejoin plans, because that seems to inflate the number of paths considered way more than the relative usefulness of merge over hash joins justifies. I don't fully understand this code, especially not on my first time through it, but it seems to me that the key to getting the performance of this code up to where we'd like it to be is to control the number of useless paths that get generated. Is there a guard in here against joining a parameterized path to an intermediate relation when no SJ is involved? In other words, if we're joining a parameterized path on A to a path on B, then either the join to B should satisfy at least part of the parameterization needed by A, or there should be a special join with A and B on one side and a relation that satisfies at least part of the parameterization of A on the other. In the probably not uncommon case where there are no SJs at all or all such SJs have only a single rel on the nullable side, we ought to be able to avoid creating any more paths than we do right now. Even if there are SJs with multiple rels on the outside, we could try to implement some fast check for whether intermediate paths make any sense: e.g. union the set of rels on the nullable sides of SJs. Then, if the joinrel whose paths we're trying to build isn't a subset of that set, the only thing worth considering at this level is satisfying a parameterization by building a parameter-driven nestloop: a bigger parameterized path will do us no good. Maybe there's a heuristic like this already in there; I just didn't see it on first glance. I guess I'm surprised my the amount of increase you're seeing in paths considered, though admittedly I haven't seen your test cases. If we're properly filtering out the paths that don't matter, I wouldn't have expected it to have as much impact as you're describing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
Robert Haas robertmh...@gmail.com writes: Is there a guard in here against joining a parameterized path to an intermediate relation when no SJ is involved? In other words, if we're joining a parameterized path on A to a path on B, then either the join to B should satisfy at least part of the parameterization needed by A, or there should be a special join with A and B on one side and a relation that satisfies at least part of the parameterization of A on the other. There is no such guard. We could probably add one with not an outrageous amount of expense, but I'm not clear on why you think it's appropriate to limit the join ordering that way? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
On Thu, Jan 26, 2012 at 11:04 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Is there a guard in here against joining a parameterized path to an intermediate relation when no SJ is involved? In other words, if we're joining a parameterized path on A to a path on B, then either the join to B should satisfy at least part of the parameterization needed by A, or there should be a special join with A and B on one side and a relation that satisfies at least part of the parameterization of A on the other. There is no such guard. We could probably add one with not an outrageous amount of expense, but I'm not clear on why you think it's appropriate to limit the join ordering that way? Because I think that adding parameterized paths that fail that test is bound to be worse - or at least no better - than just rearranging the join order. In other words, suppose I have this: a some-join (b some-other-join c on b.x = c.x) on a.y = b.y If some-join is a LEFT join and some-other-join is an INNER join, then it makes sense to think about implementing this as a parameterized nest loop with a on the outside and (b ij c) on the inside. But if the joins commute, then AFAICT it doesn't. The trivial case is when they are both inner joins; I could do: Nest Loop - Seq Scan A - Nested Loop - Index Scan B - Index Scan C But that's no better than: Nest Loop - Nest Loop - Seq Scan A - Index Scan B - Index Scan C ...because those are alternative formulations of the same concept: scan A, use that to drive index probes against B, and then use those results to drive index probes against C. But the same applies if both joins are left joins, or more generally: if the joins commute, then the plans we're already generating are sufficient. When they *don't* commute, then we can potentially benefit from joining a parameterized path against an intermediate table. I would expect a lot of people might have things like this: a INNER JOIN b ON a.bid = b.bid INNER JOIN c ON a.cid = c.cid INNER JOIN d ON a.did = d.did INNER JOIN e ON d.eid = e.eid INNER JOIN f ON d.fid = f.fid LEFT JOIN g ON a.gid = g.gid LEFT JOIN h ON a.hid = h.hid LEFT JOIN i ON a.iid = i.iid AFAICS, such queries aren't going to benefit from this optimization, so ideally we'd like them to not have to pay little or nothing for it. Now, if h happens to be a view defined as h1 INNER JOIN h2 ON h1.x = h2.x, then it makes sense to try to join a parameterized path on either h1 or h2 against the other relation before implementing it as a nest loop against some set of relations that includes a, but we still can't get any benefit from parameterization anywhere else - joining h1 or h2 against anything else in there is not legal, and join of a parameterized path over d to, say, f is still no better than what we can get by rearranging the join order: if the path over d is parameterized, then whatever join (d join f) will be no better than (whatever join d) join f. So it seems like we ought to just not build a parameterized d join f path in the first place, unless, of course, I am missing something. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
On Thu, Jan 26, 2012 at 11:54 AM, Robert Haas robertmh...@gmail.com wrote: AFAICS, such queries aren't going to benefit from this optimization, so ideally we'd like them to not have to pay little or nothing for it. There are a few too many negations in that sentence. Let me try again: AFAICS, such queries aren't going to benefit from this optimization, so ideally we'd like them to pay little or nothing for it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
Robert Haas robertmh...@gmail.com writes: Is there a guard in here against joining a parameterized path to an intermediate relation when no SJ is involved? In other words, if we're joining a parameterized path on A to a path on B, then either the join to B should satisfy at least part of the parameterization needed by A, or there should be a special join with A and B on one side and a relation that satisfies at least part of the parameterization of A on the other. I've implemented this idea, recast a bit to prevent generating a parameterized join path in the first place unless it depends on a parameter from a relation for which there's a join ordering constraint still outstanding. It seems to get us to where the planning time penalty is only about 10%, which frankly is probably less than sampling error considering the small set of test cases I'm looking at. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
On Thu, Jan 26, 2012 at 2:27 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Is there a guard in here against joining a parameterized path to an intermediate relation when no SJ is involved? In other words, if we're joining a parameterized path on A to a path on B, then either the join to B should satisfy at least part of the parameterization needed by A, or there should be a special join with A and B on one side and a relation that satisfies at least part of the parameterization of A on the other. I've implemented this idea, recast a bit to prevent generating a parameterized join path in the first place unless it depends on a parameter from a relation for which there's a join ordering constraint still outstanding. It seems to get us to where the planning time penalty is only about 10%, which frankly is probably less than sampling error considering the small set of test cases I'm looking at. Awesome. If you can post the updated patch, I'll poke at it a little more and see if anything jumps out at me, but that sounds promising. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
Robert Haas robertmh...@gmail.com writes: ... In an ideal world, I'd like the amount of effort we spend planning to be somehow tied to the savings we can expect to get, and deploy optimizations like this only in cases where we have a reasonable expectation of that effort being repaid. BTW, so far as that goes: the only way I know to significantly cut the planner's join-planning resource consumption in any run-time-tunable fashion is to restrict it to planning subproblems separately, as for instance via from_collapse_limit/join_collapse_limit. Now, whatever the merits of those specific heuristics, the killer issue is that maybe you really needed a join order different from what the subproblem division entails. I believe that this patch can help with that. If the issue is that you really don't want to form the entire result of a specific subproblem subquery, that can be dealt with by treating the subquery as a parameterizable path, such that it can be on the inside of a nestloop with whatever other relation is supplying the parameter. Or in other words, the reason from_collapse_limit/join_collapse_limit sometimes lead to bad plans is really that they represent artificial join order restrictions. So I think this patch ought to be able to alleviate the worst cases of that. Or at least it did a few hours ago. What we'll probably want to do is tweak the path-formation heuristic you suggested so that joins to relations outside the current subproblem are treated like outer joins and allowed to form parameterized paths. Anyway this is the sort of thing that I hope to investigate after the basic patch is in. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
On Wed, Jan 25, 2012 at 11:24 AM, Tom Lane t...@sss.pgh.pa.us wrote: After that I started doing some performance testing, and the initial news was bad: the planner was about 3x slower than 9.1 on a moderately large join problem. I've spent the past several days hacking away at that, and have got it down to about 35% slower by dint of the following changes: * micro-optimization of add_path(), in particular avoiding duplicate calculations during cost comparisons. * introducing a two-step mechanism for testing whether proposed join paths are interesting. The patch first calculates a quick and dirty lower bound on the cost of the join path (mostly, from the costs of its input paths) and looks through the joinrel's pathlist to see if there is already a path that is clearly better. If not, it proceeds with the full cost calculation and add_path insertion. In my testing the precheck is able to eliminate 80% or so of the proposed paths, more than repaying the extra pathlist search. Now this is of course cheating with both hands, in that either of these optimization techniques could have been applied before ... but they weren't. I think that 35% slower on large join problems is probably acceptable, given that this is investigating a larger number of possible solutions and hopefully often finding a better plan. I think I can knock some more off of that by refactoring the costsize.c APIs, anyway. Right now the first-pass and second-pass cost calculations are independent and so there's some repeated work, which I'd like to eliminate both for speed reasons and to get rid of duplicate code that'd have to be kept in sync if it's left as-is. If there are not objections, I'd like to go ahead and commit this after one more round of polishing. There's still a lot left to do, but what it mainly needs now is some testing on real-world planning problems, and I don't think it's going to get that until it's been merged in. I have to admit that I have a bad feeling about this. It strikes me that there is no way we're not going to get complaints about a 35% slowdown on planning a large join problem. It is true that some people will have the benefit of finding a faster plan - but I think many people won't. We're really facing the same trade-off here that we do with many other things people have asked for the query planner to do over the years: is it worth slowing down everyone, on every query, for an optimization that will apply rarely but provide huge benefits when it does? Of course, that's fundamentally a judgement call. But I can't help observing that the number of requests we've had for the planner to deduce implied inequalities is far larger than the number of people who have complained about the problem that this fixes. Now, maybe the speed penalty for deducing implied inequalities would be even larger than 35%: I don't know. But we've sweat blood to avoid much smaller regressions on much less important cases. To be clear, I'd love to have this feature. But if there is a choice between reducing planning time significantly for everyone and NOT getting this feature, and increasing planning time significantly for everyone and getting this feature, I think we will make more people happy by doing the first one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
Robert Haas robertmh...@gmail.com writes: I have to admit that I have a bad feeling about this. It strikes me that there is no way we're not going to get complaints about a 35% slowdown on planning a large join problem. I have to admit that I'm worried about that too. However, I hope to continue whittling away at that number. It's also worth pointing out that the number is based on just a couple of test cases that are not very real-world, in that I'm testing against empty tables with no statistics. I think that this is a worst case, because it results in a lot of paths with basically the same costs, making them hard to prune; but I can't do much better, because the examples I've got were supplied without test data. Also, you're assuming that the changes have no upside whatsoever, which I fondly hope is not the case. Large join problems tend not to execute instantaneously --- so nobody is going to complain if the planner takes awhile longer but the resulting plan is enough better to buy that back. In my test cases, the planner *is* finding better plans, or at least ones with noticeably lower estimated costs. It's hard to gauge how much that translates to in real-world savings, since I don't have real data loaded up. I also think, though I've not tried to measure, that I've made planning cheaper for very simple queries by eliminating some overhead in those cases. Anyway, I'd be willing to hold off committing if someone were to volunteer to test an unintegrated copy of the patch against some moderately complicated application. But it's a sufficiently large patch that I don't really care to sit on it and try to maintain it outside the tree for a long time. To be clear, I'd love to have this feature. But if there is a choice between reducing planning time significantly for everyone and NOT getting this feature, and increasing planning time significantly for everyone and getting this feature, I think we will make more people happy by doing the first one. We're not really talking about are we going to accept or reject a specific feature. We're talking about whether we're going to decide that the last two years worth of planner development were headed in the wrong direction and we're now going to reject that and try to think of some entirely new concept. This isn't an isolated patch, it's the necessary next step in a multi-year development plan. The fact that it's a bit slower at the moment just means there's still work to do. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
On Jan 25, 2012, at 10:24 AM, Tom Lane wrote: Anyway, I'd be willing to hold off committing if someone were to volunteer to test an unintegrated copy of the patch against some moderately complicated application. But it's a sufficiently large patch that I don't really care to sit on it and try to maintain it outside the tree for a long time. Why not create a branch? IIRC the build farm can be configured to run branches. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
David E. Wheeler da...@justatheory.com writes: On Jan 25, 2012, at 10:24 AM, Tom Lane wrote: Anyway, I'd be willing to hold off committing if someone were to volunteer to test an unintegrated copy of the patch against some moderately complicated application. But it's a sufficiently large patch that I don't really care to sit on it and try to maintain it outside the tree for a long time. Why not create a branch? IIRC the build farm can be configured to run branches. I already know what the patch does against the regression tests. Buildfarm testing is not of interest here. What would be of help is, say, Kevin volunteering to load up his Circuit Courts software and data into a git-head server and see how performance looks with and without the patch. Distribution of the code doesn't strike me as being the bottleneck. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
On Jan 25, 2012, at 12:19 PM, Tom Lane wrote: Why not create a branch? IIRC the build farm can be configured to run branches. I already know what the patch does against the regression tests. Buildfarm testing is not of interest here. What would be of help is, say, Kevin volunteering to load up his Circuit Courts software and data into a git-head server and see how performance looks with and without the patch. Distribution of the code doesn't strike me as being the bottleneck. Yeah, but it would be easier to keep a branch up-to-date via `git rebase` than to maintain the patch, I would think. And if it’s a remote branch, then simpler distribution is a bonus. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
On Wed, Jan 25, 2012 at 1:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: Also, you're assuming that the changes have no upside whatsoever, which I fondly hope is not the case. Large join problems tend not to execute instantaneously --- so nobody is going to complain if the planner takes awhile longer but the resulting plan is enough better to buy that back. In my test cases, the planner *is* finding better plans, or at least ones with noticeably lower estimated costs. It's hard to gauge how much that translates to in real-world savings, since I don't have real data loaded up. I also think, though I've not tried to measure, that I've made planning cheaper for very simple queries by eliminating some overhead in those cases. I had a 34-table join on one of the last applications I maintained that planned and executed in less than 2 seconds. That was pushing it, but I had many joins in the 10-20 table range that planned and executed in 100-200 ms. I agree that if you are dealing with a terabyte table - or even a gigabyte table - then the growth of planning time will probably not bother anyone even if you fail to find a better plan, and will certainly make the user very happy if you do. But on tables with only a megabyte of data, it's not nearly so clear-cut. In an ideal world, I'd like the amount of effort we spend planning to be somehow tied to the savings we can expect to get, and deploy optimizations like this only in cases where we have a reasonable expectation of that effort being repaid. AIUI, this is mostly going to benefit cases like small LJ (big1 IJ big2) and, of course, those cases aren't going to arise if your query only involves small tables, or even if you have something like big IJ small1 IJ small2 IJ small3 IJ small4 LJ small5 LJ small6 IJ small7, which is a reasonably common pattern for me. Now, if you come back and say, ah, well, those cases aren't the ones that are going to be harmed by this, then maybe we should have a more detailed conversation about where the mines are. Or maybe it is helping in more cases than I'm thinking about at the moment. To be clear, I'd love to have this feature. But if there is a choice between reducing planning time significantly for everyone and NOT getting this feature, and increasing planning time significantly for everyone and getting this feature, I think we will make more people happy by doing the first one. We're not really talking about are we going to accept or reject a specific feature. We're talking about whether we're going to decide that the last two years worth of planner development were headed in the wrong direction and we're now going to reject that and try to think of some entirely new concept. This isn't an isolated patch, it's the necessary next step in a multi-year development plan. The fact that it's a bit slower at the moment just means there's still work to do. I'm not proposing that you should never commit this. I'm proposing that any commit by anyone that introduces a 35% performance regression is unwise, and doubly so at the end of the release cycle. I have every confidence that you can improve the code further over time, but the middle of the last CommitFest is not a great time to commit code that, by your own admission, needs a considerable amount of additional work. Sure, there are some things that we're not going to find out until the code goes into production, but it seems to me that you've already uncovered a fairly major performance problem that is only partially fixed. Once this is committed, it's not coming back out, so we're either going to have to figure out how to fix it before we release, or release with a regression in certain cases. If you got it down to 10% I don't think I'd be worried, but a 35% regression that we don't know how to fix seems like a lot. On another note, nobody besides you has looked at the code yet, AFAIK... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
On 17/01/12 18:06, Tom Lane wrote: Anyway, I'm hoping to keep hacking at this for a couple more days before the CF gets into full swing, and hopefully arrive at something committable for 9.2. I'd really like to get this fixed in this cycle, since it's been a problem for several releases now. Comments? Very, nice - keep hacking! Best wishes Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
Tom Lane t...@sss.pgh.pa.us writes: Anyway, I'm hoping to keep hacking at this for a couple more days before the CF gets into full swing, and hopefully arrive at something committable for 9.2. I'd really like to get this fixed in this cycle, since it's been a problem for several releases now. Comments? Go Tom go ! :) Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
On 01/17/2012 12:06 AM, Tom Lane wrote: Well, since I see other committers sending in patches the day after the nominal commitfest deadline, I don't feel too bad about being a bit late as well. To clarify the fairness standard here: submitting a patch before the CommitFest deadline, then adding it to the app, means that we will try very hard to find a reviewer for the submission during that CF. It's setting a worst-case bound on how long someone who contributes will have to wait for feedback. That delay, how long it would take before someone saw community feedback after they sent in a patch, used to be far less predictable. Something like this, sent just after the deadline, won't be assigned a reviewer by the CommitFest manager until the next CF. That doesn't mean it won't be reviewed anyway. Also, submissions that fix a regression, like this one, can easily end up on a fast track unrelated to the normal schedule. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers