Re: disfavoring unparameterized nested loops

2023-09-20 Thread Lepikhov Andrei



On Wed, Sep 20, 2023, at 4:49 PM, David Rowley wrote:
> On Wed, 20 Sept 2023 at 19:56, Andrey Lepikhov
>  wrote:
>> What could you say about a different way: hybrid join? In MS SQL Server,
>> they have such a feature [1], and, according to their description, it
>> requires low overhead. They start from HashJoin and switch to NestLoop
>> if the inner input contains too small tuples. It solves the issue, Isn't it?
>
> A complexity which you may not be considering here is that Nested Loop
> joins always preserve the tuple order from the outer side of the join,
> whereas hash joins will not do this when multi-batching.

My idea here is the same as MS SQL guys did: prefetch from the HashJoin inner 
some predefined number of tuples and, if the planner has made a mistake and 
overestimated it, move hash join inner to NestLoop as an outer.
The opposite strategy, "underestimation" - starting with NestLoop and switching 
to HashJoin looks more difficult, but the main question is: is it worthwhile to 
research?

> I've no idea how the SQL Server engineers solved that.

>> [1]
>> https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-batch-mode-adaptive-joins/ba-p/385411

-- 
Regards,
Andrei Lepikhov




Re: disfavoring unparameterized nested loops

2023-09-20 Thread David Rowley
On Wed, 20 Sept 2023 at 19:56, Andrey Lepikhov
 wrote:
> What could you say about a different way: hybrid join? In MS SQL Server,
> they have such a feature [1], and, according to their description, it
> requires low overhead. They start from HashJoin and switch to NestLoop
> if the inner input contains too small tuples. It solves the issue, Isn't it?

A complexity which you may not be considering here is that Nested Loop
joins always preserve the tuple order from the outer side of the join,
whereas hash joins will not do this when multi-batching.

I've no idea how the SQL Server engineers solved that.

David

> [1]
> https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-batch-mode-adaptive-joins/ba-p/385411




Re: disfavoring unparameterized nested loops

2023-09-20 Thread Andrey Lepikhov

On 29/9/2022 21:32, Benjamin Coutu wrote:

I'd like to revamp this important discussion.

As is well described in this fairly recent paper here 
https://www.vldb.org/pvldb/vol9/p204-leis.pdf (which also looks at Postgres) 
"estimation errors quickly grow as the number of joins increases, and that these 
errors are usually the reason for bad plans" - I think we can all get behind that 
statement.

While nested loop joins work great when cardinality estimates are correct, they 
are notoriously bad when the optimizer underestimates and they degrade very 
fast in such cases - the good vs. bad here is very asymmetric. On the other 
hand hash joins degrade much more gracefully - they are considered very robust 
against underestimation. The above mentioned paper illustrates that all mayor 
DBMS (including Postgres) tend to underestimate and usually that 
underestimation increases drastically with the number of joins (see Figures 3+4 
of the paper).

Now, a simple approach to guarding against bad plans that arise from 
underestimation could be to use what I would call a 
nested-loop-conviction-multiplier based on the current depth of the join tree, 
e.g. for a base table that multiplier would obviously be 1, but would then grow 
(e.g.) quadratically. That conviction-multiplier would *NOT* be used to skew 
the cardinality estimates themselves, but rather be applied to the overall 
nested loop join cost at each particular stage of the plan when comparing it to 
other more robust join strategies like hash or sort-merge joins. That way when 
we can be sure to have a good estimate at the bottom of the join tree we treat 
all things equal, but favor nested loops less and less as we move up the join 
tree for the sake of robustness.
Also, we can expand the multiplier whenever we fall back to using the default 
cardinality constant as surely all bets are off at that point - we should 
definitely treat nested loop joins as out of favor in this instance and that 
could easily be incorporated by simply increasing the conviction-mutliplier.

What are your thoughts on this simple idea - is it perhaps too simple?
In my practice, parameterized nested loop reduces, sometimes 
drastically, execution time. If your query touches a lot of tables but 
extracts only a tiny part of the data, and you have good coverage by 
indexes, PNL works great.
Moreover, I have pondered extending parameterization through subqueries 
and groupings.


What could you say about a different way: hybrid join? In MS SQL Server, 
they have such a feature [1], and, according to their description, it 
requires low overhead. They start from HashJoin and switch to NestLoop 
if the inner input contains too small tuples. It solves the issue, Isn't it?


[1] 
https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-batch-mode-adaptive-joins/ba-p/385411


--
regards,
Andrey Lepikhov
Postgres Professional





Re: disfavoring unparameterized nested loops

2022-10-02 Thread Peter Geoghegan
On Fri, Sep 30, 2022 at 3:19 PM Benjamin Coutu  wrote:
> > For all I know you might be onto something. But it really seems
> > independent to me.
>
> Yeah, I‘m sorry if I highjacked this thread for something related but 
> technically different.

I certainly wouldn't say that you hijacked the thread. I'm glad that
you revived the discussion, in fact.

The way that I've framed the problem is probably at least somewhat
controversial. In fact I'm almost certain that at least one or two
people will flat out disagree with me. But even if everybody else
already thought about unparameterized nested loop joins in the same
terms, it might still be useful to make the arguments that you've
made.

What I'm saying is that the probability of "getting it right" is
virtually irrelevant in the case of these unparameterized nested loop
join plans specifically. Any probability that's less than 1.0 is
already unacceptable, more or less. A probability of 1.0 is never
unattainable in the real world, no matter what, so why should the true
probability (whatever that means) matter at all? The appropriate
course of action will still be "just don't do that, ever".

To me this dynamic seems qualitatively different to other cases, where
we might want to give some weight to uncertainty. Understanding where
the boundaries lie between those trickier cases and this simpler case
seems important and relevant to me.

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2022-10-02 Thread Peter Geoghegan
On Sun, Oct 2, 2022 at 3:43 AM Robert Haas  wrote:
> On Fri, Sep 30, 2022 at 2:24 PM Peter Geoghegan  wrote:
> > It's not just that the risks are ludicrously high, of course. The
> > potential benefits must *also* be very low. It's both factors,
> > together.
>
> Hmm, maybe. But it also wouldn't surprise me very much if someone can
> come up with a test case where a nested loop with a single row (or
> maybe no rows) on one side or the other and it's significantly faster
> than any alternative plan.

That's certainly possible, but wouldn't the difference all come from
fixed startup costs? If we're talking about a single row, with a
minimal test case, then the potential downside of this more
conservative strategy might indeed amount to something like a 2x or 3x
slowdown, if we look at it in isolation. But why measure it that way?
I think that absolute differences like milliseconds of execution time
are much more relevant.

Real production databases have many queries with very diverse
characteristics -- there is a lot going on at any given moment. The
proportion of queries that will be affected either way by avoiding
unparamaterized nested loop joins is probably going to be very small.
Nobody is going to notice if only a small subset or all queries are
maybe 1 ms or 2 ms slower. As long as it's well within the margin of
noise in 100% of all cases, it really shouldn't matter.

AFAICT the choice is one of "bounded, low upside versus unbounded,
high downside".

> I believe, though, that even if such cases
> exist, they are probably relatively few in number compared to the
> cases where parameterized nested loops hurt, and the savings are
> probably small compared to the multiple-orders-of-magnitude slowdowns
> that you can get when a nested loop goes bad. But they might still be
> relatively large -- 2x, 3x? -- in absolute terms.

I suspect it won't even matter if disallowing unparamaterized nested
loop joins loses on average.

I am reminded of this:
https://en.wikipedia.org/wiki/St._Petersburg_paradox#Expected_utility_theory

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2022-10-02 Thread Robert Haas
On Fri, Sep 30, 2022 at 2:24 PM Peter Geoghegan  wrote:
> It's not just that the risks are ludicrously high, of course. The
> potential benefits must *also* be very low. It's both factors,
> together.

Hmm, maybe. But it also wouldn't surprise me very much if someone can
come up with a test case where a nested loop with a single row (or
maybe no rows) on one side or the other and it's significantly faster
than any alternative plan. I believe, though, that even if such cases
exist, they are probably relatively few in number compared to the
cases where parameterized nested loops hurt, and the savings are
probably small compared to the multiple-orders-of-magnitude slowdowns
that you can get when a nested loop goes bad. But they might still be
relatively large -- 2x, 3x? -- in absolute terms.

In the prior discussion, the only person who showed a case in which he
thought that an unparameterized nested loop might be a clear winner
was Tom, but it was just sort of a general "this kind of case might be
a problem" thing rather than a fully worked out example with real
timings. Perhaps someone ought to try to characterize the kinds of
cases he mentioned, to help us get a clearer feeling about what, if
anything, we're gaining from the current scheme.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: disfavoring unparameterized nested loops

2022-09-30 Thread Benjamin Coutu
> Sure, but the model isn't the problem here, really -- not to me. The
> problem is that the planner can in some cases choose a plan that is
> inherently unreasonable, at least in practical terms. You're talking
> about uncertainties. But I'm actually talking about the opposite thing
> -- certainty (albeit a limited kind of certainty that applies only to
> one narrow set of conditions).

I absolutely agree and support your proposal to simply not generate those paths 
at all unless necessary.

> For all I know you might be onto something. But it really seems
> independent to me.

Yeah, I‘m sorry if I highjacked this thread for something related but 
technically different. I just wanted to expand on your proposal by taking into 
account the join depth and also not just talking about unparametrized nested 
loop joins. The research is very clear that the uncertainty is proportional to 
the join level, and that is the point I am trying to focus the discussion on.

I really encourage everyone to read the VLDB paper. BTW, the unnamed 
proprietary DBMSs in that paper are the 3 big ones from Washington, California 
and NY, in that order.




Re: disfavoring unparameterized nested loops

2022-09-30 Thread Peter Geoghegan
On Thu, Sep 29, 2022 at 11:44 PM Benjamin Coutu  wrote:
> I think these things are orthogonal.

I agree that they're orthogonal. I just meant that execution time
strategies seem underexplored in general.

> No matter how good the cost model ever gets, we will always have degenerate 
> cases.

Sure, but the model isn't the problem here, really -- not to me. The
problem is that the planner can in some cases choose a plan that is
inherently unreasonable, at least in practical terms. You're talking
about uncertainties. But I'm actually talking about the opposite thing
-- certainty (albeit a limited kind of certainty that applies only to
one narrow set of conditions).

It's theoretically possible that bogosort will be faster than
quicksort in some individual cases. After all, bogosort is O(n) in the
best case, which is impossible to beat! But there is no practical
sense in which bogosort could ever be better than quicksort. Having
fewer choices by just not offering inherently bad choices seems quite
unrelated to what you're talking about.

For all I know you might be onto something. But it really seems
independent to me.

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2022-09-30 Thread Peter Geoghegan
On Thu, Sep 29, 2022 at 9:00 PM David Rowley  wrote:
> I understand that what you propose would be a fast way to fix this
> issue. However, if we went and changed the join path creation code to
> not add non-parameterised nested loop paths when other paths exist,
> then how could we ever dare to put that code back again when we come
> up with a better solution?

But why would it matter, even then?

I don't deny that something like that could make sense, but I don't
see why it should be in tension with this proposal. We're talking
about a plan shape that is (in practical terms) inherently
unreasonable, given the availability of an alternative plan shape. Why
wouldn't that continue to be true in every such case, forever?

To put it another way, the proposal seems like taking away something
that we don't want to have, ever. It seems like a subtractive thing to
me. The potential upside of allowing unparameterized nestloop joins
seems infinitesimal; zero for all practical purposes. So even with a
far more sophisticated framework for "plan riskiness" in place, it
would still make sense to treat unparameterized nestloop joins as
inherently undesirable. There is perhaps a theoretical sense in which
that isn't quite true, but it's true for all practical purposes, which
should be enough.

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2022-09-30 Thread Peter Geoghegan
On Fri, Sep 30, 2022 at 10:43 AM Robert Haas  wrote:
> But it's unnecessary to do this computation at runtime for every separate
> unparameterized nested loop: we can do it right here, in a generic
> way, for every such loop.

It's not just that the risks are ludicrously high, of course. The
potential benefits must *also* be very low. It's both factors,
together.

> Now, it's true that the actual risk depends on how certain we are of
> the estimates for the input rels, but that's difficult to quantify and
> I'm not convinced it really matters at all.

We're talking about a problem that is fairly unlikely to occur in
general, I think -- let's not forget that. These are presumably rare
events that nevertheless cause many real practical problems.

If we're going to add error bars, why wouldn't we also need error bars
for our error bars?

> I think we're kind of just making life complicated for ourselves here
> by pretending that unparameterized nested loops are part of some
> general class of uncertainty problems that we need to worry about. In
> some sense, they are, and David Rowley is right to mention the other
> one that comes up pretty frequently. But like that list of two is
> pretty much the whole list. I think we've talked ourselves into
> believing that this problem is much harder than it really is.

+1

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2022-09-30 Thread Robert Haas
On Thu, Sep 29, 2022 at 7:46 PM Tom Lane  wrote:
> Agreed, but dealing with uncertainty in those numbers is an enormous
> task if you want to do it right.  "Doing it right", IMV, would start
> out by extending all the selectivity estimation functions to include
> error bars; then we could have error bars on rowcount estimates and
> then costs; then we could start adding policies about avoiding plans
> with too large a possible upper-bound cost.  Trying to add such
> policy with no data to go on is not going to work well.

I think that the point of the paper which started the thread that this
discussion branched from was essentially that trying to add such a
policy with no data to go on worked extremely well in practice, and
other database systems are already doing it, and we're hurting
ourselves by not doing it. And specifically what they did was to
disfavor unparameterized nested loops.

And I think that actually makes a lot of sense. If you think through
parameterized nested loops, unparameterized nested loops, hash joins,
and merge joins, which is basically all the strategies, and you
imagine having many more or fewer rows on one side of the join or the
other than you thought, unparameterized nested loops are the standout
case. It's the only kind of join where the expense grows as the
product of the sizes of the two inputs. The other cases tend to be
more like O(n+m) rather than O(nm), and even if there are some lg n or
lg m terms in there too they don't tend to make a whole lot of
difference in practice.

So I think what you would find if you did all of this analysis is
that, basically, every time you did the cost computation for a
parameterized nested loop, a hash join, or a merge join, the error
bars would be whatever they were, and then when you did a cost
computation for an unparameterized nested loop, the error bars would
be way worse. Like, if we assume that the estimates for each side of a
hash join are off by 100x, then the cost will be off by roughly 100x
if it still fits in work_mem and by several hundred x if it now spills
to disk. But if we assume the same thing for an unparameterized nested
loop, the cost is now off by 1x. And that is massively, massively
more, so clearly we should avoid the unparameterized nested loop. But
it's unnecessary to do this computation at runtime for every separate
unparameterized nested loop: we can do it right here, in a generic
way, for every such loop.

Now, it's true that the actual risk depends on how certain we are of
the estimates for the input rels, but that's difficult to quantify and
I'm not convinced it really matters at all. Like, if the input is a
base relation with no filter condition, then the error should be
small, unless the table size changes a lot between planning and
execution. If it's the output of a user-defined aggregate, the error
could be really, really large. But that has no impact on the
*relative* dangers of the unparameterized nested loop vs. some other
join method. If join A is between two input rels whose sizes are
probably known fairly precisely, and join B is between two input rels
whose sizes we might be drastically wrong about, then a hash join is
riskier for join B than it is for join A. But that really does not
matter. What *does* matter is that an unparameterized nested loop is
riskier for join A than a hash join is for join A; and likewise for
join B.

I think we're kind of just making life complicated for ourselves here
by pretending that unparameterized nested loops are part of some
general class of uncertainty problems that we need to worry about. In
some sense, they are, and David Rowley is right to mention the other
one that comes up pretty frequently. But like that list of two is
pretty much the whole list. I think we've talked ourselves into
believing that this problem is much harder than it really is. Maybe a
blanket ban on unparameterized nested loops is too strong (or maybe
it's exactly the right thing) but it can't possibly be wrong to think
about that case in particular as something we need to solve. It's the
only join method that can go quadratic in easy, realistic scenarios --
and it often does.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: disfavoring unparameterized nested loops

2022-09-30 Thread Benjamin Coutu
> Actually, if we wanted to improve things in this area, we should have a
> set of queries that don't chose optimal plans we can test with.  We used
> to see them a lot before we had extended statistics, but I don't
> remember seeing many recently, let alone a collection of them.  I guess
> that is good.

In the VLDB paper they actually created their own "Join Order Benchmark", which 
is publicly available under https://github.com/gregrahn/join-order-benchmark
It would probably be more suited for this kind of testing than, e.g. the TPC 
benchmarks.

If there is interest, I could also compile a set of relevant cases based on the 
message history of the performance mailing list.




Re: disfavoring unparameterized nested loops

2022-09-30 Thread Benjamin Coutu
> Agreed, but dealing with uncertainty in those numbers is an enormous
> task if you want to do it right.  "Doing it right", IMV, would start
> out by extending all the selectivity estimation functions to include
> error bars; then we could have error bars on rowcount estimates and
> then costs; then we could start adding policies about avoiding plans
> with too large a possible upper-bound cost.  Trying to add such
> policy with no data to go on is not going to work well.

Error bars would be fantastic, no question. But that would make things very 
complex.
A lot of judgment calls would be necessary for the policy behind upper-bound 
pruning, picking up on Peter's comment about "conviction multiplier of 
conviction multiplier" ;)
Also, the math in deriving those bounds based on the stats and how they 
propagate up the join tree doesn't seem trivial either.

> I think Peter's point is that a quick-n-dirty patch is likely to make
> as many cases worse as it makes better.  That's certainly my opinion
> about the topic.

As in my reply to Peter, I think the join level/depth metric is a simple but 
principled way of dealing with it, given the referenced research.
In the first step, we'd use this merely to be more risk-averse towards nested 
loop joins as we climb up the join tree - we are not fiddling with the cost 
model itself, nor the join ordering, just when it comes to considering that 
particular join algorithm. Later this could be expanded to be more broadly 
scoped.

Please not give up on a simple way to reap most of the fruits just yet.




Re: disfavoring unparameterized nested loops

2022-09-30 Thread Benjamin Coutu


> In general I suspect that we'd be better off focussing on mitigating
> the impact at execution time. There are at least a few things that we
> could do there, at least in theory. Mostly very ambitious, long term
> things.

I think these things are orthogonal. No matter how good the cost model ever 
gets, we will always have degenerate cases.
Having some smarts about that in the executor is surely a good thing, but it 
shouldn't distract us from improving on the planner front.

> 
> I like the idea of just avoiding unparameterized nested loop joins
> altogether when an "equivalent" hash join plan is available because
> it's akin to an execution-time mitigation, despite the fact that it
> happens during planning. While it doesn't actually change anything in
> the executor, it is built on the observation that we have virtually
> everything to gain and nothing to lose during execution, no matter
> what happens.

I agree with you, that those plans are too risky. But let's maybe find a more 
general way of dealing with this.

> Right. Though I am actually sympathetic to the idea that users might
> gladly pay a cost for performance stability -- even a fairly large
> cost. That part doesn't seem like the problem.




Re: disfavoring unparameterized nested loops

2022-09-30 Thread Benjamin Coutu
> Right. But that seems fraught with difficulty. I suspect that the
> costs that the planner attributes to each plan often aren't very
> reliable in any absolute sense, even when everything is working very
> well by every available metric. Even a very noisy cost model with
> somewhat inaccurate selectivity estimates will often pick the cheapest
> plan, or close enough.

Sure, the absolute cost of a complex plan will always be inaccurate at best.
My point is that we can be very confident in the cardinalities of base tables. 
As the paper states in "3.1. Estimates for Base Tables":

"The median q-error is close to the optimal value of 1 for all systems,
indicating that the majority of all selections are estimated correctly."

Thanks to the statistics will practically never be off by an order of magnitude 
when estimating base table cardinalities.

The paper also clearly shows (and that certainly coincides with my experience) 
that those cardinality underestimations grow exponentially as they propagate up 
the join tree.

Given the research I'd stipulate that at any given level of the join tree, the 
current depth is a reasonable indicator of underestimation. Taking that into 
account (even if only to mitigate nested loops on higher levels) is IMV a 
principled approach, and not necesseraly a hack.

Obviously having something like error bars as proposed by Tom would be even 
better and perhaps more general, but that is on a whole different level in 
terms of complexity and I certainly have no idea how we would easily get there.




Re: disfavoring unparameterized nested loops

2022-09-29 Thread David Rowley
On Fri, 30 Sept 2022 at 13:06, Peter Geoghegan  wrote:
> I like the idea of just avoiding unparameterized nested loop joins
> altogether when an "equivalent" hash join plan is available because
> it's akin to an execution-time mitigation, despite the fact that it
> happens during planning. While it doesn't actually change anything in
> the executor, it is built on the observation that we have virtually
> everything to gain and nothing to lose during execution, no matter
> what happens.

I'm not sure if it's a good idea to assume that performing
non-parameterised Nested Loops when we shouldn't is the only shape of
plan that causes us problems.

We also have the case where we assume early start-up plans are
favourable.  For example:

SELECT * FROM t WHERE a = 1 ORDER BY b LIMIT 10;

where we have two indexes, one on t(a) and another on t(b).

Should we use the t(b) index and filter out the rows that don't match
a = 1 and hope we get 10 a=1 rows soon in the t(b) index? or do we use
t(a) and then perform a sort? Best case for using the t(b) index is
that we find 10 a=1 rows in the first 10 rows of the index scan, the
worst case is that there are no rows with a=1.

Having something coded into the cost model is a more generic way of
addressing this issue.  Providing we design the cost model correctly,
we'd be able to address future issues we discover using which ever
cost model infrastructure that we design for this.

I understand that what you propose would be a fast way to fix this
issue. However, if we went and changed the join path creation code to
not add non-parameterised nested loop paths when other paths exist,
then how could we ever dare to put that code back again when we come
up with a better solution?

David




Re: disfavoring unparameterized nested loops

2022-09-29 Thread Bruce Momjian
On Thu, Sep 29, 2022 at 07:51:47PM -0400, Bruce Momjian wrote:
> On Thu, Sep 29, 2022 at 07:46:18PM -0400, Tom Lane wrote:
> > Agreed, but dealing with uncertainty in those numbers is an enormous
> > task if you want to do it right.  "Doing it right", IMV, would start
> > out by extending all the selectivity estimation functions to include
> > error bars; then we could have error bars on rowcount estimates and
> > then costs; then we could start adding policies about avoiding plans
> > with too large a possible upper-bound cost.  Trying to add such
> > policy with no data to go on is not going to work well.
> > 
> > I think Peter's point is that a quick-n-dirty patch is likely to make
> > as many cases worse as it makes better.  That's certainly my opinion
> > about the topic.
> 
> Agreed on all points --- I was thinking error bars too.

Actually, if we wanted to improve things in this area, we should have a
set of queries that don't chose optimal plans we can test with.  We used
to see them a lot before we had extended statistics, but I don't
remember seeing many recently, let alone a collection of them.  I guess
that is good.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: disfavoring unparameterized nested loops

2022-09-29 Thread Peter Geoghegan
On Thu, Sep 29, 2022 at 4:46 PM Tom Lane  wrote:
> Agreed, but dealing with uncertainty in those numbers is an enormous
> task if you want to do it right.  "Doing it right", IMV, would start
> out by extending all the selectivity estimation functions to include
> error bars; then we could have error bars on rowcount estimates and
> then costs; then we could start adding policies about avoiding plans
> with too large a possible upper-bound cost.  Trying to add such
> policy with no data to go on is not going to work well.

In general I suspect that we'd be better off focussing on mitigating
the impact at execution time. There are at least a few things that we
could do there, at least in theory. Mostly very ambitious, long term
things.

I like the idea of just avoiding unparameterized nested loop joins
altogether when an "equivalent" hash join plan is available because
it's akin to an execution-time mitigation, despite the fact that it
happens during planning. While it doesn't actually change anything in
the executor, it is built on the observation that we have virtually
everything to gain and nothing to lose during execution, no matter
what happens.

It seems like a very small oasis of certainty in a desert of
uncertainty -- which seems nice, as far as it goes.

> I think Peter's point is that a quick-n-dirty patch is likely to make
> as many cases worse as it makes better.  That's certainly my opinion
> about the topic.

Right. Though I am actually sympathetic to the idea that users might
gladly pay a cost for performance stability -- even a fairly large
cost. That part doesn't seem like the problem.

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2022-09-29 Thread Bruce Momjian
On Thu, Sep 29, 2022 at 07:46:18PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > I think the point the original poster as making, and I have made in the
> > past, is that even of two optimizer costs are the same, one might be
> > more penalized by misestimation than the other, and we don't have a good
> > way of figuring that into our plan choices.
> 
> Agreed, but dealing with uncertainty in those numbers is an enormous
> task if you want to do it right.  "Doing it right", IMV, would start
> out by extending all the selectivity estimation functions to include
> error bars; then we could have error bars on rowcount estimates and
> then costs; then we could start adding policies about avoiding plans
> with too large a possible upper-bound cost.  Trying to add such
> policy with no data to go on is not going to work well.
> 
> I think Peter's point is that a quick-n-dirty patch is likely to make
> as many cases worse as it makes better.  That's certainly my opinion
> about the topic.

Agreed on all points --- I was thinking error bars too.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson






Re: disfavoring unparameterized nested loops

2022-09-29 Thread Tom Lane
Bruce Momjian  writes:
> I think the point the original poster as making, and I have made in the
> past, is that even of two optimizer costs are the same, one might be
> more penalized by misestimation than the other, and we don't have a good
> way of figuring that into our plan choices.

Agreed, but dealing with uncertainty in those numbers is an enormous
task if you want to do it right.  "Doing it right", IMV, would start
out by extending all the selectivity estimation functions to include
error bars; then we could have error bars on rowcount estimates and
then costs; then we could start adding policies about avoiding plans
with too large a possible upper-bound cost.  Trying to add such
policy with no data to go on is not going to work well.

I think Peter's point is that a quick-n-dirty patch is likely to make
as many cases worse as it makes better.  That's certainly my opinion
about the topic.

regards, tom lane




Re: disfavoring unparameterized nested loops

2022-09-29 Thread Peter Geoghegan
On Thu, Sep 29, 2022 at 4:27 PM Bruce Momjian  wrote:
> I think the point the original poster as making, and I have made in the
> past, is that even of two optimizer costs are the same, one might be
> more penalized by misestimation than the other, and we don't have a good
> way of figuring that into our plan choices.

Right. But that seems fraught with difficulty. I suspect that the
costs that the planner attributes to each plan often aren't very
reliable in any absolute sense, even when everything is working very
well by every available metric. Even a very noisy cost model with
somewhat inaccurate selectivity estimates will often pick the cheapest
plan, or close enough.

Having a cost-based optimizer that determines the cheapest plan quite
reliably is one thing. Taking the same underlying information and
adding the dimension of risk to it and expecting a useful result is
quite another -- that seems far harder.

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2022-09-29 Thread Bruce Momjian
On Thu, Sep 29, 2022 at 04:12:06PM -0700, Peter Geoghegan wrote:
> Offhand I'd say it's more likely to be too complicated. Without
> meaning to sound glib, the first question that occurs to me is "will
> we also need a conviction multiplier conviction multiplier?". Anything
> like that is going to have unintended consequences that might very
> well be much worse than the problem that you set out to solve.
> 
> Personally I still like the idea of just avoiding unparameterized
> nested loop joins altogether when an "equivalent" hash join plan is
> available. I think of it as preferring the hash join plan because it
> will have virtually the same performance characteristics when you have
> a good cardinality estimate (probably very often), but far better
> performance characteristics when you don't. We can perhaps be
> approximately 100% sure that something like that will be true in all
> cases, no matter the details. That seems like a very different concept
> to what you've proposed.

I think the point the original poster as making, and I have made in the
past, is that even of two optimizer costs are the same, one might be
more penalized by misestimation than the other, and we don't have a good
way of figuring that into our plan choices.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: disfavoring unparameterized nested loops

2022-09-29 Thread Peter Geoghegan
On Thu, Sep 29, 2022 at 7:32 AM Benjamin Coutu  wrote:
> Also, we can expand the multiplier whenever we fall back to using the default 
> cardinality constant as surely all bets are off at that point - we should 
> definitely treat nested loop joins as out of favor in this instance and that 
> could easily be incorporated by simply increasing the conviction-mutliplier.
>
> What are your thoughts on this simple idea - is it perhaps too simple?

Offhand I'd say it's more likely to be too complicated. Without
meaning to sound glib, the first question that occurs to me is "will
we also need a conviction multiplier conviction multiplier?". Anything
like that is going to have unintended consequences that might very
well be much worse than the problem that you set out to solve.

Personally I still like the idea of just avoiding unparameterized
nested loop joins altogether when an "equivalent" hash join plan is
available. I think of it as preferring the hash join plan because it
will have virtually the same performance characteristics when you have
a good cardinality estimate (probably very often), but far better
performance characteristics when you don't. We can perhaps be
approximately 100% sure that something like that will be true in all
cases, no matter the details. That seems like a very different concept
to what you've proposed.

--
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2022-09-29 Thread Benjamin Coutu
I'd like to revamp this important discussion.

As is well described in this fairly recent paper here 
https://www.vldb.org/pvldb/vol9/p204-leis.pdf (which also looks at Postgres) 
"estimation errors quickly grow as the number of joins increases, and that 
these errors are usually the reason for bad plans" - I think we can all get 
behind that statement.

While nested loop joins work great when cardinality estimates are correct, they 
are notoriously bad when the optimizer underestimates and they degrade very 
fast in such cases - the good vs. bad here is very asymmetric. On the other 
hand hash joins degrade much more gracefully - they are considered very robust 
against underestimation. The above mentioned paper illustrates that all mayor 
DBMS (including Postgres) tend to underestimate and usually that 
underestimation increases drastically with the number of joins (see Figures 3+4 
of the paper).

Now, a simple approach to guarding against bad plans that arise from 
underestimation could be to use what I would call a 
nested-loop-conviction-multiplier based on the current depth of the join tree, 
e.g. for a base table that multiplier would obviously be 1, but would then grow 
(e.g.) quadratically. That conviction-multiplier would *NOT* be used to skew 
the cardinality estimates themselves, but rather be applied to the overall 
nested loop join cost at each particular stage of the plan when comparing it to 
other more robust join strategies like hash or sort-merge joins. That way when 
we can be sure to have a good estimate at the bottom of the join tree we treat 
all things equal, but favor nested loops less and less as we move up the join 
tree for the sake of robustness.
Also, we can expand the multiplier whenever we fall back to using the default 
cardinality constant as surely all bets are off at that point - we should 
definitely treat nested loop joins as out of favor in this instance and that 
could easily be incorporated by simply increasing the conviction-mutliplier.

What are your thoughts on this simple idea - is it perhaps too simple?

Cheers, Ben




Re: disfavoring unparameterized nested loops

2022-01-13 Thread Robert Haas
On Wed, Jan 12, 2022 at 7:20 PM Peter Geoghegan  wrote:
> Should I take it that you've dropped this project? I was rather hoping
> that you'd get back to it, FWIW.

Honestly, I'd forgotten all about it.  While in theory I'd like to do
something about this, but I just have too many other things going on.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: disfavoring unparameterized nested loops

2022-01-12 Thread Peter Geoghegan
On Tue, Jun 22, 2021 at 4:13 AM Robert Haas  wrote:
> I think that's a reasonable request. I'm not sure that I believe it's
> 100% necessary, but it's certainly an improvement on a technical
> level, and given that the proposed change could impact quite a lot of
> plans, it's fair to want to see some effort being put into mitigating
> the possible downsides. Now, I'm not sure when I might have time to
> actually try to do the work, which kind of sucks, but that's how it
> goes sometimes.

Should I take it that you've dropped this project? I was rather hoping
that you'd get back to it, FWIW.

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2021-08-02 Thread Mike Klaas
I think that it is worth paying more than nothing to avoid plans that are
so far from optimal that they might as well take forever to execute


I recently came across this article from 2016 that expounded upon a bad
plan of the sort discussed in this thread:
https://heap.io/blog/when-to-avoid-jsonb-in-a-postgresql-schema

(The proximate cause in this case was Postgresql not collecting statistics
for fields in a JSONB column, estimating rowcount of 1, and thus creating a
pathological slowdown.)

–Mike


On Tue, Jun 22, 2021 at 7:37 PM, Peter Geoghegan  wrote:

> On Tue, Jun 22, 2021 at 2:53 AM Tomas Vondra
>  wrote:
>
> Yeah, I like the insurance analogy - it gets to the crux of the problem,
> because insurance is pretty much exactly about managing risk.
>
> The user's exposure to harm is what truly matters. I admit that that's
> very hard to quantify, but we should at least try to do so.
>
> We sometimes think about a plan that is 10x slower as if it's infinitely
> slow, or might as well be. But it's usually not like that
> -- it is generally meaningfully much better than the plan being 100x
> slower, which is itself sometimes appreciably better than 1000x slower. And
> besides, users often don't get anything like the optimal plan, even on what
> they would consider to be a good day (which is most days). So maybe 10x
> slower is actually the baseline good case already, without anybody knowing
> it. Most individual queries are not executed very often, even on the
> busiest databases. The extremes really do matter a lot.
>
> If a web app or OLTP query is ~10x slower than optimal then it might be
> the practical equivalent of an outage that affects the query alone
> (i.e. "infinitely slow") -- but probably not. I think that it is worth
> paying more than nothing to avoid plans that are so far from optimal that
> they might as well take forever to execute. This is not meaningfully
> different from a database outage affecting one particular query. It kind of
> is in a category of its own that surpasses "slow plan", albeit one that is
> very difficult or impossible to define formally.
>
> There may be a huge amount of variation in risk tolerance among basically
> reasonable people. For example, if somebody chooses to engage in some kind
> of extreme sport, to me it seems understandable. It's just not my cup of
> tea. Whereas if somebody chooses to never wear a seatbelt while driving,
> then to me they're simply behaving foolishly. They're not willing to incur
> the tiniest inconvenience in order to get a huge reduction in potential
> harm -- including a very real risk of approximately the worst thing that
> can happen to you. Sure, refusing to wear a seatbelt can theoretically be
> classified as just another point on the risk tolerance spectrum, but that
> seems utterly contrived to me. Some things (maybe not that many) really are
> like that, or can at least be assumed to work that way as a practical
> matter.
>
> But making
> everything slower will be a hard sell, because wast majority of workloads
> already running on Postgres don't have this issue at all, so for them it's
> not worth the expense.
>
> I think that we're accepting too much risk here. But I bet it's also true
> that we're not taking enough risk in other areas. That was really my point
> with the insurance analogy -- we can afford to take lots of individual
> risks as long as they don't increase our exposure to truly disastrous
> outcomes -- by which I mean queries that might as well take forever to
> execute as far as the user is concerned. (Easier said than done, of
> course.)
>
> A simple trade-off between fast and robust doesn't seem like a universally
> helpful thing. Sometimes it's a very unhelpful way of looking at the
> situation. If you make something more robust to extreme bad outcomes, then
> you may have simultaneously made it *faster* (not slower) for all practical
> purposes. This can happen when the increase in robustness allows the user
> to tune the system aggressively, and only take on new risks that they can
> truly live with (which wouldn't have been possible without the increase in
> robustness). For example, I imagine that the failsafe mechanism added to
> VACUUM will actually make it possible to tune VACUUM much more aggressively
> -- it might actually end up significantly improving performance for all
> practical purposes, even though technically it has nothing to do with
> performance.
>
> Having your indexes a little more bloated because the failsafe kicked-in
> is a survivable event -- the DBA lives to fight another day, and *learns*
> to tune vacuum/the app so it doesn't happen again and again. An
> anti-wraparound failure is perhaps not a survivable event -- the DBA gets
> fired. This really does seem like a fundamental difference to me.
>
> Following the insurance analogy,
> selling tornado insurance in Europe is mostly pointless.
>
> Principled skepticism of this kind of thing is of course necessary and
> welcome. 

Re: disfavoring unparameterized nested loops

2021-06-22 Thread Peter Geoghegan
On Tue, Jun 22, 2021 at 2:53 AM Tomas Vondra
 wrote:
> Yeah, I like the insurance analogy - it gets to the crux of the problem,
> because insurance is pretty much exactly about managing risk.

The user's exposure to harm is what truly matters. I admit that that's
very hard to quantify, but we should at least try to do so.

We sometimes think about a plan that is 10x slower as if it's
infinitely slow, or might as well be. But it's usually not like that
-- it is generally meaningfully much better than the plan being 100x
slower, which is itself sometimes appreciably better than 1000x
slower. And besides, users often don't get anything like the optimal
plan, even on what they would consider to be a good day (which is most
days). So maybe 10x slower is actually the baseline good case already,
without anybody knowing it. Most individual queries are not executed
very often, even on the busiest databases. The extremes really do
matter a lot.

If a web app or OLTP query is ~10x slower than optimal then it might
be the practical equivalent of an outage that affects the query alone
(i.e. "infinitely slow") -- but probably not. I think that it is worth
paying more than nothing to avoid plans that are so far from optimal
that they might as well take forever to execute. This is not
meaningfully different from a database outage affecting one particular
query. It kind of is in a category of its own that surpasses "slow
plan", albeit one that is very difficult or impossible to define
formally.

There may be a huge amount of variation in risk tolerance among
basically reasonable people. For example, if somebody chooses to
engage in some kind of extreme sport, to me it seems understandable.
It's just not my cup of tea. Whereas if somebody chooses to never wear
a seatbelt while driving, then to me they're simply behaving
foolishly. They're not willing to incur the tiniest inconvenience in
order to get a huge reduction in potential harm -- including a very
real risk of approximately the worst thing that can happen to you.
Sure, refusing to wear a seatbelt can theoretically be classified as
just another point on the risk tolerance spectrum, but that seems
utterly contrived to me. Some things (maybe not that many) really are
like that, or can at least be assumed to work that way as a practical
matter.

> But making
> everything slower will be a hard sell, because wast majority of
> workloads already running on Postgres don't have this issue at all, so
> for them it's not worth the expense.

I think that we're accepting too much risk here. But I bet it's also
true that we're not taking enough risk in other areas. That was really
my point with the insurance analogy -- we can afford to take lots of
individual risks as long as they don't increase our exposure to truly
disastrous outcomes -- by which I mean queries that might as well take
forever to execute as far as the user is concerned. (Easier said than
done, of course.)

A simple trade-off between fast and robust doesn't seem like a
universally helpful thing. Sometimes it's a very unhelpful way of
looking at the situation. If you make something more robust to extreme
bad outcomes, then you may have simultaneously made it *faster* (not
slower) for all practical purposes. This can happen when the increase
in robustness allows the user to tune the system aggressively, and
only take on new risks that they can truly live with (which wouldn't
have been possible without the increase in robustness). For example, I
imagine that the failsafe mechanism added to VACUUM will actually make
it possible to tune VACUUM much more aggressively -- it might actually
end up significantly improving performance for all practical purposes,
even though technically it has nothing to do with performance.

Having your indexes a little more bloated because the failsafe
kicked-in is a survivable event -- the DBA lives to fight another day,
and *learns* to tune vacuum/the app so it doesn't happen again and
again. An anti-wraparound failure is perhaps not a survivable event --
the DBA gets fired. This really does seem like a fundamental
difference to me.

> Following the insurance analogy,
> selling tornado insurance in Europe is mostly pointless.

Principled skepticism of this kind of thing is of course necessary and
welcome. It *could* be taken too far.

> And the lack of data also plays role - the insurance company will ask
> for higher rates when it does not have enough accurate data about the
> phenomenon, or when there's a lot of unknowns. Maybe this would allow
> some basic measure of uncertainty, based on the number and type of
> restrictions, joins, etc.

I don't think that you can really model uncertainty. But you can have
true certainty (or close to it) about a trade-off that makes the
system fundamentally more robust over time. You can largely be certain
about both the cost of the insurance, as well as how it ameliorates
the problem in at least some cases.

> So maybe some fairly rough measure of 

Re: disfavoring unparameterized nested loops

2021-06-22 Thread Robert Haas
On Mon, Jun 21, 2021 at 4:52 PM Tom Lane  wrote:
> I'm willing to take some flak if there's not an easy proof that the outer
> scan is single-row, but I don't think we should just up and change it
> for cases where there is.

I think that's a reasonable request. I'm not sure that I believe it's
100% necessary, but it's certainly an improvement on a technical
level, and given that the proposed change could impact quite a lot of
plans, it's fair to want to see some effort being put into mitigating
the possible downsides. Now, I'm not sure when I might have time to
actually try to do the work, which kind of sucks, but that's how it
goes sometimes.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: disfavoring unparameterized nested loops

2021-06-22 Thread Tomas Vondra



On 6/22/21 2:25 AM, Peter Geoghegan wrote:
> On Mon, Jun 21, 2021 at 4:51 PM Bruce Momjian  wrote:
>> There were a lot of interesting ideas in this thread and I want to
>> analyze some of them.  First, there is the common assumption (not
>> stated) that over-estimating by 5% and underestimating by 5% cause the
>> same harm, which is clearly false.  If I go to a restaurant and estimate
>> the bill to be 5% higher or %5 lower, assuming I have sufficient funds,
>> under or over estimating is probably fine.  If I am driving and
>> under-estimate the traction of my tires, I am probably fine, but if I
>> over-estimate their traction by 5%, I might crash.
> 
> My favorite analogy is the home insurance one:
> 
> It might make sense to buy home insurance because losing one's home
> (say through fire) is a loss that usually just cannot be tolerated --
> you are literally ruined. We can debate how likely it is to happen,
> but in the end it's not so unlikely that it can't be ruled out. At the
> same time I may be completely unwilling to buy insurance for personal
> electronic devices. I can afford to replace all of them if I truly
> have to. And the chances of all of them breaking or being stolen on
> the same day is remote (unless my home burns down!). If I drop my cell
> phone and crack the screen, I'll be annoyed, but it's certainly not
> the end of the world.
> 
> This behavior will make perfect sense to most people. But it doesn't
> scale up or down. I have quite a few electronic devices, but only a
> single home, so technically I'm taking risks way more often than I am
> playing it safe here. Am I risk tolerant when it comes to insurance?
> Conservative?
> 
> I myself don't think that it is sensible to apply either term here.
> It's easier to just look at the specifics. A home is a pretty
> important thing to almost everybody; we can afford to treat it as a
> special case.
> 
>> If that is accurate, I think the big question is how common are cases
>> where the outer side can't be proven to have zero or one row and nested
>> loops are enough of a win to risk its greater sensitivity to
>> misestimation.  If it is uncommon, seems we could just code the
>> optimizer to use hash joins in those cases without a user-visible knob,
>> beyond the knob that already turns off nested loop joins.
> 
> I think it's possible that Robert's proposal will lead to very
> slightly slower plans in the vast majority of cases that are affected,
> while still being a very good idea. Why should insurance be 100% free,
> though? Maybe it can be in some cases where we get lucky, but why
> should that be the starting point? It just has to be very cheap
> relative to what we do today for us to come out ahead, certainly, but
> that seems quite possible in at least this case.
> 

Yeah, I like the insurance analogy - it gets to the crux of the problem,
because insurance is pretty much exactly about managing risk. But making
everything slower will be a hard sell, because wast majority of
workloads already running on Postgres don't have this issue at all, so
for them it's not worth the expense. Following the insurance analogy,
selling tornado insurance in Europe is mostly pointless.

Insurance is also about personal preference / risk tolerance. Maybe I'm
fine with accepting risk that my house burns down, or whatever ...

And the lack of data also plays role - the insurance company will ask
for higher rates when it does not have enough accurate data about the
phenomenon, or when there's a lot of unknowns. Maybe this would allow
some basic measure of uncertainty, based on the number and type of
restrictions, joins, etc. The more restrictions we have, the less
certain the estimates are. Some conditions are estimated less
accurately, and using default estimates makes it much less accurate.

So maybe some fairly rough measure of uncertainty might work, and the
user might specify how much risk it's willing to tolerate.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Peter Geoghegan
On Mon, Jun 21, 2021 at 4:51 PM Bruce Momjian  wrote:
> There were a lot of interesting ideas in this thread and I want to
> analyze some of them.  First, there is the common assumption (not
> stated) that over-estimating by 5% and underestimating by 5% cause the
> same harm, which is clearly false.  If I go to a restaurant and estimate
> the bill to be 5% higher or %5 lower, assuming I have sufficient funds,
> under or over estimating is probably fine.  If I am driving and
> under-estimate the traction of my tires, I am probably fine, but if I
> over-estimate their traction by 5%, I might crash.

My favorite analogy is the home insurance one:

It might make sense to buy home insurance because losing one's home
(say through fire) is a loss that usually just cannot be tolerated --
you are literally ruined. We can debate how likely it is to happen,
but in the end it's not so unlikely that it can't be ruled out. At the
same time I may be completely unwilling to buy insurance for personal
electronic devices. I can afford to replace all of them if I truly
have to. And the chances of all of them breaking or being stolen on
the same day is remote (unless my home burns down!). If I drop my cell
phone and crack the screen, I'll be annoyed, but it's certainly not
the end of the world.

This behavior will make perfect sense to most people. But it doesn't
scale up or down. I have quite a few electronic devices, but only a
single home, so technically I'm taking risks way more often than I am
playing it safe here. Am I risk tolerant when it comes to insurance?
Conservative?

I myself don't think that it is sensible to apply either term here.
It's easier to just look at the specifics. A home is a pretty
important thing to almost everybody; we can afford to treat it as a
special case.

> If that is accurate, I think the big question is how common are cases
> where the outer side can't be proven to have zero or one row and nested
> loops are enough of a win to risk its greater sensitivity to
> misestimation.  If it is uncommon, seems we could just code the
> optimizer to use hash joins in those cases without a user-visible knob,
> beyond the knob that already turns off nested loop joins.

I think it's possible that Robert's proposal will lead to very
slightly slower plans in the vast majority of cases that are affected,
while still being a very good idea. Why should insurance be 100% free,
though? Maybe it can be in some cases where we get lucky, but why
should that be the starting point? It just has to be very cheap
relative to what we do today for us to come out ahead, certainly, but
that seems quite possible in at least this case.

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Bruce Momjian
On Mon, Jun 21, 2021 at 12:52:39PM -0400, Tom Lane wrote:
> You're throwing around a lot of pejorative adjectives there without
> having bothered to quantify any of them.  This sounds less like a sound
> argument to me than like a witch trial.
>
> Reflecting for a bit on the ancient principle that "the only good numbers
> in computer science are 0, 1, and N", it seems to me that we could make
> an effort to classify RelOptInfos as provably empty, provably at most one
> row, and others.  (This would be a property of relations, not individual
> paths, so it needn't bloat struct Path.)  We already understand about
> provably-empty rels, so this is just generalizing that idea a little.
> Once we know about that, at least for the really-common cases like unique
> keys, I'd be okay with a hard rule that we don't consider unparameterized
> nestloop joins with an outer side that falls in the "N" category.
> Unless there's no alternative, of course.
> 
> Another thought that occurs to me here is that maybe we could get rid of
> the enable_material knob in favor of forcing (or at least encouraging)
> materialization when the outer side isn't provably one row.

There were a lot of interesting ideas in this thread and I want to
analyze some of them.  First, there is the common assumption (not
stated) that over-estimating by 5% and underestimating by 5% cause the
same harm, which is clearly false.  If I go to a restaurant and estimate
the bill to be 5% higher or %5 lower, assuming I have sufficient funds,
under or over estimating is probably fine.  If I am driving and
under-estimate the traction of my tires, I am probably fine, but if I
over-estimate their traction by 5%, I might crash.

Closer to home, Postgres is more tolerant of memory usage
under-estimation than over-estimation:

https://momjian.us/main/blogs/pgblog/2018.html#December_7_2018

What I hear Robert saying is that unparameterized nested loops are much
more sensitive to misestimation than hash joins, and only slightly
faster than hash joins when they use accurate row counts, so we might
want to avoid them by default.  Tom is saying that if we know the outer
side will have zero or one row, we should still do unparameterized
nested loops because those are not more sensitive to misestimation than
hash joins, and slightly faster.

If that is accurate, I think the big question is how common are cases
where the outer side can't be proven to have zero or one row and nested
loops are enough of a win to risk its greater sensitivity to
misestimation.  If it is uncommon, seems we could just code the
optimizer to use hash joins in those cases without a user-visible knob,
beyond the knob that already turns off nested loop joins.

Peter's comment about having nodes in the executor that adjust to the
row counts it finds is interesting, and eventually might be necessary
once we are sure we have gone as far as we can without it.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: disfavoring unparameterized nested loops

2021-06-21 Thread Peter Geoghegan
On Mon, Jun 21, 2021 at 1:52 PM Tom Lane  wrote:
> You're ignoring the fact that the plan shape we generate now is in fact
> *optimal*, and easily proven to be so, in some very common cases.

As I've said I don't reject the idea that there is room for
disagreement on the specifics. For example perhaps it'll turn out that
only a restricted subset of the cases that Robert originally had in
mind will truly turn out to work as well as hoped. But that just seems
like a case of Robert refining a very preliminary proposal. I
absolutely expect there to be some need to iron out the wrinkles.

> I don't
> think the people whose use-cases get worse are going to be mollified by
> the argument that you reduced their risk, when there is provably no risk.

By definition what we're doing here is throwing away slightly cheaper
plans when the potential downside is much higher than the potential
upside of choosing a reasonable alternative. I don't think that the
downside is particularly likely. In fact I believe that it's fairly
unlikely in general. This is an imperfect trade-off, at least in
theory. I fully own that.

> I'm willing to take some flak if there's not an easy proof that the outer
> scan is single-row, but I don't think we should just up and change it
> for cases where there is.

Seems reasonable to me.

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Tom Lane
Peter Geoghegan  writes:
> On Mon, Jun 21, 2021 at 10:14 AM Robert Haas  wrote:
>> The other thing is - the risk of a particular path doesn't matter in
>> an absolute sense, only a relative one. In the particular case I'm on
>> about here, we *know* there's a less-risky alternative.

> Exactly! This, a thousand times.

This is a striking oversimplification.

You're ignoring the fact that the plan shape we generate now is in fact
*optimal*, and easily proven to be so, in some very common cases.  I don't
think the people whose use-cases get worse are going to be mollified by
the argument that you reduced their risk, when there is provably no risk.
Obviously the people whose use-cases are currently hitting the wrong end
of the risk will be happy with any change whatever, but those may not be
the same people.

I'm willing to take some flak if there's not an easy proof that the outer
scan is single-row, but I don't think we should just up and change it
for cases where there is.

regards, tom lane




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Peter Geoghegan
On Mon, Jun 21, 2021 at 10:14 AM Robert Haas  wrote:
> Hmm, maybe I need to see an example of the sort of plan shape that you
> have in mind. To me it feels like a comparison on a unique key ought
> to use a *parameterized* nested loop. And it's also not clear to me
> why a nested loop is actually better in a case like this. If the
> nested loop iterates more than once because there are more rows on the
> outer side, then you don't want to have something on the inner side
> that might be expensive, and either an aggregate or an unparameterized
> search for a unique value are potentially quite expensive. Now if you
> put a materialize node on top of the inner side, then you don't have
> to worry about that problem, but how much are you saving at that point
> vs. just doing a hash join?

I suspected that that was true, but even that doesn't seem like the
really important thing. While it may be true that the simple heuristic
can't be quite as simple as we'd hoped at first, ISTM that this is
ultimately not much of a problem. The basic fact remains: some more or
less simple heuristic makes perfect sense, and should be adapted.

This conclusion is counterintuitive because it's addressing a very
complicated problem with a very simple solution. However, if we lived
in a world where things that sound too good to be true always turned
out to be false, we'd also live in a world where optimizers were
completely impractical and useless. Optimizers have that quality
already, whether or not we officially acknowledge it.

> I don't understand how to generate a risk assessment or what we ought
> to do with it if we had one. I don't even understand what units we
> would use. We measure costs using abstract cost units, but those
> abstract cost units are intended to be a proxy for runtime. If it's
> not the case that a plan that runs for longer has a higher cost, then
> something's wrong with the costing model or the settings. In the case
> of risk, the whole thing seems totally subjective. We're talking about
> the risk that our estimate is bogus, but how do we estimate the risk
> that we don't know how to estimate?

Clearly we need a risk estimate for our risk estimate!

> The other thing is - the risk of a particular path doesn't matter in
> an absolute sense, only a relative one. In the particular case I'm on
> about here, we *know* there's a less-risky alternative.

Exactly! This, a thousand times.

This reminds me of how people behave in the real world. In the real
world people deal with this without too much difficulty. Everything is
situational and based on immediate trade-offs, with plenty of
uncertainty at every step. For example, if you think that there is
even a tiny chance of a piece of fruit being poisonous, you don't eat
the piece of fruit -- better to wait until lunchtime. This is one of
the *easiest* decisions I can think of, despite the uncertainty.
(Except perhaps if you happen to be in danger of dying of starvation,
in which case it might be a very different story. And so on.)

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Tom Lane
Robert Haas  writes:
> On Mon, Jun 21, 2021 at 1:38 PM Tom Lane  wrote:
>> NestLoop Join
>>   Join Filter: t1.x op t2.y
>>   -> Index Scan on t1_pkey
>>Index Cond: t1.id = constant
>>   -> Seq Scan on t2

> Hmm, yeah, I guess that's possible. How much do you think this loses
> as compared with:

> Hash Join
> Hash Cond: t1.x op t2.y
> -> Seq Scan on t2
> -> Hash
>   -> Index Scan on t1_pkey

Hard to say.  The hash overhead might or might not pay for itself.
If the equality operator proper is expensive and we get to avoid
applying it at most t2 rows, then this might be an OK alternative;
otherwise not so much.

In any case, the former looks like plans that we generate now,
the second not.  Do you really want to field a lot of questions
about why we suddenly changed to a not-clearly-superior plan?

regards, tom lane




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Robert Haas
On Mon, Jun 21, 2021 at 1:38 PM Tom Lane  wrote:
> > Hmm, maybe I need to see an example of the sort of plan shape that you
> > have in mind. To me it feels like a comparison on a unique key ought
> > to use a *parameterized* nested loop.
>
> The unique-key comparison would be involved in the outer scan in
> the cases I'm thinking of.  As an example,
>
> select * from t1, t2 where t1.id = constant and t1.x op t2.y;
>
> where I'm not assuming much about the properties of "op".
> This could be amenable to a plan like
>
> NestLoop Join
>   Join Filter: t1.x op t2.y
>   -> Index Scan on t1_pkey
>Index Cond: t1.id = constant
>   -> Seq Scan on t2
>
> and if we can detect that the pkey indexscan produces just one row,
> this is very possibly the best available plan.

Hmm, yeah, I guess that's possible. How much do you think this loses
as compared with:

Hash Join
Hash Cond: t1.x op t2.y
-> Seq Scan on t2
-> Hash
  -> Index Scan on t1_pkey

(If the operator is not hashable then this plan is impractical, but in
such a case the question of preferring the hash join over the nested
loop doesn't arise anyway.)

> BTW, it strikes me that there might be an additional consideration
> here: did parameterization actually help anything?  That is, the
> proposed rule wants to reject the above but allow
>
> NestLoop Join
>   -> Index Scan on t1_pkey
>Index Cond: t1.id = constant
>   -> Seq Scan on t2
>Filter: t1.x op t2.y
>
> even though the latter isn't meaningfully better.  It's possible
> this won't arise because we don't consider parameterized paths
> except where the parameter is used in an indexqual or the like,
> but I'm not confident of that.  See in particular reparameterize_path
> and friends before you assert there's no such issue.  So we might
> need to distinguish essential from incidental parameterization,
> or something like that.

Hmm, perhaps. I think it won't happen in the normal cases, but I can't
completely rule out the possibility that there are corner cases where
it does.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Tom Lane
I wrote:
> ... As an example,
>   select * from t1, t2 where t1.id = constant and t1.x op t2.y;
> where I'm not assuming much about the properties of "op".
> This could be amenable to a plan like
>   NestLoop Join
> Join Filter: t1.x op t2.y
> -> Index Scan on t1_pkey
>  Index Cond: t1.id = constant
> -> Seq Scan on t2

Also, to enlarge on that example: if "op" isn't hashable then the
original argument is moot.  However, it'd still be useful to know
if the outer scan is sure to return no more than one row, as that
could inform the choice whether to plaster a Materialize node on
the inner scan.

regards, tom lane




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Tom Lane
Robert Haas  writes:
> On Mon, Jun 21, 2021 at 11:55 AM Tom Lane  wrote:
>> There are certainly cases where the optimizer can prove (in principle;
>> it doesn't do so today) that a plan node will produce at most one row.
>> They're hardly uncommon either: an equality comparison on a unique
>> key, or a subquery with a simple aggregate function, come to mind.

> Hmm, maybe I need to see an example of the sort of plan shape that you
> have in mind. To me it feels like a comparison on a unique key ought
> to use a *parameterized* nested loop.

The unique-key comparison would be involved in the outer scan in
the cases I'm thinking of.  As an example,

select * from t1, t2 where t1.id = constant and t1.x op t2.y;

where I'm not assuming much about the properties of "op".
This could be amenable to a plan like

NestLoop Join
  Join Filter: t1.x op t2.y
  -> Index Scan on t1_pkey
   Index Cond: t1.id = constant
  -> Seq Scan on t2

and if we can detect that the pkey indexscan produces just one row,
this is very possibly the best available plan.  Nor do I think this
is an unusual situation that we can just ignore.

BTW, it strikes me that there might be an additional consideration
here: did parameterization actually help anything?  That is, the
proposed rule wants to reject the above but allow

NestLoop Join
  -> Index Scan on t1_pkey
   Index Cond: t1.id = constant
  -> Seq Scan on t2
   Filter: t1.x op t2.y

even though the latter isn't meaningfully better.  It's possible
this won't arise because we don't consider parameterized paths
except where the parameter is used in an indexqual or the like,
but I'm not confident of that.  See in particular reparameterize_path
and friends before you assert there's no such issue.  So we might
need to distinguish essential from incidental parameterization,
or something like that.

regards, tom lane




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Robert Haas
On Mon, Jun 21, 2021 at 1:11 PM Peter Geoghegan  wrote:
> On Mon, Jun 21, 2021 at 9:52 AM Tom Lane  wrote:
> > > I'm not so sure that it is. The point isn't the risk, even if it could
> > > be calculated. The point is the downsides of being wrong are huge and
> > > pretty much unbounded, whereas the benefits of being right are tiny
> > > and bounded. It almost doesn't matter what the underlying
> > > probabilities are.
> >
> > You're throwing around a lot of pejorative adjectives there without
> > having bothered to quantify any of them.  This sounds less like a sound
> > argument to me than like a witch trial.
>
> I'm not sure what you mean by pejorative. Isn't what I said about
> downsides and upsides pretty accurate?

Yeah, I don't see why Peter's characterization deserves to be labelled
as pejorative here. A hash join or merge join or parameterized nested
loop can turn out to be slower than some other algorithm, but all of
those approaches have some component that tends to make the asymptotic
cost less than the product of the sizes of the inputs. I don't think
that's true in absolutely every case; for example, if a merge join has
every row duplicated on both sides, it will have to scan every inner
tuple once per outer tuple, just like a nested loop, and the other
algorithms also are going to degrade toward O(NM) performance in the
face of many duplicates. Also, a hash join can be pretty close to that
if it needs a shazillion batches. But in normal cases, any algorithm
other than an unparameterized nested loop tends to read each input
tuple on each side ONCE, so the cost is more like the sum of the input
sizes than the product. And there's nothing pejorative in saying that
N + M can be less than N * M by an unbounded amount. That's just the
facts.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Robert Haas
On Mon, Jun 21, 2021 at 11:55 AM Tom Lane  wrote:
> There are certainly cases where the optimizer can prove (in principle;
> it doesn't do so today) that a plan node will produce at most one row.
> They're hardly uncommon either: an equality comparison on a unique
> key, or a subquery with a simple aggregate function, come to mind.

Hmm, maybe I need to see an example of the sort of plan shape that you
have in mind. To me it feels like a comparison on a unique key ought
to use a *parameterized* nested loop. And it's also not clear to me
why a nested loop is actually better in a case like this. If the
nested loop iterates more than once because there are more rows on the
outer side, then you don't want to have something on the inner side
that might be expensive, and either an aggregate or an unparameterized
search for a unique value are potentially quite expensive. Now if you
put a materialize node on top of the inner side, then you don't have
to worry about that problem, but how much are you saving at that point
vs. just doing a hash join?

> I'd be a lot happier if this proposal were couched around some sort
> of estimate of the risk of the outer side producing more than the
> expected number of rows.  The arguments so far seem like fairly lame
> rationalizations for not putting forth the effort to do that.

I don't understand how to generate a risk assessment or what we ought
to do with it if we had one. I don't even understand what units we
would use. We measure costs using abstract cost units, but those
abstract cost units are intended to be a proxy for runtime. If it's
not the case that a plan that runs for longer has a higher cost, then
something's wrong with the costing model or the settings. In the case
of risk, the whole thing seems totally subjective. We're talking about
the risk that our estimate is bogus, but how do we estimate the risk
that we don't know how to estimate? Given quals (x + 0) = x, x = some
MCV, and x = some non-MCV, we can say that we're most likely to be
wrong about the first one and least likely to be wrong about the
second one, but how much more likely? I don't know how you can decide
that, even in principle. We can also say that an unparameterized
nested loop is more risky than some other plan because it could turn
out to be crazy expensive, but is that more or less risky than
scanning the index on b as a way to implement SELECT * FROM foo WHERE
a = 1 ORDER BY b LIMIT 1? How much more risky, and why?

And then, even supposing we had a risk metric for every path, what
exactly would we do with it? Suppose path A is cheaper than path B,
but also more risky. Which should we keep? We could keep both, but
that seems to be just kicking the can down the road. If plan B is
likely to blow up in our face, we should probably just get rid of it,
or not even generate it in the first place. Even if we do keep both,
at some point we're going to have to make a cost-vs-risk tradeoff, and
I don't see how to do that intelligently, because the point is
precisely that if the risk is high, the cost number might be totally
wrong. If we know that plan A is cheaper than plan B, we should choose
plan A. But if all we know is that plan A would be cheaper than plan B
if our estimate of the cost were correct, but also that it probably
isn't, then what we actually know is nothing. We have no principled
basis for deciding anything based on cost unless we're reasonably
confident that the cost estimate is pretty good. So AFAICT the only
principled strategy here is to throw away high risk paths as early as
we possibly can. What am I missing?

The other thing is - the risk of a particular path doesn't matter in
an absolute sense, only a relative one. In the particular case I'm on
about here, we *know* there's a less-risky alternative. We don't need
to quantify the risk to know which of the two options has more. In
many other cases, the risk is irreducible e.g. a default estimate
could be totally bogus, but switching paths is of no help in getting
rid of it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Peter Geoghegan
On Mon, Jun 21, 2021 at 9:52 AM Tom Lane  wrote:
> > I'm not so sure that it is. The point isn't the risk, even if it could
> > be calculated. The point is the downsides of being wrong are huge and
> > pretty much unbounded, whereas the benefits of being right are tiny
> > and bounded. It almost doesn't matter what the underlying
> > probabilities are.
>
> You're throwing around a lot of pejorative adjectives there without
> having bothered to quantify any of them.  This sounds less like a sound
> argument to me than like a witch trial.

I'm not sure what you mean by pejorative. Isn't what I said about
downsides and upsides pretty accurate?

> Reflecting for a bit on the ancient principle that "the only good numbers
> in computer science are 0, 1, and N", it seems to me that we could make
> an effort to classify RelOptInfos as provably empty, provably at most one
> row, and others.  (This would be a property of relations, not individual
> paths, so it needn't bloat struct Path.)  We already understand about
> provably-empty rels, so this is just generalizing that idea a little.

It sounds like you're concerned about properly distinguishing between:

1. Cases where the only non-reckless choice is a hash join over a
unparameterized nested loop join

2. Cases that look like that at first, that don't really have that
quality on closer examination.

This seems like a reasonable concern.

> Once we know about that, at least for the really-common cases like unique
> keys, I'd be okay with a hard rule that we don't consider unparameterized
> nestloop joins with an outer side that falls in the "N" category.
> Unless there's no alternative, of course.

I thought that you were arguing against the premise itself. It's now
clear that you weren't, though.

I don't have an opinion for or against bringing the provably-empty
rels stuff into the picture. At least not right now.

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Tom Lane
Tomas Vondra  writes:
> I've been thinking about this a bit recently and searching for papers 
> talking about this, and but it's not clear to me how to calculate the 
> risk (and propagate it through the plan) without making the whole cost 
> evaluation way more complicated / expensive :-(

Yeah, a truly complete approach using confidence intervals or the
like seems frighteningly complicated.

> But maybe you have thought about some much simpler approach, addressing 
> this sufficiently well?

See my nearby response to Peter.  The main case that's distressing me
is the possibility of forcing a hash join even when the outer side
is obviously only one row.  If we could avoid that, at least for
large values of "obvious", I'd be far more comfortable.

regards, tom lane




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Tom Lane
Peter Geoghegan  writes:
> On Mon, Jun 21, 2021 at 8:55 AM Tom Lane  wrote:
>> I'd be a lot happier if this proposal were couched around some sort
>> of estimate of the risk of the outer side producing more than the
>> expected number of rows.  The arguments so far seem like fairly lame
>> rationalizations for not putting forth the effort to do that.

> I'm not so sure that it is. The point isn't the risk, even if it could
> be calculated. The point is the downsides of being wrong are huge and
> pretty much unbounded, whereas the benefits of being right are tiny
> and bounded. It almost doesn't matter what the underlying
> probabilities are.

You're throwing around a lot of pejorative adjectives there without
having bothered to quantify any of them.  This sounds less like a sound
argument to me than like a witch trial.

Reflecting for a bit on the ancient principle that "the only good numbers
in computer science are 0, 1, and N", it seems to me that we could make
an effort to classify RelOptInfos as provably empty, provably at most one
row, and others.  (This would be a property of relations, not individual
paths, so it needn't bloat struct Path.)  We already understand about
provably-empty rels, so this is just generalizing that idea a little.
Once we know about that, at least for the really-common cases like unique
keys, I'd be okay with a hard rule that we don't consider unparameterized
nestloop joins with an outer side that falls in the "N" category.
Unless there's no alternative, of course.

Another thought that occurs to me here is that maybe we could get rid of
the enable_material knob in favor of forcing (or at least encouraging)
materialization when the outer side isn't provably one row.

regards, tom lane




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Tomas Vondra




On 6/21/21 5:55 PM, Tom Lane wrote:

Peter Geoghegan  writes:

The heuristic that has the optimizer flat out avoids an
unparameterized nested loop join is justified by the belief that
that's fundamentally reckless. Even though we all agree on that much,
I don't know when it stops being reckless and starts being "too risky
for me, but not fundamentally reckless". I think that that's worth
living with, but it isn't very satisfying.


There are certainly cases where the optimizer can prove (in principle;
it doesn't do so today) that a plan node will produce at most one row.
They're hardly uncommon either: an equality comparison on a unique
key, or a subquery with a simple aggregate function, come to mind.
  
In such cases, not only is this choice not reckless, but it's provably

superior to a hash join.  So in the end this gets back to the planning
risk factor that we keep circling around but nobody quite wants to
tackle.



Agreed.


I'd be a lot happier if this proposal were couched around some sort
of estimate of the risk of the outer side producing more than the
expected number of rows.  The arguments so far seem like fairly lame
rationalizations for not putting forth the effort to do that.

I agree having such measure would be helpful, but do you have an idea 
how it could be implemented?


I've been thinking about this a bit recently and searching for papers 
talking about this, and but it's not clear to me how to calculate the 
risk (and propagate it through the plan) without making the whole cost 
evaluation way more complicated / expensive :-(


The "traditional approach" to quantifying risk would be confidence 
intervals, i.e. for each cardinality estimate "e" we'd determine a range 
[a,b] so that P(a <= e <= b) = p. So we could pick "p" depending on how 
"robust" the plan choice should be (say 0.9 for "risky" and 0.999 for 
"safe" plans) and calculate a,b. Or maybe we could calculate where the 
plan changes, and then we'd see if those "break points" are within the 
confidence interval. If not, great - we can assume the plan is stable, 
otherwise we'd need to consider the other plans too, somehow.


But what I'm not sure about is:

1) Now we're dealing with three cardinality estimates (the original "e" 
and the boundaries "a, "b"). So which one do we use to calculate cost 
and pass to upper parts of the plan?


2) The outer relation may be a complex join, so we'd need to combine the 
confidence intervals for the two input relations, somehow.


3) We'd need to know how to calculate the confidence intervals for most 
plan nodes, which I'm not sure we know how to do. So it's not clear to 
me how to do this, which seems rather problematic because we need to 
propagate and combine those confidence intervals through the plan.



But maybe you have thought about some much simpler approach, addressing 
this sufficiently well?



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Peter Geoghegan
On Mon, Jun 21, 2021 at 8:55 AM Tom Lane  wrote:
> There are certainly cases where the optimizer can prove (in principle;
> it doesn't do so today) that a plan node will produce at most one row.
> They're hardly uncommon either: an equality comparison on a unique
> key, or a subquery with a simple aggregate function, come to mind.

That sounds like it might be useful in general.

> In such cases, not only is this choice not reckless, but it's provably
> superior to a hash join.  So in the end this gets back to the planning
> risk factor that we keep circling around but nobody quite wants to
> tackle.

Let's assume for the sake of argument that we really have to have that
additional infrastructure to move forward with the idea. (I'm not sure
if it's possible in principle to use infrastructure like that for some
of the cases that Robert has in mind, but for now I'll assume that it
is both possible and a practical necessity.)

Even when I make this working assumption I don't see what it changes
at a fundamental level. You've merely come up with a slightly more
specific definition of the class of plans that are "reckless". You've
only refined the original provisional definition of "reckless" to
exclude specific "clearly not reckless" cases (I think). But the
definition of "reckless" is no less squishy than what we started out
with.

> I'd be a lot happier if this proposal were couched around some sort
> of estimate of the risk of the outer side producing more than the
> expected number of rows.  The arguments so far seem like fairly lame
> rationalizations for not putting forth the effort to do that.

I'm not so sure that it is. The point isn't the risk, even if it could
be calculated. The point is the downsides of being wrong are huge and
pretty much unbounded, whereas the benefits of being right are tiny
and bounded. It almost doesn't matter what the underlying
probabilities are.

To be clear I'm not arguing against modelling risk. I'm just not sure
that it's useful to think of this problem as truly a problem of risk.

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Tom Lane
Peter Geoghegan  writes:
> The heuristic that has the optimizer flat out avoids an
> unparameterized nested loop join is justified by the belief that
> that's fundamentally reckless. Even though we all agree on that much,
> I don't know when it stops being reckless and starts being "too risky
> for me, but not fundamentally reckless". I think that that's worth
> living with, but it isn't very satisfying.

There are certainly cases where the optimizer can prove (in principle;
it doesn't do so today) that a plan node will produce at most one row.
They're hardly uncommon either: an equality comparison on a unique
key, or a subquery with a simple aggregate function, come to mind.
 
In such cases, not only is this choice not reckless, but it's provably
superior to a hash join.  So in the end this gets back to the planning
risk factor that we keep circling around but nobody quite wants to
tackle.

I'd be a lot happier if this proposal were couched around some sort
of estimate of the risk of the outer side producing more than the
expected number of rows.  The arguments so far seem like fairly lame
rationalizations for not putting forth the effort to do that.

regards, tom lane




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Peter Geoghegan
On Mon, Jun 21, 2021 at 7:45 AM Robert Haas  wrote:
> On Mon, Jun 21, 2021 at 6:41 AM David Rowley  wrote:
> > For example, in an unparameterized Nested Loop that estimates the
> > outer Path to have 1 row will cost an entire additional inner cost if
> > there are 2 rows.  With Hash Join the cost is just an additional
> > hashtable lookup, which is dead cheap.   I don't know exactly how
> > add_path() would weigh all that up, but it seems to me that I wouldn't
> > take the risk unless I was 100% certain that the Nested Loop's outer
> > Path would only return 1 row exactly, if there was any chance at all
> > it could return more, I'd be picking some other join method.
>
> It seems like everyone agrees that it would be good to do something
> about this problem, but the question is whether it's best to do
> something that tries to be general, or whether we should instead do
> something about this specific case. I favor the latter approach.

I agree with your conclusion, but FWIW I am sympathetic to David's
view too. I certainly understand why he'd naturally want to define the
class of problems that are like this one, to understand what the
boundaries are.

The heuristic that has the optimizer flat out avoids an
unparameterized nested loop join is justified by the belief that
that's fundamentally reckless. Even though we all agree on that much,
I don't know when it stops being reckless and starts being "too risky
for me, but not fundamentally reckless". I think that that's worth
living with, but it isn't very satisfying.

> Risk
> and uncertainty exist all over the place, but dealing with that in a
> general way seems difficult, and maybe unnecessary. Addressing the
> case of unparameterized nest loops specifically seems simpler, because
> it's easier to reason about what the alternatives are. Your last
> sentence here seems right on point to me.

Right. Part of why this is a good idea is that the user is exposed to
so many individual risks and uncertainties. We cannot see any one risk
as existing in a vacuum. It is not the only risk the user will ever
take in the planner -- if it was then it might actually be okay to
allow unparameterized nested loop joins.

A bad unparameterized nested loop join plan has, in a sense, unknown
and unbounded cost/downside. But it is only very slightly faster than
a hash join, by a fixed well understood amount. With enough "trials"
and on a long enough timeline, it will inevitably blow up and cause
the application to grind to a halt. It seems like no amount of fixed,
bounded benefit from "fast unparameterized nested loop joins" could
possibly make up for that. The life of Postgres users would be a lot
better if bad plans were at least "survivable events".

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Robert Haas
On Mon, Jun 21, 2021 at 6:41 AM David Rowley  wrote:
> For example, in an unparameterized Nested Loop that estimates the
> outer Path to have 1 row will cost an entire additional inner cost if
> there are 2 rows.  With Hash Join the cost is just an additional
> hashtable lookup, which is dead cheap.   I don't know exactly how
> add_path() would weigh all that up, but it seems to me that I wouldn't
> take the risk unless I was 100% certain that the Nested Loop's outer
> Path would only return 1 row exactly, if there was any chance at all
> it could return more, I'd be picking some other join method.

It seems like everyone agrees that it would be good to do something
about this problem, but the question is whether it's best to do
something that tries to be general, or whether we should instead do
something about this specific case. I favor the latter approach. Risk
and uncertainty exist all over the place, but dealing with that in a
general way seems difficult, and maybe unnecessary. Addressing the
case of unparameterized nest loops specifically seems simpler, because
it's easier to reason about what the alternatives are. Your last
sentence here seems right on point to me.

Basically, what you argue for there is disabling unparameterized
nested loops entirely except when we can prove that the inner side
will never generate more than one row. But, that's almost never going
to be something that we can prove. If the inner side is coming from a
table or sub-join, it can turn out to be big. As far as I can see, the
only way that this doesn't happen is if it's something like a subquery
that aggregates everything down to one row, or has LIMIT 1, but those
are such special cases that I don't even think we should be worrying
about them.

So my counter-proposal is: how about if we split
merge_unsorted_outer() into two functions, one of which generates
nested loops only based on parameterized paths and the other of which
generates nested loops based only on unparameterized paths, and then
rejigger add_paths_to_joinrel() so that we do the latter between the
steps that are currently number 5 and 6 and only if we haven't got any
other paths yet? If somebody later comes up with logic for proving
that the inner side can never have more than 1 row, we can let this be
run in those cases as well. In the meantime, if somebody does
something like a JOIN b ON a.x < b.x, we will still generate these
paths because there's no other approach, or similarly if it's a.x =
b.x but for some strange type that doesn't have a hash-joinable or
merge-joinable equality operator. But otherwise we omit those paths
entirely.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: disfavoring unparameterized nested loops

2021-06-21 Thread Kenneth Marshall
> >
> > Most of the time when I see that happen it's down to either the
> > selectivity of some correlated base-quals being multiplied down to a
> > number low enough that we clamp the estimate to be 1 row.   The other
> > case is similar, but with join quals.
> 
> If an estimate is lower than 1, that should be a red flag that Something Is
> Wrong. This is kind of a crazy idea, but what if we threw it back the other
> way by computing 1 / est , and clamping that result to 2 <= res < 10 (or
> 100 or something)? The theory is, the more impossibly low it is, the more
> wrong it is. I'm attracted to the idea of dealing with it as an estimation
> problem and not needing to know about join types. Might have unintended
> consequences, though.
> 
> Long term, it would be great to calculate something about the distribution
> of cardinality estimates, so we can model risk in the estimates.
> 

Hi,

Laurenz suggested clamping to 2 in this thread in 2017:

https://www.postgresql.org/message-id/1509611428.3268.5.camel%40cybertec.at

Having been the victim of this problem in the past, I like the risk
based approach to this. If the cost of being wrong in the estimate is
high, use a merge join instead. In every case that I have encountered,
that heuristic would have given the correct performant plan.

Regards,
Ken




Re: disfavoring unparameterized nested loops

2021-06-21 Thread John Naylor
On Tue, Jun 15, 2021 at 8:00 PM David Rowley  wrote:

> In my experience, the most common reason that the planner chooses
> non-parameterized nested loops wrongly is when there's row
> underestimation that says there's just going to be 1 row returned by
> some set of joins.  The problem often comes when some subsequent join
> is planned and the planner sees the given join rel only produces one
> row.  The cheapest join method we have to join 1 row is Nested Loop.
> So the planner just sticks the 1-row join rel on the outer side
> thinking the executor will only need to scan the inner side of the
> join once.  When the outer row count blows up, then we end up scanning
> that inner side many more times. The problem is compounded when you
> nest it a few joins deep
>
> Most of the time when I see that happen it's down to either the
> selectivity of some correlated base-quals being multiplied down to a
> number low enough that we clamp the estimate to be 1 row.   The other
> case is similar, but with join quals.

If an estimate is lower than 1, that should be a red flag that Something Is
Wrong. This is kind of a crazy idea, but what if we threw it back the other
way by computing 1 / est , and clamping that result to 2 <= res < 10 (or
100 or something)? The theory is, the more impossibly low it is, the more
wrong it is. I'm attracted to the idea of dealing with it as an estimation
problem and not needing to know about join types. Might have unintended
consequences, though.

Long term, it would be great to calculate something about the distribution
of cardinality estimates, so we can model risk in the estimates.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: disfavoring unparameterized nested loops

2021-06-21 Thread David Rowley
On Wed, 16 Jun 2021 at 15:08, Peter Geoghegan  wrote:
> It seems important to distinguish between risk and uncertainty --
> they're rather different things. The short version is that you can
> model risk but you cannot model uncertainty. This seems like a problem
> of uncertainty to me.

You might be right there.  "Uncertainty" or "Certainty" seems more
like a value that clauselist_selectivity() would be able to calculate
itself. It would just be up to the planner to determine what to do
with it.

One thing I thought about is that if the costing modal was able to
separate out a cost of additional (unexpected) rows then it would be
easier for add_path() to take into account how bad things might go if
we underestimate.

For example, in an unparameterized Nested Loop that estimates the
outer Path to have 1 row will cost an entire additional inner cost if
there are 2 rows.  With Hash Join the cost is just an additional
hashtable lookup, which is dead cheap.   I don't know exactly how
add_path() would weigh all that up, but it seems to me that I wouldn't
take the risk unless I was 100% certain that the Nested Loop's outer
Path would only return 1 row exactly, if there was any chance at all
it could return more, I'd be picking some other join method.

David




Re: disfavoring unparameterized nested loops

2021-06-18 Thread John Naylor
On Fri, Jun 18, 2021 at 6:20 AM Ashutosh Bapat 
wrote:

> I have not come across many papers which leverage this idea. Googling
> "selectivity estimation confidence interval", does not yield many
> papers. Although I found [1] to be using a similar idea. So may be
> there's not merit in this idea, thought theoretically it sounds fine
> to me.
>
>
> [1]
https://pi3.informatik.uni-mannheim.de/~moer/Publications/vldb18_smpl_synop.pdf

Well, that paper's title shows it's a bit too far forward for us, since we
don't use samples during plan time (although that's a separate topic worth
considering). From the references, however, this one gives some
mathematical framing of the problem that lead to the thread subject,
although I haven't read enough to see if we can get practical advice from
it:

Y. E. Ioannidis and S. Christodoulakis. On the propagation of errors in the
size of join results.
https://www.csd.uoc.gr/~hy460/pdf/p268-ioannidis.pdf

--
John Naylor
EDB: http://www.enterprisedb.com


Re: disfavoring unparameterized nested loops

2021-06-18 Thread Ashutosh Bapat
>
> The problem I have with this idea is that I really don't know how to
> properly calculate what the risk_factor should be set to.  It seems
> easy at first to set it to something that has the planner avoid these
> silly 1-row estimate nested loop mistakes, but I think what we'd set
> the risk_factor to would become much more important when more and more
> Path types start using it. So if we did this and just guessed the
> risk_factor, that might be fine when only 1 of the paths being
> compared had a non-zero risk_factor, but as soon as both paths have
> one set, unless they're set to something sensible, then we just end up
> comparing garbage costs to garbage costs.

Risk factor is the inverse of confidence on estimate, lesser
confidence, higher risk. If we associate confidence with the
selectivity estimate, or computer confidence interval of the estimate
instead of a single number, we can associate risk factor with each
estimate. When we combine estimates to calculate new estimates, we
also combine their confidences/confidence intervals. If my memory
serves well, confidence intervals/confidences are calculated based on
the sample size and method used for estimation, so we should be able
to compute those during ANALYZE.

I have not come across many papers which leverage this idea. Googling
"selectivity estimation confidence interval", does not yield many
papers. Although I found [1] to be using a similar idea. So may be
there's not merit in this idea, thought theoretically it sounds fine
to me.


[1] 
https://pi3.informatik.uni-mannheim.de/~moer/Publications/vldb18_smpl_synop.pdf
--
Best Wishes,
Ashutosh Bapat




Re: disfavoring unparameterized nested loops

2021-06-15 Thread Peter Geoghegan
On Tue, Jun 15, 2021 at 5:00 PM David Rowley  wrote:
> Most of the time when I see that happen it's down to either the
> selectivity of some correlated base-quals being multiplied down to a
> number low enough that we clamp the estimate to be 1 row.   The other
> case is similar, but with join quals.
>
> It seems to me that each time we multiply 2 selectivities together
> that the risk of the resulting selectivity being out increases.  The
> risk is likely lower when we have some extended statistics which
> allows us to do fewer selectivity multiplications.

It seems important to distinguish between risk and uncertainty --
they're rather different things. The short version is that you can
model risk but you cannot model uncertainty. This seems like a problem
of uncertainty to me.

The example from the paper that Robert cited isn't interesting to me
because it hints at a way of managing the uncertainty, exactly. It's
interesting because it seems to emphasize the user's exposure to the
problem, which is what really matters. Even if it was extremely
unlikely that the user would have a problem, the downside of being
wrong is still absurdly high, and the upside of being right is low
(possibly even indistinguishable from zero). It's just not worth
thinking about. Besides, we all know that selectivity estimates are
very often quite wrong without the user ever noticing. It's amazing
that database optimizers work as well as they do, really.

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2021-06-15 Thread Peter Geoghegan
On Tue, Jun 15, 2021 at 6:15 PM David Rowley  wrote:
> On Wed, 16 Jun 2021 at 12:11, Peter Geoghegan  wrote:
> > Whether or not we throw the plan back at the planner or "really change
> > our minds at execution time" seems like a distinction without a
> > difference.
>
> What is "really change our minds at execution time"?  Is that switch
> to another plan without consulting the planner?

I don't know what it means. That was my point -- it all seems like
semantics to me.

The strong separation between plan time and execution time isn't
necessarily a good thing, at least as far as solving some of the
thorniest problems goes. It seems obvious to me that cardinality
estimation is the main problem, and that the most promising solutions
are all fundamentally about using execution time information to change
course. Some problems with planning just can't be solved at plan time
-- no model can ever be smart enough. Better to focus on making query
execution more robust, perhaps by totally changing the plan when it is
clearly wrong. But also by using more techniques that we've
traditionally thought of as execution time techniques (e.g. role
reversal in hash join). The distinction is blurry to me.

There are no doubt practical software engineering issues with this --
separation of concerns and whatnot. But it seems premature to go into
that now.

> The new information might cause the join order to
> completely change. It might not be as simple as swapping a Nested Loop
> for a Hash Join.

I agree that it might not be that simple at all. I think that Robert
is saying that this is one case where it really does appear to be that
simple, and so we really can expect to benefit from a simple plan-time
heuristic that works within the confines of the current model. Why
wouldn't we just take that easy win, once the general idea has been
validated some more? Why let the perfect be the enemy of the good?

I have perhaps muddied the waters by wading into the more general
question of robust execution, the inherent uncertainty with
cardinality estimation, and so on. Robert really didn't seem to be
talking about that at all (though it is clearly related).

> > Either way we're changing our minds about the plan based
> > on information that is fundamentally execution time information, not
> > plan time information. Have I missed something?
>
> I don't really see why you think the number of rows that a given join
> might produce is execution information.

If we're 100% sure a join will produce at least n rows because we
executed it (with the express intention of actually doing real query
processing that returns rows to the client), and it already produced n
rows, then what else could it be called? Why isn't it that simple?

> It's exactly the sort of
> information the planner needs to make a good plan. It's just that
> today we get that information from statistics. Plenty of other DBMSs
> make decisions from sampling.

> FWIW, we do already have a minimalist
> sampling already in get_actual_variable_range().

I know, but that doesn't seem all that related -- it almost seems like
the opposite idea. It isn't the executor balking when it notices that
the plan is visibly wrong during execution, in some important way.
It's more like the planner using the executor to get information about
an index that is well within the scope of what we think of as plan
time.

To some degree the distinction gets really blurred due to nodes like
hash join, where some important individual decisions are delayed until
execution time already. It's really unclear when precisely it stops
being that, and starts being more of a case of either partially or
wholly replanning. I don't know how to talk about it without it being
confusing.

> I'm just trying to highlight that I don't think overloading nodes is a
> good path to go down.  It's not a sustainable practice. It's a path
> towards just having a single node that does everything. If your
> suggestion was not serious then there's no point in discussing it
> further.

As I said, it was a way of framing one particular issue that Robert is
concerned about.

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2021-06-15 Thread David Rowley
On Wed, 16 Jun 2021 at 12:11, Peter Geoghegan  wrote:
> Whether or not we throw the plan back at the planner or "really change
> our minds at execution time" seems like a distinction without a
> difference.

What is "really change our minds at execution time"?  Is that switch
to another plan without consulting the planner?  If so what decides
what that new plan should be? The planner is meant to be the expert in
that department. The new information might cause the join order to
completely change. It might not be as simple as swapping a Nested Loop
for a Hash Join.

> Either way we're changing our minds about the plan based
> on information that is fundamentally execution time information, not
> plan time information. Have I missed something?

I don't really see why you think the number of rows that a given join
might produce is execution information. It's exactly the sort of
information the planner needs to make a good plan. It's just that
today we get that information from statistics. Plenty of other DBMSs
make decisions from sampling. FWIW, we do already have a minimalist
sampling already in get_actual_variable_range().

I'm just trying to highlight that I don't think overloading nodes is a
good path to go down.  It's not a sustainable practice. It's a path
towards just having a single node that does everything. If your
suggestion was not serious then there's no point in discussing it
further.

David




Re: disfavoring unparameterized nested loops

2021-06-15 Thread Peter Geoghegan
On Tue, Jun 15, 2021 at 5:00 PM David Rowley  wrote:
> I don't really think we should solve this by having nodeNestloop.c
> fall back on hashing when the going gets tough.  Overloading our nodes
> that way is not a sustainable thing to do.  I'd rather see the
> executor throw the plan back at the planner along with some hints
> about what was wrong with it.  We could do that providing we've not
> sent anything back to the client yet.

It wasn't a serious suggestion -- it was just a way of framing the
issue at hand that I thought might be useful.

If we did have something like that (which FWIW I think makes sense but
is hard to do right in a general way) then it might be expected to
preemptively refuse to even start down the road of using an
unparameterized nestloop join very early, or even before execution
time. Such an adaptive join algorithm/node might be expected to have a
huge bias against this particular plan shape, that can be reduced to a
simple heuristic. But you can have the simple heuristic without
needing to build everything else.

Whether or not we throw the plan back at the planner or "really change
our minds at execution time" seems like a distinction without a
difference. Either way we're changing our minds about the plan based
on information that is fundamentally execution time information, not
plan time information. Have I missed something?

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2021-06-15 Thread David Rowley
On Wed, 16 Jun 2021 at 05:09, Robert Haas  wrote:
> How to do that is not very clear. One very simple thing we could do
> would be to introduce enable_nestloop=parameterized or
> enable_parameterized_nestloop=false. That is a pretty blunt instrument
> but the authors of the paper seem to have done that with positive
> results, so maybe it's actually not a dumb idea.

It's not great that people are having to use such blunt instruments to
get the planner to behave.  It might not be a terrible idea to provide
them with something a bit sharper as you suggest. The GUC idea is
certainly something that could be done without too much effort.

There was some talk of doing that in [1].

> Another approach
> would be to introduce a large fuzz factor for such nested loops e.g.
> keep them only if the cost estimate is better than the comparable hash
> join plan by at least a factor of N (or if no such plan is being
> generated for some reason).

In my experience, the most common reason that the planner chooses
non-parameterized nested loops wrongly is when there's row
underestimation that says there's just going to be 1 row returned by
some set of joins.  The problem often comes when some subsequent join
is planned and the planner sees the given join rel only produces one
row.  The cheapest join method we have to join 1 row is Nested Loop.
So the planner just sticks the 1-row join rel on the outer side
thinking the executor will only need to scan the inner side of the
join once.  When the outer row count blows up, then we end up scanning
that inner side many more times. The problem is compounded when you
nest it a few joins deep

Most of the time when I see that happen it's down to either the
selectivity of some correlated base-quals being multiplied down to a
number low enough that we clamp the estimate to be 1 row.   The other
case is similar, but with join quals.

It seems to me that each time we multiply 2 selectivities together
that the risk of the resulting selectivity being out increases.  The
risk is likely lower when we have some extended statistics which
allows us to do fewer selectivity multiplications.

For that 1-row case, doing an unparameterized nested loop is only the
cheapest join method by a tiny amount.  It really wouldn't be much
more expensive to just put that single row into a hash table.  If that
1 estimated row turns out to be 10 actual rows then it's likely not
too big a deal for the hash join code to accommodate the 9 additional
rows.

This makes me think that it's insanely risky for the planner to be
picking Nested Loop joins in this case. And that puts me down the path
of thinking that this problem should be solved by having the planner
take into account risk as well as costs when generating plans.

I don't really have any concrete ideas on that, but a basic idea that
I have been considering is that a Path has a risk_factor field that is
decided somewhere like clauselist_selectivity(). Maybe the risk can go
up by 1 each time we multiply an individual selectivity. (As of
master, estimate_num_groups() allows the estimation to pass back some
further information to the caller. I added that for Result Cache so I
could allow the caller to get visibility about when the estimate fell
back on DEFAULT_NUM_DISTINCT. clauselist_selectivity() maybe could get
similar treatment to allow the risk_factor or number of nstats_used to
be passed back).  We'd then add a GUC, something like
planner_risk_adversion which could be set to anything from 0.0 to some
positive number. During add_path() we could do the cost comparison
like:  path1.cost * path1.risk_factor * (1.0 + planner_risk_adversion)
< path2.cost * path2.risk_factor * (1.0 + planner_risk_adversion).
That way, if you set planner_risk_adversion to 0.0, then the planner
does as it does today, i.e takes big risks.

The problem I have with this idea is that I really don't know how to
properly calculate what the risk_factor should be set to.  It seems
easy at first to set it to something that has the planner avoid these
silly 1-row estimate nested loop mistakes, but I think what we'd set
the risk_factor to would become much more important when more and more
Path types start using it. So if we did this and just guessed the
risk_factor, that might be fine when only 1 of the paths being
compared had a non-zero risk_factor, but as soon as both paths have
one set, unless they're set to something sensible, then we just end up
comparing garbage costs to garbage costs.

Another common mistake the planner makes is around WHERE a = 
ORDER BY b LIMIT n; where there are separate indexes on (a) and (b).
Scanning the (b) index is pretty risky. All the "a" values you need
might be right at the end of the index. It seems safer to scan the (a)
index as we'd likely have statistics that tell us how many rows exist
with . We don't have any stats that tell us where in the (b)
index are all the rows with a = .

I don't really think we should solve this by having nodeNestloop.c

Re: disfavoring unparameterized nested loops

2021-06-15 Thread Peter Geoghegan
On Tue, Jun 15, 2021 at 12:31 PM Robert Haas  wrote:
> Yes, I think it is. Reading the paper really helped me crystallize my
> thoughts about this, because when I've studied the problems myself, I
> came, as you postulate here, to the conclusion that there's a lot of
> stuff the planner does where there is risk and uncertainty, and thus
> that a general framework would be necessary to deal with it.

It is an example (perhaps the only example in the optimizer) of an
oasis of certainty in an ocean of uncertainty. As uncertain as
everything is, we seemingly can make strong robust statements about
the relative merits of each strategy *in general*, just in this
particular instance. It's just not reasonable to make such a reckless
choice, no matter what your general risk tolerance is.

Goetz Graefe is interviewed here, and goes into his philosophy on
robustness -- it seems really interesting to me:

https://sigmodrecord.org/publications/sigmodRecord/2009/pdfs/05_Profiles_Graefe.pdf

> In defense of that approach, note that this is a
> case where we know both that the Nested Loop is risky and that Hash
> Join is a similar alternative with probably similar cost. I am not
> sure there are any other cases where we can say quite so generally
> both that a certain thing is risky and what we could do instead.

I tend to think of a hash join as like a nested loop join with an
inner index scan where you build the index yourself, dynamically. That
might be why I find it easy to make this mental leap. In theory you
could do this by giving the nestloop join runtime smarts -- make it
turn into a hash join adaptively. Like Graefe's G-Join design. That
way you could do this in a theoretically pure way.

I don't think that that's actually necessary just to deal with this
case -- it probably really is as simple as it seems. I point this out
because perhaps it's useful to have that theoretical anchoring.

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2021-06-15 Thread Robert Haas
On Tue, Jun 15, 2021 at 2:04 PM Peter Geoghegan  wrote:
> I guess that there is a hesitation to not introduce heuristics like
> this because it doesn't fit into some larger framework that captures
> risk -- it might be seen as an ugly special case. But isn't this
> already actually kind of special, whether or not we officially think
> so?

Yes, I think it is. Reading the paper really helped me crystallize my
thoughts about this, because when I've studied the problems myself, I
came, as you postulate here, to the conclusion that there's a lot of
stuff the planner does where there is risk and uncertainty, and thus
that a general framework would be necessary to deal with it. But the
fact that an academic researcher called this problem out as the only
one worth treating specially makes me think that perhaps it deserves
special handling. In defense of that approach, note that this is a
case where we know both that the Nested Loop is risky and that Hash
Join is a similar alternative with probably similar cost. I am not
sure there are any other cases where we can say quite so generally
both that a certain thing is risky and what we could do instead.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: disfavoring unparameterized nested loops

2021-06-15 Thread Peter Geoghegan
On Tue, Jun 15, 2021 at 10:09 AM Robert Haas  wrote:
> How to do that is not very clear. One very simple thing we could do
> would be to introduce enable_nestloop=parameterized or
> enable_parameterized_nestloop=false. That is a pretty blunt instrument
> but the authors of the paper seem to have done that with positive
> results, so maybe it's actually not a dumb idea.

I think that it's probably a good idea as-is.

Simple heuristics that are very frequently wrong when considered in a
naive way can work very well in practice. This seems to happen when
they capture some kind of extreme naturally occuring cost/benefit
asymmetry -- especially one with fixed well understood costs and
unlimited benefits (this business with unparameterized nestloop joins
is about *avoiding* the inverse asymmetry, but that seems very
similar). My go to example of such an asymmetry is the rightmost page
split heuristic of applying leaf fillfactor regardless of any of the
other specifics; we effectively assume that all indexes are on columns
with ever-increasing values. Which is obviously wrong.

We're choosing between two alternatives (unparameterized nested loop
vs hash join) that are really very similar when things go as expected,
but diverge sharply when there is a misestimation -- who wouldn't take
the "conservative" choice here?

I guess that there is a hesitation to not introduce heuristics like
this because it doesn't fit into some larger framework that captures
risk -- it might be seen as an ugly special case. But isn't this
already actually kind of special, whether or not we officially think
so?

-- 
Peter Geoghegan