subject:"Re\: disfavoring unparameterized nested loops"

Re: disfavoring unparameterized nested loops

2023-09-20 Thread Lepikhov Andrei

On Wed, Sep 20, 2023, at 4:49 PM, David Rowley wrote:
> On Wed, 20 Sept 2023 at 19:56, Andrey Lepikhov
>  wrote:
>> What could you say about a different way: hybrid join? In MS SQL Server,
>> they have such a feature [1], and, according to their description, it
>> requires low overhead. They start from HashJoin and switch to NestLoop
>> if the inner input contains too small tuples. It solves the issue, Isn't it?
>
> A complexity which you may not be considering here is that Nested Loop
> joins always preserve the tuple order from the outer side of the join,
> whereas hash joins will not do this when multi-batching.

My idea here is the same as MS SQL guys did: prefetch from the HashJoin inner 
some predefined number of tuples and, if the planner has made a mistake and 
overestimated it, move hash join inner to NestLoop as an outer.
The opposite strategy, "underestimation" - starting with NestLoop and switching 
to HashJoin looks more difficult, but the main question is: is it worthwhile to 
research?

> I've no idea how the SQL Server engineers solved that.

>> [1]
>> https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-batch-mode-adaptive-joins/ba-p/385411

-- 
Regards,
Andrei Lepikhov

Re: disfavoring unparameterized nested loops

2023-09-20 Thread David Rowley

On Wed, 20 Sept 2023 at 19:56, Andrey Lepikhov
 wrote:
> What could you say about a different way: hybrid join? In MS SQL Server,
> they have such a feature [1], and, according to their description, it
> requires low overhead. They start from HashJoin and switch to NestLoop
> if the inner input contains too small tuples. It solves the issue, Isn't it?

A complexity which you may not be considering here is that Nested Loop
joins always preserve the tuple order from the outer side of the join,
whereas hash joins will not do this when multi-batching.

I've no idea how the SQL Server engineers solved that.

David

> [1]
> https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-batch-mode-adaptive-joins/ba-p/385411

Re: disfavoring unparameterized nested loops

2023-09-20 Thread Andrey Lepikhov

On 29/9/2022 21:32, Benjamin Coutu wrote:

I'd like to revamp this important discussion.

As is well described in this fairly recent paper here
https://www.vldb.org/pvldb/vol9/p204-leis.pdf (which also looks at Postgres)
"estimation errors quickly grow as the number of joins increases, and that these
errors are usually the reason for bad plans" - I think we can all get behind that
statement.

While nested loop joins work great when cardinality estimates are correct, they
are notoriously bad when the optimizer underestimates and they degrade very
fast in such cases - the good vs. bad here is very asymmetric. On the other
hand hash joins degrade much more gracefully - they are considered very robust
against underestimation. The above mentioned paper illustrates that all mayor
DBMS (including Postgres) tend to underestimate and usually that
underestimation increases drastically with the number of joins (see Figures 3+4
of the paper).

Now, a simple approach to guarding against bad plans that arise from
underestimation could be to use what I would call a
nested-loop-conviction-multiplier based on the current depth of the join tree,
e.g. for a base table that multiplier would obviously be 1, but would then grow
(e.g.) quadratically. That conviction-multiplier would *NOT* be used to skew
the cardinality estimates themselves, but rather be applied to the overall
nested loop join cost at each particular stage of the plan when comparing it to
other more robust join strategies like hash or sort-merge joins. That way when
we can be sure to have a good estimate at the bottom of the join tree we treat
all things equal, but favor nested loops less and less as we move up the join
tree for the sake of robustness.
Also, we can expand the multiplier whenever we fall back to using the default
cardinality constant as surely all bets are off at that point - we should
definitely treat nested loop joins as out of favor in this instance and that
could easily be incorporated by simply increasing the conviction-mutliplier.

What are your thoughts on this simple idea - is it perhaps too simple?
In my practice, parameterized nested loop reduces, sometimes
drastically, execution time. If your query touches a lot of tables but
extracts only a tiny part of the data, and you have good coverage by
indexes, PNL works great.
Moreover, I have pondered extending parameterization through subqueries
and groupings.

What could you say about a different way: hybrid join? In MS SQL Server,
they have such a feature [1], and, according to their description, it
requires low overhead. They start from HashJoin and switch to NestLoop
if the inner input contains too small tuples. It solves the issue, Isn't it?

62 matches

Mail list logo