Re: [PERFORM] OLAP/reporting queries fall into nested loops over seq scans or other horrible planner choices
Gunther wrote: > But there > is one thing that keeps bothering me both with Oracle and PgSQL. And > that is the preference for Nested Loops. [...] > But the issue is bulk searches, reports, and any analytic queries > scenarios. In those queries Nested Loops are almost always a bad choice, > even if there is an index. In over 20 years of working with RDBMs this > has been my unfailing heuristics. A report runs slow? Look at plan, is > there a Nested Loop? Yes? Squash it! And the report runs 10x faster > instantaneously. [...] > If you can set enable_nestloop off and the Hash Join is chosen and the > performance goes from 1 hour of 100% CPU to 10 seconds completion time, > then something is deadly wrong. [...] > The point is that Nested Loops should never be chosen except in index > lookup situations or may be memory constraints. > > How can I prevent it on a query by query scope? I cannot set > enable_nestloop = off because one query will be for a full report, wile > another one might have indexed constraints running in the same session, > and I don't want to manage side effects and remember to set > enable_nestloop parameter on and off. > > There must be a way to tell the optimizer to penalize nested loops to > make them the last resort. In Oracle there are those infamous hints, but > they don't always work either (or it is easy to make mistakes that you > get no feedback about). > > Is there any chance PgSQL can get something like a hint feature? PostgreSQL doesn't have a way to tell if a query is an OLAP query running against a star schema or a regular OLTP query, it will treat both in the same fashion. I also have had to deal with wrongly chosen nested loop joins, and testing a query with "enable_nestloop=off" is one of the first things to try in my experience. However, it is not true that PostgreSQL "perfers nested loops". Sometimes a nested loop join is the only sane and efficient way to process a query, and removing that capability would be just as bad a disaster as you are experiencing with your OLAP queries. Bad choices are almost always caused by bad estimates. Granted, there is no way that estimates can ever be perfect. So what could be done? One pragmatic solution would be to wrap every query that you know to be an OLAP query with BEGIN; SET LOCAL enable_nestloop=off; SELECT ... COMMIT; Looking deeper, I would say that wrongly chosen nested loop joins often come from an underestimate that is close to zero. PostgreSQL already clamps row count estimates to 1, that is, it will choose an estimate of 1 whenever it thinks fewer rows will be returned. Perhaps using a higher clamp like 2 would get rid of many of your problems, but it is a difficult gamble as it will also prevent some nested loop joins that would have been the best solution. Finally, even though the official line of PostgreSQL is to *not* have query hints, and for a number of good reasons, this is far from being an unanimous decision. The scales may tip at some point, though I personally hope that this point is not too close. Yours, Laurenz Albe -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] OLAP/reporting queries fall into nested loops over seq scans or other horrible planner choices
Thanks for your support Laurent. I have an idea on one thing you said: Just adding to your voice. I recently experienced the same issue with a complex multi-table view, including pivots, and was surprised to see all the nested loops everywhere and here is the clue for me: in spite of indices being available. I would say that sometimes indexes are detrimental. If you don't need them for other reasons, you might want to not have them. And without the index, the Nested Loop strategy might not be chosen. But that is a side-issue, because it can often not be avoided. Just saying in case it might help. I also found the opposite now. In the query that made me "blow the lid" and "complain" here, my team decided to add an index and that did not get rid of Nested Loops but at least made the inner table access indexed rather than a table scan and the performance ended up OK. But it's not always predictable, and these indexes could trap the planner into sub-optimal solutions still. I think there is an opportunity for a PgSQL query plan extension, especially wen dealing with CTE (WITH-clauses), PgSQL could make them a temporary table and add indexes that it needs for it on the fly, because after it has done one pass over the inner loop sequential scan it knows perfectly well how many rows it has, and knowing how many more iterations are coming from the sub-query that's driving the Nested Loop, it could decide that it's much faster to put an index on the nested relation, temporarily materialized. Or it could even decide to change it's plan mid-way and do the Hash Join. This is why I had always dreamed that the PgSQL optimizer had some easy API where one could plug in experimental strategies. I personally am extremely efficient with XSLT for complex intelligent algorithms, and I dream of a PgSQL query plan structure exposed as XML which an XSLT plugin could then process to edit the plan. People could experiment with awesome intelligent new strategies based on statistics gathered along the way of the execution. regards, -Gunther -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] OLAP/reporting queries fall into nested loops over seq scans or other horrible planner choices
Thanks you for your thoughtful reply, Laurenz (funny that the people interested in this topic are named Laurent and Laurenz :) PostgreSQL doesn't have a way to tell if a query is an OLAP query running against a star schema or a regular OLTP query, it will treat both in the same fashion. right, of course, and I would not want to go down that road. There OLAP vs. OLTP are not just two cut and dry options, and neither is "star schema" but one way in which to lay out a simple data model. The real world is always more complex than such cut and dry choices However, it is not true that PostgreSQL "perfers nested loops". Sometimes a nested loop join is the only sane and efficient way to process a query ... of course, it's not preferring NLs deliberately, but it happens awfully often (and not just with PgSQL, same problems I have had with Oracle over the years). Bad choices are almost always caused by bad estimates. Granted, there is no way that estimates can ever be perfect. ... Looking deeper, I would say that wrongly chosen nested loop joins often come from an underestimate that is close to zero. PostgreSQL already clamps row count estimates to 1, that is, it will choose an estimate of 1 whenever it thinks fewer rows will be returned. Perhaps using a higher clamp like 2 would get rid of many of your problems, but it is a difficult gamble as it will also prevent some nested loop joins that would have been the best solution. Wow, that is very interesting! Are you saying that if PgSQL can't know what the cardinality is, it assumes a default of 1? That would be very slanted a guess. I would think a couple of hundred would be more appropriate, or 10% of the average of the base tables for which it does have statistics. I would wonder if changing 1 to 2 would make much difference, as Seq Search over 1 to 10 tuples should generally be better than any other approach, as long as the 1-10 tuples are already readily available. Finally, even though the official line of PostgreSQL is to *not* have query hints, and for a number of good reasons, this is far from being an unanimous decision. The scales may tip at some point, though I personally hope that this point is not too close. I am glad to hear that hints are not completely ruled out by the development team. Definitely Oracle hints are painful and should not be replicated as is. Butmay be I can nudge your (and others') personal tastes with the following. You suggested this: One pragmatic solution would be to wrap every query that you know to be an OLAP query with BEGIN; SET LOCAL enable_nestloop=off; SELECT ... COMMIT; I would also like to put the set enable_nestloop = false statement into a combined statement, but when I do it in a transaction like you showed, it would not work for a normal PreparedStatement just expecting a ResultSet, or at least I haven't been able to make that work. In my Aqua Data Studio, if I put the set statement before the select statement, the combined statement doesn't return any results. May be I am doing something wrong. If there is a way, then I would ave what I need. If not, I think it might be an easy thing to add. We already have different scopes of these optimizer parameters like enable_nestloop 1. the system wide scope 2. a session wide scope and I see no reason why one could not just add a non-disruptive syntax form to change these parameters on a statement-wide scope. By all means in a comment. Why not --! set enable_nestloop = false --! set work_mem = '20 MB' SELECT * FROM ; something like that. It would not be a big deal, no completely new obscure hint syntax. And may be, if that is possible so far, then why not add a CTE scope as well: WITH Foo AS ( --! set enable_nestloop = false SELECT * FROM ... INNER JOIN ... INNER JOIN ... INNER JOIN ... ... ) , Bar AS ( SELECT * FROM Foo INNER JOIN IndexedTable USING(a, b, c) ) SELECT * FROM Bar ... ; this would keep the nestloop off for the CTE Foo with that complex join but allow it to be used for the CTE Bar or the ultimate query. I think these features should be relatively easy to add without causing SQL compatibility issue and also not opening a can of worms with obscure hint features that need a lot of work to implement correctly. But while we are at dreaming up solution, I think materialized indexed sub-plans would also be a nice ting, especially when dealing with CTEs. This could be controlled manually to begin with: WITH Foo AS ( --! set enable_nestloop = false SELECT * FROM ... INNER JOIN ... INNER JOIN ... INNER JOIN ... ... ) MATERIALIZE INDEX ON(a, b, c) , Bar AS ( SELECT * FROM Foo INNER JOIN IndexedTable USING(a, b, c) ) SELECT * FROM Bar ... ; And of course if we don't want to disturb SQL syntax, the "materialize index on ..." clause could be in a --! comment. But then, to dream on, PgSQL could make sub-query plans a temporary table and add indexes that it needs for it on the fly, because a
[PERFORM] Re: OLAP/reporting queries fall into nested loops over seq scans or other horrible planner choices
Hello, may I suggest you to look at https://github.com/ossc-db/pg_hint_plan that mimics Oracle hints syntax Regards PAscal -- Sent from: http://www.postgresql-archive.org/PostgreSQL-performance-f2050081.html -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance