[PERFORM] Cost estimate vs. actual - do I care?
I gather that a big part of making queries performant is making sure the planner's estimates reflect reality. Given a random explain analyze line: Limit (cost=0.02..943.11 rows=50 width=16) (actual time=0.603..79.729 rows=50 loops=1) which is the truer statement? 1. As long as costs go up with actual time, you're fine. 2. You should try to ensure that costs go up linearly with actual time. 3. You should try to ensure that costs are as close as possible to actual time. 4. The number 4. Jay Levitt -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Subquery in a JOIN not getting restricted?
Tom Lane wrote: Jay Levittjay.lev...@gmail.com writes: If the query was more like select questions.id from questions join ( select sum(u.id) from users as u group by u.id ) as s on s.id = questions.user_id where questions.id = 1; would you no longer be surprised that it scanned all user rows? I'd suggest rephrasing the query to do the join underneath the GROUP BY. Well, my real goal is to have that inner query in a set-returning function that gives a computed table of other users relative to the current user, and then be able to JOIN that with other things and ORDER BY it: select questions.id from questions join (select * from relevance(current_user)) as r on r.id = questions.user_id where questions.id = 1; I assume there's no way for that function (in SQL or PL/pgSQL) to reach to the upper node and say do that join again here, or force the join order from down below? I can't imagine how there could be, but never hurts to ask. Right now, our workaround is to pass the joined target user as a function parameter and do the JOIN in the function, but that means we have to put the function in the select list, else we hit the lack of LATERAL support: -- This would need LATERAL select questions.id from questions join ( select * from relevance(current_user, questions.user_id)) as r ) on r.id = questions.user_id where questions.id = 1; -- This works but has lots of row-at-a-time overhead select questions.id, ( select * from relevance(current_user, questions.user_id) ) as r from questions where questions.id = 1; Again, just checking if there's a solution I'm missing. I know the optimizer is only asymptotically approaching optimal! Jay -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Subquery in a JOIN not getting restricted?
Kevin Grittner wrote: Merlin Moncuremmonc...@gmail.com wrote: Well, this may not fit the OP's 'real' query Right, if I recall correctly, the OP said it was simplified down as far as it could be and still have the issue show. but the inner subquery is probably better written as a semi-join (WHERE EXISTS). Kevin's right. The real query involves several SQL and PL/pgsql functions (all now inlineable), custom aggregates, a union or two and a small coyote. I could post it, but that feels like Please write my code for me. Still, if you really want to... Meanwhile, it's good for me to learn how the planner sees my queries and how I can best state them. I assume this is me not understanding something about restrictions across group-by nodes. If the query was more like select questions.id from questions join ( select sum(u.id) from users as u group by u.id ) as s on s.id = questions.user_id where questions.id = 1; would you no longer be surprised that it scanned all user rows? I.E. is the group by a red herring, which usually wouldn't be present without an aggregate, and the real problem is that the planner can't restrict aggregates? This comment in planagg.c may be relevant; I'm not doing min/max, but is it still true that GROUP BY always looks at all the rows, period? void preprocess_minmax_aggregates(PlannerInfo *root, List *tlist) ... /* We don't handle GROUP BY or windowing, because our current * implementations of grouping require looking at all the rows anyway, */ -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Subquery in a JOIN not getting restricted?
Don't hold your breath waiting for that to change. To do what you're wishing for, we'd have to treat the GROUP BY subquery as if it were an inner indexscan, and push a join condition into it. That's not even possible today. Thanks! Knowing that's not a thing helps; we'll just have to rephrase the query. Jay -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Subquery in a JOIN not getting restricted?
Kevin Grittner wrote: Jay Levittjay.lev...@gmail.com wrote: I don't get why the GROUP BY in this subquery forces it to scan the entire users table (seq scan here, index scan on a larger table) when there's only one row in users that can match: Are you sure there's a plan significantly faster than 1.3 ms? Yep! Watch this: drop schema if exists jaytest cascade; create schema jaytest; set search_path to jaytest; create table questions ( id int not null primary key, user_id int not null ); insert into questions select generate_series(1,1100), (random()*200)::int; create table users ( id int not null primary key ); insert into users select generate_series(1, 200); vacuum freeze analyze; explain analyze select questions.id from questions join ( select u.id from users as u group by u.id ) as s on s.id = questions.user_id where questions.id = 1; --- Merge Join (cost=8.28..90833.02 rows=1818 width=4) (actual time=888.787..888.790 rows=1 loops=1) Merge Cond: (u.id = questions.user_id) - Group (cost=0.00..65797.47 rows=200 width=4) (actual time=0.017..735.509 rows=1747305 loops=1) - Index Scan using users_pkey on users u (cost=0.00..60797.47 rows=200 width=4) (actual time=0.015..331.990 rows=1747305 loops=1) - Materialize (cost=8.28..8.29 rows=1 width=8) (actual time=0.013..0.015 rows=1 loops=1) - Sort (cost=8.28..8.28 rows=1 width=8) (actual time=0.012..0.013 rows=1 loops=1) Sort Key: questions.user_id Sort Method: quicksort Memory: 25kB - Index Scan using questions_pkey on questions (cost=0.00..8.27 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=1) Index Cond: (id = 1) Total runtime: 888.832 ms (11 rows) explain analyze select questions.id from questions join ( select u.id from users as u ) as s on s.id = questions.user_id where questions.id = 1; --- Nested Loop (cost=0.00..16.77 rows=1 width=4) (actual time=0.019..0.021 rows=1 loops=1) - Index Scan using questions_pkey on questions (cost=0.00..8.27 rows=1 width=8) (actual time=0.008..0.009 rows=1 loops=1) Index Cond: (id = 1) - Index Scan using users_pkey on users u (cost=0.00..8.49 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=1) Index Cond: (u.id = questions.user_id) Total runtime: 0.045 ms (6 rows) That said, there might be some room for an optimization which pushes that test into the query with the group by clause. I don't know if there's a problem with that which I'm missing, the construct was judged to be too rare to be worth the cost of testing for it, or it's just that nobody has yet gotten to it. Anyone have more insights on whether this is hard to optimize or simply not-yet-optimized? And if the latter, where might I start looking? (Not that you -really- want me to submit a patch; my C has regressed to the try an ampersand. OK, try an asterisk. level...) Jay -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Predicates not getting pushed into SQL function?
Tom Lane wrote: Please don't send HTML-only email to these lists. Oops - new mail client, sorry. Anyway, the answer seems to be that inline_set_returning_function needs some work to handle cases with declared OUT parameters. I will see about fixing that going forward, but in existing releases what you need to do is declare the function as returning SETOF some named composite type Yes, that patch works great! Oddly enough, the workaround now does NOT work; functions returning SETOF named composite types don't get inlined, but functions returning the equivalent TABLE do get inlined. Let me know if you need a failcase, but the bug doesn't actually affect me now :) Jay create type matcher_result as (user_id int, match int); create or replace function matcher() returns setof matcher_result as ... -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Predicates not getting pushed into SQL function?
Jay Levitt wrote: Yes, that patch works great! Oddly enough, the workaround now does NOT work; functions returning SETOF named composite types don't get inlined, but functions returning the equivalent TABLE do get inlined. Let me know if you need a failcase, but the bug doesn't actually affect me now :) Never mind... I left a strict in my test. Works great all around. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Subquery in a JOIN not getting restricted?
When I run the following query: select questions.id from questions join ( select u.id as user_id from users as u left join scores as s on s.user_id = u.id ) as subquery on subquery.user_id = questions.user_id; the subquery is scanning my entire user table, even though it's restricted by the outer query. (My real subquery is much more complicated, of course, but this is the minimal fail case.) Is this just not a thing the optimizer can do? Are there ways to rewrite this, still as a subquery, that will be smart enough to only produce the one row of subquery that matches questions.user_id? Jay Levitt -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Subquery in a JOIN not getting restricted?
Kevin Grittner wrote: Kevin Grittnerkevin.gritt...@wicourts.gov wrote: If I had made the scores table wider, it might have gone from the user table to scores on the index. Bah. I just forgot to put an index on scores.user_id. With that index available it did what you were probably expecting -- seq scan on questions, nested loop index scan on users, nested loop index scan on scores. You weren't running you test with just a few rows in each table and expecting the same plan to be generated as for tables with a lot of rows, were you? No, we're a startup - we only have 2,000 users and 17,000 scores! We don't need test databases yet... But I just realized something I'd completely forgot (or blocked) - scores is a view. And views don't have indexes. The underlying tables are ultimately indexed by user_id, but I can believe that Postgres doesn't think that's a cheap way to do it - especially since we're still using stock tuning settings (I know) so its costs are all screwed up. And yep! When I do a CREATE TABLE AS from that view, and add an index on user_id, it works just as I'd like. I've been meaning to persist that view anyway, so that's what I'll do. Thanks for the push in the right direction.. Jay -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Subquery in a JOIN not getting restricted?
Jay Levitt wrote: And yep! When I do a CREATE TABLE AS from that view, and add an index on user_id, it works just as I'd like. Or not. Feel free to kick me back over to pgsql-novice, but I don't get why the GROUP BY in this subquery forces it to scan the entire users table (seq scan here, index scan on a larger table) when there's only one row in users that can match: create table questions ( id int not null primary key, user_id int not null ); insert into questions select generate_series(1,1100), (random()*2000)::int; create table users ( id int not null primary key ); insert into users select generate_series(1, 2000); vacuum freeze analyze; explain analyze select questions.id from questions join ( select u.id from users as u group by u.id ) as s on s.id = questions.user_id where questions.id = 1; Hash Join (cost=42.28..89.80 rows=2 width=4) (actual time=0.857..1.208 rows=1 loops=1) Hash Cond: (u.id = questions.user_id) - HashAggregate (cost=34.00..54.00 rows=2000 width=4) (actual time=0.763..1.005 rows=2000 loops=1) - Seq Scan on users u (cost=0.00..29.00 rows=2000 width=4) (actual time=0.003..0.160 rows=2000 loops=1) - Hash (cost=8.27..8.27 rows=1 width=8) (actual time=0.015..0.015 rows=1 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB - Index Scan using questions_pkey on questions (cost=0.00..8.27 rows=1 width=8) (actual time=0.012..0.013 rows=1 loops=1) Index Cond: (id = 1) Total runtime: 1.262 ms This is on patched 9.0.5 built earlier today. The real query has aggregates, so it really does need GROUP BY.. I think.. Jay -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Predicates not getting pushed into SQL function?
I'm confused. I have a now-trivial SQL function that, unrestricted, would scan my whole users table. When I paste the body of the function as a subquery and restrict it to one row, it only produces one row. When I paste the body of the function into a view and restrict it to one row, it produces one row. But when I put it in a SQL function... it scans the whole users table and then throws the other rows away. I thought SQL functions were generally inline-able, push-down-able, etc. As a workaround, I can put my WHERE clause inside the function and pass it parameters, but that feels ugly, and it won't help for things like resticting via JOINs. The real function needs parameters, so I can't use it as a view. Are there better workarounds? I suspect the problem is (something like) the planner doesn't realize the function will produce a variable number of rows; I can specify COST or ROWS, but they're both fixed values. Pretty-printed function and explain analyze results: https://gist.github.com/1336963 In ASCII for web-haters and posterity: -- THE OVERLY SIMPLIFIED FUNCTION create or replace function matcher() returns table(user_id int, match int) as $$ select o.user_id, 1 as match from ( select u.id as user_id, u.gender from users as u ) as o cross join ( select u.id as user_id, u.gender from users as u where u.id = 1 ) as my; $$ language sql stable; -- WHEN I CALL IT AS A FUNCTION select * from matcher() where user_id = 2; LOG: duration: 1.242 ms plan: Query Text: select o.user_id, 1 as match from ( select u.id as user_id, u.gender from users as u ) as o cross join ( select u.id as user_id, u.gender from users as u where u.id = 1 ) as my; Nested Loop (cost=0.00..118.39 rows=1656 width=4) (actual time=0.022..0.888 rows=1613 loops=1) Output: u.id, 1 - Index Scan using users_pkey on public.users u (cost=0.00..8.27 rows=1 width=0) (actual time=0.013..0.015 rows=1 loops=1) Index Cond: (u.id = 1) - Seq Scan on public.users u (cost=0.00..93.56 rows=1656 width=4) (actual time=0.004..0.479 rows=1613 loops=1) Output: u.id CONTEXT: SQL function matcher statement 1 LOG: duration: 1.951 ms plan: Query Text: select * from matcher() where user_id = 2; Function Scan on public.matcher (cost=0.25..12.75 rows=5 width=8) (actual time=1.687..1.940 rows=1 loops=1) Output: user_id, match Filter: (matcher.user_id = 2) -- WHEN I CALL IT AS A SUBQUERY select * from ( select o.user_id, 1 as match from ( select u.id as user_id, u.gender from users as u ) as o cross join ( select u.id as user_id, u.gender from users as u where u.id = 1 ) as my ) as matcher where user_id = 2; LOG: duration: 0.044 ms plan: Query Text: select * from ( select o.user_id, 1 as match from ( select u.id as user_id, u.gender from users as u ) as o cross join ( select u.id as user_id, u.gender from users as u where u.id = 1 ) as my ) as matcher where user_id = 2; Nested Loop (cost=0.00..16.55 rows=1 width=4) (actual time=0.028..0.031 rows=1 loops=1) Output: u.id, 1 - Index Scan using users_pkey on public.users u (cost=0.00..8.27 rows=1 width=4) (actual time=0.021..0.022 rows=1 loops=1) Output: u.id Index Cond: (u.id = 2) - Index Scan using users_pkey on public.users u (cost=0.00..8.27 rows=1 width=0) (actual time=0.004..0.006 rows=1 loops=1) Index Cond: (u.id = 1) -- WHEN I CALL IT AS A VIEW create view matchview as select o.user_id, 1 as match from ( select u.id as user_id, u.gender from users as u ) as o cross join ( select u.id as user_id, u.gender from users as u where u.id = 1 ) as my; select * from matchview where user_id = 2; LOG: duration: 0.044 ms plan: Query Text: select * from matchview where user_id = 2; Nested Loop (cost=0.00..16.55 rows=1 width=4) (actual time=0.028..0.031 rows=1 loops=1) Output: u.id, 1 - Index Scan using users_pkey on public.users u (cost=0.00..8.27 rows=1 width=4) (actual time=0.021..0.022 rows=1 loops=1) Output: u.id Index Cond: (u.id = 2) - Index Scan using users_pkey on public.users u (cost=0.00..8.27 rows=1 width=0) (actual time=0.005..0.007 rows=1 loops=1) Index Cond: (u.id = 1) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Predicates not getting pushed into SQL function?
What other info can I provide? id is int, gender is varchar(255), and it's happening on 9.0.4... Tom Lane November 3, 2011 2:41 PM Jay Levitt jay.lev...@gmail.com writes: I'm confused. I have a now-trivial SQL function that, unrestricted, would scan my whole users table. When I paste the body of the function as a subquery and restrict it to one row, it only produces one row. When I paste the body of the function into a view and restrict it to one row, it produces one row. But when I put it in a SQL function... it scans the whole users table and then throws the other rows away. I thought SQL functions were generally inline-able, push-down-able, etc. inline-able, yes, but if they're not inlined you don't get any such thing as pushdown of external conditions into the function body. A non-inlined function is a black box. The interesting question here is why the function doesn't get inlined into the calling query. You got the obvious showstoppers: it has a SETOF result, it's not volatile, nor strict. The only other possibility I can see offhand is that there's some sort of result datatype mismatch, but you've not provided enough info to be sure about that. regards, tom lane Jay Levitt November 3, 2011 1:47 PM I'm confused. I have a now-trivial SQL function that, unrestricted, would scan my whole users table. When I paste the body of the function as a subquery and restrict it to one row, it only produces one row. When I paste the body of the function into a view and restrict it to one row, it produces one row. But when I put it in a SQL function... it scans the whole users table and then throws the other rows away. I thought SQL functions were generally inline-able, push-down-able, etc. As a workaround, I can put my WHERE clause inside the function and pass it parameters, but that feels ugly, and it won't help for things like resticting via JOINs. The real function needs parameters, so I can't use it as a view. Are there better workarounds? I suspect the problem is (something like) the planner doesn't realize the function will produce a variable number of rows; I can specify COST or ROWS, but they're both fixed values. Pretty-printed function and explain analyze results: https://gist.github.com/1336963 In ASCII for web-haters and posterity: -- THE OVERLY SIMPLIFIED FUNCTION create or replace function matcher() returns table(user_id int, match int) as $$ select o.user_id, 1 as match from ( select u.id as user_id, u.gender from users as u ) as o cross join ( select u.id as user_id, u.gender from users as u where u.id = 1 ) as my; $$ language sql stable; -- WHEN I CALL IT AS A FUNCTION select * from matcher() where user_id = 2; LOG: duration: 1.242 ms plan: Query Text: select o.user_id, 1 as match from ( select u.id as user_id, u.gender from users as u ) as o cross join ( select u.id as user_id, u.gender from users as u where u.id = 1 ) as my; Nested Loop (cost=0.00..118.39 rows=1656 width=4) (actual time=0.022..0.888 rows=1613 loops=1) Output: u.id, 1 - Index Scan using users_pkey on public.users u (cost=0.00..8.27 rows=1 width=0) (actual time=0.013..0.015 rows=1 loops=1) Index Cond: (u.id = 1) - Seq Scan on public.users u (cost=0.00..93.56 rows=1656 width=4) (actual time=0.004..0.479 rows=1613 loops=1) Output: u.id CONTEXT: SQL function "matcher" statement 1 LOG: duration: 1.951 ms plan: Query Text: select * from matcher() where user_id = 2; Function Scan on public.matcher (cost=0.25..12.75 rows=5 width=8) (actual time=1.687..1.940 rows=1 loops=1) Output: user_id, match Filter: (matcher.user_id = 2) -- WHEN I CALL IT AS A SUBQUERY select * from ( select o.user_id, 1 as match from ( select u.id as user_id, u.gender from users as u ) as o cross join ( select u.id as user_id, u.gender from users as u where u.id = 1 ) as my ) as matcher where user_id = 2; LOG: duration: 0.044 ms plan: Query Text: select * from ( select o.user_id, 1 as match from ( select u.id as user_id, u.gender from users as u ) as o cross join ( select u.id as user_id, u.gender from users as u where u.id = 1 ) as my ) as matcher where user_id = 2; Nested Loop (cost=0.00..16.55 rows=1 width=4) (actual time=0.028..0.031 rows=1 loops=1) Output: u.id, 1 - Index Scan using users_pkey on public.users u (cost=0.00..8.27 rows=1 width=4) (actual time=0.021..0.022 rows=1 loops=1) Output: u.id Index Cond: (u.id = 2) - Index Scan using users_pkey on public.users u (cost=0.00..8.27 rows=1 width=0) (actual time=0.004..0.006 rows=1 loops=1) Index Cond: (u.id = 1) -- WHEN I CALL IT AS A VIEW create view matchview as select o.user_id, 1 as match from ( select u.id as user_id, u.gender from users as u ) as o cross join ( select u.id as user_id, u.gender from users as u where u.id = 1 ) as my; select * from matchview where user_id = 2; LOG: duration
[PERFORM] Guide to PG's capabilities for inlining, predicate hoisting, flattening, etc?
As I've come up to speed on SQL and PostgreSQL with some medium-complexity queries, I've asked a few questions about what the optimizer will do in various situations. I'm not talking about the seq-scan-vs-index type of optimizing; I mean transforming within the relational calculus (algebra?) to an equivalent but more performant query. The same topics come up: - Flattening. I think that means Merge the intent of the subquery into the various clauses of the parent query. - Inlining. That's Don't run this function/subquery/view as an atomic unit; instead, push it up into the parent query so the optimizer can see it all at once. Maybe that's the same as flattening. - Predicate pushdown. That's This subquery produces a lot of rows, but the parent query has a WHERE clause that will eliminate half of them, so don't produce the unnecessary rows. Am I right so far? Now, the big question, which I haven't seen documented anywhere: Under what circumstances can the optimizer do each of these things? For instance, I have a complex query that calculates the similarity of one user to every other user. The output is two columns, one row per user: select * from similarity(my_user_id); other_user | similarity% ---|- 123 | 99 Being a novice at SQL, I first wrote it in PL/pgSQL, so I could stay in my imperative, iterative head. The query performed decently well when scanning the whole table, but when I only wanted to compare myself to a single user, I said: select * from similarity(my_user_id) as s where s.other_user = 321; And, of course, similarity() produced the whole table anyway, because predicates don't get pushed down into PL/pgSQL functions. So I went and rewrote similarity as a SQL function, but I still didn't want one big hairy SQL query. Ah ha! CTEs let you write modular subqueries, and you also avoid problems with lack of LATERAL. I'll use those. .. But of course predicates don't get pushed into CTEs, either. (Or maybe it was that they would, but only if they were inline with the predicate.. I forget now.) So you can see where I'm going. I know if I break everything into elegant, composable functions, it'll continue to perform poorly. If I write one big hairy, it'll perform great but it will be difficult to maintain, and it will be inelegant and a kitten will die. My tools are CTEs, subqueries, aliases, SQL functions, PL/pgSQL functions, and views (and other tools?) What optimizations do each of those prevent? We're on 9.0 now but will happily upgrade to 9.1 if that matters. Jay Levitt -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance