Re: [HACKERS] Weirdly pesimistic estimates in optimizer
I wrote: > I chewed on this for awhile and decided that there'd be no real harm in > taking identification of the unique expressions out of > create_unique_path() and doing it earlier, in initsplan.c; we'd need a > couple more fields in SpecialJoinInfo but that doesn't seem like a > problem. However, rel->rows is a *big* problem; we simply have not made > any join size estimates yet, and can't, because these things are done > bottom up. > However ... estimate_num_groups's dependency on its rowcount input is not > large (it's basically using it as a clamp). So conceivably we could have > get_loop_count just multiply together the sizes of the base relations > included in the semijoin's RHS to get a preliminary estimate of that > number. This would be the right thing anyway for a single relation in the > RHS, which is the most common case. It would usually be an overestimate > for join RHS, but we could hope that the output of estimate_num_groups > wouldn't be affected too badly. Attached is a draft patch that does those two things. I like the first part (replacing SpecialJoinInfo's rather ad-hoc join_quals field with something more explicitly attuned to semijoin uniqueness processing). The second part is still pretty much of a kluge, but then get_loop_count was a kluge already. This arguably makes it better. Now, on the test case you presented, this has the unfortunate effect that it now reliably chooses the "wrong" plan for both cases :-(. But I think that's a reflection of poor cost parameters (ie, test case fits handily in RAM but we've not set the cost parameters to reflect that). We do get the same rowcount and roughly-same cost estimates for both the random_fk_dupl and random_fk_uniq queries, so from that standpoint it's doing the right thing. If I reduce random_page_cost to 2 or so, it makes the choices you wanted. regards, tom lane diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index 9fe8008..9f65d66 100644 *** a/src/backend/nodes/copyfuncs.c --- b/src/backend/nodes/copyfuncs.c *** _copySpecialJoinInfo(const SpecialJoinIn *** 1942,1948 COPY_SCALAR_FIELD(jointype); COPY_SCALAR_FIELD(lhs_strict); COPY_SCALAR_FIELD(delay_upper_joins); ! COPY_NODE_FIELD(join_quals); return newnode; } --- 1942,1951 COPY_SCALAR_FIELD(jointype); COPY_SCALAR_FIELD(lhs_strict); COPY_SCALAR_FIELD(delay_upper_joins); ! COPY_SCALAR_FIELD(semi_can_btree); ! COPY_SCALAR_FIELD(semi_can_hash); ! COPY_NODE_FIELD(semi_operators); ! COPY_NODE_FIELD(semi_rhs_exprs); return newnode; } diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index fe509b0..fd876fb 100644 *** a/src/backend/nodes/equalfuncs.c --- b/src/backend/nodes/equalfuncs.c *** _equalSpecialJoinInfo(const SpecialJoinI *** 798,804 COMPARE_SCALAR_FIELD(jointype); COMPARE_SCALAR_FIELD(lhs_strict); COMPARE_SCALAR_FIELD(delay_upper_joins); ! COMPARE_NODE_FIELD(join_quals); return true; } --- 798,807 COMPARE_SCALAR_FIELD(jointype); COMPARE_SCALAR_FIELD(lhs_strict); COMPARE_SCALAR_FIELD(delay_upper_joins); ! COMPARE_SCALAR_FIELD(semi_can_btree); ! COMPARE_SCALAR_FIELD(semi_can_hash); ! COMPARE_NODE_FIELD(semi_operators); ! COMPARE_NODE_FIELD(semi_rhs_exprs); return true; } diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 775f482..75c57a2 100644 *** a/src/backend/nodes/outfuncs.c --- b/src/backend/nodes/outfuncs.c *** _outSpecialJoinInfo(StringInfo str, cons *** 1945,1951 WRITE_ENUM_FIELD(jointype, JoinType); WRITE_BOOL_FIELD(lhs_strict); WRITE_BOOL_FIELD(delay_upper_joins); ! WRITE_NODE_FIELD(join_quals); } static void --- 1945,1954 WRITE_ENUM_FIELD(jointype, JoinType); WRITE_BOOL_FIELD(lhs_strict); WRITE_BOOL_FIELD(delay_upper_joins); ! WRITE_BOOL_FIELD(semi_can_btree); ! WRITE_BOOL_FIELD(semi_can_hash); ! WRITE_NODE_FIELD(semi_operators); ! WRITE_NODE_FIELD(semi_rhs_exprs); } static void diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index 5a9daf0..1a0d358 100644 *** a/src/backend/optimizer/path/costsize.c --- b/src/backend/optimizer/path/costsize.c *** compute_semi_anti_join_factors(PlannerIn *** 3294,3300 /* we don't bother trying to make the remaining fields valid */ norm_sjinfo.lhs_strict = false; norm_sjinfo.delay_upper_joins = false; ! norm_sjinfo.join_quals = NIL; nselec = clauselist_selectivity(root, joinquals, --- 3294,3303 /* we don't bother trying to make the remaining fields valid */ norm_sjinfo.lhs_strict = false; norm_sjinfo.delay_upper_joins = false; ! norm_sjinfo.semi_can_btree = false; ! norm_sjinfo.semi_can_hash = false; ! norm_sjinfo.semi_operators = NIL; ! norm_sjinfo.semi_rhs_exprs = NIL; nselec = clauselist_selectivity(root, joinquals, ***
Re: [HACKERS] Weirdly pesimistic estimates in optimizer
=?UTF-8?Q?David_Kube=C4=8Dka?= writes: > There is division by loop_count because of predicted effect of caching and > it is exactly this division which makes the run_cost for single index > lookup so low compared with the query version with random_fk_uniq. So the > main problem is that the planner calls cost_index with loop_count equal to > number of rows of inner table, *but it ignores semi-join semantics*, i.e. > it doesn't account for the number of *unique* rows in the inner table which > will actually be the number of loops. Good point. > Since this is my first time looking into the optimizer (and in fact any > postgres) code I am not yet able to locate the exact place where this > should be repaired, but I hope that in few days I will have a patch :-) Hm, this might not be the best problem to tackle for your first Postgres patch :-(. The information about the estimated number of unique rows isn't readily available at the point where we're creating parameterized paths. When we do compute such an estimate, it's done like this: pathnode->path.rows = estimate_num_groups(root, uniq_exprs, rel->rows); where uniq_exprs is a list of right-hand-side expressions we've determined belong to the semijoin, and rel->rows represents the raw size of the semijoin's RHS relation (which might itself be a join). Neither of those things are available to create_index_path. I chewed on this for awhile and decided that there'd be no real harm in taking identification of the unique expressions out of create_unique_path() and doing it earlier, in initsplan.c; we'd need a couple more fields in SpecialJoinInfo but that doesn't seem like a problem. However, rel->rows is a *big* problem; we simply have not made any join size estimates yet, and can't, because these things are done bottom up. However ... estimate_num_groups's dependency on its rowcount input is not large (it's basically using it as a clamp). So conceivably we could have get_loop_count just multiply together the sizes of the base relations included in the semijoin's RHS to get a preliminary estimate of that number. This would be the right thing anyway for a single relation in the RHS, which is the most common case. It would usually be an overestimate for join RHS, but we could hope that the output of estimate_num_groups wouldn't be affected too badly. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Weirdly pesimistic estimates in optimizer
On 3/5/15 7:58 PM, Jim Nasby wrote: This got answered on one of the other lists, right? That was supposed to be off-list. I'll answer my own question: yes. Sorry for the noise. :( -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Weirdly pesimistic estimates in optimizer
On 2/28/15 12:01 PM, David Kubečka wrote: With 'random_fk_dupl': -> Index Scan using facts_fk_idx on facts (cost=0.42..5.75 rows=100 width=15) (actual time=0.009..0.117 rows=98 loops=100) With 'random_fk_uniq': -> Index Scan using facts_fk_idx on facts (cost=0.42..214.26 rows=100 width=15) (actual time=0.007..0.109 rows=98 loops=100) I have read the optimizer README file and also looked briefly at the code, but this seems to be something not related to particular implementation of algorithm (e.g. nested loop). Perhaps it's the way how cost estimates are propagated down (or sideways? that would be weird...) the query tree. But I am really not sure, since this is my first time lookng at the optimizer code base. I should also add that I have reproduced this behaviour for all versions of Pg from 9.2 up to current devel. This got answered on one of the other lists, right? -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Weirdly pesimistic estimates in optimizer
Hi Tomas and others, 2015-03-02 21:29 GMT+01:00 Tomas Vondra : > Hi David ;-) > > On 2.3.2015 20:19, David Kubečka wrote: > > > > The question is why optimizer, or rather the cost estimator, > > produced so much different estimates upon very small change in input. > > Moreover it seems that the main culprit of bad estimates isn't > > actually directly related to outer table, but rather to inner table. > > Just compare estimates for the two index scans: > > > > With 'random_fk_dupl': > > -> Index Scan using facts_fk_idx on facts (cost=0.42..5.75 > > rows=100 width=15) (actual time=0.009..0.117 rows=98 loops=100) > > With 'random_fk_uniq': > > -> Index Scan using facts_fk_idx on facts (cost=0.42..214.26 > > rows=100 width=15) (actual time=0.007..0.109 rows=98 loops=100) > > > > I have read the optimizer README file and also looked briefly at the > > code, but this seems to be something not related to particular > > implementation of algorithm (e.g. nested loop). Perhaps it's the way > > how cost estimates are propagated down (or sideways? that would be > > weird...) the query tree. But I am really not sure, since this is my > > first time lookng at the optimizer code base. I should also add that > > I have reproduced this behaviour for all versions of Pg from 9.2 up > > to current devel. > > Interesting. I've only spent a few minutes looking at this, but it > certainly is a bit strange that using smaller table with unique values > results in a slower plan than using a table with duplicities (but the > same number of unique values). > Yep. And I think that I have found the cause of the strangeness. I have traced the planner code and this is how cost_index from costsize.c is called when computing the cost of index scan on outer (facts) table, when the inner table is with duplicate entries (random_fk_dupl): cost_index (path=0x27f8fb8, root=0x27670e8, loop_count=9920) The important thing here is the value of loop_count which is the (perfect estimate of) number of rows of random_fk_dupl (I have recreated the table so it differs slightly from previously posted version (9893)). Now the predominant part of running cost is computed by this expression: run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost); Since csquared (indexCorrelation * indexCorrelation) is very small in this case, the biggest term here is max_IO_cost, which is computed as follows: max_IO_cost = (pages_fetched * spc_random_page_cost) / loop_count; There is division by loop_count because of predicted effect of caching and it is exactly this division which makes the run_cost for single index lookup so low compared with the query version with random_fk_uniq. So the main problem is that the planner calls cost_index with loop_count equal to number of rows of inner table, *but it ignores semi-join semantics*, i.e. it doesn't account for the number of *unique* rows in the inner table which will actually be the number of loops. Since this is my first time looking into the optimizer (and in fact any postgres) code I am not yet able to locate the exact place where this should be repaired, but I hope that in few days I will have a patch :-) > > Another observation is that the planner does see other plans, because > tweaking the cost variables like this: > > SET random_page_cost = 2; > > produces this plan (which on my machine runs just a tad slower than the > first query in your post): > > QUERY PLAN > - > HashAggregate (cost=10564.81..10688.47 rows=9893 width=15) >Group Key: facts.fk >-> Nested Loop (cost=5.42..10515.34 rows=9893 width=15) >-> HashAggregate (cost=2.24..3.23 rows=99 width=4) > Group Key: random_fk_uniq.fk > -> Seq Scan on random_fk_uniq (cost=0.00..1.99 ...) >-> Bitmap Heap Scan on facts (cost=3.18..105.18 rows=100 ...) >Recheck Cond: (fk = random_fk_uniq.fk) >-> Bitmap Index Scan on facts_fk_idx (cost=0.00..3.15 ...) > Index Cond: (fk = random_fk_uniq.fk) > Yeah, this makes now perfect sense to me. From the run_cost and max_IO_cost formulas you can see that (in this case!) random_page_cost is almost exactly the linear factor here. So eventually the results are completely opposite than I wished at first, i.e. instead of tweaking postgres to produce lower costs for random_fk_uniq it will start producing higher costs for random_fk_dupl. But I now believe that this is correct. Regards, -David > I can get similar results by setting cpu_operator_cost=0.005. And > further lowering to random_page_cost to 1.1 actually gets me this: > > QUERY PLAN > - > HashAggregate (cost=6185.41..6309.08 rows=9893 width=15) >Group Key: facts.fk >-> Nested Loop (cost=2.66..6135.95 rows=9893 width=15) >
Re: [HACKERS] Weirdly pesimistic estimates in optimizer
On Mon, Mar 2, 2015 at 2:19 PM, David Kubečka wrote: > The question is why optimizer, or rather the cost estimator, produced so > much different estimates upon very small change in input. Moreover it seems > that the main culprit of bad estimates isn't actually directly related to > outer table, but rather to inner table. Just compare estimates for the two > index scans: > > With 'random_fk_dupl': > -> Index Scan using facts_fk_idx on facts (cost=0.42..5.75 > rows=100 width=15) (actual time=0.009..0.117 rows=98 loops=100) > With 'random_fk_uniq': > -> Index Scan using facts_fk_idx on facts (cost=0.42..214.26 > rows=100 width=15) (actual time=0.007..0.109 rows=98 loops=100) Whoa. That's pretty dramatic. Am I correctly understanding that the two tables contain *exactly* the same data? Could the issue be that the optimizer thinks that one of the tables is thought to have a higher logical-to-physical correlation than the other (see pg_stats.correlation)? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Weirdly pesimistic estimates in optimizer
Hi all, I have encountered a performance problem with relatively simple query, which I think is caused by the overly pesimistic estimates in optimizer. I have originally run into this issue on a table few GBs large, but it can be reproduced with much smaller table as follows: -- Setup main fact table create table facts (fk int, f numeric); insert into facts select (random()*1)::int, random()*1 from generate_series(1,100) g(i); create index on facts (fk); vacuum analyze facts; -- Pick 100 values of 'fk' which cover roughly 1/100 rows from the 'facts' table and select all corresponding values create table random_fk_dupl as select fk from facts where fk in (select (random()*1)::int from generate_series(1,100) g(i)); vacuum analyze random_fk_dupl; -- Create a table with unique values from 'random_fk_dupl' create table random_fk_uniq as select distinct fk from random_fk_dupl; vacuum analyze random_fk_uniq; (The motivation for doing this is that I have a persistent table/cache with aggregated values and I want to update this table whenever new data are loaded to 'facts' by processing only loaded data. This is very rough description but it isn't needed to go into more detail for the sake of the argument.) Now the main query: david=# explain analyze select fk, sum(f) from facts where fk in (select fk from random_fk_dupl) group by 1; QUERY PLAN -- HashAggregate (cost=892.82..1017.34 rows=9962 width=15) (actual time=22.752..22.791 rows=100 loops=1) Group Key: facts.fk -> Nested Loop (cost=167.01..842.63 rows=10038 width=15) (actual time=2.167..16.739 rows=9807 loops=1) -> HashAggregate (cost=166.59..167.59 rows=100 width=4) (actual time=2.153..2.193 rows=100 loops=1) Group Key: random_fk_dupl.fk -> Seq Scan on random_fk_dupl (cost=0.00..142.07 rows=9807 width=4) (actual time=0.003..0.688 rows=9807 loops=1) -> Index Scan using facts_fk_idx on facts (cost=0.42..5.75 rows=100 width=15) (actual time=0.009..0.117 rows=98 loops=100) Index Cond: (fk = random_fk_dupl.fk) Planning time: 0.372 ms Execution time: 22.842 ms This is the plan I would expect given the estimates (quite good except the last aggregation) for this query and it exactly corresponds to what's written in the README of optimizer: {quote} The naive way to join two relations using a clause like WHERE A.X = B.Y is to generate a nestloop plan like this: NestLoop Filter: A.X = B.Y -> Seq Scan on A -> Seq Scan on B We can make this better by using a merge or hash join, but it still requires scanning all of both input relations. If A is very small and B is very large, but there is an index on B.Y, it can be enormously better to do something like this: NestLoop -> Seq Scan on A -> Index Scan using B_Y_IDX on B Index Condition: B.Y = A.X Here, we are expecting that for each row scanned from A, the nestloop plan node will pass down the current value of A.X into the scan of B. That allows the indexscan to treat A.X as a constant for any one invocation, and thereby use it as an index key. This is the only plan type that can avoid fetching all of B, and for small numbers of rows coming from A, that will dominate every other consideration. (As A gets larger, this gets less attractive, and eventually a merge or hash join will win instead. So we have to cost out all the alternatives to decide what to do.) {quote} So far so good. Now let's use 'random_fk_uniq' instead of 'random_fk_dupl'. I would expect almost the same plan (with just cheaper inner HashAggregate), because the optimizer knows that thar there are (at most) 100 values in 'random_fk_uniq' so nested loop is clearly the best choice. Instead we get this: david=# explain analyze select fk, sum(f) from facts where fk in (select fk from random_fk_uniq) group by 1; QUERY PLAN -- HashAggregate (cost=18198.11..18322.64 rows=9962 width=15) (actual time=163.738..163.773 rows=100 loops=1) Group Key: facts.fk -> Hash Semi Join (cost=3.25..18147.92 rows=10038 width=15) (actual time=0.298..160.271 rows=9807 loops=1) Hash Cond: (facts.fk = random_fk_uniq.fk) -> Seq Scan on facts (cost=0.00..15408.00 rows=100 width=15) (actual time=0.010..66.524 rows=100 loops=1) -> Hash (cost=2.00..2.00 rows=100 width=4) (actual time=0.079..0.079 rows=100 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 4kB -> Seq Scan on random_fk_uniq (cost=0.00..2.00 rows=100 width=4) (actual time=0.003..0.028 rows=100 loops=1) Planni
[HACKERS] Weirdly pesimistic estimates in optimizer
Hi all, I have encountered a performance problem with relatively simple query, which I think is caused by the overly pesimistic estimates in optimizer. I have originally run into this issue on a table few GBs large, but it can be reproduced with much smaller table as follows: -- Setup main fact table create table facts (fk int, f numeric); insert into facts select (random()*1)::int, random()*1 from generate_series(1,100) g(i); create index on facts (fk); vacuum analyze facts; -- Pick 100 values of 'fk' which cover roughly 1/100 rows from the 'facts' table and select all corresponding values create table random_fk_dupl as select fk from facts where fk in (select (random()*1)::int from generate_series(1,100) g(i)); vacuum analyze random_fk_dupl; -- Create a table with unique values from 'random_fk_dupl' create table random_fk_uniq as select distinct fk from random_fk_dupl; vacuum analyze random_fk_uniq; (The motivation for doing this is that I have a persistent table/cache with aggregated values and I want to update this table whenever new data are loaded to 'facts' by processing only loaded data. This is very rough description but it isn't needed to go into more detail for the sake of the argument.) Now the main query: david=# explain analyze select fk, sum(f) from facts where fk in (select fk from random_fk_dupl) group by 1; QUERY PLAN -- HashAggregate (cost=892.82..1017.34 rows=9962 width=15) (actual time=22.752..22.791 rows=100 loops=1) Group Key: facts.fk -> Nested Loop (cost=167.01..842.63 rows=10038 width=15) (actual time=2.167..16.739 rows=9807 loops=1) -> HashAggregate (cost=166.59..167.59 rows=100 width=4) (actual time=2.153..2.193 rows=100 loops=1) Group Key: random_fk_dupl.fk -> Seq Scan on random_fk_dupl (cost=0.00..142.07 rows=9807 width=4) (actual time=0.003..0.688 rows=9807 loops=1) -> Index Scan using facts_fk_idx on facts (cost=0.42..5.75 rows=100 width=15) (actual time=0.009..0.117 rows=98 loops=100) Index Cond: (fk = random_fk_dupl.fk) Planning time: 0.372 ms Execution time: 22.842 ms This is the plan I would expect given the estimates (quite good except the last aggregation) for this query and it exactly corresponds to what's written in the README of optimizer: {quote} The naive way to join two relations using a clause like WHERE A.X = B.Y is to generate a nestloop plan like this: NestLoop Filter: A.X = B.Y -> Seq Scan on A -> Seq Scan on B We can make this better by using a merge or hash join, but it still requires scanning all of both input relations. If A is very small and B is very large, but there is an index on B.Y, it can be enormously better to do something like this: NestLoop -> Seq Scan on A -> Index Scan using B_Y_IDX on B Index Condition: B.Y = A.X Here, we are expecting that for each row scanned from A, the nestloop plan node will pass down the current value of A.X into the scan of B. That allows the indexscan to treat A.X as a constant for any one invocation, and thereby use it as an index key. This is the only plan type that can avoid fetching all of B, and for small numbers of rows coming from A, that will dominate every other consideration. (As A gets larger, this gets less attractive, and eventually a merge or hash join will win instead. So we have to cost out all the alternatives to decide what to do.) {quote} So far so good. Now let's use 'random_fk_uniq' instead of 'random_fk_dupl'. I would expect almost the same plan (with just cheaper inner HashAggregate), because the optimizer knows that thar there are (at most) 100 values in 'random_fk_uniq' so nested loop is clearly the best choice. Instead we get this: david=# explain analyze select fk, sum(f) from facts where fk in (select fk from random_fk_uniq) group by 1; QUERY PLAN -- HashAggregate (cost=18198.11..18322.64 rows=9962 width=15) (actual time=163.738..163.773 rows=100 loops=1) Group Key: facts.fk -> Hash Semi Join (cost=3.25..18147.92 rows=10038 width=15) (actual time=0.298..160.271 rows=9807 loops=1) Hash Cond: (facts.fk = random_fk_uniq.fk) -> Seq Scan on facts (cost=0.00..15408.00 rows=100 width=15) (actual time=0.010..66.524 rows=100 loops=1) -> Hash (cost=2.00..2.00 rows=100 width=4) (actual time=0.079..0.079 rows=100 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 4kB -> Seq Scan on random_fk_uniq (cost=0.00..2.00 rows=100 width=4) (actual time=0.003..0.028 rows=100 loops=1) Planni
Re: [HACKERS] Weirdly pesimistic estimates in optimizer
David Kubečka wrote: > I have read the optimizer README file and also looked briefly at > the code, but this seems to be something not related to > particular implementation of algorithm (e.g. nested loop). > Perhaps it's the way how cost estimates are propagated down It could be as simple as not having tuned your cost factors to accurately reflect the relative costs of different actions in your environment. If you are using the default configuration, you might want to try a few of the adjustments that are most often needed (at least in my experience): cpu_tuple_cost = 0.03 random_page_cost = 2 effective_cache_size = <50% to 75% of machine RAM> work_mem = You can SET these on an individual connection, one at a time or in combination, and EXPLAIN the query to see the effects on plan choice. Other advice, not all of which matches my personal experience, can be found here: https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server The best thing to do is experiment with different values with your own queries and workloads to see what most accurately models your costs (and thus produces the fastest plans). -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Weirdly pesimistic estimates in optimizer
Hi David ;-) On 2.3.2015 20:19, David Kubečka wrote: > > The question is why optimizer, or rather the cost estimator, > produced so much different estimates upon very small change in input. > Moreover it seems that the main culprit of bad estimates isn't > actually directly related to outer table, but rather to inner table. > Just compare estimates for the two index scans: > > With 'random_fk_dupl': > -> Index Scan using facts_fk_idx on facts (cost=0.42..5.75 > rows=100 width=15) (actual time=0.009..0.117 rows=98 loops=100) > With 'random_fk_uniq': > -> Index Scan using facts_fk_idx on facts (cost=0.42..214.26 > rows=100 width=15) (actual time=0.007..0.109 rows=98 loops=100) > > I have read the optimizer README file and also looked briefly at the > code, but this seems to be something not related to particular > implementation of algorithm (e.g. nested loop). Perhaps it's the way > how cost estimates are propagated down (or sideways? that would be > weird...) the query tree. But I am really not sure, since this is my > first time lookng at the optimizer code base. I should also add that > I have reproduced this behaviour for all versions of Pg from 9.2 up > to current devel. Interesting. I've only spent a few minutes looking at this, but it certainly is a bit strange that using smaller table with unique values results in a slower plan than using a table with duplicities (but the same number of unique values). Another observation is that the planner does see other plans, because tweaking the cost variables like this: SET random_page_cost = 2; produces this plan (which on my machine runs just a tad slower than the first query in your post): QUERY PLAN - HashAggregate (cost=10564.81..10688.47 rows=9893 width=15) Group Key: facts.fk -> Nested Loop (cost=5.42..10515.34 rows=9893 width=15) -> HashAggregate (cost=2.24..3.23 rows=99 width=4) Group Key: random_fk_uniq.fk -> Seq Scan on random_fk_uniq (cost=0.00..1.99 ...) -> Bitmap Heap Scan on facts (cost=3.18..105.18 rows=100 ...) Recheck Cond: (fk = random_fk_uniq.fk) -> Bitmap Index Scan on facts_fk_idx (cost=0.00..3.15 ...) Index Cond: (fk = random_fk_uniq.fk) I can get similar results by setting cpu_operator_cost=0.005. And further lowering to random_page_cost to 1.1 actually gets me this: QUERY PLAN - HashAggregate (cost=6185.41..6309.08 rows=9893 width=15) Group Key: facts.fk -> Nested Loop (cost=2.66..6135.95 rows=9893 width=15) -> HashAggregate (cost=2.24..3.23 rows=99 width=4) Group Key: random_fk_uniq.fk -> Seq Scan on random_fk_uniq (cost=0.00..1.99 ...) -> Index Scan using facts_fk_idx on facts (cost=0.42..60.95 ...) Index Cond: (fk = random_fk_uniq.fk) which is exactly the first plan from your post. I still don't understand why using the smaller table produces less efficient plan, but the estimated costs are very clost (20013 vs. 18147 is very small difference in this context), and the estimator shoots for minimal average error. So it may seem 'weirdly pessimistic' but it might be equally optimistic for other queries. Also, keep in mind that the cost model is just a simplification of reality, so it can't be correct 100% of the time. The fact that this small difference in costs results in significant duration difference suggests that the default optimizer cost values do not reflect your environment. For example, you assume that this is very fast: NestLoop -> Seq Scan on A -> Index Scan using B_Y_IDX on B Index Condition: B.Y = A.X but index scans often produce a lot of random I/O, and that may be quite expensive if it really hits the storage. And that's what the default values (random_page_cost=4) is set for. But if you're on a system with lots of RAM, or SSD, ... that simply is not the case, and may penalize index scans (and prefer more sequential workloads). The other thing you might do is creating index on (fk,f), which on the new releases produces this: QUERY PLAN - HashAggregate (cost=783.77..907.43 rows=9893 width=15) Group Key: facts.fk -> Nested Loop (cost=2.66..734.30 rows=9893 width=15) -> HashAggregate (cost=2.24..3.23 rows=99 width=4) Group Key: random_fk_uniq.fk -> Seq Scan on random_fk_uniq (cost=0.00..1.99 ...) -> Index Only Scan using facts_2_idx on facts (cost=...) Index Cond: (fk = random_fk_uniq.fk) Which limits the amount of random I/O by only scanning the index, and is even faster than the first query. regard -- Tomas Vondra
[HACKERS] Weirdly pesimistic estimates in optimizer
Hi all, I have encountered a performance problem with relatively simple query, which I think is caused by the overly pesimistic estimates in optimizer. I have originally run into this issue on a table few GBs large, but it can be reproduced with much smaller table as follows: -- Setup main fact table create table facts (fk int, f numeric); insert into facts select (random()*1)::int, random()*1 from generate_series(1,100) g(i); create index on facts (fk); vacuum analyze facts; -- Pick 100 values of 'fk' which cover roughly 1/100 rows from the 'facts' table and select all corresponding values create table random_fk_dupl as select fk from facts where fk in (select (random()*1)::int from generate_series(1,100) g(i)); vacuum analyze random_fk_dupl; -- Create a table with unique values from 'random_fk_dupl' create table random_fk_uniq as select distinct fk from random_fk_dupl; vacuum analyze random_fk_uniq; (The motivation for doing this is that I have a persistent table/cache with aggregated values and I want to update this table whenever new data are loaded to 'facts' by processing only loaded data. This is very rough description but it isn't needed to go into more detail for the sake of the argument.) Now the main query: david=# explain analyze select fk, sum(f) from facts where fk in (select fk from random_fk_dupl) group by 1; QUERY PLAN -- HashAggregate (cost=892.82..1017.34 rows=9962 width=15) (actual time=22.752..22.791 rows=100 loops=1) Group Key: facts.fk -> Nested Loop (cost=167.01..842.63 rows=10038 width=15) (actual time=2.167..16.739 rows=9807 loops=1) -> HashAggregate (cost=166.59..167.59 rows=100 width=4) (actual time=2.153..2.193 rows=100 loops=1) Group Key: random_fk_dupl.fk -> Seq Scan on random_fk_dupl (cost=0.00..142.07 rows=9807 width=4) (actual time=0.003..0.688 rows=9807 loops=1) -> Index Scan using facts_fk_idx on facts (cost=0.42..5.75 rows=100 width=15) (actual time=0.009..0.117 rows=98 loops=100) Index Cond: (fk = random_fk_dupl.fk) Planning time: 0.372 ms Execution time: 22.842 ms This is the plan I would expect given the estimates (quite good except the last aggregation) for this query and it exactly corresponds to what's written in the README of optimizer: {quote} The naive way to join two relations using a clause like WHERE A.X = B.Y is to generate a nestloop plan like this: NestLoop Filter: A.X = B.Y -> Seq Scan on A -> Seq Scan on B We can make this better by using a merge or hash join, but it still requires scanning all of both input relations. If A is very small and B is very large, but there is an index on B.Y, it can be enormously better to do something like this: NestLoop -> Seq Scan on A -> Index Scan using B_Y_IDX on B Index Condition: B.Y = A.X Here, we are expecting that for each row scanned from A, the nestloop plan node will pass down the current value of A.X into the scan of B. That allows the indexscan to treat A.X as a constant for any one invocation, and thereby use it as an index key. This is the only plan type that can avoid fetching all of B, and for small numbers of rows coming from A, that will dominate every other consideration. (As A gets larger, this gets less attractive, and eventually a merge or hash join will win instead. So we have to cost out all the alternatives to decide what to do.) {quote} So far so good. Now let's use 'random_fk_uniq' instead of 'random_fk_dupl'. I would expect almost the same plan (with just cheaper inner HashAggregate), because the optimizer knows that thar there are (at most) 100 values in 'random_fk_uniq' so nested loop is clearly the best choice. Instead we get this: david=# explain analyze select fk, sum(f) from facts where fk in (select fk from random_fk_uniq) group by 1; QUERY PLAN -- HashAggregate (cost=18198.11..18322.64 rows=9962 width=15) (actual time=163.738..163.773 rows=100 loops=1) Group Key: facts.fk -> Hash Semi Join (cost=3.25..18147.92 rows=10038 width=15) (actual time=0.298..160.271 rows=9807 loops=1) Hash Cond: (facts.fk = random_fk_uniq.fk) -> Seq Scan on facts (cost=0.00..15408.00 rows=100 width=15) (actual time=0.010..66.524 rows=100 loops=1) -> Hash (cost=2.00..2.00 rows=100 width=4) (actual time=0.079..0.079 rows=100 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 4kB -> Seq Scan on random_fk_uniq (cost=0.00..2.00 rows=100 width=4) (actual time=0.003..0.028 rows=100 loops=1) Planni