Re: partition table optimizer join cost misestimation
Andrei, Yes, from explain output, since optimizer already get the merge_append cost but not take account into total cost, that make a big difference. I shared table DDLs and explain analyze,buffers output , I think the data maybe generated by other way to reproduce this issue. sorry for not sharing the commercial production data here. Andrei Lepikhov 於 2025年4月2日週三 下午7:03寫道: > On 4/2/25 12:18, James Pang wrote: > > Hi, > > Postgresq v14.8, we found optimizer doest not take "merge append" > > cost into sql plan total cost and then make a bad sql plan. attached > > please find details. > I suppose there is a different type of issue. > MegeJoin sometimes doesn't need to scan the whole inner or outer side > (see the MergeScanSelCache structure and how it is used in the cost > estimation routine). > > So, the cost can be less because the optimizer predicted that only a > small part of the Append will be scanned and used some sort of > interpolation between startup cost and total cost. > > But to be sure, could you send the results of EXPLAIN ANALYZE VERBOSE? > If you also send the data to reproduce the case, we may find the source > of the problem more precisely. > > -- > regards, Andrei Lepikhov > \d+ tablea Partitioned table "mct.tablea" Column |Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description ---+-+---+--+---+--+-+--+- mmm_id | bigint | | not null | | plain| | | clu_id| bigint | | | | plain| | | _id | bigint | | | | plain| | | zzz_id | bigint | | | | plain| | | ssr_id | bigint | | | | plain| | | ssr_type | bigint | | | | plain| | | mmm_stat| integer | | | | plain| | | time | timestamp without time zone | | not null | | plain| | | protocol | integer | | | | plain| | | uuun | character varying(20) | | | | extended | | | lastmodifiedtime | timestamp without time zone | | | sysdate() | plain| | | mmm_sss_id | bigint | | | | plain| | | remote_addr | character varying(20) | | | | extended | | | Partition key: RANGE ("time") Indexes: "tablea_pkey" PRIMARY KEY, btree (mmm_id, "time") "tablea_idx1" btree (zzz_id, mmm_sss_id, mmm_stat) "tablea_idx2" btree (mmm_sss_id) Publications: "_24503" Triggers: x AFTER DELETE ON tablea FOR EACH ROW WHEN (USER !~~ 'SPLEX%'::text) EXECUTE FUNCTION trigger_fct_xx() tr_tablea_update BEFORE UPDATE ON tablea FOR EACH ROW WHEN (USER !~~ 'SPLEX%'::text) EXECUTE FUNCTION trigger_fct_tr_x tus_update() Partitions: tablea_p1970 FOR VALUES FROM ('1970-01-01 00:00:00') TO ('2023-09-07 00:00:00'), tablea_p20250319 FOR VALUES FROM ('2025-03-19 00:00:00') TO ('2025-03-20 00:00:00'), tablea_p20250320 FOR VALUES FROM ('2025-03-20 00:00:00') TO ('2025-03-21 00:00:00'), tablea_p20250321 FOR VALUES FROM ('2025-03-21 00:00:00') TO ('2025-03-22 00:00:00'), tablea_p20250322 FOR VALUES FROM ('2025-03-22 00:00:00') TO ('2025-03-23 00:00:00'), tablea_p20250323 FOR VALUES FROM ('2025-03-23 00:00:00') TO ('2025-03-24 00:00:00'), tablea_p20250324 FOR VALUES FROM ('2025-03-24 00:00:00') TO ('2025-03-25 00:00:00'), tablea_p20250325 FOR VALUES FROM ('2025-03-25 00:00:00') TO ('2025-03-26 00:00:00'), tablea_p20250326 FOR VALUES FROM ('2025-03-26 00:00:00') TO ('2025-03-27 00:00:00'), tablea_p20250327 FOR VALUES FROM ('2025-03-27 00:00:00') TO ('2025-03-28 00:00:00'), tablea_p20250328 FOR VALUES FROM ('2025-03-28 00:00:00') TO ('2025-03-29 00:00:00'), tablea_p20250329 FOR VALUES FROM ('2025-03-29 00:00:00') TO ('2025-03-30 00:00:00'), tablea_p20250330 FOR VALUES FROM ('2025-03-30 00:00:00') TO ('2025-03-31 00:00:00'), tablea_p20250331 FOR V
Re: partition table optimizer join cost misestimation
On 4/3/25 02:46, James Pang wrote: Andrei, Yes, from explain output, since optimizer already get the merge_append cost but not take account into total cost, that make a big difference. I shared table DDLs and explain analyze,buffers output , I think the data maybe generated by other way to reproduce this issue. sorry for not sharing the commercial production data here. I think it is almost enough to identify the issue. Let me ask you the following questions: 1. Can you provide min and max values in columns cc.sss_id and aa.mmm_id? 2. How often do you analyze your *parent* tables tablea and tablec? Remember, if you want to build statistics on a partitioned table (not a partition), you have to explicitly call ANALYZE tablea,tablec; mentioning these tables in the analyze list. 3. May you provide a dump of pg_statistic on attributes cc.sss_id and aa.mmm_id? 4. Is there a possibility of changing a single code line and rebuilding your DB instance to check a conjecture? -- regards, Andrei Lepikhov
Re: partition table optimizer join cost misestimation
On 4/2/25 12:18, James Pang wrote: Hi, Postgresq v14.8, we found optimizer doest not take "merge append" cost into sql plan total cost and then make a bad sql plan. attached please find details. I suppose there is a different type of issue. MegeJoin sometimes doesn't need to scan the whole inner or outer side (see the MergeScanSelCache structure and how it is used in the cost estimation routine). So, the cost can be less because the optimizer predicted that only a small part of the Append will be scanned and used some sort of interpolation between startup cost and total cost. But to be sure, could you send the results of EXPLAIN ANALYZE VERBOSE? If you also send the data to reproduce the case, we may find the source of the problem more precisely. -- regards, Andrei Lepikhov
partition table optimizer join cost misestimation
Hi, Postgresq v14.8, we found optimizer doest not take "merge append" cost into sql plan total cost and then make a bad sql plan. attached please find details. Query: masking table and column names : select from tablea aa inner join tableb bb on aa.ind1 = bb.ind1 inner join tablec cc on aa.ind2 = cc.ind2 where aa.ind3 in ($1) order by cc.id3; optimizer, it choose merge join with full index scan with a lot of rows and huge IO needed, looks like optimizer does not take into account the "Merge Append" huge cost into the whole SQL plan cost, so it thought only Sort and Merge join cost and make it to choose mergejoin instead of nestloop. --bad plan, it took hundreds of seconds to complete. Sort (cost=9624.33..9677.60 rows=21306 width=981) Sort Key: cc.id3 -> Nested Loop (cost=99.78..2717.44 rows=21306 width=981) -> Merge Join (cost=99.35..2009.19 rows=21306 width=915) Merge Cond: (cc.ind2 = aa.id2) -> Merge Append (cost=8.15..17046177.32 rows=98194074 width=903)<<< merge append huge cost but looks like optimizer does not take this into total plan cost yet. --good plan, off mergejoin to make it chose nestloop and it took only 20 milliseconds to complete. Sort (cost=19618.71..19671.98 rows=21306 width=981) Sort Key: cc.ind2 -> Nested Loop (cost=0.72..12711.82 rows=21306 width=981) -> Nested Loop (cost=0.29..12003.57 rows=21306 width=915) -> Append (cost=0.29..42.46 rows=30 width=28) Thanks, James Query: masking table and column names : select from tablea aa inner join tableb bb on aa.ind1 = bb.ind1 inner join tablec cc on aa.ind2 = cc.ind2 where aa.ind3 in ($1) order by cc.id3; optimizer, it choose merge join with full index scan with a lot of rows and huge IO needed, looks like optimizer does not take into account the "Merge Append" huge cost , so it thought only Sort and Merge join cost. that make --bad plan, it took hundreds of seconds to complete. Sort (cost=9624.33..9677.60 rows=21306 width=981) Sort Key: cc.id3 -> Nested Loop (cost=99.78..2717.44 rows=21306 width=981) -> Merge Join (cost=99.35..2009.19 rows=21306 width=915) Merge Cond: (cc.ind2 = aa.id2) -> Merge Append (cost=8.15..17046177.32 rows=98194074 width=903)<<< merge append huge cost but looks like optimizer does not take this into total plan cost yet. --good plan, off mergejoin to make it chose nestloop and it took only 20 milliseconds to complete. Sort (cost=19618.71..19671.98 rows=21306 width=981) Sort Key: cc.ind2 -> Nested Loop (cost=0.72..12711.82 rows=21306 width=981) -> Nested Loop (cost=0.29..12003.57 rows=21306 width=915) -> Append (cost=0.29..42.46 rows=30 width=28) explain execute slowsql9(123456789); QUERY PLAN - - Sort (cost=9624.33..9677.60 rows=21306 width=981) Sort Key: cc.id3 -> Nested Loop (cost=99.78..2717.44 rows=21306 width=981) -> Merge Join (cost=99.35..2009.19 rows=21306 width=915) Merge Cond: (cc.ind2 = aa.id2) -> Merge Append (cost=8.15..17046177.32 rows=98194074 width=903)<<< merge append huge cost but looks like optimizer does not take this into total plan cost yet. Sort Key: cc.ind2 -> Index Scan using tablec_p1970_ind2_idx on tablec_p1970 cc_1 (cost=0.12..2.34 rows=1 width=532) -> Index Scan using tablec_p20250318_ind2_idx on tablec_p20250318 cc_2 (cost=0.29..24653.56 rows=158631 width=861) -> Index Scan using tablec_p20250319_ind2_idx on tablec_p20250319 cc_3 (cost=0.29..25078.61 rows=157041 width=861) -> Index Scan using tablec_p20250320_ind2_idx on tablec_p20250320 cc_4 (cost=0.29..26336.20 rows=162959 width=868) -> Index Scan using tablec_p20250321_ind2_idx on tablec_p20250321 cc_5 (cost=0.29..25405.45 rows=158288 width=866) -> Index Scan using tablec_p20250322_ind2_idx on tablec_p20250322 cc_6 (cost=0.42..26346.47 rows=185312 width=837) -> Index Scan using tablec_p20250323_ind2_idx on tablec_p20250323 cc_7 (cost=0.42..60430.87 rows=370927 width=918) -> Index Scan using tablec_p20250324_ind2_idx on tablec_p20250324 cc_8 (cost=0.29..24798.45 rows=153782 width=862) -> Index Scan using tablec_p20250325_ind2_idx on tablec_p20250325 cc_9 (cost=0.29..23069.31 rows=144129 width=854) -> Index Scan using