Extremely inefficient merge-join
Hi I am having a rare issue with extremely inefficient merge join. The query plan indicates that PG is doing some kind of nested loop, although an index is present. PostgreSQL 9.6.17 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit Schema of dir_current (some columns left out for brevity): starfish=# \d sf.dir_current Table "sf.dir_current" Column |Type | Collation | Nullable | Default +-+---+--+--- id | bigint | | not null | nextval('sf.object_id_seq'::regclass) volume_id | bigint | | not null | parent_id | bigint | | | blocks | sf.blkcnt_t | | | rec_aggrs | jsonb | | not null | Indexes: "dir_current_pk" PRIMARY KEY, btree (id), tablespace "sf_current" "dir_current_parentid_idx" btree (parent_id), tablespace "sf_current" "dir_current_volume_id_out_of_sync_time_idx" btree (volume_id, out_of_sync_time) WHERE out_of_sync_time IS NOT NULL, tablespace "sf_current" "dir_current_volume_id_path_unq_idx" UNIQUE, btree (volume_id, path text_pattern_ops), tablespace "sf_current" "dir_current_volumeid_id_unq" UNIQUE CONSTRAINT, btree (volume_id, id), tablespace "sf_current" Foreign-key constraints: "dir_current_parentid_fk" FOREIGN KEY (parent_id) REFERENCES sf.dir_current(id) DEFERRABLE INITIALLY DEFERRED dir_process is created as a temporary table: CREATE TEMP TABLE dir_process AS ( SELECT sf.dir_current.id, volume_id, parent_id, depth, size, blocks, atime, ctime, mtime, sync_time, local_aggrs FROM sf.dir_current WHERE ); CREATE INDEX dir_process_indx ON dir_process(volume_id, id); ANALYZE dir_process; and usually contains a few thousands rows. Slow query: SELECT dir.id, dir.volume_id, dir.parent_id, dir.rec_aggrs, dir.blocks FROM sf.dir_current AS dir INNER JOIN dir_process ON dir.parent_id = dir_process.id AND dir.volume_id = dir_process.volume_id WHERE dir.volume_id = ANY(volume_ids) dir_current contains around 750M rows altogether, and there is ca. 1.75M rows with volume_id=5. Sometimes Postgres will choose very inefficient plan, which involves looping many times over same rows, producing hundreds of millions or billions of rows: LOG: duration: 1125530.496 ms plan: Merge Join (cost=909.42..40484.01 rows=1 width=456) (actual rows=1 loops=1) Merge Cond: (dir.volume_id = dir_process.volume_id) Join Filter: (dir.parent_id = dir_process.id) Rows Removed by Join Filter: 13583132483 -> Index Scan using dir_current_volumeid_id_unq on dir_current dir (cost=0.12..884052.46 rows=601329 width=456) (actual rows=2000756 loops=1) Index Cond: (volume_id = ANY ('{5}'::bigint[])) -> Sort (cost=909.31..912.70 rows=6789 width=16) (actual rows=13581131729 loops=1) Sort Key: dir_process.volume_id Sort Method: quicksort Memory: 511kB -> Seq Scan on dir_process (cost=0.00..822.89 rows=6789 width=16) (actual rows=6789 loops=1) LOG: duration: 3923310.224 ms plan: Merge Join (cost=0.17..4324.64 rows=1 width=456) (actual rows=529 loops=1) Merge Cond: (dir_process.volume_id = dir.volume_id) Join Filter: (dir.parent_id = dir_process.id) Rows Removed by Join Filter: 831113021 -> Index Only Scan using dir_process_indx on dir_process (cost=0.05..245.00 rows=450 width=16) (actual rows=450 loops=1) Heap Fetches: 450 -> Index Scan using dir_current_volumeid_id_unq on dir_current dir (cost=0.12..884052.46 rows=601329 width=456) (actual rows=831113101 loops=1) Index Cond: (volume_id = ANY ('{5}'::bigint[])) LOG: duration: 10140968.829 ms plan: Merge Join (cost=0.17..8389.13 rows=1 width=456) (actual rows=819 loops=1) Merge Cond: (dir_process.volume_id = dir.volume_id) Join Filter: (dir.parent_id = dir_process.id) Rows Removed by Join Filter: 2153506735 -> Index Only Scan using dir_process_indx on dir_process (cost=0.06..659.76 rows=1166 width=16) (actual rows=1166 loops=1) Heap Fetches: 1166 -> Index Scan using dir_current_volumeid_id_unq on dir_current dir (cost=0.12..885276.20 rows=602172 width=456) (actual rows=2153506389 loops=1) Index Cond: (volume_id = ANY ('{5}'::bigint[])) Note that 2153506389 / 1166 = 1 846 918. Similarly 831113101 / 450 = 1 846 918. I wonder how I can help Postgres query planner to choose a faster plan? -- Marcin Gozdalik
Re: Extremely inefficient merge-join
dir_current changes often, but is analyzed after significant changes, so effectively it's analyzed probably once an hour. The approximate ratio of rows with volume_id=5 to the whole number of rows doesn't change (i.e. volume_id=5 will appear roughly in 1.5M-2M rows, total is around 750-800M rows). dir_process is created once, analyzed and doesn't change later. Assuming dir_process is the outer side in plans shown here has only duplicates - i.e. all rows have volume_id=5 in this example. Do you think there is anything that could be changed with the query itself? Any hints would be appreciated. śr., 17 mar 2021 o 20:47 Tom Lane napisał(a): > Marcin Gozdalik writes: > > Sometimes Postgres will choose very inefficient plan, which involves > > looping many times over same rows, producing hundreds of millions or > > billions of rows: > > Yeah, this can happen if the outer side of the join has a lot of > duplicate rows. The query planner is aware of that effect and will > charge an increased cost when it applies, so I wonder if your > statistics for the tables being joined are up-to-date. > > regards, tom lane > -- Marcin Gozdalik
Very slow "bloat query"
Hi I am trying to use `pgmetrics` on a big (10TB+), busy (1GB/s RW) database. It takes around 5 minutes for pgmetrics to run. I traced the problem to the "bloat query" (version of https://wiki.postgresql.org/wiki/Show_database_bloat) spinning in CPU, doing no I/O. I have traced the problem to the bloated `pg_class` (the irony: `pgmetrics` does not collect bloat on `pg_catalog`): `vacuum (full, analyze, verbose) pg_class;` ``` INFO: vacuuming "pg_catalog.pg_class" INFO: "pg_class": found 1 removable, 7430805 nonremovable row versions in 158870 pages DETAIL: 7429943 dead row versions cannot be removed yet. CPU 1.36s/6.40u sec elapsed 9.85 sec. INFO: analyzing "pg_catalog.pg_class" INFO: "pg_class": scanned 6 of 158869 pages, containing 295 live rows and 2806547 dead rows; 295 rows in sample, 781 estimated total rows VACUUM ``` `pg_class` has so many dead rows because the workload is temp-table heavy (creating/destroying 1M+ temporary tables per day) and has long running analytics queries running for 24h+. PG query planner assumes that index scan on `pg_class` will be very quick and plans Nested loop with Index scan. However, the index scan has 7M dead tuples to filter out and the query takes more than 200 seconds ( https://explain.depesz.com/s/bw2G). If I create a temp table from `pg_class` to contain only the live tuples: ``` CREATE TEMPORARY TABLE pg_class_alive AS SELECT oid,* from pg_class; CREATE UNIQUE INDEX pg_class_alive_oid_index ON pg_class_alive(oid); CREATE UNIQUE INDEX pg_class_alive_relname_nsp_index ON pg_class_alive(relname, relnamespace); CREATE INDEX pg_class_tblspc_relfilenode_index ON pg_class_alive(reltablespace, relfilenode); ANALYZE pg_class_alive; ``` and run the bloat query on `pg_class_alive` instead of `pg_class`: ``` SELECT nn.nspname AS schemaname, cc.relname AS tablename, COALESCE(cc.reltuples,0) AS reltuples, COALESCE(cc.relpages,0) AS relpages, COALESCE(CEIL((cc.reltuples*((datahdr+8- (CASE WHEN datahdr%8=0 THEN 8 ELSE datahdr%8 END))+nullhdr2+4))/(8192-20::float)),0) AS otta FROM pg_class_alive cc JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname <> 'information_schema' LEFT JOIN ( SELECT foo.nspname,foo.relname, (datawidth+32)::numeric AS datahdr, (maxfracsum*(nullhdr+8-(case when nullhdr%8=0 THEN 8 ELSE nullhdr%8 END))) AS nullhdr2 FROM ( SELECT ns.nspname, tbl.relname, SUM((1-coalesce(null_frac,0))*coalesce(avg_width, 2048)) AS datawidth, MAX(coalesce(null_frac,0)) AS maxfracsum, 23+( SELECT 1+count(*)/8 FROM pg_stats s2 WHERE null_frac<>0 AND s2.schemaname = ns.nspname AND s2.tablename = tbl.relname ) AS nullhdr FROM pg_attribute att JOIN pg_class_alive tbl ON att.attrelid = tbl.oid JOIN pg_namespace ns ON ns.oid = tbl.relnamespace LEFT JOIN pg_stats s ON s.schemaname=ns.nspname AND s.tablename = tbl.relname AND s.inherited=false AND s.attname=att.attname WHERE att.attnum > 0 AND tbl.relkind='r' GROUP BY 1,2 ) AS foo ) AS rs ON cc.relname = rs.relname AND nn.nspname = rs.nspname LEFT JOIN pg_index i ON indrelid = cc.oid LEFT JOIN pg_class_alive c2 ON c2.oid = i.indexrelid ``` it runs in 10s, 20x faster (https://explain.depesz.com/s/K4SH) The rabbit hole probably goes deeper (e.g. should do the same for pg_statistic and pg_attribute and create a new pg_stats view). I am not able (at least not quickly) change the amount of temporary tables created or make the analytics queries finish quicker. Apart from the above hack of filtering out live tuples to a separate table is there anything I could do? Thank you, Marcin Gozdalik -- Marcin Gozdalik
Re: Very slow "bloat query"
Unfortunately it's still 9.6. Upgrade to latest 13 is planned for this year. pt., 14 maj 2021 o 12:08 Imre Samu napisał(a): > > Apart from the above hack of filtering out live tuples to a separate > table is there anything I could do? > > This is the latest PG13.3 version? > > IMHO: If not, maybe worth updating to the latest patch release, as soon > as possible > > https://www.postgresql.org/docs/release/13.3/ > Release date: 2021-05-13 > *"Disable the vacuum_cleanup_index_scale_factor parameter and storage > option (Peter Geoghegan)* > *The notion of tracking “stale” index statistics proved to interact badly > with the autovacuum_vacuum_insert_threshold parameter, resulting in > unnecessary full-index scans and consequent degradation of autovacuum > performance. The latter mechanism seems superior, so remove the > stale-statistics logic. The control parameter for that, > vacuum_cleanup_index_scale_factor, will be removed entirely in v14. In v13, > it remains present to avoid breaking existing configuration files, but it > no longer does anything."* > > best, > Imre > > > Marcin Gozdalik ezt írta (időpont: 2021. máj. 14., P, > 13:20): > >> Hi >> >> I am trying to use `pgmetrics` on a big (10TB+), busy (1GB/s RW) >> database. It takes around 5 minutes for pgmetrics to run. I traced the >> problem to the "bloat query" (version of >> https://wiki.postgresql.org/wiki/Show_database_bloat) spinning in CPU, >> doing no I/O. >> >> I have traced the problem to the bloated `pg_class` (the irony: >> `pgmetrics` does not collect bloat on `pg_catalog`): >> `vacuum (full, analyze, verbose) pg_class;` >> ``` >> INFO: vacuuming "pg_catalog.pg_class" >> INFO: "pg_class": found 1 removable, 7430805 nonremovable row versions >> in 158870 pages >> DETAIL: 7429943 dead row versions cannot be removed yet. >> CPU 1.36s/6.40u sec elapsed 9.85 sec. >> INFO: analyzing "pg_catalog.pg_class" >> INFO: "pg_class": scanned 6 of 158869 pages, containing 295 live >> rows and 2806547 dead rows; 295 rows in sample, 781 estimated total rows >> VACUUM >> ``` >> >> `pg_class` has so many dead rows because the workload is temp-table heavy >> (creating/destroying 1M+ temporary tables per day) and has long running >> analytics queries running for 24h+. >> >> PG query planner assumes that index scan on `pg_class` will be very quick >> and plans Nested loop with Index scan. However, the index scan has 7M dead >> tuples to filter out and the query takes more than 200 seconds ( >> https://explain.depesz.com/s/bw2G). >> >> If I create a temp table from `pg_class` to contain only the live tuples: >> ``` >> CREATE TEMPORARY TABLE pg_class_alive AS SELECT oid,* from pg_class; >> CREATE UNIQUE INDEX pg_class_alive_oid_index ON pg_class_alive(oid); >> CREATE UNIQUE INDEX pg_class_alive_relname_nsp_index ON >> pg_class_alive(relname, relnamespace); >> CREATE INDEX pg_class_tblspc_relfilenode_index ON >> pg_class_alive(reltablespace, relfilenode); >> ANALYZE pg_class_alive; >> ``` >> >> and run the bloat query on `pg_class_alive` instead of `pg_class`: >> ``` >> SELECT >>nn.nspname AS schemaname, >>cc.relname AS tablename, >>COALESCE(cc.reltuples,0) AS reltuples, >>COALESCE(cc.relpages,0) AS relpages, >>COALESCE(CEIL((cc.reltuples*((datahdr+8- >> (CASE WHEN datahdr%8=0 THEN 8 ELSE datahdr%8 >> END))+nullhdr2+4))/(8192-20::float)),0) AS otta >> FROM >> pg_class_alive cc >> JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname <> >> 'information_schema' >> LEFT JOIN >> ( >>SELECT >> foo.nspname,foo.relname, >> (datawidth+32)::numeric AS datahdr, >> (maxfracsum*(nullhdr+8-(case when nullhdr%8=0 THEN 8 ELSE nullhdr%8 >> END))) AS nullhdr2 >>FROM ( >> SELECT >>ns.nspname, tbl.relname, >>SUM((1-coalesce(null_frac,0))*coalesce(avg_width, 2048)) AS >> datawidth, >>MAX(coalesce(null_frac,0)) AS maxfracsum, >>23+( >> SELECT 1+count(*)/8 >> FROM pg_stats s2 >> WHERE null_frac<>0 AND s2.schemaname = ns.nspname AND >> s2.tablename = tbl.relname >>) AS nullhdr >> FROM pg_attribute att >> JOIN pg_class_alive tbl ON att.attrelid = tbl.oid >> JOIN pg_namespace ns ON ns.oid = tbl.relnamespace >> LEFT JOIN pg_stats s ON s.schemaname=ns.nspname >> A
Re: Very slow "bloat query"
There is a long running analytics query (which is running usually for 30-40 hours). I agree that's not the best position to be in but right now can't do anything about it. pt., 14 maj 2021 o 15:04 Tom Lane napisał(a): > Marcin Gozdalik writes: > > I have traced the problem to the bloated `pg_class` (the irony: > `pgmetrics` > > does not collect bloat on `pg_catalog`): > > `vacuum (full, analyze, verbose) pg_class;` > > ``` > > INFO: vacuuming "pg_catalog.pg_class" > > INFO: "pg_class": found 1 removable, 7430805 nonremovable row versions > in > > 158870 pages > > DETAIL: 7429943 dead row versions cannot be removed yet. > > Ugh. It's understandable that having a lot of temp-table traffic > would result in the creation of lots of dead rows in pg_class. > The question to be asking is why aren't they vacuumable? You > must have a longstanding open transaction somewhere (perhaps > a forgotten prepared transaction?) that is holding back the > global xmin horizon. Closing that out and then doing another > manual VACUUM FULL should help. > > regards, tom lane > -- Marcin Gozdalik
Re: Very slow "bloat query"
PostgreSQL 9.6.21 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit pt., 14 maj 2021 o 15:45 Imre Samu napisał(a): > > Unfortunately it's still 9.6. > > And what is your "*version()*"? > > > for example: > postgres=# select version(); > version > > > - > PostgreSQL 9.6.22 on x86_64-pc-linux-gnu (Debian 9.6.22-1.pgdg110+1), > compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit > (1 row) > > Imre > > > Marcin Gozdalik ezt írta (időpont: 2021. máj. 14., P, > 14:11): > >> Unfortunately it's still 9.6. Upgrade to latest 13 is planned for this >> year. >> >> pt., 14 maj 2021 o 12:08 Imre Samu napisał(a): >> >>> > Apart from the above hack of filtering out live tuples to a separate >>> table is there anything I could do? >>> >>> This is the latest PG13.3 version? >>> >>> IMHO: If not, maybe worth updating to the latest patch release, as >>> soon as possible >>> >>> https://www.postgresql.org/docs/release/13.3/ >>> Release date: 2021-05-13 >>> *"Disable the vacuum_cleanup_index_scale_factor parameter and storage >>> option (Peter Geoghegan)* >>> *The notion of tracking “stale” index statistics proved to interact >>> badly with the autovacuum_vacuum_insert_threshold parameter, resulting in >>> unnecessary full-index scans and consequent degradation of autovacuum >>> performance. The latter mechanism seems superior, so remove the >>> stale-statistics logic. The control parameter for that, >>> vacuum_cleanup_index_scale_factor, will be removed entirely in v14. In v13, >>> it remains present to avoid breaking existing configuration files, but it >>> no longer does anything."* >>> >>> best, >>> Imre >>> >>> >>> Marcin Gozdalik ezt írta (időpont: 2021. máj. 14., >>> P, 13:20): >>> >>>> Hi >>>> >>>> I am trying to use `pgmetrics` on a big (10TB+), busy (1GB/s RW) >>>> database. It takes around 5 minutes for pgmetrics to run. I traced the >>>> problem to the "bloat query" (version of >>>> https://wiki.postgresql.org/wiki/Show_database_bloat) spinning in CPU, >>>> doing no I/O. >>>> >>>> I have traced the problem to the bloated `pg_class` (the irony: >>>> `pgmetrics` does not collect bloat on `pg_catalog`): >>>> `vacuum (full, analyze, verbose) pg_class;` >>>> ``` >>>> INFO: vacuuming "pg_catalog.pg_class" >>>> INFO: "pg_class": found 1 removable, 7430805 nonremovable row versions >>>> in 158870 pages >>>> DETAIL: 7429943 dead row versions cannot be removed yet. >>>> CPU 1.36s/6.40u sec elapsed 9.85 sec. >>>> INFO: analyzing "pg_catalog.pg_class" >>>> INFO: "pg_class": scanned 6 of 158869 pages, containing 295 live >>>> rows and 2806547 dead rows; 295 rows in sample, 781 estimated total rows >>>> VACUUM >>>> ``` >>>> >>>> `pg_class` has so many dead rows because the workload is temp-table >>>> heavy (creating/destroying 1M+ temporary tables per day) and has long >>>> running analytics queries running for 24h+. >>>> >>>> PG query planner assumes that index scan on `pg_class` will be very >>>> quick and plans Nested loop with Index scan. However, the index scan has 7M >>>> dead tuples to filter out and the query takes more than 200 seconds ( >>>> https://explain.depesz.com/s/bw2G). >>>> >>>> If I create a temp table from `pg_class` to contain only the live >>>> tuples: >>>> ``` >>>> CREATE TEMPORARY TABLE pg_class_alive AS SELECT oid,* from pg_class; >>>> CREATE UNIQUE INDEX pg_class_alive_oid_index ON pg_class_alive(oid); >>>> CREATE UNIQUE INDEX pg_class_alive_relname_nsp_index ON >>>> pg_class_alive(relname, relnamespace); >>>> CREATE INDEX pg_class_tblspc_relfilenode_index ON >>>> pg_class_alive(reltablespace, relfilenode); >>>> ANALYZE pg_class_alive; >>>> ``` >>>> >>>> and run the bloat query on `pg_class_alive` instead of `pg_class`: >>>> ``` >>>> SELECT >>>>nn.nspname AS sc