Extremely inefficient merge-join

2021-03-17 Thread Marcin Gozdalik
Hi

I am having a rare issue with extremely inefficient merge join. The query
plan indicates that PG is doing some kind of nested loop, although an index
is present.

PostgreSQL 9.6.17 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623 (Red Hat 4.8.5-39), 64-bit

Schema of dir_current (some columns left out for brevity):

starfish=# \d sf.dir_current
 Table "sf.dir_current"
   Column   |Type | Collation | Nullable |
   Default
+-+---+--+---
 id | bigint  |   | not null |
nextval('sf.object_id_seq'::regclass)
 volume_id  | bigint  |   | not null |
 parent_id  | bigint  |   |  |
 blocks | sf.blkcnt_t |   |  |
 rec_aggrs  | jsonb   |   | not null |
Indexes:
"dir_current_pk" PRIMARY KEY, btree (id), tablespace "sf_current"
"dir_current_parentid_idx" btree (parent_id), tablespace "sf_current"
"dir_current_volume_id_out_of_sync_time_idx" btree (volume_id,
out_of_sync_time) WHERE out_of_sync_time IS NOT NULL, tablespace
"sf_current"
"dir_current_volume_id_path_unq_idx" UNIQUE, btree (volume_id, path
text_pattern_ops), tablespace "sf_current"
"dir_current_volumeid_id_unq" UNIQUE CONSTRAINT, btree (volume_id, id),
tablespace "sf_current"
Foreign-key constraints:
"dir_current_parentid_fk" FOREIGN KEY (parent_id) REFERENCES
sf.dir_current(id) DEFERRABLE INITIALLY DEFERRED

dir_process is created as a temporary table:

CREATE TEMP TABLE dir_process AS (
  SELECT sf.dir_current.id, volume_id, parent_id, depth, size, blocks,
atime, ctime, mtime, sync_time, local_aggrs FROM sf.dir_current
WHERE 
);

CREATE INDEX dir_process_indx ON dir_process(volume_id, id);
ANALYZE dir_process;

and usually contains a few thousands rows.

Slow query:
  SELECT dir.id, dir.volume_id, dir.parent_id, dir.rec_aggrs, dir.blocks
FROM sf.dir_current AS dir
  INNER JOIN dir_process ON dir.parent_id = dir_process.id AND
dir.volume_id = dir_process.volume_id
  WHERE dir.volume_id = ANY(volume_ids)

dir_current contains around 750M rows altogether, and there is ca. 1.75M
rows with volume_id=5.
Sometimes Postgres will choose very inefficient plan, which involves
looping many times over same rows, producing hundreds of millions or
billions of rows:

LOG:  duration: 1125530.496 ms  plan:
Merge Join  (cost=909.42..40484.01 rows=1 width=456) (actual rows=1
loops=1)
  Merge Cond: (dir.volume_id = dir_process.volume_id)
  Join Filter: (dir.parent_id = dir_process.id)
  Rows Removed by Join Filter: 13583132483
  ->  Index Scan using dir_current_volumeid_id_unq on dir_current
dir  (cost=0.12..884052.46 rows=601329 width=456) (actual rows=2000756
loops=1)
Index Cond: (volume_id = ANY ('{5}'::bigint[]))
  ->  Sort  (cost=909.31..912.70 rows=6789 width=16) (actual
rows=13581131729 loops=1)
Sort Key: dir_process.volume_id
Sort Method: quicksort  Memory: 511kB
->  Seq Scan on dir_process  (cost=0.00..822.89 rows=6789
width=16) (actual rows=6789 loops=1)

LOG:  duration: 3923310.224 ms  plan:
Merge Join  (cost=0.17..4324.64 rows=1 width=456) (actual rows=529
loops=1)
  Merge Cond: (dir_process.volume_id = dir.volume_id)
  Join Filter: (dir.parent_id = dir_process.id)
  Rows Removed by Join Filter: 831113021
  ->  Index Only Scan using dir_process_indx on dir_process
 (cost=0.05..245.00 rows=450 width=16) (actual rows=450 loops=1)
Heap Fetches: 450
  ->  Index Scan using dir_current_volumeid_id_unq on dir_current
dir  (cost=0.12..884052.46 rows=601329 width=456) (actual rows=831113101
loops=1)
Index Cond: (volume_id = ANY ('{5}'::bigint[]))

LOG:  duration: 10140968.829 ms  plan:
Merge Join  (cost=0.17..8389.13 rows=1 width=456) (actual rows=819
loops=1)
  Merge Cond: (dir_process.volume_id = dir.volume_id)
  Join Filter: (dir.parent_id = dir_process.id)
  Rows Removed by Join Filter: 2153506735
  ->  Index Only Scan using dir_process_indx on dir_process
 (cost=0.06..659.76 rows=1166 width=16) (actual rows=1166 loops=1)
Heap Fetches: 1166
  ->  Index Scan using dir_current_volumeid_id_unq on dir_current
dir  (cost=0.12..885276.20 rows=602172 width=456) (actual rows=2153506389
loops=1)
Index Cond: (volume_id = ANY ('{5}'::bigint[]))

Note that 2153506389 / 1166 = 1 846 918. Similarly 831113101 / 450 =
1 846 918.

I wonder how I can help Postgres query planner to choose a faster plan?

-- 
Marcin Gozdalik


Re: Extremely inefficient merge-join

2021-03-17 Thread Marcin Gozdalik
 dir_current changes often, but is analyzed after significant changes, so
effectively it's analyzed probably once an hour.
The approximate ratio of rows with volume_id=5 to the whole number of rows
doesn't change (i.e. volume_id=5 will appear roughly in 1.5M-2M rows, total
is around 750-800M rows).
dir_process is created once, analyzed and doesn't change later.

Assuming dir_process is the outer side in plans shown here has only
duplicates - i.e. all rows have volume_id=5 in this example.
Do you think there is anything that could be changed with the query itself?
Any hints would be appreciated.

śr., 17 mar 2021 o 20:47 Tom Lane  napisał(a):

> Marcin Gozdalik  writes:
> > Sometimes Postgres will choose very inefficient plan, which involves
> > looping many times over same rows, producing hundreds of millions or
> > billions of rows:
>
> Yeah, this can happen if the outer side of the join has a lot of
> duplicate rows.  The query planner is aware of that effect and will
> charge an increased cost when it applies, so I wonder if your
> statistics for the tables being joined are up-to-date.
>
> regards, tom lane
>


-- 
Marcin Gozdalik


Very slow "bloat query"

2021-05-14 Thread Marcin Gozdalik
Hi

I am trying to use `pgmetrics` on a big (10TB+), busy (1GB/s RW) database.
It takes around 5 minutes for pgmetrics to run. I traced the problem to the
"bloat query" (version of
https://wiki.postgresql.org/wiki/Show_database_bloat) spinning in CPU,
doing no I/O.

I have traced the problem to the bloated `pg_class` (the irony: `pgmetrics`
does not collect bloat on `pg_catalog`):
`vacuum (full, analyze, verbose) pg_class;`
```
INFO:  vacuuming "pg_catalog.pg_class"
INFO:  "pg_class": found 1 removable, 7430805 nonremovable row versions in
158870 pages
DETAIL:  7429943 dead row versions cannot be removed yet.
CPU 1.36s/6.40u sec elapsed 9.85 sec.
INFO:  analyzing "pg_catalog.pg_class"
INFO:  "pg_class": scanned 6 of 158869 pages, containing 295 live rows
and 2806547 dead rows; 295 rows in sample, 781 estimated total rows
VACUUM
```

`pg_class` has so many dead rows because the workload is temp-table heavy
(creating/destroying 1M+ temporary tables per day) and has long running
analytics queries running for 24h+.

PG query planner assumes that index scan on `pg_class` will be very quick
and plans Nested loop with Index scan. However, the index scan has 7M dead
tuples to filter out and the query takes more than 200 seconds (
https://explain.depesz.com/s/bw2G).

If I create a temp table from `pg_class` to contain only the live tuples:
```
CREATE TEMPORARY TABLE pg_class_alive AS SELECT oid,* from pg_class;
CREATE UNIQUE INDEX pg_class_alive_oid_index ON pg_class_alive(oid);
CREATE UNIQUE INDEX pg_class_alive_relname_nsp_index ON
pg_class_alive(relname, relnamespace);
CREATE INDEX pg_class_tblspc_relfilenode_index ON
pg_class_alive(reltablespace, relfilenode);
ANALYZE pg_class_alive;
```

and run the bloat query on `pg_class_alive` instead of `pg_class`:
```
SELECT
   nn.nspname AS schemaname,
   cc.relname AS tablename,
   COALESCE(cc.reltuples,0) AS reltuples,
   COALESCE(cc.relpages,0) AS relpages,
   COALESCE(CEIL((cc.reltuples*((datahdr+8-
 (CASE WHEN datahdr%8=0 THEN 8 ELSE datahdr%8
END))+nullhdr2+4))/(8192-20::float)),0) AS otta
 FROM
pg_class_alive cc
 JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname <>
'information_schema'
 LEFT JOIN
 (
   SELECT
 foo.nspname,foo.relname,
 (datawidth+32)::numeric AS datahdr,
 (maxfracsum*(nullhdr+8-(case when nullhdr%8=0 THEN 8 ELSE nullhdr%8
END))) AS nullhdr2
   FROM (
 SELECT
   ns.nspname, tbl.relname,
   SUM((1-coalesce(null_frac,0))*coalesce(avg_width, 2048)) AS
datawidth,
   MAX(coalesce(null_frac,0)) AS maxfracsum,
   23+(
 SELECT 1+count(*)/8
 FROM pg_stats s2
 WHERE null_frac<>0 AND s2.schemaname = ns.nspname AND s2.tablename
= tbl.relname
   ) AS nullhdr
 FROM pg_attribute att
 JOIN pg_class_alive tbl ON att.attrelid = tbl.oid
 JOIN pg_namespace ns ON ns.oid = tbl.relnamespace
 LEFT JOIN pg_stats s ON s.schemaname=ns.nspname
 AND s.tablename = tbl.relname
 AND s.inherited=false
 AND s.attname=att.attname
 WHERE att.attnum > 0 AND tbl.relkind='r'
 GROUP BY 1,2
   ) AS foo
 ) AS rs
 ON cc.relname = rs.relname AND nn.nspname = rs.nspname
 LEFT JOIN pg_index i ON indrelid = cc.oid
 LEFT JOIN pg_class_alive c2 ON c2.oid = i.indexrelid
```

it runs in 10s, 20x faster (https://explain.depesz.com/s/K4SH)

The rabbit hole probably goes deeper (e.g. should do the same for
pg_statistic and pg_attribute and create a new pg_stats view).

I am not able (at least not quickly) change the amount of temporary tables
created or make the analytics queries finish quicker. Apart from the above
hack of filtering out live tuples to a separate table is there anything I
could do?

Thank you,
Marcin Gozdalik

-- 
Marcin Gozdalik


Re: Very slow "bloat query"

2021-05-14 Thread Marcin Gozdalik
Unfortunately it's still 9.6. Upgrade to latest 13 is planned for this year.

pt., 14 maj 2021 o 12:08 Imre Samu  napisał(a):

> >  Apart from the above hack of filtering out live tuples to a separate
> table is there anything I could do?
>
> This is the latest PG13.3 version?
>
> IMHO:  If not,  maybe worth updating to the latest patch release, as soon
> as possible
>
> https://www.postgresql.org/docs/release/13.3/
> Release date: 2021-05-13
> *"Disable the vacuum_cleanup_index_scale_factor parameter and storage
> option (Peter Geoghegan)*
> *The notion of tracking “stale” index statistics proved to interact badly
> with the autovacuum_vacuum_insert_threshold parameter, resulting in
> unnecessary full-index scans and consequent degradation of autovacuum
> performance. The latter mechanism seems superior, so remove the
> stale-statistics logic. The control parameter for that,
> vacuum_cleanup_index_scale_factor, will be removed entirely in v14. In v13,
> it remains present to avoid breaking existing configuration files, but it
> no longer does anything."*
>
> best,
>  Imre
>
>
> Marcin Gozdalik  ezt írta (időpont: 2021. máj. 14., P,
> 13:20):
>
>> Hi
>>
>> I am trying to use `pgmetrics` on a big (10TB+), busy (1GB/s RW)
>> database. It takes around 5 minutes for pgmetrics to run. I traced the
>> problem to the "bloat query" (version of
>> https://wiki.postgresql.org/wiki/Show_database_bloat) spinning in CPU,
>> doing no I/O.
>>
>> I have traced the problem to the bloated `pg_class` (the irony:
>> `pgmetrics` does not collect bloat on `pg_catalog`):
>> `vacuum (full, analyze, verbose) pg_class;`
>> ```
>> INFO:  vacuuming "pg_catalog.pg_class"
>> INFO:  "pg_class": found 1 removable, 7430805 nonremovable row versions
>> in 158870 pages
>> DETAIL:  7429943 dead row versions cannot be removed yet.
>> CPU 1.36s/6.40u sec elapsed 9.85 sec.
>> INFO:  analyzing "pg_catalog.pg_class"
>> INFO:  "pg_class": scanned 6 of 158869 pages, containing 295 live
>> rows and 2806547 dead rows; 295 rows in sample, 781 estimated total rows
>> VACUUM
>> ```
>>
>> `pg_class` has so many dead rows because the workload is temp-table heavy
>> (creating/destroying 1M+ temporary tables per day) and has long running
>> analytics queries running for 24h+.
>>
>> PG query planner assumes that index scan on `pg_class` will be very quick
>> and plans Nested loop with Index scan. However, the index scan has 7M dead
>> tuples to filter out and the query takes more than 200 seconds (
>> https://explain.depesz.com/s/bw2G).
>>
>> If I create a temp table from `pg_class` to contain only the live tuples:
>> ```
>> CREATE TEMPORARY TABLE pg_class_alive AS SELECT oid,* from pg_class;
>> CREATE UNIQUE INDEX pg_class_alive_oid_index ON pg_class_alive(oid);
>> CREATE UNIQUE INDEX pg_class_alive_relname_nsp_index ON
>> pg_class_alive(relname, relnamespace);
>> CREATE INDEX pg_class_tblspc_relfilenode_index ON
>> pg_class_alive(reltablespace, relfilenode);
>> ANALYZE pg_class_alive;
>> ```
>>
>> and run the bloat query on `pg_class_alive` instead of `pg_class`:
>> ```
>> SELECT
>>nn.nspname AS schemaname,
>>cc.relname AS tablename,
>>COALESCE(cc.reltuples,0) AS reltuples,
>>COALESCE(cc.relpages,0) AS relpages,
>>COALESCE(CEIL((cc.reltuples*((datahdr+8-
>>  (CASE WHEN datahdr%8=0 THEN 8 ELSE datahdr%8
>> END))+nullhdr2+4))/(8192-20::float)),0) AS otta
>>  FROM
>> pg_class_alive cc
>>  JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname <>
>> 'information_schema'
>>  LEFT JOIN
>>  (
>>SELECT
>>  foo.nspname,foo.relname,
>>  (datawidth+32)::numeric AS datahdr,
>>  (maxfracsum*(nullhdr+8-(case when nullhdr%8=0 THEN 8 ELSE nullhdr%8
>> END))) AS nullhdr2
>>FROM (
>>  SELECT
>>ns.nspname, tbl.relname,
>>SUM((1-coalesce(null_frac,0))*coalesce(avg_width, 2048)) AS
>> datawidth,
>>MAX(coalesce(null_frac,0)) AS maxfracsum,
>>23+(
>>  SELECT 1+count(*)/8
>>  FROM pg_stats s2
>>  WHERE null_frac<>0 AND s2.schemaname = ns.nspname AND
>> s2.tablename = tbl.relname
>>) AS nullhdr
>>  FROM pg_attribute att
>>  JOIN pg_class_alive tbl ON att.attrelid = tbl.oid
>>  JOIN pg_namespace ns ON ns.oid = tbl.relnamespace
>>  LEFT JOIN pg_stats s ON s.schemaname=ns.nspname
>>  A

Re: Very slow "bloat query"

2021-05-14 Thread Marcin Gozdalik
There is a long running analytics query (which is running usually for 30-40
hours). I agree that's not the best position to be in but right now can't
do anything about it.

pt., 14 maj 2021 o 15:04 Tom Lane  napisał(a):

> Marcin Gozdalik  writes:
> > I have traced the problem to the bloated `pg_class` (the irony:
> `pgmetrics`
> > does not collect bloat on `pg_catalog`):
> > `vacuum (full, analyze, verbose) pg_class;`
> > ```
> > INFO:  vacuuming "pg_catalog.pg_class"
> > INFO:  "pg_class": found 1 removable, 7430805 nonremovable row versions
> in
> > 158870 pages
> > DETAIL:  7429943 dead row versions cannot be removed yet.
>
> Ugh.  It's understandable that having a lot of temp-table traffic
> would result in the creation of lots of dead rows in pg_class.
> The question to be asking is why aren't they vacuumable?  You
> must have a longstanding open transaction somewhere (perhaps
> a forgotten prepared transaction?) that is holding back the
> global xmin horizon.  Closing that out and then doing another
> manual VACUUM FULL should help.
>
> regards, tom lane
>


-- 
Marcin Gozdalik


Re: Very slow "bloat query"

2021-05-14 Thread Marcin Gozdalik
 PostgreSQL 9.6.21 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623 (Red Hat 4.8.5-44), 64-bit

pt., 14 maj 2021 o 15:45 Imre Samu  napisał(a):

> > Unfortunately it's still 9.6.
>
> And what is your "*version()*"?
>
>
> for example:
> postgres=# select version();
>  version
>
>
> -
>  PostgreSQL 9.6.22 on x86_64-pc-linux-gnu (Debian 9.6.22-1.pgdg110+1),
> compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
> (1 row)
>
> Imre
>
>
> Marcin Gozdalik  ezt írta (időpont: 2021. máj. 14., P,
> 14:11):
>
>> Unfortunately it's still 9.6. Upgrade to latest 13 is planned for this
>> year.
>>
>> pt., 14 maj 2021 o 12:08 Imre Samu  napisał(a):
>>
>>> >  Apart from the above hack of filtering out live tuples to a separate
>>> table is there anything I could do?
>>>
>>> This is the latest PG13.3 version?
>>>
>>> IMHO:  If not,  maybe worth updating to the latest patch release, as
>>> soon as possible
>>>
>>> https://www.postgresql.org/docs/release/13.3/
>>> Release date: 2021-05-13
>>> *"Disable the vacuum_cleanup_index_scale_factor parameter and storage
>>> option (Peter Geoghegan)*
>>> *The notion of tracking “stale” index statistics proved to interact
>>> badly with the autovacuum_vacuum_insert_threshold parameter, resulting in
>>> unnecessary full-index scans and consequent degradation of autovacuum
>>> performance. The latter mechanism seems superior, so remove the
>>> stale-statistics logic. The control parameter for that,
>>> vacuum_cleanup_index_scale_factor, will be removed entirely in v14. In v13,
>>> it remains present to avoid breaking existing configuration files, but it
>>> no longer does anything."*
>>>
>>> best,
>>>  Imre
>>>
>>>
>>> Marcin Gozdalik  ezt írta (időpont: 2021. máj. 14.,
>>> P, 13:20):
>>>
>>>> Hi
>>>>
>>>> I am trying to use `pgmetrics` on a big (10TB+), busy (1GB/s RW)
>>>> database. It takes around 5 minutes for pgmetrics to run. I traced the
>>>> problem to the "bloat query" (version of
>>>> https://wiki.postgresql.org/wiki/Show_database_bloat) spinning in CPU,
>>>> doing no I/O.
>>>>
>>>> I have traced the problem to the bloated `pg_class` (the irony:
>>>> `pgmetrics` does not collect bloat on `pg_catalog`):
>>>> `vacuum (full, analyze, verbose) pg_class;`
>>>> ```
>>>> INFO:  vacuuming "pg_catalog.pg_class"
>>>> INFO:  "pg_class": found 1 removable, 7430805 nonremovable row versions
>>>> in 158870 pages
>>>> DETAIL:  7429943 dead row versions cannot be removed yet.
>>>> CPU 1.36s/6.40u sec elapsed 9.85 sec.
>>>> INFO:  analyzing "pg_catalog.pg_class"
>>>> INFO:  "pg_class": scanned 6 of 158869 pages, containing 295 live
>>>> rows and 2806547 dead rows; 295 rows in sample, 781 estimated total rows
>>>> VACUUM
>>>> ```
>>>>
>>>> `pg_class` has so many dead rows because the workload is temp-table
>>>> heavy (creating/destroying 1M+ temporary tables per day) and has long
>>>> running analytics queries running for 24h+.
>>>>
>>>> PG query planner assumes that index scan on `pg_class` will be very
>>>> quick and plans Nested loop with Index scan. However, the index scan has 7M
>>>> dead tuples to filter out and the query takes more than 200 seconds (
>>>> https://explain.depesz.com/s/bw2G).
>>>>
>>>> If I create a temp table from `pg_class` to contain only the live
>>>> tuples:
>>>> ```
>>>> CREATE TEMPORARY TABLE pg_class_alive AS SELECT oid,* from pg_class;
>>>> CREATE UNIQUE INDEX pg_class_alive_oid_index ON pg_class_alive(oid);
>>>> CREATE UNIQUE INDEX pg_class_alive_relname_nsp_index ON
>>>> pg_class_alive(relname, relnamespace);
>>>> CREATE INDEX pg_class_tblspc_relfilenode_index ON
>>>> pg_class_alive(reltablespace, relfilenode);
>>>> ANALYZE pg_class_alive;
>>>> ```
>>>>
>>>> and run the bloat query on `pg_class_alive` instead of `pg_class`:
>>>> ```
>>>> SELECT
>>>>nn.nspname AS sc