Re: GIN index on JSONB not used due to lack of nested statistics

2019-10-30 Thread Jeff Janes
On Wed, Oct 30, 2019 at 12:25 PM Alessandro Baretta 
wrote:


>  ->  Bitmap Index Scan on idx_object
> (cost=0.00..75580.00 rows=1 width=0) (actual time=24.094..24.094
> rows=77 loops=1)
>Index Cond: ((... @> ...::jsonb) AND (...
> @> ...::jsonb))
>  Planning time: 0.301 ms
>  Execution time: 24.723 ms
> (14 rows)
>
> It would seem that this miscalculation of the cost of the index scan
> is due to the query planner lacking detailed statistics about the
> relevant JSONB column.


Since it expected 1 row but actually found 77, I think that if it had
accurate statistics it would have overestimated the costs by even more.

Can you repeat the executions with "EXPLAIN (ANALYZE, BUFFERS)"?

How does the cost estimate change if you make effective_cache_size much
larger or much smaller? (No need for ANALYZE, just the cost estimate)

What kind of performance do you get if you turn enable_seqscan and then
repeat the query from a cold start (restart PostgreSQL, then run sudo sh -c
 "echo 3 > /proc/sys/vm/drop_caches").  If the performance is very fast
after a cold start, then something is wrong with the planner estimate.  If
it is slow from a cold start, then the planner has at least a plausible
basis for charging as much as it does.

If you run the query with just one branch of your AND at a time, what is
the expected and actual number of rows?


> The workaround I found so far is to set a low value of
> random_page_cost, but this could result in the query planner using
> index scans for other tables and other queries, where a seq scan would
> actually be more appropriate.
>

Based on what you know about your IO system, and the cacheability of your
data, what is the appropriate setting of random_page_cost from first
principles?  Maybe it is those other queries which have the problem, not
this one.

If you can up with a random object generator which creates data structured
similar to yours, and shows the same issue when run with disclosable
queries, that would help us look into it.

Also, what version are you running?

Cheers,

Jeff


GIN index on JSONB not used due to lack of nested statistics

2019-10-30 Thread Alessandro Baretta
I have created a GIN index using jsonb_path_ops over some JSONB
columns. Initially, while the index was small, the query planner would
select a Bitmap Index Scan strategy to execute queries leveraging the
appropriate JSONB operator (@>). Now that the table has grown to
almost 200k rows and the index has grown to 1.6M rows, the query
planner prefers running Seq Scans.

db=> SELECT relname, relkind, reltuples, relpages FROM pg_class WHERE
relname IN ('_object', 'idx_object') and relnamespace = 43315;
  relname   | relkind | reltuples  | relpages
+-++--
 _object| r   | 185618 |39030
 idx_object | i   | 1.6583e+06 |  512
(2 rows)

db=> explain analyze ...;
  QUERY PLAN
---
 Subquery Scan on x  (cost=41814.28..41814.30 rows=1 width=25) (actual
time=3954.742..3954.820 rows=5 loops=1)
   Filter: ((x._data IS NOT NULL) OR (x.metadata IS NOT NULL))
   Rows Removed by Filter: 3
   ->  Unique  (cost=41814.28..41814.29 rows=1 width=1562) (actual
time=3954.740..3954.814 rows=8 loops=1)
 ->  Sort  (cost=41814.28..41814.28 rows=1 width=1562) (actual
time=3954.738..3954.754 rows=77 loops=1)
   Sort Key: ...
   Sort Method: quicksort  Memory: 63kB
   ->  Seq Scan on _object  (cost=0.00..41814.27 rows=1
width=1562) (actual time=84.980..3954.330 rows=77 loops=1)
 Filter: ((... @> ...::jsonb) AND (... @> ...::jsonb))
 Rows Removed by Filter: 185529
 Planning time: 0.261 ms
 Execution time: 3954.860 ms
(12 rows)

Disabling seq scans shows that the execution time of the Bitmap Index
Scan is considerably lower, but the cost estimate is completely off.

db=> explain analyze ;
 QUERY PLAN

 Subquery Scan on x  (cost=75584.03..75584.05 rows=1 width=25) (actual
time=24.591..24.664 rows=5 loops=1)
   Filter: ((x._data IS NOT NULL) OR (x.metadata IS NOT NULL))
   Rows Removed by Filter: 3
   ->  Unique  (cost=75584.03..75584.04 rows=1 width=1562) (actual
time=24.589..24.659 rows=8 loops=1)
 ->  Sort  (cost=75584.03..75584.03 rows=1 width=1562) (actual
time=24.588..24.603 rows=77 loops=1)
   Sort Key:...
   Sort Method: quicksort  Memory: 63kB
   ->  Bitmap Heap Scan on _object
(cost=75580.00..75584.02 rows=1 width=1562) (actual
time=24.120..24.284 rows=77 loops=1)
 Recheck Cond: ((... @> ...::jsonb) AND (... @> ...::jsonb))
 Heap Blocks: exact=73
 ->  Bitmap Index Scan on idx_object
(cost=0.00..75580.00 rows=1 width=0) (actual time=24.094..24.094
rows=77 loops=1)
   Index Cond: ((... @> ...::jsonb) AND (...
@> ...::jsonb))
 Planning time: 0.301 ms
 Execution time: 24.723 ms
(14 rows)

It would seem that this miscalculation of the cost of the index scan
is due to the query planner lacking detailed statistics about the
relevant JSONB column. Here are the statistics collected by postgresql
on the relevant columns.

db=> \x
db=> select attname, null_frac, n_distinct,
array_length(most_common_vals,1) mcv,
array_length(most_common_freqs,1) mcf,
array_length(most_common_elems,1) mce,
array_length(most_common_elem_freqs,1) mcef, elem_count_histogram from
pg_stats where schemaname = 'zero2' and tablename = '_object';
-[ RECORD 1 ]+--
attname  | doctype
null_frac| 0
n_distinct   | 56
mcv  | 4
mcf  | 4
mce  |
mcef |
elem_count_histogram |
-[ RECORD 2 ]+--
attname  | app_id
null_frac| 0
n_distinct   | 2374
mcv  | 100
mcf  | 100
mce  |
mcef |
elem_count_histogram |
-[ RECORD 3 ]+--
attname  | status
null_frac| 0
n_distinct   | 2
mcv  | 2
mcf  | 2
mce  |
mcef |
elem_count_histogram |
-[ RECORD 4 ]+--
attname  | metadata
null_frac| 0.0924333
n_distinct   | 985
mcv  | 100
mcf  | 100
mce  |
mcef |
elem_count_histogram |


As you can see, most_common_elems and most_common_elems_freqs are
empty. This prevents the query planner from making any kind of
meaningful estimate of the number of index rows that it would need to
access as a function of the query terms.

The workaround I found so far is to set a low value of
random_page_cost, but this could result in the