Re: BRIN index which is much faster never chosen by planner

2019-10-15 Thread David Rowley
On Wed, 16 Oct 2019 at 11:40, Justin Pryzby  wrote:
> It didn't occur to me at the time, but that would also allow
> creating numerous, partial BRIN indices, each of which would have separate
> correlation computed over just their "restricted range", which *might* also
> handle your case, depending on how packed your data is.

Perhaps I've misunderstood you, but the correlation that's used is per
column, not per index. The only way to have it calculate multiple
correlations would be to partition the table. There'd be a correlation
for the column on each partition that way.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: BRIN index which is much faster never chosen by planner

2019-10-15 Thread David Rowley
On Wed, 16 Oct 2019 at 05:05, Jeremy Finzel  wrote:
> But perhaps it would be worth exploring if there could be more detailed stats 
> on physical vs logical correlation, such as when ANALYZE takes its samples, 
> noting physical locations as well as logical values, and allowing the 
> correlation to take account of that more detailed analysis.  Of course, 
> sounds like a huge amount of work with uncertain benefits.

Yes. I think improving the statistics could be beneficial. It does
appear like the single value for the entire column is not fine-grained
enough for your use case.

> As the docs state, I do believe that the only use case that will work really 
> well for BRIN is either a truly insert-only table which is never pruned (an 
> idea I dislike as a DBA who wants us to keep OLTP data trim and implement 
> data retention policies :), or a table which is routinely CLUSTERed!

Have you considered partitioning the table so that the retention
policy could be implemented by dropping a partition rather than
performing a bulk DELETE?


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: BRIN index which is much faster never chosen by planner

2019-10-15 Thread Justin Pryzby
This reminds me of an issue I reported several years ago where Btree index
scans were chosen over seq scan of a large, INSERT-only table due to very high
correlation, but performed poorly.  I concluded that use of the the high "large
scale" correlation on a large 50+GB table caused the planner to fail to account
for a larger number of pages being read nonsequentially (the opposite of your
issue).  I think that's because we were INSERTing data which was at least
approximately sorted on record END time, and the index was on record START
time.  For a large table with a week's data, the correlation of "start time"
was still be very high (0.5).  But scanning the index ends up reading pages
nonsequentially, and also multiple visits per page.

I eeked out a patch which made "correlation" a per-index statistic rather than
a per-column one.  That means the planner could distinguish between a
freshly-built btree index and a fragmented one.  (At the time, there was a
hypothesis that our issue was partially due to repeated values of the index
columns.)  It didn't occur to me at the time, but that would also allow
creating numerous, partial BRIN indices, each of which would have separate
correlation computed over just their "restricted range", which *might* also
handle your case, depending on how packed your data is.

https://www.postgresql.org/message-id/flat/20170707234119.GN17566%40telsasoft.com#fdcbebc342b8fb9ad0ff293913f54d11

On Tue, Oct 15, 2019 at 11:05:13AM -0500, Jeremy Finzel wrote:
> I do believe that the only use case that will work really well for BRIN is
> either a truly insert-only table which is never pruned ...  or a table which
> is routinely CLUSTERed!

Or partitioned table, which for large data sets I highly recommend instead of
DELETE.

Justin




Re: BRIN index which is much faster never chosen by planner

2019-10-15 Thread Michael Lewis
Thanks for closing the loop on the data correlation question. I've been
playing with BRIN indexes on a log table of sorts and this thread helped
clear up some of the behavior I have been seeing.

I am curious, would a partial btree index fit your needs? Perhaps the
maintenance overhead is too significant or this is too off-the-wall, but a
daily job to create new index and drop the old concurrently could give the
performance you need while still saving the extra disk space of the full
btree on the timestamp.

CREATE INDEX CONCURRENTLY log_table_rec_insert_time_partial_10_04 ON
log_table USING btree ( rec_insert_time ) WHERE rec_insert_time >
'2019-10-04'::DATE;
DROP INDEX CONCURRENTLY IF EXISTS log_table_rec_insert_time_partial_10_03;

I would consider including category column as well, but I suspect that
would increase the size of the index significantly. Of course, this depends
on the query planner evaluating that "l.rec_insert_time >= now() - interval
'10 days'" and determining that the index fulfills the need.

>


Re: BRIN index which is much faster never chosen by planner

2019-10-15 Thread Jeremy Finzel
Thank you for the thorough and thoughtful reply!  Please see below.

On Mon, Oct 14, 2019 at 3:48 PM David Rowley 
wrote:

> Another thing which you might want to look at is the correlation
> column in the pg_stats view for the rec_insert_time column. Previous
> to 7e534adcd, BRIN index were costed based on the selectivity
> estimate. There was no accountability towards the fact that the pages
> for those records might have been spread out over the entire table.
> Post 7e534adcd, we use the correlation estimate to attempt to estimate
> how many pages (more specifically "ranges") we're likely to hit based
> on that and the selectivity estimate. This commit intended to fix the
> issue we had with BRIN indexes being selected far too often.  Of
> course, the correlation is based on the entire table, if there are
> subsets of the table that are perhaps perfectly correlated, then the
> planner is not going to know about that. It's possible that some of
> your older rec_insert_times are spread out far more than the newer
> ones.  As a test, you could try creating a new table and copying the
> records over to it in rec_insert_time order and seeing if the BRIN
> index is selected for that table (after having performed an ANALYZE).
>
> It would be interesting if you could show the pg_stats row for the
> column so that we can see if the correlation is low.
>

So what I said originally (and light bulbs now going off) is that the table
is insert-only, but it is **pruned (deletes) to the past year of data**.  I
think this is the key of what I've missed.  Once vacuum runs, we have a
bunch of old physical space being re-used by new inserts, thus botching
that good correlation between physical and logical order.  So it appears
the physical order of the data is indeed well-correlated in recent history,
but not so when you go back a bit further.  Here are pg_stats:

-[ RECORD 1 ]--+---
schemaname | foo
tablename  | log_table
attname| rec_insert_time
inherited  | f
null_frac  | 0
avg_width  | 8
n_distinct | 1.89204e+06
correlation| 0.193951
most_common_elems  |
most_common_elem_freqs |
elem_count_histogram   |

I have not tried creating a fresh table, but if I modify my OP query to
instead take a window of 10 days 100 days ago, the BRIN index actually
performs really bad... identically to the seq scan:

Here is with a seq scan:

   QUERY PLAN
-
 WindowAgg  (cost=26822167.70..26822170.60 rows=129 width=120) (actual
time=200730.856..200730.910 rows=62 loops=1)
   ->  Sort  (cost=26822167.70..26822168.02 rows=129 width=104) (actual
time=200730.834..200730.837 rows=62 loops=1)
 Sort Key: unique_cases.source, unique_cases.rec_insert_time
 Sort Method: quicksort  Memory: 33kB
 ->  Subquery Scan on unique_cases  (cost=26822160.60..26822163.18
rows=129 width=104) (actual time=200730.672..200730.763 rows=62 loops=1)
   ->  Unique  (cost=26822160.60..26822161.89 rows=129
width=124) (actual time=200730.670..200730.753 rows=62 loops=1)
 ->  Sort  (cost=26822160.60..26822160.92 rows=129
width=124) (actual time=200730.668..200730.686 rows=395 loops=1)
   Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
   Sort Method: quicksort  Memory: 80kB
   ->  Nested Loop  (cost=0.00..26822156.08
rows=129 width=124) (actual time=200692.321..200730.121 rows=395 loops=1)
 Join Filter: ((l.category)::text =
filter.category)
 Rows Removed by Join Filter: 552210
 ->  Seq Scan on small_join_table filter
 (cost=0.00..26.99 rows=1399 width=8) (actual time=0.013..0.179 rows=1399
loops=1)
 ->  Materialize  (cost=0.00..26818970.84
rows=129 width=99) (actual time=136.942..143.440 rows=395 loops=1399)
   ->  Seq Scan on log_table l
 (cost=0.00..26818970.20 rows=129 width=99) (actual
time=191581.193..200649.013 rows=395 loops=1)
 Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name) AND (rec_insert_time >= (now() - '100
days'::interval)) AND (rec_insert_time <= (now() - '90 days'::interval)))
 Rows Removed by Filter:
315097963
 Planning Time: 1.541 ms
 Execution Time: 200731.273 ms
(19 rows)

Here is with the forced brin index scan:


QUERY PLAN

Re: BRIN index which is much faster never chosen by planner

2019-10-14 Thread David Rowley
On Tue, 15 Oct 2019 at 08:43, Jeremy Finzel  wrote:
> I wanted to follow up on this specific issue.  Isn't this the heart of the 
> matter and a fundamental problem?  If I want to rely on BRIN indexes as in a 
> straightforward case as explained in OP, but I don't know if the planner will 
> be nearly reliable enough, how can I depend on them in production?  Is this 
> not considered a planner bug or should this kind of case be documented as 
> problematic for BRIN?  As another way to look at it: is there a configuration 
> parameter that could be added specific to BRIN or bitmapscan to provide help 
> to cases like this?
>
> On freshly analyzed tables, I tried my original query again and found that 
> even with now() - 3 days it does not choose the BRIN index.  In fact, it 
> chose another btree on the table like (id1, id2, rec_insert_time).  With warm 
> cache, the pg-chosen plan takes 40 seconds to execute, whereas when I force a 
> BRIN scan it takes only 4 seconds.

Another thing which you might want to look at is the correlation
column in the pg_stats view for the rec_insert_time column. Previous
to 7e534adcd, BRIN index were costed based on the selectivity
estimate. There was no accountability towards the fact that the pages
for those records might have been spread out over the entire table.
Post 7e534adcd, we use the correlation estimate to attempt to estimate
how many pages (more specifically "ranges") we're likely to hit based
on that and the selectivity estimate. This commit intended to fix the
issue we had with BRIN indexes being selected far too often.  Of
course, the correlation is based on the entire table, if there are
subsets of the table that are perhaps perfectly correlated, then the
planner is not going to know about that. It's possible that some of
your older rec_insert_times are spread out far more than the newer
ones.  As a test, you could try creating a new table and copying the
records over to it in rec_insert_time order and seeing if the BRIN
index is selected for that table (after having performed an ANALYZE).

It would be interesting if you could show the pg_stats row for the
column so that we can see if the correlation is low.

You can see from the code below that the final selectivity strongly
influenced by the correlation value (REF: brincostestimate)

qualSelectivity = clauselist_selectivity(root, indexQuals,
baserel->relid,
JOIN_INNER, NULL);

/* work out the actual number of ranges in the index */
indexRanges = Max(ceil((double) baserel->pages / statsData.pagesPerRange),
  1.0);

/*
* Now calculate the minimum possible ranges we could match with if all of
* the rows were in the perfect order in the table's heap.
*/
minimalRanges = ceil(indexRanges * qualSelectivity);

/*
* Now estimate the number of ranges that we'll touch by using the
* indexCorrelation from the stats. Careful not to divide by zero (note
* we're using the absolute value of the correlation).
*/
if (*indexCorrelation < 1.0e-10)
estimatedRanges = indexRanges;
else
estimatedRanges = Min(minimalRanges / *indexCorrelation, indexRanges);

/* we expect to visit this portion of the table */
selec = estimatedRanges / indexRanges;

CLAMP_PROBABILITY(selec);


My overall view on this is that the BRIN index is not that great since
it's not eliminating that many rows by using it.

>From above we see:

>  Bitmap Heap Scan on foo.log_table l  (cost=2391.71..24360848.29 rows=735 
> width=99) (actual time=824.133..21329.054 rows=466 loops=1)
   Output: 
   Recheck Cond:
(l.rec_insert_time >= (now() - '10 days'::interval))
   Rows Removed by Index Recheck: 8187584
   Filter: ((l.field1 IS NOT NULL)
AND (l.category = 'music'::name))
   Rows Removed by Filter: 19857107
   Heap Blocks: lossy=1509000

So you have just 466 rows matching these quals, but the executor had
to scan 1.5 million pages to get those and filter out 8.1 million rows
on the recheck then 19.8 million on the filter. You've mentioned that
the table's heap is 139 GB, which is about 18 million pages.  It seems
your query would perform much better if you had a btree index such as
(category, rec_insert_time) where field1 is not null;,

Of course, you've mentioned that you are finding when the plan uses
the BRIN index that it executes more quickly, but I think you're going
to find BRIN unreliable for tables anything other than INSERT-only
tables which the records are always inserted with an ever-increasing
or decreasing value in the BRIN indexed column.  If you start
performing UPDATEs then that's going to create holes that new record
will fill and cause the correlation to start dropping resulting in the
BRIN indexes scan cost going up.

On the other hand, if you think you can do better than what was done
in 7e534adcd, then it would be good to see 

Re: BRIN index which is much faster never chosen by planner

2019-10-14 Thread Jeremy Finzel
>
> The other issue is that the estimation of pages fetched using bitmap
> heap scan is rather crude - but that's simply hard, and I don't think we
> can fundamentally improve this.
>

I wanted to follow up on this specific issue.  Isn't this the heart of the
matter and a fundamental problem?  If I want to rely on BRIN indexes as in
a straightforward case as explained in OP, but I don't know if the planner
will be nearly reliable enough, how can I depend on them in production?  Is
this not considered a planner bug or should this kind of case be documented
as problematic for BRIN?  As another way to look at it: is there a
configuration parameter that could be added specific to BRIN or bitmapscan
to provide help to cases like this?

On freshly analyzed tables, I tried my original query again and found that
even with now() - 3 days it does not choose the BRIN index.  In fact it
chose another btree on the table like (id1, id2, rec_insert_time).  With
warm cache, the pg-chosen plan takes 40 seconds to execute, whereas when I
force a BRIN scan it takes only 4 seconds.

I could understand more if the execution times were close, but the actual
BRIN index is orders of magnitude faster than the plan Postgres is
choosing.  I appreciate the feedback on this very much, as I am quite eager
to use BRIN indexes!!!

Thanks,
Jeremy


Re: BRIN index which is much faster never chosen by planner

2019-10-11 Thread Tomas Vondra

On Fri, Oct 11, 2019 at 09:08:05AM -0500, Jeremy Finzel wrote:

On Thu, Oct 10, 2019 at 7:22 PM David Rowley 
wrote:


The planner might be able to get a better estimate on the number of
matching rows if the now() - interval '10 days' expression was
replaced with 'now'::timestamptz - interval '10 days'. However, care
would need to be taken to ensure the plan is never prepared since
'now' is evaluated during parse. The same care must be taken when
creating views, functions, stored procedures and the like.



You are on to something here I think with the now() function, even if above
suggestion is not exactly right as you said further down.  I am finding a
hard-coded timestamp gives the right query plan.  I also tested same with
even bigger window (last 16 days) and it yet still chooses the brin index.

foo_prod=# EXPLAIN
foo_prod-# SELECT
foo_prod-#  category, source, MIN(rec_insert_time) OVER (partition by
source order by rec_insert_time) AS first_source_time, MAX(rec_insert_time)
OVER (partition by source order by rec_insert_time) AS last_source_time
foo_prod-# FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
foo_prod(# category, source(field1) AS source, rec_insert_time
foo_prod(# FROM log_table l
foo_prod(# INNER JOIN public.small_join_table filter ON filter.category =
l.category
foo_prod(# WHERE field1 IS NOT NULL AND l.category = 'music'
foo_prod(# AND l.rec_insert_time >= now() - interval '10 days'
foo_prod(# ORDER BY brand_id, last_change, log_id, rec_insert_time DESC)
unique_cases;

  QUERY PLAN
-
WindowAgg  (cost=24436329.10..24436343.56 rows=643 width=120)
  ->  Sort  (cost=24436329.10..24436330.70 rows=643 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
->  Subquery Scan on unique_cases  (cost=24436286.24..24436299.10
rows=643 width=104)
  ->  Unique  (cost=24436286.24..24436292.67 rows=643
width=124)
->  Sort  (cost=24436286.24..24436287.85 rows=643
width=124)
  Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
  ->  Nested Loop  (cost=0.00..24436256.25
rows=643 width=124)
Join Filter: ((l.category)::text =
filter.category)
->  Seq Scan on small_join_table filter
(cost=0.00..26.99 rows=1399 width=8)
->  Materialize  (cost=0.00..24420487.02
rows=643 width=99)
  ->  Seq Scan on log_table l
(cost=0.00..24420483.80 rows=643 width=99)
Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name) AND (rec_insert_time >= (now() - '10
days'::interval)))
(13 rows)

foo_prod=# SELECT now() - interval '10 days';
  ?column?
---
2019-10-01 08:20:38.115471-05
(1 row)

foo_prod=# EXPLAIN
SELECT
category, source, MIN(rec_insert_time) OVER (partition by source order by
rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition
by source order by rec_insert_time) AS last_source_time
FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
category, source(field1) AS source, rec_insert_time
FROM log_table l
INNER JOIN public.small_join_table filter ON filter.category = l.category
WHERE field1 IS NOT NULL AND l.category = 'music'
AND l.rec_insert_time >= '2019-10-01 08:20:38.115471-05'
ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases;

 QUERY PLAN
---
WindowAgg  (cost=19664576.17..19664590.63 rows=643 width=120)
  ->  Sort  (cost=19664576.17..19664577.77 rows=643 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
->  Subquery Scan on unique_cases  (cost=19664533.31..19664546.17
rows=643 width=104)
  ->  Unique  (cost=19664533.31..19664539.74 rows=643
width=124)
->  Sort  (cost=19664533.31..19664534.92 rows=643
width=124)
  Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
  ->  Nested Loop  (cost=3181.19..19664503.32
rows=643 width=124)
->  Gather  (cost=3180.91..19662574.92
rows=643 width=99)
  Workers Planned: 3
  ->  Parallel Bitmap Heap Scan on
log_table l  (cost=2180.91..19661510.62 rows=207 width=99)
Recheck Cond: (rec_insert_time

= '2019-10-01 08:20:38.115471-05'::timestamp with time zone)

Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name))
 

Re: BRIN index which is much faster never chosen by planner

2019-10-11 Thread Jeremy Finzel
On Thu, Oct 10, 2019 at 6:13 PM Tomas Vondra 
wrote:

>
> So this seems like a combination of multiple issues. Firstly, the bitmap
> index scan on rec_insert_time_brin_1000 estimate seems somewhat poor. It
> might be worth increasing stats target on that column, or something like
> that. Not sure, but it's about the only "fixable" thing here, I think.
>

In the OP I had mentioned that I already increased it to 5000, and it made
no difference.  Ah fine let's go ahead and try 1... still no change:

foo_prod=# ALTER TABLE log_table ALTER COLUMN rec_insert_time SET
STATISTICS 1;
ALTER TABLE
foo_prod=# ANALYZE log_table;
ANALYZE
foo_prod=# EXPLAIN
SELECT
 category, source, MIN(rec_insert_time) OVER (partition by source order by
rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition
by source order by rec_insert_time) AS last_source_time
FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
category, source(field1) AS source, rec_insert_time
FROM log_table l
INNER JOIN public.small_join_table filter ON filter.category = l.category
WHERE field1 IS NOT NULL AND l.category = 'music'
AND l.rec_insert_time >= now() - interval '10 days'
ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases;

   QUERY PLAN
-
 WindowAgg  (cost=24451299.20..24451313.21 rows=623 width=120)
   ->  Sort  (cost=24451299.20..24451300.75 rows=623 width=104)
 Sort Key: unique_cases.source, unique_cases.rec_insert_time
 ->  Subquery Scan on unique_cases  (cost=24451257.82..24451270.28
rows=623 width=104)
   ->  Unique  (cost=24451257.82..24451264.05 rows=623
width=124)
 ->  Sort  (cost=24451257.82..24451259.38 rows=623
width=124)
   Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
   ->  Nested Loop  (cost=0.00..24451228.90
rows=623 width=124)
 Join Filter: ((l.category)::text =
filter.category)
 ->  Seq Scan on small_join_table filter
 (cost=0.00..26.99 rows=1399 width=8)
 ->  Materialize  (cost=0.00..24435949.31
rows=623 width=99)
   ->  Seq Scan on log_table l
 (cost=0.00..24435946.20 rows=623 width=99)
 Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name) AND (rec_insert_time >= (now() - '10
days'::interval)))
(13 rows)

Thanks,
Jeremy


Re: BRIN index which is much faster never chosen by planner

2019-10-11 Thread Jeremy Finzel
Dear Michael,

On Thu, Oct 10, 2019 at 5:20 PM Michael Lewis  wrote:

> Since the optimizer is choosing a seq scan over index scan when it seems
> like it has good row estimates in both cases, to me that may mean costs of
> scanning index are expected to be high. Is this workload on SSD? Has the
> random_page_cost config been decreased from default 4 (compared with cost
> of 1 unit for sequential scan)?
>

It's 1.5


> Your buffer hits aren't great. What is shared_buffers set to? How much ram
> on this cluster?
>

shared_buffers is 4GB.  It has 500G of RAM, but server has several clusters
on it.


>
> With this table being insert only, one assumes correlation is very high on
> the data in this column as shown in pg_stats, but have your confirmed?
>

Yes, but the issue isn't with the BRIN index performing badly or being
fragmented.  It's that it performs great (7x faster than the seq scan) but
postgres doesn't pick using it.  I have seen this same issue also in other
attempts I have made to use BRIN.


> To me, distinct ON is often a bad code smell and probably can be
> re-written to be much more efficient with GROUP BY, lateral & order by, or
> some other tool. Same with the window function. It is a powerful tool, but
> sometimes not the right one.
>

I don't really agree, but it's beside the point because the issue is not in
aggregation.  It's pre-aggregation.  Indeed if I run my query as a simple
select (as I tried) it's the exact same planning issue.  (In my experience,
distinct on for given example is the fastest.  Same with window functions
which prevent inefficient self-joins)


> Is "source" a function that is called on field1? What is it doing/how is
> it defined?
>

I can't see how that matters either, but the "source" function is a mask
for a built-in pg function that is trivial.  This whole query is masked so
as not to expose our actual prod query, but I hope it's still
understandable enough :).

My question is not how to make this query faster in general.  It's that I
want to use BRIN indexes very much, but I'm not sure I can trust they will
scale with the right query plan like I know BTREE will.

Thanks!
Jeremy


Re: BRIN index which is much faster never chosen by planner

2019-10-11 Thread Jeremy Finzel
On Thu, Oct 10, 2019 at 7:22 PM David Rowley 
wrote:

> The planner might be able to get a better estimate on the number of
> matching rows if the now() - interval '10 days' expression was
> replaced with 'now'::timestamptz - interval '10 days'. However, care
> would need to be taken to ensure the plan is never prepared since
> 'now' is evaluated during parse. The same care must be taken when
> creating views, functions, stored procedures and the like.
>

You are on to something here I think with the now() function, even if above
suggestion is not exactly right as you said further down.  I am finding a
hard-coded timestamp gives the right query plan.  I also tested same with
even bigger window (last 16 days) and it yet still chooses the brin index.

foo_prod=# EXPLAIN
foo_prod-# SELECT
foo_prod-#  category, source, MIN(rec_insert_time) OVER (partition by
source order by rec_insert_time) AS first_source_time, MAX(rec_insert_time)
OVER (partition by source order by rec_insert_time) AS last_source_time
foo_prod-# FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
foo_prod(# category, source(field1) AS source, rec_insert_time
foo_prod(# FROM log_table l
foo_prod(# INNER JOIN public.small_join_table filter ON filter.category =
l.category
foo_prod(# WHERE field1 IS NOT NULL AND l.category = 'music'
foo_prod(# AND l.rec_insert_time >= now() - interval '10 days'
foo_prod(# ORDER BY brand_id, last_change, log_id, rec_insert_time DESC)
unique_cases;

   QUERY PLAN
-
 WindowAgg  (cost=24436329.10..24436343.56 rows=643 width=120)
   ->  Sort  (cost=24436329.10..24436330.70 rows=643 width=104)
 Sort Key: unique_cases.source, unique_cases.rec_insert_time
 ->  Subquery Scan on unique_cases  (cost=24436286.24..24436299.10
rows=643 width=104)
   ->  Unique  (cost=24436286.24..24436292.67 rows=643
width=124)
 ->  Sort  (cost=24436286.24..24436287.85 rows=643
width=124)
   Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
   ->  Nested Loop  (cost=0.00..24436256.25
rows=643 width=124)
 Join Filter: ((l.category)::text =
filter.category)
 ->  Seq Scan on small_join_table filter
 (cost=0.00..26.99 rows=1399 width=8)
 ->  Materialize  (cost=0.00..24420487.02
rows=643 width=99)
   ->  Seq Scan on log_table l
 (cost=0.00..24420483.80 rows=643 width=99)
 Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name) AND (rec_insert_time >= (now() - '10
days'::interval)))
(13 rows)

foo_prod=# SELECT now() - interval '10 days';
   ?column?
---
 2019-10-01 08:20:38.115471-05
(1 row)

foo_prod=# EXPLAIN
SELECT
 category, source, MIN(rec_insert_time) OVER (partition by source order by
rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition
by source order by rec_insert_time) AS last_source_time
FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
category, source(field1) AS source, rec_insert_time
FROM log_table l
INNER JOIN public.small_join_table filter ON filter.category = l.category
WHERE field1 IS NOT NULL AND l.category = 'music'
AND l.rec_insert_time >= '2019-10-01 08:20:38.115471-05'
ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases;

  QUERY PLAN
---
 WindowAgg  (cost=19664576.17..19664590.63 rows=643 width=120)
   ->  Sort  (cost=19664576.17..19664577.77 rows=643 width=104)
 Sort Key: unique_cases.source, unique_cases.rec_insert_time
 ->  Subquery Scan on unique_cases  (cost=19664533.31..19664546.17
rows=643 width=104)
   ->  Unique  (cost=19664533.31..19664539.74 rows=643
width=124)
 ->  Sort  (cost=19664533.31..19664534.92 rows=643
width=124)
   Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
   ->  Nested Loop  (cost=3181.19..19664503.32
rows=643 width=124)
 ->  Gather  (cost=3180.91..19662574.92
rows=643 width=99)
   Workers Planned: 3
   ->  Parallel Bitmap Heap Scan on
log_table l  (cost=2180.91..19661510.62 rows=207 width=99)
 Recheck Cond: (rec_insert_time
>= '2019-10-01 08:20:38.115471-05'::timestamp with time zone)
 Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name))
   

Re: BRIN index which is much faster never chosen by planner

2019-10-11 Thread David Rowley
On Fri, 11 Oct 2019 at 17:48, Michael Lewis  wrote:
>
> On Thu, Oct 10, 2019 at 6:22 PM David Rowley  
> wrote:
>> The planner will just estimate the selectivity of now() - interval '10
>> days'  by using DEFAULT_INEQ_SEL, which is 0., so it
>> thinks it'll get 1/3rd of the table.  Using 'now' will allow the
>> planner to lookup actual statistics on that column which will likely
>> give a much better estimate, which by the looks of it, likely will
>> result in one of those BRIN index being used.
>
>
> This surprised me a bit, and would have significant implications. I tested a 
> few different tables in our system and get the same row count estimate with 
> either WHERE condition. Perhaps I am missing a critical piece of what you 
> said.
>
> explain
> select * from charges where posted_on > now() - interval '10 days';
>
> explain
> select * from charges where posted_on > 'now'::timestamptz  - interval '10 
> days';

You're right. On looking more closely at the code, it uses
estimate_expression_value(), which performs additional constant
folding of expressions for selectivity purposes only. It does end up
calling the now() function and evaluating the now() - interval '10
days'; expression into a Const.

The header comment for that function reads:

* estimate_expression_value
 *
 * This function attempts to estimate the value of an expression for
 * planning purposes.  It is in essence a more aggressive version of
 * eval_const_expressions(): we will perform constant reductions that are
 * not necessarily 100% safe, but are reasonable for estimation purposes.

So I take back what I said about using 'now'::timestamptz instead of now().

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: BRIN index which is much faster never chosen by planner

2019-10-10 Thread Michael Lewis
On Thu, Oct 10, 2019 at 6:22 PM David Rowley 
wrote:

> The planner might be able to get a better estimate on the number of
> matching rows if the now() - interval '10 days' expression was
> replaced with 'now'::timestamptz - interval '10 days'. However, care
> would need to be taken to ensure the plan is never prepared since
> 'now' is evaluated during parse. The same care must be taken when
> creating views, functions, stored procedures and the like.
>
> The planner will just estimate the selectivity of now() - interval '10
> days'  by using DEFAULT_INEQ_SEL, which is 0., so it
> thinks it'll get 1/3rd of the table.  Using 'now' will allow the
> planner to lookup actual statistics on that column which will likely
> give a much better estimate, which by the looks of it, likely will
> result in one of those BRIN index being used.
>

This surprised me a bit, and would have significant implications. I tested
a few different tables in our system and get the same row count estimate
with either WHERE condition. Perhaps I am missing a critical piece of what
you said.

explain
select * from charges where posted_on > now() - interval '10 days';

explain
select * from charges where posted_on > 'now'::timestamptz  - interval '10
days';


Re: BRIN index which is much faster never chosen by planner

2019-10-10 Thread David Rowley
On Fri, 11 Oct 2019 at 12:13, Tomas Vondra  wrote:
> The index scan is estimated to return 157328135 rows, i.e. about 50% of
> the table (apparently it's ~10x more than the actual number).

Don't pay too much attention to the actual row counts from bitmap
index scans of brin indexes. The value is entirely made up, per:

/*
* XXX We have an approximation of the number of *pages* that our scan
* returns, but we don't have a precise idea of the number of heap tuples
* involved.
*/
return totalpages * 10;

in bringetbitmap().

(Ideally EXPLAIN would be written in such a way that it didn't even
show the actual rows for node types that don't return rows. However,
I'm sure that would break many explain parsing tools)

The planner might be able to get a better estimate on the number of
matching rows if the now() - interval '10 days' expression was
replaced with 'now'::timestamptz - interval '10 days'. However, care
would need to be taken to ensure the plan is never prepared since
'now' is evaluated during parse. The same care must be taken when
creating views, functions, stored procedures and the like.

The planner will just estimate the selectivity of now() - interval '10
days'  by using DEFAULT_INEQ_SEL, which is 0., so it
thinks it'll get 1/3rd of the table.  Using 'now' will allow the
planner to lookup actual statistics on that column which will likely
give a much better estimate, which by the looks of it, likely will
result in one of those BRIN index being used.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: BRIN index which is much faster never chosen by planner

2019-10-10 Thread Tomas Vondra

On Thu, Oct 10, 2019 at 04:58:11PM -0500, Jeremy Finzel wrote:


...

Notice it chooses the smallest BRIN index with 1000 pages per range, and
this is far faster than the seq scan.

I do believe the estimate is actually way off.  Just a plain EXPLAIN of the
latter estimates 10x more rows than actual:
WindowAgg  (cost=24354689.19..24354705.07 rows=706 width=120)
  ->  Sort  (cost=24354689.19..24354690.95 rows=706 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
->  Subquery Scan on unique_cases  (cost=24354641.66..24354655.78
rows=706 width=104)
  ->  Unique  (cost=24354641.66..24354648.72 rows=706
width=124)
->  Sort  (cost=24354641.66..24354643.42 rows=706
width=124)
  Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
  ->  Nested Loop  (cost=2385.42..24354608.25
rows=706 width=124)
->  Bitmap Heap Scan on log_table l
(cost=2383.82..24352408.26 rows=706 width=99)
  Recheck Cond: (rec_insert_time >=
(now() - '10 days'::interval))
  Filter: ((field1 IS NOT NULL) AND
(category = 'music'::name))
  ->  Bitmap Index Scan on
rec_insert_time_brin_1000  (cost=0.00..2383.64 rows=156577455 width=0)
Index Cond: (rec_insert_time

= (now() - '10 days'::interval))

->  Bitmap Heap Scan on small_join_table
filter  (cost=1.60..3.12 rows=1 width=8)
  Recheck Cond: (category =
(l.category)::text)
  ->  Bitmap Index Scan on
small_join_table_pkey  (cost=0.00..1.60 rows=1 width=0)
Index Cond: (category =
(l.category)::text)
(17 rows)


Here is EXPLAIN only of the default chosen plan:
WindowAgg  (cost=24437857.18..24437873.07 rows=706 width=120)
  ->  Sort  (cost=24437857.18..24437858.95 rows=706 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
->  Subquery Scan on unique_cases  (cost=24437809.66..24437823.78
rows=706 width=104)
  ->  Unique  (cost=24437809.66..24437816.72 rows=706
width=124)
->  Sort  (cost=24437809.66..24437811.42 rows=706
width=124)
  Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
  ->  Nested Loop  (cost=0.00..24437776.25
rows=706 width=124)
Join Filter: ((l.category)::text =
filter.category)
->  Seq Scan on log_table l
(cost=0.00..24420483.80 rows=706 width=99)
  Filter: ((field1 IS NOT NULL) AND
(category = 'music'::name) AND (rec_insert_time >= (now() - '10
days'::interval)))
->  Materialize  (cost=0.00..33.98
rows=1399 width=8)
  ->  Seq Scan on small_join_table
filter  (cost=0.00..26.99 rows=1399 width=8)
(13 rows)



Any insight into this is much appreciated.  This is just one example of
many similar issues I have been finding with BRIN indexes scaling
predictably with insert-only workloads.



It's quite interesting planning issue. The cost estimates are:

->  Seq Scan on foo.log_table l
(cost=0.00..24420483.80 rows=707 width=99) (actual

while for the bitmap heap scan it looks like this:

->  Bitmap Heap Scan on foo.log_table l
(cost=2391.71..24360848.29 rows=735 width=99) (actual

So the planner actualy thinks the bitmap heap scan is a tad *cheaper*
but picks the seq scan anyway. This is likely because we don't really
compare the exact costs, but we do fuzzy comparison - the plan has to be
at least 1% cheaper to dominate the existing plan. This allows us to
save some work when replacing the paths.

In this case the difference is only about 0.2%, so we keep the seqscan
path. The real question is why the planner came to this cost, when it
got pretty good row estimates etc.

Looking at the cost_bitmap_heap_scan() I think the costing issue comes
mostly from this bit:

 /*
  * For small numbers of pages we should charge spc_random_page_cost
  * apiece, while if nearly all the table's pages are being read, it's more
  * appropriate to charge spc_seq_page_cost apiece.  The effect is
  * nonlinear, too. For lack of a better idea, interpolate like this to
  * determine the cost per page.
  */
 if (pages_fetched >= 2.0)
 cost_per_page = spc_random_page_cost -
 (spc_random_page_cost - spc_seq_page_cost)
 * sqrt(pages_fetched / T);
 else
 cost_per_page = spc_random_page_cost;

The index scan is estimated to return 157328135 rows, i.e. about 50% of
the table (apparently it's ~10x more than the actual number). This is
produced by compute_bitmap_pages() which also computes pages_fetched,
and I guess that's going to be 

Re: BRIN index which is much faster never chosen by planner

2019-10-10 Thread Michael Lewis
Since the optimizer is choosing a seq scan over index scan when it seems
like it has good row estimates in both cases, to me that may mean costs of
scanning index are expected to be high. Is this workload on SSD? Has the
random_page_cost config been decreased from default 4 (compared with cost
of 1 unit for sequential scan)?

Your buffer hits aren't great. What is shared_buffers set to? How much ram
on this cluster?

With this table being insert only, one assumes correlation is very high on
the data in this column as shown in pg_stats, but have your confirmed?

To me, distinct ON is often a bad code smell and probably can be re-written
to be much more efficient with GROUP BY, lateral & order by, or some other
tool. Same with the window function. It is a powerful tool, but sometimes
not the right one.

Is "source" a function that is called on field1? What is it doing/how is it
defined?


BRIN index which is much faster never chosen by planner

2019-10-10 Thread Jeremy Finzel
Good Afternoon,

I posted about this on another thread here
,
but the topic was not precisely planner issues, so I wanted to post it here.

I am running Postgres 11.5.  I have a table that is insert-only and has 312
million rows.  It is also pruned continuously to only past year.  The size
of the table is 223 GB with indexes, 140GB without.  One of the fields is
rec_insert_time timestamptz.  Here are all potentially relevant table stats:

schemaname  | foo
relname | log_table
n_tup_ins   | 86850506
n_tup_upd   | 0
n_tup_del   | 68916115
n_tup_hot_upd   | 0
n_live_tup  | 312810691
n_dead_tup  | 9405132
n_mod_since_analyze | 11452
last_vacuum | 2019-09-20 09:41:43.78383-05
last_autovacuum | 2019-10-04 13:56:16.810355-05
last_analyze| 2019-10-10 09:34:26.807695-05
last_autoanalyze|
vacuum_count| 2
autovacuum_count| 1
analyze_count   | 13
autoanalyze_count   | 0
total_relation_size | 223 GB
relation_size   | 139 GB
table_size  | 140 GB

I have a simple query looking at past 10 days based on rec_insert_time, and
it will not choose the BRIN index even with several configurations.  Here
are my all relevant indexes (I intentionally do not have a btree on
rec_insert_time because I believe BRIN *should* fit better here):

"log_table_brand_id_product_rec_insert_time_idx" btree (brand_id, product,
rec_insert_time)
"log_table_rec_insert_time_idx" brin (rec_insert_time)
"log_table_rec_insert_time_idx1" brin (rec_insert_time) WITH
(pages_per_range='64')
"rec_insert_time_brin_1000" brin (rec_insert_time) WITH
(pages_per_range='1000')

And here is the SQL:
SELECT
 category, source, MIN(rec_insert_time) OVER (partition by source order by
rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition
by source order by rec_insert_time) AS last_source_time
FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
category, source(field1) AS source, rec_insert_time
FROM log_table l
INNER JOIN public.small_join_table filter ON filter.category = l.category
WHERE field1 IS NOT NULL AND l.category = 'music'
AND l.rec_insert_time >= now() - interval '10 days'
ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases
;

This query will choose a seq scan on log_table every time in spite of these
BRIN indexes on rec_insert_time.
@Michael Lewis  had suggested I check
default_statistics_target for this column.  I raised it to 5000 for this
column and still it's choosing a seq scan.

Here is default chosen plan (takes 2 minutes 12 seconds):
 WindowAgg  (cost=24437881.80..24437897.70 rows=707 width=120) (actual
time=132173.173..132173.222 rows=53 loops=1)
   Output: unique_cases.category, unique_cases.source,
min(unique_cases.rec_insert_time) OVER (?),
max(unique_cases.rec_insert_time) OVER (?), unique_cases.rec_insert_time
   Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
   ->  Sort  (cost=24437881.80..24437883.56 rows=707 width=104) (actual
time=132173.146..132173.149 rows=53 loops=1)
 Output: unique_cases.source, unique_cases.rec_insert_time,
unique_cases.category
 Sort Key: unique_cases.source, unique_cases.rec_insert_time
 Sort Method: quicksort  Memory: 32kB
 Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
 ->  Subquery Scan on unique_cases  (cost=24437834.20..24437848.34
rows=707 width=104) (actual time=132172.950..132173.062 rows=53 loops=1)
   Output: unique_cases.source, unique_cases.rec_insert_time,
unique_cases.category
   Buffers: shared hit=391676 read=17772642 dirtied=4679
written=7
   ->  Unique  (cost=24437834.20..24437841.27 rows=707
width=124) (actual time=132172.946..132173.048 rows=53 loops=1)
 Output: l.category, (source(l.field1)),
l.rec_insert_time, l.brand_id, l.last_change, l.log_id
 Buffers: shared hit=391676 read=17772642 dirtied=4679
written=7
 ->  Sort  (cost=24437834.20..24437835.96 rows=707
width=124) (actual time=132172.939..132172.962 rows=466 loops=1)
   Output: l.category, (source(l.field1)),
l.rec_insert_time, l.brand_id, l.last_change, l.log_id
   Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
   Sort Method: quicksort  Memory: 90kB
   Buffers: shared hit=391676 read=17772642
dirtied=4679 written=7
   ->  Nested Loop  (cost=0.00..24437800.73
rows=707 width=124) (actual time=4096.253..132171.425 rows=466 loops=1)
 Output: l.category, source(l.field1),
l.rec_insert_time, l.brand_id, l.last_change, l.log_id
 Inner Unique: true
 Join Filter: