Re: BRIN index worse than sequential scan for large search set

2023-02-25 Thread Tomas Vondra
FWIW I don't think the randomness per se is the main issue. The main
problem is that this results in a loop of bitmap index scans, with 20k
loops. This is made worse by the block-range nature of BRIN indexes,
resulting in many more heap block accesses.

The table has ~80k pages, but the bitmapscan plan reads ~2559362. That
can easily happen if each loop matches 100 ranges (out of 700), each
having 128 pages. Making the ranges smaller should help to minimize the
amount of pages read unnecessarily, but the loops are an issue.

And the same range may be scanned again and again, if the range is
consistent with multiple values.

Interestingly enough, this is the kind of queries / plans I thought
about a couple weeks ago, which made me to look at SK_SEARCHARRAY
support for BRIN indexes.

Imagine you did rewrite the query to something like:

SELECT * FROM test_brin WHERE id IN (... lilerals ...);

That would only scan the table one, reducing the number of heap pages it
has to access. The drawback is that we still have to build the bitmap,
and without the SK_SEARCHARRAY support we just scan the index for each
element again. So better.

With the SK_SEARCHARRAY patch [1] this is further optimized, and for me
the query runs a bit faster than seqscan (not by much, and it depend on
how the data is cached). At least with pages_per_range=1.

Of course, this won't help unless the query can be rewritten like this.
At least not currently, but I wonder if we could invent some sort of
"pushdown" that'd derive an array of values and push it down into a
parameterized path at once (instead of doing that for each value in a loop).

regards

[1] https://commitfest.postgresql.org/42/4187/


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: BRIN index worse than sequential scan for large search set

2023-02-24 Thread Justin Pryzby
On Fri, Feb 24, 2023 at 06:51:00PM +0100, Mickael van der Beek wrote:
> Hello Justin,
> 
> Thanks for the quick response!
> 
> > The table may be dense, but the tuples aren't.  You're asking to return
> > 1/1000th of the tuples, across the entire table.  Suppose there are ~100
> > tuples per page, and you need to read about every 10th page.  It makes
> > sense that it's slow to read a large amount of data nonsequentially.
> 
> Ah, of course, you're right!
> I forgot that the BRIN indexes store ranges that are not fully covered by
> the row values and that PostgreSQL has to double-check (bitmap heap scan)
> ...
> Would you thus advise to only use BRIN indexes for columns who's values are
> (1) monotonically increasing but also (2) close to each other?

It's not important whether they're "rigidly" monotonic (nor "strictly").
What's important is that a query doesn't need to access a large number
of pages.

For example, some of the BRIN indexes that I'm familiar with are created
on a column called "start time", but the table's data tends to be
naturally sorted by "end time" - and that's good enough.  If someone
queries for data between 12pm and 1pm, there's surely no data for the
first 12 hours of the day's table (because it hadn't happened yet) and
there's probably no data for the last 9+ hours of the day, either, so
it's only got to read data for a 1-2h interval in the middle.  This
assumes that the column's data is typically correlated.  If the tuples
aren't clustered/"close to each other" then it probably doesn't work
well.  I haven't played with brin "multi minmax", though.

> > >2. Since we only select the "idx" column, why does the BRIN index not
> > >simply return the searched value if included in one of it's ranges?
> > >Hitting the actual row data stored in the table seems to be unnessary 
> > > no?
> >
> > Because it's necessary to check if the tuple is visible to the current
> > transaction.  It might be from an uncommited/aborted transaction.

Actually, a better explanation is that all the brin scan returns is the page,
and not the tuples.

"BRIN indexes can satisfy queries via regular bitmap index scans, and
will return all tuples in all pages within each range if the summary
info stored by the index is CONSISTENT with the query conditions.  The
query executor is in charge of rechecking these tuples and discarding
those that do not match the query conditions — in other words, these
indexes are LOSSY".

The index is returning pages where matching tuples *might* be found,
after excluding those pages where it's certain that no tuples are found.

-- 
Justin




Re: BRIN index worse than sequential scan for large search set

2023-02-24 Thread Mickael van der Beek
Hello Justin,

Thanks for the quick response!

The table may be dense, but the tuples aren't.  You're asking to return
> 1/1000th of the tuples, across the entire table.  Suppose there are ~100
> tuples per page, and you need to read about every 10th page.  It makes
> sense that it's slow to read a large amount of data nonsequentially.


Ah, of course, you're right!
I forgot that the BRIN indexes store ranges that are not fully covered by
the row values and that PostgreSQL has to double-check (bitmap heap scan)
...
Would you thus advise to only use BRIN indexes for columns who's values are
(1) monotonically increasing but also (2) close to each other?

I guess that in my use case, something like a roaring bitmap would have
been perfect but I do not believe that those exist in PostgreSQL by default.
Btrees work well performance wise but are simply too large (> 400MB for 20M
rows) even when the index fill factor is 100% (+/- 380 MB) for my use case
as I need to index around 6B rows partitioned in roughly 3K buckets which
would result in ~120GB of  Btree indexes which seems a bit much for simple
existence checks.

Because it's necessary to check if the tuple is visible to the current
> transaction.  It might be from an uncommited/aborted transaction.


Interesting, that's something I did not consider indeed.
I'm guessing that this is a cost brought by MVCC that you can't get around
no matter the isolation level?

Mickael


On Fri, Feb 24, 2023 at 6:19 PM Justin Pryzby  wrote:

> On Fri, Feb 24, 2023 at 05:40:55PM +0100, Mickael van der Beek wrote:
> > Hello everyone,
> >
> > I'm playing around with BRIN indexes so as to get a feel for the feature.
> > During my tests, I was unable to make BRIN indexes perform better than a
> > sequential scan for queries searching for large value sets (20K values in
> > the example down below).
>
> > And now let's query 20K random rows from our 20M total rows:
>
> I didn't try your test, but I think *random* is the problem/explanation.
>
> > By default, this query will not use the BRIN index and simply run a 1.5s
> > long sequential scan (hitting 700 MB) and a 2.47s hash join for a total
> > 8.7s query time:
> > https://explain.dalibo.com/plan/46c3191g8a6c1bc7
>
> > If we force the use of the BRIN index using (`SET LOCAL enable_seqscan =
> > OFF;`) the same query will now take 50s with 2.5s spent on the bitmap
> index
> > scan (hitting 470 MB of data) and a whopping 42s on the bitmap heap scan
> > (hitting 20 GB of data!):
> > https://explain.dalibo.com/plan/7f73bg9172a8b226
>
> That means the planner's cost model correctly preferred a seq scan.
>
> > So I had the following two questions:
> >
> >1. Why is the BRIN index systematically worse than a sequential scan,
> >even when the table is x1000 larger than the search set, physically
> >pre-sorted, dense (fillfactor at 100%) and the search rows are
> themselves
> >sorted?
>
> The table may be dense, but the tuples aren't.  You're asking to return
> 1/1000th of the tuples, across the entire table.  Suppose there are ~100
> tuples per page, and you need to read about every 10th page.  It makes
> sense that it's slow to read a large amount of data nonsequentially.
> That's why random_page_cost is several times higher than seq_page_cost.
>
> I would expect brin to win if the pages to be accessed were dense rather
> than distributed across the whole table.
>
> >2. Since we only select the "idx" column, why does the BRIN index not
> >simply return the searched value if included in one of it's ranges?
> >Hitting the actual row data stored in the table seems to be unnessary
> no?
>
> Because it's necessary to check if the tuple is visible to the current
> transaction.  It might be from an uncommited/aborted transaction.
>
> --
> Justin
>


-- 
Mickael van der BeekWeb developer & Security analyst

mickael.van.der.b...@gmail.com


Re: BRIN index worse than sequential scan for large search set

2023-02-24 Thread Justin Pryzby
On Fri, Feb 24, 2023 at 05:40:55PM +0100, Mickael van der Beek wrote:
> Hello everyone,
> 
> I'm playing around with BRIN indexes so as to get a feel for the feature.
> During my tests, I was unable to make BRIN indexes perform better than a
> sequential scan for queries searching for large value sets (20K values in
> the example down below).

> And now let's query 20K random rows from our 20M total rows:

I didn't try your test, but I think *random* is the problem/explanation.

> By default, this query will not use the BRIN index and simply run a 1.5s
> long sequential scan (hitting 700 MB) and a 2.47s hash join for a total
> 8.7s query time:
> https://explain.dalibo.com/plan/46c3191g8a6c1bc7

> If we force the use of the BRIN index using (`SET LOCAL enable_seqscan =
> OFF;`) the same query will now take 50s with 2.5s spent on the bitmap index
> scan (hitting 470 MB of data) and a whopping 42s on the bitmap heap scan
> (hitting 20 GB of data!):
> https://explain.dalibo.com/plan/7f73bg9172a8b226

That means the planner's cost model correctly preferred a seq scan.

> So I had the following two questions:
> 
>1. Why is the BRIN index systematically worse than a sequential scan,
>even when the table is x1000 larger than the search set, physically
>pre-sorted, dense (fillfactor at 100%) and the search rows are themselves
>sorted?

The table may be dense, but the tuples aren't.  You're asking to return
1/1000th of the tuples, across the entire table.  Suppose there are ~100
tuples per page, and you need to read about every 10th page.  It makes
sense that it's slow to read a large amount of data nonsequentially.
That's why random_page_cost is several times higher than seq_page_cost.

I would expect brin to win if the pages to be accessed were dense rather
than distributed across the whole table.

>2. Since we only select the "idx" column, why does the BRIN index not
>simply return the searched value if included in one of it's ranges?
>Hitting the actual row data stored in the table seems to be unnessary no?

Because it's necessary to check if the tuple is visible to the current
transaction.  It might be from an uncommited/aborted transaction.

-- 
Justin




BRIN index worse than sequential scan for large search set

2023-02-24 Thread Mickael van der Beek
Hello everyone,

I'm playing around with BRIN indexes so as to get a feel for the feature.
During my tests, I was unable to make BRIN indexes perform better than a
sequential scan for queries searching for large value sets (20K values in
the example down below).

Creating the table with one single BIGINT required column:

> CREATE TABLE
>   test_brin
>   (
> idx BIGINT NOT NULL
>   )
> WITH
>   (
> fillfactor = 100
>   )
> ;


Filling the table with 20 million sorted random BIGINT values:

> INSERT INTO
>   test_brin
>   (
> idx
>   )
> SELECT
>   CAST(FLOOR(RANDOM() * 9223372036854775806) AS BIGINT)
> AS idx
> FROM
>   GENERATE_SERIES(1, 20 * 1000 * 1000, 1)
> AS g
> ORDER BY
>   idx ASC
> ;


Now we cluster the table (even though this shouldn't be needed):

> CREATE UNIQUE INDEX test_brin_idx_uniq ON test_brin (idx);
> CLUSTER test_brin USING test_brin_idx_uniq;
> DROP INDEX test_brin_idx_uniq;



Now we create the BRIN index on what should be a perfectly ordered and
dense table:

> CREATE INDEX
>   test_brin_idx
> ON
>   test_brin
> USING
>   BRIN
>   (
> idx
>   )
> ;


Let's VACUUM the table just to be safe:

> VACUUM test_brin;
> VACUUM ANALYSE test_brin;


And now let's query 20K random rows from our 20M total rows:

> EXPLAIN (
>   ANALYZE,
>   VERBOSE,
>   COSTS,
>   BUFFERS,
>   TIMING
> )
> SELECT
>   tb.idx
> FROM
>   test_brin
> AS tb
> WHERE
>   EXISTS (
> SELECT
> FROM
>   (
> SELECT
>   idx
> FROM
>   (
> SELECT
>   -- Just trying to break the optimisation that would
> recognize "idx" as being an indexed column
>   (idx + (CEIL(RANDOM()) - 1))::BIGINT
> AS idx
> FROM
>   test_brin
> ORDER BY
>   RANDOM()
> LIMIT
>   2
>   )
> AS t
> ORDER BY
>   idx ASC
>   )
> AS q
> WHERE
>   tb.idx = q.idx
>   )
> ;


By default, this query will not use the BRIN index and simply run a 1.5s
long sequential scan (hitting 700 MB) and a 2.47s hash join for a total
8.7s query time:

https://explain.dalibo.com/plan/46c3191g8a6c1bc7

If we force the use of the BRIN index using (`SET LOCAL enable_seqscan =
OFF;`) the same query will now take 50s with 2.5s spent on the bitmap index
scan (hitting 470 MB of data) and a whopping 42s on the bitmap heap scan
(hitting 20 GB of data!):

https://explain.dalibo.com/plan/7f73bg9172a8b226

Since the bitmap heap scan is taking a long time, lets reduce the
`pages_per_range` parameter from its 128 default value to 32:

> CREATE INDEX
>   test_brin_idx
> ON
>   test_brin
> USING
>   BRIN
>   (
> idx
>   )
> WITH
>   (
> pages_per_range = 32
>   )
> ;


The query now takes 25s, half the time we had before, with 9.7s (up from
2.5s) spent on the bitmap index scan (hitting 2.6GB of data, up from 470
MB) and 10s (down from 42s) on the bitmap heap scan (hitting 4.9GB of data,
down from 20 GB):

https://explain.dalibo.com/plan/64fh5h1daaheeab3

We go a step further and lower the `pages_per_range` parameter to 8 (the
other extreme).

The query now takes 45s, close-ish to the initial time, with 38.5s (up from
2.5s) spent on the bitmap index scan (hitting 9.8GB of data, up from 470
MB) and 2.6s (down from 42s) on the bitmap heap scan (hitting 1.2GB of
data, down from 20 GB):

https://explain.dalibo.com/plan/431fbde7gb19g6g6

So I had the following two questions:

   1. Why is the BRIN index systematically worse than a sequential scan,
   even when the table is x1000 larger than the search set, physically
   pre-sorted, dense (fillfactor at 100%) and the search rows are themselves
   sorted?
   This setup would seem to be the ideal conditions for a BRIN index to run
   well.

   2. Since we only select the "idx" column, why does the BRIN index not
   simply return the searched value if included in one of it's ranges?
   Hitting the actual row data stored in the table seems to be unnessary no?

Here's my test environnement:

   -   PostgreSQL version: v14
   -   Memory: 32GB
   -   CPUs: 8

Thanks a lot in advance,

Mickael