subject:"Optimizing nbtree ScalarArrayOp execution, allowing multi\-column ordered scans, skip scan"

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-22 Thread Peter Geoghegan

On Mon, Apr 22, 2024 at 11:13 AM Peter Geoghegan  wrote:
> I'm pretty sure that I could fix this by simply removing the
> assertion. But I need to think about it a bit more before I push a
> fix.
>
> The test case you've provided proves that _bt_preprocess_keys's
> new no-op path isn't just used during scans that have array keys (your
> test case doesn't have a SAOP at all). This was never intended. On the
> other hand, I think that it's still correct (or will be once the assertion is
> gone), and it seems like it would be simpler to allow this case (and
> document it) than to not allow it at all.

Pushed a fix like that just now.

Thanks for the report, Alexander.

-- 
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-22 Thread Peter Geoghegan

On Mon, Apr 22, 2024 at 4:00 AM Alexander Lakhin  wrote:
> Please look at another case, where a similar Assert (but this time in
> _bt_preprocess_keys()) is triggered:
> CREATE TABLE t (a text, b text);
> INSERT INTO t (a, b) SELECT 'a', repeat('b', 100) FROM generate_series(1, 
> 500) g;
> CREATE INDEX t_idx ON t USING btree(a);
> BEGIN;
> DECLARE c CURSOR FOR SELECT a FROM t WHERE a = 'a';
> FETCH FROM c;
> FETCH RELATIVE 0 FROM c;
>
> TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 2582, PID: 
> 1130962

I'm pretty sure that I could fix this by simply removing the
assertion. But I need to think about it a bit more before I push a
fix.

The test case you've provided proves that _bt_preprocess_keys's
new no-op path isn't just used during scans that have array keys (your
test case doesn't have a SAOP at all). This was never intended. On the
other hand, I think that it's still correct (or will be once the assertion is
gone), and it seems like it would be simpler to allow this case (and
document it) than to not allow it at all.

The general idea that we only need one "real" _bt_preprocess_keys call
per btrescan (independent of the presence of array keys) still seems
sound.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-22 Thread Alexander Lakhin


Hello Peter,

07.04.2024 20:18, Peter Geoghegan wrote:

On Sun, Apr 7, 2024 at 1:00 PM Alexander Lakhin  wrote:

SELECT * FROM t WHERE a < ANY (ARRAY[1]) AND b < ANY (ARRAY[1]);

TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 560, PID: 
3251267

I immediately see what's up here. WIll fix this in the next short
while. There is no bug here in builds without assertions, but it's
probably best to keep the assertion, and to just make sure not to call
_bt_preprocess_array_keys_final() unless it has real work to do.


Please look at another case, where a similar Assert (but this time in
_bt_preprocess_keys()) is triggered:
CREATE TABLE t (a text, b text);
INSERT INTO t (a, b) SELECT 'a', repeat('b', 100) FROM generate_series(1, 500) 
g;
CREATE INDEX t_idx ON t USING btree(a);
BEGIN;
DECLARE c CURSOR FOR SELECT a FROM t WHERE a = 'a';
FETCH FROM c;
FETCH RELATIVE 0 FROM c;

TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 2582, PID: 
1130962

Best regards,
Alexander

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-18 Thread Peter Geoghegan

On Thu, Apr 18, 2024 at 2:13 AM Donghang Lin  wrote:
> It's triggered when a scankey's strategy is set to invalid. While for a 
> descending ordered column,
> the strategy needs to get fixed to its commute strategy. That doesn't work if 
> the strategy is invalid.

The problem is that _bt_fix_scankey_strategy shouldn't have been doing
anything with already-eliminated array scan keys in the first place
(whether or not they're on a DESC index column). I just pushed a fix
along those lines.

Thanks for the report!

-- 
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-18 Thread Donghang Lin

Hi Peter

There seems to be an assertion failure with this change in HEAD
TRAP: failed Assert("leftarg->sk_attno == rightarg->sk_attno"), File:
"../../src/backend/access/nbtree/nbtutils.c", Line: 3246, PID: 1434532

It can be reproduced by:
create table t(a int);
insert into t select 1 from generate_series(1,10);
create index on t (a desc);
set enable_seqscan = false;
select * from t where a IN (1,2) and a IN (1,2,3);

It's triggered when a scankey's strategy is set to invalid. While for a
descending ordered column,
the strategy needs to get fixed to its commute strategy. That doesn't work
if the strategy is invalid.

Attached a demo fix.

Regards,
Donghang Lin
(ServiceNow)


demo-fix.patch
Description: Binary data

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-07 Thread Peter Geoghegan

On Sun, Apr 7, 2024 at 9:57 PM Peter Geoghegan  wrote:
> On Sun, Apr 7, 2024 at 9:50 PM Tom Lane  wrote:
> > those first two Asserts are redundant with the "if" as well.
>
> I'll get rid of those other two assertions as well, then.

Done that way.

-- 
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-07 Thread Peter Geoghegan

On Sun, Apr 7, 2024 at 9:50 PM Tom Lane  wrote:
> If you're doing that, then surely
>
> if (j != (BTEqualStrategyNumber - 1) ||
> !(xform[j].skey->sk_flags & SK_SEARCHARRAY))
> {
> ...
> }
> else
> {
> Assert(j == (BTEqualStrategyNumber - 1));
> Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);
> Assert(xform[j].ikey == array->scan_key);
> Assert(!(cur->sk_flags & SK_SEARCHARRAY));
> }
>
> those first two Asserts are redundant with the "if" as well.

I'll get rid of those other two assertions as well, then.

-- 
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-07 Thread Tom Lane

Peter Geoghegan  writes:
> The assertions in question are arguably redundant. There are very
> similar assertions just a little earlier on, as we initially set up
> the array stuff (right before _bt_compare_scankey_args is called).
> I'll just remove the "Assert(xform[j].ikey == array->scan_key)"
> assertion that Coverity doesn't like, in addition to the
> "Assert(!array || array->scan_key == i)" assertion, on the grounds
> that they're redundant.

If you're doing that, then surely

if (j != (BTEqualStrategyNumber - 1) ||
!(xform[j].skey->sk_flags & SK_SEARCHARRAY))
{
...
}
else
{
Assert(j == (BTEqualStrategyNumber - 1));
Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);
Assert(xform[j].ikey == array->scan_key);
Assert(!(cur->sk_flags & SK_SEARCHARRAY));
}

those first two Asserts are redundant with the "if" as well.

regards, tom lane

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-07 Thread Peter Geoghegan

On Sun, Apr 7, 2024 at 9:25 PM Tom Lane  wrote:
> Perhaps this'd help:
>
> -Assert(xform[j].ikey == array->scan_key);
> +Assert(array && xform[j].ikey == array->scan_key);
>
> If that doesn't silence it, I'd be prepared to just dismiss the
> warning.

The assertions in question are arguably redundant. There are very
similar assertions just a little earlier on, as we initially set up
the array stuff (right before _bt_compare_scankey_args is called).
I'll just remove the "Assert(xform[j].ikey == array->scan_key)"
assertion that Coverity doesn't like, in addition to the
"Assert(!array || array->scan_key == i)" assertion, on the grounds
that they're redundant.

> Some work in the comment to explain why we must have an array here
> wouldn't be out of place either, perhaps.

There is a comment block about this right above the assertion in question:

/*
 * Both scan keys might have arrays, in which case we'll
 * arbitrarily pass only one of the arrays.  That won't
 * matter, since _bt_compare_scankey_args is aware that two
 * SEARCHARRAY scan keys mean that _bt_preprocess_array_keys
 * failed to eliminate redundant arrays through array merging.
 * _bt_compare_scankey_args just returns false when it sees
 * this; it won't even try to examine either array.
 */

Do you think it needs more work?

-- 
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-07 Thread Tom Lane

Peter Geoghegan  writes:
> On Sun, Apr 7, 2024 at 8:48 PM Tom Lane  wrote:
>> Coverity pointed out something that looks like a potentially live
>> problem in 5bf748b86:
>> ... which certainly makes it look like array->scan_key could be
>> a null-pointer dereference.

> But the "Assert(xform[j].ikey == array->scan_key)" assertion is
> located in a block where it's been established that the scan key (the
> one stored in xform[j] at this point in execution) must have an array.

> This is probably very hard for tools like Coverity to understand.

It's not obvious to human readers either ...

> Would Coverity stop complaining if I just removed the assertion? I
> could just do that, I suppose, but that seems backwards to me.

Perhaps this'd help:

-Assert(xform[j].ikey == array->scan_key);
+Assert(array && xform[j].ikey == array->scan_key);

If that doesn't silence it, I'd be prepared to just dismiss the
warning.

Some work in the comment to explain why we must have an array here
wouldn't be out of place either, perhaps.

regards, tom lane

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-07 Thread Peter Geoghegan

On Sun, Apr 7, 2024 at 8:48 PM Tom Lane  wrote:
>
> Coverity pointed out something that looks like a potentially live
> problem in 5bf748b86:
>
> /srv/coverity/git/pgsql-git/postgresql/src/backend/access/nbtree/nbtutils.c: 
> 2950 in _bt_preprocess_keys()
> 2944  * need to make sure that we don't throw 
> away an array
> 2945  * scan key.  _bt_compare_scankey_args 
> expects us to
> 2946  * always keep arrays (and discard 
> non-arrays).
> 2947  */
> 2948 Assert(j == (BTEqualStrategyNumber - 1));
> 2949 Assert(xform[j].skey->sk_flags & 
> SK_SEARCHARRAY);
> >>> CID 1596256:  Null pointer dereferences  (FORWARD_NULL)
> >>> Dereferencing null pointer "array".
> 2950 Assert(xform[j].ikey == array->scan_key);
> 2951 Assert(!(cur->sk_flags & SK_SEARCHARRAY));
> 2952 }
> 2953 }
> 2954 else if (j == (BTEqualStrategyNumber - 1))
>
> Above this there is an assertion
>
> Assert(!array || array->num_elems > 0);
>
> which certainly makes it look like array->scan_key could be
> a null-pointer dereference.

But the "Assert(xform[j].ikey == array->scan_key)" assertion is
located in a block where it's been established that the scan key (the
one stored in xform[j] at this point in execution) must have an array.
It has been marked SK_SEARCHARRAY, and uses the equality strategy, so
it had better have one or we're in big trouble either way.

This is probably very hard for tools like Coverity to understand. We
also rely on the fact that only one of the two scan keys (only one of
the pair of scan keys that were passed to _bt_compare_scankey_args)
can have an array at the point of the assertion that Coverity finds
suspicious. It's possible that both of those scan keys actually did
have arrays, but _bt_compare_scankey_args just treats that as a case
of being unable to prove which scan key was redundant/contradictory
due to a lack of suitable cross-type support -- so the assertion won't
be reached.

Would Coverity stop complaining if I just removed the assertion? I
could just do that, I suppose, but that seems backwards to me.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-07 Thread Tom Lane

Coverity pointed out something that looks like a potentially live
problem in 5bf748b86:

/srv/coverity/git/pgsql-git/postgresql/src/backend/access/nbtree/nbtutils.c: 
2950 in _bt_preprocess_keys()
2944  * need to make sure that we don't throw away 
an array
2945  * scan key.  _bt_compare_scankey_args expects 
us to
2946  * always keep arrays (and discard non-arrays).
2947  */
2948 Assert(j == (BTEqualStrategyNumber - 1));
2949 Assert(xform[j].skey->sk_flags & 
SK_SEARCHARRAY);
>>> CID 1596256:  Null pointer dereferences  (FORWARD_NULL)
>>> Dereferencing null pointer "array".
2950 Assert(xform[j].ikey == array->scan_key);
2951 Assert(!(cur->sk_flags & SK_SEARCHARRAY));
2952 }
2953 }
2954 else if (j == (BTEqualStrategyNumber - 1))

Above this there is an assertion

Assert(!array || array->num_elems > 0);

which certainly makes it look like array->scan_key could be
a null-pointer dereference.

regards, tom lane

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-07 Thread Peter Geoghegan

On Sun, Apr 7, 2024 at 1:00 PM Alexander Lakhin  wrote:
> Please look at an assertion failure (reproduced starting from 5bf748b86),
> triggered by the following query:
> CREATE TABLE t (a int, b int);
> CREATE INDEX t_idx ON t (a, b);
> INSERT INTO t (a, b) SELECT g, g FROM generate_series(0, 999) g;
> ANALYZE t;
> SELECT * FROM t WHERE a < ANY (ARRAY[1]) AND b < ANY (ARRAY[1]);
>
> TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 560, PID: 
> 3251267

I immediately see what's up here. WIll fix this in the next short
while. There is no bug here in builds without assertions, but it's
probably best to keep the assertion, and to just make sure not to call
_bt_preprocess_array_keys_final() unless it has real work to do.

The assertion failure demonstrates that
_bt_preprocess_array_keys_final can currently be called when there can
be no real work for it to do. The problem is that we condition the
call to _bt_preprocess_array_keys_final() on whether or not we had to
do "real work" back when _bt_preprocess_array_keys() was called, at
the start of _bt_preprocess_keys(). That usually leaves us with
"so->numArrayKeys > 0", because the "real work" typically includes
equality type array keys. But not here -- here you have two SAOP
inequalities, which become simple scalar scan keys through early
array-specific preprocessing in _bt_preprocess_array_keys(). There are
no arrays left at this point, so "so->numArrayKeys == 0".

FWIW I missed this because the tests only cover cases with one SOAP
inequality, which will always return early from _bt_preprocess_keys
(by taking its generic single scan key fast path). Your test case has
2 scan keys, avoiding the fast path, allowing us to reach
_bt_preprocess_array_keys_final().

-- 
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-07 Thread Alexander Lakhin


Hello Peter,

03.04.2024 22:53, Peter Geoghegan wrote:

On Mon, Apr 1, 2024 at 6:33 PM Peter Geoghegan  wrote:

Note: v18 doesn't have any adjustments to the costing, as originally
planned. I'll probably need to post a revised patch with improved (or
at least polished) costing in the next few days, so that others will
have the opportunity to comment before I commit the patch.

Attached is v19, which dealt with remaining concerns I had about the
costing in selfuncs.c. My current plan is to commit this on Saturday
morning (US Eastern time).


Please look at an assertion failure (reproduced starting from 5bf748b86),
triggered by the following query:
CREATE TABLE t (a int, b int);
CREATE INDEX t_idx ON t (a, b);
INSERT INTO t (a, b) SELECT g, g FROM generate_series(0, 999) g;
ANALYZE t;
SELECT * FROM t WHERE a < ANY (ARRAY[1]) AND b < ANY (ARRAY[1]);

TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 560, PID: 
3251267

Best regards,
Alexander

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-21 Thread Peter Geoghegan

On Mon, Mar 18, 2024 at 9:25 AM Matthias van de Meent
 wrote:
> I was thinking about a more unified processing model, where
> _bt_preprocess_keys would iterate over all keys, including processing
> of array keys (by merging and reduction to normal keys) if and when
> found. This would also reduce the effort expended when there are
> contradictory scan keys, as array preprocessing is relatively more
> expensive than other scankeys and contradictions are then found before
> processing of later keys.

Does it really matter? I just can't get excited about maybe getting
one less syscache lookup in queries that involve *obviously*
contradictory keys at the SQL level. Especially because we're so much
better off with the new design here anyway; calling
_bt_preprocess_keys once rather than once per distinct set of array
keys is an enormous improvement, all on its own.

My concern with preprocessing overhead is almost entirely limited to
pathological performance issues involving extreme (even adversarial)
input scan keys/arrays. I feel that if we're going to guarantee that
we won't choke in _bt_checkkeys, even given billions of distinct sets
of array keys, we should make the same guarantee in
_bt_preprocess_keys -- just to avoid POLA violations. But that's the
only thing that seems particularly important in the general area of
preprocessing performance. (Preprocessing performance probably does
matter quite a bit in more routine cases, where />= are
mixed together on the same attribute. But there'll be no new
_bt_preprocess_keys operator/function lookups for stuff like that.)

The advantage of not completely merging _bt_preprocess_array_keys with
_bt_preprocess_keys is that preserving this much of the existing code
structure allows us to still decide how many array keys we'll need for
the scan up front (if the scan doesn't end up having an unsatisfiable
qual). _bt_preprocess_array_keys will eliminate redundant arrays
early, in practically all cases (the exception is when it has to deal
with an incomplete opfamily lacking the appropriate cross-type ORDER
proc), so we don't have to think about merging arrays after that
point. Rather like how we don't have to worry about "WHERE a <
any('{1, 2, 3}')" type inequalities after _bt_preprocess_array_keys
does an initial round of array-specific preprocessing
(_bt_preprocess_keys can deal with those in the same way as it will
with standard inequalities).

> > This preprocessing work should all be happening during planning, not
> > during query execution -- that's the design that makes the most sense.
> > This is something we've discussed in the past in the context of skip
> > scan (see my original email to this thread for the reference).
>
> Yes, but IIRC we also agreed that it's impossible to do this fully in
> planning, amongst others due to joins on array fields.

Even with a nested loop join's inner index scan, where the constants
used for each btrescan are not really predictable in the planner, we
can still do most preprocessing in the planner, at least most of the
time.

We could still easily do analysis that is capable of ruling out
redundant or contradictory scan keys for any possible parameter value
-- seen or unseen. I'd expect this to be the common case -- most of
the time these inner index scans need only one simple = operator
(maybe 2 = operators). Obviously, tjat approach still requires that
btrescan at least accept a new set of constants for each new inner
index scan invocation. But that's still much cheaper than calling
_bt_preprocess_keys from scratch every time btresca is called.

> > It
> > would be especially useful for the very fancy kinds of preprocessing
> > that are described by the MDAM paper, like using an index scan for a
> > NOT IN() list/array (this can actually make sense with a low
> > cardinality index).
>
> Yes, indexes such as those on enums. Though, in those cases the NOT IN
> () could be transformed into IN()-lists by the planner, but not the
> index.

I think that it would probably be built as just another kind of index
path, like the ones we build for SAOPs. Anti-SAOPs?

Just as with SAOPs, the disjunctive accesses wouldn't really be
something that the planner would need too much direct understanding of
(except during costing). I'd only expect the plan to use such an index
scan when it wasn't too far off needing a full index scan anyway. Just
like with skip scan, the distinction between these NOT IN() index scan
paths and a simple full index scan path is supposed to be fuzzy (maybe
they'll actually turn out to be full scans at runtime, since the
number of primitive index scans is determined dynamically, based on
the structure of the index at runtime).

> > The structure for preprocessing that I'm working towards (especially
> > in v15) sets the groundwork for making those shifts in the planner,
> > because we'll no longer treat each array constant as its own primitive
> > index scan during preprocessing.
>
> I hope that's going to be a

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-21 Thread Peter Geoghegan

On Thu, Mar 7, 2024 at 10:42 AM Benoit Tigeot  wrote:
> I am not up to date with the last version of patch but I did a regular 
> benchmark with version 11 and typical issue we have at the moment and the 
> result are still very very good. [1]

Thanks for providing the test case. It was definitely important back
when the ideas behind the patch had not yet fully developed. It helped
me to realize that my thinking around non-required arrays (meaning
arrays that cannot reposition the scan, and just filter out
non-matching tuples) was still sloppy.

> In term of performance improvement the last proposals could be a real game 
> changer for 2 of our biggest databases. We hope that Postgres 17 will contain 
> those improvements.

Current plan is to commit this patch in the next couple of weeks,
ahead of Postgres 17 feature freeze.

-- 
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-20 Thread Matthias van de Meent

On Sat, 16 Mar 2024 at 01:12, Peter Geoghegan  wrote:
>
> On Fri, Mar 8, 2024 at 9:00 AM Matthias van de Meent
>  wrote:
> > I've attached v14, where 0001 is v13, 0002 is a patch with small
> > changes + some large comment suggestions, and 0003 which contains
> > sorted merge join code for _bt_merge_arrays.
>
> This is part of my next revision, v15, which I've attached (along with
> a test case that you might find useful, explained below).
>
> v15 makes the difference between the non-required scan key trigger and
> required scan key trigger cases clearer within _bt_advance_array_keys.

OK, here's a small additional review, with a suggestion for additional
changes to _bt_preprocess:

> @@ -1117,6 +3160,46 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey 
> op,
>  /*
>  * The opfamily we need to worry about is identified by the index column.
>  */
> Assert(leftarg->sk_attno == rightarg->sk_attno);
>
> +/*
> + * If either leftarg or rightarg are equality-type array scankeys, we 
> need
> + * specialized handling (since by now we know that IS NULL wasn't used)
> + */
> [...]
> +}
> +
> opcintype = rel->rd_opcintype[leftarg->sk_attno - 1];

Here, you insert your code between the comment about which opfamily to
choose and the code assigning the opfamily. I think this needs some
cleanup.

> + * Don't call _bt_preprocess_array_keys_final in this fast path
> + * (we'll miss out on the single value array transformation, but
> + * that's not nearly as important when there's only one scan key)

Why is it OK to ignore it? Or, why don't we apply it here?

---

Attached 2 patches for further optimization of the _bt_preprocess_keys
path (on top of your v15), according to the following idea:

Right now, we do "expensive" processing with xform merging for all
keys when we have more than 1 keys in the scan. However, we only do
per-attribute merging of these keys, so if there is only one key for
any attribute, the many cycles spent in that loop are mostly wasted.
By checking for single-scankey attributes, we can short-track many
multi-column index scans because they usually have only a single scan
key per attribute.
The first implements that idea, the second reduces the scope of
various variables so as to improve compiler optimizability.

I'll try to work a bit more on merging the _bt_preprocess steps into a
single main iterator, but this is about as far as I got with clean
patches. Merging the steps for array preprocessing with per-key
processing and post-processing is proving a bit more complex than I'd
anticipated, so I don't think I'll be able to finish that before the
feature freeze, especially with other things that keep distracting me.

Matthias van de Meent
From 16808e85f5fae7b16fd52d8a2be8437e4cff8640 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Tue, 19 Mar 2024 16:39:29 +0100
Subject: [PATCH v1 1/2] nbtree: Optimize preprocessing for single ScanKey per
 column

Before, there was only a single fast path: single-ScanKey index scans.
With this optimization, we can fast-track processing of attributes with
only a single scankey, significantly reducing overhead for common scans
like "a = 10 and b < 8".
---
 src/backend/access/nbtree/nbtutils.c | 529 ++-
 1 file changed, 278 insertions(+), 251 deletions(-)

diff --git a/src/backend/access/nbtree/nbtutils.c 
b/src/backend/access/nbtree/nbtutils.c
index b401b31191..8cd6270408 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2511,7 +2511,6 @@ _bt_preprocess_keys(IndexScanDesc scan)
ScanKey inkeys;
ScanKey outkeys;
ScanKey cur;
-   BTScanKeyPreproc xform[BTMaxStrategyNumber];
booltest_result;
int i,
j;
@@ -2619,327 +2618,355 @@ _bt_preprocess_keys(IndexScanDesc scan)
 * xform[i] points to the currently best scan key of strategy type i+1; 
it
 * is NULL if we haven't yet found such a key for this attr.
 */
-   attno = 1;
-   memset(xform, 0, sizeof(xform));

-   /*
-* Loop iterates from 0 to numberOfKeys inclusive; we use the last pass 
to
-* handle after-last-key processing.  Actual exit from the loop is at 
the
-* "break" statement below.
-*/
-   for (i = 0;; cur++, i++)
+   for (i = 0; i < numberOfKeys;)
{
-   if (i < numberOfKeys)
+   BTScanKeyPreproc xform[BTMaxStrategyNumber];
+   int priorNumberOfEqualCols = 
numberOfEqualCols;
+   boolfast = true;
+
+   /* initialize for this attno */
+   attno = cur->sk_attno;
+
+   /* We'll try to merge keys if there are multiple for this 
attribute */
+   if (i + 1 < numberOfKeys && (cur + 1)->sk_attno == attno)
+

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-18 Thread Matthias van de Meent

On Sat, 16 Mar 2024 at 01:12, Peter Geoghegan  wrote:
>
> On Fri, Mar 8, 2024 at 9:00 AM Matthias van de Meent
>  wrote:
> > I've attached v14, where 0001 is v13, 0002 is a patch with small
> > changes + some large comment suggestions, and 0003 which contains
> > sorted merge join code for _bt_merge_arrays.
>
> I have accepted your changes from 0003. Agree that it's better that
> way. It's at least a little faster, but not meaningfully more
> complicated.

Thanks.

> > I'll try to work a bit on v13/14's _bt_preprocess_keys, and see what I
> > can make of it.
>
> That's been the big focus of this new v15, which now goes all out with
> teaching _bt_preprocess_keys with how to deal with arrays. We now do
> comprehensive normalization of even very complicated combinations of
> arrays and non-array scan keys in this version.

I was thinking about a more unified processing model, where
_bt_preprocess_keys would iterate over all keys, including processing
of array keys (by merging and reduction to normal keys) if and when
found. This would also reduce the effort expended when there are
contradictory scan keys, as array preprocessing is relatively more
expensive than other scankeys and contradictions are then found before
processing of later keys.
As I wasn't very far into the work yet it seems I can reuse a lot of
your work here.

> For example, consider this query:
[...]
> This has a total of 6 input scankeys -- 3 of which are arrays. But by
> the time v15's _bt_preprocess_keys is done with it, it'll have only 1
> scan key -- which doesn't even have an array (not anymore). And so we
> won't even need to advance the array keys one single time -- there'll
> simply be no array left to advance. In other words, it'll be just as
> if the query was written this way from the start:
>
> select *
> from multi_test
> where
>   a = 3::int;
>
> (Though of course the original query will spend more cycles on
> preprocessing, compared to this manually simplified variant.)

That's a good improvement, much closer to an optimal access pattern.

> It turned out to not be terribly difficult to teach
> _bt_preprocess_keys everything it could possibly need to know about
> arrays, so that it can operate on them directly, as a variant of the
> standard equality strategy (we do still need _bt_preprocess_array_keys
> for basic preprocessing of arrays, mostly just merging them). This is
> better overall (in that it gets every little subtlety right), but it
> also simplified a number of related issues. For example, there is no
> longer any need to maintain a mapping between so->keyData[]-wise scan
> keys (output scan keys), and scan->keyData[]-wise scan keys (input
> scan keys). We can just add a step to fix-up the references to the end
> of _bt_preprocess_keys, to make life easier within
> _bt_advance_array_keys.
>
> This preprocessing work should all be happening during planning, not
> during query execution -- that's the design that makes the most sense.
> This is something we've discussed in the past in the context of skip
> scan (see my original email to this thread for the reference).

Yes, but IIRC we also agreed that it's impossible to do this fully in
planning, amongst others due to joins on array fields.

> It
> would be especially useful for the very fancy kinds of preprocessing
> that are described by the MDAM paper, like using an index scan for a
> NOT IN() list/array (this can actually make sense with a low
> cardinality index).

Yes, indexes such as those on enums. Though, in those cases the NOT IN
() could be transformed into IN()-lists by the planner, but not the
index.

> The structure for preprocessing that I'm working towards (especially
> in v15) sets the groundwork for making those shifts in the planner,
> because we'll no longer treat each array constant as its own primitive
> index scan during preprocessing.

I hope that's going to be a fully separate patch. I don't think I can
handle much more complexity in this one.

> Right now, on HEAD, preprocessing
> with arrays kinda treats each array constant like the parameter of an
> imaginary inner index scan, from an imaginary nested loop join. But
> the planner really needs to operate on the whole qual, all at once,
> including any arrays. An actual nestloop join's inner index scan
> naturally cannot do that, and so might still require runtime/dynamic
> preprocessing in a world where that mostly happens in the planner --
> but that clearly not appropriate for arrays ("WHERE a = 5" and "WHERE
> a in(4, 5)" are almost the same thing, and so should be handled in
> almost the same way by preprocessing).

Yeah, if the planner could handle some of this that'd be great. At the
same time, I think that this might need to be gated behind a guc for
more expensive planner-time deductions.

> > > + * We generally permit primitive index scans to continue onto the 
> > > next
> > > + * sibling page when the page's finaltup satisfies all required scan 
> > > keys
> > > + *

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-07 Thread Peter Geoghegan

On Wed, Mar 6, 2024 at 4:51 PM Matthias van de Meent
 wrote:
> To clarify, what I mean here is that merging the changes of both the
> SAOPs changes and the removal of arrayKeyData seems to increase the
> patch size and increases the maximum complexity of each component
> patch's review.

Removing arrayKeyData probably makes the patch very slightly smaller,
actually. But even if it's really the other way around, I'd still like
to get rid of it as part of the same commit as everything else. It
just makes sense that way.

> Separate patches may make this more reviewable, or not, but no comment
> was given on why it is better to merge the changes into a single
> patch.

Fair enough. Here's why:

The original SAOP design (commit 9e8da0f7, "Teach btree to handle
ScalarArrayOpExpr quals natively") added a layer of indirection
between scan->keyData (input scan keys) and so->keyData (output scan
keys): it added another scan key array, so->arrayKeyData. There was
array-specific preprocessing in _bt_preprocess_array_keys, that
happened before the first primitive index scan even began -- that
transformed our true input scan keys (scan->keyData) into a copy of
the array with limited amounts of array-specific preprocessing already
performed (so->arrayKeyData).

This made a certain amount of sense at the time, because
_bt_preprocess_keys was intended to be called once per primitive index
scan. Kind of like the inner side of a nested loop join's inner index
scan, where we call _bt_preprocess_keys once per inner-side
scan/btrescan call. (Actually, Tom's design has us call _bt_preprocess
once per primitive index scan per btrescan call, which might matter in
those rare cases where the inner side of a nestloop join had SAOP
quals.)

What I now propose to do is to just call _bt_preprocess_keys once per
btrescan (actually, it's still called once per primitive index scan,
but all calls after the first are just no-ops after v12 of the patch).
This makes sense because SAOP array constants aren't like nestloop
joins with an inner index scans, in one important way: we really can
see everything up-front. We can see all of the array elements, and
operate on whole arrays as necessary during preprocessing (e.g.,
performing the array merging thing I added to
_bt_preprocess_array_keys).

It's not like the next array element is only visible to prepocessing
only after the outer side of a nestloop join runs, and next calls
btrescan -- so why treat it like that? Conceptually, "WHERE a = 1" is
almost the same thing as "WHERE a = any('{1,2}')", so why not use
essentially the same approach to preprocessing in both cases? We don't
need to copy the input keys into so->arrayKeyData, because the
indirection (which is a bit like a fake nested loop join) doesn't buy
us anything.

v13 of the patch doesn't quite 100% eliminate so->arrayKeyData. While
it removes the arrayKeyData field from the BTScanOpaqueData struct, we
still use a temporary array (accessed via a pointer that's just a
local variable) that's also called arrayKeyData. And that also stores
array-preprocessing-only copies of the input scan keys. That happens
within _bt_preprocess_keys.

So we do "still need arrayKeyData", but we only need it for a brief
period at the start of the index scan. It just doesn't make any sense
to keep it around for longer than that, in a world where
_bt_preprocess_keys "operates directly on arrays".  That only made
sense (a bit of sense) back when _bt_preprocess_keys was subordinate
to _bt_preprocess_array_keys, but it's kinda the other way around now.
We could probably even get rid of this remaining limited form of
arrayKeyData, but that doesn't seem like it would add much.

-- 
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-07 Thread Peter Geoghegan

On Wed, Mar 6, 2024 at 4:46 PM Matthias van de Meent
 wrote:
> On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan  wrote:
> > I think that there is supposed to be a closing parenthesis here? So
> > "... (such as those described in ") might
> > perform...".
>
> Correct.
>
> > FWM, if that's what you meant.
>
> WFM, yes?

Then we're in agreement on this.

> > At one point Heikki suggested that I just get rid of
> > BTScanOpaqueData.arrayKeyData (which has been there for as long as
> > nbtree had native support for SAOPs), and use
> > BTScanOpaqueData.keyData exclusively instead. I've finally got around
> > to doing that now.
>
> I'm not sure if it was worth the reduced churn when the changes for
> the primary optimization are already 150+kB in size; every "small"
> addition increases the context needed to review the patch, and it's
> already quite complex.

I agree that the patch is quite complex, especially relative to its size.

> > Note that we no longer need to have an independent representation of
> > so->qual_okay for array keys (the convention of setting
> > so->numArrayKeys to -1 for unsatisfiable array keys is no longer
> > required). There is no longer any need for a separate pass to carry
> > over the contents of BTScanOpaqueData.arrayKeyData to
> > BTScanOpaqueData.keyData, which was confusing.
>
> I wasn't very confused by it, but sure.
>
> > Are you still interested in working directly on the preprocessing
> > stuff?
>
> If you mean my proposed change to merging two equality AOPs, then yes.
> I'll try to fit it in tomorrow with the rest of the review.

I didn't just mean that stuff. I was also suggesting that you could
join the project directly (not just as a reviewer). If you're
interested, you could do general work on the preprocessing of arrays.
Fancier array-specific preprocessing.

For example, something that can transform this: select * from foo
where a in (1,2,3) and a < 3;

Into this: select * from foo where a in (1,2);

Or, something that can just set qual_okay=false, given a query such
as: select * from foo where a in (1,2,3) and a < 5;

This is clearly doable by reusing the binary search code during
preprocessing. The first example transformation could work via a
binary search for the constant "3", followed by a memmove() to shrink
the array in-place (plus the inequality itself would need to be fully
eliminated). The second example would work a little like the array
merging thing that the patch has already, except that there'd only be
one array involved (there wouldn't be a pair of arrays).

> > select * from multi_test where a in (1,99, 182, 183, 184) and a < 183;
> >
> > Maybe we need to do better with that. What do you think?
>
> Let me come back to that when I'm done reviewing the final part of nbtutils.

Even if this case doesn't matter (which I now doubt), it's probably
easier to just get it right than it would be to figure out if we can
live without it.

> > void
> > _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
> > @@ -519,159 +888,1163 @@ _bt_start_array_keys(IndexScanDesc scan, 
> > ScanDirection dir)
> > BTScanOpaque so = (BTScanOpaque) scan->opaque;
> > inti;
> >
> > +Assert(so->numArrayKeys);
> > +Assert(so->qual_ok);
>
> Has the requirement for a known scan direction been removed, or should
> this still have an Assert(dir != NoMovementScanDirection)?

I agree that such an assertion is worth having. Added that locally.

> I'm not sure that comment is correct, at least it isn't as clear as
> can be. Maybe something more in the line of the following?
> +pstate.prechecked = false;/* prechecked didn't cover HIKEY */

I agree that that's a little better than what I had.

> +++ b/src/backend/access/nbtree/nbtutils.c
> > @@ -272,7 +319,32 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
> > +elemtype = cur->sk_subtype;
> > +if (elemtype == InvalidOid)
> > +elemtype = rel->rd_opcintype[cur->sk_attno - 1];
>
> Should we Assert() that this elemtype matches the array element type
> ARR_ELEMTYPE(arrayval) used to deconstruct the array?

Yeah, good idea.

> This is not "a" previous equality key, but "the" previous equality
> operator array scan key.
> Do we want to expend some additional cycles for detecting duplicate
> equality array key types in interleaved order like =int[] =bigint[]
> =int[]? I don't think it would be very expensive considering the
> limited number of cross-type equality operators registered in default
> PostgreSQL, so a simple loop that checks matching element types
> starting at the first array key scankey for this attribute should
> suffice. We could even sort the keys by element type if we wanted to
> fix any O(n) issues for this type of behaviour (though this is
> _extremely_ unlikely to become a performance issue).

Yes, I think that we should probably have this (though likely wouldn't
bother sorting the scan keys themselves).

You'd need to look-up cross-type operators for

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-07 Thread Benoit Tigeot

Hello,

I am not up to date with the last version of patch but I did a regular
benchmark with version 11 and typical issue we have at the moment and the
result are still very very good. [1]

In term of performance improvement the last proposals could be a real game
changer for 2 of our biggest databases. We hope that Postgres 17 will
contain those improvements.

Kind regards,

Benoit

[1] -
https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4972955#gistcomment-4972955
__
Benoit Tigeot



Le jeu. 7 mars 2024 à 15:36, Peter Geoghegan  a écrit :

> On Tue, Jan 23, 2024 at 3:22 PM Peter Geoghegan  wrote:
> > I could include something less verbose, mentioning a theoretical risk
> > to out-of-core amcanorder routines that coevolved with nbtree,
> > inherited the same SAOP limitations, and then never got the same set
> > of fixes.
>
> Attached is v11, which now says something like that in the commit
> message. Other changes:
>
> * Fixed buggy sorting of arrays using cross-type ORDER procs, by
> recognizing that we need to consistently use same-type ORDER procs for
> sorting and merging the arrays during array preprocessing.
>
> Obviously, when we sort, we compare array elements to other array
> elements (all of the same type). This is true independent of whether
> the query itself happens to use a cross type operator/ORDER proc, so
> we will need to do two separate ORDER proc lookups in cross-type
> scenarios.
>
> * No longer subscript the ORDER proc used for array binary searches
> using a scankey subscript. Now there is an additional indirection that
> works even in the presence of multiple redundant scan keys that cannot
> be detected as such due to a lack of appropriate cross-type support
> within an opfamily.
>
> This was subtly buggy before now. Requires a little more coordination
> between array preprocessing and standard/primitive index scan
> preprocessing, which isn't ideal but seems unavoidable.
>
> * Lots of code polishing, especially within _bt_advance_array_keys().
>
> While _bt_advance_array_keys() still works in pretty much exactly the
> same way as it did back in v10, there are now better comments.
> Including something about why its recursive call to itself is
> guaranteed to use a low, fixed amount of stack space, verified using
> an assertion. That addresses a concern held by Matthias.
>
> Outlook
> ===
>
> This patch is approaching being committable now. Current plan is to
> commit this within the next few weeks.
>
> All that really remains now is to research how we might integrate this
> work with the recently added continuescanPrechecked/haveFirstMatch
> stuff from Alexander Korotkov, if at all. I've put that off until now
> because it isn't particularly fundamental to what I'm doing here, and
> seems optional.
>
> I would also like to do more performance validation. Things like the
> parallel index scan code could stand to be revisited once again. Plus
> I should think about the overhead of array preprocessing when
> btrescan() is called many times, from a nested loop join -- I should
> have something to say about that concern (raised by Heikki at one
> point) before too long.
>
> --
> Peter Geoghegan
>

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-06 Thread Matthias van de Meent

On Wed, 6 Mar 2024 at 22:46, Matthias van de Meent
 wrote:
>
> On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan  wrote:
> > At one point Heikki suggested that I just get rid of
> > BTScanOpaqueData.arrayKeyData (which has been there for as long as
> > nbtree had native support for SAOPs), and use
> > BTScanOpaqueData.keyData exclusively instead. I've finally got around
> > to doing that now.
>
> I'm not sure if it was worth the reduced churn when the changes for
> the primary optimization are already 150+kB in size; every "small"
> addition increases the context needed to review the patch, and it's
> already quite complex.

To clarify, what I mean here is that merging the changes of both the
SAOPs changes and the removal of arrayKeyData seems to increase the
patch size and increases the maximum complexity of each component
patch's review.
Separate patches may make this more reviewable, or not, but no comment
was given on why it is better to merge the changes into a single
patch.

- Matthias

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-06 Thread Matthias van de Meent

On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan  wrote:
>
> On Mon, Mar 4, 2024 at 3:51 PM Matthias van de Meent
>  wrote:
> > > +that searches for multiple values together.  Queries that use certain
> > > +SQL constructs to search for rows matching any 
> > > value
> > > +out of a list (or an array) of multiple scalar values might perform
> > > +multiple primitive index scans (up to one primitive 
> > > scan
> > > +per scalar value) at runtime.  See  > > linkend="functions-comparisons"/>
> > > +for details.
> >
> > I don't think the "see  for details" is
> > correctly worded: The surrounding text implies that it would contain
> > details about in which precise situations multiple primitive index
> > scans would be consumed, while it only contains documentation about
> > IN/NOT IN/ANY/ALL/SOME.
> >
> > Something like the following would fit better IMO:
> >
> > +that searches for multiple values together.  Queries that use certain
> > +SQL constructs to search for rows matching any value
> > +out of a list or array of multiple scalar values (such as those
> > described in
> > + might perform multiple primitive
> > +index scans (up to one primitive scan per scalar value) at runtime.
>
> I think that there is supposed to be a closing parenthesis here? So
> "... (such as those described in ") might
> perform...".

Correct.

> FWM, if that's what you meant.

WFM, yes?

> > Then there is a second issue in the paragraph: Inverted indexes such
> > as GIN might well decide to start counting more than one "primitive
> > scan" per scalar value,
[...]
> I've described the issues in this area (in the docs) in a way that is
> most consistent with historical conventions. That seems to have the
> fewest problems, despite everything I've said about it.

Clear enough, thank you for explaining your thoughts on this.

> > > > All that really remains now is to research how we might integrate this
> > > > work with the recently added continuescanPrechecked/haveFirstMatch
> > > > stuff from Alexander Korotkov, if at all.
> > >
> > > The main change in v12 is that I've integrated both the
> > > continuescanPrechecked and the haveFirstMatch optimizations. Both of
> > > these fields are now page-level state, shared between the _bt_readpage
> > > caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they
> > > appear next to the new home for _bt_checkkeys' continuescan field, in
> > > the new page state struct).
> >
> > Cool. I'm planning to review the rest of this patch this
> > week/tomorrow, could you take some time to review some of my btree
> > patches, too?
>
> Okay, I'll take a look again.

Thanks, greatly appreciated.

> At one point Heikki suggested that I just get rid of
> BTScanOpaqueData.arrayKeyData (which has been there for as long as
> nbtree had native support for SAOPs), and use
> BTScanOpaqueData.keyData exclusively instead. I've finally got around
> to doing that now.

I'm not sure if it was worth the reduced churn when the changes for
the primary optimization are already 150+kB in size; every "small"
addition increases the context needed to review the patch, and it's
already quite complex.

> These simplifications were enabled by my new approach within
> _bt_preprocess_keys, described when I posted v12. v13 goes even
> further than v12 did, by demoting _bt_preprocess_array_keys to a
> helper function for _bt_preprocess_keys. That means that we do all of
> our scan key preprocessing at the same time, at the start of _bt_first
> (though only during the first _bt_first, or to be more precise during
> the first per btrescan). If we need fancier
> preprocessing/normalization for arrays, then it ought to be a lot
> easier with this structure.

Agreed.

> Note that we no longer need to have an independent representation of
> so->qual_okay for array keys (the convention of setting
> so->numArrayKeys to -1 for unsatisfiable array keys is no longer
> required). There is no longer any need for a separate pass to carry
> over the contents of BTScanOpaqueData.arrayKeyData to
> BTScanOpaqueData.keyData, which was confusing.

I wasn't very confused by it, but sure.

> Are you still interested in working directly on the preprocessing
> stuff?

If you mean my proposed change to merging two equality AOPs, then yes.
I'll try to fit it in tomorrow with the rest of the review.

>  I have a feeling that I was slightly too optimistic about how
> likely we were to be able to get away with not having certain kinds of
> array preprocessing, back when I posted v12. It's true that the
> propensity of the patch to not recognize "partial
> redundancies/contradictions" hardly matters with redundant equalities,
> but inequalities are another matter. I'm slightly worried about cases
> like this one:
>
> select * from multi_test where a in (1,99, 182, 183, 184) and a < 183;
>
> Maybe we need to do better with that. What do you think?

Let me come back to that when I'm done reviewing the final

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-04 Thread Matthias van de Meent

On Sat, 2 Mar 2024 at 02:30, Peter Geoghegan  wrote:
>
> On Thu, Feb 15, 2024 at 6:36 PM Peter Geoghegan  wrote:
> > Attached is v11, which now says something like that in the commit
> > message.
>
> Attached is v12.

Some initial comments on the documentation:

> +that searches for multiple values together.  Queries that use certain
> +SQL constructs to search for rows matching any value
> +out of a list (or an array) of multiple scalar values might perform
> +multiple primitive index scans (up to one primitive scan
> +per scalar value) at runtime.  See  linkend="functions-comparisons"/>
> +for details.

I don't think the "see  for details" is
correctly worded: The surrounding text implies that it would contain
details about in which precise situations multiple primitive index
scans would be consumed, while it only contains documentation about
IN/NOT IN/ANY/ALL/SOME.

Something like the following would fit better IMO:

+that searches for multiple values together.  Queries that use certain
+SQL constructs to search for rows matching any value
+out of a list or array of multiple scalar values (such as those
described in
+ might perform multiple primitive
+index scans (up to one primitive scan per scalar value) at runtime.

Then there is a second issue in the paragraph: Inverted indexes such
as GIN might well decide to start counting more than one "primitive
scan" per scalar value, because they may need to go through their
internal structure more than once to produce results for a single
scalar value; e.g. with queries WHERE textfield LIKE '%some%word%', a
trigram index would likely use at least 4 descents here: one for each
of "som", "ome", "wor", "ord".

> > All that really remains now is to research how we might integrate this
> > work with the recently added continuescanPrechecked/haveFirstMatch
> > stuff from Alexander Korotkov, if at all.
>
> The main change in v12 is that I've integrated both the
> continuescanPrechecked and the haveFirstMatch optimizations. Both of
> these fields are now page-level state, shared between the _bt_readpage
> caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they
> appear next to the new home for _bt_checkkeys' continuescan field, in
> the new page state struct).

Cool. I'm planning to review the rest of this patch this
week/tomorrow, could you take some time to review some of my btree
patches, too?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-02-15 Thread Peter Geoghegan

On Mon, Jan 22, 2024 at 4:13 PM Matthias van de Meent
 wrote:
> Attached 2 patches: v11.patch-a and v11.patch-b. Both are incremental
> on top of your earlier set, and both don't allocate additional memory
> in the merge operation in non-assertion builds.
>
> patch-a is a trivial and clean implementation of mergesort, which
> tends to reduce the total number of compare operations if both arrays
> are of similar size and value range. It writes data directly back into
> the main array on non-assertion builds, and with assertions it reuses
> your binary search join on scratch space for algorithm validation.

This patch fails some of my tests on non-assert builds only (assert
builds pass all my tests, though). I'm using the first patch on its
own here.

While I tend to be relatively in favor of complicated assertions (I
tend to think the risk of introducing side-effects is worth it), it
looks like you're only performing certain steps in release builds.
This is evident from just looking at the code (there is an #else block
just for the release build in the loop). Note also that
"Assert(_bt_compare_array_elements([merged_nelems++], orig,
) == 0)" has side-effects in assert-enabled builds only (it
increments merged_nelems). While it's true that you *also* increment
merged_nelems *outside* of the assertion (or in an #else block used
during non-assert builds), that is conditioned on some other thing (so
it's in no way equivalent to the debug #ifdef USE_ASSERT_CHECKING
case). It's also just really hard to understand what's going on here.

If I was going to do this kind of thing, I'd use two completely
separate loops, that were obviously completely separate (maybe even
two functions). I'd then memcmp() each array at the end.

-- 
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-01-23 Thread Peter Geoghegan

On Mon, Jan 22, 2024 at 4:13 PM Matthias van de Meent
 wrote:
> On Fri, 19 Jan 2024 at 23:42, Peter Geoghegan  wrote:
> And this is where I disagree with your (percieved) implicit argument
> that this should be and always stay this way.

I never said that, and didn't intend to imply it. As I outlined to you
back in November, my general philosophy around APIs (such as the index
AM API) is that ambiguity about the limits and extent of an abstract
interface isn't necessarily a bad thing [1]. It can actually be a good
thing! Ever hear of Hyrum's Law? Abstractions are very often quite
leaky.

APIs like the index AM API inevitably make trade-offs. Trade-offs are
almost always political, in one way or another. Litigating every
possible question up-front requires knowing ~everything before you
really get started. This mostly ends up being a waste of time, since
many theoretical contentious trade-offs just won't matter one bit in
the long run, for one reason or another (not necessarily because there
is only ever one consumer of an API, for all sorts of reasons).

You don't have to agree with me. That's just what my experience
indicates works best on average, and in the long run. I cannot justify
it any further than that.

> By keeping that expectation alive,
> this becomes a self-fulfilling prophecy, and I really don't like such
> restrictive self-fulfilling prophecies. It's nice that we have index
> AM feature flags, but if we only effectively allow one implementation
> for this set of flags by ignoring potential other users when changing
> behavior, then what is the API good for (apart from abstraction
> layering, which is nice)?

I explicitly made it clear that I don't mean that.

> I think that may be right, but I could very well have built a
> btree-lite that doesn't have the historical baggage of having to deal
> with pages from before v12 (version 4) and some improvements that
> haven't made it to core yet.

Let me know if you ever do that. Let me know what the problems you
encounter are. I'm quite happy to revise my position on this in light
of new information. I change my mind all the time!

> Something like the following, maybe?
>
> """
> Compatibility: The planner will now generate paths with array scan
> keys in any column for ordered indexes, rather than on only a prefix
> of the index columns. The planner still expects fully ordered data
> from those indexes.
> Historically, the btree AM couldn't output correctly ordered data for
> suffix array scan keys, which was "fixed" by prohibiting the planner
> from generating array scan keys without an equality prefix scan key up
> to that attribute. In this commit, the issue has been fixed, and the
> planner restriction has thus been removed as the only in-core IndexAM
> that reports amcanorder now supports array scan keys on any column
> regardless of what prefix scan keys it has.
> """

Even if I put something like this in the commit message, I doubt that
Bruce would pick it up in anything like this form (I have my doubts
about it happening no matter what wording is used, actually).

I could include something less verbose, mentioning a theoretical risk
to out-of-core amcanorder routines that coevolved with nbtree,
inherited the same SAOP limitations, and then never got the same set
of fixes.

> patch-a is a trivial and clean implementation of mergesort, which
> tends to reduce the total number of compare operations if both arrays
> are of similar size and value range. It writes data directly back into
> the main array on non-assertion builds, and with assertions it reuses
> your binary search join on scratch space for algorithm validation.

I'll think about this some more.

> patch-b is an improved version of patch-a with reduced time complexity
> in some cases: It applies exponential search to reduce work done where
> there are large runs of unmatched values in either array, and does
> that while keeping O(n+m) worst-case complexity, now getting a
> best-case O(log(n)) complexity.

This patch seems massively over-engineered, though. Over 200 new lines
of code, all for a case that is only used when queries are written
with obviously-contradictory quals? It's just not worth it.

> > The recursive call to _bt_advance_array_keys is needed to deal with
> > unsatisfied-and-required inequalities that were not detected by an
> > initial call to _bt_check_compare, following a second retry call to
> > _bt_check_compare. We know that this recursive call to
> > _bt_advance_array_keys won't cannot recurse again because we've
> > already identified which specific inequality scan key it is that's a
> > still-unsatisfied inequality, following an initial round of array key
> > advancement.
>
> So, it's a case of this?
>
> scankey: a > 1 AND a = ANY (1 (=current), 2, 3)) AND b < 4 AND b = ANY
> (2 (=current), 3, 4)
> tuple: a=2, b=4

I don't understand why your example starts with "scankey: a > 1" and
uses redundant/contradictory scan keys (for both "a" and "b"). For a

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-01-22 Thread Matthias van de Meent

On Fri, 19 Jan 2024 at 23:42, Peter Geoghegan  wrote:
>

Thank you for your replies so far.

> On Thu, Jan 18, 2024 at 11:39 AM Matthias van de Meent
>  wrote:
> > I would agree with you if this was about new APIs and features, but
> > here existing APIs are being repurposed without changing them. A
> > maintainer of index AMs would not look at the commit title 'Enhance
> > nbtree ScalarArrayOp execution' and think "oh, now I have to make sure
> > my existing amsearcharray+amcanorder handling actually supports
> > non-prefix arrays keys and return data in index order".
>
> This is getting ridiculous.
>
> It is quite likely that there are exactly zero affected out-of-core
> index AMs. I don't count pgroonga as a counterexample (I don't think
> that it actually fullfills the definition of a ). Basically,
> "amcanorder" index AMs more or less promise to be compatible with
> nbtree, down to having the same strategy numbers. So the idea that I'm
> going to upset amsearcharray+amcanorder index AM authors is a
> completely theoretical problem. The planner code evolved with nbtree,
> hand-in-glove.

And this is where I disagree with your (percieved) implicit argument
that this should be and always stay this way. I don't mind changes in
the planner for nbtree per se, but as I've mentioned before in other
places, I really don't like how we handle amcanorder as if it were
amisbtree. But it's not called that, so we shouldn't expect that to
remain the case; and definitely not keep building on those
expectations that it always is going to be the nbtree when amcanorder
is set (or amsearcharray is set, or ..., or any combination of those
that is currently used by btree). By keeping that expectation alive,
this becomes a self-fulfilling prophecy, and I really don't like such
restrictive self-fulfilling prophecies. It's nice that we have index
AM feature flags, but if we only effectively allow one implementation
for this set of flags by ignoring potential other users when changing
behavior, then what is the API good for (apart from abstraction
layering, which is nice)?

/rant

I'll see about the

> I'm more than happy to add a reminder to the commit message about
> adding a proforma listing to the compatibility section of the Postgres
> 17 release notes. Can we actually agree on which theoretical index AM
> types are affected first, though? Isn't it actually
> amsearcharray+amcanorder+amcanmulticol external index AMs only? Do I
> have that right?

I think that may be right, but I could very well have built a
btree-lite that doesn't have the historical baggage of having to deal
with pages from before v12 (version 4) and some improvements that
haven't made it to core yet.

> BTW, how should we phrase this compatibility note, so that it can be
> understood? It's rather awkward.

Something like the following, maybe?

"""
Compatibility: The planner will now generate paths with array scan
keys in any column for ordered indexes, rather than on only a prefix
of the index columns. The planner still expects fully ordered data
from those indexes.
Historically, the btree AM couldn't output correctly ordered data for
suffix array scan keys, which was "fixed" by prohibiting the planner
from generating array scan keys without an equality prefix scan key up
to that attribute. In this commit, the issue has been fixed, and the
planner restriction has thus been removed as the only in-core IndexAM
that reports amcanorder now supports array scan keys on any column
regardless of what prefix scan keys it has.
"""

> > > As I said to Heikki, thinking about this some more is on my todo list.
> > > I mean the way that this worked had substantial revisions on HEAD
> > > right before I posted v9. v9 was only to fix the bit rot that that
> > > caused.
> >
> > Then I'll be waiting for that TODO item to be resolved.
>
> My TODO item is to resolve the question of whether and to what extent
> these two optimizations should be combined. It's possible that I'll
> decide that it isn't worth it, for whatever reason.

That's fine, this decision (and any work related to it) is exactly
what I was referring to with that mention of the TODO.

> > > Even single digit
> > > thousand of array elements should be considered huge. Plus this code
> > > path is only hit with poorly written queries.
> >
> > Would you accept suggestions?
>
> You mean that you want to have a go at it yourself? Sure, go ahead.
>
> I cannot promise that I'll accept your suggested revisions, but if
> they aren't too invasive/complicated compared to what I have now, then
> I will just accept them. I maintain that this isn't really necessary,
> but it might be the path of least resistance at this point.

Attached 2 patches: v11.patch-a and v11.patch-b. Both are incremental
on top of your earlier set, and both don't allocate additional memory
in the merge operation in non-assertion builds.

patch-a is a trivial and clean implementation of mergesort, which
tends to reduce the total number of

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-01-19 Thread Peter Geoghegan

On Thu, Jan 18, 2024 at 11:39 AM Matthias van de Meent
 wrote:
> > I'm not going to break out the planner changes, because they're *not*
> > independent in any way.
>
> The planner changes depend on the btree changes, that I agree with.
> However, I don't think that the btree changes depend on the planner
> changes.

Yeah, technically it would be possible to break out the indxpath.c
changes -- that wouldn't leave the tree in a state with visibly-wrong
behavior at any point. But that's far from the only thing that
matters. If that was the standard that we applied, then I might have
to split the patch into 10 or more patches.

What it boils down to is this: it is totally natural for me to think
of the planner changes as integral to the nbtree changes, so I'm going
to include them together in one commit. That's just how the code was
written -- I always thought about it as one single thing. The majority
of the queries that I've used to promote the patch directly rely on
the planner changes in one way or another (even back in v1, when the
planner side of things had lots of kludges).

I don't necessarily always follow that standard. Sometimes it is
useful to further break things up. Like when it happens to make the
high-level division of labor a little bit clearer. The example that
comes to mind is commit dd299df8 and commit fab25024. These nbtree
commits were essentially one piece of work that was broken into two,
but committed within minutes of each other, and left the tree in a
momentary state that was not-very-useful (though still correct). That
made sense to me at the time because the code in question was
mechanically distinct, while at the same time modifying some of the
same nbtree files. Drawing a line under it seemed to make sense.

I will admit that there is some amount of subjective judgement (gasp!)
here. Plus I'll acknowledge that it's *slightly* odd that the most
compelling cases for the SAOP patch almost all involve savings in heap
page accesses, even though it is fundamentally a B-Tree patch. But
that's just how it came out. Slightly odd things happen all the time.

> I would agree with you if this was about new APIs and features, but
> here existing APIs are being repurposed without changing them. A
> maintainer of index AMs would not look at the commit title 'Enhance
> nbtree ScalarArrayOp execution' and think "oh, now I have to make sure
> my existing amsearcharray+amcanorder handling actually supports
> non-prefix arrays keys and return data in index order".

This is getting ridiculous.

It is quite likely that there are exactly zero affected out-of-core
index AMs. I don't count pgroonga as a counterexample (I don't think
that it actually fullfills the definition of a ). Basically,
"amcanorder" index AMs more or less promise to be compatible with
nbtree, down to having the same strategy numbers. So the idea that I'm
going to upset amsearcharray+amcanorder index AM authors is a
completely theoretical problem. The planner code evolved with nbtree,
hand-in-glove.

I'm more than happy to add a reminder to the commit message about
adding a proforma listing to the compatibility section of the Postgres
17 release notes. Can we actually agree on which theoretical index AM
types are affected first, though? Isn't it actually
amsearcharray+amcanorder+amcanmulticol external index AMs only? Do I
have that right?

BTW, how should we phrase this compatibility note, so that it can be
understood? It's rather awkward.

> > As I said to Heikki, thinking about this some more is on my todo list.
> > I mean the way that this worked had substantial revisions on HEAD
> > right before I posted v9. v9 was only to fix the bit rot that that
> > caused.
>
> Then I'll be waiting for that TODO item to be resolved.

My TODO item is to resolve the question of whether and to what extent
these two optimizations should be combined. It's possible that I'll
decide that it isn't worth it, for whatever reason. At this point I'm
still totally neutral.

> > Even single digit
> > thousand of array elements should be considered huge. Plus this code
> > path is only hit with poorly written queries.
>
> Would you accept suggestions?

You mean that you want to have a go at it yourself? Sure, go ahead.

I cannot promise that I'll accept your suggested revisions, but if
they aren't too invasive/complicated compared to what I have now, then
I will just accept them. I maintain that this isn't really necessary,
but it might be the path of least resistance at this point.

> > > Please fix this, as it shows up in profiling of large array merges.
> > > Additionally, as it merges two arrays of unique items into one,
> > > storing only matching entries, I feel that it is quite wasteful to do
> > > this additional allocation here. Why not reuse the original allocation
> > > immediately?
> >
> > Because that's adding more complexity for a code path that will hardly
> > ever be used in practice. For something that happens once, during a
> >

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-01-18 Thread Matthias van de Meent

On Tue, 16 Jan 2024 at 03:03, Peter Geoghegan  wrote:
>
> On Mon, Jan 15, 2024 at 2:32 PM Matthias van de Meent
>  wrote:
> > Can you pull these planner changes into their own commit(s)?
> > As mentioned upthread, it's a significant change in behavior that
> > should have separate consideration and reference in the commit log. I
> > really don't think it should be buried in the 5th paragraph of an
> > "Enhance nbtree ScalarArrayOp execution" commit. Additionally, the
> > changes of btree are arguably independent of the planner changes, as
> > the btree changes improve performance even if we ignore that it
> > implements strict result ordering.
>
> I'm not going to break out the planner changes, because they're *not*
> independent in any way.

The planner changes depend on the btree changes, that I agree with.
However, I don't think that the btree changes depend on the planner
changes.

> You could say the same thing about practically
> any work that changes the planner. They're "buried" in the 5th
> paragraph of the commit message. If an interested party can't even
> read that far to gain some understanding of a legitimately complicated
> piece of work such as this, I'm okay with that.

I would agree with you if this was about new APIs and features, but
here existing APIs are being repurposed without changing them. A
maintainer of index AMs would not look at the commit title 'Enhance
nbtree ScalarArrayOp execution' and think "oh, now I have to make sure
my existing amsearcharray+amcanorder handling actually supports
non-prefix arrays keys and return data in index order".
There are also no changes in amapi.h that would signal any index AM
author that expectations have changed. I really don't think you can
just ignore all that, and I believe this to also be the basis of
Heikki's first comment.

> As I said to Heikki, thinking about this some more is on my todo list.
> I mean the way that this worked had substantial revisions on HEAD
> right before I posted v9. v9 was only to fix the bit rot that that
> caused.

Then I'll be waiting for that TODO item to be resolved.

> > > +++ b/src/backend/access/nbtree/nbtutils.c
> > > +/*
> > > + * _bt_merge_arrays() -- merge together duplicate array keys
> > > + *
> > > + * Both scan keys have array elements that have already been sorted and
> > > + * deduplicated.
> > > + */
> >
> > As I mentioned upthread, I find this function to be very wasteful, as
> > it uses N binary searches to merge join two already sorted arrays,
> > resulting in a O(n log(m)) complexity, whereas a normal merge join
> > should be O(n + m) once the input datasets are sorted.
>
> And as I mentioned upthread, I think that you're making a mountain out
> of a molehill here. This is not a merge join.

We're merging two sorted sets and want to retain only the matching
entries, while retaining the order of the data. AFAIK the best
algorithm available for this is a sort-merge join.
Sure, it isn't a MergeJoin plan node, but that's not what I was trying to argue.

> Even single digit
> thousand of array elements should be considered huge. Plus this code
> path is only hit with poorly written queries.

Would you accept suggestions?

> > Please fix this, as it shows up in profiling of large array merges.
> > Additionally, as it merges two arrays of unique items into one,
> > storing only matching entries, I feel that it is quite wasteful to do
> > this additional allocation here. Why not reuse the original allocation
> > immediately?
>
> Because that's adding more complexity for a code path that will hardly
> ever be used in practice. For something that happens once, during a
> preprocessing step.

Or many times, when we're in a parameterized loop, as was also pointed
out by Heikki. While I do think it is rare, the existence of this path
that merges these arrays implies the need for merging these arrays,
which thus

> > > +_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
> > > + IndexTuple tuple, int sktrig, bool 
> > > validtrig)
> >
> > I don't quite understand what the 'validtrig' argument is used for.
> > There is an assertion that it is false under some conditions in this
> > code, but it's not at all clear to me why that would have to be the
> > case - it is called with `true` in one of the three call sites. Could
> > the meaning of this be clarified?
>
> Sure, I'll add a comment.

Thanks!

> > I also feel that this patch includes several optimizations such as
> > this sktrig argument which aren't easy to understand. Could you pull
> > that into a separately reviewable patch?
>
> It probably makes sense to add the extra preprocessing stuff out into
> its own commit, since I tend to agree that that's an optimization that
> can be treated as unrelated (and isn't essential to the main thrust of
> the patch).
>
> However, the sktrig thing isn't really like that. We need to do things
> that way for required inequality scan keys. It doesn't make sense to

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-01-15 Thread Peter Geoghegan

On Mon, Jan 15, 2024 at 2:32 PM Matthias van de Meent
 wrote:
> Can you pull these planner changes into their own commit(s)?
> As mentioned upthread, it's a significant change in behavior that
> should have separate consideration and reference in the commit log. I
> really don't think it should be buried in the 5th paragraph of an
> "Enhance nbtree ScalarArrayOp execution" commit. Additionally, the
> changes of btree are arguably independent of the planner changes, as
> the btree changes improve performance even if we ignore that it
> implements strict result ordering.

I'm not going to break out the planner changes, because they're *not*
independent in any way. You could say the same thing about practically
any work that changes the planner. They're "buried" in the 5th
paragraph of the commit message. If an interested party can't even
read that far to gain some understanding of a legitimately complicated
piece of work such as this, I'm okay with that.

> An independent thought when reviewing the finaltup / HIKEY scan
> optimization parts of this patch:
> The 'use highkey to check for next page's data range' optimization is
> useful, but can currently only be used for scans that go to the right.
> Could we use a similar optimization in the inverse direction if we
> marked the right page of a split with "the left page has a HIKEY based
> on this page's (un)truncated leftmost value" or "left page's HIKEY was
> truncated to N key attributes"? It'd give one bit of information,
> specifically that it does (or does not) share some (or all) key
> attributes with this page's minimum value, which allows for earlier
> scan termination in boundary cases.

That would have to be maintained in all sorts of different places in
nbtree. And could be broken at any time by somebody inserting a
non-pivot tuple before every existing one on the leaf page. Doesn't
seem worth it to me.

If we were to do something like this then it would be discussed as
independent work. It's akin to adding a low key, which could be used
in several different places. As you say, it's totally independent.

> > +++ b/src/include/access/nbtree.h
> > +#define SK_BT_RDDNARRAY0x0004/* redundant in array 
> > preprocessing */
>
> I think "made redundant" is more appropriate than just "redundant";
> the array key is not itself redundant in array preprocessing (indeed
> we actually work hard on that key during preprocessing to allow us to
> mark it redundant)

Meh. I did it that way to fit under 78 chars while staying on the same
line. I don't think that it matters.

> I think this was mentioned before, but can't we have an argument or
> version of _bt_checkkeys that makes it not advance the array keys, so
> that we can keep this optimization? Or, isn't that now
> _bt_check_compare?

As I said to Heikki, thinking about this some more is on my todo list.
I mean the way that this worked had substantial revisions on HEAD
right before I posted v9. v9 was only to fix the bit rot that that
caused.

> For an orderlines table with 1000s of lines per order, we would
> greatly benefit from a query `order_id = ANY ('{1, 3, 5}')` that
> handles scan keys more efficiently; the optimization which is being
> disabled here.

The way that these optimizations might work with the mechanism from
the patch isn't some kind of natural extension to what's there
already. We'll need new heuristics to not waste cycles. Applying all
of the optimizations together just isn't trivial, and it's not yet
clear what really makes sense. Combining the two optimizations more or
less adds another dimension of complexity.

> > +++ b/src/backend/access/nbtree/nbtutils.c
> > _bt_preprocess_array_keys(IndexScanDesc scan)
> The _bt_preprocess_array_keys code is now broken due to type
> confusion, as it assumes there is only one array subtype being
> compared per attribute in so.orderProcs.

I've been aware of this for some time, but didn't think that it was
worth bringing up before I had a solution to present...

> I'm also concerned about the implications of this in
> _bt_binsrch_array_skey, as this too assumes the same compare operator
> is always used for all array operations on each attribute. We may need
> one so->orderProcs entry for each array key, but at least one per
> sk_subtype of each array key.

...since right now it's convenient to make so->orderProcs have a 1:1
correspondence with index key columns

> I also notice that the merging of values doesn't seem to be applied
> optimally with mixed typed array operations: num = int[] AND num =
> bigint[] AND num = int[] doesn't seem to merge the first and last
> array ops. I'm also concerned about being (un)able to merge

...which ought to work and be robust, once the cross-type support is
in place. That is, it should work once we really can assume that there
really must be exactly one so->orderProcs entry per equality-strategy
scan key in all cases -- including in highly obscure corner cases
involving a mix of cross-type

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-01-15 Thread Matthias van de Meent

On Thu, 28 Dec 2023 at 18:28, Peter Geoghegan  wrote:
>
> On Sat, Dec 9, 2023 at 10:38 AM Peter Geoghegan  wrote:
> > Attached is v8, which pretty much rips all of this stuff out.
>
> Attached is v9, which I'm posting just to fix bitrot. The patch
> stopped cleanly applying against HEAD due to recent bugfix commit
> 7e6fb5da. No real changes here compared to v8.

I found 2 major issues; one correctness issue in the arraykey
processing code of _bt_preprocess_array_keys, and one assertion error
in _bt_advance_array_keys; both discussed in the relevant sections
below.

> Subject: [PATCH v9] Enhance nbtree ScalarArrayOp execution.
> [...]
> Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
> path keys: path keys on a multicolumn index path, with a SAOP clause on
> any attribute beyond the first/most significant attribute.  These cases
> are now all safe, so we go back to generating path keys without regard
> for the presence of SAOP clauses (just like with any other clause type).
> Also undo changes from follow-up bugfix commit a4523c5a, which taught
> the planner to produce alternative index paths without any low-order
> ScalarArrayOpExpr quals (making the SAOP quals into filter quals).
> We'll no longer generate these alternative paths, which can no longer
> offer any advantage over the index qual paths that we do still generate.

Can you pull these planner changes into their own commit(s)?
As mentioned upthread, it's a significant change in behavior that
should have separate consideration and reference in the commit log. I
really don't think it should be buried in the 5th paragraph of an
"Enhance nbtree ScalarArrayOp execution" commit. Additionally, the
changes of btree are arguably independent of the planner changes, as
the btree changes improve performance even if we ignore that it
implements strict result ordering.

An independent thought when reviewing the finaltup / HIKEY scan
optimization parts of this patch:
The 'use highkey to check for next page's data range' optimization is
useful, but can currently only be used for scans that go to the right.
Could we use a similar optimization in the inverse direction if we
marked the right page of a split with "the left page has a HIKEY based
on this page's (un)truncated leftmost value" or "left page's HIKEY was
truncated to N key attributes"? It'd give one bit of information,
specifically that it does (or does not) share some (or all) key
attributes with this page's minimum value, which allows for earlier
scan termination in boundary cases.

> +++ b/src/include/access/nbtree.h
> +#define SK_BT_RDDNARRAY0x0004/* redundant in array preprocessing 
> */

I think "made redundant" is more appropriate than just "redundant";
the array key is not itself redundant in array preprocessing (indeed
we actually work hard on that key during preprocessing to allow us to
mark it redundant)

> +++ b/src/backend/access/nbtree/nbtsearch.c
>  * We skip this for the first page in the scan to evade the possible
> - * slowdown of the point queries.
> + * slowdown of point queries.  Never apply the optimization with a scan
> + * that uses array keys, either, since that breaks certain assumptions.
> + * (Our search-type scan keys change whenever _bt_checkkeys advances the
> + * arrays, invalidating any precheck.  Tracking all that would be 
> tricky.)

I think this was mentioned before, but can't we have an argument or
version of _bt_checkkeys that makes it not advance the array keys, so
that we can keep this optimization? Or, isn't that now
_bt_check_compare?
For an orderlines table with 1000s of lines per order, we would
greatly benefit from a query `order_id = ANY ('{1, 3, 5}')` that
handles scan keys more efficiently; the optimization which is being
disabled here.

> +++ b/src/backend/access/nbtree/nbtutils.c
> _bt_preprocess_array_keys(IndexScanDesc scan)
The _bt_preprocess_array_keys code is now broken due to type
confusion, as it assumes there is only one array subtype being
compared per attribute in so.orderProcs. Reproducer:
CREATE TABLE test AS
SELECT generate_series(1, 1, 1::bigint) num;
CREATE INDEX ON test (num); /* bigint typed */

SELECT num FROM test
WHERE num = ANY ('{1}'::smallint[])
  AND num = ANY ('{1}'::int[]) /* ensure matching
lastEqualityArrayAtt, lastOrderProc for next qual
  AND num = ANY ('{65537}'::int[]); /* qual is broken due to
application of smallint compare operator on int values that do equal
mod 2^16, but do not equal in their own type */
 num
-
   1

I'm also concerned about the implications of this in
_bt_binsrch_array_skey, as this too assumes the same compare operator
is always used for all array operations on each attribute. We may need
one so->orderProcs entry for each array key, but at least one per
sk_subtype of each array key.

I also notice that the merging of values doesn't seem to be applied
optimally with mixed typed array operations: num = int[] AND num =
bigint[] AND num =

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-12-04 Thread Peter Geoghegan

On Mon, Dec 4, 2023 at 7:25 PM Peter Geoghegan  wrote:
> 2. The optimization that has _bt_checkkeys skip non-required scan keys
> that are *only* required in the direction *opposite* the current scan
> direction -- this can work even without any precheck from
> _bt_readpage.
>
> Note that this second optimization relies on various behaviors in
> _bt_first() that make it impossible for _bt_checkkeys() to ever see a
> tuple that could fail to satisfy such a scan key -- we must always
> have passed over non-matching tuples, thanks to _bt_first(). That
> prevents my patch with a problem: the logic for determining whether or
> not we need a new primitive index scan only promises to never require
> the scan to grovel through many leaf pages that _bt_first() could and
> should just skip over instead. This new logic makes no promises about
> skipping over small numbers of tuples. So it's possible that
> _bt_checkkeys() will see a handful of tuples "after the end of the
> _bt_first-wise primitive index scan", but "before the _bt_first-wise
> start of the next would-be primitive index scan".

BTW, I have my doubts about this actually being correct without the
patch. The following comment block appears above _bt_preprocess_keys:

 * Note that one reason we need direction-sensitive required-key flags is
 * precisely that we may not be able to eliminate redundant keys.  Suppose
 * we have "x > 4::int AND x > 10::bigint", and we are unable to determine
 * which key is more restrictive for lack of a suitable cross-type operator.
 * _bt_first will arbitrarily pick one of the keys to do the initial
 * positioning with.  If it picks x > 4, then the x > 10 condition will fail
 * until we reach index entries > 10; but we can't stop the scan just because
 * x > 10 is failing.  On the other hand, if we are scanning backwards, then
 * failure of either key is indeed enough to stop the scan.  (In general, when
 * inequality keys are present, the initial-positioning code only promises to
 * position before the first possible match, not exactly at the first match,
 * for a forward scan; or after the last match for a backward scan.)

As I understand it, this might still be okay, because the optimization
in question from Alexander's commit e0b1ee17 (what I've called
optimization 2) is careful about NULLs, which were the one case that
definitely had problems. Note that IS NOT NULL works kind of like
WHERE foo < NULL here (see old bug fix commit 882368e8, "Fix btree
stop-at-nulls logic properly", for more context on this NULLs
behavior).

In any case, my patch isn't compatible with "optimization 2" (as in my
tests break in a rather obvious way) due to a behavior that these old
comments claim is normal within any scan (or perhaps normal in any
scan with scan keys that couldn't be deemed redundant due to a lack of
cross-type support in the opfamily).

Something has to be wrong here -- could just be the comment, I
suppose. But I find it easy to believe that Alexander's commit
e0b1ee17 might not have been properly tested with opfamilies that lack
a suitable cross-type operator. That's a pretty niche thing. (My patch
doesn't need that niche thing to be present to easily break when
combined with "optimization 2", which could hint at an existing and
far more subtle problem.)

-- 
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-12-04 Thread Peter Geoghegan

On Mon, Nov 27, 2023 at 5:39 AM Heikki Linnakangas  wrote:
> - +1 on the general idea. Hard to see any downsides if implemented right.

Glad you think so. The "no possible downside" perspective is one that
the planner sort of relies on, so this isn't just a nice-to-have -- it
might actually be a condition of committing the patch. It's important
that the planner can be very aggressive about using SAOP index quals,
without us suffering any real downside at execution time.

> - This changes the meaning of amsearcharray==true to mean that the
> ordering is preserved with ScalarArrayOps, right? You change B-tree to
> make that true, but what about any out-of-tree index AM extensions? I
> don't know if any such extensions exist, and I don't think we should
> jump through any hoops to preserve backwards compatibility here, but
> probably deserves a notice in the release notes if nothing else.

My interpretation is that the planner changes affect amcanorder +
amsearcharray index AMs, but have no impact on mere amsearcharray
index AMs. If anything this is a step *away* from knowing about nbtree
implementation details in the planner (though the planner's definition
of amcanorder is very close to the behavior from nbtree, down to
things like knowing all about nbtree strategy numbers). The planner
changes from the patch are all subtractive -- I'm removing kludges
that were added by bug fix commits. Things that weren't in the
original feature commit at all.

I used the term "my interpretation" here because it seems hard to
think of this in abstract terms, and to write a compatibility note for
this imaginary audience. I'm happy to go along with whatever you want,
though. Perhaps you can suggest a wording for this?

> - You use the term "primitive index scan" a lot, but it's not clear to
> me what it means. Does one ScalarArrayOps turn into one "primitive index
> scan"? Or each element in the array turns into a separate primitive
> index scan? Or something in between? Maybe add a new section to the
> README explain how that works.

The term primitive index scan refers to the thing that happens each
time _bt_first() is called -- with and without the patch. In other
words, it's what happens when pg_stat_all_indexes.idx_scan is
incremented.

You could argue that that's not quite the right thing to be focussing
on, with this new design. But it has precedent going for it. As I
said, it's the thing that pg_stat_all_indexes.idx_scan counts, which
is a pretty exact and tangible thing. So it's consistent with
historical practice, but also with what other index AMs do when
executing ScalarArrayOps non-natively.

> - _bt_preprocess_array_keys() is called for each btrescan(). It performs
> a lot of work like cmp function lookups and desconstructing and merging
> the arrays, even if none of the SAOP keys change in the rescan. That
> could make queries with nested loop joins like this slower than before:
> "select * from generate_series(1, 50) g, tenk1 WHERE g = tenk1.unique1
> and tenk1.two IN (1,2);".

But that's nothing new. _bt_preprocess_array_keys() isn't a new
function, and the way that it's called isn't new in any way.

That said, I certainly agree that we should be worried about any added
overhead in _bt_first for nested loop joins with an inner index scan.
In my experience that can be an important issue. I actually have a
TODO item about this already. It needs to be included in my work on
performance validation, on general principle.

> - nbtutils.c is pretty large now. Perhaps create a new file
> nbtpreprocesskeys.c or something?

Let me get back to you on this.

> - You and Matthias talked about an implicit state machine. I wonder if
> this could be refactored to have a more explicit state machine. The
> state transitions and interactions between _bt_checkkeys(),
> _bt_advance_array_keys() and friends feel complicated.

I agree that it's complicated. That's the main problem that the patch
has, by far. It used to be even more complicated, but it's hard to see
a way to make it a lot simpler at this point. If you can think of a
way to simplify it then I'll definitely give it a go.

Can you elaborate on "more explicit state machine"? That seems like it
could have value by adding more invariants, and making things a bit
more explicit in one or two areas. It could also help us to verify
that they hold from assertions. But that isn't really the same thing
as simplification. I wouldn't use that word, at least.

> > +  
> > +   
> > +Every time an index is searched, the index's
> > +
> > pg_stat_all_indexes.idx_scan
> > +field is incremented.  This usually happens once per index scan node
> > +execution, but might take place several times during execution of a 
> > scan
> > +that searches for multiple values together.  Only queries that use 
> > certain
> > +SQL constructs to search for rows matching any value
> > +out of a list (or an array) of multiple scalar values are affected.  
> > See
> > +

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-28 Thread Peter Geoghegan

On Tue, Nov 28, 2023 at 3:52 PM Peter Geoghegan  wrote:
> While I still think that we need heuristics that apply speculative
> criteria to decide whether or not going to the next page directly
> (when we have a choice at all), that doesn't mean that the v7
> heuristics can't be improved on, with a little more thought. It's a
> bit tricky, since we're probably also benefiting from the same
> heuristics all the time -- probably even for this same test case.

Correction: this particular test case happens to be one where the
optimal strategy is to do *exactly* what the master branch does
currently. The master branch is unbeatable, so the only reasonable
goal for the patch is to not lose (or to lose by no more than a
completely negligible amount).

I'm now prepared to say that this behavior is not okay -- I definitely
need to fix this. It's a bug.

Because each distinct value never fits on one leaf page (it's more
like 1.5 - 2 pages, even though we're deduplicating heavily), and
because Postgres 12 optimizations are so effective with low
cardinality/posting-list-heavy indexes such as this, we're bound to
lose quite often. The only reason it doesn't happen _every single
time_ we descend the index is because the test script uses CREATE
INDEX, rather than using retail inserts (I tend to prefer the latter
for this sort of analysis). Since nbtsort.c isn't as clever/aggressive
about suffix truncation as the nbtsplitloc.c split strategies would
have been, had we used them (had there been retail inserts), many
individual leaf pages are left with high keys that aren't particularly
good targets for the high key precheck optimization (see Postgres 12
commit 29b64d1d).

If I wanted to produce a truly adversarial case for this issue (which
this is already close to), I'd go with the following:

1. Retail inserts that leave each leaf page full of one single value,
which will allow each high key to still make a "clean break" from the
right sibling page -- it'll have the right sibling's value. Maybe
insert 1200 - 1300 tuples per distinct index value for this.

In other words, bulk loading that results in an index that never has
to append a heap TID tiebreaker during suffix truncation, but comes
very close to needing to. Bulk loading where nbtsplitloc.c needs to
use SPLIT_MANY_DUPLICATES all the time, but never quite gets to the
point of needing a SPLIT_SINGLE_VALUE split.

2. A SAOP query with an array with every second value in the index as
an element. Something like "WHERE arr in (2, 4, 6, 8, ...)".

The patch will read every single leaf page, whereas master will
*reliably* only read every second leaf page. I didn't need to "trick"
the patch in a contrived sort of way to get this bad outcome -- this
scenario is fairly realistic. So this behavior is definitely not
something that I'm prepared to defend. As I said, it's a bug.

It'll be fixed in the next revision.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-28 Thread Peter Geoghegan

On Tue, Nov 28, 2023 at 9:19 AM Peter Geoghegan  wrote:
> > I'm not convinced this is a problem we have to solve. It's possible it
> > only affects cases that are implausible in practice (the script forces a
> > particular scan type, and maybe it would not be picked in practice). But
> > maybe it's fixable ...
>
> I would expect the patch to do quite well (relative to what is
> actually possible) on cases like the two extremes that I've focussed
> on so far. It seems possible that it will do less well on cases that
> are somewhere in the middle (that also have lots of distinct values on
> each page).

Actually, I think that it's more likely that the problems that you saw
are related to low cardinality data, which seems like it might not be
a great fit for the heuristics that the patch uses to decide whether
to continue the ongoing primitive index scan, or start a new one
instead. I'm referring to the heuristics I describe here:

https://postgr.es/m/CAH2-WzmTHoCsOmSgLg=yyft9loertuckxyg2gzn+28pzonf...@mail.gmail.com

The patch itself discusses these heuristics in a large comment block
after the point that _bt_checkkeys() calls _bt_advance_array_keys().

I hardly paid any attention to low cardinality data in my performance
validation work -- it was almost always indexes that had few or no
indexes (just pgbench_accounts if we're talking pure stress-tests),
just because those are more complicated, and so seemed more important.
I'm not quite prepared to say that there is definitely a problem here,
right this minute, but if there was then it wouldn't be terribly
surprising (the problems are usually wherever it is that I didn't look
for them).

Attached is a sample of my debug instrumentation for one such query,
based on running the test script that Tomas posted -- thanks for
writing this script, Tomas (I'll use it as the basis for some of my
own performance validation work going forward). I don't mind sharing
the patch that outputs this stuff if anybody is interested (it's kind
of a monstrosity, so I'm disinclined to post it with the patch until I
have a reason). Even without this instrumentation, you can get some
idea of the kinds of issues I'm talking about just by viewing EXPLAIN
ANALYZE output for a bitmap index scan -- that breaks out the index
page accesses separately, which is a number that we should expect to
remain lower than what the master branch shows in approximately all
cases.

While I still think that we need heuristics that apply speculative
criteria to decide whether or not going to the next page directly
(when we have a choice at all), that doesn't mean that the v7
heuristics can't be improved on, with a little more thought. It's a
bit tricky, since we're probably also benefiting from the same
heuristics all the time -- probably even for this same test case. We
do lose against the master branch, on balance, and by enough to
concern me, though. (I don't want to promise that it'll never happen
at all, but it should be very limited, which this wasn't.)

I didn't bother to ascertain how much longer it takes to execute the
query, since that question seems rather beside the point. The
important thing to me is whether or not this behavior actually makes
sense, all things considered, and what exactly can be done about it if
it doesn't make sense. I will need to think about this some more. This
is just a status update.

Thanks
-- 
Peter Geoghegan

index_walk_dump.txt.bz2
Description: BZip2 compressed data

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-28 Thread Peter Geoghegan

On Tue, Nov 28, 2023 at 4:29 AM Tomas Vondra
 wrote:
> I haven't looked at the code, but I decided to do a bit of blackbox perf
> and stress testing, to get some feeling of what to expect in terms of
> performance improvements, and see if there happen to be some unexpected
> regressions. Attached is a couple simple bash scripts doing a
> brute-force test with tables of different size / data distribution,
> number of values in the SAOP expression, etc.

My own stress-testing has focussed on the two obvious extremes for
this patch, using variants of pgbench with SAOP SELECTs on
pgbench_accounts:

1. The case where there is almost no chance of finding any two index
tuples together on the same page, because the constants are completely
random. This workload makes the patch's attempts at "coalescing
together" index page reads pure overhead, with no possible benefit.
Obviously that's a cost that needs to be kept under control.

2. The case where there are 255 of tuples with distinct values that
are clustered together (both in the key space and in physical index
pages). Usually they'll span two index pages, but they might all fit
together on one index page, allowing us to descend to it directly and
read it only once.

With 32 clients, I typically see a regression of about 1.5% for the
first case relative to master, measured in throughput/TPS. The second
case typically sees throughput that's ~4.8x master (i.e. a ~380%
increase). I consider both of these extremes to be fairly unrealistic.
With fewer array constants, the speed-ups you'll see in sympathetic
cases are still very significant, but nothing like 4.8x.  They're more
like the 30% numbers that you saw.

As you know, I'm not actually all that excited about cases like 2 --
it's not where users are likely to benefit from the patch. The truly
interesting cases are those cases where we can completely avoid heap
accesses in the first place (not just *repeat* accesses to the same
index pages), due to the patch's ability to consistently use index
quals rather than filter quals. It's not that hard to show cases where
there are 100x+ fewer pages accessed -- often with cases have very few
array constants. It's just that these cases aren't that interesting
from a performance validation point of view -- it's obvious that
filter quals are terrible.

Another thing that the patch does particularly well on is cases where
the array keys don't have any matches at all, but there is significant
clustering (relatively common when multiple SAOPs are used as index
quals, which becomes far more likely due to the planner changes). We
don't just skip over parts of the index that aren't relevant -- we
also skip over parts of the arrays that aren't relevant. Some of my
adversarial test cases that take ~1 millisecond for the patch to
execute will practically take forever on master (I had to have my test
suite not run those tests against master). You just need lots of array
keys.

What master does on those adversarial cases with billions of distinct
combinations of array keys (not that unlikely if there are 4 or 5
SAOPs with mere hundreds or thousands of array keys each) is so
inefficient that we might as well call it infinitely slower than the
patch. This is interesting to me from a performance
robustness/stability point of view. The slowest kind of SAOP index
scan with the patch becomes a full index scan -- just as it would be
if we were using any type of non-SOAP qual before now. The worst case
is a lot easier to reason about.

> I'm not convinced this is a problem we have to solve. It's possible it
> only affects cases that are implausible in practice (the script forces a
> particular scan type, and maybe it would not be picked in practice). But
> maybe it's fixable ...

I would expect the patch to do quite well (relative to what is
actually possible) on cases like the two extremes that I've focussed
on so far. It seems possible that it will do less well on cases that
are somewhere in the middle (that also have lots of distinct values on
each page).

We effectively do a linear search on a page that we know has at least
one more match (following a precheck that uses the high key). We hope
that the next match (for the next array value) closely follows an
initial match. But what if there are only 2 or 3 matches on each leaf
page, that are spaced relatively far apart? You're going to have to
grovel through the whole page.

It's not obvious that that's a problem to be fixed -- we're still only
descending the index once and still only locking the leaf page once,
so we'll probably still win relative to master. And it's not that easy
to imagine beating a linear search -- it's not like there is just one
"next value" to search for in these cases. But it's something that
deserves further consideration.

-- 
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-27 Thread Heikki Linnakangas


On 21/11/2023 04:52, Peter Geoghegan wrote:

Attached is v7.


First, some high-level reactions before looking at the patch very closely:

- +1 on the general idea. Hard to see any downsides if implemented right.

- This changes the meaning of amsearcharray==true to mean that the 
ordering is preserved with ScalarArrayOps, right? You change B-tree to 
make that true, but what about any out-of-tree index AM extensions? I 
don't know if any such extensions exist, and I don't think we should 
jump through any hoops to preserve backwards compatibility here, but 
probably deserves a notice in the release notes if nothing else.


- You use the term "primitive index scan" a lot, but it's not clear to 
me what it means. Does one ScalarArrayOps turn into one "primitive index 
scan"? Or each element in the array turns into a separate primitive 
index scan? Or something in between? Maybe add a new section to the 
README explain how that works.


- _bt_preprocess_array_keys() is called for each btrescan(). It performs 
a lot of work like cmp function lookups and desconstructing and merging 
the arrays, even if none of the SAOP keys change in the rescan. That 
could make queries with nested loop joins like this slower than before: 
"select * from generate_series(1, 50) g, tenk1 WHERE g = tenk1.unique1 
and tenk1.two IN (1,2);".


- nbtutils.c is pretty large now. Perhaps create a new file 
nbtpreprocesskeys.c or something?


- You and Matthias talked about an implicit state machine. I wonder if 
this could be refactored to have a more explicit state machine. The 
state transitions and interactions between _bt_checkkeys(), 
_bt_advance_array_keys() and friends feel complicated.



And then some details:


--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4035,6 +4035,19 @@ description | Waiting for a newly initialized WAL file 
to reach durable storage

   
 
+  

+   
+Every time an index is searched, the index's
+
pg_stat_all_indexes.idx_scan
+field is incremented.  This usually happens once per index scan node
+execution, but might take place several times during execution of a scan
+that searches for multiple values together.  Only queries that use certain
+SQL constructs to search for rows matching any value
+out of a list (or an array) of multiple scalar values are affected.  See
+ for details.
+   
+  
+


Is this true even without this patch? Maybe commit this separately.

The "Only queries ..." sentence feels difficult. Maybe something like 
"For example, queries using IN (...) or = ANY(...) constructs.".



 * _bt_preprocess_keys treats each primitive scan as an independent piece of
 * work.  That structure pushes the responsibility for preprocessing that must
 * work "across array keys" onto us.  This division of labor makes sense once
 * you consider that we're typically called no more than once per btrescan,
 * whereas _bt_preprocess_keys is always called once per primitive index scan.


"That structure ..." is a garden-path sentence. I kept parsing "that 
must work" as one unit, the same way as "that structure", and it didn't 
make sense. Took me many re-reads to parse it correctly. Now that I get 
it, it doesn't bother me anymore, but maybe it could be rephrased.


Is there _any_ situation where _bt_preprocess_array_keys() is called 
more than once per btrescan?



/*
 * Look up the appropriate comparison operator in the opfamily.
 *
 * Note: it's possible that this would fail, if the opfamily is
 * incomplete, but it seems quite unlikely that an opfamily would omit
 * non-cross-type comparison operators for any datatype that it supports
 * at all. ...
 */


I agree that's unlikely. I cannot come up with an example where you 
would have cross-type operators between A and B, but no same-type 
operators between B and B. For any real-world opfamily, that would be an 
omission you'd probably want to fix.


Still I wonder if we could easily fall back if it doesn't exist? And 
maybe add a check in the 'opr_sanity' test for that.


In _bt_readpage():

/*
 * Prechecking the page with scan keys required for direction scan.  We
 * check these keys with the last item on the page (according to our 
scan
 * direction).  If these keys are matched, we can skip checking them 
with
 * every item on the page.  Scan keys for our scan direction would
 * necessarily match the previous items.  Scan keys required for 
opposite
 * direction scan are already matched by the _bt_first() call.
 *
 * With the forward scan, we do this check for the last item on the page
 * instead of the high key.  It's relatively likely that the most
 * significant column in the high key will be different from the
 * corresponding value from the last item on the page.  So checking with
 * the last item on the page would give

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-11 Thread Matthias van de Meent

On Wed, 8 Nov 2023 at 02:53, Peter Geoghegan  wrote:
>
> On Tue, Nov 7, 2023 at 4:20 AM Matthias van de Meent
>  wrote:
> > On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan  wrote:
> > > I should be able to post v6 later this week. My current plan is to
> > > commit the other nbtree patch first (the backwards scan "boundary
> > > cases" one from the ongoing CF) -- since I saw your review earlier
> > > today. I think that you should probably wait for this v6 before
> > > starting your review.
> >
> > Okay, thanks for the update, then I'll wait for v6 to be posted.
>
> On second thought, I'll just post v6 now (there won't be conflicts
> against the master branch once the other patch is committed anyway).

Thanks. Here's my review of the btree-related code:

> +++ b/src/backend/access/nbtree/nbtsearch.c
> @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, 
> OffsetNumber offnum)
>  * set flag to true if all required keys are satisfied and false
>  * otherwise.
>  */
> -(void) _bt_checkkeys(scan, itup, indnatts, dir,
> - , false);
> +_bt_checkkeys(scan, , itup, false, false);
> +requiredMatchedByPrecheck = pstate.continuescan;
> +pstate.continuescan = true; /* reset */

The comment above the updated section needs to be updated.

> @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, 
> OffsetNumber offnum)
>  * set flag to true if all required keys are satisfied and false
>  * otherwise.
>  */
> -(void) _bt_checkkeys(scan, itup, indnatts, dir,
> - , false);
> +_bt_checkkeys(scan, , itup, false, false);

This 'false' finaltup argument is surely wrong for the rightmost
page's rightmost tuple, no?

> +++ b/src/backend/access/nbtree/nbtutils.c
> @@ -357,6 +431,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
> +/* We could pfree(elem_values) after, but not worth the cycles */
> +num_elems = _bt_merge_arrays(scan, cur,
> + (indoption[cur->sk_attno - 1] & 
> INDOPTION_DESC) != 0,
> + prev->elem_values, prev->num_elems,
> + elem_values, num_elems);

This code can get hit several times when there are multiple = ANY
clauses, which may result in repeated leakage of these arrays during
this scan. I think cleaning up may well be worth the cycles when the
total size of the arrays is large enough.

> @@ -496,6 +627,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
>_bt_compare_array_elements, );
> +_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
> + Datum *elems_orig, int nelems_orig,
> + Datum *elems_next, int nelems_next)
> [...]
> +/*
> + * Incrementally copy the original array into a temp buffer, skipping 
> over
> + * any items that are missing from the "next" array
> + */

Given that we only keep the members that both arrays have in common,
the result array will be a strict subset of the original array. So, I
don't quite see why we need the temporary buffer here - we can reuse
the entries of the elems_orig array that we've already compared
against the elems_next array.

We may want to optimize this further by iterating over only the
smallest array: With the current code, [1, 2] + [11000] is faster
to merge than [1..1000] + [1000, 1001], because 2 * log(1000) is much
smaller than 1000*log(2). In practice this may matter very little,
though.
An even better optimized version would do a merge join on the two
arrays, rather than loop + binary search.

> @@ -515,6 +688,161 @@ _bt_compare_array_elements(const void *a, const void 
> *b, void *arg)
> [...]
> +_bt_binsrch_array_skey(FmgrInfo *orderproc,

Is there a reason for this complex initialization of high/low_elem,
rather than the this easier to understand and more compact
initialization?:

+ low_elem = 0;
+ high_elem = array->num_elems - 1;
+ if (cur_elem_start)
+ {
+ if (ScanDirectionIsForward(dir))
+ low_elem = array->cur_elem;
+ else
+ high_elem = array->cur_elem;
+ }

> @@ -661,20 +1008,691 @@ _bt_restore_array_keys(IndexScanDesc scan)
> [...]
> + _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
> [...]
> +if (scan->parallel_scan != NULL)
> +_bt_parallel_done(scan);
> +
> +/*
> + * No more primitive index scans.  Terminate the top-level scan.
> + */
> +return false;

I think the conditional _bt_parallel_done(scan) feels misplaced here,
as the comment immediately below indicates the scan is to be
terminated after that comment. So, either move this _bt_parallel_done
call outside the function (which by name would imply it is read-only,
without side effects like this) or move it below the comment
"terminate the top-level scan".

> +_bt_advance_array_keys(IndexScanDesc

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-09 Thread Matthias van de Meent

On Fri, 10 Nov 2023 at 00:58, Peter Geoghegan  wrote:
> On Tue, Nov 7, 2023 at 5:53 PM Peter Geoghegan  wrote:
> > If you end up finding a bug in this v6, it'll most likely be a case
> > where nbtree fails to live up to that. This project is as much about
> > robust/predictable performance as anything else -- nbtree needs to be
> > able to cope with practically anything. I suggest that your review
> > start by trying to break the patch along these lines.
>
> I spent some time on this myself today (which I'd already planned on).
>
> Attached is an adversarial stress-test, which shows something that
> must be approaching the worst case for the patch in terms of time
> spent with a buffer lock held, due to spending so much time evaluating
> unusually expensive SAOP index quals. The array binary searches that
> take place with a buffer lock held aren't quite like anything else
> that nbtree can do right now, so it's worthy of special attention.
>
> I thought of several factors that maximize both the number of binary
> searches within any given _bt_readpage, as well as the cost of each
> binary search -- the SQL file has full details. My test query is
> *extremely* unrealistic, since it combines multiple independent
> unrealistic factors, all of which aim to make life hard for the
> implementation in one way or another. I hesitate to say that it
> couldn't be much worse (I only spent a few hours on this), but I'm
> prepared to say that it seems very unlikely that any real world query
> could make the patch spend as many cycles in
> _bt_readpage/_bt_checkkeys as this one does.
>
> Perhaps you can think of some other factor that would make this test
> case even less sympathetic towards the patch, Matthias? The only thing
> I thought of that I've left out was the use of a custom btree opclass,
> "unrealistically_slow_ops". Something that calls pg_usleep in its
> order proc. (I left it out because it wouldn't prove anything.)

Have you tried using text index columns that are sorted with
non-default locales?
I've seen non-default locales use significantly more resources during
compare operations than any other ordering operation I know of (which
has mostly been in finding the locale), and use it extensively to test
improvements for worst index shapes over at my btree patchsets because
locales are dynamically loaded in text compare and nondefault locales
are not cached at all. I suspect that this would be even worse if a
somehow even worse locale path is available than what I'm using for
test right now; this could be the case with complex custom ICU
locales.

> On my machine, custom instrumentation shows that each call to
> _bt_readpage made while this query executes (on a patched server)
> takes just under 1.4 milliseconds. While that is far longer than it
> usually takes, it's basically acceptable IMV. It's not significantly
> longer than I'd expect heap_index_delete_tuples() to take on an
> average day with EBS (or other network-attached storage). But that's a
> process that happens all the time, with an exclusive buffer lock held
> on the leaf page throughout -- whereas this is only a shared buffer
> lock, and involves a query that's just absurd .
>
> Another factor that makes this seem acceptable is just how sensitive
> the test case is to everything going exactly and perfectly wrong, all
> at the same time, again and again. The test case uses a 32 column
> index (the INDEX_MAX_KEYS maximum), with a query that has 32 SAOP
> clauses (one per index column). If I reduce the number of SAOP clauses
> in the query to (say) 8, I still have a test case that's almost as
> silly as my original -- but now we only spend ~225 microseconds in
> each _bt_readpage call (i.e. we spend over 6x less time in each
> _bt_readpage call). (Admittedly if I also make the CREATE INDEX use
> only 8 columns, we can fit more index tuples on one page, leaving us
> at ~800 microseconds).

A quick update of the table definition to use the various installed
'fr-%-x-icu' locales on text hash columns instead of numeric with a
different collation for each column this gets me to EXPLAIN (analyze)
showing 2.07ms spent every buffer hit inside the index scan node, as
opposed to 1.76ms when using numeric. But, as you mention, the value
of this metric is probably not very high.


As for the patch itself, I'm probably about 50% through the patch now.
While reviewing, I noticed the following two user-visible items,
related to SAOP but not broken by or touched upon in this patch:

1. We don't seem to plan `column opr ALL (...)` as index conditions,
while this should be trivial to optimize for at least btree. Example:

SET enable_bitmapscan = OFF;
WITH a AS (select generate_series(1, 1000) a)
SELECT * FROM tenk1
WHERE thousand = ANY (array(table a))
   AND thousand < ALL (array(table a));

This will never return any rows, but it does hit 9990 buffers in the
new btree code, while I expected that to be 0 buffers based on the
query and index (that is, I

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-09 Thread Peter Geoghegan

On Tue, Nov 7, 2023 at 5:53 PM Peter Geoghegan  wrote:
> If you end up finding a bug in this v6, it'll most likely be a case
> where nbtree fails to live up to that. This project is as much about
> robust/predictable performance as anything else -- nbtree needs to be
> able to cope with practically anything. I suggest that your review
> start by trying to break the patch along these lines.

I spent some time on this myself today (which I'd already planned on).

Attached is an adversarial stress-test, which shows something that
must be approaching the worst case for the patch in terms of time
spent with a buffer lock held, due to spending so much time evaluating
unusually expensive SAOP index quals. The array binary searches that
take place with a buffer lock held aren't quite like anything else
that nbtree can do right now, so it's worthy of special attention.

I thought of several factors that maximize both the number of binary
searches within any given _bt_readpage, as well as the cost of each
binary search -- the SQL file has full details. My test query is
*extremely* unrealistic, since it combines multiple independent
unrealistic factors, all of which aim to make life hard for the
implementation in one way or another. I hesitate to say that it
couldn't be much worse (I only spent a few hours on this), but I'm
prepared to say that it seems very unlikely that any real world query
could make the patch spend as many cycles in
_bt_readpage/_bt_checkkeys as this one does.

Perhaps you can think of some other factor that would make this test
case even less sympathetic towards the patch, Matthias? The only thing
I thought of that I've left out was the use of a custom btree opclass,
"unrealistically_slow_ops". Something that calls pg_usleep in its
order proc. (I left it out because it wouldn't prove anything.)

On my machine, custom instrumentation shows that each call to
_bt_readpage made while this query executes (on a patched server)
takes just under 1.4 milliseconds. While that is far longer than it
usually takes, it's basically acceptable IMV. It's not significantly
longer than I'd expect heap_index_delete_tuples() to take on an
average day with EBS (or other network-attached storage). But that's a
process that happens all the time, with an exclusive buffer lock held
on the leaf page throughout -- whereas this is only a shared buffer
lock, and involves a query that's just absurd .

Another factor that makes this seem acceptable is just how sensitive
the test case is to everything going exactly and perfectly wrong, all
at the same time, again and again. The test case uses a 32 column
index (the INDEX_MAX_KEYS maximum), with a query that has 32 SAOP
clauses (one per index column). If I reduce the number of SAOP clauses
in the query to (say) 8, I still have a test case that's almost as
silly as my original -- but now we only spend ~225 microseconds in
each _bt_readpage call (i.e. we spend over 6x less time in each
_bt_readpage call). (Admittedly if I also make the CREATE INDEX use
only 8 columns, we can fit more index tuples on one page, leaving us
at ~800 microseconds).

I'm a little surprised that it isn't a lot worse than this, given how
far I went. I was a little concerned that it would prove necessary to
lock this kind of thing down at some higher level (e.g., in the
planner), but that now looks unnecessary. There are much better ways
to DOS the server than this. For example, you could run this same
query while forcing a sequential scan! That appears to be quite a lot
less responsive to interrupts (in addition to being hopelessly slow),
probably because it uses parallel workers, each of which will use
wildly expensive filter quals that just do a linear scan of the SAOP.

-- 
Peter Geoghegan

nemesis.sql
Description: Binary data

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-07 Thread Matthias van de Meent

On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan  wrote:
>
> On Mon, Nov 6, 2023 at 1:28 PM Matthias van de Meent
>  wrote:
> > I'm planning on reviewing this patch tomorrow, but in an initial scan
> > through the patch I noticed there's little information about how the
> > array keys state machine works in this new design. Do you have a more
> > toplevel description of the full state machine used in the new design?
>
> This is an excellent question. You're entirely right: there isn't
> enough information about the design of the state machine.
>
> I should be able to post v6 later this week. My current plan is to
> commit the other nbtree patch first (the backwards scan "boundary
> cases" one from the ongoing CF) -- since I saw your review earlier
> today. I think that you should probably wait for this v6 before
> starting your review.

Okay, thanks for the update, then I'll wait for v6 to be posted.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-06 Thread Peter Geoghegan

On Mon, Nov 6, 2023 at 1:28 PM Matthias van de Meent
 wrote:
> I'm planning on reviewing this patch tomorrow, but in an initial scan
> through the patch I noticed there's little information about how the
> array keys state machine works in this new design. Do you have a more
> toplevel description of the full state machine used in the new design?

This is an excellent question. You're entirely right: there isn't
enough information about the design of the state machine.

In v1 of the patch, from all the way back in July, the "state machine"
advanced in the hackiest way possible: via repeated "incremental"
advancement (using logic from the function that we call
_bt_advance_array_keys() on HEAD) in a loop -- we just kept doing that
until the function I'm now calling _bt_tuple_before_array_skeys()
eventually reported that the array keys were now sufficiently
advanced. v2 greatly improved matters by totally overhauling
_bt_advance_array_keys(): it was taught to use binary searches to
advance the array keys, with limited remaining use of "incremental"
array key advancement.

However, version 2 (and all later versions to date) have somewhat
wonky state machine transitions, in one important respect: calls to
the new _bt_advance_array_keys() won't always advance the array keys
to the maximum extent possible (possible while still getting correct
behavior, that is). There were still various complicated scenarios
involving multiple "required" array keys (SK_BT_REQFWD + SK_BT_REQBKWD
scan keys that use BTEqualStrategyNumber), where one single call to
_bt_advance_array_keys() would advance the array keys to a point that
was still < caller's tuple. AFAICT this didn't cause wrong answers to
queries (that would require failing to find a set of exactly matching
array keys where a matching set exists), but it was kludgey. It was
sloppy in roughly the same way as the approach in my v1 prototype was
sloppy (just to a lesser degree).

I should be able to post v6 later this week. My current plan is to
commit the other nbtree patch first (the backwards scan "boundary
cases" one from the ongoing CF) -- since I saw your review earlier
today. I think that you should probably wait for this v6 before
starting your review. The upcoming version will have simple
preconditions and postconditions for the function that advances the
array key state machine (the new _bt_advance_array_keys). These are
enforced by assertions at the start and end of the function. So the
rules for the state machine become crystal clear and fairly easy to
keep in your head (e.g., tuple must be >= required array keys on entry
and <= required array keys on exit, the array keys must always either
advance by one increment or be completely exhausted for the top-level
scan in the current scan direction).

Unsurprisingly, I found that adding and enforcing these invariants led
to a simpler and more general design within _bt_advance_array_keys.
That code is still the most complicated part of the patch, but it's
much less of a bag of tricks. Another reason for you to hold off for a
few more days.

-- 
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-06 Thread Matthias van de Meent

On Sat, 21 Oct 2023 at 00:40, Peter Geoghegan  wrote:
>
> On Sun, Oct 15, 2023 at 1:50 PM Peter Geoghegan  wrote:
> > Attached is v4, which applies cleanly on top of HEAD. This was needed
> > due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan
> > keys required for directional scan in B-tree".
> >
> > Unfortunately I have more or less dealt with the conflicts on HEAD by
> > disabling the optimization from that commit, for the time being.
>
> Attached is v5, which deals with the conflict with the optimization
> added by Alexandar Korotkov's commit e0b1ee17 sensibly: the
> optimization is now only disabled in cases without array scan keys.
> (It'd be very hard to make it work with array scan keys, since an
> important principle for my patch is that we can change search-type
> scan keys right in the middle of any _bt_readpage() call).

I'm planning on reviewing this patch tomorrow, but in an initial scan
through the patch I noticed there's little information about how the
array keys state machine works in this new design. Do you have a more
toplevel description of the full state machine used in the new design?
If not, I'll probably be able to discover my own understanding of the
mechanism used in the patch, but if there is a framework to build that
understanding on (rather than having to build it from scratch) that'd
be greatly appreciated.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-10-20 Thread Peter Geoghegan

On Sun, Oct 15, 2023 at 1:50 PM Peter Geoghegan  wrote:
> Attached is v4, which applies cleanly on top of HEAD. This was needed
> due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan
> keys required for directional scan in B-tree".
>
> Unfortunately I have more or less dealt with the conflicts on HEAD by
> disabling the optimization from that commit, for the time being.

Attached is v5, which deals with the conflict with the optimization
added by Alexandar Korotkov's commit e0b1ee17 sensibly: the
optimization is now only disabled in cases without array scan keys.
(It'd be very hard to make it work with array scan keys, since an
important principle for my patch is that we can change search-type
scan keys right in the middle of any _bt_readpage() call).

v5 also fixes a longstanding open item for the patch: we no longer
call _bt_preprocess_keys() with a buffer lock held, which was a bad
idea at best, and unsafe (due to the syscache lookups within
_bt_preprocess_keys) at worst. A new, minimal version of the function
(called _bt_preprocess_keys_leafbuf) is called at the same point
instead. That change, combined with the array binary search stuff
(which was added back in v2), makes the total amount of work performed
with a buffer lock held totally reasonable in all cases. It's even
okay in extreme or adversarial cases with many millions of array keys.

Making this _bt_preprocess_keys_leafbuf approach work has a downside:
it requires that _bt_preprocess_keys be a little less aggressive about
removing redundant scan keys, in order to meet certain assumptions
held by the new _bt_preprocess_keys_leafbuf function. Essentially,
_bt_preprocess_keys must now worry about current and future array key
values when determining redundancy among scan keys -- not just the
current array key values. _bt_preprocess_keys knows nothing about
SK_SEARCHARRAY scan keys on HEAD, because on HEAD there is a strict
1:1 correspondence between the number of primitive index scans and the
number of array keys (actually, the number of distinct combinations of
array keys). Obviously that's no longer the case with the patch
(that's the whole point of the patch).

It's easiest to understand how elimination of redundant quals needs to
work in v5 by way of an example. Consider the following query:

select count(*), two, four, twenty, hundred
from
  tenk1
where
  two in (0, 1) and four in (1, 2, 3)
  and two < 1;

Notice that "two" appears in the where clause twice. First it appears
as an SAOP, and then as an inequality. Right now, on HEAD, the
primitive index scan where the SAOP's scankey is "two = 0" renders
"two < 1" redundant. However, the subsequent primitive index scan
where "two = 1" does *not* render "two < 1" redundant. This has
implications for the mechanism in the patch, since the patch will
perform one big primitive index scan for all array constants, with
only a single _bt_preprocess_keys call at the start of its one and
only _bt_first call (but with multiple _bt_preprocess_keys_leafbuf
calls once we reach the leaf level).

The compromise that I've settled on in v5 is to teach
_bt_preprocess_keys to *never* treat "two < 1" as redundant with such
a query -- even though there is some squishy sense in which "two < 1"
is indeed still redundant (for the first SAOP key of value 0). My
approach is reasonably well targeted in that it mostly doesn't affect
queries that don't need it. But it will add cycles to some badly
written queries that wouldn't have had them in earlier Postgres
versions. I'm not entirely sure how much this matters, but my current
sense is that it doesn't matter all that much. This is the kind of
thing that is hard to test and poorly tested, so simplicity is even
more of a virtue than usual.

Note that the changes to _bt_preprocess_keys in v5 *don't* affect how
we determine if the scan has contradictory quals, which is generally
more important. With contradictory quals, _bt_first can avoid reading
any data from the index. OTOH eliminating redundant quals (i.e. the
thing that v5 *does* change) merely makes evaluating index quals less
expensive via preprocessing-away unneeded scan keys. In other words,
while it's possible that the approach taken by v5 will add CPU cycles
in a small number of cases, it should never result in more page
accesses.

--
Peter Geoghegan

v5-0001-Enhance-nbtree-ScalarArrayOp-execution.patch
Description: Binary data

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-10-15 Thread Peter Geoghegan

On Thu, Sep 28, 2023 at 5:32 PM Peter Geoghegan  wrote:
> On Sun, Sep 17, 2023 at 4:47 PM Peter Geoghegan  wrote:
> > Attached is v2, which makes all array key advancement take place using
> > the "next index tuple" approach (using binary searches to find array
> > keys using index tuple values).
>
> Attached is v3, which fixes bitrot caused by today's bugfix commit 714780dc.

Attached is v4, which applies cleanly on top of HEAD. This was needed
due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan
keys required for directional scan in B-tree".

Unfortunately I have more or less dealt with the conflicts on HEAD by
disabling the optimization from that commit, for the time being. The
commit in question is rather poorly documented, and it's not
immediately clear how to integrate it with my work. I just want to
make sure that there's a testable patch available.

-- 
Peter Geoghegan

v4-0001-Enhance-nbtree-ScalarArrayOp-execution.patch
Description: Binary data

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-09-28 Thread Peter Geoghegan

On Sun, Sep 17, 2023 at 4:47 PM Peter Geoghegan  wrote:
> Attached is v2, which makes all array key advancement take place using
> the "next index tuple" approach (using binary searches to find array
> keys using index tuple values).

Attached is v3, which fixes bitrot caused by today's bugfix commit 714780dc.

No notable changes here compared to v2.

--
Peter Geoghegan
From 2cff1dadb7903d49a2338b64b27178fa0bc51456 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v3] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing additional context about the arrays
down into the nbtree index AM, as index quals.  This information enabled
nbtree to execute multiple primitive index scans as part of an index
scan executor node that was treated as one continuous index scan.

The motivation behind this earlier work was enabling index-only scans
with ScalarArrayOpExpr clauses (SAOP quals are traditionally executed
via BitmapOr nodes, which is largely index-AM-agnostic, but always
requires heap access).  The general idea of giving the index AM this
additional context can be pushed a lot further, though.

Teach nbtree SAOP index scans to dynamically advance array scan keys
using information about the characteristics of the index, determined at
runtime.  The array key state machine advances the current array keys
using the next index tuple in line to be scanned, at the point where the
scan reaches the end of the last set of array keys.  This approach is
far more flexible, and can be far more efficient.  Cases that previously
required hundreds (even thousands) of primitive index scans now require
as few as one single primitive index scan.

Also remove all restrictions on generating path keys for nbtree index
scans that happen to have ScalarArrayOpExpr quals.  Bugfix commit
807a40c5 taught the planner to avoid generating unsafe path keys: path
keys on a multicolumn index path, with a SAOP clause on any attribute
beyond the first/most significant attribute.  These cases are now safe.
Now nbtree index scans with an inequality clause on a high order column
and a SAOP clause on a lower order column are executed as one single
primitive index scan, since that is the most efficient way to do it.
Non-required equality type SAOP quals are executed by nbtree using
almost the same approach used for required equality type SAOP quals.

nbtree is now strictly guaranteed to avoid all repeat accesses to any
individual leaf page, even in cases with inequalities on high order
columns (except when the scan direction changes, or the scan restarts).
We now have strong guarantees about the worst case, which is very useful
when costing index scans with SAOP clauses.  The cost profile of index
paths with multiple SAOP clauses is now a lot closer to other cases;
more selective index scans will now generally have lower costs than less
selective index scans.  The added cost from repeatedly descending the
index still matters, but it can never dominate.

An important goal of this work is to remove all ScalarArrayOpExpr clause
special cases from the planner -- ScalarArrayOpExpr clauses can now be
thought of a generalization of simple equality clauses (except when
costing index scans, perhaps).  The planner no longer needs to generate
alternative index paths with filter quals/qpquals.  We assume that true
SAOP index quals are strictly better than filter/qpquals, since the work
in nbtree guarantees that they'll be at least slightly faster.

Many of the queries sped up by the work from this commit don't directly
benefit from the nbtree/executor enhancements.  They benefit indirectly.
The planner no longer shows any restraint around making SAOP clauses
into true nbtree index quals, which tends to result in significant
savings on heap page accesses.  In general we never need visibility
checks to evaluate true index quals, whereas filter quals often need to
perform extra heap accesses, just to eliminate non-matching tuples
(expression evaluation is only safe with known visible tuples).

Author: Peter Geoghegan 
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 src/include/access/nbtree.h|   39 +-
 src/backend/access/nbtree/nbtree.c |   65 +-
 src/backend/access/nbtree/nbtsearch.c  |   62 +-
 src/backend/access/nbtree/nbtutils.c   | 1367 ++--
 src/backend/optimizer/path/indxpath.c  |   64 +-
 src/backend/utils/adt/selfuncs.c   |  123 +-
 doc/src/sgml/monitoring.sgml   |   13 +
 src/test/regress/expected/create_index.out |   61 +-
 src/test/regress/expected/join.out |5 +-
 src/test/regress/sql/create_index.sql  |   20 +-
 10 files changed, 1506 insertions(+), 313 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6345e16d7..33db9b648 100644
---

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-09-17 Thread Peter Geoghegan

On Wed, Jul 26, 2023 at 6:41 AM Peter Geoghegan  wrote:
> > MDAM seems to require exponential storage for "scan key operations"
> > for conditions on N columns (to be precise, the product of the number
> > of distinct conditions on each column); e.g. an index on mytable
> > (a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
> > AND h IN (1, 2)" would require 2^8 entries.

> What you describe is a problem in theory, but I doubt that it's a
> problem in practice. You don't actually have to materialize the
> predicates up-front, or at all. Plus you can skip over them using the
> next index tuple. So skipping works both ways.

Attached is v2, which makes all array key advancement take place using
the "next index tuple" approach (using binary searches to find array
keys using index tuple values). This approach was necessary for fairly
mundane reasons (it limits the amount of work required while holding a
buffer lock), but it also solves quite a few other problems that I
find far more interesting.

It's easy to imagine the state machine from v2 of the patch being
extended for skip scan. My approach "abstracts away" the arrays. For
skip scan, it would more or less behave as if the user had written a
query "WHERE a in () AND b = 5
... " -- without actually knowing what the so-called array keys for
the high-order skipped column are (not up front, at least). We'd only
need to track the current "array key" for the scan key on the skipped
column, "a". The state machine would notice when the scan had reached
the next-greatest "a" value in the index (whatever that might be), and
then make that the current value. Finally, the state machine would
effectively instruct its caller to consider repositioning the scan via
a new descent of the index. In other words, almost everything for skip
scan would work just like regular SAOPs -- and any differences would
be well encapsulated.

But it's not just skip scan. This approach also enables thinking of
SAOP index scans (using nbtree) as just another type of indexable
clause, without any special restrictions (compared to true indexable
operators such as "=", say). Particularly in the planner. That was
always the general thrust of teaching nbtree about SAOPs, from the
start. But it's something that should be totally embraced IMV. That's
just what the patch proposes to do.

In particular, the patch now:

1. Entirely removes the long-standing restriction on generating path
keys for index paths with SAOPs, even when there are inequalities on a
high order column present. You can mix SAOPs together with other
clause types, arbitrarily, and everything still works and works
efficiently.

For example, the regression test expected output for this query/test
(from bugfix commit 807a40c5) is updated by the patch, as shown here:

 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-  QUERY PLAN
---
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
- Index Cond: ((thousand < 2) AND (tenthous = ANY
('{1001,3000}'::integer[])))
-(4 rows)
+   QUERY PLAN
+
+ Index Scan using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)

We don't need a sort node anymore -- even though the leading column
here (thousand) uses an inequality, a particularly tricky case. Now
it's an index scan, much like any other, with no particular
restrictions caused by using a SAOP.

2. Adds an nbtree strategy for non-required equality array scan keys,
which is built on the same state machine, with only minor differences
to deal with column values "appearing out of key space order".

3. Simplifies the optimizer side of things by consistently avoiding
filter quals (except when it's truly unavoidable). The optimizer
doesn't even consider alternative index paths with filter quals for
lower-order SAOP columns, because they have no possible advantage
anymore. On the other hand, as we saw already, upthread, filter quals
have huge disadvantages. By always using true index quals, we
automatically avoid any question of getting excessive amounts of heap
page accesses just to eliminate non-matching rows. AFAICT we don't
need to make a trade-off here.

The first version of the patch added some crufty code to the
optimizer, to account for various restrictions on sort order. This
revised version actually removes existing cruft from the same place
(indxpath.c) instead.

Items 1, 2, and 3 are all closely related. Take the query I've shown
for item 1. Bugfix commit 807a40c5 (which added the test query in
question) dealt with an oversight in the then-recent original nbtree
SAOP patch (commit 9e8da0f7): when nbtree

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-31 Thread Peter Geoghegan

On Mon, Jul 31, 2023 at 12:24 PM Alena Rybakina
 wrote:
> I noticed that you are going to add CNF->DNF transformation at the index
> construction stage. If I understand correctly, you will rewrite
> restrictinfo node,
> change boolean "AND" expressions to "OR" expressions, but would it be
> possible to apply such a procedure earlier?

Sort of. I haven't really added any new CNF->DNF transformations. The
code you're talking about is really just checking that every index
path has clauses that we know that nbtree can handle. That's a big,
ugly modularity violation -- many of these details are quite specific
to the nbtree index AM (in theory we could have other index AMs that
are amsearcharray).

At most, v1 of the patch makes greater use of an existing
transformation that takes place in the nbtree index AM, as it
preprocesses scan keys for these types of queries (it's not inventing
new transformations at all). This is a slightly creative
interpretation, too. Tom's commit 9e8da0f7 didn't actually say
anything about CNF/DNF.

> Otherwise I suppose you
> could face the problem of
> incorrect selectivity of the calculation and, consequently, the
> cardinality calculation?

I can't think of any reason why that should happen as a direct result
of what I have done here. Multi-column index paths + multiple SAOP
clauses are not a new thing. The number of rows returned does not
depend on whether we have some columns as filter quals or not.

Of course that doesn't mean that the costing has no problems. The
costing definitely has several problems right now.

It also isn't necessarily okay that it's "just as good as before" if
it turns out that it needs to be better now. But I don't see why it
would be. (Actually, my hope is that selectivity estimation might be
*less* important as a practical matter with the patch.)

> I can't clearly understand at what stage it is clear that the such a
> transformation needs to be applied?

I don't know either.

I think that most of this work needs to take place in the nbtree code,
during preprocessing. But it's not so simple. There is a mutual
dependency between the code that generates index paths in the planner
and nbtree scan key preprocessing. The planner needs to know what
kinds of index paths are possible/safe up-front, so that it can choose
the fastest plan (the fastest that the index AM knows how to execute
correctly). But, there are lots of small annoying nbtree
implementation details that might matter, and can change.

I think we need to have nbtree register a callback, so that the
planner can initialize some preprocessing early. I think that we
require a "two way conversation" between the planner and the index AM.

-- 
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-31 Thread Peter Geoghegan

On Thu, Jul 27, 2023 at 10:00 AM Matthias van de Meent
 wrote:
> My idea is not quite block nested loop join. It's more 'restart the
> index scan at the location the previous index scan ended, if
> heuristics say there's a good chance that might save us time'. I'd say
> it is comparable to the fast tree descent optimization that we have
> for endpoint queries, and comparable to this patch's scankey
> optimization, but across AM-level rescans instead of internal rescans.

Yeah, I see what you mean. Seems related, even though what you've
shown in your prototype patch doesn't seem like it fits into my
taxonomy very neatly.

(BTW, I was a little confused by the use of the term "endpoint" at
first, since there is a function that uses that term to refer to a
descent of the tree that happens without any insertion scan key. This
path is used whenever the best we can do in _bt_first is to descend to
the rightmost or leftmost page.)

> The basic design of that patch is this: We keep track of how many
> times we've rescanned, and the end location of the index scan. If a
> new index scan hits the same page after _bt_search as the previous
> scan ended, we register that.

I can see one advantage that block nested loop join would retain here:
it does block-based accesses on both sides of the join. Since it
"looks ahead" on both sides of the join, more repeat accesses are
likely to be avoided.

Not too sure how much that matters in practice, though.

-- 
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-31 Thread Alena Rybakina


Hi, all!


  CNF -> DNF conversion
=

Like many great papers, the MDAM paper takes one core idea, and finds
ways to leverage it to the hilt. Here the core idea is to take
predicates in conjunctive normal form (an "AND of ORs"), and convert
them into disjunctive normal form (an "OR of ANDs"). DNF quals are
logically equivalent to CNF quals, but ideally suited to SAOP-array
style processing by an ordered B-Tree index scan -- they reduce
everything to a series of non-overlapping primitive index scans, that
can be processed in keyspace order. We already do this today in the
case of SAOPs, in effect. The nbtree "next array keys" state machine
already materializes values that can be seen as MDAM style DNF single
value predicates. The state machine works by outputting the cartesian
product of each array as a multi-column index is scanned, but that
could be taken a lot further in the future. We can use essentially the
same kind of state machine to do everything described in the paper --
ultimately, it just needs to output a list of disjuncts, like the DNF
clauses that the paper shows in "Table 3".

In theory, anything can be supported via a sufficiently complete CNF
-> DNF conversion framework. There will likely always be the potential
for unsafe/unsupported clauses and/or types in an extensible system
like Postgres, though. So we will probably need to retain some notion
of safety. It seems like more of a job for nbtree preprocessing (or
some suitably index-AM-agnostic version of the same idea) than the
optimizer, in any case. But that's not entirely true, either (that
would be far too easy).

The optimizer still needs to optimize. It can't very well do that
without having some kind of advanced notice of what is and is not
supported by the index AM. And, the index AM cannot just unilaterally
decide that index quals actually should be treated as filter/qpquals,
after all -- it doesn't get a veto. So there is a mutual dependency
that needs to be resolved. I suspect that there needs to be a two way
conversation between the optimizer and nbtree code to break the
dependency -- a callback that does some of the preprocessing work
during planning. Tom said something along the same lines in passing,
when discussing the MDAM paper last year [2]. Much work remains here.

Honestly, I'm just reading and delving into this thread and other topics 
related to it, so excuse me if I ask you a few obvious questions.


I noticed that you are going to add CNF->DNF transformation at the index 
construction stage. If I understand correctly, you will rewrite 
restrictinfo node,
change boolean "AND" expressions to "OR" expressions, but would it be 
possible to apply such a procedure earlier? Otherwise I suppose you 
could face the problem of
incorrect selectivity of the calculation and, consequently, the 
cardinality calculation?
I can't clearly understand at what stage it is clear that the such a 
transformation needs to be applied?


--
Regards,
Alena Rybakina
Postgres Professional

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-28 Thread Matthias van de Meent

On Thu, 27 Jul 2023 at 16:01, Peter Geoghegan  wrote:
>
> On Thu, Jul 27, 2023 at 7:59 AM Matthias van de Meent
>  wrote:
> > > Basically, the patch that added that feature had to revise the index
> > > AM API, in order to support a mode of operation where scans return
> > > groupings rather than tuples. Whereas this patch requires none of
> > > that. It makes affected index scans as similar as possible to
> > > conventional index scans.
> >
> > Hmm, yes. I see now where my confusion started. You called it out in
> > your first paragraph of the original mail, too, but that didn't help
> > me then:
> >
> > The wiki does not distinguish "Index Skip Scans" and "Loose Index
> > Scans", but these are not the same.
>
> A lot of people (myself included) were confused on this point for
> quite a while.

I've taken the liberty to update the "Loose indexscan" wiki page [0],
adding detail that Loose indexscans are distinct from Skip scans, and
showing some high-level distinguishing properties.
I also split the TODO entry for `` "loose" or "skip" scan `` into two,
and added links to the relevant recent threads so that it's clear
these are different (and that some previous efforts may have had a
confusing name).

I hope this will reduce the chance of future confusion between the two
different approaches to improving index scan performance.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0]: https://wiki.postgresql.org/wiki/Loose_indexscan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-27 Thread Peter Geoghegan

On Thu, Jul 27, 2023 at 7:59 AM Matthias van de Meent
 wrote:
> > Basically, the patch that added that feature had to revise the index
> > AM API, in order to support a mode of operation where scans return
> > groupings rather than tuples. Whereas this patch requires none of
> > that. It makes affected index scans as similar as possible to
> > conventional index scans.
>
> Hmm, yes. I see now where my confusion started. You called it out in
> your first paragraph of the original mail, too, but that didn't help
> me then:
>
> The wiki does not distinguish "Index Skip Scans" and "Loose Index
> Scans", but these are not the same.

A lot of people (myself included) were confused on this point for
quite a while. To make matters even more confusing, one of the really
compelling cases for the MDAM design is scans that feed into
GroupAggregates -- preserving index sort order for naturally big index
scans will tend to enable it. One of my examples from the start of
this thread showed just that. (It just so happened that that example
was faster because of all the "skipping" that nbtree *wasn't* doing
with the patch.)

> Yes, that's why I asked: The MDAM paper's examples seem to materialize
> the full predicate up-front, which would require a product of all
> indexed columns' quals in size, so that materialization has a good
> chance to get really, really large. But if we're not doing that
> materialization upfront, then there is no issue with resource
> consumption (except CPU time, which can likely be improved with other
> methods)

I get why you asked. I might have asked the same question.

As I said, the MDAM paper has *surprisingly* little to say about
B-Tree executor stuff -- it's almost all just describing the
preprocessing/transformation process. It seems as if optimizations
like the one from my patch were considered too obvious to talk about
and/or out of scope by the authors. Thinking about the MDAM paper like
that was what made everything fall into place for me. Remember,
"missing key predicates" isn't all that special.

-- 
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-27 Thread Matthias van de Meent

On Thu, 27 Jul 2023 at 06:14, Peter Geoghegan  wrote:
>
> On Wed, Jul 26, 2023 at 12:07 PM Matthias van de Meent
>  wrote:
> > We could cache the last accessed leaf page across amrescan operations
> > to reduce the number of index traversals needed when the join key of
> > the left side is highly (but not necessarily strictly) correllated.
>
> That sounds like block nested loop join. It's possible that that could
> reuse some infrastructure from this patch, but I'm not sure.

My idea is not quite block nested loop join. It's more 'restart the
index scan at the location the previous index scan ended, if
heuristics say there's a good chance that might save us time'. I'd say
it is comparable to the fast tree descent optimization that we have
for endpoint queries, and comparable to this patch's scankey
optimization, but across AM-level rescans instead of internal rescans.

See also the attached prototype and loosely coded patch. It passes
tests, but it might not be without bugs.

The basic design of that patch is this: We keep track of how many
times we've rescanned, and the end location of the index scan. If a
new index scan hits the same page after _bt_search as the previous
scan ended, we register that. Those two values - num_rescans and
num_samepage - are used as heuristics for the following:

If 50% or more of rescans hit the same page as the end location of the
previous scan, we start saving the scan's end location's buffer into
the BTScanOpaque, so that the next _bt_first can check whether that
page might be the right leaf page, and if so, immediately go to that
buffer instead of descending the tree - saving one tree descent in the
process.

Further optimizations of this mechanism could easily be implemented by
e.g. only copying the min/max index tuples instead of the full index
page, reducing the overhead at scan end.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

v1-0001-Cache-btree-scan-end-page-across-rescans-in-the-s.patch.cfbot-ignore
Description: Binary data

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-27 Thread Matthias van de Meent

On Wed, 26 Jul 2023 at 15:42, Peter Geoghegan  wrote:
>
> On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent
> > I'm not sure I understand. MDAM seems to work on an index level to
> > return full ranges of values, while "skip scan" seems to try to allow
> > systems to signal to the index to skip to some other index condition
> > based on arbitrary cutoffs. This would usually be those of which the
> > information is not stored in the index, such as "SELECT user_id FROM
> > orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go
> > though the user_id index and skip to the next user_id value when it
> > gets enough rows of a matching result (where "enough" is determined
> > above the index AM's plan node, or otherwise is impossible to
> > determine with only the scan key info in the index AM). I'm not sure
> > how this could work without specifically adding skip scan-related
> > index AM functionality, and I don't see how it fits in with this
> > MDAM/SAOP system.
>
> I think of that as being quite a different thing.
>
> Basically, the patch that added that feature had to revise the index
> AM API, in order to support a mode of operation where scans return
> groupings rather than tuples. Whereas this patch requires none of
> that. It makes affected index scans as similar as possible to
> conventional index scans.

Hmm, yes. I see now where my confusion started. You called it out in
your first paragraph of the original mail, too, but that didn't help
me then:

The wiki does not distinguish "Index Skip Scans" and "Loose Index
Scans", but these are not the same.

In the one page on "Loose indexscan", it refers to MySQL's "loose
index scan" documentation, which does handle groupings, and this was
targeted with the previous, mislabeled, "Index skipscan" patchset.
However, crucially, it also refers to other databases' Index Skip Scan
documentation, which document and implement this approach of 'skipping
to the next potential key range to get efficient non-prefix qual
results', giving me a false impression that those two features are one
and the same when they are not.

It seems like I'll have to wait a bit longer for the functionality of
Loose Index Scans.

> > > [...]
> > >
> > > Thoughts?
> >
> > MDAM seems to require exponential storage for "scan key operations"
> > for conditions on N columns (to be precise, the product of the number
> > of distinct conditions on each column); e.g. an index on mytable
> > (a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
> > AND h IN (1, 2)" would require 2^8 entries.
>
> Note that I haven't actually changed anything about the way that the
> state machine generates new sets of single value predicates -- it's
> still just cycling through each distinct set of array keys in the
> patch.
>
> What you describe is a problem in theory, but I doubt that it's a
> problem in practice. You don't actually have to materialize the
> predicates up-front, or at all.

Yes, that's why I asked: The MDAM paper's examples seem to materialize
the full predicate up-front, which would require a product of all
indexed columns' quals in size, so that materialization has a good
chance to get really, really large. But if we're not doing that
materialization upfront, then there is no issue with resource
consumption (except CPU time, which can likely be improved with other
methods)

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-26 Thread Peter Geoghegan

On Wed, Jul 26, 2023 at 12:07 PM Matthias van de Meent
 wrote:
> We could cache the last accessed leaf page across amrescan operations
> to reduce the number of index traversals needed when the join key of
> the left side is highly (but not necessarily strictly) correllated.

That sounds like block nested loop join. It's possible that that could
reuse some infrastructure from this patch, but I'm not sure.

In general, SAOP execution/MDAM performs "duplicate elimination before
it reads the data" by sorting and deduplicating the arrays up front.
While my patch sometimes elides a primitive index scan, primitive
index scans are already disjuncts that are combined to create what can
be considered one big index scan (that's how the planner and executor
think of them). The patch takes that one step further by recognizing
that it could quite literally be one big index scan in some cases (or
fewer, larger scans, at least). It's a natural incremental
improvement, as opposed to inventing a new kind of index scan. If
anything the patch makes SAOP execution more similar to traditional
index scans, especially when costing them.

Like InnoDB style loose index scan (for DISTINCT and GROUP BY
optimization), block nested loop join would require inventing a new
type of index scan. Both of these other two optimizations involve the
use of semantic information that spans multiple levels of abstraction.
Loose scan requires duplicate elimination (that's the whole point),
while IIUC block nested loop join needs to "simulate multiple inner
index scans" by deliberately returning duplicates for each would-be
inner index scan. These are specialized things.

To be clear, I think that all of these ideas are reasonable. I just
find it useful to classify these sorts of techniques according to
whether or not the index AM API would have to change or not, and the
general nature of any required changes. MDAM can do a lot of cool
things without requiring any revisions to the index AM API, which
should allow it to play nice with everything else (index path clause
safety issues notwithstanding).

-- 
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-26 Thread Matthias van de Meent

On Wed, 26 Jul 2023 at 15:42, Peter Geoghegan  wrote:
>
> On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent
>  wrote:
> > Considering that it caches/reuses the page across SAOP operations, can
> > (or does) this also improve performance for index scans on the outer
> > side of a join if the order of join columns matches the order of the
> > index?
>
> It doesn't really cache leaf pages at all. What it does is advance the
> array keys locally, while the original buffer lock is still held on
> that same page.

Hmm, then I had a mistaken understanding of what we do in _bt_readpage
with _bt_saveitem.

> > That is, I believe this caches (leaf) pages across scan keys, but can
> > (or does) it also reuse these already-cached leaf pages across
> > restarts of the index scan/across multiple index lookups in the same
> > plan node, so that retrieval of nearby index values does not need to
> > do an index traversal?
>
> I'm not sure what you mean. There is no reason why you need to do more
> than one single descent of an index to scan many leaf pages using many
> distinct sets of array keys. Obviously, this depends on being able to
> observe that we really don't need to redescend the index to advance
> the array keys, again and again. Note in particularly that this
> usually works across leaf pages.

In a NestedLoop(inner=seqscan, outer=indexscan), the index gets
repeatedly scanned from the root, right? It seems that right now, we
copy matching index entries into a local cache (that is deleted on
amrescan), then we drop our locks and pins on the buffer, and then
start returning values from our local cache (in _bt_saveitem).
We could cache the last accessed leaf page across amrescan operations
to reduce the number of index traversals needed when the join key of
the left side is highly (but not necessarily strictly) correllated.
The worst case overhead of this would be 2 _bt_compares (to check if
the value is supposed to be fully located on the cached leaf page)
plus one memcpy( , , BLCKSZ) in the previous loop. With some smart
heuristics (e.g. page fill factor, number of distinct values, and
whether we previously hit this same leaf page in the previous scan of
this Node) we can probably also reduce this overhead to a minimum if
the joined keys are not correllated, but accellerate the query
significantly when we find out they are correllated.

Of course, in the cases where we'd expect very few distinct join keys
the planner would likely put a Memoize node above the index scan, but
for mostly unique join keys I think this could save significant
amounts of time, if only on buffer pinning and locking.

I guess I'll try to code something up when I have the time, as it
sounds not quite exactly related to your patch but an interesting
improvement nonetheless.

Kind regards,

Matthias van de Meent

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-26 Thread Peter Geoghegan

On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent
 wrote:
> Considering that it caches/reuses the page across SAOP operations, can
> (or does) this also improve performance for index scans on the outer
> side of a join if the order of join columns matches the order of the
> index?

It doesn't really cache leaf pages at all. What it does is advance the
array keys locally, while the original buffer lock is still held on
that same page.

> That is, I believe this caches (leaf) pages across scan keys, but can
> (or does) it also reuse these already-cached leaf pages across
> restarts of the index scan/across multiple index lookups in the same
> plan node, so that retrieval of nearby index values does not need to
> do an index traversal?

I'm not sure what you mean. There is no reason why you need to do more
than one single descent of an index to scan many leaf pages using many
distinct sets of array keys. Obviously, this depends on being able to
observe that we really don't need to redescend the index to advance
the array keys, again and again. Note in particularly that this
usually works across leaf pages.

> I'm not sure I understand. MDAM seems to work on an index level to
> return full ranges of values, while "skip scan" seems to try to allow
> systems to signal to the index to skip to some other index condition
> based on arbitrary cutoffs. This would usually be those of which the
> information is not stored in the index, such as "SELECT user_id FROM
> orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go
> though the user_id index and skip to the next user_id value when it
> gets enough rows of a matching result (where "enough" is determined
> above the index AM's plan node, or otherwise is impossible to
> determine with only the scan key info in the index AM). I'm not sure
> how this could work without specifically adding skip scan-related
> index AM functionality, and I don't see how it fits in with this
> MDAM/SAOP system.

I think of that as being quite a different thing.

Basically, the patch that added that feature had to revise the index
AM API, in order to support a mode of operation where scans return
groupings rather than tuples. Whereas this patch requires none of
that. It makes affected index scans as similar as possible to
conventional index scans.

> > [...]
> >
> > Thoughts?
>
> MDAM seems to require exponential storage for "scan key operations"
> for conditions on N columns (to be precise, the product of the number
> of distinct conditions on each column); e.g. an index on mytable
> (a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
> AND h IN (1, 2)" would require 2^8 entries.

Note that I haven't actually changed anything about the way that the
state machine generates new sets of single value predicates -- it's
still just cycling through each distinct set of array keys in the
patch.

What you describe is a problem in theory, but I doubt that it's a
problem in practice. You don't actually have to materialize the
predicates up-front, or at all. Plus you can skip over them using the
next index tuple. So skipping works both ways.

-- 
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-26 Thread Matthias van de Meent

On Tue, 25 Jul 2023 at 03:34, Peter Geoghegan  wrote:
>
> I've been working on a variety of improvements to nbtree's native
> ScalarArrayOpExpr execution. This builds on Tom's work in commit
> 9e8da0f7.

Cool.

> Attached patch is still at the prototype stage. I'm posting it as v1 a
> little earlier than I usually would because there has been much back
> and forth about it on a couple of other threads involving Tomas Vondra
> and Jeff Davis -- seems like it would be easier to discuss with
> working code available.
>
> The patch adds two closely related enhancements to ScalarArrayOp
> execution by nbtree:
>
> 1. Execution of quals with ScalarArrayOpExpr clauses during nbtree
> index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now
> "advance the scan's array keys locally", which sometimes avoids
> significant amounts of unneeded pinning/locking of the same set of
> index pages.
>
> SAOP index scans become capable of eliding primitive index scans for
> the next set of array keys in line in cases where it isn't truly
> necessary to descend the B-Tree again. Index scans are now capable of
> "sticking with the existing leaf page for now" when it is determined
> that the end of the current set of array keys is physically close to
> the start of the next set of array keys (the next set in line to be
> materialized by the _bt_advance_array_keys state machine). This is
> often possible.
>
> Naturally, we still prefer to advance the array keys in the
> traditional way ("globally") much of the time. That means we'll
> perform another _bt_first/_bt_search descent of the index, starting a
> new primitive index scan. Whether we try to skip pages on the leaf
> level or stick with the current primitive index scan (by advancing
> array keys locally) is likely to vary a great deal. Even during the
> same index scan. Everything is decided dynamically, which is the only
> approach that really makes sense.
>
> This optimization can significantly lower the number of buffers pinned
> and locked in cases with significant locality, and/or with many array
> keys with no matches. The savings (when measured in buffers
> pined/locked) can be as high as 10x, 100x, or even more. Benchmarking
> has shown that transaction throughput for variants of "pgbench -S"
> designed to stress the implementation (hundreds of array constants)
> under concurrent load can have up to 5.5x higher transaction
> throughput with the patch. Less extreme cases (10 array constants,
> spaced apart) see about a 20% improvement in throughput. There are
> similar improvements to latency for the patch, in each case.

Considering that it caches/reuses the page across SAOP operations, can
(or does) this also improve performance for index scans on the outer
side of a join if the order of join columns matches the order of the
index?
That is, I believe this caches (leaf) pages across scan keys, but can
(or does) it also reuse these already-cached leaf pages across
restarts of the index scan/across multiple index lookups in the same
plan node, so that retrieval of nearby index values does not need to
do an index traversal?

> [...]
> Skip Scan
> =
>
> MDAM encompasses something that people tend to call "skip scan" --
> terminology with a great deal of baggage. These days I prefer to call
> it "filling in missing key predicates", per the paper. That's much
> more descriptive, and makes it less likely that people will conflate
> the techniques with InnoDB style "loose Index scans" -- the latter is
> a much more specialized/targeted optimization. (I now believe that
> these are very different things, though I was thrown off by the
> superficial similarities for a long time. It's pretty confusing.)

I'm not sure I understand. MDAM seems to work on an index level to
return full ranges of values, while "skip scan" seems to try to allow
systems to signal to the index to skip to some other index condition
based on arbitrary cutoffs. This would usually be those of which the
information is not stored in the index, such as "SELECT user_id FROM
orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go
though the user_id index and skip to the next user_id value when it
gets enough rows of a matching result (where "enough" is determined
above the index AM's plan node, or otherwise is impossible to
determine with only the scan key info in the index AM). I'm not sure
how this could work without specifically adding skip scan-related
index AM functionality, and I don't see how it fits in with this
MDAM/SAOP system.

> [...]
>
> Thoughts?

MDAM seems to require exponential storage for "scan key operations"
for conditions on N columns (to be precise, the product of the number
of distinct conditions on each column); e.g. an index on mytable
(a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
AND h IN (1, 2)" would require 2^8 entries. If 4 conditions were used
for each column, that'd be 4^8, etc...
With an index column limit of 32, that's

Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-24 Thread Peter Geoghegan

I've been working on a variety of improvements to nbtree's native
ScalarArrayOpExpr execution. This builds on Tom's work in commit
9e8da0f7.

Attached patch is still at the prototype stage. I'm posting it as v1 a
little earlier than I usually would because there has been much back
and forth about it on a couple of other threads involving Tomas Vondra
and Jeff Davis -- seems like it would be easier to discuss with
working code available.

The patch adds two closely related enhancements to ScalarArrayOp
execution by nbtree:

1. Execution of quals with ScalarArrayOpExpr clauses during nbtree
index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now
"advance the scan's array keys locally", which sometimes avoids
significant amounts of unneeded pinning/locking of the same set of
index pages.

SAOP index scans become capable of eliding primitive index scans for
the next set of array keys in line in cases where it isn't truly
necessary to descend the B-Tree again. Index scans are now capable of
"sticking with the existing leaf page for now" when it is determined
that the end of the current set of array keys is physically close to
the start of the next set of array keys (the next set in line to be
materialized by the _bt_advance_array_keys state machine). This is
often possible.

Naturally, we still prefer to advance the array keys in the
traditional way ("globally") much of the time. That means we'll
perform another _bt_first/_bt_search descent of the index, starting a
new primitive index scan. Whether we try to skip pages on the leaf
level or stick with the current primitive index scan (by advancing
array keys locally) is likely to vary a great deal. Even during the
same index scan. Everything is decided dynamically, which is the only
approach that really makes sense.

This optimization can significantly lower the number of buffers pinned
and locked in cases with significant locality, and/or with many array
keys with no matches. The savings (when measured in buffers
pined/locked) can be as high as 10x, 100x, or even more. Benchmarking
has shown that transaction throughput for variants of "pgbench -S"
designed to stress the implementation (hundreds of array constants)
under concurrent load can have up to 5.5x higher transaction
throughput with the patch. Less extreme cases (10 array constants,
spaced apart) see about a 20% improvement in throughput. There are
similar improvements to latency for the patch, in each case.

2. The optimizer now produces index paths with multiple SAOP clauses
(or other clauses we can safely treat as "equality constraints'') on
each of the leading columns from a composite index -- all while
preserving index ordering/useful pathkeys in most cases.

The nbtree work from item 1 is useful even with the simplest IN() list
query involving a scan of a single column index. Obviously, it's very
inefficient for the nbtree code to use 100 primitive index scans when
1 is sufficient. But that's not really why I'm pursuing this project.
My real goal is to implement (or to enable the implementation of) a
whole family of useful techniques for multi-column indexes. I call
these "MDAM techniques", after the 1995 paper "Efficient Search of
Multidimensional B-Trees" [1][2]-- MDAM is short for "multidimensional
access method". In the context of the paper, "dimension" refers to
dimensions in a decision support system.

The most compelling cases for the patch all involve multiple index
columns with multiple SAOP clauses (especially where each column
represents a separate "dimension", in the DSS sense). It's important
that index sort be preserved whenever possible, too. Sometimes this is
directly useful (e.g., because the query has an ORDER BY), but it's
always indirectly needed, on the nbtree side (when the optimizations
are applicable at all). The new nbtree code now has special
requirements surrounding SAOP search type scan keys with composite
indexes. These requirements make changes in the optimizer all but
essential.

Index order
===

As I said, there are cases where preserving index order is immediately
and obviously useful, in and of itself. Let's start there.

Here's a test case that you can run against the regression test database:

pg@regression:5432 =# create index order_by_saop on tenk1(two,four,twenty);
CREATE INDEX

pg@regression:5432 =# EXPLAIN (ANALYZE, BUFFERS)
select ctid, thousand from tenk1
where two in (0,1) and four in (1,2) and twenty in (1,2)
order by two, four, twenty limit 20;

With the patch, this query gets 13 buffer hits. On the master branch,
it gets 1377 buffer hits -- which exceeds the number you'll get from a
sequential scan by about 4x. No coaxing was required to get the
planner to produce this plan on the master branch. Almost all of the
savings shown here are related to heap page buffer hits -- the nbtree
changes don't directly help in this particular example (strictly
speaking, you only need the optimizer changes to get this result).

Obviously, the

59 matches

Mail list logo