Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Mon, Apr 22, 2024 at 11:13 AM Peter Geoghegan wrote: > I'm pretty sure that I could fix this by simply removing the > assertion. But I need to think about it a bit more before I push a > fix. > > The test case you've provided proves that _bt_preprocess_keys's > new no-op path isn't just used during scans that have array keys (your > test case doesn't have a SAOP at all). This was never intended. On the > other hand, I think that it's still correct (or will be once the assertion is > gone), and it seems like it would be simpler to allow this case (and > document it) than to not allow it at all. Pushed a fix like that just now. Thanks for the report, Alexander. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Mon, Apr 22, 2024 at 4:00 AM Alexander Lakhin wrote: > Please look at another case, where a similar Assert (but this time in > _bt_preprocess_keys()) is triggered: > CREATE TABLE t (a text, b text); > INSERT INTO t (a, b) SELECT 'a', repeat('b', 100) FROM generate_series(1, > 500) g; > CREATE INDEX t_idx ON t USING btree(a); > BEGIN; > DECLARE c CURSOR FOR SELECT a FROM t WHERE a = 'a'; > FETCH FROM c; > FETCH RELATIVE 0 FROM c; > > TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 2582, PID: > 1130962 I'm pretty sure that I could fix this by simply removing the assertion. But I need to think about it a bit more before I push a fix. The test case you've provided proves that _bt_preprocess_keys's new no-op path isn't just used during scans that have array keys (your test case doesn't have a SAOP at all). This was never intended. On the other hand, I think that it's still correct (or will be once the assertion is gone), and it seems like it would be simpler to allow this case (and document it) than to not allow it at all. The general idea that we only need one "real" _bt_preprocess_keys call per btrescan (independent of the presence of array keys) still seems sound. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Hello Peter, 07.04.2024 20:18, Peter Geoghegan wrote: On Sun, Apr 7, 2024 at 1:00 PM Alexander Lakhin wrote: SELECT * FROM t WHERE a < ANY (ARRAY[1]) AND b < ANY (ARRAY[1]); TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 560, PID: 3251267 I immediately see what's up here. WIll fix this in the next short while. There is no bug here in builds without assertions, but it's probably best to keep the assertion, and to just make sure not to call _bt_preprocess_array_keys_final() unless it has real work to do. Please look at another case, where a similar Assert (but this time in _bt_preprocess_keys()) is triggered: CREATE TABLE t (a text, b text); INSERT INTO t (a, b) SELECT 'a', repeat('b', 100) FROM generate_series(1, 500) g; CREATE INDEX t_idx ON t USING btree(a); BEGIN; DECLARE c CURSOR FOR SELECT a FROM t WHERE a = 'a'; FETCH FROM c; FETCH RELATIVE 0 FROM c; TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 2582, PID: 1130962 Best regards, Alexander
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Thu, Apr 18, 2024 at 2:13 AM Donghang Lin wrote: > It's triggered when a scankey's strategy is set to invalid. While for a > descending ordered column, > the strategy needs to get fixed to its commute strategy. That doesn't work if > the strategy is invalid. The problem is that _bt_fix_scankey_strategy shouldn't have been doing anything with already-eliminated array scan keys in the first place (whether or not they're on a DESC index column). I just pushed a fix along those lines. Thanks for the report! -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Hi Peter There seems to be an assertion failure with this change in HEAD TRAP: failed Assert("leftarg->sk_attno == rightarg->sk_attno"), File: "../../src/backend/access/nbtree/nbtutils.c", Line: 3246, PID: 1434532 It can be reproduced by: create table t(a int); insert into t select 1 from generate_series(1,10); create index on t (a desc); set enable_seqscan = false; select * from t where a IN (1,2) and a IN (1,2,3); It's triggered when a scankey's strategy is set to invalid. While for a descending ordered column, the strategy needs to get fixed to its commute strategy. That doesn't work if the strategy is invalid. Attached a demo fix. Regards, Donghang Lin (ServiceNow) demo-fix.patch Description: Binary data
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Sun, Apr 7, 2024 at 9:57 PM Peter Geoghegan wrote: > On Sun, Apr 7, 2024 at 9:50 PM Tom Lane wrote: > > those first two Asserts are redundant with the "if" as well. > > I'll get rid of those other two assertions as well, then. Done that way. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Sun, Apr 7, 2024 at 9:50 PM Tom Lane wrote: > If you're doing that, then surely > > if (j != (BTEqualStrategyNumber - 1) || > !(xform[j].skey->sk_flags & SK_SEARCHARRAY)) > { > ... > } > else > { > Assert(j == (BTEqualStrategyNumber - 1)); > Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY); > Assert(xform[j].ikey == array->scan_key); > Assert(!(cur->sk_flags & SK_SEARCHARRAY)); > } > > those first two Asserts are redundant with the "if" as well. I'll get rid of those other two assertions as well, then. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Peter Geoghegan writes: > The assertions in question are arguably redundant. There are very > similar assertions just a little earlier on, as we initially set up > the array stuff (right before _bt_compare_scankey_args is called). > I'll just remove the "Assert(xform[j].ikey == array->scan_key)" > assertion that Coverity doesn't like, in addition to the > "Assert(!array || array->scan_key == i)" assertion, on the grounds > that they're redundant. If you're doing that, then surely if (j != (BTEqualStrategyNumber - 1) || !(xform[j].skey->sk_flags & SK_SEARCHARRAY)) { ... } else { Assert(j == (BTEqualStrategyNumber - 1)); Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY); Assert(xform[j].ikey == array->scan_key); Assert(!(cur->sk_flags & SK_SEARCHARRAY)); } those first two Asserts are redundant with the "if" as well. regards, tom lane
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Sun, Apr 7, 2024 at 9:25 PM Tom Lane wrote: > Perhaps this'd help: > > -Assert(xform[j].ikey == array->scan_key); > +Assert(array && xform[j].ikey == array->scan_key); > > If that doesn't silence it, I'd be prepared to just dismiss the > warning. The assertions in question are arguably redundant. There are very similar assertions just a little earlier on, as we initially set up the array stuff (right before _bt_compare_scankey_args is called). I'll just remove the "Assert(xform[j].ikey == array->scan_key)" assertion that Coverity doesn't like, in addition to the "Assert(!array || array->scan_key == i)" assertion, on the grounds that they're redundant. > Some work in the comment to explain why we must have an array here > wouldn't be out of place either, perhaps. There is a comment block about this right above the assertion in question: /* * Both scan keys might have arrays, in which case we'll * arbitrarily pass only one of the arrays. That won't * matter, since _bt_compare_scankey_args is aware that two * SEARCHARRAY scan keys mean that _bt_preprocess_array_keys * failed to eliminate redundant arrays through array merging. * _bt_compare_scankey_args just returns false when it sees * this; it won't even try to examine either array. */ Do you think it needs more work? -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Peter Geoghegan writes: > On Sun, Apr 7, 2024 at 8:48 PM Tom Lane wrote: >> Coverity pointed out something that looks like a potentially live >> problem in 5bf748b86: >> ... which certainly makes it look like array->scan_key could be >> a null-pointer dereference. > But the "Assert(xform[j].ikey == array->scan_key)" assertion is > located in a block where it's been established that the scan key (the > one stored in xform[j] at this point in execution) must have an array. > This is probably very hard for tools like Coverity to understand. It's not obvious to human readers either ... > Would Coverity stop complaining if I just removed the assertion? I > could just do that, I suppose, but that seems backwards to me. Perhaps this'd help: -Assert(xform[j].ikey == array->scan_key); +Assert(array && xform[j].ikey == array->scan_key); If that doesn't silence it, I'd be prepared to just dismiss the warning. Some work in the comment to explain why we must have an array here wouldn't be out of place either, perhaps. regards, tom lane
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Sun, Apr 7, 2024 at 8:48 PM Tom Lane wrote: > > Coverity pointed out something that looks like a potentially live > problem in 5bf748b86: > > /srv/coverity/git/pgsql-git/postgresql/src/backend/access/nbtree/nbtutils.c: > 2950 in _bt_preprocess_keys() > 2944 * need to make sure that we don't throw > away an array > 2945 * scan key. _bt_compare_scankey_args > expects us to > 2946 * always keep arrays (and discard > non-arrays). > 2947 */ > 2948 Assert(j == (BTEqualStrategyNumber - 1)); > 2949 Assert(xform[j].skey->sk_flags & > SK_SEARCHARRAY); > >>> CID 1596256: Null pointer dereferences (FORWARD_NULL) > >>> Dereferencing null pointer "array". > 2950 Assert(xform[j].ikey == array->scan_key); > 2951 Assert(!(cur->sk_flags & SK_SEARCHARRAY)); > 2952 } > 2953 } > 2954 else if (j == (BTEqualStrategyNumber - 1)) > > Above this there is an assertion > > Assert(!array || array->num_elems > 0); > > which certainly makes it look like array->scan_key could be > a null-pointer dereference. But the "Assert(xform[j].ikey == array->scan_key)" assertion is located in a block where it's been established that the scan key (the one stored in xform[j] at this point in execution) must have an array. It has been marked SK_SEARCHARRAY, and uses the equality strategy, so it had better have one or we're in big trouble either way. This is probably very hard for tools like Coverity to understand. We also rely on the fact that only one of the two scan keys (only one of the pair of scan keys that were passed to _bt_compare_scankey_args) can have an array at the point of the assertion that Coverity finds suspicious. It's possible that both of those scan keys actually did have arrays, but _bt_compare_scankey_args just treats that as a case of being unable to prove which scan key was redundant/contradictory due to a lack of suitable cross-type support -- so the assertion won't be reached. Would Coverity stop complaining if I just removed the assertion? I could just do that, I suppose, but that seems backwards to me. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Coverity pointed out something that looks like a potentially live problem in 5bf748b86: /srv/coverity/git/pgsql-git/postgresql/src/backend/access/nbtree/nbtutils.c: 2950 in _bt_preprocess_keys() 2944 * need to make sure that we don't throw away an array 2945 * scan key. _bt_compare_scankey_args expects us to 2946 * always keep arrays (and discard non-arrays). 2947 */ 2948 Assert(j == (BTEqualStrategyNumber - 1)); 2949 Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY); >>> CID 1596256: Null pointer dereferences (FORWARD_NULL) >>> Dereferencing null pointer "array". 2950 Assert(xform[j].ikey == array->scan_key); 2951 Assert(!(cur->sk_flags & SK_SEARCHARRAY)); 2952 } 2953 } 2954 else if (j == (BTEqualStrategyNumber - 1)) Above this there is an assertion Assert(!array || array->num_elems > 0); which certainly makes it look like array->scan_key could be a null-pointer dereference. regards, tom lane
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Sun, Apr 7, 2024 at 1:00 PM Alexander Lakhin wrote: > Please look at an assertion failure (reproduced starting from 5bf748b86), > triggered by the following query: > CREATE TABLE t (a int, b int); > CREATE INDEX t_idx ON t (a, b); > INSERT INTO t (a, b) SELECT g, g FROM generate_series(0, 999) g; > ANALYZE t; > SELECT * FROM t WHERE a < ANY (ARRAY[1]) AND b < ANY (ARRAY[1]); > > TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 560, PID: > 3251267 I immediately see what's up here. WIll fix this in the next short while. There is no bug here in builds without assertions, but it's probably best to keep the assertion, and to just make sure not to call _bt_preprocess_array_keys_final() unless it has real work to do. The assertion failure demonstrates that _bt_preprocess_array_keys_final can currently be called when there can be no real work for it to do. The problem is that we condition the call to _bt_preprocess_array_keys_final() on whether or not we had to do "real work" back when _bt_preprocess_array_keys() was called, at the start of _bt_preprocess_keys(). That usually leaves us with "so->numArrayKeys > 0", because the "real work" typically includes equality type array keys. But not here -- here you have two SAOP inequalities, which become simple scalar scan keys through early array-specific preprocessing in _bt_preprocess_array_keys(). There are no arrays left at this point, so "so->numArrayKeys == 0". FWIW I missed this because the tests only cover cases with one SOAP inequality, which will always return early from _bt_preprocess_keys (by taking its generic single scan key fast path). Your test case has 2 scan keys, avoiding the fast path, allowing us to reach _bt_preprocess_array_keys_final(). -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Hello Peter, 03.04.2024 22:53, Peter Geoghegan wrote: On Mon, Apr 1, 2024 at 6:33 PM Peter Geoghegan wrote: Note: v18 doesn't have any adjustments to the costing, as originally planned. I'll probably need to post a revised patch with improved (or at least polished) costing in the next few days, so that others will have the opportunity to comment before I commit the patch. Attached is v19, which dealt with remaining concerns I had about the costing in selfuncs.c. My current plan is to commit this on Saturday morning (US Eastern time). Please look at an assertion failure (reproduced starting from 5bf748b86), triggered by the following query: CREATE TABLE t (a int, b int); CREATE INDEX t_idx ON t (a, b); INSERT INTO t (a, b) SELECT g, g FROM generate_series(0, 999) g; ANALYZE t; SELECT * FROM t WHERE a < ANY (ARRAY[1]) AND b < ANY (ARRAY[1]); TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 560, PID: 3251267 Best regards, Alexander
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Mon, Mar 18, 2024 at 9:25 AM Matthias van de Meent wrote: > I was thinking about a more unified processing model, where > _bt_preprocess_keys would iterate over all keys, including processing > of array keys (by merging and reduction to normal keys) if and when > found. This would also reduce the effort expended when there are > contradictory scan keys, as array preprocessing is relatively more > expensive than other scankeys and contradictions are then found before > processing of later keys. Does it really matter? I just can't get excited about maybe getting one less syscache lookup in queries that involve *obviously* contradictory keys at the SQL level. Especially because we're so much better off with the new design here anyway; calling _bt_preprocess_keys once rather than once per distinct set of array keys is an enormous improvement, all on its own. My concern with preprocessing overhead is almost entirely limited to pathological performance issues involving extreme (even adversarial) input scan keys/arrays. I feel that if we're going to guarantee that we won't choke in _bt_checkkeys, even given billions of distinct sets of array keys, we should make the same guarantee in _bt_preprocess_keys -- just to avoid POLA violations. But that's the only thing that seems particularly important in the general area of preprocessing performance. (Preprocessing performance probably does matter quite a bit in more routine cases, where />= are mixed together on the same attribute. But there'll be no new _bt_preprocess_keys operator/function lookups for stuff like that.) The advantage of not completely merging _bt_preprocess_array_keys with _bt_preprocess_keys is that preserving this much of the existing code structure allows us to still decide how many array keys we'll need for the scan up front (if the scan doesn't end up having an unsatisfiable qual). _bt_preprocess_array_keys will eliminate redundant arrays early, in practically all cases (the exception is when it has to deal with an incomplete opfamily lacking the appropriate cross-type ORDER proc), so we don't have to think about merging arrays after that point. Rather like how we don't have to worry about "WHERE a < any('{1, 2, 3}')" type inequalities after _bt_preprocess_array_keys does an initial round of array-specific preprocessing (_bt_preprocess_keys can deal with those in the same way as it will with standard inequalities). > > This preprocessing work should all be happening during planning, not > > during query execution -- that's the design that makes the most sense. > > This is something we've discussed in the past in the context of skip > > scan (see my original email to this thread for the reference). > > Yes, but IIRC we also agreed that it's impossible to do this fully in > planning, amongst others due to joins on array fields. Even with a nested loop join's inner index scan, where the constants used for each btrescan are not really predictable in the planner, we can still do most preprocessing in the planner, at least most of the time. We could still easily do analysis that is capable of ruling out redundant or contradictory scan keys for any possible parameter value -- seen or unseen. I'd expect this to be the common case -- most of the time these inner index scans need only one simple = operator (maybe 2 = operators). Obviously, tjat approach still requires that btrescan at least accept a new set of constants for each new inner index scan invocation. But that's still much cheaper than calling _bt_preprocess_keys from scratch every time btresca is called. > > It > > would be especially useful for the very fancy kinds of preprocessing > > that are described by the MDAM paper, like using an index scan for a > > NOT IN() list/array (this can actually make sense with a low > > cardinality index). > > Yes, indexes such as those on enums. Though, in those cases the NOT IN > () could be transformed into IN()-lists by the planner, but not the > index. I think that it would probably be built as just another kind of index path, like the ones we build for SAOPs. Anti-SAOPs? Just as with SAOPs, the disjunctive accesses wouldn't really be something that the planner would need too much direct understanding of (except during costing). I'd only expect the plan to use such an index scan when it wasn't too far off needing a full index scan anyway. Just like with skip scan, the distinction between these NOT IN() index scan paths and a simple full index scan path is supposed to be fuzzy (maybe they'll actually turn out to be full scans at runtime, since the number of primitive index scans is determined dynamically, based on the structure of the index at runtime). > > The structure for preprocessing that I'm working towards (especially > > in v15) sets the groundwork for making those shifts in the planner, > > because we'll no longer treat each array constant as its own primitive > > index scan during preprocessing. > > I hope that's going to be a
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Thu, Mar 7, 2024 at 10:42 AM Benoit Tigeot wrote: > I am not up to date with the last version of patch but I did a regular > benchmark with version 11 and typical issue we have at the moment and the > result are still very very good. [1] Thanks for providing the test case. It was definitely important back when the ideas behind the patch had not yet fully developed. It helped me to realize that my thinking around non-required arrays (meaning arrays that cannot reposition the scan, and just filter out non-matching tuples) was still sloppy. > In term of performance improvement the last proposals could be a real game > changer for 2 of our biggest databases. We hope that Postgres 17 will contain > those improvements. Current plan is to commit this patch in the next couple of weeks, ahead of Postgres 17 feature freeze. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Sat, 16 Mar 2024 at 01:12, Peter Geoghegan wrote: > > On Fri, Mar 8, 2024 at 9:00 AM Matthias van de Meent > wrote: > > I've attached v14, where 0001 is v13, 0002 is a patch with small > > changes + some large comment suggestions, and 0003 which contains > > sorted merge join code for _bt_merge_arrays. > > This is part of my next revision, v15, which I've attached (along with > a test case that you might find useful, explained below). > > v15 makes the difference between the non-required scan key trigger and > required scan key trigger cases clearer within _bt_advance_array_keys. OK, here's a small additional review, with a suggestion for additional changes to _bt_preprocess: > @@ -1117,6 +3160,46 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey > op, > /* > * The opfamily we need to worry about is identified by the index column. > */ > Assert(leftarg->sk_attno == rightarg->sk_attno); > > +/* > + * If either leftarg or rightarg are equality-type array scankeys, we > need > + * specialized handling (since by now we know that IS NULL wasn't used) > + */ > [...] > +} > + > opcintype = rel->rd_opcintype[leftarg->sk_attno - 1]; Here, you insert your code between the comment about which opfamily to choose and the code assigning the opfamily. I think this needs some cleanup. > + * Don't call _bt_preprocess_array_keys_final in this fast path > + * (we'll miss out on the single value array transformation, but > + * that's not nearly as important when there's only one scan key) Why is it OK to ignore it? Or, why don't we apply it here? --- Attached 2 patches for further optimization of the _bt_preprocess_keys path (on top of your v15), according to the following idea: Right now, we do "expensive" processing with xform merging for all keys when we have more than 1 keys in the scan. However, we only do per-attribute merging of these keys, so if there is only one key for any attribute, the many cycles spent in that loop are mostly wasted. By checking for single-scankey attributes, we can short-track many multi-column index scans because they usually have only a single scan key per attribute. The first implements that idea, the second reduces the scope of various variables so as to improve compiler optimizability. I'll try to work a bit more on merging the _bt_preprocess steps into a single main iterator, but this is about as far as I got with clean patches. Merging the steps for array preprocessing with per-key processing and post-processing is proving a bit more complex than I'd anticipated, so I don't think I'll be able to finish that before the feature freeze, especially with other things that keep distracting me. Matthias van de Meent From 16808e85f5fae7b16fd52d8a2be8437e4cff8640 Mon Sep 17 00:00:00 2001 From: Matthias van de Meent Date: Tue, 19 Mar 2024 16:39:29 +0100 Subject: [PATCH v1 1/2] nbtree: Optimize preprocessing for single ScanKey per column Before, there was only a single fast path: single-ScanKey index scans. With this optimization, we can fast-track processing of attributes with only a single scankey, significantly reducing overhead for common scans like "a = 10 and b < 8". --- src/backend/access/nbtree/nbtutils.c | 529 ++- 1 file changed, 278 insertions(+), 251 deletions(-) diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c index b401b31191..8cd6270408 100644 --- a/src/backend/access/nbtree/nbtutils.c +++ b/src/backend/access/nbtree/nbtutils.c @@ -2511,7 +2511,6 @@ _bt_preprocess_keys(IndexScanDesc scan) ScanKey inkeys; ScanKey outkeys; ScanKey cur; - BTScanKeyPreproc xform[BTMaxStrategyNumber]; booltest_result; int i, j; @@ -2619,327 +2618,355 @@ _bt_preprocess_keys(IndexScanDesc scan) * xform[i] points to the currently best scan key of strategy type i+1; it * is NULL if we haven't yet found such a key for this attr. */ - attno = 1; - memset(xform, 0, sizeof(xform)); - /* -* Loop iterates from 0 to numberOfKeys inclusive; we use the last pass to -* handle after-last-key processing. Actual exit from the loop is at the -* "break" statement below. -*/ - for (i = 0;; cur++, i++) + for (i = 0; i < numberOfKeys;) { - if (i < numberOfKeys) + BTScanKeyPreproc xform[BTMaxStrategyNumber]; + int priorNumberOfEqualCols = numberOfEqualCols; + boolfast = true; + + /* initialize for this attno */ + attno = cur->sk_attno; + + /* We'll try to merge keys if there are multiple for this attribute */ + if (i + 1 < numberOfKeys && (cur + 1)->sk_attno == attno) +
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Sat, 16 Mar 2024 at 01:12, Peter Geoghegan wrote: > > On Fri, Mar 8, 2024 at 9:00 AM Matthias van de Meent > wrote: > > I've attached v14, where 0001 is v13, 0002 is a patch with small > > changes + some large comment suggestions, and 0003 which contains > > sorted merge join code for _bt_merge_arrays. > > I have accepted your changes from 0003. Agree that it's better that > way. It's at least a little faster, but not meaningfully more > complicated. Thanks. > > I'll try to work a bit on v13/14's _bt_preprocess_keys, and see what I > > can make of it. > > That's been the big focus of this new v15, which now goes all out with > teaching _bt_preprocess_keys with how to deal with arrays. We now do > comprehensive normalization of even very complicated combinations of > arrays and non-array scan keys in this version. I was thinking about a more unified processing model, where _bt_preprocess_keys would iterate over all keys, including processing of array keys (by merging and reduction to normal keys) if and when found. This would also reduce the effort expended when there are contradictory scan keys, as array preprocessing is relatively more expensive than other scankeys and contradictions are then found before processing of later keys. As I wasn't very far into the work yet it seems I can reuse a lot of your work here. > For example, consider this query: [...] > This has a total of 6 input scankeys -- 3 of which are arrays. But by > the time v15's _bt_preprocess_keys is done with it, it'll have only 1 > scan key -- which doesn't even have an array (not anymore). And so we > won't even need to advance the array keys one single time -- there'll > simply be no array left to advance. In other words, it'll be just as > if the query was written this way from the start: > > select * > from multi_test > where > a = 3::int; > > (Though of course the original query will spend more cycles on > preprocessing, compared to this manually simplified variant.) That's a good improvement, much closer to an optimal access pattern. > It turned out to not be terribly difficult to teach > _bt_preprocess_keys everything it could possibly need to know about > arrays, so that it can operate on them directly, as a variant of the > standard equality strategy (we do still need _bt_preprocess_array_keys > for basic preprocessing of arrays, mostly just merging them). This is > better overall (in that it gets every little subtlety right), but it > also simplified a number of related issues. For example, there is no > longer any need to maintain a mapping between so->keyData[]-wise scan > keys (output scan keys), and scan->keyData[]-wise scan keys (input > scan keys). We can just add a step to fix-up the references to the end > of _bt_preprocess_keys, to make life easier within > _bt_advance_array_keys. > > This preprocessing work should all be happening during planning, not > during query execution -- that's the design that makes the most sense. > This is something we've discussed in the past in the context of skip > scan (see my original email to this thread for the reference). Yes, but IIRC we also agreed that it's impossible to do this fully in planning, amongst others due to joins on array fields. > It > would be especially useful for the very fancy kinds of preprocessing > that are described by the MDAM paper, like using an index scan for a > NOT IN() list/array (this can actually make sense with a low > cardinality index). Yes, indexes such as those on enums. Though, in those cases the NOT IN () could be transformed into IN()-lists by the planner, but not the index. > The structure for preprocessing that I'm working towards (especially > in v15) sets the groundwork for making those shifts in the planner, > because we'll no longer treat each array constant as its own primitive > index scan during preprocessing. I hope that's going to be a fully separate patch. I don't think I can handle much more complexity in this one. > Right now, on HEAD, preprocessing > with arrays kinda treats each array constant like the parameter of an > imaginary inner index scan, from an imaginary nested loop join. But > the planner really needs to operate on the whole qual, all at once, > including any arrays. An actual nestloop join's inner index scan > naturally cannot do that, and so might still require runtime/dynamic > preprocessing in a world where that mostly happens in the planner -- > but that clearly not appropriate for arrays ("WHERE a = 5" and "WHERE > a in(4, 5)" are almost the same thing, and so should be handled in > almost the same way by preprocessing). Yeah, if the planner could handle some of this that'd be great. At the same time, I think that this might need to be gated behind a guc for more expensive planner-time deductions. > > > + * We generally permit primitive index scans to continue onto the > > > next > > > + * sibling page when the page's finaltup satisfies all required scan > > > keys > > > + *
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Wed, Mar 6, 2024 at 4:51 PM Matthias van de Meent wrote: > To clarify, what I mean here is that merging the changes of both the > SAOPs changes and the removal of arrayKeyData seems to increase the > patch size and increases the maximum complexity of each component > patch's review. Removing arrayKeyData probably makes the patch very slightly smaller, actually. But even if it's really the other way around, I'd still like to get rid of it as part of the same commit as everything else. It just makes sense that way. > Separate patches may make this more reviewable, or not, but no comment > was given on why it is better to merge the changes into a single > patch. Fair enough. Here's why: The original SAOP design (commit 9e8da0f7, "Teach btree to handle ScalarArrayOpExpr quals natively") added a layer of indirection between scan->keyData (input scan keys) and so->keyData (output scan keys): it added another scan key array, so->arrayKeyData. There was array-specific preprocessing in _bt_preprocess_array_keys, that happened before the first primitive index scan even began -- that transformed our true input scan keys (scan->keyData) into a copy of the array with limited amounts of array-specific preprocessing already performed (so->arrayKeyData). This made a certain amount of sense at the time, because _bt_preprocess_keys was intended to be called once per primitive index scan. Kind of like the inner side of a nested loop join's inner index scan, where we call _bt_preprocess_keys once per inner-side scan/btrescan call. (Actually, Tom's design has us call _bt_preprocess once per primitive index scan per btrescan call, which might matter in those rare cases where the inner side of a nestloop join had SAOP quals.) What I now propose to do is to just call _bt_preprocess_keys once per btrescan (actually, it's still called once per primitive index scan, but all calls after the first are just no-ops after v12 of the patch). This makes sense because SAOP array constants aren't like nestloop joins with an inner index scans, in one important way: we really can see everything up-front. We can see all of the array elements, and operate on whole arrays as necessary during preprocessing (e.g., performing the array merging thing I added to _bt_preprocess_array_keys). It's not like the next array element is only visible to prepocessing only after the outer side of a nestloop join runs, and next calls btrescan -- so why treat it like that? Conceptually, "WHERE a = 1" is almost the same thing as "WHERE a = any('{1,2}')", so why not use essentially the same approach to preprocessing in both cases? We don't need to copy the input keys into so->arrayKeyData, because the indirection (which is a bit like a fake nested loop join) doesn't buy us anything. v13 of the patch doesn't quite 100% eliminate so->arrayKeyData. While it removes the arrayKeyData field from the BTScanOpaqueData struct, we still use a temporary array (accessed via a pointer that's just a local variable) that's also called arrayKeyData. And that also stores array-preprocessing-only copies of the input scan keys. That happens within _bt_preprocess_keys. So we do "still need arrayKeyData", but we only need it for a brief period at the start of the index scan. It just doesn't make any sense to keep it around for longer than that, in a world where _bt_preprocess_keys "operates directly on arrays". That only made sense (a bit of sense) back when _bt_preprocess_keys was subordinate to _bt_preprocess_array_keys, but it's kinda the other way around now. We could probably even get rid of this remaining limited form of arrayKeyData, but that doesn't seem like it would add much. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Wed, Mar 6, 2024 at 4:46 PM Matthias van de Meent wrote: > On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan wrote: > > I think that there is supposed to be a closing parenthesis here? So > > "... (such as those described in ") might > > perform...". > > Correct. > > > FWM, if that's what you meant. > > WFM, yes? Then we're in agreement on this. > > At one point Heikki suggested that I just get rid of > > BTScanOpaqueData.arrayKeyData (which has been there for as long as > > nbtree had native support for SAOPs), and use > > BTScanOpaqueData.keyData exclusively instead. I've finally got around > > to doing that now. > > I'm not sure if it was worth the reduced churn when the changes for > the primary optimization are already 150+kB in size; every "small" > addition increases the context needed to review the patch, and it's > already quite complex. I agree that the patch is quite complex, especially relative to its size. > > Note that we no longer need to have an independent representation of > > so->qual_okay for array keys (the convention of setting > > so->numArrayKeys to -1 for unsatisfiable array keys is no longer > > required). There is no longer any need for a separate pass to carry > > over the contents of BTScanOpaqueData.arrayKeyData to > > BTScanOpaqueData.keyData, which was confusing. > > I wasn't very confused by it, but sure. > > > Are you still interested in working directly on the preprocessing > > stuff? > > If you mean my proposed change to merging two equality AOPs, then yes. > I'll try to fit it in tomorrow with the rest of the review. I didn't just mean that stuff. I was also suggesting that you could join the project directly (not just as a reviewer). If you're interested, you could do general work on the preprocessing of arrays. Fancier array-specific preprocessing. For example, something that can transform this: select * from foo where a in (1,2,3) and a < 3; Into this: select * from foo where a in (1,2); Or, something that can just set qual_okay=false, given a query such as: select * from foo where a in (1,2,3) and a < 5; This is clearly doable by reusing the binary search code during preprocessing. The first example transformation could work via a binary search for the constant "3", followed by a memmove() to shrink the array in-place (plus the inequality itself would need to be fully eliminated). The second example would work a little like the array merging thing that the patch has already, except that there'd only be one array involved (there wouldn't be a pair of arrays). > > select * from multi_test where a in (1,99, 182, 183, 184) and a < 183; > > > > Maybe we need to do better with that. What do you think? > > Let me come back to that when I'm done reviewing the final part of nbtutils. Even if this case doesn't matter (which I now doubt), it's probably easier to just get it right than it would be to figure out if we can live without it. > > void > > _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir) > > @@ -519,159 +888,1163 @@ _bt_start_array_keys(IndexScanDesc scan, > > ScanDirection dir) > > BTScanOpaque so = (BTScanOpaque) scan->opaque; > > inti; > > > > +Assert(so->numArrayKeys); > > +Assert(so->qual_ok); > > Has the requirement for a known scan direction been removed, or should > this still have an Assert(dir != NoMovementScanDirection)? I agree that such an assertion is worth having. Added that locally. > I'm not sure that comment is correct, at least it isn't as clear as > can be. Maybe something more in the line of the following? > +pstate.prechecked = false;/* prechecked didn't cover HIKEY */ I agree that that's a little better than what I had. > +++ b/src/backend/access/nbtree/nbtutils.c > > @@ -272,7 +319,32 @@ _bt_preprocess_array_keys(IndexScanDesc scan) > > +elemtype = cur->sk_subtype; > > +if (elemtype == InvalidOid) > > +elemtype = rel->rd_opcintype[cur->sk_attno - 1]; > > Should we Assert() that this elemtype matches the array element type > ARR_ELEMTYPE(arrayval) used to deconstruct the array? Yeah, good idea. > This is not "a" previous equality key, but "the" previous equality > operator array scan key. > Do we want to expend some additional cycles for detecting duplicate > equality array key types in interleaved order like =int[] =bigint[] > =int[]? I don't think it would be very expensive considering the > limited number of cross-type equality operators registered in default > PostgreSQL, so a simple loop that checks matching element types > starting at the first array key scankey for this attribute should > suffice. We could even sort the keys by element type if we wanted to > fix any O(n) issues for this type of behaviour (though this is > _extremely_ unlikely to become a performance issue). Yes, I think that we should probably have this (though likely wouldn't bother sorting the scan keys themselves). You'd need to look-up cross-type operators for
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Hello, I am not up to date with the last version of patch but I did a regular benchmark with version 11 and typical issue we have at the moment and the result are still very very good. [1] In term of performance improvement the last proposals could be a real game changer for 2 of our biggest databases. We hope that Postgres 17 will contain those improvements. Kind regards, Benoit [1] - https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4972955#gistcomment-4972955 __ Benoit Tigeot Le jeu. 7 mars 2024 à 15:36, Peter Geoghegan a écrit : > On Tue, Jan 23, 2024 at 3:22 PM Peter Geoghegan wrote: > > I could include something less verbose, mentioning a theoretical risk > > to out-of-core amcanorder routines that coevolved with nbtree, > > inherited the same SAOP limitations, and then never got the same set > > of fixes. > > Attached is v11, which now says something like that in the commit > message. Other changes: > > * Fixed buggy sorting of arrays using cross-type ORDER procs, by > recognizing that we need to consistently use same-type ORDER procs for > sorting and merging the arrays during array preprocessing. > > Obviously, when we sort, we compare array elements to other array > elements (all of the same type). This is true independent of whether > the query itself happens to use a cross type operator/ORDER proc, so > we will need to do two separate ORDER proc lookups in cross-type > scenarios. > > * No longer subscript the ORDER proc used for array binary searches > using a scankey subscript. Now there is an additional indirection that > works even in the presence of multiple redundant scan keys that cannot > be detected as such due to a lack of appropriate cross-type support > within an opfamily. > > This was subtly buggy before now. Requires a little more coordination > between array preprocessing and standard/primitive index scan > preprocessing, which isn't ideal but seems unavoidable. > > * Lots of code polishing, especially within _bt_advance_array_keys(). > > While _bt_advance_array_keys() still works in pretty much exactly the > same way as it did back in v10, there are now better comments. > Including something about why its recursive call to itself is > guaranteed to use a low, fixed amount of stack space, verified using > an assertion. That addresses a concern held by Matthias. > > Outlook > === > > This patch is approaching being committable now. Current plan is to > commit this within the next few weeks. > > All that really remains now is to research how we might integrate this > work with the recently added continuescanPrechecked/haveFirstMatch > stuff from Alexander Korotkov, if at all. I've put that off until now > because it isn't particularly fundamental to what I'm doing here, and > seems optional. > > I would also like to do more performance validation. Things like the > parallel index scan code could stand to be revisited once again. Plus > I should think about the overhead of array preprocessing when > btrescan() is called many times, from a nested loop join -- I should > have something to say about that concern (raised by Heikki at one > point) before too long. > > -- > Peter Geoghegan >
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Wed, 6 Mar 2024 at 22:46, Matthias van de Meent wrote: > > On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan wrote: > > At one point Heikki suggested that I just get rid of > > BTScanOpaqueData.arrayKeyData (which has been there for as long as > > nbtree had native support for SAOPs), and use > > BTScanOpaqueData.keyData exclusively instead. I've finally got around > > to doing that now. > > I'm not sure if it was worth the reduced churn when the changes for > the primary optimization are already 150+kB in size; every "small" > addition increases the context needed to review the patch, and it's > already quite complex. To clarify, what I mean here is that merging the changes of both the SAOPs changes and the removal of arrayKeyData seems to increase the patch size and increases the maximum complexity of each component patch's review. Separate patches may make this more reviewable, or not, but no comment was given on why it is better to merge the changes into a single patch. - Matthias
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan wrote: > > On Mon, Mar 4, 2024 at 3:51 PM Matthias van de Meent > wrote: > > > +that searches for multiple values together. Queries that use certain > > > +SQL constructs to search for rows matching any > > > value > > > +out of a list (or an array) of multiple scalar values might perform > > > +multiple primitive index scans (up to one primitive > > > scan > > > +per scalar value) at runtime. See > > linkend="functions-comparisons"/> > > > +for details. > > > > I don't think the "see for details" is > > correctly worded: The surrounding text implies that it would contain > > details about in which precise situations multiple primitive index > > scans would be consumed, while it only contains documentation about > > IN/NOT IN/ANY/ALL/SOME. > > > > Something like the following would fit better IMO: > > > > +that searches for multiple values together. Queries that use certain > > +SQL constructs to search for rows matching any value > > +out of a list or array of multiple scalar values (such as those > > described in > > + might perform multiple primitive > > +index scans (up to one primitive scan per scalar value) at runtime. > > I think that there is supposed to be a closing parenthesis here? So > "... (such as those described in ") might > perform...". Correct. > FWM, if that's what you meant. WFM, yes? > > Then there is a second issue in the paragraph: Inverted indexes such > > as GIN might well decide to start counting more than one "primitive > > scan" per scalar value, [...] > I've described the issues in this area (in the docs) in a way that is > most consistent with historical conventions. That seems to have the > fewest problems, despite everything I've said about it. Clear enough, thank you for explaining your thoughts on this. > > > > All that really remains now is to research how we might integrate this > > > > work with the recently added continuescanPrechecked/haveFirstMatch > > > > stuff from Alexander Korotkov, if at all. > > > > > > The main change in v12 is that I've integrated both the > > > continuescanPrechecked and the haveFirstMatch optimizations. Both of > > > these fields are now page-level state, shared between the _bt_readpage > > > caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they > > > appear next to the new home for _bt_checkkeys' continuescan field, in > > > the new page state struct). > > > > Cool. I'm planning to review the rest of this patch this > > week/tomorrow, could you take some time to review some of my btree > > patches, too? > > Okay, I'll take a look again. Thanks, greatly appreciated. > At one point Heikki suggested that I just get rid of > BTScanOpaqueData.arrayKeyData (which has been there for as long as > nbtree had native support for SAOPs), and use > BTScanOpaqueData.keyData exclusively instead. I've finally got around > to doing that now. I'm not sure if it was worth the reduced churn when the changes for the primary optimization are already 150+kB in size; every "small" addition increases the context needed to review the patch, and it's already quite complex. > These simplifications were enabled by my new approach within > _bt_preprocess_keys, described when I posted v12. v13 goes even > further than v12 did, by demoting _bt_preprocess_array_keys to a > helper function for _bt_preprocess_keys. That means that we do all of > our scan key preprocessing at the same time, at the start of _bt_first > (though only during the first _bt_first, or to be more precise during > the first per btrescan). If we need fancier > preprocessing/normalization for arrays, then it ought to be a lot > easier with this structure. Agreed. > Note that we no longer need to have an independent representation of > so->qual_okay for array keys (the convention of setting > so->numArrayKeys to -1 for unsatisfiable array keys is no longer > required). There is no longer any need for a separate pass to carry > over the contents of BTScanOpaqueData.arrayKeyData to > BTScanOpaqueData.keyData, which was confusing. I wasn't very confused by it, but sure. > Are you still interested in working directly on the preprocessing > stuff? If you mean my proposed change to merging two equality AOPs, then yes. I'll try to fit it in tomorrow with the rest of the review. > I have a feeling that I was slightly too optimistic about how > likely we were to be able to get away with not having certain kinds of > array preprocessing, back when I posted v12. It's true that the > propensity of the patch to not recognize "partial > redundancies/contradictions" hardly matters with redundant equalities, > but inequalities are another matter. I'm slightly worried about cases > like this one: > > select * from multi_test where a in (1,99, 182, 183, 184) and a < 183; > > Maybe we need to do better with that. What do you think? Let me come back to that when I'm done reviewing the final
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Sat, 2 Mar 2024 at 02:30, Peter Geoghegan wrote: > > On Thu, Feb 15, 2024 at 6:36 PM Peter Geoghegan wrote: > > Attached is v11, which now says something like that in the commit > > message. > > Attached is v12. Some initial comments on the documentation: > +that searches for multiple values together. Queries that use certain > +SQL constructs to search for rows matching any value > +out of a list (or an array) of multiple scalar values might perform > +multiple primitive index scans (up to one primitive scan > +per scalar value) at runtime. See linkend="functions-comparisons"/> > +for details. I don't think the "see for details" is correctly worded: The surrounding text implies that it would contain details about in which precise situations multiple primitive index scans would be consumed, while it only contains documentation about IN/NOT IN/ANY/ALL/SOME. Something like the following would fit better IMO: +that searches for multiple values together. Queries that use certain +SQL constructs to search for rows matching any value +out of a list or array of multiple scalar values (such as those described in + might perform multiple primitive +index scans (up to one primitive scan per scalar value) at runtime. Then there is a second issue in the paragraph: Inverted indexes such as GIN might well decide to start counting more than one "primitive scan" per scalar value, because they may need to go through their internal structure more than once to produce results for a single scalar value; e.g. with queries WHERE textfield LIKE '%some%word%', a trigram index would likely use at least 4 descents here: one for each of "som", "ome", "wor", "ord". > > All that really remains now is to research how we might integrate this > > work with the recently added continuescanPrechecked/haveFirstMatch > > stuff from Alexander Korotkov, if at all. > > The main change in v12 is that I've integrated both the > continuescanPrechecked and the haveFirstMatch optimizations. Both of > these fields are now page-level state, shared between the _bt_readpage > caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they > appear next to the new home for _bt_checkkeys' continuescan field, in > the new page state struct). Cool. I'm planning to review the rest of this patch this week/tomorrow, could you take some time to review some of my btree patches, too? Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Mon, Jan 22, 2024 at 4:13 PM Matthias van de Meent wrote: > Attached 2 patches: v11.patch-a and v11.patch-b. Both are incremental > on top of your earlier set, and both don't allocate additional memory > in the merge operation in non-assertion builds. > > patch-a is a trivial and clean implementation of mergesort, which > tends to reduce the total number of compare operations if both arrays > are of similar size and value range. It writes data directly back into > the main array on non-assertion builds, and with assertions it reuses > your binary search join on scratch space for algorithm validation. This patch fails some of my tests on non-assert builds only (assert builds pass all my tests, though). I'm using the first patch on its own here. While I tend to be relatively in favor of complicated assertions (I tend to think the risk of introducing side-effects is worth it), it looks like you're only performing certain steps in release builds. This is evident from just looking at the code (there is an #else block just for the release build in the loop). Note also that "Assert(_bt_compare_array_elements([merged_nelems++], orig, ) == 0)" has side-effects in assert-enabled builds only (it increments merged_nelems). While it's true that you *also* increment merged_nelems *outside* of the assertion (or in an #else block used during non-assert builds), that is conditioned on some other thing (so it's in no way equivalent to the debug #ifdef USE_ASSERT_CHECKING case). It's also just really hard to understand what's going on here. If I was going to do this kind of thing, I'd use two completely separate loops, that were obviously completely separate (maybe even two functions). I'd then memcmp() each array at the end. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Mon, Jan 22, 2024 at 4:13 PM Matthias van de Meent wrote: > On Fri, 19 Jan 2024 at 23:42, Peter Geoghegan wrote: > And this is where I disagree with your (percieved) implicit argument > that this should be and always stay this way. I never said that, and didn't intend to imply it. As I outlined to you back in November, my general philosophy around APIs (such as the index AM API) is that ambiguity about the limits and extent of an abstract interface isn't necessarily a bad thing [1]. It can actually be a good thing! Ever hear of Hyrum's Law? Abstractions are very often quite leaky. APIs like the index AM API inevitably make trade-offs. Trade-offs are almost always political, in one way or another. Litigating every possible question up-front requires knowing ~everything before you really get started. This mostly ends up being a waste of time, since many theoretical contentious trade-offs just won't matter one bit in the long run, for one reason or another (not necessarily because there is only ever one consumer of an API, for all sorts of reasons). You don't have to agree with me. That's just what my experience indicates works best on average, and in the long run. I cannot justify it any further than that. > By keeping that expectation alive, > this becomes a self-fulfilling prophecy, and I really don't like such > restrictive self-fulfilling prophecies. It's nice that we have index > AM feature flags, but if we only effectively allow one implementation > for this set of flags by ignoring potential other users when changing > behavior, then what is the API good for (apart from abstraction > layering, which is nice)? I explicitly made it clear that I don't mean that. > I think that may be right, but I could very well have built a > btree-lite that doesn't have the historical baggage of having to deal > with pages from before v12 (version 4) and some improvements that > haven't made it to core yet. Let me know if you ever do that. Let me know what the problems you encounter are. I'm quite happy to revise my position on this in light of new information. I change my mind all the time! > Something like the following, maybe? > > """ > Compatibility: The planner will now generate paths with array scan > keys in any column for ordered indexes, rather than on only a prefix > of the index columns. The planner still expects fully ordered data > from those indexes. > Historically, the btree AM couldn't output correctly ordered data for > suffix array scan keys, which was "fixed" by prohibiting the planner > from generating array scan keys without an equality prefix scan key up > to that attribute. In this commit, the issue has been fixed, and the > planner restriction has thus been removed as the only in-core IndexAM > that reports amcanorder now supports array scan keys on any column > regardless of what prefix scan keys it has. > """ Even if I put something like this in the commit message, I doubt that Bruce would pick it up in anything like this form (I have my doubts about it happening no matter what wording is used, actually). I could include something less verbose, mentioning a theoretical risk to out-of-core amcanorder routines that coevolved with nbtree, inherited the same SAOP limitations, and then never got the same set of fixes. > patch-a is a trivial and clean implementation of mergesort, which > tends to reduce the total number of compare operations if both arrays > are of similar size and value range. It writes data directly back into > the main array on non-assertion builds, and with assertions it reuses > your binary search join on scratch space for algorithm validation. I'll think about this some more. > patch-b is an improved version of patch-a with reduced time complexity > in some cases: It applies exponential search to reduce work done where > there are large runs of unmatched values in either array, and does > that while keeping O(n+m) worst-case complexity, now getting a > best-case O(log(n)) complexity. This patch seems massively over-engineered, though. Over 200 new lines of code, all for a case that is only used when queries are written with obviously-contradictory quals? It's just not worth it. > > The recursive call to _bt_advance_array_keys is needed to deal with > > unsatisfied-and-required inequalities that were not detected by an > > initial call to _bt_check_compare, following a second retry call to > > _bt_check_compare. We know that this recursive call to > > _bt_advance_array_keys won't cannot recurse again because we've > > already identified which specific inequality scan key it is that's a > > still-unsatisfied inequality, following an initial round of array key > > advancement. > > So, it's a case of this? > > scankey: a > 1 AND a = ANY (1 (=current), 2, 3)) AND b < 4 AND b = ANY > (2 (=current), 3, 4) > tuple: a=2, b=4 I don't understand why your example starts with "scankey: a > 1" and uses redundant/contradictory scan keys (for both "a" and "b"). For a
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Fri, 19 Jan 2024 at 23:42, Peter Geoghegan wrote: > Thank you for your replies so far. > On Thu, Jan 18, 2024 at 11:39 AM Matthias van de Meent > wrote: > > I would agree with you if this was about new APIs and features, but > > here existing APIs are being repurposed without changing them. A > > maintainer of index AMs would not look at the commit title 'Enhance > > nbtree ScalarArrayOp execution' and think "oh, now I have to make sure > > my existing amsearcharray+amcanorder handling actually supports > > non-prefix arrays keys and return data in index order". > > This is getting ridiculous. > > It is quite likely that there are exactly zero affected out-of-core > index AMs. I don't count pgroonga as a counterexample (I don't think > that it actually fullfills the definition of a ). Basically, > "amcanorder" index AMs more or less promise to be compatible with > nbtree, down to having the same strategy numbers. So the idea that I'm > going to upset amsearcharray+amcanorder index AM authors is a > completely theoretical problem. The planner code evolved with nbtree, > hand-in-glove. And this is where I disagree with your (percieved) implicit argument that this should be and always stay this way. I don't mind changes in the planner for nbtree per se, but as I've mentioned before in other places, I really don't like how we handle amcanorder as if it were amisbtree. But it's not called that, so we shouldn't expect that to remain the case; and definitely not keep building on those expectations that it always is going to be the nbtree when amcanorder is set (or amsearcharray is set, or ..., or any combination of those that is currently used by btree). By keeping that expectation alive, this becomes a self-fulfilling prophecy, and I really don't like such restrictive self-fulfilling prophecies. It's nice that we have index AM feature flags, but if we only effectively allow one implementation for this set of flags by ignoring potential other users when changing behavior, then what is the API good for (apart from abstraction layering, which is nice)? /rant I'll see about the > I'm more than happy to add a reminder to the commit message about > adding a proforma listing to the compatibility section of the Postgres > 17 release notes. Can we actually agree on which theoretical index AM > types are affected first, though? Isn't it actually > amsearcharray+amcanorder+amcanmulticol external index AMs only? Do I > have that right? I think that may be right, but I could very well have built a btree-lite that doesn't have the historical baggage of having to deal with pages from before v12 (version 4) and some improvements that haven't made it to core yet. > BTW, how should we phrase this compatibility note, so that it can be > understood? It's rather awkward. Something like the following, maybe? """ Compatibility: The planner will now generate paths with array scan keys in any column for ordered indexes, rather than on only a prefix of the index columns. The planner still expects fully ordered data from those indexes. Historically, the btree AM couldn't output correctly ordered data for suffix array scan keys, which was "fixed" by prohibiting the planner from generating array scan keys without an equality prefix scan key up to that attribute. In this commit, the issue has been fixed, and the planner restriction has thus been removed as the only in-core IndexAM that reports amcanorder now supports array scan keys on any column regardless of what prefix scan keys it has. """ > > > As I said to Heikki, thinking about this some more is on my todo list. > > > I mean the way that this worked had substantial revisions on HEAD > > > right before I posted v9. v9 was only to fix the bit rot that that > > > caused. > > > > Then I'll be waiting for that TODO item to be resolved. > > My TODO item is to resolve the question of whether and to what extent > these two optimizations should be combined. It's possible that I'll > decide that it isn't worth it, for whatever reason. That's fine, this decision (and any work related to it) is exactly what I was referring to with that mention of the TODO. > > > Even single digit > > > thousand of array elements should be considered huge. Plus this code > > > path is only hit with poorly written queries. > > > > Would you accept suggestions? > > You mean that you want to have a go at it yourself? Sure, go ahead. > > I cannot promise that I'll accept your suggested revisions, but if > they aren't too invasive/complicated compared to what I have now, then > I will just accept them. I maintain that this isn't really necessary, > but it might be the path of least resistance at this point. Attached 2 patches: v11.patch-a and v11.patch-b. Both are incremental on top of your earlier set, and both don't allocate additional memory in the merge operation in non-assertion builds. patch-a is a trivial and clean implementation of mergesort, which tends to reduce the total number of
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Thu, Jan 18, 2024 at 11:39 AM Matthias van de Meent wrote: > > I'm not going to break out the planner changes, because they're *not* > > independent in any way. > > The planner changes depend on the btree changes, that I agree with. > However, I don't think that the btree changes depend on the planner > changes. Yeah, technically it would be possible to break out the indxpath.c changes -- that wouldn't leave the tree in a state with visibly-wrong behavior at any point. But that's far from the only thing that matters. If that was the standard that we applied, then I might have to split the patch into 10 or more patches. What it boils down to is this: it is totally natural for me to think of the planner changes as integral to the nbtree changes, so I'm going to include them together in one commit. That's just how the code was written -- I always thought about it as one single thing. The majority of the queries that I've used to promote the patch directly rely on the planner changes in one way or another (even back in v1, when the planner side of things had lots of kludges). I don't necessarily always follow that standard. Sometimes it is useful to further break things up. Like when it happens to make the high-level division of labor a little bit clearer. The example that comes to mind is commit dd299df8 and commit fab25024. These nbtree commits were essentially one piece of work that was broken into two, but committed within minutes of each other, and left the tree in a momentary state that was not-very-useful (though still correct). That made sense to me at the time because the code in question was mechanically distinct, while at the same time modifying some of the same nbtree files. Drawing a line under it seemed to make sense. I will admit that there is some amount of subjective judgement (gasp!) here. Plus I'll acknowledge that it's *slightly* odd that the most compelling cases for the SAOP patch almost all involve savings in heap page accesses, even though it is fundamentally a B-Tree patch. But that's just how it came out. Slightly odd things happen all the time. > I would agree with you if this was about new APIs and features, but > here existing APIs are being repurposed without changing them. A > maintainer of index AMs would not look at the commit title 'Enhance > nbtree ScalarArrayOp execution' and think "oh, now I have to make sure > my existing amsearcharray+amcanorder handling actually supports > non-prefix arrays keys and return data in index order". This is getting ridiculous. It is quite likely that there are exactly zero affected out-of-core index AMs. I don't count pgroonga as a counterexample (I don't think that it actually fullfills the definition of a ). Basically, "amcanorder" index AMs more or less promise to be compatible with nbtree, down to having the same strategy numbers. So the idea that I'm going to upset amsearcharray+amcanorder index AM authors is a completely theoretical problem. The planner code evolved with nbtree, hand-in-glove. I'm more than happy to add a reminder to the commit message about adding a proforma listing to the compatibility section of the Postgres 17 release notes. Can we actually agree on which theoretical index AM types are affected first, though? Isn't it actually amsearcharray+amcanorder+amcanmulticol external index AMs only? Do I have that right? BTW, how should we phrase this compatibility note, so that it can be understood? It's rather awkward. > > As I said to Heikki, thinking about this some more is on my todo list. > > I mean the way that this worked had substantial revisions on HEAD > > right before I posted v9. v9 was only to fix the bit rot that that > > caused. > > Then I'll be waiting for that TODO item to be resolved. My TODO item is to resolve the question of whether and to what extent these two optimizations should be combined. It's possible that I'll decide that it isn't worth it, for whatever reason. At this point I'm still totally neutral. > > Even single digit > > thousand of array elements should be considered huge. Plus this code > > path is only hit with poorly written queries. > > Would you accept suggestions? You mean that you want to have a go at it yourself? Sure, go ahead. I cannot promise that I'll accept your suggested revisions, but if they aren't too invasive/complicated compared to what I have now, then I will just accept them. I maintain that this isn't really necessary, but it might be the path of least resistance at this point. > > > Please fix this, as it shows up in profiling of large array merges. > > > Additionally, as it merges two arrays of unique items into one, > > > storing only matching entries, I feel that it is quite wasteful to do > > > this additional allocation here. Why not reuse the original allocation > > > immediately? > > > > Because that's adding more complexity for a code path that will hardly > > ever be used in practice. For something that happens once, during a > >
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Tue, 16 Jan 2024 at 03:03, Peter Geoghegan wrote: > > On Mon, Jan 15, 2024 at 2:32 PM Matthias van de Meent > wrote: > > Can you pull these planner changes into their own commit(s)? > > As mentioned upthread, it's a significant change in behavior that > > should have separate consideration and reference in the commit log. I > > really don't think it should be buried in the 5th paragraph of an > > "Enhance nbtree ScalarArrayOp execution" commit. Additionally, the > > changes of btree are arguably independent of the planner changes, as > > the btree changes improve performance even if we ignore that it > > implements strict result ordering. > > I'm not going to break out the planner changes, because they're *not* > independent in any way. The planner changes depend on the btree changes, that I agree with. However, I don't think that the btree changes depend on the planner changes. > You could say the same thing about practically > any work that changes the planner. They're "buried" in the 5th > paragraph of the commit message. If an interested party can't even > read that far to gain some understanding of a legitimately complicated > piece of work such as this, I'm okay with that. I would agree with you if this was about new APIs and features, but here existing APIs are being repurposed without changing them. A maintainer of index AMs would not look at the commit title 'Enhance nbtree ScalarArrayOp execution' and think "oh, now I have to make sure my existing amsearcharray+amcanorder handling actually supports non-prefix arrays keys and return data in index order". There are also no changes in amapi.h that would signal any index AM author that expectations have changed. I really don't think you can just ignore all that, and I believe this to also be the basis of Heikki's first comment. > As I said to Heikki, thinking about this some more is on my todo list. > I mean the way that this worked had substantial revisions on HEAD > right before I posted v9. v9 was only to fix the bit rot that that > caused. Then I'll be waiting for that TODO item to be resolved. > > > +++ b/src/backend/access/nbtree/nbtutils.c > > > +/* > > > + * _bt_merge_arrays() -- merge together duplicate array keys > > > + * > > > + * Both scan keys have array elements that have already been sorted and > > > + * deduplicated. > > > + */ > > > > As I mentioned upthread, I find this function to be very wasteful, as > > it uses N binary searches to merge join two already sorted arrays, > > resulting in a O(n log(m)) complexity, whereas a normal merge join > > should be O(n + m) once the input datasets are sorted. > > And as I mentioned upthread, I think that you're making a mountain out > of a molehill here. This is not a merge join. We're merging two sorted sets and want to retain only the matching entries, while retaining the order of the data. AFAIK the best algorithm available for this is a sort-merge join. Sure, it isn't a MergeJoin plan node, but that's not what I was trying to argue. > Even single digit > thousand of array elements should be considered huge. Plus this code > path is only hit with poorly written queries. Would you accept suggestions? > > Please fix this, as it shows up in profiling of large array merges. > > Additionally, as it merges two arrays of unique items into one, > > storing only matching entries, I feel that it is quite wasteful to do > > this additional allocation here. Why not reuse the original allocation > > immediately? > > Because that's adding more complexity for a code path that will hardly > ever be used in practice. For something that happens once, during a > preprocessing step. Or many times, when we're in a parameterized loop, as was also pointed out by Heikki. While I do think it is rare, the existence of this path that merges these arrays implies the need for merging these arrays, which thus > > > +_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate, > > > + IndexTuple tuple, int sktrig, bool > > > validtrig) > > > > I don't quite understand what the 'validtrig' argument is used for. > > There is an assertion that it is false under some conditions in this > > code, but it's not at all clear to me why that would have to be the > > case - it is called with `true` in one of the three call sites. Could > > the meaning of this be clarified? > > Sure, I'll add a comment. Thanks! > > I also feel that this patch includes several optimizations such as > > this sktrig argument which aren't easy to understand. Could you pull > > that into a separately reviewable patch? > > It probably makes sense to add the extra preprocessing stuff out into > its own commit, since I tend to agree that that's an optimization that > can be treated as unrelated (and isn't essential to the main thrust of > the patch). > > However, the sktrig thing isn't really like that. We need to do things > that way for required inequality scan keys. It doesn't make sense to
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Mon, Jan 15, 2024 at 2:32 PM Matthias van de Meent wrote: > Can you pull these planner changes into their own commit(s)? > As mentioned upthread, it's a significant change in behavior that > should have separate consideration and reference in the commit log. I > really don't think it should be buried in the 5th paragraph of an > "Enhance nbtree ScalarArrayOp execution" commit. Additionally, the > changes of btree are arguably independent of the planner changes, as > the btree changes improve performance even if we ignore that it > implements strict result ordering. I'm not going to break out the planner changes, because they're *not* independent in any way. You could say the same thing about practically any work that changes the planner. They're "buried" in the 5th paragraph of the commit message. If an interested party can't even read that far to gain some understanding of a legitimately complicated piece of work such as this, I'm okay with that. > An independent thought when reviewing the finaltup / HIKEY scan > optimization parts of this patch: > The 'use highkey to check for next page's data range' optimization is > useful, but can currently only be used for scans that go to the right. > Could we use a similar optimization in the inverse direction if we > marked the right page of a split with "the left page has a HIKEY based > on this page's (un)truncated leftmost value" or "left page's HIKEY was > truncated to N key attributes"? It'd give one bit of information, > specifically that it does (or does not) share some (or all) key > attributes with this page's minimum value, which allows for earlier > scan termination in boundary cases. That would have to be maintained in all sorts of different places in nbtree. And could be broken at any time by somebody inserting a non-pivot tuple before every existing one on the leaf page. Doesn't seem worth it to me. If we were to do something like this then it would be discussed as independent work. It's akin to adding a low key, which could be used in several different places. As you say, it's totally independent. > > +++ b/src/include/access/nbtree.h > > +#define SK_BT_RDDNARRAY0x0004/* redundant in array > > preprocessing */ > > I think "made redundant" is more appropriate than just "redundant"; > the array key is not itself redundant in array preprocessing (indeed > we actually work hard on that key during preprocessing to allow us to > mark it redundant) Meh. I did it that way to fit under 78 chars while staying on the same line. I don't think that it matters. > I think this was mentioned before, but can't we have an argument or > version of _bt_checkkeys that makes it not advance the array keys, so > that we can keep this optimization? Or, isn't that now > _bt_check_compare? As I said to Heikki, thinking about this some more is on my todo list. I mean the way that this worked had substantial revisions on HEAD right before I posted v9. v9 was only to fix the bit rot that that caused. > For an orderlines table with 1000s of lines per order, we would > greatly benefit from a query `order_id = ANY ('{1, 3, 5}')` that > handles scan keys more efficiently; the optimization which is being > disabled here. The way that these optimizations might work with the mechanism from the patch isn't some kind of natural extension to what's there already. We'll need new heuristics to not waste cycles. Applying all of the optimizations together just isn't trivial, and it's not yet clear what really makes sense. Combining the two optimizations more or less adds another dimension of complexity. > > +++ b/src/backend/access/nbtree/nbtutils.c > > _bt_preprocess_array_keys(IndexScanDesc scan) > The _bt_preprocess_array_keys code is now broken due to type > confusion, as it assumes there is only one array subtype being > compared per attribute in so.orderProcs. I've been aware of this for some time, but didn't think that it was worth bringing up before I had a solution to present... > I'm also concerned about the implications of this in > _bt_binsrch_array_skey, as this too assumes the same compare operator > is always used for all array operations on each attribute. We may need > one so->orderProcs entry for each array key, but at least one per > sk_subtype of each array key. ...since right now it's convenient to make so->orderProcs have a 1:1 correspondence with index key columns > I also notice that the merging of values doesn't seem to be applied > optimally with mixed typed array operations: num = int[] AND num = > bigint[] AND num = int[] doesn't seem to merge the first and last > array ops. I'm also concerned about being (un)able to merge ...which ought to work and be robust, once the cross-type support is in place. That is, it should work once we really can assume that there really must be exactly one so->orderProcs entry per equality-strategy scan key in all cases -- including in highly obscure corner cases involving a mix of cross-type
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Thu, 28 Dec 2023 at 18:28, Peter Geoghegan wrote: > > On Sat, Dec 9, 2023 at 10:38 AM Peter Geoghegan wrote: > > Attached is v8, which pretty much rips all of this stuff out. > > Attached is v9, which I'm posting just to fix bitrot. The patch > stopped cleanly applying against HEAD due to recent bugfix commit > 7e6fb5da. No real changes here compared to v8. I found 2 major issues; one correctness issue in the arraykey processing code of _bt_preprocess_array_keys, and one assertion error in _bt_advance_array_keys; both discussed in the relevant sections below. > Subject: [PATCH v9] Enhance nbtree ScalarArrayOp execution. > [...] > Bugfix commit 807a40c5 taught the planner to avoid generating unsafe > path keys: path keys on a multicolumn index path, with a SAOP clause on > any attribute beyond the first/most significant attribute. These cases > are now all safe, so we go back to generating path keys without regard > for the presence of SAOP clauses (just like with any other clause type). > Also undo changes from follow-up bugfix commit a4523c5a, which taught > the planner to produce alternative index paths without any low-order > ScalarArrayOpExpr quals (making the SAOP quals into filter quals). > We'll no longer generate these alternative paths, which can no longer > offer any advantage over the index qual paths that we do still generate. Can you pull these planner changes into their own commit(s)? As mentioned upthread, it's a significant change in behavior that should have separate consideration and reference in the commit log. I really don't think it should be buried in the 5th paragraph of an "Enhance nbtree ScalarArrayOp execution" commit. Additionally, the changes of btree are arguably independent of the planner changes, as the btree changes improve performance even if we ignore that it implements strict result ordering. An independent thought when reviewing the finaltup / HIKEY scan optimization parts of this patch: The 'use highkey to check for next page's data range' optimization is useful, but can currently only be used for scans that go to the right. Could we use a similar optimization in the inverse direction if we marked the right page of a split with "the left page has a HIKEY based on this page's (un)truncated leftmost value" or "left page's HIKEY was truncated to N key attributes"? It'd give one bit of information, specifically that it does (or does not) share some (or all) key attributes with this page's minimum value, which allows for earlier scan termination in boundary cases. > +++ b/src/include/access/nbtree.h > +#define SK_BT_RDDNARRAY0x0004/* redundant in array preprocessing > */ I think "made redundant" is more appropriate than just "redundant"; the array key is not itself redundant in array preprocessing (indeed we actually work hard on that key during preprocessing to allow us to mark it redundant) > +++ b/src/backend/access/nbtree/nbtsearch.c > * We skip this for the first page in the scan to evade the possible > - * slowdown of the point queries. > + * slowdown of point queries. Never apply the optimization with a scan > + * that uses array keys, either, since that breaks certain assumptions. > + * (Our search-type scan keys change whenever _bt_checkkeys advances the > + * arrays, invalidating any precheck. Tracking all that would be > tricky.) I think this was mentioned before, but can't we have an argument or version of _bt_checkkeys that makes it not advance the array keys, so that we can keep this optimization? Or, isn't that now _bt_check_compare? For an orderlines table with 1000s of lines per order, we would greatly benefit from a query `order_id = ANY ('{1, 3, 5}')` that handles scan keys more efficiently; the optimization which is being disabled here. > +++ b/src/backend/access/nbtree/nbtutils.c > _bt_preprocess_array_keys(IndexScanDesc scan) The _bt_preprocess_array_keys code is now broken due to type confusion, as it assumes there is only one array subtype being compared per attribute in so.orderProcs. Reproducer: CREATE TABLE test AS SELECT generate_series(1, 1, 1::bigint) num; CREATE INDEX ON test (num); /* bigint typed */ SELECT num FROM test WHERE num = ANY ('{1}'::smallint[]) AND num = ANY ('{1}'::int[]) /* ensure matching lastEqualityArrayAtt, lastOrderProc for next qual AND num = ANY ('{65537}'::int[]); /* qual is broken due to application of smallint compare operator on int values that do equal mod 2^16, but do not equal in their own type */ num - 1 I'm also concerned about the implications of this in _bt_binsrch_array_skey, as this too assumes the same compare operator is always used for all array operations on each attribute. We may need one so->orderProcs entry for each array key, but at least one per sk_subtype of each array key. I also notice that the merging of values doesn't seem to be applied optimally with mixed typed array operations: num = int[] AND num = bigint[] AND num =
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Mon, Dec 4, 2023 at 7:25 PM Peter Geoghegan wrote: > 2. The optimization that has _bt_checkkeys skip non-required scan keys > that are *only* required in the direction *opposite* the current scan > direction -- this can work even without any precheck from > _bt_readpage. > > Note that this second optimization relies on various behaviors in > _bt_first() that make it impossible for _bt_checkkeys() to ever see a > tuple that could fail to satisfy such a scan key -- we must always > have passed over non-matching tuples, thanks to _bt_first(). That > prevents my patch with a problem: the logic for determining whether or > not we need a new primitive index scan only promises to never require > the scan to grovel through many leaf pages that _bt_first() could and > should just skip over instead. This new logic makes no promises about > skipping over small numbers of tuples. So it's possible that > _bt_checkkeys() will see a handful of tuples "after the end of the > _bt_first-wise primitive index scan", but "before the _bt_first-wise > start of the next would-be primitive index scan". BTW, I have my doubts about this actually being correct without the patch. The following comment block appears above _bt_preprocess_keys: * Note that one reason we need direction-sensitive required-key flags is * precisely that we may not be able to eliminate redundant keys. Suppose * we have "x > 4::int AND x > 10::bigint", and we are unable to determine * which key is more restrictive for lack of a suitable cross-type operator. * _bt_first will arbitrarily pick one of the keys to do the initial * positioning with. If it picks x > 4, then the x > 10 condition will fail * until we reach index entries > 10; but we can't stop the scan just because * x > 10 is failing. On the other hand, if we are scanning backwards, then * failure of either key is indeed enough to stop the scan. (In general, when * inequality keys are present, the initial-positioning code only promises to * position before the first possible match, not exactly at the first match, * for a forward scan; or after the last match for a backward scan.) As I understand it, this might still be okay, because the optimization in question from Alexander's commit e0b1ee17 (what I've called optimization 2) is careful about NULLs, which were the one case that definitely had problems. Note that IS NOT NULL works kind of like WHERE foo < NULL here (see old bug fix commit 882368e8, "Fix btree stop-at-nulls logic properly", for more context on this NULLs behavior). In any case, my patch isn't compatible with "optimization 2" (as in my tests break in a rather obvious way) due to a behavior that these old comments claim is normal within any scan (or perhaps normal in any scan with scan keys that couldn't be deemed redundant due to a lack of cross-type support in the opfamily). Something has to be wrong here -- could just be the comment, I suppose. But I find it easy to believe that Alexander's commit e0b1ee17 might not have been properly tested with opfamilies that lack a suitable cross-type operator. That's a pretty niche thing. (My patch doesn't need that niche thing to be present to easily break when combined with "optimization 2", which could hint at an existing and far more subtle problem.) -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Mon, Nov 27, 2023 at 5:39 AM Heikki Linnakangas wrote: > - +1 on the general idea. Hard to see any downsides if implemented right. Glad you think so. The "no possible downside" perspective is one that the planner sort of relies on, so this isn't just a nice-to-have -- it might actually be a condition of committing the patch. It's important that the planner can be very aggressive about using SAOP index quals, without us suffering any real downside at execution time. > - This changes the meaning of amsearcharray==true to mean that the > ordering is preserved with ScalarArrayOps, right? You change B-tree to > make that true, but what about any out-of-tree index AM extensions? I > don't know if any such extensions exist, and I don't think we should > jump through any hoops to preserve backwards compatibility here, but > probably deserves a notice in the release notes if nothing else. My interpretation is that the planner changes affect amcanorder + amsearcharray index AMs, but have no impact on mere amsearcharray index AMs. If anything this is a step *away* from knowing about nbtree implementation details in the planner (though the planner's definition of amcanorder is very close to the behavior from nbtree, down to things like knowing all about nbtree strategy numbers). The planner changes from the patch are all subtractive -- I'm removing kludges that were added by bug fix commits. Things that weren't in the original feature commit at all. I used the term "my interpretation" here because it seems hard to think of this in abstract terms, and to write a compatibility note for this imaginary audience. I'm happy to go along with whatever you want, though. Perhaps you can suggest a wording for this? > - You use the term "primitive index scan" a lot, but it's not clear to > me what it means. Does one ScalarArrayOps turn into one "primitive index > scan"? Or each element in the array turns into a separate primitive > index scan? Or something in between? Maybe add a new section to the > README explain how that works. The term primitive index scan refers to the thing that happens each time _bt_first() is called -- with and without the patch. In other words, it's what happens when pg_stat_all_indexes.idx_scan is incremented. You could argue that that's not quite the right thing to be focussing on, with this new design. But it has precedent going for it. As I said, it's the thing that pg_stat_all_indexes.idx_scan counts, which is a pretty exact and tangible thing. So it's consistent with historical practice, but also with what other index AMs do when executing ScalarArrayOps non-natively. > - _bt_preprocess_array_keys() is called for each btrescan(). It performs > a lot of work like cmp function lookups and desconstructing and merging > the arrays, even if none of the SAOP keys change in the rescan. That > could make queries with nested loop joins like this slower than before: > "select * from generate_series(1, 50) g, tenk1 WHERE g = tenk1.unique1 > and tenk1.two IN (1,2);". But that's nothing new. _bt_preprocess_array_keys() isn't a new function, and the way that it's called isn't new in any way. That said, I certainly agree that we should be worried about any added overhead in _bt_first for nested loop joins with an inner index scan. In my experience that can be an important issue. I actually have a TODO item about this already. It needs to be included in my work on performance validation, on general principle. > - nbtutils.c is pretty large now. Perhaps create a new file > nbtpreprocesskeys.c or something? Let me get back to you on this. > - You and Matthias talked about an implicit state machine. I wonder if > this could be refactored to have a more explicit state machine. The > state transitions and interactions between _bt_checkkeys(), > _bt_advance_array_keys() and friends feel complicated. I agree that it's complicated. That's the main problem that the patch has, by far. It used to be even more complicated, but it's hard to see a way to make it a lot simpler at this point. If you can think of a way to simplify it then I'll definitely give it a go. Can you elaborate on "more explicit state machine"? That seems like it could have value by adding more invariants, and making things a bit more explicit in one or two areas. It could also help us to verify that they hold from assertions. But that isn't really the same thing as simplification. I wouldn't use that word, at least. > > + > > + > > +Every time an index is searched, the index's > > + > > pg_stat_all_indexes.idx_scan > > +field is incremented. This usually happens once per index scan node > > +execution, but might take place several times during execution of a > > scan > > +that searches for multiple values together. Only queries that use > > certain > > +SQL constructs to search for rows matching any value > > +out of a list (or an array) of multiple scalar values are affected. > > See > > +
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Tue, Nov 28, 2023 at 3:52 PM Peter Geoghegan wrote: > While I still think that we need heuristics that apply speculative > criteria to decide whether or not going to the next page directly > (when we have a choice at all), that doesn't mean that the v7 > heuristics can't be improved on, with a little more thought. It's a > bit tricky, since we're probably also benefiting from the same > heuristics all the time -- probably even for this same test case. Correction: this particular test case happens to be one where the optimal strategy is to do *exactly* what the master branch does currently. The master branch is unbeatable, so the only reasonable goal for the patch is to not lose (or to lose by no more than a completely negligible amount). I'm now prepared to say that this behavior is not okay -- I definitely need to fix this. It's a bug. Because each distinct value never fits on one leaf page (it's more like 1.5 - 2 pages, even though we're deduplicating heavily), and because Postgres 12 optimizations are so effective with low cardinality/posting-list-heavy indexes such as this, we're bound to lose quite often. The only reason it doesn't happen _every single time_ we descend the index is because the test script uses CREATE INDEX, rather than using retail inserts (I tend to prefer the latter for this sort of analysis). Since nbtsort.c isn't as clever/aggressive about suffix truncation as the nbtsplitloc.c split strategies would have been, had we used them (had there been retail inserts), many individual leaf pages are left with high keys that aren't particularly good targets for the high key precheck optimization (see Postgres 12 commit 29b64d1d). If I wanted to produce a truly adversarial case for this issue (which this is already close to), I'd go with the following: 1. Retail inserts that leave each leaf page full of one single value, which will allow each high key to still make a "clean break" from the right sibling page -- it'll have the right sibling's value. Maybe insert 1200 - 1300 tuples per distinct index value for this. In other words, bulk loading that results in an index that never has to append a heap TID tiebreaker during suffix truncation, but comes very close to needing to. Bulk loading where nbtsplitloc.c needs to use SPLIT_MANY_DUPLICATES all the time, but never quite gets to the point of needing a SPLIT_SINGLE_VALUE split. 2. A SAOP query with an array with every second value in the index as an element. Something like "WHERE arr in (2, 4, 6, 8, ...)". The patch will read every single leaf page, whereas master will *reliably* only read every second leaf page. I didn't need to "trick" the patch in a contrived sort of way to get this bad outcome -- this scenario is fairly realistic. So this behavior is definitely not something that I'm prepared to defend. As I said, it's a bug. It'll be fixed in the next revision. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Tue, Nov 28, 2023 at 9:19 AM Peter Geoghegan wrote: > > I'm not convinced this is a problem we have to solve. It's possible it > > only affects cases that are implausible in practice (the script forces a > > particular scan type, and maybe it would not be picked in practice). But > > maybe it's fixable ... > > I would expect the patch to do quite well (relative to what is > actually possible) on cases like the two extremes that I've focussed > on so far. It seems possible that it will do less well on cases that > are somewhere in the middle (that also have lots of distinct values on > each page). Actually, I think that it's more likely that the problems that you saw are related to low cardinality data, which seems like it might not be a great fit for the heuristics that the patch uses to decide whether to continue the ongoing primitive index scan, or start a new one instead. I'm referring to the heuristics I describe here: https://postgr.es/m/CAH2-WzmTHoCsOmSgLg=yyft9loertuckxyg2gzn+28pzonf...@mail.gmail.com The patch itself discusses these heuristics in a large comment block after the point that _bt_checkkeys() calls _bt_advance_array_keys(). I hardly paid any attention to low cardinality data in my performance validation work -- it was almost always indexes that had few or no indexes (just pgbench_accounts if we're talking pure stress-tests), just because those are more complicated, and so seemed more important. I'm not quite prepared to say that there is definitely a problem here, right this minute, but if there was then it wouldn't be terribly surprising (the problems are usually wherever it is that I didn't look for them). Attached is a sample of my debug instrumentation for one such query, based on running the test script that Tomas posted -- thanks for writing this script, Tomas (I'll use it as the basis for some of my own performance validation work going forward). I don't mind sharing the patch that outputs this stuff if anybody is interested (it's kind of a monstrosity, so I'm disinclined to post it with the patch until I have a reason). Even without this instrumentation, you can get some idea of the kinds of issues I'm talking about just by viewing EXPLAIN ANALYZE output for a bitmap index scan -- that breaks out the index page accesses separately, which is a number that we should expect to remain lower than what the master branch shows in approximately all cases. While I still think that we need heuristics that apply speculative criteria to decide whether or not going to the next page directly (when we have a choice at all), that doesn't mean that the v7 heuristics can't be improved on, with a little more thought. It's a bit tricky, since we're probably also benefiting from the same heuristics all the time -- probably even for this same test case. We do lose against the master branch, on balance, and by enough to concern me, though. (I don't want to promise that it'll never happen at all, but it should be very limited, which this wasn't.) I didn't bother to ascertain how much longer it takes to execute the query, since that question seems rather beside the point. The important thing to me is whether or not this behavior actually makes sense, all things considered, and what exactly can be done about it if it doesn't make sense. I will need to think about this some more. This is just a status update. Thanks -- Peter Geoghegan index_walk_dump.txt.bz2 Description: BZip2 compressed data
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Tue, Nov 28, 2023 at 4:29 AM Tomas Vondra wrote: > I haven't looked at the code, but I decided to do a bit of blackbox perf > and stress testing, to get some feeling of what to expect in terms of > performance improvements, and see if there happen to be some unexpected > regressions. Attached is a couple simple bash scripts doing a > brute-force test with tables of different size / data distribution, > number of values in the SAOP expression, etc. My own stress-testing has focussed on the two obvious extremes for this patch, using variants of pgbench with SAOP SELECTs on pgbench_accounts: 1. The case where there is almost no chance of finding any two index tuples together on the same page, because the constants are completely random. This workload makes the patch's attempts at "coalescing together" index page reads pure overhead, with no possible benefit. Obviously that's a cost that needs to be kept under control. 2. The case where there are 255 of tuples with distinct values that are clustered together (both in the key space and in physical index pages). Usually they'll span two index pages, but they might all fit together on one index page, allowing us to descend to it directly and read it only once. With 32 clients, I typically see a regression of about 1.5% for the first case relative to master, measured in throughput/TPS. The second case typically sees throughput that's ~4.8x master (i.e. a ~380% increase). I consider both of these extremes to be fairly unrealistic. With fewer array constants, the speed-ups you'll see in sympathetic cases are still very significant, but nothing like 4.8x. They're more like the 30% numbers that you saw. As you know, I'm not actually all that excited about cases like 2 -- it's not where users are likely to benefit from the patch. The truly interesting cases are those cases where we can completely avoid heap accesses in the first place (not just *repeat* accesses to the same index pages), due to the patch's ability to consistently use index quals rather than filter quals. It's not that hard to show cases where there are 100x+ fewer pages accessed -- often with cases have very few array constants. It's just that these cases aren't that interesting from a performance validation point of view -- it's obvious that filter quals are terrible. Another thing that the patch does particularly well on is cases where the array keys don't have any matches at all, but there is significant clustering (relatively common when multiple SAOPs are used as index quals, which becomes far more likely due to the planner changes). We don't just skip over parts of the index that aren't relevant -- we also skip over parts of the arrays that aren't relevant. Some of my adversarial test cases that take ~1 millisecond for the patch to execute will practically take forever on master (I had to have my test suite not run those tests against master). You just need lots of array keys. What master does on those adversarial cases with billions of distinct combinations of array keys (not that unlikely if there are 4 or 5 SAOPs with mere hundreds or thousands of array keys each) is so inefficient that we might as well call it infinitely slower than the patch. This is interesting to me from a performance robustness/stability point of view. The slowest kind of SAOP index scan with the patch becomes a full index scan -- just as it would be if we were using any type of non-SOAP qual before now. The worst case is a lot easier to reason about. > I'm not convinced this is a problem we have to solve. It's possible it > only affects cases that are implausible in practice (the script forces a > particular scan type, and maybe it would not be picked in practice). But > maybe it's fixable ... I would expect the patch to do quite well (relative to what is actually possible) on cases like the two extremes that I've focussed on so far. It seems possible that it will do less well on cases that are somewhere in the middle (that also have lots of distinct values on each page). We effectively do a linear search on a page that we know has at least one more match (following a precheck that uses the high key). We hope that the next match (for the next array value) closely follows an initial match. But what if there are only 2 or 3 matches on each leaf page, that are spaced relatively far apart? You're going to have to grovel through the whole page. It's not obvious that that's a problem to be fixed -- we're still only descending the index once and still only locking the leaf page once, so we'll probably still win relative to master. And it's not that easy to imagine beating a linear search -- it's not like there is just one "next value" to search for in these cases. But it's something that deserves further consideration. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On 21/11/2023 04:52, Peter Geoghegan wrote: Attached is v7. First, some high-level reactions before looking at the patch very closely: - +1 on the general idea. Hard to see any downsides if implemented right. - This changes the meaning of amsearcharray==true to mean that the ordering is preserved with ScalarArrayOps, right? You change B-tree to make that true, but what about any out-of-tree index AM extensions? I don't know if any such extensions exist, and I don't think we should jump through any hoops to preserve backwards compatibility here, but probably deserves a notice in the release notes if nothing else. - You use the term "primitive index scan" a lot, but it's not clear to me what it means. Does one ScalarArrayOps turn into one "primitive index scan"? Or each element in the array turns into a separate primitive index scan? Or something in between? Maybe add a new section to the README explain how that works. - _bt_preprocess_array_keys() is called for each btrescan(). It performs a lot of work like cmp function lookups and desconstructing and merging the arrays, even if none of the SAOP keys change in the rescan. That could make queries with nested loop joins like this slower than before: "select * from generate_series(1, 50) g, tenk1 WHERE g = tenk1.unique1 and tenk1.two IN (1,2);". - nbtutils.c is pretty large now. Perhaps create a new file nbtpreprocesskeys.c or something? - You and Matthias talked about an implicit state machine. I wonder if this could be refactored to have a more explicit state machine. The state transitions and interactions between _bt_checkkeys(), _bt_advance_array_keys() and friends feel complicated. And then some details: --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -4035,6 +4035,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage + + +Every time an index is searched, the index's + pg_stat_all_indexes.idx_scan +field is incremented. This usually happens once per index scan node +execution, but might take place several times during execution of a scan +that searches for multiple values together. Only queries that use certain +SQL constructs to search for rows matching any value +out of a list (or an array) of multiple scalar values are affected. See + for details. + + + Is this true even without this patch? Maybe commit this separately. The "Only queries ..." sentence feels difficult. Maybe something like "For example, queries using IN (...) or = ANY(...) constructs.". * _bt_preprocess_keys treats each primitive scan as an independent piece of * work. That structure pushes the responsibility for preprocessing that must * work "across array keys" onto us. This division of labor makes sense once * you consider that we're typically called no more than once per btrescan, * whereas _bt_preprocess_keys is always called once per primitive index scan. "That structure ..." is a garden-path sentence. I kept parsing "that must work" as one unit, the same way as "that structure", and it didn't make sense. Took me many re-reads to parse it correctly. Now that I get it, it doesn't bother me anymore, but maybe it could be rephrased. Is there _any_ situation where _bt_preprocess_array_keys() is called more than once per btrescan? /* * Look up the appropriate comparison operator in the opfamily. * * Note: it's possible that this would fail, if the opfamily is * incomplete, but it seems quite unlikely that an opfamily would omit * non-cross-type comparison operators for any datatype that it supports * at all. ... */ I agree that's unlikely. I cannot come up with an example where you would have cross-type operators between A and B, but no same-type operators between B and B. For any real-world opfamily, that would be an omission you'd probably want to fix. Still I wonder if we could easily fall back if it doesn't exist? And maybe add a check in the 'opr_sanity' test for that. In _bt_readpage(): /* * Prechecking the page with scan keys required for direction scan. We * check these keys with the last item on the page (according to our scan * direction). If these keys are matched, we can skip checking them with * every item on the page. Scan keys for our scan direction would * necessarily match the previous items. Scan keys required for opposite * direction scan are already matched by the _bt_first() call. * * With the forward scan, we do this check for the last item on the page * instead of the high key. It's relatively likely that the most * significant column in the high key will be different from the * corresponding value from the last item on the page. So checking with * the last item on the page would give
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Wed, 8 Nov 2023 at 02:53, Peter Geoghegan wrote: > > On Tue, Nov 7, 2023 at 4:20 AM Matthias van de Meent > wrote: > > On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan wrote: > > > I should be able to post v6 later this week. My current plan is to > > > commit the other nbtree patch first (the backwards scan "boundary > > > cases" one from the ongoing CF) -- since I saw your review earlier > > > today. I think that you should probably wait for this v6 before > > > starting your review. > > > > Okay, thanks for the update, then I'll wait for v6 to be posted. > > On second thought, I'll just post v6 now (there won't be conflicts > against the master branch once the other patch is committed anyway). Thanks. Here's my review of the btree-related code: > +++ b/src/backend/access/nbtree/nbtsearch.c > @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, > OffsetNumber offnum) > * set flag to true if all required keys are satisfied and false > * otherwise. > */ > -(void) _bt_checkkeys(scan, itup, indnatts, dir, > - , false); > +_bt_checkkeys(scan, , itup, false, false); > +requiredMatchedByPrecheck = pstate.continuescan; > +pstate.continuescan = true; /* reset */ The comment above the updated section needs to be updated. > @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, > OffsetNumber offnum) > * set flag to true if all required keys are satisfied and false > * otherwise. > */ > -(void) _bt_checkkeys(scan, itup, indnatts, dir, > - , false); > +_bt_checkkeys(scan, , itup, false, false); This 'false' finaltup argument is surely wrong for the rightmost page's rightmost tuple, no? > +++ b/src/backend/access/nbtree/nbtutils.c > @@ -357,6 +431,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan) > +/* We could pfree(elem_values) after, but not worth the cycles */ > +num_elems = _bt_merge_arrays(scan, cur, > + (indoption[cur->sk_attno - 1] & > INDOPTION_DESC) != 0, > + prev->elem_values, prev->num_elems, > + elem_values, num_elems); This code can get hit several times when there are multiple = ANY clauses, which may result in repeated leakage of these arrays during this scan. I think cleaning up may well be worth the cycles when the total size of the arrays is large enough. > @@ -496,6 +627,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey, >_bt_compare_array_elements, ); > +_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse, > + Datum *elems_orig, int nelems_orig, > + Datum *elems_next, int nelems_next) > [...] > +/* > + * Incrementally copy the original array into a temp buffer, skipping > over > + * any items that are missing from the "next" array > + */ Given that we only keep the members that both arrays have in common, the result array will be a strict subset of the original array. So, I don't quite see why we need the temporary buffer here - we can reuse the entries of the elems_orig array that we've already compared against the elems_next array. We may want to optimize this further by iterating over only the smallest array: With the current code, [1, 2] + [11000] is faster to merge than [1..1000] + [1000, 1001], because 2 * log(1000) is much smaller than 1000*log(2). In practice this may matter very little, though. An even better optimized version would do a merge join on the two arrays, rather than loop + binary search. > @@ -515,6 +688,161 @@ _bt_compare_array_elements(const void *a, const void > *b, void *arg) > [...] > +_bt_binsrch_array_skey(FmgrInfo *orderproc, Is there a reason for this complex initialization of high/low_elem, rather than the this easier to understand and more compact initialization?: + low_elem = 0; + high_elem = array->num_elems - 1; + if (cur_elem_start) + { + if (ScanDirectionIsForward(dir)) + low_elem = array->cur_elem; + else + high_elem = array->cur_elem; + } > @@ -661,20 +1008,691 @@ _bt_restore_array_keys(IndexScanDesc scan) > [...] > + _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir) > [...] > +if (scan->parallel_scan != NULL) > +_bt_parallel_done(scan); > + > +/* > + * No more primitive index scans. Terminate the top-level scan. > + */ > +return false; I think the conditional _bt_parallel_done(scan) feels misplaced here, as the comment immediately below indicates the scan is to be terminated after that comment. So, either move this _bt_parallel_done call outside the function (which by name would imply it is read-only, without side effects like this) or move it below the comment "terminate the top-level scan". > +_bt_advance_array_keys(IndexScanDesc
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Fri, 10 Nov 2023 at 00:58, Peter Geoghegan wrote: > On Tue, Nov 7, 2023 at 5:53 PM Peter Geoghegan wrote: > > If you end up finding a bug in this v6, it'll most likely be a case > > where nbtree fails to live up to that. This project is as much about > > robust/predictable performance as anything else -- nbtree needs to be > > able to cope with practically anything. I suggest that your review > > start by trying to break the patch along these lines. > > I spent some time on this myself today (which I'd already planned on). > > Attached is an adversarial stress-test, which shows something that > must be approaching the worst case for the patch in terms of time > spent with a buffer lock held, due to spending so much time evaluating > unusually expensive SAOP index quals. The array binary searches that > take place with a buffer lock held aren't quite like anything else > that nbtree can do right now, so it's worthy of special attention. > > I thought of several factors that maximize both the number of binary > searches within any given _bt_readpage, as well as the cost of each > binary search -- the SQL file has full details. My test query is > *extremely* unrealistic, since it combines multiple independent > unrealistic factors, all of which aim to make life hard for the > implementation in one way or another. I hesitate to say that it > couldn't be much worse (I only spent a few hours on this), but I'm > prepared to say that it seems very unlikely that any real world query > could make the patch spend as many cycles in > _bt_readpage/_bt_checkkeys as this one does. > > Perhaps you can think of some other factor that would make this test > case even less sympathetic towards the patch, Matthias? The only thing > I thought of that I've left out was the use of a custom btree opclass, > "unrealistically_slow_ops". Something that calls pg_usleep in its > order proc. (I left it out because it wouldn't prove anything.) Have you tried using text index columns that are sorted with non-default locales? I've seen non-default locales use significantly more resources during compare operations than any other ordering operation I know of (which has mostly been in finding the locale), and use it extensively to test improvements for worst index shapes over at my btree patchsets because locales are dynamically loaded in text compare and nondefault locales are not cached at all. I suspect that this would be even worse if a somehow even worse locale path is available than what I'm using for test right now; this could be the case with complex custom ICU locales. > On my machine, custom instrumentation shows that each call to > _bt_readpage made while this query executes (on a patched server) > takes just under 1.4 milliseconds. While that is far longer than it > usually takes, it's basically acceptable IMV. It's not significantly > longer than I'd expect heap_index_delete_tuples() to take on an > average day with EBS (or other network-attached storage). But that's a > process that happens all the time, with an exclusive buffer lock held > on the leaf page throughout -- whereas this is only a shared buffer > lock, and involves a query that's just absurd . > > Another factor that makes this seem acceptable is just how sensitive > the test case is to everything going exactly and perfectly wrong, all > at the same time, again and again. The test case uses a 32 column > index (the INDEX_MAX_KEYS maximum), with a query that has 32 SAOP > clauses (one per index column). If I reduce the number of SAOP clauses > in the query to (say) 8, I still have a test case that's almost as > silly as my original -- but now we only spend ~225 microseconds in > each _bt_readpage call (i.e. we spend over 6x less time in each > _bt_readpage call). (Admittedly if I also make the CREATE INDEX use > only 8 columns, we can fit more index tuples on one page, leaving us > at ~800 microseconds). A quick update of the table definition to use the various installed 'fr-%-x-icu' locales on text hash columns instead of numeric with a different collation for each column this gets me to EXPLAIN (analyze) showing 2.07ms spent every buffer hit inside the index scan node, as opposed to 1.76ms when using numeric. But, as you mention, the value of this metric is probably not very high. As for the patch itself, I'm probably about 50% through the patch now. While reviewing, I noticed the following two user-visible items, related to SAOP but not broken by or touched upon in this patch: 1. We don't seem to plan `column opr ALL (...)` as index conditions, while this should be trivial to optimize for at least btree. Example: SET enable_bitmapscan = OFF; WITH a AS (select generate_series(1, 1000) a) SELECT * FROM tenk1 WHERE thousand = ANY (array(table a)) AND thousand < ALL (array(table a)); This will never return any rows, but it does hit 9990 buffers in the new btree code, while I expected that to be 0 buffers based on the query and index (that is, I
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Tue, Nov 7, 2023 at 5:53 PM Peter Geoghegan wrote: > If you end up finding a bug in this v6, it'll most likely be a case > where nbtree fails to live up to that. This project is as much about > robust/predictable performance as anything else -- nbtree needs to be > able to cope with practically anything. I suggest that your review > start by trying to break the patch along these lines. I spent some time on this myself today (which I'd already planned on). Attached is an adversarial stress-test, which shows something that must be approaching the worst case for the patch in terms of time spent with a buffer lock held, due to spending so much time evaluating unusually expensive SAOP index quals. The array binary searches that take place with a buffer lock held aren't quite like anything else that nbtree can do right now, so it's worthy of special attention. I thought of several factors that maximize both the number of binary searches within any given _bt_readpage, as well as the cost of each binary search -- the SQL file has full details. My test query is *extremely* unrealistic, since it combines multiple independent unrealistic factors, all of which aim to make life hard for the implementation in one way or another. I hesitate to say that it couldn't be much worse (I only spent a few hours on this), but I'm prepared to say that it seems very unlikely that any real world query could make the patch spend as many cycles in _bt_readpage/_bt_checkkeys as this one does. Perhaps you can think of some other factor that would make this test case even less sympathetic towards the patch, Matthias? The only thing I thought of that I've left out was the use of a custom btree opclass, "unrealistically_slow_ops". Something that calls pg_usleep in its order proc. (I left it out because it wouldn't prove anything.) On my machine, custom instrumentation shows that each call to _bt_readpage made while this query executes (on a patched server) takes just under 1.4 milliseconds. While that is far longer than it usually takes, it's basically acceptable IMV. It's not significantly longer than I'd expect heap_index_delete_tuples() to take on an average day with EBS (or other network-attached storage). But that's a process that happens all the time, with an exclusive buffer lock held on the leaf page throughout -- whereas this is only a shared buffer lock, and involves a query that's just absurd . Another factor that makes this seem acceptable is just how sensitive the test case is to everything going exactly and perfectly wrong, all at the same time, again and again. The test case uses a 32 column index (the INDEX_MAX_KEYS maximum), with a query that has 32 SAOP clauses (one per index column). If I reduce the number of SAOP clauses in the query to (say) 8, I still have a test case that's almost as silly as my original -- but now we only spend ~225 microseconds in each _bt_readpage call (i.e. we spend over 6x less time in each _bt_readpage call). (Admittedly if I also make the CREATE INDEX use only 8 columns, we can fit more index tuples on one page, leaving us at ~800 microseconds). I'm a little surprised that it isn't a lot worse than this, given how far I went. I was a little concerned that it would prove necessary to lock this kind of thing down at some higher level (e.g., in the planner), but that now looks unnecessary. There are much better ways to DOS the server than this. For example, you could run this same query while forcing a sequential scan! That appears to be quite a lot less responsive to interrupts (in addition to being hopelessly slow), probably because it uses parallel workers, each of which will use wildly expensive filter quals that just do a linear scan of the SAOP. -- Peter Geoghegan nemesis.sql Description: Binary data
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan wrote: > > On Mon, Nov 6, 2023 at 1:28 PM Matthias van de Meent > wrote: > > I'm planning on reviewing this patch tomorrow, but in an initial scan > > through the patch I noticed there's little information about how the > > array keys state machine works in this new design. Do you have a more > > toplevel description of the full state machine used in the new design? > > This is an excellent question. You're entirely right: there isn't > enough information about the design of the state machine. > > I should be able to post v6 later this week. My current plan is to > commit the other nbtree patch first (the backwards scan "boundary > cases" one from the ongoing CF) -- since I saw your review earlier > today. I think that you should probably wait for this v6 before > starting your review. Okay, thanks for the update, then I'll wait for v6 to be posted. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Mon, Nov 6, 2023 at 1:28 PM Matthias van de Meent wrote: > I'm planning on reviewing this patch tomorrow, but in an initial scan > through the patch I noticed there's little information about how the > array keys state machine works in this new design. Do you have a more > toplevel description of the full state machine used in the new design? This is an excellent question. You're entirely right: there isn't enough information about the design of the state machine. In v1 of the patch, from all the way back in July, the "state machine" advanced in the hackiest way possible: via repeated "incremental" advancement (using logic from the function that we call _bt_advance_array_keys() on HEAD) in a loop -- we just kept doing that until the function I'm now calling _bt_tuple_before_array_skeys() eventually reported that the array keys were now sufficiently advanced. v2 greatly improved matters by totally overhauling _bt_advance_array_keys(): it was taught to use binary searches to advance the array keys, with limited remaining use of "incremental" array key advancement. However, version 2 (and all later versions to date) have somewhat wonky state machine transitions, in one important respect: calls to the new _bt_advance_array_keys() won't always advance the array keys to the maximum extent possible (possible while still getting correct behavior, that is). There were still various complicated scenarios involving multiple "required" array keys (SK_BT_REQFWD + SK_BT_REQBKWD scan keys that use BTEqualStrategyNumber), where one single call to _bt_advance_array_keys() would advance the array keys to a point that was still < caller's tuple. AFAICT this didn't cause wrong answers to queries (that would require failing to find a set of exactly matching array keys where a matching set exists), but it was kludgey. It was sloppy in roughly the same way as the approach in my v1 prototype was sloppy (just to a lesser degree). I should be able to post v6 later this week. My current plan is to commit the other nbtree patch first (the backwards scan "boundary cases" one from the ongoing CF) -- since I saw your review earlier today. I think that you should probably wait for this v6 before starting your review. The upcoming version will have simple preconditions and postconditions for the function that advances the array key state machine (the new _bt_advance_array_keys). These are enforced by assertions at the start and end of the function. So the rules for the state machine become crystal clear and fairly easy to keep in your head (e.g., tuple must be >= required array keys on entry and <= required array keys on exit, the array keys must always either advance by one increment or be completely exhausted for the top-level scan in the current scan direction). Unsurprisingly, I found that adding and enforcing these invariants led to a simpler and more general design within _bt_advance_array_keys. That code is still the most complicated part of the patch, but it's much less of a bag of tricks. Another reason for you to hold off for a few more days. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Sat, 21 Oct 2023 at 00:40, Peter Geoghegan wrote: > > On Sun, Oct 15, 2023 at 1:50 PM Peter Geoghegan wrote: > > Attached is v4, which applies cleanly on top of HEAD. This was needed > > due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan > > keys required for directional scan in B-tree". > > > > Unfortunately I have more or less dealt with the conflicts on HEAD by > > disabling the optimization from that commit, for the time being. > > Attached is v5, which deals with the conflict with the optimization > added by Alexandar Korotkov's commit e0b1ee17 sensibly: the > optimization is now only disabled in cases without array scan keys. > (It'd be very hard to make it work with array scan keys, since an > important principle for my patch is that we can change search-type > scan keys right in the middle of any _bt_readpage() call). I'm planning on reviewing this patch tomorrow, but in an initial scan through the patch I noticed there's little information about how the array keys state machine works in this new design. Do you have a more toplevel description of the full state machine used in the new design? If not, I'll probably be able to discover my own understanding of the mechanism used in the patch, but if there is a framework to build that understanding on (rather than having to build it from scratch) that'd be greatly appreciated. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Sun, Oct 15, 2023 at 1:50 PM Peter Geoghegan wrote: > Attached is v4, which applies cleanly on top of HEAD. This was needed > due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan > keys required for directional scan in B-tree". > > Unfortunately I have more or less dealt with the conflicts on HEAD by > disabling the optimization from that commit, for the time being. Attached is v5, which deals with the conflict with the optimization added by Alexandar Korotkov's commit e0b1ee17 sensibly: the optimization is now only disabled in cases without array scan keys. (It'd be very hard to make it work with array scan keys, since an important principle for my patch is that we can change search-type scan keys right in the middle of any _bt_readpage() call). v5 also fixes a longstanding open item for the patch: we no longer call _bt_preprocess_keys() with a buffer lock held, which was a bad idea at best, and unsafe (due to the syscache lookups within _bt_preprocess_keys) at worst. A new, minimal version of the function (called _bt_preprocess_keys_leafbuf) is called at the same point instead. That change, combined with the array binary search stuff (which was added back in v2), makes the total amount of work performed with a buffer lock held totally reasonable in all cases. It's even okay in extreme or adversarial cases with many millions of array keys. Making this _bt_preprocess_keys_leafbuf approach work has a downside: it requires that _bt_preprocess_keys be a little less aggressive about removing redundant scan keys, in order to meet certain assumptions held by the new _bt_preprocess_keys_leafbuf function. Essentially, _bt_preprocess_keys must now worry about current and future array key values when determining redundancy among scan keys -- not just the current array key values. _bt_preprocess_keys knows nothing about SK_SEARCHARRAY scan keys on HEAD, because on HEAD there is a strict 1:1 correspondence between the number of primitive index scans and the number of array keys (actually, the number of distinct combinations of array keys). Obviously that's no longer the case with the patch (that's the whole point of the patch). It's easiest to understand how elimination of redundant quals needs to work in v5 by way of an example. Consider the following query: select count(*), two, four, twenty, hundred from tenk1 where two in (0, 1) and four in (1, 2, 3) and two < 1; Notice that "two" appears in the where clause twice. First it appears as an SAOP, and then as an inequality. Right now, on HEAD, the primitive index scan where the SAOP's scankey is "two = 0" renders "two < 1" redundant. However, the subsequent primitive index scan where "two = 1" does *not* render "two < 1" redundant. This has implications for the mechanism in the patch, since the patch will perform one big primitive index scan for all array constants, with only a single _bt_preprocess_keys call at the start of its one and only _bt_first call (but with multiple _bt_preprocess_keys_leafbuf calls once we reach the leaf level). The compromise that I've settled on in v5 is to teach _bt_preprocess_keys to *never* treat "two < 1" as redundant with such a query -- even though there is some squishy sense in which "two < 1" is indeed still redundant (for the first SAOP key of value 0). My approach is reasonably well targeted in that it mostly doesn't affect queries that don't need it. But it will add cycles to some badly written queries that wouldn't have had them in earlier Postgres versions. I'm not entirely sure how much this matters, but my current sense is that it doesn't matter all that much. This is the kind of thing that is hard to test and poorly tested, so simplicity is even more of a virtue than usual. Note that the changes to _bt_preprocess_keys in v5 *don't* affect how we determine if the scan has contradictory quals, which is generally more important. With contradictory quals, _bt_first can avoid reading any data from the index. OTOH eliminating redundant quals (i.e. the thing that v5 *does* change) merely makes evaluating index quals less expensive via preprocessing-away unneeded scan keys. In other words, while it's possible that the approach taken by v5 will add CPU cycles in a small number of cases, it should never result in more page accesses. -- Peter Geoghegan v5-0001-Enhance-nbtree-ScalarArrayOp-execution.patch Description: Binary data
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Thu, Sep 28, 2023 at 5:32 PM Peter Geoghegan wrote: > On Sun, Sep 17, 2023 at 4:47 PM Peter Geoghegan wrote: > > Attached is v2, which makes all array key advancement take place using > > the "next index tuple" approach (using binary searches to find array > > keys using index tuple values). > > Attached is v3, which fixes bitrot caused by today's bugfix commit 714780dc. Attached is v4, which applies cleanly on top of HEAD. This was needed due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan keys required for directional scan in B-tree". Unfortunately I have more or less dealt with the conflicts on HEAD by disabling the optimization from that commit, for the time being. The commit in question is rather poorly documented, and it's not immediately clear how to integrate it with my work. I just want to make sure that there's a testable patch available. -- Peter Geoghegan v4-0001-Enhance-nbtree-ScalarArrayOp-execution.patch Description: Binary data
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Sun, Sep 17, 2023 at 4:47 PM Peter Geoghegan wrote: > Attached is v2, which makes all array key advancement take place using > the "next index tuple" approach (using binary searches to find array > keys using index tuple values). Attached is v3, which fixes bitrot caused by today's bugfix commit 714780dc. No notable changes here compared to v2. -- Peter Geoghegan From 2cff1dadb7903d49a2338b64b27178fa0bc51456 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Sat, 17 Jun 2023 17:03:36 -0700 Subject: [PATCH v3] Enhance nbtree ScalarArrayOp execution. Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals natively. This works by pushing additional context about the arrays down into the nbtree index AM, as index quals. This information enabled nbtree to execute multiple primitive index scans as part of an index scan executor node that was treated as one continuous index scan. The motivation behind this earlier work was enabling index-only scans with ScalarArrayOpExpr clauses (SAOP quals are traditionally executed via BitmapOr nodes, which is largely index-AM-agnostic, but always requires heap access). The general idea of giving the index AM this additional context can be pushed a lot further, though. Teach nbtree SAOP index scans to dynamically advance array scan keys using information about the characteristics of the index, determined at runtime. The array key state machine advances the current array keys using the next index tuple in line to be scanned, at the point where the scan reaches the end of the last set of array keys. This approach is far more flexible, and can be far more efficient. Cases that previously required hundreds (even thousands) of primitive index scans now require as few as one single primitive index scan. Also remove all restrictions on generating path keys for nbtree index scans that happen to have ScalarArrayOpExpr quals. Bugfix commit 807a40c5 taught the planner to avoid generating unsafe path keys: path keys on a multicolumn index path, with a SAOP clause on any attribute beyond the first/most significant attribute. These cases are now safe. Now nbtree index scans with an inequality clause on a high order column and a SAOP clause on a lower order column are executed as one single primitive index scan, since that is the most efficient way to do it. Non-required equality type SAOP quals are executed by nbtree using almost the same approach used for required equality type SAOP quals. nbtree is now strictly guaranteed to avoid all repeat accesses to any individual leaf page, even in cases with inequalities on high order columns (except when the scan direction changes, or the scan restarts). We now have strong guarantees about the worst case, which is very useful when costing index scans with SAOP clauses. The cost profile of index paths with multiple SAOP clauses is now a lot closer to other cases; more selective index scans will now generally have lower costs than less selective index scans. The added cost from repeatedly descending the index still matters, but it can never dominate. An important goal of this work is to remove all ScalarArrayOpExpr clause special cases from the planner -- ScalarArrayOpExpr clauses can now be thought of a generalization of simple equality clauses (except when costing index scans, perhaps). The planner no longer needs to generate alternative index paths with filter quals/qpquals. We assume that true SAOP index quals are strictly better than filter/qpquals, since the work in nbtree guarantees that they'll be at least slightly faster. Many of the queries sped up by the work from this commit don't directly benefit from the nbtree/executor enhancements. They benefit indirectly. The planner no longer shows any restraint around making SAOP clauses into true nbtree index quals, which tends to result in significant savings on heap page accesses. In general we never need visibility checks to evaluate true index quals, whereas filter quals often need to perform extra heap accesses, just to eliminate non-matching tuples (expression evaluation is only safe with known visible tuples). Author: Peter Geoghegan Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com --- src/include/access/nbtree.h| 39 +- src/backend/access/nbtree/nbtree.c | 65 +- src/backend/access/nbtree/nbtsearch.c | 62 +- src/backend/access/nbtree/nbtutils.c | 1367 ++-- src/backend/optimizer/path/indxpath.c | 64 +- src/backend/utils/adt/selfuncs.c | 123 +- doc/src/sgml/monitoring.sgml | 13 + src/test/regress/expected/create_index.out | 61 +- src/test/regress/expected/join.out |5 +- src/test/regress/sql/create_index.sql | 20 +- 10 files changed, 1506 insertions(+), 313 deletions(-) diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h index 6345e16d7..33db9b648 100644 ---
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Wed, Jul 26, 2023 at 6:41 AM Peter Geoghegan wrote: > > MDAM seems to require exponential storage for "scan key operations" > > for conditions on N columns (to be precise, the product of the number > > of distinct conditions on each column); e.g. an index on mytable > > (a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ... > > AND h IN (1, 2)" would require 2^8 entries. > What you describe is a problem in theory, but I doubt that it's a > problem in practice. You don't actually have to materialize the > predicates up-front, or at all. Plus you can skip over them using the > next index tuple. So skipping works both ways. Attached is v2, which makes all array key advancement take place using the "next index tuple" approach (using binary searches to find array keys using index tuple values). This approach was necessary for fairly mundane reasons (it limits the amount of work required while holding a buffer lock), but it also solves quite a few other problems that I find far more interesting. It's easy to imagine the state machine from v2 of the patch being extended for skip scan. My approach "abstracts away" the arrays. For skip scan, it would more or less behave as if the user had written a query "WHERE a in () AND b = 5 ... " -- without actually knowing what the so-called array keys for the high-order skipped column are (not up front, at least). We'd only need to track the current "array key" for the scan key on the skipped column, "a". The state machine would notice when the scan had reached the next-greatest "a" value in the index (whatever that might be), and then make that the current value. Finally, the state machine would effectively instruct its caller to consider repositioning the scan via a new descent of the index. In other words, almost everything for skip scan would work just like regular SAOPs -- and any differences would be well encapsulated. But it's not just skip scan. This approach also enables thinking of SAOP index scans (using nbtree) as just another type of indexable clause, without any special restrictions (compared to true indexable operators such as "=", say). Particularly in the planner. That was always the general thrust of teaching nbtree about SAOPs, from the start. But it's something that should be totally embraced IMV. That's just what the patch proposes to do. In particular, the patch now: 1. Entirely removes the long-standing restriction on generating path keys for index paths with SAOPs, even when there are inequalities on a high order column present. You can mix SAOPs together with other clause types, arbitrarily, and everything still works and works efficiently. For example, the regression test expected output for this query/test (from bugfix commit 807a40c5) is updated by the patch, as shown here: explain (costs off) SELECT thousand, tenthous FROM tenk1 WHERE thousand < 2 AND tenthous IN (1001,3000) ORDER BY thousand; - QUERY PLAN --- - Sort - Sort Key: thousand - -> Index Scan using tenk1_thous_tenthous on tenk1 - Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[]))) -(4 rows) + QUERY PLAN + + Index Scan using tenk1_thous_tenthous on tenk1 + Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[]))) +(2 rows) We don't need a sort node anymore -- even though the leading column here (thousand) uses an inequality, a particularly tricky case. Now it's an index scan, much like any other, with no particular restrictions caused by using a SAOP. 2. Adds an nbtree strategy for non-required equality array scan keys, which is built on the same state machine, with only minor differences to deal with column values "appearing out of key space order". 3. Simplifies the optimizer side of things by consistently avoiding filter quals (except when it's truly unavoidable). The optimizer doesn't even consider alternative index paths with filter quals for lower-order SAOP columns, because they have no possible advantage anymore. On the other hand, as we saw already, upthread, filter quals have huge disadvantages. By always using true index quals, we automatically avoid any question of getting excessive amounts of heap page accesses just to eliminate non-matching rows. AFAICT we don't need to make a trade-off here. The first version of the patch added some crufty code to the optimizer, to account for various restrictions on sort order. This revised version actually removes existing cruft from the same place (indxpath.c) instead. Items 1, 2, and 3 are all closely related. Take the query I've shown for item 1. Bugfix commit 807a40c5 (which added the test query in question) dealt with an oversight in the then-recent original nbtree SAOP patch (commit 9e8da0f7): when nbtree
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Mon, Jul 31, 2023 at 12:24 PM Alena Rybakina wrote: > I noticed that you are going to add CNF->DNF transformation at the index > construction stage. If I understand correctly, you will rewrite > restrictinfo node, > change boolean "AND" expressions to "OR" expressions, but would it be > possible to apply such a procedure earlier? Sort of. I haven't really added any new CNF->DNF transformations. The code you're talking about is really just checking that every index path has clauses that we know that nbtree can handle. That's a big, ugly modularity violation -- many of these details are quite specific to the nbtree index AM (in theory we could have other index AMs that are amsearcharray). At most, v1 of the patch makes greater use of an existing transformation that takes place in the nbtree index AM, as it preprocesses scan keys for these types of queries (it's not inventing new transformations at all). This is a slightly creative interpretation, too. Tom's commit 9e8da0f7 didn't actually say anything about CNF/DNF. > Otherwise I suppose you > could face the problem of > incorrect selectivity of the calculation and, consequently, the > cardinality calculation? I can't think of any reason why that should happen as a direct result of what I have done here. Multi-column index paths + multiple SAOP clauses are not a new thing. The number of rows returned does not depend on whether we have some columns as filter quals or not. Of course that doesn't mean that the costing has no problems. The costing definitely has several problems right now. It also isn't necessarily okay that it's "just as good as before" if it turns out that it needs to be better now. But I don't see why it would be. (Actually, my hope is that selectivity estimation might be *less* important as a practical matter with the patch.) > I can't clearly understand at what stage it is clear that the such a > transformation needs to be applied? I don't know either. I think that most of this work needs to take place in the nbtree code, during preprocessing. But it's not so simple. There is a mutual dependency between the code that generates index paths in the planner and nbtree scan key preprocessing. The planner needs to know what kinds of index paths are possible/safe up-front, so that it can choose the fastest plan (the fastest that the index AM knows how to execute correctly). But, there are lots of small annoying nbtree implementation details that might matter, and can change. I think we need to have nbtree register a callback, so that the planner can initialize some preprocessing early. I think that we require a "two way conversation" between the planner and the index AM. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Thu, Jul 27, 2023 at 10:00 AM Matthias van de Meent wrote: > My idea is not quite block nested loop join. It's more 'restart the > index scan at the location the previous index scan ended, if > heuristics say there's a good chance that might save us time'. I'd say > it is comparable to the fast tree descent optimization that we have > for endpoint queries, and comparable to this patch's scankey > optimization, but across AM-level rescans instead of internal rescans. Yeah, I see what you mean. Seems related, even though what you've shown in your prototype patch doesn't seem like it fits into my taxonomy very neatly. (BTW, I was a little confused by the use of the term "endpoint" at first, since there is a function that uses that term to refer to a descent of the tree that happens without any insertion scan key. This path is used whenever the best we can do in _bt_first is to descend to the rightmost or leftmost page.) > The basic design of that patch is this: We keep track of how many > times we've rescanned, and the end location of the index scan. If a > new index scan hits the same page after _bt_search as the previous > scan ended, we register that. I can see one advantage that block nested loop join would retain here: it does block-based accesses on both sides of the join. Since it "looks ahead" on both sides of the join, more repeat accesses are likely to be avoided. Not too sure how much that matters in practice, though. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Hi, all! CNF -> DNF conversion = Like many great papers, the MDAM paper takes one core idea, and finds ways to leverage it to the hilt. Here the core idea is to take predicates in conjunctive normal form (an "AND of ORs"), and convert them into disjunctive normal form (an "OR of ANDs"). DNF quals are logically equivalent to CNF quals, but ideally suited to SAOP-array style processing by an ordered B-Tree index scan -- they reduce everything to a series of non-overlapping primitive index scans, that can be processed in keyspace order. We already do this today in the case of SAOPs, in effect. The nbtree "next array keys" state machine already materializes values that can be seen as MDAM style DNF single value predicates. The state machine works by outputting the cartesian product of each array as a multi-column index is scanned, but that could be taken a lot further in the future. We can use essentially the same kind of state machine to do everything described in the paper -- ultimately, it just needs to output a list of disjuncts, like the DNF clauses that the paper shows in "Table 3". In theory, anything can be supported via a sufficiently complete CNF -> DNF conversion framework. There will likely always be the potential for unsafe/unsupported clauses and/or types in an extensible system like Postgres, though. So we will probably need to retain some notion of safety. It seems like more of a job for nbtree preprocessing (or some suitably index-AM-agnostic version of the same idea) than the optimizer, in any case. But that's not entirely true, either (that would be far too easy). The optimizer still needs to optimize. It can't very well do that without having some kind of advanced notice of what is and is not supported by the index AM. And, the index AM cannot just unilaterally decide that index quals actually should be treated as filter/qpquals, after all -- it doesn't get a veto. So there is a mutual dependency that needs to be resolved. I suspect that there needs to be a two way conversation between the optimizer and nbtree code to break the dependency -- a callback that does some of the preprocessing work during planning. Tom said something along the same lines in passing, when discussing the MDAM paper last year [2]. Much work remains here. Honestly, I'm just reading and delving into this thread and other topics related to it, so excuse me if I ask you a few obvious questions. I noticed that you are going to add CNF->DNF transformation at the index construction stage. If I understand correctly, you will rewrite restrictinfo node, change boolean "AND" expressions to "OR" expressions, but would it be possible to apply such a procedure earlier? Otherwise I suppose you could face the problem of incorrect selectivity of the calculation and, consequently, the cardinality calculation? I can't clearly understand at what stage it is clear that the such a transformation needs to be applied? -- Regards, Alena Rybakina Postgres Professional
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Thu, 27 Jul 2023 at 16:01, Peter Geoghegan wrote: > > On Thu, Jul 27, 2023 at 7:59 AM Matthias van de Meent > wrote: > > > Basically, the patch that added that feature had to revise the index > > > AM API, in order to support a mode of operation where scans return > > > groupings rather than tuples. Whereas this patch requires none of > > > that. It makes affected index scans as similar as possible to > > > conventional index scans. > > > > Hmm, yes. I see now where my confusion started. You called it out in > > your first paragraph of the original mail, too, but that didn't help > > me then: > > > > The wiki does not distinguish "Index Skip Scans" and "Loose Index > > Scans", but these are not the same. > > A lot of people (myself included) were confused on this point for > quite a while. I've taken the liberty to update the "Loose indexscan" wiki page [0], adding detail that Loose indexscans are distinct from Skip scans, and showing some high-level distinguishing properties. I also split the TODO entry for `` "loose" or "skip" scan `` into two, and added links to the relevant recent threads so that it's clear these are different (and that some previous efforts may have had a confusing name). I hope this will reduce the chance of future confusion between the two different approaches to improving index scan performance. Kind regards, Matthias van de Meent Neon (https://neon.tech) [0]: https://wiki.postgresql.org/wiki/Loose_indexscan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Thu, Jul 27, 2023 at 7:59 AM Matthias van de Meent wrote: > > Basically, the patch that added that feature had to revise the index > > AM API, in order to support a mode of operation where scans return > > groupings rather than tuples. Whereas this patch requires none of > > that. It makes affected index scans as similar as possible to > > conventional index scans. > > Hmm, yes. I see now where my confusion started. You called it out in > your first paragraph of the original mail, too, but that didn't help > me then: > > The wiki does not distinguish "Index Skip Scans" and "Loose Index > Scans", but these are not the same. A lot of people (myself included) were confused on this point for quite a while. To make matters even more confusing, one of the really compelling cases for the MDAM design is scans that feed into GroupAggregates -- preserving index sort order for naturally big index scans will tend to enable it. One of my examples from the start of this thread showed just that. (It just so happened that that example was faster because of all the "skipping" that nbtree *wasn't* doing with the patch.) > Yes, that's why I asked: The MDAM paper's examples seem to materialize > the full predicate up-front, which would require a product of all > indexed columns' quals in size, so that materialization has a good > chance to get really, really large. But if we're not doing that > materialization upfront, then there is no issue with resource > consumption (except CPU time, which can likely be improved with other > methods) I get why you asked. I might have asked the same question. As I said, the MDAM paper has *surprisingly* little to say about B-Tree executor stuff -- it's almost all just describing the preprocessing/transformation process. It seems as if optimizations like the one from my patch were considered too obvious to talk about and/or out of scope by the authors. Thinking about the MDAM paper like that was what made everything fall into place for me. Remember, "missing key predicates" isn't all that special. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Thu, 27 Jul 2023 at 06:14, Peter Geoghegan wrote: > > On Wed, Jul 26, 2023 at 12:07 PM Matthias van de Meent > wrote: > > We could cache the last accessed leaf page across amrescan operations > > to reduce the number of index traversals needed when the join key of > > the left side is highly (but not necessarily strictly) correllated. > > That sounds like block nested loop join. It's possible that that could > reuse some infrastructure from this patch, but I'm not sure. My idea is not quite block nested loop join. It's more 'restart the index scan at the location the previous index scan ended, if heuristics say there's a good chance that might save us time'. I'd say it is comparable to the fast tree descent optimization that we have for endpoint queries, and comparable to this patch's scankey optimization, but across AM-level rescans instead of internal rescans. See also the attached prototype and loosely coded patch. It passes tests, but it might not be without bugs. The basic design of that patch is this: We keep track of how many times we've rescanned, and the end location of the index scan. If a new index scan hits the same page after _bt_search as the previous scan ended, we register that. Those two values - num_rescans and num_samepage - are used as heuristics for the following: If 50% or more of rescans hit the same page as the end location of the previous scan, we start saving the scan's end location's buffer into the BTScanOpaque, so that the next _bt_first can check whether that page might be the right leaf page, and if so, immediately go to that buffer instead of descending the tree - saving one tree descent in the process. Further optimizations of this mechanism could easily be implemented by e.g. only copying the min/max index tuples instead of the full index page, reducing the overhead at scan end. Kind regards, Matthias van de Meent Neon (https://neon.tech) v1-0001-Cache-btree-scan-end-page-across-rescans-in-the-s.patch.cfbot-ignore Description: Binary data
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Wed, 26 Jul 2023 at 15:42, Peter Geoghegan wrote: > > On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent > > I'm not sure I understand. MDAM seems to work on an index level to > > return full ranges of values, while "skip scan" seems to try to allow > > systems to signal to the index to skip to some other index condition > > based on arbitrary cutoffs. This would usually be those of which the > > information is not stored in the index, such as "SELECT user_id FROM > > orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go > > though the user_id index and skip to the next user_id value when it > > gets enough rows of a matching result (where "enough" is determined > > above the index AM's plan node, or otherwise is impossible to > > determine with only the scan key info in the index AM). I'm not sure > > how this could work without specifically adding skip scan-related > > index AM functionality, and I don't see how it fits in with this > > MDAM/SAOP system. > > I think of that as being quite a different thing. > > Basically, the patch that added that feature had to revise the index > AM API, in order to support a mode of operation where scans return > groupings rather than tuples. Whereas this patch requires none of > that. It makes affected index scans as similar as possible to > conventional index scans. Hmm, yes. I see now where my confusion started. You called it out in your first paragraph of the original mail, too, but that didn't help me then: The wiki does not distinguish "Index Skip Scans" and "Loose Index Scans", but these are not the same. In the one page on "Loose indexscan", it refers to MySQL's "loose index scan" documentation, which does handle groupings, and this was targeted with the previous, mislabeled, "Index skipscan" patchset. However, crucially, it also refers to other databases' Index Skip Scan documentation, which document and implement this approach of 'skipping to the next potential key range to get efficient non-prefix qual results', giving me a false impression that those two features are one and the same when they are not. It seems like I'll have to wait a bit longer for the functionality of Loose Index Scans. > > > [...] > > > > > > Thoughts? > > > > MDAM seems to require exponential storage for "scan key operations" > > for conditions on N columns (to be precise, the product of the number > > of distinct conditions on each column); e.g. an index on mytable > > (a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ... > > AND h IN (1, 2)" would require 2^8 entries. > > Note that I haven't actually changed anything about the way that the > state machine generates new sets of single value predicates -- it's > still just cycling through each distinct set of array keys in the > patch. > > What you describe is a problem in theory, but I doubt that it's a > problem in practice. You don't actually have to materialize the > predicates up-front, or at all. Yes, that's why I asked: The MDAM paper's examples seem to materialize the full predicate up-front, which would require a product of all indexed columns' quals in size, so that materialization has a good chance to get really, really large. But if we're not doing that materialization upfront, then there is no issue with resource consumption (except CPU time, which can likely be improved with other methods) Kind regards, Matthias van de Meent Neon (https://neon.tech/)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Wed, Jul 26, 2023 at 12:07 PM Matthias van de Meent wrote: > We could cache the last accessed leaf page across amrescan operations > to reduce the number of index traversals needed when the join key of > the left side is highly (but not necessarily strictly) correllated. That sounds like block nested loop join. It's possible that that could reuse some infrastructure from this patch, but I'm not sure. In general, SAOP execution/MDAM performs "duplicate elimination before it reads the data" by sorting and deduplicating the arrays up front. While my patch sometimes elides a primitive index scan, primitive index scans are already disjuncts that are combined to create what can be considered one big index scan (that's how the planner and executor think of them). The patch takes that one step further by recognizing that it could quite literally be one big index scan in some cases (or fewer, larger scans, at least). It's a natural incremental improvement, as opposed to inventing a new kind of index scan. If anything the patch makes SAOP execution more similar to traditional index scans, especially when costing them. Like InnoDB style loose index scan (for DISTINCT and GROUP BY optimization), block nested loop join would require inventing a new type of index scan. Both of these other two optimizations involve the use of semantic information that spans multiple levels of abstraction. Loose scan requires duplicate elimination (that's the whole point), while IIUC block nested loop join needs to "simulate multiple inner index scans" by deliberately returning duplicates for each would-be inner index scan. These are specialized things. To be clear, I think that all of these ideas are reasonable. I just find it useful to classify these sorts of techniques according to whether or not the index AM API would have to change or not, and the general nature of any required changes. MDAM can do a lot of cool things without requiring any revisions to the index AM API, which should allow it to play nice with everything else (index path clause safety issues notwithstanding). -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Wed, 26 Jul 2023 at 15:42, Peter Geoghegan wrote: > > On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent > wrote: > > Considering that it caches/reuses the page across SAOP operations, can > > (or does) this also improve performance for index scans on the outer > > side of a join if the order of join columns matches the order of the > > index? > > It doesn't really cache leaf pages at all. What it does is advance the > array keys locally, while the original buffer lock is still held on > that same page. Hmm, then I had a mistaken understanding of what we do in _bt_readpage with _bt_saveitem. > > That is, I believe this caches (leaf) pages across scan keys, but can > > (or does) it also reuse these already-cached leaf pages across > > restarts of the index scan/across multiple index lookups in the same > > plan node, so that retrieval of nearby index values does not need to > > do an index traversal? > > I'm not sure what you mean. There is no reason why you need to do more > than one single descent of an index to scan many leaf pages using many > distinct sets of array keys. Obviously, this depends on being able to > observe that we really don't need to redescend the index to advance > the array keys, again and again. Note in particularly that this > usually works across leaf pages. In a NestedLoop(inner=seqscan, outer=indexscan), the index gets repeatedly scanned from the root, right? It seems that right now, we copy matching index entries into a local cache (that is deleted on amrescan), then we drop our locks and pins on the buffer, and then start returning values from our local cache (in _bt_saveitem). We could cache the last accessed leaf page across amrescan operations to reduce the number of index traversals needed when the join key of the left side is highly (but not necessarily strictly) correllated. The worst case overhead of this would be 2 _bt_compares (to check if the value is supposed to be fully located on the cached leaf page) plus one memcpy( , , BLCKSZ) in the previous loop. With some smart heuristics (e.g. page fill factor, number of distinct values, and whether we previously hit this same leaf page in the previous scan of this Node) we can probably also reduce this overhead to a minimum if the joined keys are not correllated, but accellerate the query significantly when we find out they are correllated. Of course, in the cases where we'd expect very few distinct join keys the planner would likely put a Memoize node above the index scan, but for mostly unique join keys I think this could save significant amounts of time, if only on buffer pinning and locking. I guess I'll try to code something up when I have the time, as it sounds not quite exactly related to your patch but an interesting improvement nonetheless. Kind regards, Matthias van de Meent
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent wrote: > Considering that it caches/reuses the page across SAOP operations, can > (or does) this also improve performance for index scans on the outer > side of a join if the order of join columns matches the order of the > index? It doesn't really cache leaf pages at all. What it does is advance the array keys locally, while the original buffer lock is still held on that same page. > That is, I believe this caches (leaf) pages across scan keys, but can > (or does) it also reuse these already-cached leaf pages across > restarts of the index scan/across multiple index lookups in the same > plan node, so that retrieval of nearby index values does not need to > do an index traversal? I'm not sure what you mean. There is no reason why you need to do more than one single descent of an index to scan many leaf pages using many distinct sets of array keys. Obviously, this depends on being able to observe that we really don't need to redescend the index to advance the array keys, again and again. Note in particularly that this usually works across leaf pages. > I'm not sure I understand. MDAM seems to work on an index level to > return full ranges of values, while "skip scan" seems to try to allow > systems to signal to the index to skip to some other index condition > based on arbitrary cutoffs. This would usually be those of which the > information is not stored in the index, such as "SELECT user_id FROM > orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go > though the user_id index and skip to the next user_id value when it > gets enough rows of a matching result (where "enough" is determined > above the index AM's plan node, or otherwise is impossible to > determine with only the scan key info in the index AM). I'm not sure > how this could work without specifically adding skip scan-related > index AM functionality, and I don't see how it fits in with this > MDAM/SAOP system. I think of that as being quite a different thing. Basically, the patch that added that feature had to revise the index AM API, in order to support a mode of operation where scans return groupings rather than tuples. Whereas this patch requires none of that. It makes affected index scans as similar as possible to conventional index scans. > > [...] > > > > Thoughts? > > MDAM seems to require exponential storage for "scan key operations" > for conditions on N columns (to be precise, the product of the number > of distinct conditions on each column); e.g. an index on mytable > (a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ... > AND h IN (1, 2)" would require 2^8 entries. Note that I haven't actually changed anything about the way that the state machine generates new sets of single value predicates -- it's still just cycling through each distinct set of array keys in the patch. What you describe is a problem in theory, but I doubt that it's a problem in practice. You don't actually have to materialize the predicates up-front, or at all. Plus you can skip over them using the next index tuple. So skipping works both ways. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Tue, 25 Jul 2023 at 03:34, Peter Geoghegan wrote: > > I've been working on a variety of improvements to nbtree's native > ScalarArrayOpExpr execution. This builds on Tom's work in commit > 9e8da0f7. Cool. > Attached patch is still at the prototype stage. I'm posting it as v1 a > little earlier than I usually would because there has been much back > and forth about it on a couple of other threads involving Tomas Vondra > and Jeff Davis -- seems like it would be easier to discuss with > working code available. > > The patch adds two closely related enhancements to ScalarArrayOp > execution by nbtree: > > 1. Execution of quals with ScalarArrayOpExpr clauses during nbtree > index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now > "advance the scan's array keys locally", which sometimes avoids > significant amounts of unneeded pinning/locking of the same set of > index pages. > > SAOP index scans become capable of eliding primitive index scans for > the next set of array keys in line in cases where it isn't truly > necessary to descend the B-Tree again. Index scans are now capable of > "sticking with the existing leaf page for now" when it is determined > that the end of the current set of array keys is physically close to > the start of the next set of array keys (the next set in line to be > materialized by the _bt_advance_array_keys state machine). This is > often possible. > > Naturally, we still prefer to advance the array keys in the > traditional way ("globally") much of the time. That means we'll > perform another _bt_first/_bt_search descent of the index, starting a > new primitive index scan. Whether we try to skip pages on the leaf > level or stick with the current primitive index scan (by advancing > array keys locally) is likely to vary a great deal. Even during the > same index scan. Everything is decided dynamically, which is the only > approach that really makes sense. > > This optimization can significantly lower the number of buffers pinned > and locked in cases with significant locality, and/or with many array > keys with no matches. The savings (when measured in buffers > pined/locked) can be as high as 10x, 100x, or even more. Benchmarking > has shown that transaction throughput for variants of "pgbench -S" > designed to stress the implementation (hundreds of array constants) > under concurrent load can have up to 5.5x higher transaction > throughput with the patch. Less extreme cases (10 array constants, > spaced apart) see about a 20% improvement in throughput. There are > similar improvements to latency for the patch, in each case. Considering that it caches/reuses the page across SAOP operations, can (or does) this also improve performance for index scans on the outer side of a join if the order of join columns matches the order of the index? That is, I believe this caches (leaf) pages across scan keys, but can (or does) it also reuse these already-cached leaf pages across restarts of the index scan/across multiple index lookups in the same plan node, so that retrieval of nearby index values does not need to do an index traversal? > [...] > Skip Scan > = > > MDAM encompasses something that people tend to call "skip scan" -- > terminology with a great deal of baggage. These days I prefer to call > it "filling in missing key predicates", per the paper. That's much > more descriptive, and makes it less likely that people will conflate > the techniques with InnoDB style "loose Index scans" -- the latter is > a much more specialized/targeted optimization. (I now believe that > these are very different things, though I was thrown off by the > superficial similarities for a long time. It's pretty confusing.) I'm not sure I understand. MDAM seems to work on an index level to return full ranges of values, while "skip scan" seems to try to allow systems to signal to the index to skip to some other index condition based on arbitrary cutoffs. This would usually be those of which the information is not stored in the index, such as "SELECT user_id FROM orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go though the user_id index and skip to the next user_id value when it gets enough rows of a matching result (where "enough" is determined above the index AM's plan node, or otherwise is impossible to determine with only the scan key info in the index AM). I'm not sure how this could work without specifically adding skip scan-related index AM functionality, and I don't see how it fits in with this MDAM/SAOP system. > [...] > > Thoughts? MDAM seems to require exponential storage for "scan key operations" for conditions on N columns (to be precise, the product of the number of distinct conditions on each column); e.g. an index on mytable (a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ... AND h IN (1, 2)" would require 2^8 entries. If 4 conditions were used for each column, that'd be 4^8, etc... With an index column limit of 32, that's
Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
I've been working on a variety of improvements to nbtree's native ScalarArrayOpExpr execution. This builds on Tom's work in commit 9e8da0f7. Attached patch is still at the prototype stage. I'm posting it as v1 a little earlier than I usually would because there has been much back and forth about it on a couple of other threads involving Tomas Vondra and Jeff Davis -- seems like it would be easier to discuss with working code available. The patch adds two closely related enhancements to ScalarArrayOp execution by nbtree: 1. Execution of quals with ScalarArrayOpExpr clauses during nbtree index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now "advance the scan's array keys locally", which sometimes avoids significant amounts of unneeded pinning/locking of the same set of index pages. SAOP index scans become capable of eliding primitive index scans for the next set of array keys in line in cases where it isn't truly necessary to descend the B-Tree again. Index scans are now capable of "sticking with the existing leaf page for now" when it is determined that the end of the current set of array keys is physically close to the start of the next set of array keys (the next set in line to be materialized by the _bt_advance_array_keys state machine). This is often possible. Naturally, we still prefer to advance the array keys in the traditional way ("globally") much of the time. That means we'll perform another _bt_first/_bt_search descent of the index, starting a new primitive index scan. Whether we try to skip pages on the leaf level or stick with the current primitive index scan (by advancing array keys locally) is likely to vary a great deal. Even during the same index scan. Everything is decided dynamically, which is the only approach that really makes sense. This optimization can significantly lower the number of buffers pinned and locked in cases with significant locality, and/or with many array keys with no matches. The savings (when measured in buffers pined/locked) can be as high as 10x, 100x, or even more. Benchmarking has shown that transaction throughput for variants of "pgbench -S" designed to stress the implementation (hundreds of array constants) under concurrent load can have up to 5.5x higher transaction throughput with the patch. Less extreme cases (10 array constants, spaced apart) see about a 20% improvement in throughput. There are similar improvements to latency for the patch, in each case. 2. The optimizer now produces index paths with multiple SAOP clauses (or other clauses we can safely treat as "equality constraints'') on each of the leading columns from a composite index -- all while preserving index ordering/useful pathkeys in most cases. The nbtree work from item 1 is useful even with the simplest IN() list query involving a scan of a single column index. Obviously, it's very inefficient for the nbtree code to use 100 primitive index scans when 1 is sufficient. But that's not really why I'm pursuing this project. My real goal is to implement (or to enable the implementation of) a whole family of useful techniques for multi-column indexes. I call these "MDAM techniques", after the 1995 paper "Efficient Search of Multidimensional B-Trees" [1][2]-- MDAM is short for "multidimensional access method". In the context of the paper, "dimension" refers to dimensions in a decision support system. The most compelling cases for the patch all involve multiple index columns with multiple SAOP clauses (especially where each column represents a separate "dimension", in the DSS sense). It's important that index sort be preserved whenever possible, too. Sometimes this is directly useful (e.g., because the query has an ORDER BY), but it's always indirectly needed, on the nbtree side (when the optimizations are applicable at all). The new nbtree code now has special requirements surrounding SAOP search type scan keys with composite indexes. These requirements make changes in the optimizer all but essential. Index order === As I said, there are cases where preserving index order is immediately and obviously useful, in and of itself. Let's start there. Here's a test case that you can run against the regression test database: pg@regression:5432 =# create index order_by_saop on tenk1(two,four,twenty); CREATE INDEX pg@regression:5432 =# EXPLAIN (ANALYZE, BUFFERS) select ctid, thousand from tenk1 where two in (0,1) and four in (1,2) and twenty in (1,2) order by two, four, twenty limit 20; With the patch, this query gets 13 buffer hits. On the master branch, it gets 1377 buffer hits -- which exceeds the number you'll get from a sequential scan by about 4x. No coaxing was required to get the planner to produce this plan on the master branch. Almost all of the savings shown here are related to heap page buffer hits -- the nbtree changes don't directly help in this particular example (strictly speaking, you only need the optimizer changes to get this result). Obviously, the