Re: Index Skip Scan (new UniqueKeys)
On 6/9/20 06:22, Dmitry Dolgov wrote: Here is a new version of index skip scan patch, based on v8 patch for UniqueKeys implementation from [1]. I want to start a new thread to simplify navigation, hopefully I didn't forget anyone who actively participated in the discussion. This CommitFest entry has been closed with RwF at [1]. Thanks for all the feedback given ! [1] https://www.postgresql.org/message-id/ab8636e7-182f-886a-3a39-f3fc279ca45d%40redhat.com Best regards, Dmitry & Jesper
Re: Index Skip Scan (new UniqueKeys)
On Mon, Jan 24, 2022 at 8:51 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Sun, Jan 23, 2022 at 04:25:04PM -0800, Zhihong Yu wrote: > > On Sat, Jan 22, 2022 at 1:32 PM Dmitry Dolgov <9erthali...@gmail.com> > wrote: > > > Besides that the new patch version contains some cleaning up and > > > addresses commentaries around leaf page pinning from [1]. The idea > > > behind the series structure is still the same: the first three patches > > > contains the essence of the implementation (hoping to help concentrate > > > review), the rest are more "experimental". > > > > > > [1]: > > > > https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uptdjsi0tjkfpupd-cea35dvn-...@mail.gmail.com > > > > Hi, > > > > + /* If same EC already is already in the list, then not unique */ > > > > The word already is duplicated. > > > > + * make_pathkeys_for_uniquekeyclauses > > > > The func name in the comment is different from the actual func name. > > Thanks for the review! Right, both above make sense. I'll wait a bit if > there will be more commentaries, and then post a new version with all > changes at once. > > > + * Portions Copyright (c) 2020, PostgreSQL Global Development Group > > > > The year should be 2022 :-) > > Now you see how old is this patch :) > > > make_pathkeys_for_uniquekeys() is called by build_uniquekeys(). Should > > make_pathkeys_for_uniquekeys() be moved into uniquekeys.c ? > > It's actually placed there by analogy with make_pathkeys_for_sortclauses > (immediately preceding function), so I think moving it into uniquekeys > will only make more confusion. > > > +query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys, > > +bool allow_multinulls) > > > > It seems allow_multinulls is not checked in the func. Can the parameter > be > > removed ? > > Right, it could be removed. I believe it was somewhat important when the > patch was tightly coupled with the UniqueKeys patch, where it was put in > use. > > Hi, It would be nice to take out this unused parameter for this patch. The parameter should be added in patch series where it is used. Cheers
Re: Index Skip Scan (new UniqueKeys)
> On Sun, Jan 23, 2022 at 04:25:04PM -0800, Zhihong Yu wrote: > On Sat, Jan 22, 2022 at 1:32 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > Besides that the new patch version contains some cleaning up and > > addresses commentaries around leaf page pinning from [1]. The idea > > behind the series structure is still the same: the first three patches > > contains the essence of the implementation (hoping to help concentrate > > review), the rest are more "experimental". > > > > [1]: > > https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uptdjsi0tjkfpupd-cea35dvn-...@mail.gmail.com > > Hi, > > + /* If same EC already is already in the list, then not unique */ > > The word already is duplicated. > > + * make_pathkeys_for_uniquekeyclauses > > The func name in the comment is different from the actual func name. Thanks for the review! Right, both above make sense. I'll wait a bit if there will be more commentaries, and then post a new version with all changes at once. > + * Portions Copyright (c) 2020, PostgreSQL Global Development Group > > The year should be 2022 :-) Now you see how old is this patch :) > make_pathkeys_for_uniquekeys() is called by build_uniquekeys(). Should > make_pathkeys_for_uniquekeys() be moved into uniquekeys.c ? It's actually placed there by analogy with make_pathkeys_for_sortclauses (immediately preceding function), so I think moving it into uniquekeys will only make more confusion. > +query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys, > +bool allow_multinulls) > > It seems allow_multinulls is not checked in the func. Can the parameter be > removed ? Right, it could be removed. I believe it was somewhat important when the patch was tightly coupled with the UniqueKeys patch, where it was put in use. > + Path *newpath; > + > + newpath = (Path *) create_projection_path(root, rel, subpath, > + scanjoin_target); > > You can remove variable newpath and assign to lfirst(lc) directly. Yes, but I've followed the same style for create_projection_path as in many other invocations of this function in planner.c -- I would prefer to keep it uniform. > +add_path(RelOptInfo *parent_rel, Path *new_path) > +add_unique_path(RelOptInfo *parent_rel, Path *new_path) > > It seems the above two func's can be combined into one func which > takes parent_rel->pathlist / parent_rel->unique_pathlist as third parameter. Sure, but here I've intentionally split it into separate functions, otherwise a lot of not relevant call sites have to be updated to provide the third parameter.
Re: Index Skip Scan (new UniqueKeys)
On Sat, Jan 22, 2022 at 1:32 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Fri, May 21, 2021 at 05:31:38PM +0200, Dmitry Dolgov wrote: > > Hi, > > > > Here is another take on the patch with a couple of changes: > > > > * I've removed for now UniqueKeys parts. The interaction of skip scan & > > unique keys patch was actually not that big, so the main difference is > > that now the structure itself went away, a list of unique expressions > > is used instead. All the suggestions about how this feature should > > look like from the planning perspective are still there. On the one > > hand it will allow to develop both patches independently and avoid > > confusion for reviewers, on the other UniqueKeys could be easily > > incorporated back when needed. > > > > * Support for skipping in case of moving backward on demand (scroll > > cursor) is moved into a separate patch. This is implemented via > > returning false from IndexSupportsBackwardScan in case if it's a skip > > scan node, which in turn adds Materialize node on top when needed. The > > name SupportsBackwardScan was a bit confusing for me, but it seems > > it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL. > > Eventually those cases when BackwardScanDirection is used are still > > handled by amskip. This change didn't affect the test coverage, all > > the test cases supported in previous patch versions are still there. > > > > About Materialize node, I guess it could be less performant than a > > "native" support, but it simplifies the implementation significantly > > to the point that most parts, which were causing questions before, are > > now located in the isolated patch. My idea here is to concentrate > > efforts on the first three patches in this series, and consider the > > rest of them as an experiment field. > > > > * IndexScan support was also relocated into a separate patch, the first > > three patches are now only about IndexOnlyScan. > > > > * Last bits of reviews were incorporated and rebased. > > While the patch is still waiting for a review, I was motivated by the > thread [1] to think about it from the interface point of view. Consider > an index skip scan being just like a normal index scan with a set of > underspecified leading search keys. It makes sense to have the same > structure "begin scan" -> "get the next tuple" -> "end scan" (now I'm > not sure if amskip is a good name to represent that, don't have anything > better yet). But the "underspecified" part is currently indeed > interpreted in a limited way -- as "missing" keys -- and is expressed > only via the prefix size. Another option would be e.g. leading keys > constrained by a range of values, so generally speaking it makes sense > to extend amount of the information provided for skipping. > > As a naive approach I've added a new patch into the series, containing > the extra data structure (ScanLooseKeys, doesn't have much meaning yet > except somehow representing keys for skipping) used for index skip scan. > Any thoughts about it? > > Besides that the new patch version contains some cleaning up and > addresses commentaries around leaf page pinning from [1]. The idea > behind the series structure is still the same: the first three patches > contains the essence of the implementation (hoping to help concentrate > review), the rest are more "experimental". > > [1]: > https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uptdjsi0tjkfpupd-cea35dvn-...@mail.gmail.com Hi, + /* If same EC already is already in the list, then not unique */ The word already is duplicated. + * make_pathkeys_for_uniquekeyclauses The func name in the comment is different from the actual func name. + * Portions Copyright (c) 2020, PostgreSQL Global Development Group The year should be 2022 :-) make_pathkeys_for_uniquekeys() is called by build_uniquekeys(). Should make_pathkeys_for_uniquekeys() be moved into uniquekeys.c ? +query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys, +bool allow_multinulls) It seems allow_multinulls is not checked in the func. Can the parameter be removed ? + Path *newpath; + + newpath = (Path *) create_projection_path(root, rel, subpath, + scanjoin_target); You can remove variable newpath and assign to lfirst(lc) directly. +add_path(RelOptInfo *parent_rel, Path *new_path) +add_unique_path(RelOptInfo *parent_rel, Path *new_path) It seems the above two func's can be combined into one func which takes parent_rel->pathlist / parent_rel->unique_pathlist as third parameter. Cheers
Re: Index Skip Scan (new UniqueKeys)
> On Fri, May 21, 2021 at 05:31:38PM +0200, Dmitry Dolgov wrote: > Hi, > > Here is another take on the patch with a couple of changes: > > * I've removed for now UniqueKeys parts. The interaction of skip scan & > unique keys patch was actually not that big, so the main difference is > that now the structure itself went away, a list of unique expressions > is used instead. All the suggestions about how this feature should > look like from the planning perspective are still there. On the one > hand it will allow to develop both patches independently and avoid > confusion for reviewers, on the other UniqueKeys could be easily > incorporated back when needed. > > * Support for skipping in case of moving backward on demand (scroll > cursor) is moved into a separate patch. This is implemented via > returning false from IndexSupportsBackwardScan in case if it's a skip > scan node, which in turn adds Materialize node on top when needed. The > name SupportsBackwardScan was a bit confusing for me, but it seems > it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL. > Eventually those cases when BackwardScanDirection is used are still > handled by amskip. This change didn't affect the test coverage, all > the test cases supported in previous patch versions are still there. > > About Materialize node, I guess it could be less performant than a > "native" support, but it simplifies the implementation significantly > to the point that most parts, which were causing questions before, are > now located in the isolated patch. My idea here is to concentrate > efforts on the first three patches in this series, and consider the > rest of them as an experiment field. > > * IndexScan support was also relocated into a separate patch, the first > three patches are now only about IndexOnlyScan. > > * Last bits of reviews were incorporated and rebased. While the patch is still waiting for a review, I was motivated by the thread [1] to think about it from the interface point of view. Consider an index skip scan being just like a normal index scan with a set of underspecified leading search keys. It makes sense to have the same structure "begin scan" -> "get the next tuple" -> "end scan" (now I'm not sure if amskip is a good name to represent that, don't have anything better yet). But the "underspecified" part is currently indeed interpreted in a limited way -- as "missing" keys -- and is expressed only via the prefix size. Another option would be e.g. leading keys constrained by a range of values, so generally speaking it makes sense to extend amount of the information provided for skipping. As a naive approach I've added a new patch into the series, containing the extra data structure (ScanLooseKeys, doesn't have much meaning yet except somehow representing keys for skipping) used for index skip scan. Any thoughts about it? Besides that the new patch version contains some cleaning up and addresses commentaries around leaf page pinning from [1]. The idea behind the series structure is still the same: the first three patches contains the essence of the implementation (hoping to help concentrate review), the rest are more "experimental". [1]: https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uptdjsi0tjkfpupd-cea35dvn-...@mail.gmail.com >From 8b61823e523ceecbb2fd6faa1a229038bb981594 Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Mon, 17 May 2021 11:47:07 +0200 Subject: [PATCH v40 1/6] Unique expressions Extend PlannerInfo and Path structures with the list of relevant unique expressions. It specifies which keys must be unique on the query level, and allows to leverage this into on the path level. At the moment only distinctClause makes use of such mechanism, which enables potential use of index skip scan. Originally proposed by David Rowley, based on the UniqueKey patch implementation from Andy Fan, contains few bits out of previous version from Jesper Pedersen, Floris Van Nee. --- src/backend/nodes/list.c| 31 + src/backend/optimizer/path/Makefile | 3 +- src/backend/optimizer/path/pathkeys.c | 62 + src/backend/optimizer/path/uniquekeys.c | 92 + src/backend/optimizer/plan/planner.c| 36 +- src/backend/optimizer/util/pathnode.c | 32 ++--- src/include/nodes/pathnodes.h | 5 ++ src/include/nodes/pg_list.h | 1 + src/include/optimizer/pathnode.h| 1 + src/include/optimizer/paths.h | 9 +++ 10 files changed, 261 insertions(+), 11 deletions(-) create mode 100644 src/backend/optimizer/path/uniquekeys.c diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c index f843f861ef..a53a50f372 100644 --- a/src/backend/nodes/list.c +++ b/src/backend/nodes/list.c @@ -1653,3 +1653,34 @@ list_oid_cmp(const ListCell *p1, const ListCell *p2) return 1; return 0; } + +/* + * Return true iff every
Re: Index Skip Scan (new UniqueKeys)
Hi, Here is another take on the patch with a couple of changes: * I've removed for now UniqueKeys parts. The interaction of skip scan & unique keys patch was actually not that big, so the main difference is that now the structure itself went away, a list of unique expressions is used instead. All the suggestions about how this feature should look like from the planning perspective are still there. On the one hand it will allow to develop both patches independently and avoid confusion for reviewers, on the other UniqueKeys could be easily incorporated back when needed. * Support for skipping in case of moving backward on demand (scroll cursor) is moved into a separate patch. This is implemented via returning false from IndexSupportsBackwardScan in case if it's a skip scan node, which in turn adds Materialize node on top when needed. The name SupportsBackwardScan was a bit confusing for me, but it seems it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL. Eventually those cases when BackwardScanDirection is used are still handled by amskip. This change didn't affect the test coverage, all the test cases supported in previous patch versions are still there. About Materialize node, I guess it could be less performant than a "native" support, but it simplifies the implementation significantly to the point that most parts, which were causing questions before, are now located in the isolated patch. My idea here is to concentrate efforts on the first three patches in this series, and consider the rest of them as an experiment field. * IndexScan support was also relocated into a separate patch, the first three patches are now only about IndexOnlyScan. * Last bits of reviews were incorporated and rebased. >From 074fc8a41b43dd8b10ab29eb880c9d161a1638d5 Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Mon, 17 May 2021 11:47:07 +0200 Subject: [PATCH v39 1/5] Unique expressions Extend PlannerInfo and Path structures with the list of relevant unique expressions. It specifies which keys must be unique on the query level, and allows to leverage this into on the path level. At the moment only distinctClause makes use of such mechanism, which enables potential use of index skip scan. Originally proposed by David Rowley, based on the UniqueKey patch implementation from Andy Fan, contains few bits out of previous version from Jesper Pedersen, Floris Van Nee. --- src/backend/nodes/list.c| 31 + src/backend/optimizer/path/Makefile | 3 +- src/backend/optimizer/path/pathkeys.c | 62 + src/backend/optimizer/path/uniquekeys.c | 92 + src/backend/optimizer/plan/planner.c| 36 +- src/backend/optimizer/util/pathnode.c | 32 ++--- src/include/nodes/pathnodes.h | 5 ++ src/include/nodes/pg_list.h | 1 + src/include/optimizer/pathnode.h| 1 + src/include/optimizer/paths.h | 9 +++ 10 files changed, 261 insertions(+), 11 deletions(-) create mode 100644 src/backend/optimizer/path/uniquekeys.c diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c index 94fb236daf..9f2c408a4e 100644 --- a/src/backend/nodes/list.c +++ b/src/backend/nodes/list.c @@ -1537,3 +1537,34 @@ list_oid_cmp(const ListCell *p1, const ListCell *p2) return 1; return 0; } + +/* + * Return true iff every entry in "members" list is also present + * in the "target" list. + */ +bool +list_is_subset(const List *members, const List *target) +{ + const ListCell *lc1, *lc2; + + Assert(IsPointerList(members)); + Assert(IsPointerList(target)); + check_list_invariants(members); + check_list_invariants(target); + + foreach(lc1, members) + { + bool found = false; + foreach(lc2, target) + { + if (equal(lfirst(lc1), lfirst(lc2))) + { +found = true; +break; + } + } + if (!found) + return false; + } + return true; +} diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile index 1e199ff66f..7b9820c25f 100644 --- a/src/backend/optimizer/path/Makefile +++ b/src/backend/optimizer/path/Makefile @@ -21,6 +21,7 @@ OBJS = \ joinpath.o \ joinrels.o \ pathkeys.o \ - tidpath.o + tidpath.o \ + uniquekeys.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index bd9a176d7d..f28547148d 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -29,6 +29,7 @@ #include "utils/lsyscache.h" +static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys); static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys); static bool matches_boolean_partition_clause(RestrictInfo *rinfo, RelOptInfo *partrel, @@ -96,6 +97,29 @@ make_canonical_pathkey(PlannerInfo *root, return pk; } +/* + * pathkey_is_unique + * Checks if the new pathkey's
Re: Index Skip Scan (new UniqueKeys)
On 3/17/21 6:02 PM, Dmitry Dolgov wrote: >> On Wed, Mar 17, 2021 at 03:28:00AM +0100, Tomas Vondra wrote: >> Hi, >> >> I took a look at the new patch series, focusing mostly on the uniquekeys >> part. It'd be a bit tedious to explain all the review comments here, so >> attached is a patch series with a "review" patch for some of the parts. > > Great, thanks. > >> Most of it is fairly small (corrections to comments etc.), I'll go over >> the more serious part so that we can discuss it here. I'll keep it split >> per parts of the original patch series. >> I suggest looking for XXX and FIXME comments in all the review patches. >> >> >> 0001 >> >> >> >> >> 0002 >> >> > > In fact both 0001 & 0002 belong to another thread, which these days > span [1], [2]. I've included them only because they happened to be a > dependency for index skip scan following David suggestions, sorry if > it's confusing. > > At the same time the author behind 0001 & 0002 is present in this thread > as well, maybe Andy can answer these comments right here and better than me. > Ah, sorry for the confusion. In that case the review comments probably belong to the other threads, so we should move the discussion there. It's not clear to me which of the threads is the right one. >> 0003 >> >> >> Just some comments/whitespace. >> >> >> 0004 >> >> >> I wonder why we don't include this in explain TEXT format? Seems it >> might make it harder to write regression tests for this? It's easier to >> just check that we deduced the right unique key(s) than having to >> construct an example where it actually changes the plan. > > Yeah, good point. I believe originally it was like that to not make > explain too verbose for skip scans, but displaying prefix definitely > could be helpful for testing, so will do this (and address other > comments as well). > Cool. Thanks. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Index Skip Scan (new UniqueKeys)
> On Wed, Mar 17, 2021 at 03:28:00AM +0100, Tomas Vondra wrote: > Hi, > > I took a look at the new patch series, focusing mostly on the uniquekeys > part. It'd be a bit tedious to explain all the review comments here, so > attached is a patch series with a "review" patch for some of the parts. Great, thanks. > Most of it is fairly small (corrections to comments etc.), I'll go over > the more serious part so that we can discuss it here. I'll keep it split > per parts of the original patch series. > I suggest looking for XXX and FIXME comments in all the review patches. > > > 0001 > > > > > 0002 > > In fact both 0001 & 0002 belong to another thread, which these days span [1], [2]. I've included them only because they happened to be a dependency for index skip scan following David suggestions, sorry if it's confusing. At the same time the author behind 0001 & 0002 is present in this thread as well, maybe Andy can answer these comments right here and better than me. > 0003 > > > Just some comments/whitespace. > > > 0004 > > > I wonder why we don't include this in explain TEXT format? Seems it > might make it harder to write regression tests for this? It's easier to > just check that we deduced the right unique key(s) than having to > construct an example where it actually changes the plan. Yeah, good point. I believe originally it was like that to not make explain too verbose for skip scans, but displaying prefix definitely could be helpful for testing, so will do this (and address other comments as well). [1]: https://www.postgresql.org/message-id/flat/caku4awpqjaqjwq2x-ar9g3+zhrzu1k8hnp7a+_mluov-n5a...@mail.gmail.com [2]: https://www.postgresql.org/message-id/flat/caku4awru35c9g3ce15jmvwh6b2hzf4hf7czukrsiktv7akr...@mail.gmail.com
Re: Index Skip Scan (new UniqueKeys)
> On Thu, Jan 28, 2021 at 09:49:26PM +0900, Masahiko Sawada wrote: > Hi Dmitry, > > Status update for a commitfest entry. > > This patch entry has been "Waiting on Author" on CF app and the > discussion seems inactive from the last CF. Could you share the > current status of this patch? Heikki already sent review comments and > there was a discussion but the WoA status is correct? If it needs > reviews, please rebase the patches and set it to "Needs Reviews" on CF > app. If you're not working on this, I'm going to set it to "Returned > with Feedback", barring objections. Yes, I'm still on it. In fact, I've sketched up almost immediately couple of changes to address Heikki feedback, but was distracted by subscripting stuff. Will try to send new version of the patch soon.
Re: Index Skip Scan (new UniqueKeys)
Hi Dmitry, On Sun, Oct 25, 2020 at 1:45 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Tue, Oct 06, 2020 at 05:20:39PM +0200, Dmitry Dolgov wrote: > > > On Mon, Sep 21, 2020 at 05:59:32PM -0700, Peter Geoghegan wrote: > > > > > > * I see the following compiler warning: > > > > > > /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c: > > > In function ‘populate_baserel_uniquekeys’: > > > /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13: > > > warning: ‘expr’ may be used uninitialized in this function > > > [-Wmaybe-uninitialized] > > > 797 | else if (!list_member(unique_index->rel->reltarget->exprs, > > > expr)) > > > | ^~ > > > > This is mostly for UniqueKeys patch, which is attached here only as a > > dependency, but I'll prepare changes for that. Interesting enough I > > can't reproduce this warning, but if I understand correctly gcc has some > > history of spurious uninitialized warnings, so I guess it could be > > version dependent. > > > > > * Perhaps the warning is related to this nearby code that I noticed > > > Valgrind complains about: > > > > > > ==1083468== VALGRINDERROR-BEGIN > > > ==1083468== Invalid read of size 4 > > > ==1083468==at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771) > > > ==1083468==by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140) > > > > This also belongs to UniqueKeys patch, but at least I can reproduce this > > one. My guess is that nkeycolums should be used there, not ncolums, > > which is visible in index_incuding tests. The same as previous one, will > > prepare corresponding changes. > > > > > * Do we really need the AM-level boolean flag/argument named > > > "scanstart"? Why not just follow the example of btgettuple(), which > > > determines whether or not the scan has been initialized based on the > > > current scan position? > > > > > > Just because you set so->currPos.buf to InvalidBuffer doesn't mean you > > > cannot or should not take the same approach as btgettuple(). And even > > > if you can't take exactly the same approach, I would still think that > > > the scan's opaque B-Tree state should remember if it's the first call > > > to _bt_skip() (rather than some subsequent call) in some other way > > > (e.g. carrying a "scanstart" bool flag directly). > > > > Yes, agree, carrying this flag inside the opaque state would be better. > > Here is a new version which doesn't require "scanstart" argument and > contains few other changes to address the issues mentioned earlier. It's > also based on the latest UniqueKeys patches with the valgrind issue > fixed (as before they're attached also just for the references, you can > find more in the original thread). I didn't rework commentaries yet, > will post it soon (need to get an inspiration first, probably via > reading Shakespeare unless someone has better suggestions). > > > > * Why is it okay to do anything important based on the > > > _bt_scankey_within_page() return value? > > > > > > If the page is empty, then how can we know that it's okay to go to the > > > next value? I'm concerned that there could be subtle bugs in this > > > area. VACUUM will usually just delete the empty page. But it won't > > > always do so, for a variety of reasons that aren't worth going into > > > now. This could mask bugs in this area. I'm concerned about patterns > > > like this one from _bt_skip(): > > > > > > while (!nextFound) > > > { > > > > > > > > > if (_bt_scankey_within_page(scan, so->skipScanKey, > > > so->currPos.buf, dir)) > > > { > > > ... > > > } > > > else > > > /* > > > * If startItup could be not found within the current > > > page, > > > * assume we found something new > > > */ > > > nextFound = true; > > > > > > } > > > > > > Why would you assume that "we found something new" here? In general I > > > just don't understand the design of _bt_skip(). I get the basic idea > > > of what you're trying to do, but it could really use better comments. > > > > Yeah, I'll put more efforts into clear comments. There are two different > > ways in which _bt_scankey_within_page is being used. > > > > The first one is to check if it's possible to skip traversal of the tree > > from root in case if what we're looking for could be on the current > > page. In this case an empty page would mean we need to search from the > > root, so not sure what could be the issue here? > > > > The second one (that you've highlighted above) I admit is probably the > > most questionable part of the patch and open for suggestions how to > > improve it. It's required for one
Re: Index Skip Scan (new UniqueKeys)
On Wed, Dec 2, 2020 at 9:59 AM Heikki Linnakangas wrote: > On 01/12/2020 22:21, Dmitry Dolgov wrote: > >> On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote: > >> - I'm surprised you need a new index AM function (amskip) for this. Can't > >> you just restart the scan with index_rescan()? The btree AM can check if > >> the > >> new keys are on the same page, and optimize the rescan accordingly, like > >> amskip does. That would speed up e.g. nested loop scans too, where the keys > >> just happen to be clustered. > > > > An interesting point. At the moment I'm not sure whether it's possible > > to implement skipping via index_rescan or not, need to take a look. But > > checking if the new keys are on the same page would introduce some > > overhead I guess, wouldn't it be too invasive to add it into already > > existing btree AM? > > I think it'll be OK. But if it's not, you could add a hint argument to > index_rescan() to hint the index AM that the new key is known to be > greater than the previous key. FWIW here's what I wrote about that years ago[1]: > It works by adding a new index operation 'skip' which the executor > code can use during a scan to advance to the next value (for some > prefix of the index's columns). That may be a terrible idea and > totally unnecessary... but let me explain my > reasoning: > > 1. Perhaps some index implementations can do something better than a > search for the next key value from the root. Is it possible or > desirable to use the current position as a starting point for a btree > traversal? I don't know. > > 2. It seemed that I'd need to create a new search ScanKey to use the > 'rescan' interface for skipping to the next value, but I already had > an insertion ScanKey so I wanted a way to just reuse that. But maybe > there is some other way to reuse existing index interfaces, or maybe > there is an easy way to make a new search ScanKey from the existing >insertion ScanKey? [1] https://www.postgresql.org/message-id/CADLWmXWALK8NPZqdnRQiPnrzAnic7NxYKynrkzO_vxYr8enWww%40mail.gmail.com
Re: Index Skip Scan (new UniqueKeys)
> On Tue, Dec 01, 2020 at 10:59:22PM +0200, Heikki Linnakangas wrote: > > > > - Does this optimization apply to bitmap index scans? > > > > No, from what I understand it doesn't. > > Would it be hard to add? Don't need to solve everything in the first > version of this, but I think in principle you could do the same > optimization for bitmap index scans, so if the current API can't do it, > that's maybe an indication that the API isn't quite right. I would expect it should not be hard as at the moment all parts seems relatively generic. But of course I need to check, while it seems no one had bitmap index scans in mind while developing this patch. > > > I'm actually a bit confused why we need this condition. The IndexScan > > > executor node should call amskip() only after checking the additional > > > quals, > > > no? > > > > This part I don't quite get, what exactly you mean by checking the > > additional quals in the executor node? But at the end of the day this > > condition was implemented exactly to address the described issue, which > > was found later and added to the tests. > > As I understand this, the executor logic goes like this: > > query: SELECT DISTINCT ON (a, b) a, b FROM foo where c like '%y%' and a > like 'a%' and b = 'b'; > > 1. Call index_beginscan, keys: a >= 'a', b = 'b' > > 2. Call index_getnext, which returns first row to the Index Scan node > > 3. Evaluates the qual "c like '%y%'" on the tuple. If it's false, goto step > 2 to get next tuple. > > 4. Return tuple to parent node > > 5. index_amskip(), to the next tuple with a > 'a'. Goto 2. > > The logic should work fine, even if there are quals that are not indexable, > like "c like '%y'" in the above example. So why doesn't it work? What am I > missing? To remind myself how it works I went through this sequence, and from what I understand the qual "c like '%y%'" is evaluated in this case in ExecQual, not after index_getnext_tid (and values returned after skipping are reported as filtered out). So when it comes to index_skip only quals on a & b were evaluated. Or did you mean something else? Another small detail is that in the current implementation there is no goto 2 in the last step. Originally it was like that, but since skipping return an exact position that we need there was something like "find a value, then do one step back so that index_getnext will find it". Unfortunately this stepping back part turns out to be a source of troubles, and getting rid of it even allowed to make code somewhat more concise. But of course I'm open for suggestions about improvements.
Re: Index Skip Scan (new UniqueKeys)
On 01/12/2020 22:21, Dmitry Dolgov wrote: On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote: I had a quick look at this patch. I haven't been following this thread, so sorry if I'm repeating old arguments, but here we go: Thanks! - I'm surprised you need a new index AM function (amskip) for this. Can't you just restart the scan with index_rescan()? The btree AM can check if the new keys are on the same page, and optimize the rescan accordingly, like amskip does. That would speed up e.g. nested loop scans too, where the keys just happen to be clustered. An interesting point. At the moment I'm not sure whether it's possible to implement skipping via index_rescan or not, need to take a look. But checking if the new keys are on the same page would introduce some overhead I guess, wouldn't it be too invasive to add it into already existing btree AM? I think it'll be OK. But if it's not, you could add a hint argument to index_rescan() to hint the index AM that the new key is known to be greater than the previous key. - Does this optimization apply to bitmap index scans? No, from what I understand it doesn't. Would it be hard to add? Don't need to solve everything in the first version of this, but I think in principle you could do the same optimization for bitmap index scans, so if the current API can't do it, that's maybe an indication that the API isn't quite right. - This logic in build_index_paths() is not correct: + /* +* Skip scan is not supported when there are qual conditions, which are not +* covered by index. The reason for that is that those conditions are +* evaluated later, already after skipping was applied. +* +* TODO: This implementation is too restrictive, and doesn't allow e.g. +* index expressions. For that we need to examine index_clauses too. +*/ + if (root->parse->jointree != NULL) + { + ListCell *lc; + + foreach(lc, (List *)root->parse->jointree->quals) + { + Node *expr, *qual = (Node *) lfirst(lc); + Var *var; + bool found = false; + + if (!is_opclause(qual)) + { + not_empty_qual = true; + break; + } + + expr = get_leftop(qual); + + if (!IsA(expr, Var)) + { + not_empty_qual = true; + break; + } + + var = (Var *) expr; + + for (int i = 0; i < index->ncolumns; i++) + { + if (index->indexkeys[i] == var->varattno) + { + found = true; + break; + } + } + + if (!found) + { + not_empty_qual = true; + break; + } + } + } If you care whether the qual is evaluated by the index AM or not, you need to also check that the operator is indexable. Attached is a query that demonstrates that problem. ... Also, you should probably check that the index quals are in the operator family as that used for the DISTINCT. Yes, good point, will change this in the next version. I'm actually a bit confused why we need this condition. The IndexScan executor node should call amskip() only after checking the additional quals, no? This part I don't quite get, what exactly you mean by checking the additional quals in the executor node? But at the end of the day this condition was implemented exactly to address the described issue, which was found later and added to the tests. As I understand this, the executor logic goes like this: query: SELECT DISTINCT ON (a, b) a, b FROM foo where c like '%y%' and a like 'a%' and b = 'b'; 1. Call index_beginscan, keys: a >= 'a', b = 'b' 2. Call index_getnext, which returns first row to the Index Scan node 3. Evaluates the qual "c like '%y%'" on the tuple. If it's false, goto step 2 to get next tuple. 4. Return tuple to parent node 5. index_amskip(), to the next tuple with a > 'a'. Goto 2. The logic should work fine, even if there are quals that are not indexable, like "c like '%y'" in the above example. So why doesn't it work? What am I
Re: Index Skip Scan (new UniqueKeys)
> On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote: > > I had a quick look at this patch. I haven't been following this thread, so > sorry if I'm repeating old arguments, but here we go: Thanks! > - I'm surprised you need a new index AM function (amskip) for this. Can't > you just restart the scan with index_rescan()? The btree AM can check if the > new keys are on the same page, and optimize the rescan accordingly, like > amskip does. That would speed up e.g. nested loop scans too, where the keys > just happen to be clustered. An interesting point. At the moment I'm not sure whether it's possible to implement skipping via index_rescan or not, need to take a look. But checking if the new keys are on the same page would introduce some overhead I guess, wouldn't it be too invasive to add it into already existing btree AM? > - Does this optimization apply to bitmap index scans? No, from what I understand it doesn't. > - This logic in build_index_paths() is not correct: > > > + /* > > +* Skip scan is not supported when there are qual conditions, > > which are not > > +* covered by index. The reason for that is that those > > conditions are > > +* evaluated later, already after skipping was applied. > > +* > > +* TODO: This implementation is too restrictive, and doesn't > > allow e.g. > > +* index expressions. For that we need to examine index_clauses > > too. > > +*/ > > + if (root->parse->jointree != NULL) > > + { > > + ListCell *lc; > > + > > + foreach(lc, (List *)root->parse->jointree->quals) > > + { > > + Node *expr, *qual = (Node *) lfirst(lc); > > + Var *var; > > + bool found = false; > > + > > + if (!is_opclause(qual)) > > + { > > + not_empty_qual = true; > > + break; > > + } > > + > > + expr = get_leftop(qual); > > + > > + if (!IsA(expr, Var)) > > + { > > + not_empty_qual = true; > > + break; > > + } > > + > > + var = (Var *) expr; > > + > > + for (int i = 0; i < index->ncolumns; i++) > > + { > > + if (index->indexkeys[i] == > > var->varattno) > > + { > > + found = true; > > + break; > > + } > > + } > > + > > + if (!found) > > + { > > + not_empty_qual = true; > > + break; > > + } > > + } > > + } > > If you care whether the qual is evaluated by the index AM or not, you need > to also check that the operator is indexable. Attached is a query that > demonstrates that problem. > ... > Also, you should probably check that the index quals are in the operator > family as that used for the DISTINCT. Yes, good point, will change this in the next version. > I'm actually a bit confused why we need this condition. The IndexScan > executor node should call amskip() only after checking the additional quals, > no? This part I don't quite get, what exactly you mean by checking the additional quals in the executor node? But at the end of the day this condition was implemented exactly to address the described issue, which was found later and added to the tests.
Re: Index Skip Scan (new UniqueKeys)
On 24/10/2020 19:45, Dmitry Dolgov wrote: Here is a new version which doesn't require "scanstart" argument and contains few other changes to address the issues mentioned earlier. It's also based on the latest UniqueKeys patches with the valgrind issue fixed (as before they're attached also just for the references, you can find more in the original thread). I didn't rework commentaries yet, will post it soon (need to get an inspiration first, probably via reading Shakespeare unless someone has better suggestions). I had a quick look at this patch. I haven't been following this thread, so sorry if I'm repeating old arguments, but here we go: - I'm surprised you need a new index AM function (amskip) for this. Can't you just restart the scan with index_rescan()? The btree AM can check if the new keys are on the same page, and optimize the rescan accordingly, like amskip does. That would speed up e.g. nested loop scans too, where the keys just happen to be clustered. - Does this optimization apply to bitmap index scans? - This logic in build_index_paths() is not correct: + /* +* Skip scan is not supported when there are qual conditions, which are not +* covered by index. The reason for that is that those conditions are +* evaluated later, already after skipping was applied. +* +* TODO: This implementation is too restrictive, and doesn't allow e.g. +* index expressions. For that we need to examine index_clauses too. +*/ + if (root->parse->jointree != NULL) + { + ListCell *lc; + + foreach(lc, (List *)root->parse->jointree->quals) + { + Node *expr, *qual = (Node *) lfirst(lc); + Var *var; + bool found = false; + + if (!is_opclause(qual)) + { + not_empty_qual = true; + break; + } + + expr = get_leftop(qual); + + if (!IsA(expr, Var)) + { + not_empty_qual = true; + break; + } + + var = (Var *) expr; + + for (int i = 0; i < index->ncolumns; i++) + { + if (index->indexkeys[i] == var->varattno) + { + found = true; + break; + } + } + + if (!found) + { + not_empty_qual = true; + break; + } + } + } If you care whether the qual is evaluated by the index AM or not, you need to also check that the operator is indexable. Attached is a query that demonstrates that problem. I'm actually a bit confused why we need this condition. The IndexScan executor node should call amskip() only after checking the additional quals, no? Also, you should probably check that the index quals are in the operator family as that used for the DISTINCT. - Heikki index-skipscan-bug.sql Description: application/sql
Re: Index Skip Scan (new UniqueKeys)
> On Mon, Sep 21, 2020 at 05:59:32PM -0700, Peter Geoghegan wrote: > > * I see the following compiler warning: > > /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c: > In function ‘populate_baserel_uniquekeys’: > /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13: > warning: ‘expr’ may be used uninitialized in this function > [-Wmaybe-uninitialized] > 797 | else if (!list_member(unique_index->rel->reltarget->exprs, expr)) > | ^~ This is mostly for UniqueKeys patch, which is attached here only as a dependency, but I'll prepare changes for that. Interesting enough I can't reproduce this warning, but if I understand correctly gcc has some history of spurious uninitialized warnings, so I guess it could be version dependent. > * Perhaps the warning is related to this nearby code that I noticed > Valgrind complains about: > > ==1083468== VALGRINDERROR-BEGIN > ==1083468== Invalid read of size 4 > ==1083468==at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771) > ==1083468==by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140) This also belongs to UniqueKeys patch, but at least I can reproduce this one. My guess is that nkeycolums should be used there, not ncolums, which is visible in index_incuding tests. The same as previous one, will prepare corresponding changes. > * Do we really need the AM-level boolean flag/argument named > "scanstart"? Why not just follow the example of btgettuple(), which > determines whether or not the scan has been initialized based on the > current scan position? > > Just because you set so->currPos.buf to InvalidBuffer doesn't mean you > cannot or should not take the same approach as btgettuple(). And even > if you can't take exactly the same approach, I would still think that > the scan's opaque B-Tree state should remember if it's the first call > to _bt_skip() (rather than some subsequent call) in some other way > (e.g. carrying a "scanstart" bool flag directly). Yes, agree, carrying this flag inside the opaque state would be better. > * Why is it okay to do anything important based on the > _bt_scankey_within_page() return value? > > If the page is empty, then how can we know that it's okay to go to the > next value? I'm concerned that there could be subtle bugs in this > area. VACUUM will usually just delete the empty page. But it won't > always do so, for a variety of reasons that aren't worth going into > now. This could mask bugs in this area. I'm concerned about patterns > like this one from _bt_skip(): > > while (!nextFound) > { > > > if (_bt_scankey_within_page(scan, so->skipScanKey, > so->currPos.buf, dir)) > { > ... > } > else > /* > * If startItup could be not found within the current > page, > * assume we found something new > */ > nextFound = true; > > } > > Why would you assume that "we found something new" here? In general I > just don't understand the design of _bt_skip(). I get the basic idea > of what you're trying to do, but it could really use better comments. Yeah, I'll put more efforts into clear comments. There are two different ways in which _bt_scankey_within_page is being used. The first one is to check if it's possible to skip traversal of the tree from root in case if what we're looking for could be on the current page. In this case an empty page would mean we need to search from the root, so not sure what could be the issue here? The second one (that you've highlighted above) I admit is probably the most questionable part of the patch and open for suggestions how to improve it. It's required for one particular case with a cursor when scan advances forward but reads backward. What could happen here is we found one valid item, but the next one e.g. do not pass scan key conditions, and we end up with the previous item again. I'm not entirely sure how presence of an empty page could change this scenario, could you please show an example? > *The "jump one more time if it's the same as at the beginning" thing > seems scary to me. Maybe you should be doing something with the actual > high key here. Same as for the previous question, can you give a hint what do you mean by "doing something with the actual high key"?
Re: Index Skip Scan (new UniqueKeys)
On Mon, Sep 21, 2020 at 5:59 PM Peter Geoghegan wrote: > That's all I have for now. One more thing. I don't think that this should be a bitwise AND: if ((offnum > maxoff) & (so->currPos.nextPage == P_NONE)) { } -- Peter Geoghegan
Re: Index Skip Scan (new UniqueKeys)
On Sat, Aug 15, 2020 at 7:09 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > Here is a new version that hopefully address most of the concerns > mentioned in this thread so far. As before, first two patches are taken > from UniqueKeys thread and attached only for the reference. List of > changes includes: Some thoughts on this version of the patch series (I'm focussing on v36-0005-Btree-implementation-of-skipping.patch again): * I see the following compiler warning: /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c: In function ‘populate_baserel_uniquekeys’: /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13: warning: ‘expr’ may be used uninitialized in this function [-Wmaybe-uninitialized] 797 | else if (!list_member(unique_index->rel->reltarget->exprs, expr)) | ^~ * Perhaps the warning is related to this nearby code that I noticed Valgrind complains about: ==1083468== VALGRINDERROR-BEGIN ==1083468== Invalid read of size 4 ==1083468==at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771) ==1083468==by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140) ==1083468==by 0x56AEA5: set_plain_rel_size (allpaths.c:586) ==1083468==by 0x56AADB: set_rel_size (allpaths.c:412) ==1083468==by 0x56A8CD: set_base_rel_sizes (allpaths.c:323) ==1083468==by 0x56A5A7: make_one_rel (allpaths.c:185) ==1083468==by 0x5AB426: query_planner (planmain.c:269) ==1083468==by 0x5AF02C: grouping_planner (planner.c:2058) ==1083468==by 0x5AD202: subquery_planner (planner.c:1015) ==1083468==by 0x5ABABF: standard_planner (planner.c:405) ==1083468==by 0x5AB7F8: planner (planner.c:275) ==1083468==by 0x6E6F84: pg_plan_query (postgres.c:875) ==1083468==by 0x6E70C4: pg_plan_queries (postgres.c:966) ==1083468==by 0x6E7497: exec_simple_query (postgres.c:1158) ==1083468==by 0x6EBCD3: PostgresMain (postgres.c:4309) ==1083468==by 0x624284: BackendRun (postmaster.c:4541) ==1083468==by 0x623995: BackendStartup (postmaster.c:4225) ==1083468==by 0x61FB70: ServerLoop (postmaster.c:1742) ==1083468==by 0x61F309: PostmasterMain (postmaster.c:1415) ==1083468==by 0x514AF2: main (main.c:209) ==1083468== Address 0x75f13e0 is 4,448 bytes inside a block of size 8,192 alloc'd ==1083468==at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==1083468==by 0x8C15C8: AllocSetAlloc (aset.c:919) ==1083468==by 0x8CEA52: palloc (mcxt.c:964) ==1083468==by 0x267F25: systable_beginscan (genam.c:373) ==1083468==by 0x8682CE: SearchCatCacheMiss (catcache.c:1359) ==1083468==by 0x868167: SearchCatCacheInternal (catcache.c:1299) ==1083468==by 0x867E2C: SearchCatCache1 (catcache.c:1167) ==1083468==by 0x8860B2: SearchSysCache1 (syscache.c:1123) ==1083468==by 0x8BD482: check_enable_rls (rls.c:66) ==1083468==by 0x68A113: get_row_security_policies (rowsecurity.c:134) ==1083468==by 0x683C2C: fireRIRrules (rewriteHandler.c:2045) ==1083468==by 0x687340: QueryRewrite (rewriteHandler.c:3962) ==1083468==by 0x6E6EB1: pg_rewrite_query (postgres.c:784) ==1083468==by 0x6E6D23: pg_analyze_and_rewrite (postgres.c:700) ==1083468==by 0x6E7476: exec_simple_query (postgres.c:1155) ==1083468==by 0x6EBCD3: PostgresMain (postgres.c:4309) ==1083468==by 0x624284: BackendRun (postmaster.c:4541) ==1083468==by 0x623995: BackendStartup (postmaster.c:4225) ==1083468==by 0x61FB70: ServerLoop (postmaster.c:1742) ==1083468==by 0x61F309: PostmasterMain (postmaster.c:1415) ==1083468== ==1083468== VALGRINDERROR-END (You'll see the same error if you run Postgres Valgrind + "make installcheck", though I don't think that the queries in question are tests that you yourself wrote.) * IndexScanDescData.xs_itup comments could stand to be updated here -- IndexScanDescData.xs_want_itup is no longer just about index-only scans. * Do we really need the AM-level boolean flag/argument named "scanstart"? Why not just follow the example of btgettuple(), which determines whether or not the scan has been initialized based on the current scan position? Just because you set so->currPos.buf to InvalidBuffer doesn't mean you cannot or should not take the same approach as btgettuple(). And even if you can't take exactly the same approach, I would still think that the scan's opaque B-Tree state should remember if it's the first call to _bt_skip() (rather than some subsequent call) in some other way (e.g. carrying a "scanstart" bool flag directly). A part of my objection to "scanstart" is that it seems to require that much of the code within _bt_skip() get another level of indentation...which makes it even more difficult to follow. * I don't understand what _bt_scankey_within_page() comments mean when they refer to "the page highkey". It looks like this function examines the highest data
Re: Index Skip Scan (new UniqueKeys)
Hi David: Thanks for looking into this. On Fri, Jul 31, 2020 at 11:07 AM David Rowley wrote: > On Mon, 13 Jul 2020 at 10:18, Floris Van Nee > wrote: > > One question about the unique keys - probably for Andy or David: I've > looked in the archives to find arguments for/against using Expr nodes or > EquivalenceClasses in the Unique Keys patch. However, I couldn't really > find a clear answer about why the current patch uses Expr rather than > EquivalenceClasses. At some point David mentioned "that probably Expr nodes > were needed rather than EquivalenceClasses", but it's not really clear to > me why. What were the thoughts behind this? > > I'm still not quite sure on this either way. I did think > EquivalenceClasses were more suitable before I wrote the POC patch for > unique keys. But after that, I had in mind that Exprs might be > better. The reason I thought this was due to the fact that the > DISTINCT clause list is a bunch of Exprs and if the UniqueKeys were > EquivalenceClasses then checking to see if the DISTINCT can be skipped > turned into something more complex that required looking through lists > of ec_members rather than just checking if the uniquekey exprs were a > subset of the DISTINCT clause. > > Thinking about it a bit harder, if we did use Exprs then it would mean > it a case like the following wouldn't work for Andy's DISTINCT no-op > stuff. > > CREATE TABLE xy (x int primary key, y int not null); > > SELECT DISTINCT y FROM xy WHERE x=y; > > whereas if we use EquivalenceClasses then we'll find that we have an > EC with x,y in it and can skip the DISTINCT since we have a UniqueKey > containing that EquivalenceClass. > Also, looking at what Andy wrote to make a case like the following > work in his populate_baserel_uniquekeys() function in the 0002 patch: > > CREATE TABLE ab (a int, b int, primary key(a,b)); > SELECT DISTINCT a FROM ab WHERE b = 1; > > it's a bit uninspiring. Really what we want here when checking if we > can skip doing the DISTINCT is a UniqueKey set using > EquivalenceClasses as we can just insist that any unmatched UniqueKey > items have an ec_is_const == true. However, that means we have to loop > through the ec_members of the EquivalenceClasses in the uniquekeys > during the DISTINCT check. That's particularly bad when you consider > that in a partitioned table case there might be an ec_member for each > child partition and there could be 1000s of child partitions and > following those ec_members chains is going to be too slow. > > My current thoughts are that we should be using EquivalenceClasses but > we should first add some infrastructure to make them perform better. > My current thoughts are that we do something like what I mentioned in > [1] or something more like what Andres mentions in [2]. After that, > we could either make EquivalenceClass.ec_members a hash table or > binary search tree. Or even perhaps just have a single hash table/BST > for all EquivalenceClasses that allows very fast lookups from {Expr} > -> {EquivalenceClass}. I think an Expr can only belong in a single > non-merged EquivalenceClass. So when we do merging of > EquivalenceClasses we could just repoint that data structure to point > to the new EquivalenceClass. We'd never point to ones that have > ec_merged != NULL. This would also allow us to fix the poor > performance in regards to get_eclass_for_sort_expr() for partitioned > tables. > > So, it seems the patch dependency chain for skip scans just got a bit > longer :-( > > I admit that EquivalenceClasses has a better expressive power. There are 2 more cases we can handle better with EquivalenceClasses. SELECT DISTINCT a, b, c FROM t WHERE a = b; Currently the UniqueKey is (a, b, c), but it is better be (a, c) and (b, c). The other case happens similarly in group by case. After realizing this, I am still hesitant to do that, due to the complexity. If we do that, we may have to maintain a EquivalenceClasses in one more place or make the existing EquivalenceClasses List longer, for example: SELECT pk FROM t; The current infrastructure doesn't create any EquivalenceClasses for pk. So we have to create a new one in this case and reuse some existing ones in other cases. Finally since the EquivalenceClasses is not so straight to upper user, we have to depend on the infrastructure change to look up an EquivalenceClasses quickly from an Expr. I rethink more about the case you provide above, IIUC, there is such issue for joinrel. then we can just add a EC checking for populate_baserel_uniquekeys. As for the DISTINCT/GROUP BY case, we should build the UniqueKeys from root->distinct_pathkeys and root->group_pathkeys where the EquivalenceClasses are already there. I am still not insisting on either Expr or EquivalenceClasses right now, if we need to change it to EquivalenceClasses, I'd see if we need to have more places to take care before doing that. -- Best Regards Andy Fan
Re: Index Skip Scan (new UniqueKeys)
On Mon, 13 Jul 2020 at 10:18, Floris Van Nee wrote: > One question about the unique keys - probably for Andy or David: I've looked > in the archives to find arguments for/against using Expr nodes or > EquivalenceClasses in the Unique Keys patch. However, I couldn't really find > a clear answer about why the current patch uses Expr rather than > EquivalenceClasses. At some point David mentioned "that probably Expr nodes > were needed rather than EquivalenceClasses", but it's not really clear to me > why. What were the thoughts behind this? I'm still not quite sure on this either way. I did think EquivalenceClasses were more suitable before I wrote the POC patch for unique keys. But after that, I had in mind that Exprs might be better. The reason I thought this was due to the fact that the DISTINCT clause list is a bunch of Exprs and if the UniqueKeys were EquivalenceClasses then checking to see if the DISTINCT can be skipped turned into something more complex that required looking through lists of ec_members rather than just checking if the uniquekey exprs were a subset of the DISTINCT clause. Thinking about it a bit harder, if we did use Exprs then it would mean it a case like the following wouldn't work for Andy's DISTINCT no-op stuff. CREATE TABLE xy (x int primary key, y int not null); SELECT DISTINCT y FROM xy WHERE x=y; whereas if we use EquivalenceClasses then we'll find that we have an EC with x,y in it and can skip the DISTINCT since we have a UniqueKey containing that EquivalenceClass. Also, looking at what Andy wrote to make a case like the following work in his populate_baserel_uniquekeys() function in the 0002 patch: CREATE TABLE ab (a int, b int, primary key(a,b)); SELECT DISTINCT a FROM ab WHERE b = 1; it's a bit uninspiring. Really what we want here when checking if we can skip doing the DISTINCT is a UniqueKey set using EquivalenceClasses as we can just insist that any unmatched UniqueKey items have an ec_is_const == true. However, that means we have to loop through the ec_members of the EquivalenceClasses in the uniquekeys during the DISTINCT check. That's particularly bad when you consider that in a partitioned table case there might be an ec_member for each child partition and there could be 1000s of child partitions and following those ec_members chains is going to be too slow. My current thoughts are that we should be using EquivalenceClasses but we should first add some infrastructure to make them perform better. My current thoughts are that we do something like what I mentioned in [1] or something more like what Andres mentions in [2]. After that, we could either make EquivalenceClass.ec_members a hash table or binary search tree. Or even perhaps just have a single hash table/BST for all EquivalenceClasses that allows very fast lookups from {Expr} -> {EquivalenceClass}. I think an Expr can only belong in a single non-merged EquivalenceClass. So when we do merging of EquivalenceClasses we could just repoint that data structure to point to the new EquivalenceClass. We'd never point to ones that have ec_merged != NULL. This would also allow us to fix the poor performance in regards to get_eclass_for_sort_expr() for partitioned tables. So, it seems the patch dependency chain for skip scans just got a bit longer :-( David [1] https://www.postgresql.org/message-id/CAApHDvrEXcadNYAAdq6RO0eKZUG6rRHXJGAbpzj8y432gCD9bA%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/20190920051857.2fhnvhvx4qdddviz%40alap3.anarazel.de#c3add3919c534591eae2179a6c82742c
Re: Index Skip Scan (new UniqueKeys)
> On Tue, Jul 21, 2020 at 04:35:55PM -0700, Peter Geoghegan wrote: > > > Well, it's obviously wrong, thanks for noticing. What is necessary is to > > compare two index tuples, the start and the next one, to test if they're > > the same (in which case if I'm not mistaken probably we can compare item > > pointers). I've got this question when I was about to post a new version > > with changes to address feedback from Andy, now I'll combine them and > > send a cumulative patch. > > This sounds like approximately the same problem as the one that > _bt_killitems() has to deal with as of Postgres 13. This is handled in > a way that is admittedly pretty tricky, even though the code does not > need to be 100% certain that it's "the same" tuple. Deduplication kind > of makes that a fuzzy concept. In principle there could be one big > index tuple instead of 5 tuples, even though the logical contents of > the page have not been changed between the time we recording heap TIDs > in local and the time _bt_killitems() tried to match on those heap > TIDs to kill_prior_tuple-kill some index tuples -- a concurrent > deduplication pass could do that. Your code needs to be prepared for > stuff like that. > > Ultimately posting list tuples are just a matter of understanding the > on-disk representation -- a "Small Matter of Programming". Even > without deduplication there are potential hazards from the physical > deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is > not code that runs in VACUUM, despite the name). Make sure that you > hold a buffer pin on the leaf page throughout, because you need to do > that to make sure that VACUUM cannot concurrently recycle heap TIDs. > If VACUUM *is* able to concurrently recycle heap TIDs then it'll be > subtly broken. _bt_killitems() is safe because it either holds on to a > pin or gives up when the LSN changes at all. (ISTM that your only > choice is to hold on to a leaf page pin, since you cannot just decide > to give up in the way that _bt_killitems() sometimes can.) I see, thanks for clarification. You're right, in this part of implementation there is no way to give up if LSN changes like _bt_killitems does. As far as I can see the leaf page is already pinned all the time between reading relevant tuples and comparing them, I only need to handle posting list tuples.
RE: Index Skip Scan (new UniqueKeys)
> > One UniqueKey can have multiple corresponding expressions, which gives us > also possibility of having one unique key with (t1.a, t2.a) and it looks now > similar to EquivalenceClass. > I believe the current definition of a unique key with two expressions (t1.a, t2.a) means that it's unique on the tuple (t1.a, t2.a) - this gives weaker guarantees than uniqueness on (t1.a) and uniqueness on (t2.a). > > The idea behind this query sounds questionable to me, more transparent > would be to do this without distinct, skipping will actually do exactly the > same > stuff just under another name. But if allowing skipping on constants do not > bring significant changes in the code probably it's fine. > Yeah indeed, I didn't say it's a query that people should generally write. :-) It's better to write as a regular SELECT with LIMIT 1 of course. However, it's more to be consistent and predictable to the user: if a SELECT DISTINCT ON (a) * FROM t1 runs fast, then it doesn't make sense to the user if a SELECT DISTINCT ON (a) * FROM t1 WHERE a=2 runs slow. And to support it also makes the implementation more consistent with little code changes. > > > > Yeah, there's definitely some double work there, but the actual impact may > be limited - it doesn't actually allocate a new path key, but it looks it up > in > root->canon_pathkeys and returns that path key. > > I wrote it like this, because I couldn't find a way to identify from a > > certain > PathKey the actual location in the index of that column. The constructed path > keys list filters out all redundant path keys. An index on (a,a,b,a,b) becomes > path keys (a,b). Now if we skip on (a,b) we actually need to use prefix=3. But > how to get from PathKey=b to that number 3, I didn't find a solid way except > doing this. Maybe there is though? > > I don't think there is a direct way, but why not modify build_index_paths to > also provide this information, or compare index_pathkeys expressions with > indextlist without actually create those pathkeys again? > I agree there could be other ways - I don't currently have a strong preference for either. I can have a look at this later. > And couple of words about this thread [1]. It looks to me like a strange way > of interacting with the community. Are you going to duplicate there > everything, or what are your plans? At the very least you could try to include > everyone involved in the recipients list, not exclude some of the authors. > When I wrote the first mail in the thread, I went to this thread [1] and included everyone from there, but I see now that I only included the to: and cc: people and forgot the original thread author, you. I'm sorry about that - I should've looked better to make sure I had everyone. In any case, my plan is to keep the patch at least applicable to master, as I believe it can be helpful for discussions about both patches. [1] https://www.postgresql.org/message-id/20200609102247.jdlatmfyeecg52fi%40localhost
Re: Index Skip Scan (new UniqueKeys)
> On Tue, Jul 14, 2020 at 06:18:50PM +, Floris Van Nee wrote: > > Due to the other changes I made in > create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan > now, even without the EC_MUST_BE_REDUNDANT check though, so it's difficult to > give an actual failing test case now. However, since all code filters out > constant keys, I think uniqueness should do it too - otherwise you could get > into problems later on. It's also more consistent. If you already know > something is unique by just (b), it doesn't make sense to store that it's > unique by (a,b). Now that I think of it, the best place to do this > EC_MUST_BE_REDUNDANT check is probably inside make_pathkeys_for_uniquekeys, > rather than build_uniquekeys though. It's probably good to move it there. That would be my suggestion as well. > > Along the lines I'm also curious about this part: > > > > - ListCell *k; > > - List *exprs = NIL; > > - > > - foreach(k, ec->ec_members) > > - { > > - EquivalenceMember *mem = (EquivalenceMember *) > > lfirst(k); > > - exprs = lappend(exprs, mem->em_expr); > > - } > > - > > - result = lappend(result, makeUniqueKey(exprs, false, false)); > > + EquivalenceMember *mem = (EquivalenceMember*) > > +lfirst(list_head(ec->ec_members)); > > > > I'm curious about this myself, maybe someone can clarify. It looks like > > generaly speaking there could be more than one member (if not > > ec_has_volatile), which "representing knowledge that multiple items are > > effectively equal". Is this information is not interesting enough to > > preserve it > > in unique keys? > > Yeah, that's a good question. Hence my question about the choice for Expr > rather than EquivalenceClass for the Unique Keys patch to Andy/David. When > storing just Expr, it is rather difficult to check equivalence in joins for > example. Suppose, later on we decide to add join support to the distinct skip > scan. Consider a query like this: > SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a > As far as my understanding goes (I didn't look into it in detail though), I > think here the distinct_pathkey will have an EqClass {t1.a, t2.a}. That > results in a UniqueKey with expr (t1.a) (because currently we only take the > first Expr in the list to construct the UniqueKey). We could also construct > *two* unique keys, one with Expr (t1.a) and one with Expr (t2.a), but I don't > think that's the correct approach either, as it will explode when you have > multiple pathkeys, each having multiple Expr inside their EqClasses. One UniqueKey can have multiple corresponding expressions, which gives us also possibility of having one unique key with (t1.a, t2.a) and it looks now similar to EquivalenceClass. > > > - the distinct_pathkeys may be NULL, even though there's a possibility for > > skipping. But it wouldn't create the uniquekeys in this case. This makes the > > planner not choose skip scans even though it could. For example in queries > > that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a > > is constant, it's eliminated from regular pathkeys. > > > > What would be the point of skipping if it's a constant? > > For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; > There may be 1000s of records with a=1. We're only interested in the first > one though. The traditional non-skip approach would still scan all records > with a=1. Skip would just fetch the first one with a=1 and then skip to the > next prefix and stop the scan. The idea behind this query sounds questionable to me, more transparent would be to do this without distinct, skipping will actually do exactly the same stuff just under another name. But if allowing skipping on constants do not bring significant changes in the code probably it's fine. > > > - to combat the issues mentioned earlier, there's now a check in > > build_index_paths that checks if the query_pathkeys matches the > > useful_pathkeys. Note that we have to use the path keys here rather than > > any of the unique keys. The unique keys are only Expr nodes - they do not > > contain the necessary information about ordering. Due to elimination of > > some constant path keys, we have to search the attributes of the index to > > find the correct prefix to use in skipping. > > > > IIUC here you mean this function, right? > > > > + prefix = find_index_prefix_for_pathkey(root, > > + > > index, > > + > > BackwardScanDirection, > > + > > llast_node(PathKey, > > + > > root->distinct_pathkeys)); > > > > Doesn't it duplicate the job already done in build_index_pathkeys by > > building > > those pathkeys again? If yes, probably it's possible to reuse > > useful_pathkeys. > > Not sure about unordered indexes, but looks like query_pathkeys should > > also match in this case. > > > > Yeah, there's definitely some double work there, but the actual impact may be > limited - it doesn't actually allocate a new path key, but it looks
Re: Index Skip Scan (new UniqueKeys)
On Sat, Jul 11, 2020 at 9:10 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > + currItem = >currPos.items[so->currPos.lastItem]; > > > + itup = (IndexTuple) (so->currTuples + > > > currItem->tupleOffset); > > > + nextOffset = ItemPointerGetOffsetNumber(>t_tid); > > Do you mean this last part with t_tid, which could also have a tid array > in case of posting tuple format? Yeah. There is a TID array at the end of the index when the tuple is a posting list tuple (as indicated by BTreeTupleIsPivot()). It isn't safe to assume that t_tid is a heap TID for this reason, even in code that only ever considers data items (that is, non high-key tuples AKA non-pivot tuples) on a leaf page. (Though new/incoming tuples cannot be posting list tuples either, so you'll see the assumption that t_tid is just a heap TID in parts of nbtinsert.c -- though only for the new/incoming item.) > Well, it's obviously wrong, thanks for noticing. What is necessary is to > compare two index tuples, the start and the next one, to test if they're > the same (in which case if I'm not mistaken probably we can compare item > pointers). I've got this question when I was about to post a new version > with changes to address feedback from Andy, now I'll combine them and > send a cumulative patch. This sounds like approximately the same problem as the one that _bt_killitems() has to deal with as of Postgres 13. This is handled in a way that is admittedly pretty tricky, even though the code does not need to be 100% certain that it's "the same" tuple. Deduplication kind of makes that a fuzzy concept. In principle there could be one big index tuple instead of 5 tuples, even though the logical contents of the page have not been changed between the time we recording heap TIDs in local and the time _bt_killitems() tried to match on those heap TIDs to kill_prior_tuple-kill some index tuples -- a concurrent deduplication pass could do that. Your code needs to be prepared for stuff like that. Ultimately posting list tuples are just a matter of understanding the on-disk representation -- a "Small Matter of Programming". Even without deduplication there are potential hazards from the physical deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is not code that runs in VACUUM, despite the name). Make sure that you hold a buffer pin on the leaf page throughout, because you need to do that to make sure that VACUUM cannot concurrently recycle heap TIDs. If VACUUM *is* able to concurrently recycle heap TIDs then it'll be subtly broken. _bt_killitems() is safe because it either holds on to a pin or gives up when the LSN changes at all. (ISTM that your only choice is to hold on to a leaf page pin, since you cannot just decide to give up in the way that _bt_killitems() sometimes can.) Note that the rules surrounding buffer locks/pins for nbtree were tightened up a tiny bit today -- see commit 4a70f829. Also, it's no longer okay to use raw LockBuffer() calls in nbtree, so you're going to have to fix that up when you next rebase -- you must use the new _bt_lockbuf() wrapper function instead, so that the new Valgrind instrumentation is used. This shouldn't be hard. Perhaps you can use Valgrind to verify that this patch doesn't have any unsafe buffer accesses. I recall problems like that in earlier versions of the patch series. Valgrind should be able to detect most bugs like that (though see comments within _bt_lockbuf() for details of a bug in this area that Valgrind cannot detect even now). -- Peter Geoghegan
RE: Index Skip Scan (new UniqueKeys)
> > > - the uniquekeys that is built, still contains some redundant keys, that are > normally eliminated from the path keys lists. > > I guess you're talking about: > > + if (EC_MUST_BE_REDUNDANT(ec)) > + continue; > > Can you add some test cases to your changes to show the effect of it? It > seem to me redundant keys are already eliminated at this point by either > make_pathkeys_for_uniquekeys or even earlier for distinct on, but could be > I've missed something. > The build_uniquekeys function calls make_pathkeys_for_uniquekeys, which checks for uniqueness using pathkey_is_unique, but not for constantness. Consider a query like: SELECT DISTINCT ON (a,b) * FROM t1 WHERE a=1 ORDER BY a,b,c All the regular path keys filter out 'a' for constantness here - so they would end up with a distinct pathkeys of (b) and sort path keys of (b,c). The unique keys would end up with (a,b) though. In the previous patch, it'd compared in create_distinct_paths, the pathkeys to the unique keys, so it wouldn't consider the skip scan. Due to the other changes I made in create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan now, even without the EC_MUST_BE_REDUNDANT check though, so it's difficult to give an actual failing test case now. However, since all code filters out constant keys, I think uniqueness should do it too - otherwise you could get into problems later on. It's also more consistent. If you already know something is unique by just (b), it doesn't make sense to store that it's unique by (a,b). Now that I think of it, the best place to do this EC_MUST_BE_REDUNDANT check is probably inside make_pathkeys_for_uniquekeys, rather than build_uniquekeys though. It's probably good to move it there. > Along the lines I'm also curious about this part: > > - ListCell *k; > - List *exprs = NIL; > - > - foreach(k, ec->ec_members) > - { > - EquivalenceMember *mem = (EquivalenceMember *) > lfirst(k); > - exprs = lappend(exprs, mem->em_expr); > - } > - > - result = lappend(result, makeUniqueKey(exprs, false, false)); > + EquivalenceMember *mem = (EquivalenceMember*) > +lfirst(list_head(ec->ec_members)); > > I'm curious about this myself, maybe someone can clarify. It looks like > generaly speaking there could be more than one member (if not > ec_has_volatile), which "representing knowledge that multiple items are > effectively equal". Is this information is not interesting enough to preserve > it > in unique keys? Yeah, that's a good question. Hence my question about the choice for Expr rather than EquivalenceClass for the Unique Keys patch to Andy/David. When storing just Expr, it is rather difficult to check equivalence in joins for example. Suppose, later on we decide to add join support to the distinct skip scan. Consider a query like this: SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a As far as my understanding goes (I didn't look into it in detail though), I think here the distinct_pathkey will have an EqClass {t1.a, t2.a}. That results in a UniqueKey with expr (t1.a) (because currently we only take the first Expr in the list to construct the UniqueKey). We could also construct *two* unique keys, one with Expr (t1.a) and one with Expr (t2.a), but I don't think that's the correct approach either, as it will explode when you have multiple pathkeys, each having multiple Expr inside their EqClasses. That makes it difficult to check if we can perform the DISTINCT skip scan on table t2 as well (theoretically we could, but for that we need to check equivalence classes rather than expressions). > > > - the distinct_pathkeys may be NULL, even though there's a possibility for > skipping. But it wouldn't create the uniquekeys in this case. This makes the > planner not choose skip scans even though it could. For example in queries > that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a > is constant, it's eliminated from regular pathkeys. > > What would be the point of skipping if it's a constant? For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; There may be 1000s of records with a=1. We're only interested in the first one though. The traditional non-skip approach would still scan all records with a=1. Skip would just fetch the first one with a=1 and then skip to the next prefix and stop the scan. > > > - to combat the issues mentioned earlier, there's now a check in > build_index_paths that checks if the query_pathkeys matches the > useful_pathkeys. Note that we have to use the path keys here rather than > any of the unique keys. The unique keys are only Expr nodes - they do not > contain the necessary information about ordering. Due to elimination of > some constant path keys, we have to search the attributes of the index to > find the correct prefix to use in skipping. > > IIUC here you mean this function, right? > > + prefix = find_index_prefix_for_pathkey(root,
Re: Index Skip Scan (new UniqueKeys)
> On Sun, Jul 12, 2020 at 12:48:47PM +, Floris Van Nee wrote: > > > > Good point, thanks for looking at this. With the latest planner version > > there > > are indeed more possibilities to use skipping. It never occured to me that > > some of those paths will still rely on index scan returning full data set. > > I'll look > > in details and add verification to prevent putting something like this on > > top of > > skip scan in the next version. > > I believe the required changes are something like in attached patch. There > were a few things I've changed: > - build_uniquekeys was constructing the list incorrectly. For a DISTINCT a,b, > it would create two unique keys, one with a and one with b. However, it > should be one unique key with (a,b). Yes, I've also noticed that while preparing fix for index scan not covered by index and included it. > - the uniquekeys that is built, still contains some redundant keys, that are > normally eliminated from the path keys lists. I guess you're talking about: + if (EC_MUST_BE_REDUNDANT(ec)) + continue; Can you add some test cases to your changes to show the effect of it? It seem to me redundant keys are already eliminated at this point by either make_pathkeys_for_uniquekeys or even earlier for distinct on, but could be I've missed something. Along the lines I'm also curious about this part: - ListCell *k; - List *exprs = NIL; - - foreach(k, ec->ec_members) - { - EquivalenceMember *mem = (EquivalenceMember *) lfirst(k); - exprs = lappend(exprs, mem->em_expr); - } - - result = lappend(result, makeUniqueKey(exprs, false, false)); + EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members)); I'm curious about this myself, maybe someone can clarify. It looks like generaly speaking there could be more than one member (if not ec_has_volatile), which "representing knowledge that multiple items are effectively equal". Is this information is not interesting enough to preserve it in unique keys? > - the distinct_pathkeys may be NULL, even though there's a possibility for > skipping. But it wouldn't create the uniquekeys in this case. This makes the > planner not choose skip scans even though it could. For example in queries > that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a is > constant, it's eliminated from regular pathkeys. What would be the point of skipping if it's a constant? > - to combat the issues mentioned earlier, there's now a check in > build_index_paths that checks if the query_pathkeys matches the > useful_pathkeys. Note that we have to use the path keys here rather than any > of the unique keys. The unique keys are only Expr nodes - they do not contain > the necessary information about ordering. Due to elimination of some constant > path keys, we have to search the attributes of the index to find the correct > prefix to use in skipping. IIUC here you mean this function, right? + prefix = find_index_prefix_for_pathkey(root, + index, + BackwardScanDirection, + llast_node(PathKey, + root->distinct_pathkeys)); Doesn't it duplicate the job already done in build_index_pathkeys by building those pathkeys again? If yes, probably it's possible to reuse useful_pathkeys. Not sure about unordered indexes, but looks like query_pathkeys should also match in this case. Will also look at the follow up questions in the next email.
Re: Index Skip Scan (new UniqueKeys)
> On Fri, Jul 10, 2020 at 05:03:37PM +, Floris Van Nee wrote: > > Also took another look at the patch now, and found a case of incorrect > data. It looks related to the new way of creating the paths, as I > can't recall seeing this in earlier versions. > > create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, > 10) a, generate_series(1,100) b; > create index on t1 (a,b,c); > > postgres=# explain select distinct on (a) * from t1 order by a,b desc,c; > QUERY PLAN > --- > Sort (cost=2.92..2.94 rows=10 width=20) >Sort Key: a, b DESC, c >-> Index Scan using t1_a_b_c_idx on t1 (cost=0.28..2.75 rows=10 width=20) > Skip scan: true > (4 rows) Good point, thanks for looking at this. With the latest planner version there are indeed more possibilities to use skipping. It never occured to me that some of those paths will still rely on index scan returning full data set. I'll look in details and add verification to prevent putting something like this on top of skip scan in the next version.
Re: Index Skip Scan (new UniqueKeys)
> On Wed, Jul 08, 2020 at 03:44:26PM -0700, Peter Geoghegan wrote: > > On Tue, Jun 9, 2020 at 3:20 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > * Btree-implementation contains btree specific code to implement amskip, > > introduced in the previous patch. > > The way that you're dealing with B-Tree tuples here needs to account > for posting list tuples: > > > + currItem = >currPos.items[so->currPos.lastItem]; > > + itup = (IndexTuple) (so->currTuples + > > currItem->tupleOffset); > > + nextOffset = ItemPointerGetOffsetNumber(>t_tid); Do you mean this last part with t_tid, which could also have a tid array in case of posting tuple format? > > + /* > > +* To check if we returned the same tuple, try to find a > > +* startItup on the current page. For that we need to update > > +* scankey to match the whole tuple and set nextkey to > > return > > +* an exact tuple, not the next one. If the nextOffset is > > the > > +* same as before, it means we are in the loop, return > > offnum > > +* to the original position and jump further > > +*/ > > Why does it make sense to use the offset number like this? It isn't > stable or reliable. The patch goes on to do this: > > > + startOffset = _bt_binsrch(scan->indexRelation, > > + so->skipScanKey, > > + so->currPos.buf); > > + > > + page = BufferGetPage(so->currPos.buf); > > + maxoff = PageGetMaxOffsetNumber(page); > > + > > + if (nextOffset <= startOffset) > > + { > > Why compare a heap TID's offset number (an offset number for a heap > page) to another offset number for a B-Tree leaf page? They're > fundamentally different things. Well, it's obviously wrong, thanks for noticing. What is necessary is to compare two index tuples, the start and the next one, to test if they're the same (in which case if I'm not mistaken probably we can compare item pointers). I've got this question when I was about to post a new version with changes to address feedback from Andy, now I'll combine them and send a cumulative patch.
RE: Index Skip Scan (new UniqueKeys)
Hi Dmitry, Also took another look at the patch now, and found a case of incorrect data. It looks related to the new way of creating the paths, as I can't recall seeing this in earlier versions. create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 10) a, generate_series(1,100) b; create index on t1 (a,b,c); postgres=# explain select distinct on (a) * from t1 order by a,b desc,c; QUERY PLAN --- Sort (cost=2.92..2.94 rows=10 width=20) Sort Key: a, b DESC, c -> Index Scan using t1_a_b_c_idx on t1 (cost=0.28..2.75 rows=10 width=20) Skip scan: true (4 rows) With the 'order by a, b desc, c' we expect the value of column 'b' to always be 100. With index_skipscan enabled, it always gives 1 though. It's not correct that the planner chooses a skip scan followed by sort in this case. -Floris
Re: Index Skip Scan (new UniqueKeys)
On Tue, Jun 9, 2020 at 3:20 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > * Btree-implementation contains btree specific code to implement amskip, > introduced in the previous patch. The way that you're dealing with B-Tree tuples here needs to account for posting list tuples: > + currItem = >currPos.items[so->currPos.lastItem]; > + itup = (IndexTuple) (so->currTuples + currItem->tupleOffset); > + nextOffset = ItemPointerGetOffsetNumber(>t_tid); But I wonder more generally what the idea here is. The following comments that immediately follow provide some hints: > + /* > +* To check if we returned the same tuple, try to find a > +* startItup on the current page. For that we need to update > +* scankey to match the whole tuple and set nextkey to return > +* an exact tuple, not the next one. If the nextOffset is the > +* same as before, it means we are in the loop, return offnum > +* to the original position and jump further > +*/ Why does it make sense to use the offset number like this? It isn't stable or reliable. The patch goes on to do this: > + startOffset = _bt_binsrch(scan->indexRelation, > + so->skipScanKey, > + so->currPos.buf); > + > + page = BufferGetPage(so->currPos.buf); > + maxoff = PageGetMaxOffsetNumber(page); > + > + if (nextOffset <= startOffset) > + { Why compare a heap TID's offset number (an offset number for a heap page) to another offset number for a B-Tree leaf page? They're fundamentally different things. -- Peter Geoghegan
Re: Index Skip Scan (new UniqueKeys)
> On Thu, Jun 11, 2020 at 04:14:07PM +0800, Andy Fan wrote: > > I just get the rough idea of patch, looks we have to narrow down the > user cases where we can use this method. Consider the below example: Hi Not exactly narrow down, but rather get rid of wrong usage of skipping for index scan. Since skipping for it was added later than for index only scan I can imagine there are still blind spots, so good that you've looked. In this particular case, when index expressions do not fully cover those expressionse result need to be distinct on, skipping just doesn't have enough information and should not be used. I'll add it to the next version, thanks!
Re: Index Skip Scan (new UniqueKeys)
On Tue, Jun 9, 2020 at 6:20 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > Hi, > > Here is a new version of index skip scan patch, based on v8 patch for > UniqueKeys implementation from [1]. I want to start a new thread to > simplify navigation, hopefully I didn't forget anyone who actively > participated in the discussion. > > To simplify reviewing I've split it into several parts: > > * First two are taken from [1] just for the reference and to make cfbot > happy. > > * Extend-UniqueKeys consists changes that needs to be done for > UniqueKeys to be used in skip scan. Essentially this is a reduced > version of previous implementation from Jesper & David, based on the > new UniqueKeys infrastructure, as it does the very same thing. > > * Index-Skip-Scan contains not am specific code and the overall > infrastructure, including configuration elements and so on. > > * Btree-implementation contains btree specific code to implement amskip, > introduced in the previous patch. > > * The last one contains just documentation bits. > > Interesting enough with a new UniqueKey implementation skipping is > applied in some tests unexpectedly. For those I've disabled > indexskipscan to avoid confusion. > > [1]: > https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com > Thanks for the patch. I just get the rough idea of patch, looks we have to narrow down the user cases where we can use this method. Consider the below example: create table j1(i int, im5 int, im100 int, im1000 int); insert into j1 select i, i%5, i%100, i%1000 from generate_series(1, 1000)i; create index on j1(im5, i); insert into j1 values (1, 1, 0, 0); analyze j1; demo=# select distinct * from j1 where i < 2; i | im5 | im100 | im1000 ---+-+---+ 1 | 1 | 1 | 1 (1 row) demo=# set enable_indexskipscan to off; SET demo=# select distinct * from j1 where i < 2; i | im5 | im100 | im1000 ---+-+---+ 1 | 1 | 0 | 0 1 | 1 | 1 | 1 (2 rows) drop index j1_im5_i_idx; create index on j1(im5, i, im100); demo=# select distinct im5, i, im100 from j1 where i < 2; im5 | i | im100 -+---+--- 1 | 1 | 0 1 | 1 | 1 (2 rows) demo=# set enable_indexskipscan to on; SET demo=# select distinct im5, i, im100 from j1 where i < 2; im5 | i | im100 -+---+--- 1 | 1 | 0 (1 row) -- Best Regards Andy Fan
Re: Index Skip Scan
On Tue, Jun 2, 2020 at 9:38 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Tue, Jun 02, 2020 at 08:36:31PM +0800, Andy Fan wrote: > > > > > Other than that to summarize current open points for future readers > > > (this thread somehow became quite big): > > > > > > * Making UniqueKeys usage more generic to allow using skip scan for > more > > > use cases (hopefully it was covered by the v33, but I still need a > > > confirmation from David, like blinking twice or something). > > > > > > * Suspicious performance difference between different type of workload, > > > mentioned by Tomas (unfortunately I had no chance yet to > investigate). > > > > > > * Thinking about supporting conditions, that are not covered by the > index, > > > to make skipping more flexible (one of the potential next steps in > the > > > future, as suggested by Floris). > > > > > > > Looks this is the latest patch, which commit it is based on? Thanks > > I have a rebased version, if you're about it. Didn't posted it yet > mostly since I'm in the middle of adapting it to the UniqueKeys from > other thread. Would it be ok for you to wait a bit until I'll post > finished version? > Sure, that's OK. The discussion on UniqueKey thread looks more complex than what I expected, that's why I want to check the code here, but that's fine, you can work on your schedule. -- Best Regards Andy Fan
Re: Index Skip Scan
> On Tue, Jun 02, 2020 at 08:36:31PM +0800, Andy Fan wrote: > > > Other than that to summarize current open points for future readers > > (this thread somehow became quite big): > > > > * Making UniqueKeys usage more generic to allow using skip scan for more > > use cases (hopefully it was covered by the v33, but I still need a > > confirmation from David, like blinking twice or something). > > > > * Suspicious performance difference between different type of workload, > > mentioned by Tomas (unfortunately I had no chance yet to investigate). > > > > * Thinking about supporting conditions, that are not covered by the index, > > to make skipping more flexible (one of the potential next steps in the > > future, as suggested by Floris). > > > > Looks this is the latest patch, which commit it is based on? Thanks I have a rebased version, if you're about it. Didn't posted it yet mostly since I'm in the middle of adapting it to the UniqueKeys from other thread. Would it be ok for you to wait a bit until I'll post finished version?
Re: Index Skip Scan
On Wed, Apr 8, 2020 at 3:41 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Mon, Apr 06, 2020 at 06:31:08PM +, Floris Van Nee wrote: > > > > > Hm, I wasn't aware about this one, thanks for bringing this up. Btw, > Floris, I > > > would appreciate if in the future you can make it more visible that > changes you > > > suggest contain some fixes. E.g. it wasn't clear for me from your > previous email > > > that that's the case, and it doesn't make sense to pull into different > direction > > > when we're trying to achieve the same goal :) > > > > I wasn't aware that this particular case could be triggered before I saw > Dilip's email, otherwise I'd have mentioned it here of course. It's just > that because my patch handles filter conditions in general, it works for > this case too. > > Oh, then fortunately I've got a wrong impression, sorry and thanks for > clarification :) > > > > > > In the patch I posted a week ago these cases are all handled > > > > > correctly, as it introduces this extra logic in the Executor. > > > > > > > > Okay, So I think we can merge those fixes in Dmitry's patch set. > > > > > > I'll definitely take a look at suggested changes in filtering part. > > > > It may be possible to just merge the filtering part into your patch, but > I'm not entirely sure. Basically you have to pull the information about > skipping one level up, out of the node, into the generic IndexNext code. > > I was actually thinking more about just preventing skip scan in this > situation, which is if I'm not mistaken could be solved by inspecting > qual conditions to figure out if they're covered in the index - > something like in attachments (this implementation is actually too > restrictive in this sense and will not allow e.g. expressions, that's > why I haven't bumped patch set version for it - soon I'll post an > extended version). > > Other than that to summarize current open points for future readers > (this thread somehow became quite big): > > * Making UniqueKeys usage more generic to allow using skip scan for more > use cases (hopefully it was covered by the v33, but I still need a > confirmation from David, like blinking twice or something). > > * Suspicious performance difference between different type of workload, > mentioned by Tomas (unfortunately I had no chance yet to investigate). > > * Thinking about supporting conditions, that are not covered by the index, > to make skipping more flexible (one of the potential next steps in the > future, as suggested by Floris). > Looks this is the latest patch, which commit it is based on? Thanks -- Best Regards Andy Fan
Re: Index Skip Scan
On Fri, 15 May 2020 at 6:06 PM, Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Wed, May 13, 2020 at 02:37:21PM +0530, Dilip Kumar wrote: > > > > + if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, > dir)) > > + { > > > > Here we expect whether the "next" unique key can be found on this page > > or not, but we are using the function which suggested whether the > > "current" key can be found on this page or not. I think in boundary > > cases where the high key is equal to the current key, this function > > will return true (which is expected from this function), and based on > > that we will simply scan the current page and IMHO that cost could be > > avoided no? > > Yes, looks like you're right, there is indeed an unecessary extra scan > happening. To avoid that we can see the key->nextkey and adjust higher > boundary correspondingly. Will also add this into the next rebased > patch, thanks! Great thanks! > -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Index Skip Scan
> On Wed, May 13, 2020 at 02:37:21PM +0530, Dilip Kumar wrote: > > + if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir)) > + { > > Here we expect whether the "next" unique key can be found on this page > or not, but we are using the function which suggested whether the > "current" key can be found on this page or not. I think in boundary > cases where the high key is equal to the current key, this function > will return true (which is expected from this function), and based on > that we will simply scan the current page and IMHO that cost could be > avoided no? Yes, looks like you're right, there is indeed an unecessary extra scan happening. To avoid that we can see the key->nextkey and adjust higher boundary correspondingly. Will also add this into the next rebased patch, thanks!
Re: Index Skip Scan
On Mon, Jan 20, 2020 at 5:05 PM Peter Geoghegan wrote: > You can add another assertion that calls a new utility function in > bufmgr.c. That can use the same logic as this existing assertion in > FlushOneBuffer(): > > Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr))); > > We haven't needed assertions like this so far because it's usually it > is clear whether or not a buffer lock is held (plus the bufmgr.c > assertions help on their own). Just in case anybody missed it, I am working on a patch that makes nbtree use Valgrind instrumentation to detect page accessed without a buffer content lock held: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpwrub_45eexlh1+k...@mail.gmail.com There is also one component that detects when any buffer is accessed without a buffer pin. -- Peter Geoghegan
Re: Index Skip Scan
On Mon, May 11, 2020 at 4:55 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Mon, May 11, 2020 at 04:04:00PM +0530, Dilip Kumar wrote: > > > > > > +static inline bool > > > > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, > > > > + Buffer buf, ScanDirection dir) > > > > +{ > > > > + OffsetNumber low, high; > > > > + Page page = BufferGetPage(buf); > > > > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); > > > > + > > > > + low = P_FIRSTDATAKEY(opaque); > > > > + high = PageGetMaxOffsetNumber(page); > > > > + > > > > + if (unlikely(high < low)) > > > > + return false; > > > > + > > > > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 && > > > > + _bt_compare(scan->indexRelation, key, page, high) < 1); > > > > +} > > > > > > > > I think the high key condition should be changed to > > > > _bt_compare(scan->indexRelation, key, page, high) < 0 ? Because if > > > > prefix qual is equal to the high key then also there is no point in > > > > searching on the current page so we can directly skip. > > > > > > From nbtree/README and comments to functions like _bt_split I've got an > > > impression that the high key could be equal to the last item on the leaf > > > page, so there is a point in searching. Is that incorrect? > > > > But IIUC, here we want to decide whether we will get the next key in > > the current page or not? > > In general this function does what it says, it checks wether or not the > provided scankey could be found within the page. All the logic about > finding a proper next key to fetch is implemented on the call site, and > within this function we want to test whatever was passed in. Does it > answer the question? Ok, I agree that the function is doing what it is expected to do. But, then I have a problem with this call site. + /* Check if the next unique key can be found within the current page. + * Since we do not lock the current page between jumps, it's possible + * that it was splitted since the last time we saw it. This is fine in + * case of scanning forward, since page split to the right and we are + * still on the left most page. In case of scanning backwards it's + * possible to loose some pages and we need to remember the previous + * page, and then follow the right link from the current page until we + * find the original one. + * + * Since the whole idea of checking the current page is to protect + * ourselves and make more performant statistic mismatch case when + * there are too many distinct values for jumping, it's not clear if + * the complexity of this solution in case of backward scan is + * justified, so for now just avoid it. + */ + if (BufferIsValid(so->currPos.buf) && ScanDirectionIsForward(dir)) + { + LockBuffer(so->currPos.buf, BT_READ); + + if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir)) + { Here we expect whether the "next" unique key can be found on this page or not, but we are using the function which suggested whether the "current" key can be found on this page or not. I think in boundary cases where the high key is equal to the current key, this function will return true (which is expected from this function), and based on that we will simply scan the current page and IMHO that cost could be avoided no? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Index Skip Scan
> On Mon, May 11, 2020 at 04:04:00PM +0530, Dilip Kumar wrote: > > > > +static inline bool > > > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, > > > + Buffer buf, ScanDirection dir) > > > +{ > > > + OffsetNumber low, high; > > > + Page page = BufferGetPage(buf); > > > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); > > > + > > > + low = P_FIRSTDATAKEY(opaque); > > > + high = PageGetMaxOffsetNumber(page); > > > + > > > + if (unlikely(high < low)) > > > + return false; > > > + > > > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 && > > > + _bt_compare(scan->indexRelation, key, page, high) < 1); > > > +} > > > > > > I think the high key condition should be changed to > > > _bt_compare(scan->indexRelation, key, page, high) < 0 ? Because if > > > prefix qual is equal to the high key then also there is no point in > > > searching on the current page so we can directly skip. > > > > From nbtree/README and comments to functions like _bt_split I've got an > > impression that the high key could be equal to the last item on the leaf > > page, so there is a point in searching. Is that incorrect? > > But IIUC, here we want to decide whether we will get the next key in > the current page or not? In general this function does what it says, it checks wether or not the provided scankey could be found within the page. All the logic about finding a proper next key to fetch is implemented on the call site, and within this function we want to test whatever was passed in. Does it answer the question?
Re: Index Skip Scan
On Sun, May 10, 2020 at 11:17 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Sat, Apr 11, 2020 at 03:17:25PM +0530, Dilip Kumar wrote: > > > > Some more comments... > > Thanks for reviewing. Since this patch took much longer than I expected, > it's useful to have someone to look at it with a "fresh eyes". > > > + so->skipScanKey->nextkey = ScanDirectionIsForward(dir); > > + _bt_update_skip_scankeys(scan, indexRel); > > + > > ... > > + /* > > + * We haven't found scan key within the current page, so let's scan from > > + * the root. Use _bt_search and _bt_binsrch to get the buffer and offset > > + * number > > + */ > > + so->skipScanKey->nextkey = ScanDirectionIsForward(dir); > > + stack = _bt_search(scan->indexRelation, so->skipScanKey, > > +, BT_READ, scan->xs_snapshot); > > > > Why do we need to set so->skipScanKey->nextkey = > > ScanDirectionIsForward(dir); multiple times? I think it is fine to > > just set it once? > > I believe it was necessary for previous implementations, but in the > current version we can avoid this, you're right. > > > +static inline bool > > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, > > + Buffer buf, ScanDirection dir) > > +{ > > + OffsetNumber low, high; > > + Page page = BufferGetPage(buf); > > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); > > + > > + low = P_FIRSTDATAKEY(opaque); > > + high = PageGetMaxOffsetNumber(page); > > + > > + if (unlikely(high < low)) > > + return false; > > + > > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 && > > + _bt_compare(scan->indexRelation, key, page, high) < 1); > > +} > > > > I think the high key condition should be changed to > > _bt_compare(scan->indexRelation, key, page, high) < 0 ? Because if > > prefix qual is equal to the high key then also there is no point in > > searching on the current page so we can directly skip. > > From nbtree/README and comments to functions like _bt_split I've got an > impression that the high key could be equal to the last item on the leaf > page, so there is a point in searching. Is that incorrect? But IIUC, here we want to decide whether we will get the next key in the current page or not? Is my understanding is correct? So if our key (the last tuple key) is equal to the high key means the max key on this page is the same as what we already got in the last tuple so why would we want to go on this page? because this will not give us the new key. So ideally, we should only be looking into this page if our last tuple key is smaller than the high key. Am I missing something? > > > + /* Check if an index skip scan is possible. */ > > + can_skip = enable_indexskipscan & index->amcanskip; > > + > > + /* > > + * Skip scan is not supported when there are qual conditions, which are not > > + * covered by index. The reason for that is that those conditions are > > + * evaluated later, already after skipping was applied. > > + * > > + * TODO: This implementation is too restrictive, and doesn't allow e.g. > > + * index expressions. For that we need to examine index_clauses too. > > + */ > > + if (root->parse->jointree != NULL) > > + { > > + ListCell *lc; > > + > > + foreach(lc, (List *)root->parse->jointree->quals) > > + { > > + Node *expr, *qual = (Node *) lfirst(lc); > > + Var *var; > > > > I think we can avoid checking for expression if can_skip is already false. > > Yes, that makes sense. I'll include your suggestions into the next > rebased version I'm preparing. Ok. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Index Skip Scan
> On Sat, Apr 11, 2020 at 03:17:25PM +0530, Dilip Kumar wrote: > > Some more comments... Thanks for reviewing. Since this patch took much longer than I expected, it's useful to have someone to look at it with a "fresh eyes". > + so->skipScanKey->nextkey = ScanDirectionIsForward(dir); > + _bt_update_skip_scankeys(scan, indexRel); > + > ... > + /* > + * We haven't found scan key within the current page, so let's scan from > + * the root. Use _bt_search and _bt_binsrch to get the buffer and offset > + * number > + */ > + so->skipScanKey->nextkey = ScanDirectionIsForward(dir); > + stack = _bt_search(scan->indexRelation, so->skipScanKey, > +, BT_READ, scan->xs_snapshot); > > Why do we need to set so->skipScanKey->nextkey = > ScanDirectionIsForward(dir); multiple times? I think it is fine to > just set it once? I believe it was necessary for previous implementations, but in the current version we can avoid this, you're right. > +static inline bool > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, > + Buffer buf, ScanDirection dir) > +{ > + OffsetNumber low, high; > + Page page = BufferGetPage(buf); > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); > + > + low = P_FIRSTDATAKEY(opaque); > + high = PageGetMaxOffsetNumber(page); > + > + if (unlikely(high < low)) > + return false; > + > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 && > + _bt_compare(scan->indexRelation, key, page, high) < 1); > +} > > I think the high key condition should be changed to > _bt_compare(scan->indexRelation, key, page, high) < 0 ? Because if > prefix qual is equal to the high key then also there is no point in > searching on the current page so we can directly skip. >From nbtree/README and comments to functions like _bt_split I've got an impression that the high key could be equal to the last item on the leaf page, so there is a point in searching. Is that incorrect? > + /* Check if an index skip scan is possible. */ > + can_skip = enable_indexskipscan & index->amcanskip; > + > + /* > + * Skip scan is not supported when there are qual conditions, which are not > + * covered by index. The reason for that is that those conditions are > + * evaluated later, already after skipping was applied. > + * > + * TODO: This implementation is too restrictive, and doesn't allow e.g. > + * index expressions. For that we need to examine index_clauses too. > + */ > + if (root->parse->jointree != NULL) > + { > + ListCell *lc; > + > + foreach(lc, (List *)root->parse->jointree->quals) > + { > + Node *expr, *qual = (Node *) lfirst(lc); > + Var *var; > > I think we can avoid checking for expression if can_skip is already false. Yes, that makes sense. I'll include your suggestions into the next rebased version I'm preparing.
Re: Index Skip Scan
Sorry for late reply. > On Tue, Apr 14, 2020 at 09:19:22PM +1200, David Rowley wrote: > > I've not yet looked at the latest patch, but I did put some thoughts > into an email on the other thread that's been discussing UniqueKeys > [1]. > > I'm keen to hear thoughts on the plan I mentioned over there. Likely > it would be best to discuss the specifics of what additional features > we need to add to UniqueKeys for skip scans over here, but discuss any > chances which affect both patches over there. We certainly can't have > two separate implementations of UniqueKeys, so I believe the skip > scans UniqueKeys patch should most likely be based on the one in [1] > or some descendant of it. > > [1] > https://www.postgresql.org/message-id/CAApHDvpx1qED1uLqubcKJ=ohatcmd7ptukkdq0b72_08nbr...@mail.gmail.com Yes, I've come to the same conclusion, although I have my concerns about having such a dependency between patches. Will look at the suggested patches soon.
Re: Index Skip Scan
On Wed, 8 Apr 2020 at 07:40, Dmitry Dolgov <9erthali...@gmail.com> wrote: > Other than that to summarize current open points for future readers > (this thread somehow became quite big): > > * Making UniqueKeys usage more generic to allow using skip scan for more > use cases (hopefully it was covered by the v33, but I still need a > confirmation from David, like blinking twice or something). I've not yet looked at the latest patch, but I did put some thoughts into an email on the other thread that's been discussing UniqueKeys [1]. I'm keen to hear thoughts on the plan I mentioned over there. Likely it would be best to discuss the specifics of what additional features we need to add to UniqueKeys for skip scans over here, but discuss any chances which affect both patches over there. We certainly can't have two separate implementations of UniqueKeys, so I believe the skip scans UniqueKeys patch should most likely be based on the one in [1] or some descendant of it. [1] https://www.postgresql.org/message-id/CAApHDvpx1qED1uLqubcKJ=ohatcmd7ptukkdq0b72_08nbr...@mail.gmail.com
RE: Index Skip Scan
> > * Suspicious performance difference between different type of workload, > mentioned by Tomas (unfortunately I had no chance yet to investigate). > His benchmark results indeed most likely point to multiple comparisons being done. Since the most likely place where these occur is _bt_readpage, I suspect this is called multiple times. Looking at your patch, I think that's indeed the case. For example, suppose a page contains [1,2,3,4,5] and the planner makes a complete misestimation and chooses a skip scan here. First call to _bt_readpage will compare every tuple on the page already and store everything in the workspace, which will now contain [1,2,3,4,5]. However, when a skip is done the elements on the page (not the workspace) are compared to find the next one. Then, another _bt_readpage is done, starting at the new offnum. So we'll compare every tuple (except 1) on the page again. Workspace now contains [2,3,4,5]. Next tuple we'll end up with [3,4,5] etc. So tuple 5 actually gets compared 5 times in _bt_readpage alone. -Floris
Re: Index Skip Scan
> On Mon, Apr 06, 2020 at 06:31:08PM +, Floris Van Nee wrote: > > > Hm, I wasn't aware about this one, thanks for bringing this up. Btw, > > Floris, I > > would appreciate if in the future you can make it more visible that changes > > you > > suggest contain some fixes. E.g. it wasn't clear for me from your previous > > email > > that that's the case, and it doesn't make sense to pull into different > > direction > > when we're trying to achieve the same goal :) > > I wasn't aware that this particular case could be triggered before I saw > Dilip's email, otherwise I'd have mentioned it here of course. It's just that > because my patch handles filter conditions in general, it works for this case > too. Oh, then fortunately I've got a wrong impression, sorry and thanks for clarification :) > > > > In the patch I posted a week ago these cases are all handled > > > > correctly, as it introduces this extra logic in the Executor. > > > > > > Okay, So I think we can merge those fixes in Dmitry's patch set. > > > > I'll definitely take a look at suggested changes in filtering part. > > It may be possible to just merge the filtering part into your patch, but I'm > not entirely sure. Basically you have to pull the information about skipping > one level up, out of the node, into the generic IndexNext code. I was actually thinking more about just preventing skip scan in this situation, which is if I'm not mistaken could be solved by inspecting qual conditions to figure out if they're covered in the index - something like in attachments (this implementation is actually too restrictive in this sense and will not allow e.g. expressions, that's why I haven't bumped patch set version for it - soon I'll post an extended version). Other than that to summarize current open points for future readers (this thread somehow became quite big): * Making UniqueKeys usage more generic to allow using skip scan for more use cases (hopefully it was covered by the v33, but I still need a confirmation from David, like blinking twice or something). * Suspicious performance difference between different type of workload, mentioned by Tomas (unfortunately I had no chance yet to investigate). * Thinking about supporting conditions, that are not covered by the index, to make skipping more flexible (one of the potential next steps in the future, as suggested by Floris). >From 15989c5250214fea8606a56afd1eeaf760b8723e Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Tue, 24 Mar 2020 17:04:32 +0100 Subject: [PATCH v33 1/2] Unique key Design by David Rowley. Author: Jesper Pedersen --- src/backend/nodes/outfuncs.c | 14 +++ src/backend/nodes/print.c | 39 +++ src/backend/optimizer/path/Makefile| 3 +- src/backend/optimizer/path/allpaths.c | 8 ++ src/backend/optimizer/path/indxpath.c | 41 src/backend/optimizer/path/pathkeys.c | 71 +++-- src/backend/optimizer/path/uniquekey.c | 136 + src/backend/optimizer/plan/planagg.c | 1 + src/backend/optimizer/plan/planmain.c | 1 + src/backend/optimizer/plan/planner.c | 37 ++- src/backend/optimizer/util/pathnode.c | 46 +++-- src/include/nodes/nodes.h | 1 + src/include/nodes/pathnodes.h | 19 src/include/nodes/print.h | 1 + src/include/optimizer/pathnode.h | 2 + src/include/optimizer/paths.h | 11 ++ 16 files changed, 408 insertions(+), 23 deletions(-) create mode 100644 src/backend/optimizer/path/uniquekey.c diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index d76fae44b8..16083e7a7e 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -1723,6 +1723,7 @@ _outPathInfo(StringInfo str, const Path *node) WRITE_FLOAT_FIELD(startup_cost, "%.2f"); WRITE_FLOAT_FIELD(total_cost, "%.2f"); WRITE_NODE_FIELD(pathkeys); + WRITE_NODE_FIELD(uniquekeys); } /* @@ -2205,6 +2206,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_NODE_FIELD(eq_classes); WRITE_BOOL_FIELD(ec_merging_done); WRITE_NODE_FIELD(canon_pathkeys); + WRITE_NODE_FIELD(canon_uniquekeys); WRITE_NODE_FIELD(left_join_clauses); WRITE_NODE_FIELD(right_join_clauses); WRITE_NODE_FIELD(full_join_clauses); @@ -2214,6 +2216,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_NODE_FIELD(placeholder_list); WRITE_NODE_FIELD(fkey_list); WRITE_NODE_FIELD(query_pathkeys); + WRITE_NODE_FIELD(query_uniquekeys); WRITE_NODE_FIELD(group_pathkeys); WRITE_NODE_FIELD(window_pathkeys); WRITE_NODE_FIELD(distinct_pathkeys); @@ -2401,6 +2404,14 @@ _outPathKey(StringInfo str, const PathKey *node) WRITE_BOOL_FIELD(pk_nulls_first); } +static void +_outUniqueKey(StringInfo str, const UniqueKey *node) +{ + WRITE_NODE_TYPE("UNIQUEKEY"); + + WRITE_NODE_FIELD(eq_clause); +} + static void _outPathTarget(StringInfo str, const PathTarget *node)
RE: Index Skip Scan
> > Hm, I wasn't aware about this one, thanks for bringing this up. Btw, Floris, I > would appreciate if in the future you can make it more visible that changes > you > suggest contain some fixes. E.g. it wasn't clear for me from your previous > email > that that's the case, and it doesn't make sense to pull into different > direction > when we're trying to achieve the same goal :) I wasn't aware that this particular case could be triggered before I saw Dilip's email, otherwise I'd have mentioned it here of course. It's just that because my patch handles filter conditions in general, it works for this case too. > > > > In the patch I posted a week ago these cases are all handled > > > correctly, as it introduces this extra logic in the Executor. > > > > Okay, So I think we can merge those fixes in Dmitry's patch set. > > I'll definitely take a look at suggested changes in filtering part. It may be possible to just merge the filtering part into your patch, but I'm not entirely sure. Basically you have to pull the information about skipping one level up, out of the node, into the generic IndexNext code. I'm eager to get some form of skip scans into master - any kind of patch that makes this possible is fine by me. Long term I think my version provides a more generic approach, with which we can optimize a much broader range of queries. However, since many more eyes have seen your patch so far, I hope yours can be committed much sooner. My knowledge on this committer process is limited though. That's why I've just posted mine so far in the hope of collecting some feedback, also on how we should continue with the process. -Floris
Re: Index Skip Scan
> On Mon, Apr 6, 2020 at 1:14 PM Floris Van Nee > wrote: > > > > There's a number of ways we can force a 'filter' rather than an > > 'index condition'. Hm, I wasn't aware about this one, thanks for bringing this up. Btw, Floris, I would appreciate if in the future you can make it more visible that changes you suggest contain some fixes. E.g. it wasn't clear for me from your previous email that that's the case, and it doesn't make sense to pull into different direction when we're trying to achieve the same goal :) > > In the patch I posted a week ago these cases are all handled > > correctly, as it introduces this extra logic in the Executor. > > Okay, So I think we can merge those fixes in Dmitry's patch set. I'll definitely take a look at suggested changes in filtering part.
Re: Index Skip Scan
On Mon, Apr 6, 2020 at 1:14 PM Floris Van Nee wrote: > > > > > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote: > > > > > > > > I was just wondering how the distinct will work with the "skip scan" > > > > if we have some filter? I mean every time we select the unique row > > > > based on the prefix key and that might get rejected by an external > > > > filter right? > > > > > Yeah, you're correct. This patch only handles the index conditions and > doesn't handle any filters correctly. There's a check in the planner for the > IndexScan for example that only columns that exist in the index are used. > However, this check is not sufficient as your example shows. There's a number > of ways we can force a 'filter' rather than an 'index condition' and still > choose a skip scan (WHERE b!=0 is another one I think). This leads to > incorrect query results. Right > This patch would need some logic in the planner to never choose the skip scan > in these cases. Better long-term solution is to adapt the rest of the > executor to work correctly in the cases of external filters (this ties in > with the previous visibility discussion as well, as that's basically also an > external filter, although a special case). I agree > In the patch I posted a week ago these cases are all handled correctly, as it > introduces this extra logic in the Executor. Okay, So I think we can merge those fixes in Dmitry's patch set. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
RE: Index Skip Scan
> > > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote: > > > > > > I was just wondering how the distinct will work with the "skip scan" > > > if we have some filter? I mean every time we select the unique row > > > based on the prefix key and that might get rejected by an external > > > filter right? > > Yeah, you're correct. This patch only handles the index conditions and doesn't handle any filters correctly. There's a check in the planner for the IndexScan for example that only columns that exist in the index are used. However, this check is not sufficient as your example shows. There's a number of ways we can force a 'filter' rather than an 'index condition' and still choose a skip scan (WHERE b!=0 is another one I think). This leads to incorrect query results. This patch would need some logic in the planner to never choose the skip scan in these cases. Better long-term solution is to adapt the rest of the executor to work correctly in the cases of external filters (this ties in with the previous visibility discussion as well, as that's basically also an external filter, although a special case). In the patch I posted a week ago these cases are all handled correctly, as it introduces this extra logic in the Executor. -Floris
Re: Index Skip Scan
On Sun, Apr 5, 2020 at 9:39 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote: > > > > I was just wondering how the distinct will work with the "skip scan" > > if we have some filter? I mean every time we select the unique row > > based on the prefix key and that might get rejected by an external > > filter right? > > Not exactly. In the case of index-only scan, we skipping to the first > unique position, and then use already existing functionality > (_bt_readpage with stepping to the next pages) to filter out those > records that do not pass the condition. I agree but that will work if we have a valid index clause, but "b%100=0" condition will not create an index clause, right? However, if we change the query to select distinct(a) from t where b=100 then it works fine because this condition will create an index clause. There are even couple of tests > in the patch for this. In case of index scan, when there are some > conditions, current implementation do not consider skipping. > > > So I tried an example to check this. > > Can you tell on which version of the patch you were testing? I have tested on v33. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Index Skip Scan
> On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote: > > I was just wondering how the distinct will work with the "skip scan" > if we have some filter? I mean every time we select the unique row > based on the prefix key and that might get rejected by an external > filter right? Not exactly. In the case of index-only scan, we skipping to the first unique position, and then use already existing functionality (_bt_readpage with stepping to the next pages) to filter out those records that do not pass the condition. There are even couple of tests in the patch for this. In case of index scan, when there are some conditions, current implementation do not consider skipping. > So I tried an example to check this. Can you tell on which version of the patch you were testing?
Re: Index Skip Scan
On Wed, Mar 25, 2020 at 2:19 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Wed, Mar 25, 2020 at 11:31:56AM +0530, Dilip Kumar wrote: > > > > Seems like you forgot to add the uniquekey.c file in the > > v33-0001-Unique-key.patch. > > Oh, you're right, thanks. Here is the corrected patch. I was just wondering how the distinct will work with the "skip scan" if we have some filter? I mean every time we select the unique row based on the prefix key and that might get rejected by an external filter right? So I tried an example to check this. postgres[50006]=# \d t Table "public.t" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | b | integer | | | Indexes: "idx" btree (a, b) postgres[50006]=# insert into t select 2, i from generate_series(1, 200)i; INSERT 0 200 postgres[50006]=# insert into t select 1, i from generate_series(1, 200)i; INSERT 0 200 postgres[50006]=# set enable_indexskipscan =off; SET postgres[50006]=# select distinct(a) from t where b%100=0; a --- 1 2 (2 rows) postgres[50006]=# set enable_indexskipscan =on; SET postgres[50006]=# select distinct(a) from t where b%100=0; a --- (0 rows) postgres[50006]=# explain select distinct(a) from t where b%100=0; QUERY PLAN --- Index Only Scan using idx on t (cost=0.15..1.55 rows=10 width=4) Skip scan: true Filter: ((b % 100) = 0) (3 rows) I think in such cases we should not select the skip scan. This should behave like we have a filter on the non-index field. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Index Skip Scan
> On Wed, Mar 25, 2020 at 11:31:56AM +0530, Dilip Kumar wrote: > > Seems like you forgot to add the uniquekey.c file in the > v33-0001-Unique-key.patch. Oh, you're right, thanks. Here is the corrected patch. >From 15989c5250214fea8606a56afd1eeaf760b8723e Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Tue, 24 Mar 2020 17:04:32 +0100 Subject: [PATCH v33 1/2] Unique key Design by David Rowley. Author: Jesper Pedersen --- src/backend/nodes/outfuncs.c | 14 +++ src/backend/nodes/print.c | 39 +++ src/backend/optimizer/path/Makefile| 3 +- src/backend/optimizer/path/allpaths.c | 8 ++ src/backend/optimizer/path/indxpath.c | 41 src/backend/optimizer/path/pathkeys.c | 71 +++-- src/backend/optimizer/path/uniquekey.c | 136 + src/backend/optimizer/plan/planagg.c | 1 + src/backend/optimizer/plan/planmain.c | 1 + src/backend/optimizer/plan/planner.c | 37 ++- src/backend/optimizer/util/pathnode.c | 46 +++-- src/include/nodes/nodes.h | 1 + src/include/nodes/pathnodes.h | 19 src/include/nodes/print.h | 1 + src/include/optimizer/pathnode.h | 2 + src/include/optimizer/paths.h | 11 ++ 16 files changed, 408 insertions(+), 23 deletions(-) create mode 100644 src/backend/optimizer/path/uniquekey.c diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index d76fae44b8..16083e7a7e 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -1723,6 +1723,7 @@ _outPathInfo(StringInfo str, const Path *node) WRITE_FLOAT_FIELD(startup_cost, "%.2f"); WRITE_FLOAT_FIELD(total_cost, "%.2f"); WRITE_NODE_FIELD(pathkeys); + WRITE_NODE_FIELD(uniquekeys); } /* @@ -2205,6 +2206,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_NODE_FIELD(eq_classes); WRITE_BOOL_FIELD(ec_merging_done); WRITE_NODE_FIELD(canon_pathkeys); + WRITE_NODE_FIELD(canon_uniquekeys); WRITE_NODE_FIELD(left_join_clauses); WRITE_NODE_FIELD(right_join_clauses); WRITE_NODE_FIELD(full_join_clauses); @@ -2214,6 +2216,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_NODE_FIELD(placeholder_list); WRITE_NODE_FIELD(fkey_list); WRITE_NODE_FIELD(query_pathkeys); + WRITE_NODE_FIELD(query_uniquekeys); WRITE_NODE_FIELD(group_pathkeys); WRITE_NODE_FIELD(window_pathkeys); WRITE_NODE_FIELD(distinct_pathkeys); @@ -2401,6 +2404,14 @@ _outPathKey(StringInfo str, const PathKey *node) WRITE_BOOL_FIELD(pk_nulls_first); } +static void +_outUniqueKey(StringInfo str, const UniqueKey *node) +{ + WRITE_NODE_TYPE("UNIQUEKEY"); + + WRITE_NODE_FIELD(eq_clause); +} + static void _outPathTarget(StringInfo str, const PathTarget *node) { @@ -4092,6 +4103,9 @@ outNode(StringInfo str, const void *obj) case T_PathKey: _outPathKey(str, obj); break; + case T_UniqueKey: +_outUniqueKey(str, obj); +break; case T_PathTarget: _outPathTarget(str, obj); break; diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c index 42476724d8..d286b34544 100644 --- a/src/backend/nodes/print.c +++ b/src/backend/nodes/print.c @@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable) printf(")\n"); } +/* + * print_uniquekeys - + * uniquekeys list of UniqueKeys + */ +void +print_uniquekeys(const List *uniquekeys, const List *rtable) +{ + ListCell *l; + + printf("("); + foreach(l, uniquekeys) + { + UniqueKey *unique_key = (UniqueKey *) lfirst(l); + EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause; + ListCell *k; + bool first = true; + + /* chase up */ + while (eclass->ec_merged) + eclass = eclass->ec_merged; + + printf("("); + foreach(k, eclass->ec_members) + { + EquivalenceMember *mem = (EquivalenceMember *) lfirst(k); + + if (first) +first = false; + else +printf(", "); + print_expr((Node *) mem->em_expr, rtable); + } + printf(")"); + if (lnext(uniquekeys, l)) + printf(", "); + } + printf(")\n"); +} + /* * print_tl * print targetlist in a more legible way. diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile index 1e199ff66f..63cc1505d9 100644 --- a/src/backend/optimizer/path/Makefile +++ b/src/backend/optimizer/path/Makefile @@ -21,6 +21,7 @@ OBJS = \ joinpath.o \ joinrels.o \ pathkeys.o \ - tidpath.o + tidpath.o \ + uniquekey.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 8286d9cf34..bbc13e6141 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -3954,6 +3954,14 @@ print_path(PlannerInfo *root, Path *path, int indent) print_pathkeys(path->pathkeys, root->parse->rtable); } + if (path->uniquekeys) + { + for (i = 0; i < indent; i++) + printf("\t"); + printf(" uniquekeys: "); +
Re: Index Skip Scan
On Tue, Mar 24, 2020 at 10:08 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Wed, Mar 11, 2020 at 11:17:51AM +1300, David Rowley wrote: > > > > Yes, I was complaining that a ProjectionPath breaks the optimisation > > and I don't believe there's any reason that it should. > > > > I believe the way to make that work correctly requires paying > > attention to the Path's uniquekeys rather than what type of path it > > is. > > Thanks for the suggestion. As a result of the discussion I've modified > the patch, does it look similar to what you had in mind? > > In this version if all conditions are met and there are corresponding > unique keys, a new index skip scan path will be added to > unique_pathlists. In case if requested distinct clauses match with > unique keys, create_distinct_paths can choose this path without needen > to know what kind of path is it. Also unique_keys are passed through > ProjectionPath, so optimization for the example mentioned in this thread > before now should work (I've added one test for that). > > I haven't changed anything about UniqueKey structure itself (one of the > suggestions was about Expr instead of EquivalenceClass), but I believe > we need anyway to figure out how two existing imlementation (in this > patch and from [1]) of this idea can be connected. > > [1]: > https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com --- src/backend/nodes/outfuncs.c | 14 ++ src/backend/nodes/print.c | 39 +++ src/backend/optimizer/path/Makefile | 3 +- src/backend/optimizer/path/allpaths.c | 8 +++ src/backend/optimizer/path/indxpath.c | 41 src/backend/optimizer/path/pathkeys.c | 71 ++- src/backend/optimizer/plan/planagg.c | 1 + src/backend/optimizer/plan/planmain.c | 1 + src/backend/optimizer/plan/planner.c | 37 +- src/backend/optimizer/util/pathnode.c | 46 + src/include/nodes/nodes.h | 1 + src/include/nodes/pathnodes.h | 19 +++ src/include/nodes/print.h | 1 + src/include/optimizer/pathnode.h | 2 + src/include/optimizer/paths.h | 11 + 15 files changed, 272 insertions(+), 23 deletions(-) Seems like you forgot to add the uniquekey.c file in the v33-0001-Unique-key.patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Index Skip Scan
On Wed, Mar 25, 2020 at 12:41 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Wed, Mar 11, 2020 at 06:56:09PM +0800, Andy Fan wrote: > > > > There was a dedicated thread [1] where David explain his idea very > > detailed, and you can also check that messages around that message for > > the context. hope it helps. > > Thanks for pointing out to this thread! Somehow I've missed it, and now > looks like we need to make some efforts to align patches for index skip > scan and distincClause elimination. > Yes:). Looks Index skip scan is a way of make a distinct result without a real distinct node, which happens after the UniqueKeys check where I try to see if the result is unique already and before the place where create a unique node for distinct node(With index skip scan we don't need it all). Currently in my patch, the logical here is 1). Check the UniqueKey to see if the result is not unique already. if not, go to next 2). After the distinct paths are created, I will add the result of distinct path as a unique key. Will you add the index skip scan path during create_distincts_paths and add the UniqueKey to RelOptInfo? if so I guess my current patch can handle it since it cares about the result of distinct path but no worried about how we archive that. Best Regards Andy Fan
Re: Index Skip Scan
> On Wed, Mar 11, 2020 at 06:56:09PM +0800, Andy Fan wrote: > > There was a dedicated thread [1] where David explain his idea very > detailed, and you can also check that messages around that message for > the context. hope it helps. Thanks for pointing out to this thread! Somehow I've missed it, and now looks like we need to make some efforts to align patches for index skip scan and distincClause elimination.
Re: Index Skip Scan
> On Wed, Mar 11, 2020 at 11:17:51AM +1300, David Rowley wrote: > > Yes, I was complaining that a ProjectionPath breaks the optimisation > and I don't believe there's any reason that it should. > > I believe the way to make that work correctly requires paying > attention to the Path's uniquekeys rather than what type of path it > is. Thanks for the suggestion. As a result of the discussion I've modified the patch, does it look similar to what you had in mind? In this version if all conditions are met and there are corresponding unique keys, a new index skip scan path will be added to unique_pathlists. In case if requested distinct clauses match with unique keys, create_distinct_paths can choose this path without needen to know what kind of path is it. Also unique_keys are passed through ProjectionPath, so optimization for the example mentioned in this thread before now should work (I've added one test for that). I haven't changed anything about UniqueKey structure itself (one of the suggestions was about Expr instead of EquivalenceClass), but I believe we need anyway to figure out how two existing imlementation (in this patch and from [1]) of this idea can be connected. [1]: https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com >From c7af8157da82db1cedf02e6ec0de355b56275680 Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Tue, 24 Mar 2020 17:04:32 +0100 Subject: [PATCH v33 1/2] Unique key Design by David Rowley. Author: Jesper Pedersen --- src/backend/nodes/outfuncs.c | 14 ++ src/backend/nodes/print.c | 39 +++ src/backend/optimizer/path/Makefile | 3 +- src/backend/optimizer/path/allpaths.c | 8 +++ src/backend/optimizer/path/indxpath.c | 41 src/backend/optimizer/path/pathkeys.c | 71 ++- src/backend/optimizer/plan/planagg.c | 1 + src/backend/optimizer/plan/planmain.c | 1 + src/backend/optimizer/plan/planner.c | 37 +- src/backend/optimizer/util/pathnode.c | 46 + src/include/nodes/nodes.h | 1 + src/include/nodes/pathnodes.h | 19 +++ src/include/nodes/print.h | 1 + src/include/optimizer/pathnode.h | 2 + src/include/optimizer/paths.h | 11 + 15 files changed, 272 insertions(+), 23 deletions(-) diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index d76fae44b8..16083e7a7e 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -1723,6 +1723,7 @@ _outPathInfo(StringInfo str, const Path *node) WRITE_FLOAT_FIELD(startup_cost, "%.2f"); WRITE_FLOAT_FIELD(total_cost, "%.2f"); WRITE_NODE_FIELD(pathkeys); + WRITE_NODE_FIELD(uniquekeys); } /* @@ -2205,6 +2206,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_NODE_FIELD(eq_classes); WRITE_BOOL_FIELD(ec_merging_done); WRITE_NODE_FIELD(canon_pathkeys); + WRITE_NODE_FIELD(canon_uniquekeys); WRITE_NODE_FIELD(left_join_clauses); WRITE_NODE_FIELD(right_join_clauses); WRITE_NODE_FIELD(full_join_clauses); @@ -2214,6 +2216,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_NODE_FIELD(placeholder_list); WRITE_NODE_FIELD(fkey_list); WRITE_NODE_FIELD(query_pathkeys); + WRITE_NODE_FIELD(query_uniquekeys); WRITE_NODE_FIELD(group_pathkeys); WRITE_NODE_FIELD(window_pathkeys); WRITE_NODE_FIELD(distinct_pathkeys); @@ -2401,6 +2404,14 @@ _outPathKey(StringInfo str, const PathKey *node) WRITE_BOOL_FIELD(pk_nulls_first); } +static void +_outUniqueKey(StringInfo str, const UniqueKey *node) +{ + WRITE_NODE_TYPE("UNIQUEKEY"); + + WRITE_NODE_FIELD(eq_clause); +} + static void _outPathTarget(StringInfo str, const PathTarget *node) { @@ -4092,6 +4103,9 @@ outNode(StringInfo str, const void *obj) case T_PathKey: _outPathKey(str, obj); break; + case T_UniqueKey: +_outUniqueKey(str, obj); +break; case T_PathTarget: _outPathTarget(str, obj); break; diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c index 42476724d8..d286b34544 100644 --- a/src/backend/nodes/print.c +++ b/src/backend/nodes/print.c @@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable) printf(")\n"); } +/* + * print_uniquekeys - + * uniquekeys list of UniqueKeys + */ +void +print_uniquekeys(const List *uniquekeys, const List *rtable) +{ + ListCell *l; + + printf("("); + foreach(l, uniquekeys) + { + UniqueKey *unique_key = (UniqueKey *) lfirst(l); + EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause; + ListCell *k; + bool first = true; + + /* chase up */ + while (eclass->ec_merged) + eclass = eclass->ec_merged; + + printf("("); + foreach(k, eclass->ec_members) + { + EquivalenceMember *mem = (EquivalenceMember *) lfirst(k); + + if (first) +first = false; + else +printf(", "); + print_expr((Node *) mem->em_expr,
Re: Index Skip Scan
> > > On Mon, Mar 23, 2020 at 1:55 AM Floris Van Nee > wrote: > > I'm unsure which version number to give this patch (to continue with > numbers from previous skip scan patches, or to start numbering from scratch > again). It's a rather big change, so one could argue it's mostly a separate > patch. I guess it mostly depends on how close the original versions were to > be committable. Thoughts? > > I don't know, but from the sidelines, it'd be nice to see the unique > path part go into PG13, where IIUC it can power the "useless unique > removal" patch. > Actually I have a patch to remove the distinct clause some long time ago[1], and later it came to the UniqueKey as well, you can see [2] for the current status. [1] https://www.postgresql.org/message-id/flat/CAKU4AWqOORqW900O-%2BL4L2%2B0xknsEqpfcs9FF7SeiO9TmpeZOg%40mail.gmail.com#f5d97cc66b9cd330add2fbb004a4d107 [2] https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL=uafudmf4wgovkel3onbatqju9nsxtuucpp...@mail.gmail.com
Re: Index Skip Scan
Hi Floris, On Sun, Mar 22, 2020 at 11:00 AM Floris Van Nee wrote: > create index on t1 (a,b,c); > select * from t1 where b in (100, 200); > Execution Time: 2.464 ms > Execution Time: 252.224 ms > Execution Time: 244.872 ms Wow. This is very cool work and I'm sure it will become a major headline feature of PG14 if the requisite planner brains can be sorted out. On Mon, Mar 23, 2020 at 1:55 AM Floris Van Nee wrote: > I'm unsure which version number to give this patch (to continue with numbers > from previous skip scan patches, or to start numbering from scratch again). > It's a rather big change, so one could argue it's mostly a separate patch. I > guess it mostly depends on how close the original versions were to be > committable. Thoughts? I don't know, but from the sidelines, it'd be nice to see the unique path part go into PG13, where IIUC it can power the "useless unique removal" patch.
Re: Index Skip Scan
On Wed, 11 Mar 2020 at 16:44, Andy Fan wrote: >> >> >> I think the UniqueKeys may need to be changed from using >> EquivalenceClasses to use Exprs instead. > > > When I try to understand why UniqueKeys needs EquivalenceClasses, > see your comments here. I feel that FuncExpr can't be > used to as a UniquePath even we can create unique index on f(a) > and f->strict == true. The reason is even we know a is not null, > f->strict = true. it is still be possible that f(a) == null. unique index > allows more than 1 null values. so shall we move further to use varattrno > instead of Expr? if so, we can also use a list of Bitmapset to present multi > unique path of a single RelOptInfo. We do need some method to determine if NULL values are possible. At the base relation level that can probably be done by checking NOT NULL constraints and strict base quals. At higher levels, we can use strict join quals as proofs. As for bit a Bitmapset of varattnos, that would certainly work well at the base relation level when there are no unique expression indexes, but it's not so simple with join relations when the varattnos only mean something when you know which base relation it comes from. I'm not saying that Lists of Exprs is ideal, but I think trying to optimise some code that does not yet exist is premature. There was some other talk in [1] on how we might make checking if a List contains a given Node. That could be advantageous in a few places in the query planner, and it might be useful for this too. [1] https://www.postgresql.org/message-id/CAKJS1f8v-fUG8YpaAGj309ZuALo3aEk7f6cqMHr_AVwz1fKXug%40mail.gmail.com
Re: Index Skip Scan
On Tue, Mar 10, 2020 at 4:32 AM James Coleman wrote: > On Mon, Mar 9, 2020 at 3:56 PM Dmitry Dolgov <9erthali...@gmail.com> > wrote: > > > > Assuming we'll implement it in a way that we do not know about what kind > > of path type is that in create_distinct_path, then it can also work for > > ProjectionPath or anything else (if UniqueKeys are present). But then > > still EquivalenceMember are used only to figure out correct > > distinctPrefixKeys and do not affect whether or not skipping is applied. > > What do I miss? > > > Part of the puzzle seems to me to this part of the response: > > > I think the UniqueKeys may need to be changed from using > > EquivalenceClasses to use Exprs instead. > > But I can't say I'm being overly helpful by pointing that out, since I > don't have my head in the code enough to understand how you'd > accomplish that :) > > There was a dedicated thread [1] where David explain his idea very detailed, and you can also check that messages around that message for the context. hope it helps. [1] https://www.postgresql.org/message-id/CAApHDvq7i0%3DO97r4Y1pv68%2BtprVczKsXRsV28rM9H-rVPOfeNQ%40mail.gmail.com
Re: Index Skip Scan
> > > I think the UniqueKeys may need to be changed from using > EquivalenceClasses to use Exprs instead. > When I try to understand why UniqueKeys needs EquivalenceClasses, see your comments here. I feel that FuncExpr can't be used to as a UniquePath even we can create unique index on f(a) and f->strict == true. The reason is even we know a is not null, f->strict = true. it is still be possible that f(a) == null. unique index allows more than 1 null values. so shall we move further to use varattrno instead of Expr? if so, we can also use a list of Bitmapset to present multi unique path of a single RelOptInfo.
Re: Index Skip Scan
On Wed, 11 Mar 2020 at 01:38, Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > >On Tue, Mar 10, 2020 at 09:29:32AM +1300, David Rowley wrote: > > There's also some weird looking assumptions that an EquivalenceMember > > can only be a Var in create_distinct_paths(). I think you're only > > saved from crashing there because a ProjectionPath will be created > > atop of the IndexPath to evaluate expressions, in which case you're > > not seeing the IndexPath. This results in the optimisation not > > working in cases like: > > I've read it as "an assumption that an EquivalenceMember can only be a > Var" results in "the optimisation not working in cases like this". But > you've meant that ignoring a ProjectionPath with an IndexPath inside > results in this optimisation not working, right? If so, then everything > is clear, and my apologies, maybe I need to finally fix my sleep > schedule :) Yes, I was complaining that a ProjectionPath breaks the optimisation and I don't believe there's any reason that it should. I believe the way to make that work correctly requires paying attention to the Path's uniquekeys rather than what type of path it is.
Re: Index Skip Scan
> >On Tue, Mar 10, 2020 at 09:29:32AM +1300, David Rowley wrote: > On Tue, 10 Mar 2020 at 08:56, Dmitry Dolgov <9erthali...@gmail.com> wrote: > > Assuming we'll implement it in a way that we do not know about what kind > > of path type is that in create_distinct_path, then it can also work for > > ProjectionPath or anything else (if UniqueKeys are present). But then > > still EquivalenceMember are used only to figure out correct > > distinctPrefixKeys and do not affect whether or not skipping is applied. > > What do I miss? > > I'm not sure I fully understand the question correctly, but let me > explain further. > > In the 0001 patch, standard_qp_callback sets the query_uniquekeys > depending on the DISTINCT / GROUP BY clause. When building index > paths in build_index_paths(), the 0002 patch should be looking at the > root->query_uniquekeys to see if it can build any index paths that > suit those keys. Such paths should be tagged with the uniquekeys they > satisfy, basically, exactly the same as how pathkeys work. Many > create_*_path functions will need to be modified to carry forward > their uniquekeys. For example, create_projection_path(), > create_limit_path() don't do anything which would cause the created > path to violate the unique keys. This way when you get down to > create_distinct_paths(), paths other than IndexPath may have > uniquekeys. You'll be able to check which existing paths satisfy the > unique keys required by the DISTINCT / GROUP BY and select those paths > instead of having to create any HashAggregate / Unique paths. > > Does that answer the question? Hmm... I'm afraid no, this was already clear. But looks like now I see that I've misinterpreted one part. > There's also some weird looking assumptions that an EquivalenceMember > can only be a Var in create_distinct_paths(). I think you're only > saved from crashing there because a ProjectionPath will be created > atop of the IndexPath to evaluate expressions, in which case you're > not seeing the IndexPath. This results in the optimisation not > working in cases like: I've read it as "an assumption that an EquivalenceMember can only be a Var" results in "the optimisation not working in cases like this". But you've meant that ignoring a ProjectionPath with an IndexPath inside results in this optimisation not working, right? If so, then everything is clear, and my apologies, maybe I need to finally fix my sleep schedule :)
Re: Index Skip Scan
On Mon, Mar 9, 2020 at 3:56 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > Assuming we'll implement it in a way that we do not know about what kind > of path type is that in create_distinct_path, then it can also work for > ProjectionPath or anything else (if UniqueKeys are present). But then > still EquivalenceMember are used only to figure out correct > distinctPrefixKeys and do not affect whether or not skipping is applied. > What do I miss? Part of the puzzle seems to me to this part of the response: > I think the UniqueKeys may need to be changed from using > EquivalenceClasses to use Exprs instead. But I can't say I'm being overly helpful by pointing that out, since I don't have my head in the code enough to understand how you'd accomplish that :) James
Re: Index Skip Scan
On Tue, 10 Mar 2020 at 08:56, Dmitry Dolgov <9erthali...@gmail.com> wrote: > Assuming we'll implement it in a way that we do not know about what kind > of path type is that in create_distinct_path, then it can also work for > ProjectionPath or anything else (if UniqueKeys are present). But then > still EquivalenceMember are used only to figure out correct > distinctPrefixKeys and do not affect whether or not skipping is applied. > What do I miss? I'm not sure I fully understand the question correctly, but let me explain further. In the 0001 patch, standard_qp_callback sets the query_uniquekeys depending on the DISTINCT / GROUP BY clause. When building index paths in build_index_paths(), the 0002 patch should be looking at the root->query_uniquekeys to see if it can build any index paths that suit those keys. Such paths should be tagged with the uniquekeys they satisfy, basically, exactly the same as how pathkeys work. Many create_*_path functions will need to be modified to carry forward their uniquekeys. For example, create_projection_path(), create_limit_path() don't do anything which would cause the created path to violate the unique keys. This way when you get down to create_distinct_paths(), paths other than IndexPath may have uniquekeys. You'll be able to check which existing paths satisfy the unique keys required by the DISTINCT / GROUP BY and select those paths instead of having to create any HashAggregate / Unique paths. Does that answer the question?
Re: Index Skip Scan
> On Mon, Mar 09, 2020 at 10:27:26AM +1300, David Rowley wrote: > > > > I think the changes in create_distinct_paths() need more work. The > > > way I think this should work is that create_distinct_paths() gets to > > > know exactly nothing about what path types support the elimination of > > > duplicate values. The Path should carry the UniqueKeys so that can be > > > determined. In create_distinct_paths() you should just be able to make > > > use of those paths, which should already have been created when > > > creating index paths for the rel due to PlannerInfo's query_uniquekeys > > > having been set. > > > > Just for me to clarify. The idea is to "move" information about what > > path types support skipping into UniqueKeys (derived from PlannerInfo's > > query_uniquekeys), but other checks (e.g. if index am supports that) > > still perform in create_distinct_paths? > > create_distinct_paths() shouldn't know any details specific to the > pathtype that it's using or considering using. All the details should > just be in Path. e.g. uniquekeys, pathkeys, costs etc. There should be > no IsA(path, ...). Please have a look over the details in my reply to > Tomas. I hope that reply has enough information in it, but please > reply there if I've missed something. Yes, I've read this reply, just wanted to ask here, since I had other questions as well. Speaking of which: > > > On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote: > > > There's also some weird looking assumptions that an EquivalenceMember > > > can only be a Var in create_distinct_paths(). I think you're only > > > saved from crashing there because a ProjectionPath will be created > > > atop of the IndexPath to evaluate expressions, in which case you're > > > not seeing the IndexPath. I'm probably missing something, so to eliminate any misunderstanding from my side: > > > This results in the optimisation not working in cases like: > > > > > > postgres=# create table t (a int); create index on t ((a+1)); explain > > > select distinct a+1 from t; > > > CREATE TABLE > > > CREATE INDEX > > > QUERY PLAN > > > --- > > > HashAggregate (cost=48.25..50.75 rows=200 width=4) > > >Group Key: (a + 1) > > >-> Seq Scan on t (cost=0.00..41.88 rows=2550 width=4) In this particular example skipping is not applied because, as you've mentioned, we're dealing with ProjectionPath (not IndexScan / IndexOnlyScan). Which means we're not even reaching the code with EquivalenceMember, so I'm still not sure how do they connected? Assuming we'll implement it in a way that we do not know about what kind of path type is that in create_distinct_path, then it can also work for ProjectionPath or anything else (if UniqueKeys are present). But then still EquivalenceMember are used only to figure out correct distinctPrefixKeys and do not affect whether or not skipping is applied. What do I miss?
Re: Index Skip Scan
On Mon, 9 Mar 2020 at 03:21, Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > > > I've been looking over v32 of the patch and have a few comments > > regarding the planner changes. > > Thanks for the commentaries! > > > I think the changes in create_distinct_paths() need more work. The > > way I think this should work is that create_distinct_paths() gets to > > know exactly nothing about what path types support the elimination of > > duplicate values. The Path should carry the UniqueKeys so that can be > > determined. In create_distinct_paths() you should just be able to make > > use of those paths, which should already have been created when > > creating index paths for the rel due to PlannerInfo's query_uniquekeys > > having been set. > > Just for me to clarify. The idea is to "move" information about what > path types support skipping into UniqueKeys (derived from PlannerInfo's > query_uniquekeys), but other checks (e.g. if index am supports that) > still perform in create_distinct_paths? create_distinct_paths() shouldn't know any details specific to the pathtype that it's using or considering using. All the details should just be in Path. e.g. uniquekeys, pathkeys, costs etc. There should be no IsA(path, ...). Please have a look over the details in my reply to Tomas. I hope that reply has enough information in it, but please reply there if I've missed something. > > On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote: > > There's also some weird looking assumptions that an EquivalenceMember > > can only be a Var in create_distinct_paths(). I think you're only > > saved from crashing there because a ProjectionPath will be created > > atop of the IndexPath to evaluate expressions, in which case you're > > not seeing the IndexPath. This results in the optimisation not > > working in cases like: > > > > postgres=# create table t (a int); create index on t ((a+1)); explain > > select distinct a+1 from t; > > CREATE TABLE > > CREATE INDEX > > QUERY PLAN > > --- > > HashAggregate (cost=48.25..50.75 rows=200 width=4) > >Group Key: (a + 1) > >-> Seq Scan on t (cost=0.00..41.88 rows=2550 width=4) > > Yes, I need to fix it. > > > Using unique paths as I mentioned above should see that fixed. > > I'm a bit confused about this statement, how exactly unique paths should > fix this? The path's uniquekeys would mention that it's unique on (a+1). You'd compare the uniquekeys of the path to the DISTINCT clause and see that the uniquekeys are a subset of the DISTINCT clause therefore the DISTINCT is a no-op. If that uniquekey path is cheaper than the cheapest_total_path + , then you should pick the unique path, otherwise use the cheapest_total_path and uniquify that. I think the UniqueKeys may need to be changed from using EquivalenceClasses to use Exprs instead.
Re: Index Skip Scan
> On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote: > > I've been looking over v32 of the patch and have a few comments > regarding the planner changes. Thanks for the commentaries! > I think the changes in create_distinct_paths() need more work. The > way I think this should work is that create_distinct_paths() gets to > know exactly nothing about what path types support the elimination of > duplicate values. The Path should carry the UniqueKeys so that can be > determined. In create_distinct_paths() you should just be able to make > use of those paths, which should already have been created when > creating index paths for the rel due to PlannerInfo's query_uniquekeys > having been set. Just for me to clarify. The idea is to "move" information about what path types support skipping into UniqueKeys (derived from PlannerInfo's query_uniquekeys), but other checks (e.g. if index am supports that) still perform in create_distinct_paths? > Also, should create_grouping_paths() be getting the same code? > Jesper's UniqueKey patch seems to set query_uniquekeys when there's a > GROUP BY with no aggregates. So it looks like he has intended that > something like: > > SELECT x FROM t GROUP BY x; > > should work the same way as > > SELECT DISTINCT x FROM t; > > but the 0002 patch does not make this work. Has that just been overlooked? I believe it wasn't overlooked in 0002 patch, but rather added just in case in 0001. I guess there are no theoretical problems in implementing it, but since we wanted to keep scope of the patch under control and concentrate on the existing functionality it probably makes sense to plan it as one of the next steps? > There's also some weird looking assumptions that an EquivalenceMember > can only be a Var in create_distinct_paths(). I think you're only > saved from crashing there because a ProjectionPath will be created > atop of the IndexPath to evaluate expressions, in which case you're > not seeing the IndexPath. This results in the optimisation not > working in cases like: > > postgres=# create table t (a int); create index on t ((a+1)); explain > select distinct a+1 from t; > CREATE TABLE > CREATE INDEX > QUERY PLAN > --- > HashAggregate (cost=48.25..50.75 rows=200 width=4) >Group Key: (a + 1) >-> Seq Scan on t (cost=0.00..41.88 rows=2550 width=4) Yes, I need to fix it. > Using unique paths as I mentioned above should see that fixed. I'm a bit confused about this statement, how exactly unique paths should fix this?
Re: Index Skip Scan
On Sat, 7 Mar 2020 at 03:49, Tomas Vondra wrote: > > On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote: > >The reason it must be done this way is that when the RelOptInfo that > >we're performing the DISTINCT on is a joinrel, then we're not going to > >see any IndexPaths in the RelOptInfo's pathlist. We'll have some sort > >of Join path instead. I understand you're not yet supporting doing > >this optimisation when joins are involved, but it should be coded in > >such a way that it'll work when we do. (It's probably also a separate > >question as to whether we should have this only work when there are no > >joins. I don't think I personally object to it for stage 1, but > >perhaps someone else might think differently.) > > > > I don't follow. Can you elaborate more? > > AFAICS skip-scan is essentially a capability of an (index) AM. I don't > see how we could ever do that for joinrels? We can do that at the scan > level, below a join, but that's what this patch already supports, I > think. When you say "supporting this optimisation" with joins, do you > mean doing skip-scan for join inputs, or on top of the join? The skip index scan Path would still be created at the base rel level, but the join path on the join relation would have one of the sub-paths of the join as an index skip scan. An example query that could make use of this is: SELECT * FROM some_table WHERE a IN(SELECT indexed_col_with_few_distinct_values FROM big_table); In this case, we might want to create a Skip Scan path on big_table using the index on the "indexed_col_with_few_distinct_values", then Hash Join to "some_table". That class of query is likely stage 2 or 3 of this work, but we need to lay foundations that'll support it. As for not having IndexScan paths in joinrels. Yes, of course, but that's exactly why create_distinct_paths() cannot work the way the patch currently codes it. The patch does: + /* + * XXX: In case of index scan quals evaluation happens + * after ExecScanFetch, which means skip results could be + * fitered out. Consider the following query: + * + * select distinct (a, b) a, b, c from t where c < 100; + * + * Skip scan returns one tuple for one distinct set of (a, + * b) with arbitrary one of c, so if the choosed c does + * not match the qual and there is any c that matches the + * qual, we miss that tuple. + */ + if (path->pathtype == T_IndexScan && which will never work for join relations since they'll only have paths for Loop/Merge/Hash type joins. The key here is to determine which skip scan paths we should create when we're building the normal index paths then see if we can make use of those when planning joins. Subsequently, we'll then see if we can make use of the resulting join paths during create_distinct_paths(). Doing it this way will allow us to use skip scans in queries such as: SELECT DISTINCT t1.z FROM t1 INNER JOIN t2 ON t1.a = t2.unique_col; We'll first create the skip scan paths on t1, then when creating the join paths we'll create additional join paths which use the skipscan path. Because t1.unique_col will at most have 1 join partner for each t2 row, then the join path will have the same unique_keys as the skipscan path. That'll allow us to use the join path which has the skip scan on whichever side of the join the t1 relation ends up. All create_distinct_paths() should be doing is looking for paths that are already implicitly unique on the distinct clause and consider using those in a cost-based way. It shouldn't be making such paths itself. > >For storing these new paths with UniqueKeys, I'm not sure exactly if > >we can just add_path() such paths into the RelOptInfo's pathlist. > >What we don't want to do is accidentally make use of paths which > >eliminate duplicate values when we don't want that behaviour. If we > >did store these paths in RelOptInfo->pathlist then we'd need to go and > >modify a bunch of places to ignore such paths. set_cheapest() would > >have to do something special for them too, which makes me think > >pathlist is the incorrect place. Parallel query added > >partial_pathlist, so perhaps we need unique_pathlist to make this > >work. > > > > Hmmm, good point. Do we actually produce incorrect plans with the > current patch, using skip-scan path when we should not? I don't think so. The patch is only creating skip scan paths on the base rel when we discover it's valid to do so. That's not the way it should work though. How the patch currently works would be similar to initially only creating a SeqScan path for a query such as: SELECT * FROM tab ORDER BY a;, but then, during create_ordered_paths() go and create some IndexPath to scan the btree index on tab.a because we suddenly realise that it'll be good to use that for the ORDER BY. The planner does not work that way. We always create all the paths that we think will be useful during set_base_rel_pathlists(). We then make use of only existing paths in the upper planner. See what
Re: Index Skip Scan
On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote: On Tue, 18 Feb 2020 at 05:24, Dmitry Dolgov <9erthali...@gmail.com> wrote: Here is something similar to what I had in mind. (changing to this email address for future emails) Hi, I've been looking over v32 of the patch and have a few comments regarding the planner changes. I think the changes in create_distinct_paths() need more work. The way I think this should work is that create_distinct_paths() gets to know exactly nothing about what path types support the elimination of duplicate values. The Path should carry the UniqueKeys so that can be determined. In create_distinct_paths() you should just be able to make use of those paths, which should already have been created when creating index paths for the rel due to PlannerInfo's query_uniquekeys having been set. +1 to code this in a generic way, using query_uniquekeys (if possible) The reason it must be done this way is that when the RelOptInfo that we're performing the DISTINCT on is a joinrel, then we're not going to see any IndexPaths in the RelOptInfo's pathlist. We'll have some sort of Join path instead. I understand you're not yet supporting doing this optimisation when joins are involved, but it should be coded in such a way that it'll work when we do. (It's probably also a separate question as to whether we should have this only work when there are no joins. I don't think I personally object to it for stage 1, but perhaps someone else might think differently.) I don't follow. Can you elaborate more? AFAICS skip-scan is essentially a capability of an (index) AM. I don't see how we could ever do that for joinrels? We can do that at the scan level, below a join, but that's what this patch already supports, I think. When you say "supporting this optimisation" with joins, do you mean doing skip-scan for join inputs, or on top of the join? For storing these new paths with UniqueKeys, I'm not sure exactly if we can just add_path() such paths into the RelOptInfo's pathlist. What we don't want to do is accidentally make use of paths which eliminate duplicate values when we don't want that behaviour. If we did store these paths in RelOptInfo->pathlist then we'd need to go and modify a bunch of places to ignore such paths. set_cheapest() would have to do something special for them too, which makes me think pathlist is the incorrect place. Parallel query added partial_pathlist, so perhaps we need unique_pathlist to make this work. Hmmm, good point. Do we actually produce incorrect plans with the current patch, using skip-scan path when we should not? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Index Skip Scan
On Tue, 18 Feb 2020 at 05:24, Dmitry Dolgov <9erthali...@gmail.com> wrote: > Here is something similar to what I had in mind. (changing to this email address for future emails) Hi, I've been looking over v32 of the patch and have a few comments regarding the planner changes. I think the changes in create_distinct_paths() need more work. The way I think this should work is that create_distinct_paths() gets to know exactly nothing about what path types support the elimination of duplicate values. The Path should carry the UniqueKeys so that can be determined. In create_distinct_paths() you should just be able to make use of those paths, which should already have been created when creating index paths for the rel due to PlannerInfo's query_uniquekeys having been set. The reason it must be done this way is that when the RelOptInfo that we're performing the DISTINCT on is a joinrel, then we're not going to see any IndexPaths in the RelOptInfo's pathlist. We'll have some sort of Join path instead. I understand you're not yet supporting doing this optimisation when joins are involved, but it should be coded in such a way that it'll work when we do. (It's probably also a separate question as to whether we should have this only work when there are no joins. I don't think I personally object to it for stage 1, but perhaps someone else might think differently.) For storing these new paths with UniqueKeys, I'm not sure exactly if we can just add_path() such paths into the RelOptInfo's pathlist. What we don't want to do is accidentally make use of paths which eliminate duplicate values when we don't want that behaviour. If we did store these paths in RelOptInfo->pathlist then we'd need to go and modify a bunch of places to ignore such paths. set_cheapest() would have to do something special for them too, which makes me think pathlist is the incorrect place. Parallel query added partial_pathlist, so perhaps we need unique_pathlist to make this work. Also, should create_grouping_paths() be getting the same code? Jesper's UniqueKey patch seems to set query_uniquekeys when there's a GROUP BY with no aggregates. So it looks like he has intended that something like: SELECT x FROM t GROUP BY x; should work the same way as SELECT DISTINCT x FROM t; but the 0002 patch does not make this work. Has that just been overlooked? There's also some weird looking assumptions that an EquivalenceMember can only be a Var in create_distinct_paths(). I think you're only saved from crashing there because a ProjectionPath will be created atop of the IndexPath to evaluate expressions, in which case you're not seeing the IndexPath. This results in the optimisation not working in cases like: postgres=# create table t (a int); create index on t ((a+1)); explain select distinct a+1 from t; CREATE TABLE CREATE INDEX QUERY PLAN --- HashAggregate (cost=48.25..50.75 rows=200 width=4) Group Key: (a + 1) -> Seq Scan on t (cost=0.00..41.88 rows=2550 width=4) Using unique paths as I mentioned above should see that fixed. David
Re: Index Skip Scan
> On Fri, Feb 14, 2020 at 01:18:20PM +0100, Dmitry Dolgov wrote: > > On Fri, Feb 14, 2020 at 05:23:13PM +0900, Kyotaro Horiguchi wrote: > > The first attached (renamed to .txt not to confuse the cfbots) is a > > small patch that makes sure if _bt_readpage is called with the proper > > condition as written in its comment, that is, caller must have pinned > > and read-locked so->currPos.buf. This patch reveals many instances of > > breakage of the contract. > > Thanks! On top of which patch version one can apply it? I'm asking > because I believe I've addressed similar issues in the last version, and > the last proposed diff (after resolving some conflicts) breaks tests for > me, so not sure if I miss something. > > At the same time if you and Tomas strongly agree that it actually makes > sense to make moving forward/reading backward case work with dead tuples > correctly, I'll take a shot and try to teach the code around _bt_skip to > do what is required for that. I can merge your changes there and we can > see what would be the result. Here is something similar to what I had in mind. In this version of the patch IndexOnlyNext now verify if we returned to the same position as before while reading in opposite to the advancing direction due to visibility checks (similar to what is implemented inside _bt_skip for the situation when some distinct keys being eliminated due to scankey conditions). It's actually not that invasive as I feared, but still pretty hacky. I'm not sure if it's ok to compare resulting heaptid in this situation, but all the mention tests are passing. Also, this version doesn't incorporate any planner feedback from Tomas yet, my intention is just to check if it could be the right direction. >From 22e6b4ccd5f79ca069bd5cd90ba3696dd97f76ea Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Tue, 9 Jul 2019 06:44:57 -0400 Subject: [PATCH v32 1/2] Unique key Design by David Rowley. Author: Jesper Pedersen --- src/backend/nodes/outfuncs.c | 14 +++ src/backend/nodes/print.c | 39 +++ src/backend/optimizer/path/Makefile| 3 +- src/backend/optimizer/path/allpaths.c | 8 ++ src/backend/optimizer/path/indxpath.c | 41 +++ src/backend/optimizer/path/pathkeys.c | 71 ++-- src/backend/optimizer/path/uniquekey.c | 147 + src/backend/optimizer/plan/planagg.c | 1 + src/backend/optimizer/plan/planmain.c | 1 + src/backend/optimizer/plan/planner.c | 17 ++- src/backend/optimizer/util/pathnode.c | 12 ++ src/include/nodes/nodes.h | 1 + src/include/nodes/pathnodes.h | 18 +++ src/include/nodes/print.h | 1 + src/include/optimizer/pathnode.h | 1 + src/include/optimizer/paths.h | 11 ++ 16 files changed, 373 insertions(+), 13 deletions(-) create mode 100644 src/backend/optimizer/path/uniquekey.c diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index d76fae44b8..16083e7a7e 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -1723,6 +1723,7 @@ _outPathInfo(StringInfo str, const Path *node) WRITE_FLOAT_FIELD(startup_cost, "%.2f"); WRITE_FLOAT_FIELD(total_cost, "%.2f"); WRITE_NODE_FIELD(pathkeys); + WRITE_NODE_FIELD(uniquekeys); } /* @@ -2205,6 +2206,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_NODE_FIELD(eq_classes); WRITE_BOOL_FIELD(ec_merging_done); WRITE_NODE_FIELD(canon_pathkeys); + WRITE_NODE_FIELD(canon_uniquekeys); WRITE_NODE_FIELD(left_join_clauses); WRITE_NODE_FIELD(right_join_clauses); WRITE_NODE_FIELD(full_join_clauses); @@ -2214,6 +2216,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_NODE_FIELD(placeholder_list); WRITE_NODE_FIELD(fkey_list); WRITE_NODE_FIELD(query_pathkeys); + WRITE_NODE_FIELD(query_uniquekeys); WRITE_NODE_FIELD(group_pathkeys); WRITE_NODE_FIELD(window_pathkeys); WRITE_NODE_FIELD(distinct_pathkeys); @@ -2401,6 +2404,14 @@ _outPathKey(StringInfo str, const PathKey *node) WRITE_BOOL_FIELD(pk_nulls_first); } +static void +_outUniqueKey(StringInfo str, const UniqueKey *node) +{ + WRITE_NODE_TYPE("UNIQUEKEY"); + + WRITE_NODE_FIELD(eq_clause); +} + static void _outPathTarget(StringInfo str, const PathTarget *node) { @@ -4092,6 +4103,9 @@ outNode(StringInfo str, const void *obj) case T_PathKey: _outPathKey(str, obj); break; + case T_UniqueKey: +_outUniqueKey(str, obj); +break; case T_PathTarget: _outPathTarget(str, obj); break; diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c index 42476724d8..d286b34544 100644 --- a/src/backend/nodes/print.c +++ b/src/backend/nodes/print.c @@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable) printf(")\n"); } +/* + * print_uniquekeys - + * uniquekeys list of UniqueKeys + */ +void +print_uniquekeys(const List *uniquekeys, const List *rtable) +{ + ListCell *l; + + printf("("); +
Re: Index Skip Scan
> On Fri, Feb 14, 2020 at 05:23:13PM +0900, Kyotaro Horiguchi wrote: > The first attached (renamed to .txt not to confuse the cfbots) is a > small patch that makes sure if _bt_readpage is called with the proper > condition as written in its comment, that is, caller must have pinned > and read-locked so->currPos.buf. This patch reveals many instances of > breakage of the contract. Thanks! On top of which patch version one can apply it? I'm asking because I believe I've addressed similar issues in the last version, and the last proposed diff (after resolving some conflicts) breaks tests for me, so not sure if I miss something. At the same time if you and Tomas strongly agree that it actually makes sense to make moving forward/reading backward case work with dead tuples correctly, I'll take a shot and try to teach the code around _bt_skip to do what is required for that. I can merge your changes there and we can see what would be the result.
Re: Index Skip Scan
Thank you very much for the benchmarking! A bit different topic from the latest branch.. At Sat, 8 Feb 2020 14:11:59 +0100, Tomas Vondra wrote in > >Yes, I've mentioned that already in one of the previous emails :) The > >simplest way I see to achieve what we want is to do something like in > >attached modified version with a new hasDeclaredCursor field. It's not > >a > >final version though, but posted just for discussion, so feel free to > >suggest any improvements or alternatives. > > IMO the proper fix for this case (moving forward, reading backwards) > is > simply making it work by properly checking deleted tuples etc. Not > sure > why that would be so much complex (haven't tried implementing it)? I don't think it's not so complex. But I suspect that it might be a bit harder starting from the current shpae. The first attached (renamed to .txt not to confuse the cfbots) is a small patch that makes sure if _bt_readpage is called with the proper condition as written in its comment, that is, caller must have pinned and read-locked so->currPos.buf. This patch reveals many instances of breakage of the contract. The second is a crude fix the breakages, but the result seems far from neat.. I think we need rethinking taking modification of support functions into consideration. > I think making this depend on things like declared cursor etc. is > going > to be tricky, may easily be more complex than checking deleted tuples, > and the behavior may be quite surprising. Sure. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From de129e5a261ed43f002c1684dc9d6575f3880b16 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 6 Feb 2020 14:31:36 +0900 Subject: [PATCH 1/2] debug aid --- src/backend/access/nbtree/nbtsearch.c | 1 + src/backend/storage/buffer/bufmgr.c | 13 + src/include/storage/bufmgr.h | 1 + 3 files changed, 15 insertions(+) diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c index c5f5d228f2..5cd97d8bb5 100644 --- a/src/backend/access/nbtree/nbtsearch.c +++ b/src/backend/access/nbtree/nbtsearch.c @@ -1785,6 +1785,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) * used here; this function is what makes it good for currPos. */ Assert(BufferIsValid(so->currPos.buf)); + Assert(BufferLockAndPinHeldByMe(so->currPos.buf)); page = BufferGetPage(so->currPos.buf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index aba3960481..08a75a6846 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1553,6 +1553,19 @@ ReleaseAndReadBuffer(Buffer buffer, return ReadBuffer(relation, blockNum); } +/* tmp function for debugging */ +bool +BufferLockAndPinHeldByMe(Buffer buffer) +{ + BufferDesc *b = GetBufferDescriptor(buffer - 1); + + if (BufferIsPinned(buffer) && + LWLockHeldByMe(BufferDescriptorGetContentLock(b))) + return true; + + return false; +} + /* * PinBuffer -- make buffer unavailable for replacement. * diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index 73c7e9ba38..8e5fc639a0 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -177,6 +177,7 @@ extern void MarkBufferDirty(Buffer buffer); extern void IncrBufferRefCount(Buffer buffer); extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation, BlockNumber blockNum); +extern bool BufferLockAndPinHeldByMe(Buffer buffer); extern void InitBufferPool(void); extern void InitBufferPoolAccess(void); -- 2.18.2 >From 912bad2ec8c66ccd01cebf1f69233b004c633243 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 6 Feb 2020 19:09:09 +0900 Subject: [PATCH 2/2] crude fix --- src/backend/access/nbtree/nbtsearch.c | 43 +-- 1 file changed, 27 insertions(+), 16 deletions(-) diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c index 5cd97d8bb5..1f18b38ca5 100644 --- a/src/backend/access/nbtree/nbtsearch.c +++ b/src/backend/access/nbtree/nbtsearch.c @@ -1619,6 +1619,9 @@ _bt_skip(IndexScanDesc scan, ScanDirection dir, nextOffset = startOffset = ItemPointerGetOffsetNumber(>xs_itup->t_tid); + if (nextOffset != startOffset) + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); + while (nextOffset == startOffset) { IndexTuple itup; @@ -1653,7 +1656,7 @@ _bt_skip(IndexScanDesc scan, ScanDirection dir, offnum = OffsetNumberPrev(offnum); /* Check if _bt_readpage returns already found
Re: Index Skip Scan
> On Sat, Feb 08, 2020 at 01:31:02PM -0500, James Coleman wrote: > On Sat, Feb 8, 2020 at 10:24 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > > > On Sat, Feb 08, 2020 at 03:22:17PM +0100, Tomas Vondra wrote: > > > So in this case the skip scan is ~15x slower than the usual plan (index > > > only scan + unique). The reason why this happens is pretty simple - to > > > estimate the number of groups we multiply the ndistinct estimates for > > > the two columns (which both have n_distinct = 1), but then we cap > > > the estimate to 10% of the table. But when the columns are independent > > > with high cardinalities that under-estimates the actual value, making > > > the cost for skip scan much lower than it should be. > > > > The current implementation checks if we can find the next value on the > > same page to do a shortcut instead of tree traversal and improve such > > kind of situations, but I can easily imagine that it's still not enough > > in some extreme situations. > > This is almost certainly rehashing already covered ground, but since I > doubt it's been discussed recently, would you be able to summarize > that choice (not to always get the next tuple by scanning from the top > of the tree again) and the performance/complexity tradeoffs? Yeah, this part of discussion happened already some time ago. The idea [1] is to protect ourselves at least partially from incorrect ndistinct estimations. Simply doing jumping over an index means that even if the next key we're searching for is on the same page as previous, we still end up doing a search from the root of the tree, which is of course less efficient than just check right on the page before jumping further. Performance tradeoff in this case is simple, we make regular use case slightly slower, but can perform better in the worst case scenarios. Complexity tradeoff was never discussed, but I guess everyone assumed it's relatively straightforward to check the current page and return if something was found before jumping. [1]: https://www.postgresql.org/message-id/CA%2BTgmoY7QTHhzLWZupNSyyqFRBfMgYocg3R-6g%3DDRgT4-KBGqg%40mail.gmail.com
Re: Index Skip Scan
On Sat, Feb 8, 2020 at 10:24 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Sat, Feb 08, 2020 at 03:22:17PM +0100, Tomas Vondra wrote: > > So in this case the skip scan is ~15x slower than the usual plan (index > > only scan + unique). The reason why this happens is pretty simple - to > > estimate the number of groups we multiply the ndistinct estimates for > > the two columns (which both have n_distinct = 1), but then we cap > > the estimate to 10% of the table. But when the columns are independent > > with high cardinalities that under-estimates the actual value, making > > the cost for skip scan much lower than it should be. > > The current implementation checks if we can find the next value on the > same page to do a shortcut instead of tree traversal and improve such > kind of situations, but I can easily imagine that it's still not enough > in some extreme situations. This is almost certainly rehashing already covered ground, but since I doubt it's been discussed recently, would you be able to summarize that choice (not to always get the next tuple by scanning from the top of the tree again) and the performance/complexity tradeoffs? Thanks, James
Re: Index Skip Scan
On Sat, Feb 08, 2020 at 04:24:40PM +0100, Dmitry Dolgov wrote: On Sat, Feb 08, 2020 at 03:22:17PM +0100, Tomas Vondra wrote: I've done some testing and benchmarking of the v31 patch, looking for regressions, costing issues etc. Essentially, I've ran a bunch of SELECT DISTINCT queries on data sets of various size, number of distinct values etc. The results are fairly large, so I've uploaded them to github https://github.com/tvondra/skip-scan-test Thanks a lot for testing! There are a couple of regressions, where the plan with skipscan enables is ~10x slower. But this seems to happen only in high-cardinality cases where we misestimate the number of groups. Consider a table with two independent columns CREATE TABLE t (a text, b text); INSERT INTO t SELECT md5((1*random())::int::text), md5((1*random())::int::text) FROM generate_series(1,100) s(i); CREATE INDEX ON t(a,b); ANALYZE; which then behaves like this: test=# select * from (select distinct a,b from t) foo offset 1000; Time: 3138.222 ms (00:03.138) test=# set enable_indexskipscan = off; Time: 0.312 ms test=# select * from (select distinct a,b from t) foo offset 1000; Time: 199.749 ms So in this case the skip scan is ~15x slower than the usual plan (index only scan + unique). The reason why this happens is pretty simple - to estimate the number of groups we multiply the ndistinct estimates for the two columns (which both have n_distinct = 1), but then we cap the estimate to 10% of the table. But when the columns are independent with high cardinalities that under-estimates the actual value, making the cost for skip scan much lower than it should be. The current implementation checks if we can find the next value on the same page to do a shortcut instead of tree traversal and improve such kind of situations, but I can easily imagine that it's still not enough in some extreme situations. Yeah. I'm not sure there's room for further improvements. The regressed cases were subject to the 10% cap, and with ndistinct being more than 10% of the table, we probably can find many distinct keys on each index page - we know that every ~10 rows the values change. I don't think this is an issue the skipscan patch needs to fix, though. Firstly, the regressed cases are a tiny minority. Secondly, we already have a way to improve the root cause - creating extended stats with ndistinct coefficients generally makes the problem go away. Yes, I agree. One interesting observation however is that this regression only happened with text columns but not with int or bigint. My assumption is that this is due to text comparisons being much more expensive. Not sure if there is something we could do to deal with this - reduce the number of comparisons or something? Hm, interesting. I need to check that we do not do any unnecessary comparisons. On Sat, Feb 08, 2020 at 02:11:59PM +0100, Tomas Vondra wrote: > Yes, I've mentioned that already in one of the previous emails :) The > simplest way I see to achieve what we want is to do something like in > attached modified version with a new hasDeclaredCursor field. It's not a > final version though, but posted just for discussion, so feel free to > suggest any improvements or alternatives. IMO the proper fix for this case (moving forward, reading backwards) is simply making it work by properly checking deleted tuples etc. Not sure why that would be so much complex (haven't tried implementing it)? It's probably not that complex by itself, but requires changing responsibilities isolation. At the moment current implementation leaves jumping over a tree fully to _bt_skip, and heap visibility checks only to IndexOnlyNext. To check deleted tuples properly we need to either verify a corresponding heap tuple visibility inside _bt_skip (as I've mentioned in one of the previous emails, checking if an index tuple is dead is not enough), or teach the code in IndexOnlyNext to understand that _bt_skip can lead to returning the same tuple while moving forward & reading backward. Do you think it's still makes sense to go this way? Not sure. I have to think about this first. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Index Skip Scan
> On Sat, Feb 08, 2020 at 03:22:17PM +0100, Tomas Vondra wrote: > > I've done some testing and benchmarking of the v31 patch, looking for > regressions, costing issues etc. Essentially, I've ran a bunch of SELECT > DISTINCT queries on data sets of various size, number of distinct values > etc. The results are fairly large, so I've uploaded them to github > > https://github.com/tvondra/skip-scan-test Thanks a lot for testing! > There are a couple of regressions, where the plan with skipscan enables > is ~10x slower. But this seems to happen only in high-cardinality cases > where we misestimate the number of groups. Consider a table with two > independent columns > > CREATE TABLE t (a text, b text); > INSERT INTO t SELECT > md5((1*random())::int::text), > md5((1*random())::int::text) > FROM generate_series(1,100) s(i); > > CREATE INDEX ON t(a,b); > > ANALYZE; > > which then behaves like this: > > test=# select * from (select distinct a,b from t) foo offset 1000; > Time: 3138.222 ms (00:03.138) > test=# set enable_indexskipscan = off; > Time: 0.312 ms > test=# select * from (select distinct a,b from t) foo offset 1000; > Time: 199.749 ms > > So in this case the skip scan is ~15x slower than the usual plan (index > only scan + unique). The reason why this happens is pretty simple - to > estimate the number of groups we multiply the ndistinct estimates for > the two columns (which both have n_distinct = 1), but then we cap > the estimate to 10% of the table. But when the columns are independent > with high cardinalities that under-estimates the actual value, making > the cost for skip scan much lower than it should be. The current implementation checks if we can find the next value on the same page to do a shortcut instead of tree traversal and improve such kind of situations, but I can easily imagine that it's still not enough in some extreme situations. > I don't think this is an issue the skipscan patch needs to fix, though. > Firstly, the regressed cases are a tiny minority. Secondly, we already > have a way to improve the root cause - creating extended stats with > ndistinct coefficients generally makes the problem go away. Yes, I agree. > One interesting observation however is that this regression only > happened with text columns but not with int or bigint. My assumption is > that this is due to text comparisons being much more expensive. Not sure > if there is something we could do to deal with this - reduce the number > of comparisons or something? Hm, interesting. I need to check that we do not do any unnecessary comparisons. > On Sat, Feb 08, 2020 at 02:11:59PM +0100, Tomas Vondra wrote: > > Yes, I've mentioned that already in one of the previous emails :) The > > simplest way I see to achieve what we want is to do something like in > > attached modified version with a new hasDeclaredCursor field. It's not a > > final version though, but posted just for discussion, so feel free to > > suggest any improvements or alternatives. > > IMO the proper fix for this case (moving forward, reading backwards) is > simply making it work by properly checking deleted tuples etc. Not sure > why that would be so much complex (haven't tried implementing it)? It's probably not that complex by itself, but requires changing responsibilities isolation. At the moment current implementation leaves jumping over a tree fully to _bt_skip, and heap visibility checks only to IndexOnlyNext. To check deleted tuples properly we need to either verify a corresponding heap tuple visibility inside _bt_skip (as I've mentioned in one of the previous emails, checking if an index tuple is dead is not enough), or teach the code in IndexOnlyNext to understand that _bt_skip can lead to returning the same tuple while moving forward & reading backward. Do you think it's still makes sense to go this way?
Re: Index Skip Scan
OK, A couple more comments based on quick review of the patch, particularly the part related to planning: 1) create_skipscan_unique_path has one assert commented out. Either it's something we want to enforce, or we should remove it. /*Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));*/ 2) I wonder if the current costing model is overly optimistic. We simply copy the startup cost from the IndexPath, which seems fine. But for total cost we do this: pathnode->path.total_cost = basepath->startup_cost * numGroups; which seems a bit too simplistic. The startup cost is pretty much just the cost to find the first item in the index, but surely we need to do more to find the next group - we need to do comparisons to skip some of the items, etc. If we think that's unnecessary, we need to explain it in a comment or somthing. 3) I don't think we should make planning dependent on hasDeclareCursor. 4) I'm not quite sure how sensible it's to create a new IndexPath in create_skipscan_unique_path. On the one hand it works, but I don't think any other path is constructed like this so I wonder if we're missing something. Perhaps it'd be better to just add a new path node on top of the IndexPath, and then handle this in create_plan. We already do something similar for Bitmap Index Scans, where we create a different executor node from IndexPath depending on the parent node. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Index Skip Scan
Hi, I've done some testing and benchmarking of the v31 patch, looking for regressions, costing issues etc. Essentially, I've ran a bunch of SELECT DISTINCT queries on data sets of various size, number of distinct values etc. The results are fairly large, so I've uploaded them to github https://github.com/tvondra/skip-scan-test There are four benchmark groups, depending on how the data are generated and availability of extended stats and if the columns are independent: 1) skipscan - just indexes, columns are independent 2) skipscan-with-stats - indexes and extended stats, independent columns 3) skipscan-correlated - just indexes, correlated columns 4) skipscan-correlated-with-stats - indexes and extended stats, correlated columns The github repository contains *.ods spreadsheet comparing duration with the regular query plan (no skip scan) and skip scan. In general, there are pretty massive speedups, often by about two orders of magnitude. There are a couple of regressions, where the plan with skipscan enables is ~10x slower. But this seems to happen only in high-cardinality cases where we misestimate the number of groups. Consider a table with two independent columns CREATE TABLE t (a text, b text); INSERT INTO t SELECT md5((1*random())::int::text), md5((1*random())::int::text) FROM generate_series(1,100) s(i); CREATE INDEX ON t(a,b); ANALYZE; which then behaves like this: test=# select * from (select distinct a,b from t) foo offset 1000; Time: 3138.222 ms (00:03.138) test=# set enable_indexskipscan = off; Time: 0.312 ms test=# select * from (select distinct a,b from t) foo offset 1000; Time: 199.749 ms So in this case the skip scan is ~15x slower than the usual plan (index only scan + unique). The reason why this happens is pretty simple - to estimate the number of groups we multiply the ndistinct estimates for the two columns (which both have n_distinct = 1), but then we cap the estimate to 10% of the table. But when the columns are independent with high cardinalities that under-estimates the actual value, making the cost for skip scan much lower than it should be. I don't think this is an issue the skipscan patch needs to fix, though. Firstly, the regressed cases are a tiny minority. Secondly, we already have a way to improve the root cause - creating extended stats with ndistinct coefficients generally makes the problem go away. One interesting observation however is that this regression only happened with text columns but not with int or bigint. My assumption is that this is due to text comparisons being much more expensive. Not sure if there is something we could do to deal with this - reduce the number of comparisons or something? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Index Skip Scan
On Fri, Feb 07, 2020 at 05:25:43PM +0100, Dmitry Dolgov wrote: On Thu, Feb 06, 2020 at 09:22:20PM +0900, Kyotaro Horiguchi wrote: At Thu, 6 Feb 2020 11:57:07 +0100, Dmitry Dolgov <9erthali...@gmail.com> wrote in > > On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote: > > At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthali...@gmail.com> wrote in > > We could add an additional parameter "in_cursor" to > > ExecSupportBackwardScan and let skip scan return false if in_cursor is > > true, but I'm not sure it's acceptable. > > I also was thinking about whether it's possible to use > ExecSupportBackwardScan here, but skip scan is just a mode of an > index/indexonly scan. Which means that ExecSupportBackwardScan also need > to know somehow if this mode is being used, and then, since this > function is called after it's already decided to use skip scan in the > resulting plan, somehow correct the plan (exclude skipping and try to > find next best path?) - do I understand your suggestion correct? I didn't thought so hardly, but a bit of confirmation told me that IndexSupportsBackwardScan returns fixed flag for AM. It seems that things are not that simple. Yes, I've mentioned that already in one of the previous emails :) The simplest way I see to achieve what we want is to do something like in attached modified version with a new hasDeclaredCursor field. It's not a final version though, but posted just for discussion, so feel free to suggest any improvements or alternatives. IMO the proper fix for this case (moving forward, reading backwards) is simply making it work by properly checking deleted tuples etc. Not sure why that would be so much complex (haven't tried implementing it)? I think making this depend on things like declared cursor etc. is going to be tricky, may easily be more complex than checking deleted tuples, and the behavior may be quite surprising. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Index Skip Scan
> On Thu, Feb 06, 2020 at 09:22:20PM +0900, Kyotaro Horiguchi wrote: > At Thu, 6 Feb 2020 11:57:07 +0100, Dmitry Dolgov <9erthali...@gmail.com> > wrote in > > > On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote: > > > At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthali...@gmail.com> > > > wrote in > > > We could add an additional parameter "in_cursor" to > > > ExecSupportBackwardScan and let skip scan return false if in_cursor is > > > true, but I'm not sure it's acceptable. > > > > I also was thinking about whether it's possible to use > > ExecSupportBackwardScan here, but skip scan is just a mode of an > > index/indexonly scan. Which means that ExecSupportBackwardScan also need > > to know somehow if this mode is being used, and then, since this > > function is called after it's already decided to use skip scan in the > > resulting plan, somehow correct the plan (exclude skipping and try to > > find next best path?) - do I understand your suggestion correct? > > I didn't thought so hardly, but a bit of confirmation told me that > IndexSupportsBackwardScan returns fixed flag for AM. It seems that > things are not that simple. Yes, I've mentioned that already in one of the previous emails :) The simplest way I see to achieve what we want is to do something like in attached modified version with a new hasDeclaredCursor field. It's not a final version though, but posted just for discussion, so feel free to suggest any improvements or alternatives. >From 22e6b4ccd5f79ca069bd5cd90ba3696dd97f76ea Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Tue, 9 Jul 2019 06:44:57 -0400 Subject: [PATCH v33 1/2] Unique key Design by David Rowley. Author: Jesper Pedersen --- src/backend/nodes/outfuncs.c | 14 +++ src/backend/nodes/print.c | 39 +++ src/backend/optimizer/path/Makefile| 3 +- src/backend/optimizer/path/allpaths.c | 8 ++ src/backend/optimizer/path/indxpath.c | 41 +++ src/backend/optimizer/path/pathkeys.c | 71 ++-- src/backend/optimizer/path/uniquekey.c | 147 + src/backend/optimizer/plan/planagg.c | 1 + src/backend/optimizer/plan/planmain.c | 1 + src/backend/optimizer/plan/planner.c | 17 ++- src/backend/optimizer/util/pathnode.c | 12 ++ src/include/nodes/nodes.h | 1 + src/include/nodes/pathnodes.h | 18 +++ src/include/nodes/print.h | 1 + src/include/optimizer/pathnode.h | 1 + src/include/optimizer/paths.h | 11 ++ 16 files changed, 373 insertions(+), 13 deletions(-) create mode 100644 src/backend/optimizer/path/uniquekey.c diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index d76fae44b8..16083e7a7e 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -1723,6 +1723,7 @@ _outPathInfo(StringInfo str, const Path *node) WRITE_FLOAT_FIELD(startup_cost, "%.2f"); WRITE_FLOAT_FIELD(total_cost, "%.2f"); WRITE_NODE_FIELD(pathkeys); + WRITE_NODE_FIELD(uniquekeys); } /* @@ -2205,6 +2206,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_NODE_FIELD(eq_classes); WRITE_BOOL_FIELD(ec_merging_done); WRITE_NODE_FIELD(canon_pathkeys); + WRITE_NODE_FIELD(canon_uniquekeys); WRITE_NODE_FIELD(left_join_clauses); WRITE_NODE_FIELD(right_join_clauses); WRITE_NODE_FIELD(full_join_clauses); @@ -2214,6 +2216,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_NODE_FIELD(placeholder_list); WRITE_NODE_FIELD(fkey_list); WRITE_NODE_FIELD(query_pathkeys); + WRITE_NODE_FIELD(query_uniquekeys); WRITE_NODE_FIELD(group_pathkeys); WRITE_NODE_FIELD(window_pathkeys); WRITE_NODE_FIELD(distinct_pathkeys); @@ -2401,6 +2404,14 @@ _outPathKey(StringInfo str, const PathKey *node) WRITE_BOOL_FIELD(pk_nulls_first); } +static void +_outUniqueKey(StringInfo str, const UniqueKey *node) +{ + WRITE_NODE_TYPE("UNIQUEKEY"); + + WRITE_NODE_FIELD(eq_clause); +} + static void _outPathTarget(StringInfo str, const PathTarget *node) { @@ -4092,6 +4103,9 @@ outNode(StringInfo str, const void *obj) case T_PathKey: _outPathKey(str, obj); break; + case T_UniqueKey: +_outUniqueKey(str, obj); +break; case T_PathTarget: _outPathTarget(str, obj); break; diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c index 42476724d8..d286b34544 100644 --- a/src/backend/nodes/print.c +++ b/src/backend/nodes/print.c @@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable) printf(")\n"); } +/* + * print_uniquekeys - + * uniquekeys list of UniqueKeys + */ +void +print_uniquekeys(const List *uniquekeys, const List *rtable) +{ + ListCell *l; + + printf("("); + foreach(l, uniquekeys) + { + UniqueKey *unique_key = (UniqueKey *) lfirst(l); + EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause; + ListCell *k; + bool first = true; + + /* chase up */ + while
Re: Index Skip Scan
Sorry, I forgot to write more significant thing. On 2020/02/06 21:22, Kyotaro Horiguchi wrote: At Thu, 6 Feb 2020 11:57:07 +0100, Dmitry Dolgov <9erthali...@gmail.com> wrote in On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote: At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthali...@gmail.com> wrote in We could add an additional parameter "in_cursor" to ExecSupportBackwardScan and let skip scan return false if in_cursor is true, but I'm not sure it's acceptable. I also was thinking about whether it's possible to use ExecSupportBackwardScan here, but skip scan is just a mode of an index/indexonly scan. Which means that ExecSupportBackwardScan also need to know somehow if this mode is being used, and then, since this function is called after it's already decided to use skip scan in the resulting plan, somehow correct the plan (exclude skipping and try to find next best path?) - do I understand your suggestion correct? No. I thought of the opposite thing. I meant that IndexSupportsBackwardScan returns false if Index(Only)Scan is going to do skip scan. But I found that the function doesn't have access to plan node nor executor node. So I wrote as the follows. I didn't thought so hardly, but a bit of confirmation told me that IndexSupportsBackwardScan returns fixed flag for AM. It seems that things are not that simple. regards. -- Kyotaro Horiguchi
Re: Index Skip Scan
At Thu, 6 Feb 2020 11:57:07 +0100, Dmitry Dolgov <9erthali...@gmail.com> wrote in > > On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote: > > At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthali...@gmail.com> > > wrote in > > We could add an additional parameter "in_cursor" to > > ExecSupportBackwardScan and let skip scan return false if in_cursor is > > true, but I'm not sure it's acceptable. > > I also was thinking about whether it's possible to use > ExecSupportBackwardScan here, but skip scan is just a mode of an > index/indexonly scan. Which means that ExecSupportBackwardScan also need > to know somehow if this mode is being used, and then, since this > function is called after it's already decided to use skip scan in the > resulting plan, somehow correct the plan (exclude skipping and try to > find next best path?) - do I understand your suggestion correct? I didn't thought so hardly, but a bit of confirmation told me that IndexSupportsBackwardScan returns fixed flag for AM. It seems that things are not that simple. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Index Skip Scan
> On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote: > At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthali...@gmail.com> > wrote in > > > On Tue, Feb 04, 2020 at 08:34:09PM +, Floris Van Nee wrote: > > > > > > > this point me and Jesper inclined to go with the second option. But > > > > maybe > > > > I'm missing something, are there any other suggestions? > > > > > > Unfortunately I figured this would need a more invasive fix. I tend to > > > agree that it'd be better to not skip in situations like this. I think > > > it'd make most sense to make any plan for these 'prepare/fetch' queries > > > would not use skip, but rather a materialize node, right? > > > > Yes, sort of, without a skip scan it would be just an index only scan > > with unique on top. Actually it's not immediately clean how to achieve > > this, since at the moment, when planner is deciding to consider index > > skip scan, there is no information about neither direction nor whether > > we're dealing with a cursor. Maybe we can somehow signal to the decision > > logic that the root was a DeclareCursorStmt by e.g. introducing a new > > field to the query structure (or abusing an existing one, since > > DeclareCursorStmt is being processed by standard_ProcessUtility, just > > for a test I've tried to use utilityStmt of a nested statement hoping > > that it's unused and it didn't break tests yet). > > Umm. I think it's a wrong direction. While defining a cursor, > default scrollability is decided based on the query allows backward > scan or not. That is, the definition of backward-scan'ability is not > just whether it can scan from the end toward the beginning, but > whether it can go back and forth freely or not. In that definition, > the *current* skip scan does not supporting backward scan. If we want > to allow descending order-by in a query, we should support scrollable > cursor, too. > > We could add an additional parameter "in_cursor" to > ExecSupportBackwardScan and let skip scan return false if in_cursor is > true, but I'm not sure it's acceptable. I also was thinking about whether it's possible to use ExecSupportBackwardScan here, but skip scan is just a mode of an index/indexonly scan. Which means that ExecSupportBackwardScan also need to know somehow if this mode is being used, and then, since this function is called after it's already decided to use skip scan in the resulting plan, somehow correct the plan (exclude skipping and try to find next best path?) - do I understand your suggestion correct?
Re: Index Skip Scan
At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthali...@gmail.com> wrote in > > On Tue, Feb 04, 2020 at 08:34:09PM +, Floris Van Nee wrote: > > > > > this point me and Jesper inclined to go with the second option. But maybe > > > I'm missing something, are there any other suggestions? > > > > Unfortunately I figured this would need a more invasive fix. I tend to > > agree that it'd be better to not skip in situations like this. I think it'd > > make most sense to make any plan for these 'prepare/fetch' queries would > > not use skip, but rather a materialize node, right? > > Yes, sort of, without a skip scan it would be just an index only scan > with unique on top. Actually it's not immediately clean how to achieve > this, since at the moment, when planner is deciding to consider index > skip scan, there is no information about neither direction nor whether > we're dealing with a cursor. Maybe we can somehow signal to the decision > logic that the root was a DeclareCursorStmt by e.g. introducing a new > field to the query structure (or abusing an existing one, since > DeclareCursorStmt is being processed by standard_ProcessUtility, just > for a test I've tried to use utilityStmt of a nested statement hoping > that it's unused and it didn't break tests yet). Umm. I think it's a wrong direction. While defining a cursor, default scrollability is decided based on the query allows backward scan or not. That is, the definition of backward-scan'ability is not just whether it can scan from the end toward the beginning, but whether it can go back and forth freely or not. In that definition, the *current* skip scan does not supporting backward scan. If we want to allow descending order-by in a query, we should support scrollable cursor, too. We could add an additional parameter "in_cursor" to ExecSupportBackwardScan and let skip scan return false if in_cursor is true, but I'm not sure it's acceptable. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Index Skip Scan
> On Tue, Feb 04, 2020 at 08:34:09PM +, Floris Van Nee wrote: > > > this point me and Jesper inclined to go with the second option. But maybe > > I'm missing something, are there any other suggestions? > > Unfortunately I figured this would need a more invasive fix. I tend to agree > that it'd be better to not skip in situations like this. I think it'd make > most sense to make any plan for these 'prepare/fetch' queries would not use > skip, but rather a materialize node, right? Yes, sort of, without a skip scan it would be just an index only scan with unique on top. Actually it's not immediately clean how to achieve this, since at the moment, when planner is deciding to consider index skip scan, there is no information about neither direction nor whether we're dealing with a cursor. Maybe we can somehow signal to the decision logic that the root was a DeclareCursorStmt by e.g. introducing a new field to the query structure (or abusing an existing one, since DeclareCursorStmt is being processed by standard_ProcessUtility, just for a test I've tried to use utilityStmt of a nested statement hoping that it's unused and it didn't break tests yet).
RE: Index Skip Scan
> this point me and Jesper inclined to go with the second option. But maybe > I'm missing something, are there any other suggestions? Unfortunately I figured this would need a more invasive fix. I tend to agree that it'd be better to not skip in situations like this. I think it'd make most sense to make any plan for these 'prepare/fetch' queries would not use skip, but rather a materialize node, right? -Floris
Re: Index Skip Scan
> On Mon, Jan 27, 2020 at 02:00:39PM +, Floris Van Nee wrote: > > Thanks for the new patch! I tested it and managed to find a case that causes > some issues. Here's how to reproduce: So, after a bit of investigation I found out the issue (it was actually there even in the previous version). In this only case of moving forward and reading backward, exactly scenarious you've described above, current implementation was not ignoring deleted tuples. My first idea to fix this was to use _bt_readpage when necessary and put couple of _bt_killitems when we leave a page while jumping before, so that deleted tuples will be ignored. To demonstrate it visually, let's say we want to go backward on a cursor over an ORDER BY a DESC, b DESC query, i.e. return: (1,100), (2, 100), (3, 100) etc. To achieve that we jump from (1,1) to (1,100), from (2,1) to (2,100) and so on. If some values are deleted, we need to read backward. E.g. if (3,100) is deleted, we need to return (3,99). +---+---+---+---+ | | | | | | 1,1 ... 1,100 | 2,1 ... 2,100 | 3,1 ... 3,100 | 4,1 ... 4,100 | | | | | | +---+---+---+---+ | ^ | ^ | ^ | ^ | | | | | | | | +-+ +-+ +-+ +-+ If it happened that a whole value series is deleted, we return to the previous value and need to detect such situation. E.g. if all the values from (3,1) to (3,100) were deleted, we will return to (2,100). +---+---+ +---+ | | | | | | 1,1 ... 1,100 | 2,1 ... 2,100 |<--+ 4,1 ... 4,100 | | | | | | +---+---+ +---+ ^ | ^ | ^ | ^ | | | | | | | | +-+ +-+ +-+ | +-+ This all is implemented inside _bt_skip. Unfortunately as I see it now the idea of relying on ability to skip dead index tuples without checking a heap tuple is not reliable, since an index tuple will be added into killedItems and can be marked as dead only when not a single transaction can see it anymore. Potentially there are two possible solutions: * Adjust the code in nodeIndexOnlyscan to perform a proper visibility check and understand if we returned back. Obviously it will make the patch more invasive. * Reduce scope of the patch, and simply do not apply jumping in this case. This means less functionality but hopefully still brings some value. At this point me and Jesper inclined to go with the second option. But maybe I'm missing something, are there any other suggestions?
Re: Index Skip Scan
Oh, interesting, thank you. I believe I know what happened, there is one unnecessary locking part that eventually gives only problems, plus one direct access to a page items without _bt_readpage. Will post a new version soon. On Mon, Jan 27, 2020 at 3:00 PM Floris Van Nee wrote: > > Hi Dmitry, > > Thanks for the new patch! I tested it and managed to find a case that causes > some issues. Here's how to reproduce: > > drop table if exists t; > create table t as select a,b,b%2 as c,10 as d from generate_series(1,5) a, > generate_series(1,1000) b; > create index on t (a,b,c,d); > > -- correct > postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d > from t order by a desc, b desc; fetch forward all from c; fetch backward all > from c; commit; > BEGIN > DECLARE CURSOR > a | b | c | d > ---+--+---+ > 5 | 1000 | 0 | 10 > 4 | 1000 | 0 | 10 > 3 | 1000 | 0 | 10 > 2 | 1000 | 0 | 10 > 1 | 1000 | 0 | 10 > (5 rows) > > a | b | c | d > ---+--+---+ > 1 | 1000 | 0 | 10 > 2 | 1000 | 0 | 10 > 3 | 1000 | 0 | 10 > 4 | 1000 | 0 | 10 > 5 | 1000 | 0 | 10 > (5 rows) > > -- now delete some rows > postgres=# delete from t where a=3; > DELETE 1000 > > -- and rerun: error is thrown > postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d > from t order by a desc, b desc; fetch forward all from c; fetch backward all > from c; commit; > BEGIN > DECLARE CURSOR > a | b | c | d > ---+--+---+ > 5 | 1000 | 0 | 10 > 4 | 1000 | 0 | 10 > 2 | 1000 | 0 | 10 > 1 | 1000 | 0 | 10 > (4 rows) > > ERROR: lock buffer_content is not held > ROLLBACK > > > A slightly different situation arises when executing the cursor with an ORDER > BY a, b instead of the ORDER BY a DESC, b DESC: > -- recreate table again and execute the delete as above > > postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d > from t order by a, b; fetch forward all from c; fetch backward all from c; > commit; > BEGIN > DECLARE CURSOR > a | b | c | d > ---+---+---+ > 1 | 1 | 1 | 10 > 2 | 1 | 1 | 10 > 4 | 1 | 1 | 10 > 5 | 1 | 1 | 10 > (4 rows) > > a | b | c | d > ---+-+---+ > 5 | 1 | 1 | 10 > 4 | 1 | 1 | 10 > 2 | 827 | 1 | 10 > 1 | 1 | 1 | 10 > (4 rows) > > COMMIT > > And lastly, you'll also get incorrect results if you do the delete slightly > differently: > -- leave one row where a=3 and b=1000 > postgres=# delete from t where a=3 and b<=999; > -- the cursor query above won't show any of the a=3 rows even though they > should > > > -Floris >
RE: Index Skip Scan
Hi Dmitry, Thanks for the new patch! I tested it and managed to find a case that causes some issues. Here's how to reproduce: drop table if exists t; create table t as select a,b,b%2 as c,10 as d from generate_series(1,5) a, generate_series(1,1000) b; create index on t (a,b,c,d); -- correct postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d from t order by a desc, b desc; fetch forward all from c; fetch backward all from c; commit; BEGIN DECLARE CURSOR a | b | c | d ---+--+---+ 5 | 1000 | 0 | 10 4 | 1000 | 0 | 10 3 | 1000 | 0 | 10 2 | 1000 | 0 | 10 1 | 1000 | 0 | 10 (5 rows) a | b | c | d ---+--+---+ 1 | 1000 | 0 | 10 2 | 1000 | 0 | 10 3 | 1000 | 0 | 10 4 | 1000 | 0 | 10 5 | 1000 | 0 | 10 (5 rows) -- now delete some rows postgres=# delete from t where a=3; DELETE 1000 -- and rerun: error is thrown postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d from t order by a desc, b desc; fetch forward all from c; fetch backward all from c; commit; BEGIN DECLARE CURSOR a | b | c | d ---+--+---+ 5 | 1000 | 0 | 10 4 | 1000 | 0 | 10 2 | 1000 | 0 | 10 1 | 1000 | 0 | 10 (4 rows) ERROR: lock buffer_content is not held ROLLBACK A slightly different situation arises when executing the cursor with an ORDER BY a, b instead of the ORDER BY a DESC, b DESC: -- recreate table again and execute the delete as above postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d from t order by a, b; fetch forward all from c; fetch backward all from c; commit; BEGIN DECLARE CURSOR a | b | c | d ---+---+---+ 1 | 1 | 1 | 10 2 | 1 | 1 | 10 4 | 1 | 1 | 10 5 | 1 | 1 | 10 (4 rows) a | b | c | d ---+-+---+ 5 | 1 | 1 | 10 4 | 1 | 1 | 10 2 | 827 | 1 | 10 1 | 1 | 1 | 10 (4 rows) COMMIT And lastly, you'll also get incorrect results if you do the delete slightly differently: -- leave one row where a=3 and b=1000 postgres=# delete from t where a=3 and b<=999; -- the cursor query above won't show any of the a=3 rows even though they should -Floris
Re: Index Skip Scan
> On Wed, Jan 22, 2020 at 10:36:03PM +0100, Dmitry Dolgov wrote: > > > Let me try to clarify. After we return the first tuple, so->currPos.buf is > > pointing to page=1 in my example (it's the only page after all). We've > > returned item=8. Then the split happens and the items get rearranged as in > > my example. We're still pointing with so->currPos.buf to page=1, but the > > page now contains [2,4]. The split happened to the right, so there's a > > page=2 with [5,6,8], however the ongoing index scan is unaware of that. > > Now _bt_skip gets called to fetch the next tuple. It starts by checking > > _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir), the > > result of which will be 'true': we're comparing the skip key to the low key > > of the page. So it thinks the next key can be found on the current page. It > > locks the page and does a _binsrch, finding item=4 to be returned. > > > > The problem here is that _bt_scankey_within_page mistakenly returns true, > > thereby limiting the search to just the page that it's pointing to already. > > It may be fine to just fix this function to return the proper value (I > > guess it'd also need to look at the high key in this example). It could > > also be fixed by not looking at the lo/hi key of the page, but to use the > > local tuple buffer instead. We already did a _read_page once, so if we have > > any matching tuples on that specific page, we have them locally in the > > buffer already. That way we never need to lock the same page twice. > > Oh, that's what you mean. Yes, I was somehow tricked by the name of this > function and didn't notice that it checks only one boundary, so in case > of backward scan it returns wrong result. I think in the situation > you've describe it would actually not find any item on the current page > and restart from the root, but nevertheless we need to check for both > keys in _bt_scankey_within_page. Thanks again everyone for commentaries and clarification. Here is the version, where hopefully I've addressed all the mentioned issues. As mentioned in the _bt_skip commentaries, before we were moving left to check the next page to avoid significant issues in case if ndistinct was underestimated and we need to skip too often. To make it work safe in presense of splits we need to remember an original page and move right again until we find a page with the right link pointing to it. It's not clear whether it's worth to increase complexity for such sort of "edge case" with ndistinct estimation while moving left, so at least for now we ignore this in the implementation and just start from the root immediately. Offset based code in moving forward/reading backward was replaced with remembering a start index tuple and an attempt to find it on the new page. Also a missing page lock before _bt_scankey_within_page was added and _bt_scankey_within_page checks for both page boundaries. >From c363e1f5dc7e2f0288fb04ca80bd073229c458a1 Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Tue, 9 Jul 2019 06:44:57 -0400 Subject: [PATCH v31 1/2] Unique key Design by David Rowley. Author: Jesper Pedersen --- src/backend/nodes/outfuncs.c | 14 +++ src/backend/nodes/print.c | 39 +++ src/backend/optimizer/path/Makefile| 3 +- src/backend/optimizer/path/allpaths.c | 8 ++ src/backend/optimizer/path/indxpath.c | 41 +++ src/backend/optimizer/path/pathkeys.c | 71 ++-- src/backend/optimizer/path/uniquekey.c | 147 + src/backend/optimizer/plan/planagg.c | 1 + src/backend/optimizer/plan/planmain.c | 1 + src/backend/optimizer/plan/planner.c | 17 ++- src/backend/optimizer/util/pathnode.c | 12 ++ src/include/nodes/nodes.h | 1 + src/include/nodes/pathnodes.h | 18 +++ src/include/nodes/print.h | 1 + src/include/optimizer/pathnode.h | 1 + src/include/optimizer/paths.h | 11 ++ 16 files changed, 373 insertions(+), 13 deletions(-) create mode 100644 src/backend/optimizer/path/uniquekey.c diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index d76fae44b8..16083e7a7e 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -1723,6 +1723,7 @@ _outPathInfo(StringInfo str, const Path *node) WRITE_FLOAT_FIELD(startup_cost, "%.2f"); WRITE_FLOAT_FIELD(total_cost, "%.2f"); WRITE_NODE_FIELD(pathkeys); + WRITE_NODE_FIELD(uniquekeys); } /* @@ -2205,6 +2206,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_NODE_FIELD(eq_classes); WRITE_BOOL_FIELD(ec_merging_done); WRITE_NODE_FIELD(canon_pathkeys); + WRITE_NODE_FIELD(canon_uniquekeys); WRITE_NODE_FIELD(left_join_clauses); WRITE_NODE_FIELD(right_join_clauses); WRITE_NODE_FIELD(full_join_clauses); @@ -2214,6 +2216,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_NODE_FIELD(placeholder_list); WRITE_NODE_FIELD(fkey_list);
Re: Index Skip Scan
On Wed, Jan 22, 2020 at 10:55 AM Peter Geoghegan wrote: > On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee > wrote: > > As far as I know, a page split could happen at any random element in the > > page. One of the situations we could be left with is: > > Leaf page 1 = [2,4] > > Leaf page 2 = [5,6,8] > > However, our scan is still pointing to leaf page 1. For non-skip scans this > > is not a problem, as we already read all matching elements in our local > > buffer and we'll return those. But the skip scan currently: > > a) checks the lo-key of the page to see if the next prefix can be found on > > the leaf page 1 > > b) finds out that this is actually true > > c) does a search on the page and returns value=4 (while it should have > > returned value=6) > > > > Peter, is my understanding about the btree internals correct so far? > > This is a good summary. This is the kind of scenario I had in mind > when I expressed a general concern about "stopping between pages". > Processing a whole page at a time is a crucial part of how > _bt_readpage() currently deals with concurrent page splits. I want to be clear about what it means that the page doesn't have a "low key". Let us once again start with a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8] -- just like in Floris' original page split scenario. Let us also say that page 1 has a left sibling page -- page 0. Page 0 happens to have a high key with the integer value 0. So you could *kind of* claim that the "low key" of page 1 is the integer value 0 (page 1 values must be > 0) -- *not* the integer value 2 (the so-called "low key" here is neither > 2, nor >= 2). More formally, an invariant exists that says that all values on page 1 must be *strictly* greater than the integer value 0. However, this formal invariant thing is hard or impossible to rely on when we actually reach page 1 and want to know about its lower bound -- since there is no "low key" pivot tuple on page 1 (we can only speak of a "low key" as an abstract concept, or something that works transitively from the parent -- there is only a physical high key pivot tuple on page 1 itself). Suppose further that Page 0 is now empty, apart from its "all values on page are <= 0" high key (page 0 must have had a few negative integer values in its tuples at some point, but not anymore). VACUUM will delete the page, *changing the effective low key* of Page 0 in the process. The lower bound from the shared parent page will move lower/left as a consequence of the deletion of page 0. nbtree page deletion makes the "keyspace move right, not left". So the "conceptual low key" of page 1 just went down from 0 to -5 (say), without there being any practical way of a skip scan reading page 1 noticing the change (the left sibling of page 0, page -1, has a high key of <= -5, say). Not only is it possible for somebody to insert the value 1 in page 1 -- now they can insert the value -3 or -4! More concretely, the pivot tuple in the parent that originally pointed to page 0 is still there -- all that page deletion changed about this tuple is its downlink, which now points to page 1 instead or page 0. Confusingly, page deletion removes the pivot tuple of the right sibling page from the parent -- *not* the pivot tuple of the empty page that gets deleted (in this case page 0) itself. Note: this example ignores things like negative infinity values in truncated pivot tuples, and the heap TID tiebreaker column -- in reality this would look a bit different because of those factors. See also: amcheck's bt_right_page_check_scankey() function, which has a huge comment that reasons about a race involving page deletion. In general, page deletion is by far the biggest source of complexity when reasoning about the key space. -- Peter Geoghegan
Re: Index Skip Scan
On Wed, Jan 22, 2020 at 1:35 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > Oh, that's what you mean. Yes, I was somehow tricked by the name of this > function and didn't notice that it checks only one boundary, so in case > of backward scan it returns wrong result. I think in the situation > you've describe it would actually not find any item on the current page > and restart from the root, but nevertheless we need to check for both > keys in _bt_scankey_within_page. I suggest reading the nbtree README file's description of backwards scans. Read the paragraph that begins with 'We support the notion of an ordered "scan" of an index...'. I also suggest that you read a bit of the stuff in the large section on page deletion. Certainly read the paragraph that begins with 'Moving left in a backward scan is complicated because...'. It's important to grok why it's okay that we don't "couple" or "crab" buffer locks as we descend the tree with Lehman & Yao's design -- we can get away with having *no* interlock against page splits (e.g., pin, buffer lock) when we are "between" levels of the tree. This is safe, since the page that we land on must still be "substantively the same page", no matter how much time passes. That is, it must at least cover the leftmost portion of the keyspace covered by the original version of the page that we saw that we needed to descend to within the parent page. The worst that can happen is that we have to recover from a concurrent page split by moving right one or more times. (Actually, page deletion can change the contents of a page entirely, but that's not really an exception to the general rule -- page deletion is careful about recycling pages that an in flight index scan might land on.) Lehman & Yao don't have backwards scans (or left links, or page deletion). Unlike nbtree. This is why the usual Lehman & Yao guarantees don't quite work with backward scans. We must therefore compensate as described by the README file (basically, we check and re-check for races, possibly returning to the original page when we think that we might have overlooked something and need to make sure). It's an exception to the general rule, you could say. -- Peter Geoghegan
Re: Index Skip Scan
On Wed, Jan 22, 2020 at 1:09 PM Floris Van Nee wrote: > The loose index scan shouldn't return a tuple twice. It should only be able > to skip 'further', so that shouldn't be a problem. Out of curiosity, why > can't index scans return the same tuple twice? Is there something in the > executor that isn't able to handle this? I have no reason to believe that the executor has a problem with index scans that return a tuple more than once, aside from the very obvious: in general, that will often be wrong. It might not be wrong when the scan happens to be input to a unique node anyway, or something like that. I'm not particularly concerned about it. Just wanted to be clear on our assumptions for loose index scans -- if loose index scans were allowed to return a tuple more than once, then that would at least have to at least be considered in the wider context of the executor (but apparently they're not, so no need to worry about it). This may have been mentioned somewhere already. If it is then I must have missed it. -- Peter Geoghegan
Re: Index Skip Scan
> On Wed, Jan 22, 2020 at 09:08:59PM +, Floris Van Nee wrote: > > Note in particular that index scans cannot return the same index tuple > > twice - > > - processing a page at a time ensures that that cannot happen. > > > > Can a loose index scan return the same tuple (i.e. a tuple with the same > > heap > > TID) to the executor more than once? > > > > The loose index scan shouldn't return a tuple twice. It should only be able > to skip 'further', so that shouldn't be a problem. Yes, it shouldn't happen.
Re: Index Skip Scan
> On Wed, Jan 22, 2020 at 05:24:43PM +, Floris Van Nee wrote: > > > > Anyone please correct me if I'm wrong, but I think one case where the > > > current patch relies on some data from the page it has locked before it > > > in checking this hi/lo key. I think it's possible for the following > > > sequence to happen. Suppose we have a very simple one leaf-page btree > > > containing four elements: leaf page 1 = [2,4,6,8] > > > We do a backwards index skip scan on this and have just returned our > > > first tuple (8). The buffer is left pinned but unlocked. Now, someone > > > else comes in and inserts a tuple (value 5) into this page, but suppose > > > the page happens to be full. So a page split occurs. As far as I know, a > > > page split could happen at any random element in the page. One of the > > > situations we could be left with is: > > > Leaf page 1 = [2,4] > > > Leaf page 2 = [5,6,8] > > > However, our scan is still pointing to leaf page 1. > > > In case if we just returned a tuple, the next action would be either > >> check the next page for another key or search down to the tree. Maybe > > But it won't look at the 'next page for another key', but rather at the 'same > page or another key', right? In the _bt_scankey_within_page shortcut we're > taking, there's no stepping to a next page involved. It just locks the page > again that it previously also locked. Yep, it would look only on the same page. Not sure what do you mean by "another key", if the current key is not found within the current page at the first stage, we restart from the root. > > I'm missing something in your scenario, but the latter will land us on a > > required page (we do not point to any leaf here), and before the former > > there is a check for high/low key. Is there anything else missing? > > Let me try to clarify. After we return the first tuple, so->currPos.buf is > pointing to page=1 in my example (it's the only page after all). We've > returned item=8. Then the split happens and the items get rearranged as in my > example. We're still pointing with so->currPos.buf to page=1, but the page > now contains [2,4]. The split happened to the right, so there's a page=2 with > [5,6,8], however the ongoing index scan is unaware of that. > Now _bt_skip gets called to fetch the next tuple. It starts by checking > _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir), the > result of which will be 'true': we're comparing the skip key to the low key > of the page. So it thinks the next key can be found on the current page. It > locks the page and does a _binsrch, finding item=4 to be returned. > > The problem here is that _bt_scankey_within_page mistakenly returns true, > thereby limiting the search to just the page that it's pointing to already. > It may be fine to just fix this function to return the proper value (I guess > it'd also need to look at the high key in this example). It could also be > fixed by not looking at the lo/hi key of the page, but to use the local tuple > buffer instead. We already did a _read_page once, so if we have any matching > tuples on that specific page, we have them locally in the buffer already. > That way we never need to lock the same page twice. Oh, that's what you mean. Yes, I was somehow tricked by the name of this function and didn't notice that it checks only one boundary, so in case of backward scan it returns wrong result. I think in the situation you've describe it would actually not find any item on the current page and restart from the root, but nevertheless we need to check for both keys in _bt_scankey_within_page.
RE: Index Skip Scan
> Note in particular that index scans cannot return the same index tuple twice - > - processing a page at a time ensures that that cannot happen. > > Can a loose index scan return the same tuple (i.e. a tuple with the same heap > TID) to the executor more than once? > The loose index scan shouldn't return a tuple twice. It should only be able to skip 'further', so that shouldn't be a problem. Out of curiosity, why can't index scans return the same tuple twice? Is there something in the executor that isn't able to handle this? -Floris