subject:"Re\: Index Skip Scan"

Re: Index Skip Scan (new UniqueKeys)

2022-03-24 Thread Jesper Pedersen


On 6/9/20 06:22, Dmitry Dolgov wrote:

Here is a new version of index skip scan patch, based on v8 patch for
UniqueKeys implementation from [1]. I want to start a new thread to
simplify navigation, hopefully I didn't forget anyone who actively
participated in the discussion.



This CommitFest entry has been closed with RwF at [1].

Thanks for all the feedback given !

[1] 
https://www.postgresql.org/message-id/ab8636e7-182f-886a-3a39-f3fc279ca45d%40redhat.com


Best regards,
 Dmitry & Jesper

Re: Index Skip Scan (new UniqueKeys)

2022-01-24 Thread Zhihong Yu

On Mon, Jan 24, 2022 at 8:51 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Sun, Jan 23, 2022 at 04:25:04PM -0800, Zhihong Yu wrote:
> > On Sat, Jan 22, 2022 at 1:32 PM Dmitry Dolgov <9erthali...@gmail.com>
> wrote:
> > > Besides that the new patch version contains some cleaning up and
> > > addresses commentaries around leaf page pinning from [1]. The idea
> > > behind the series structure is still the same: the first three patches
> > > contains the essence of the implementation (hoping to help concentrate
> > > review), the rest are more "experimental".
> > >
> > > [1]:
> > >
> https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uptdjsi0tjkfpupd-cea35dvn-...@mail.gmail.com
> >
> > Hi,
> >
> > +   /* If same EC already is already in the list, then not unique */
> >
> > The word already is duplicated.
> >
> > + * make_pathkeys_for_uniquekeyclauses
> >
> > The func name in the comment is different from the actual func name.
>
> Thanks for the review! Right, both above make sense. I'll wait a bit if
> there will be more commentaries, and then post a new version with all
> changes at once.
>
> > + * Portions Copyright (c) 2020, PostgreSQL Global Development Group
> >
> > The year should be 2022 :-)
>
> Now you see how old is this patch :)
>
> > make_pathkeys_for_uniquekeys() is called by build_uniquekeys(). Should
> > make_pathkeys_for_uniquekeys() be moved into uniquekeys.c ?
>
> It's actually placed there by analogy with make_pathkeys_for_sortclauses
> (immediately preceding function), so I think moving it into uniquekeys
> will only make more confusion.
>
> > +query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys,
> > +bool allow_multinulls)
> >
> > It seems allow_multinulls is not checked in the func. Can the parameter
> be
> > removed ?
>
> Right, it could be removed. I believe it was somewhat important when the
> patch was tightly coupled with the UniqueKeys patch, where it was put in
> use.
>
> Hi,
It would be nice to take out this unused parameter for this patch.

The parameter should be added in patch series where it is used.

Cheers

Re: Index Skip Scan (new UniqueKeys)

2022-01-24 Thread Dmitry Dolgov

> On Sun, Jan 23, 2022 at 04:25:04PM -0800, Zhihong Yu wrote:
> On Sat, Jan 22, 2022 at 1:32 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > Besides that the new patch version contains some cleaning up and
> > addresses commentaries around leaf page pinning from [1]. The idea
> > behind the series structure is still the same: the first three patches
> > contains the essence of the implementation (hoping to help concentrate
> > review), the rest are more "experimental".
> >
> > [1]:
> > https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uptdjsi0tjkfpupd-cea35dvn-...@mail.gmail.com
>
> Hi,
>
> +   /* If same EC already is already in the list, then not unique */
>
> The word already is duplicated.
>
> + * make_pathkeys_for_uniquekeyclauses
>
> The func name in the comment is different from the actual func name.

Thanks for the review! Right, both above make sense. I'll wait a bit if
there will be more commentaries, and then post a new version with all
changes at once.

> + * Portions Copyright (c) 2020, PostgreSQL Global Development Group
>
> The year should be 2022 :-)

Now you see how old is this patch :)

> make_pathkeys_for_uniquekeys() is called by build_uniquekeys(). Should
> make_pathkeys_for_uniquekeys() be moved into uniquekeys.c ?

It's actually placed there by analogy with make_pathkeys_for_sortclauses
(immediately preceding function), so I think moving it into uniquekeys
will only make more confusion.

> +query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys,
> +bool allow_multinulls)
>
> It seems allow_multinulls is not checked in the func. Can the parameter be
> removed ?

Right, it could be removed. I believe it was somewhat important when the
patch was tightly coupled with the UniqueKeys patch, where it was put in
use.

> +   Path   *newpath;
> +
> +   newpath = (Path *) create_projection_path(root, rel, subpath,
> + scanjoin_target);
>
> You can remove variable newpath and assign to lfirst(lc) directly.

Yes, but I've followed the same style for create_projection_path as in
many other invocations of this function in planner.c -- I would prefer
to keep it uniform.

>  +add_path(RelOptInfo *parent_rel, Path *new_path)
> +add_unique_path(RelOptInfo *parent_rel, Path *new_path)
>
> It seems the above two func's can be combined into one func which
> takes parent_rel->pathlist / parent_rel->unique_pathlist as third parameter.

Sure, but here I've intentionally split it into separate functions,
otherwise a lot of not relevant call sites have to be updated to provide
the third parameter.

Re: Index Skip Scan (new UniqueKeys)

2022-01-23 Thread Zhihong Yu

On Sat, Jan 22, 2022 at 1:32 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Fri, May 21, 2021 at 05:31:38PM +0200, Dmitry Dolgov wrote:
> > Hi,
> >
> > Here is another take on the patch with a couple of changes:
> >
> > * I've removed for now UniqueKeys parts. The interaction of skip scan &
> >   unique keys patch was actually not that big, so the main difference is
> >   that now the structure itself went away, a list of unique expressions
> >   is used instead. All the suggestions about how this feature should
> >   look like from the planning perspective are still there. On the one
> >   hand it will allow to develop both patches independently and avoid
> >   confusion for reviewers, on the other UniqueKeys could be easily
> >   incorporated back when needed.
> >
> > * Support for skipping in case of moving backward on demand (scroll
> >   cursor) is moved into a separate patch. This is implemented via
> >   returning false from IndexSupportsBackwardScan in case if it's a skip
> >   scan node, which in turn adds Materialize node on top when needed. The
> >   name SupportsBackwardScan was a bit confusing for me, but it seems
> >   it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL.
> >   Eventually those cases when BackwardScanDirection is used are still
> >   handled by amskip. This change didn't affect the test coverage, all
> >   the test cases supported in previous patch versions are still there.
> >
> >   About Materialize node, I guess it could be less performant than a
> >   "native" support, but it simplifies the implementation significantly
> >   to the point that most parts, which were causing questions before, are
> >   now located in the isolated patch. My idea here is to concentrate
> >   efforts on the first three patches in this series, and consider the
> >   rest of them as an experiment field.
> >
> > * IndexScan support was also relocated into a separate patch, the first
> >   three patches are now only about IndexOnlyScan.
> >
> > * Last bits of reviews were incorporated and rebased.
>
> While the patch is still waiting for a review, I was motivated by the
> thread [1] to think about it from the interface point of view. Consider
> an index skip scan being just like a normal index scan with a set of
> underspecified leading search keys. It makes sense to have the same
> structure "begin scan" -> "get the next tuple" -> "end scan" (now I'm
> not sure if amskip is a good name to represent that, don't have anything
> better yet). But the "underspecified" part is currently indeed
> interpreted in a limited way -- as "missing" keys -- and is expressed
> only via the prefix size. Another option would be e.g. leading keys
> constrained by a range of values, so generally speaking it makes sense
> to extend amount of the information provided for skipping.
>
> As a naive approach I've added a new patch into the series, containing
> the extra data structure (ScanLooseKeys, doesn't have much meaning yet
> except somehow representing keys for skipping) used for index skip scan.
> Any thoughts about it?
>
> Besides that the new patch version contains some cleaning up and
> addresses commentaries around leaf page pinning from [1]. The idea
> behind the series structure is still the same: the first three patches
> contains the essence of the implementation (hoping to help concentrate
> review), the rest are more "experimental".
>
> [1]:
> https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uptdjsi0tjkfpupd-cea35dvn-...@mail.gmail.com

Hi,

+   /* If same EC already is already in the list, then not unique */

The word already is duplicated.

+ * make_pathkeys_for_uniquekeyclauses

The func name in the comment is different from the actual func name.

+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group

The year should be 2022 :-)

make_pathkeys_for_uniquekeys() is called by build_uniquekeys(). Should
make_pathkeys_for_uniquekeys() be moved into uniquekeys.c ?

+query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys,
+bool allow_multinulls)

It seems allow_multinulls is not checked in the func. Can the parameter be
removed ?

+   Path   *newpath;
+
+   newpath = (Path *) create_projection_path(root, rel, subpath,
+ scanjoin_target);

You can remove variable newpath and assign to lfirst(lc) directly.

 +add_path(RelOptInfo *parent_rel, Path *new_path)
+add_unique_path(RelOptInfo *parent_rel, Path *new_path)

It seems the above two func's can be combined into one func which
takes parent_rel->pathlist / parent_rel->unique_pathlist as third parameter.

Cheers

Re: Index Skip Scan (new UniqueKeys)

2022-01-22 Thread Dmitry Dolgov

> On Fri, May 21, 2021 at 05:31:38PM +0200, Dmitry Dolgov wrote:
> Hi,
>
> Here is another take on the patch with a couple of changes:
>
> * I've removed for now UniqueKeys parts. The interaction of skip scan &
>   unique keys patch was actually not that big, so the main difference is
>   that now the structure itself went away, a list of unique expressions
>   is used instead. All the suggestions about how this feature should
>   look like from the planning perspective are still there. On the one
>   hand it will allow to develop both patches independently and avoid
>   confusion for reviewers, on the other UniqueKeys could be easily
>   incorporated back when needed.
>
> * Support for skipping in case of moving backward on demand (scroll
>   cursor) is moved into a separate patch. This is implemented via
>   returning false from IndexSupportsBackwardScan in case if it's a skip
>   scan node, which in turn adds Materialize node on top when needed. The
>   name SupportsBackwardScan was a bit confusing for me, but it seems
>   it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL.
>   Eventually those cases when BackwardScanDirection is used are still
>   handled by amskip. This change didn't affect the test coverage, all
>   the test cases supported in previous patch versions are still there.
>
>   About Materialize node, I guess it could be less performant than a
>   "native" support, but it simplifies the implementation significantly
>   to the point that most parts, which were causing questions before, are
>   now located in the isolated patch. My idea here is to concentrate
>   efforts on the first three patches in this series, and consider the
>   rest of them as an experiment field.
>
> * IndexScan support was also relocated into a separate patch, the first
>   three patches are now only about IndexOnlyScan.
>
> * Last bits of reviews were incorporated and rebased.

While the patch is still waiting for a review, I was motivated by the
thread [1] to think about it from the interface point of view. Consider
an index skip scan being just like a normal index scan with a set of
underspecified leading search keys. It makes sense to have the same
structure "begin scan" -> "get the next tuple" -> "end scan" (now I'm
not sure if amskip is a good name to represent that, don't have anything
better yet). But the "underspecified" part is currently indeed
interpreted in a limited way -- as "missing" keys -- and is expressed
only via the prefix size. Another option would be e.g. leading keys
constrained by a range of values, so generally speaking it makes sense
to extend amount of the information provided for skipping.

As a naive approach I've added a new patch into the series, containing
the extra data structure (ScanLooseKeys, doesn't have much meaning yet
except somehow representing keys for skipping) used for index skip scan.
Any thoughts about it?

Besides that the new patch version contains some cleaning up and
addresses commentaries around leaf page pinning from [1]. The idea
behind the series structure is still the same: the first three patches
contains the essence of the implementation (hoping to help concentrate
review), the rest are more "experimental".

[1]: 
https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uptdjsi0tjkfpupd-cea35dvn-...@mail.gmail.com
>From 8b61823e523ceecbb2fd6faa1a229038bb981594 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Mon, 17 May 2021 11:47:07 +0200
Subject: [PATCH v40 1/6] Unique expressions

Extend PlannerInfo and Path structures with the list of relevant unique
expressions. It specifies which keys must be unique on the query
level, and allows to leverage this into on the path level. At the moment
only distinctClause makes use of such mechanism, which enables potential
use of index skip scan.

Originally proposed by David Rowley, based on the UniqueKey patch
implementation from Andy Fan, contains few bits out of previous version
from Jesper Pedersen, Floris Van Nee.
---
 src/backend/nodes/list.c| 31 +
 src/backend/optimizer/path/Makefile |  3 +-
 src/backend/optimizer/path/pathkeys.c   | 62 +
 src/backend/optimizer/path/uniquekeys.c | 92 +
 src/backend/optimizer/plan/planner.c| 36 +-
 src/backend/optimizer/util/pathnode.c   | 32 ++---
 src/include/nodes/pathnodes.h   |  5 ++
 src/include/nodes/pg_list.h |  1 +
 src/include/optimizer/pathnode.h|  1 +
 src/include/optimizer/paths.h   |  9 +++
 10 files changed, 261 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index f843f861ef..a53a50f372 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -1653,3 +1653,34 @@ list_oid_cmp(const ListCell *p1, const ListCell *p2)
 		return 1;
 	return 0;
 }
+
+/*
+ * Return true iff every

Re: Index Skip Scan (new UniqueKeys)

2021-05-21 Thread Dmitry Dolgov

Hi,

Here is another take on the patch with a couple of changes:

* I've removed for now UniqueKeys parts. The interaction of skip scan &
  unique keys patch was actually not that big, so the main difference is
  that now the structure itself went away, a list of unique expressions
  is used instead. All the suggestions about how this feature should
  look like from the planning perspective are still there. On the one
  hand it will allow to develop both patches independently and avoid
  confusion for reviewers, on the other UniqueKeys could be easily
  incorporated back when needed.

* Support for skipping in case of moving backward on demand (scroll
  cursor) is moved into a separate patch. This is implemented via
  returning false from IndexSupportsBackwardScan in case if it's a skip
  scan node, which in turn adds Materialize node on top when needed. The
  name SupportsBackwardScan was a bit confusing for me, but it seems
  it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL.
  Eventually those cases when BackwardScanDirection is used are still
  handled by amskip. This change didn't affect the test coverage, all
  the test cases supported in previous patch versions are still there.

  About Materialize node, I guess it could be less performant than a
  "native" support, but it simplifies the implementation significantly
  to the point that most parts, which were causing questions before, are
  now located in the isolated patch. My idea here is to concentrate
  efforts on the first three patches in this series, and consider the
  rest of them as an experiment field.

* IndexScan support was also relocated into a separate patch, the first
  three patches are now only about IndexOnlyScan.

* Last bits of reviews were incorporated and rebased.
>From 074fc8a41b43dd8b10ab29eb880c9d161a1638d5 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Mon, 17 May 2021 11:47:07 +0200
Subject: [PATCH v39 1/5] Unique expressions

Extend PlannerInfo and Path structures with the list of relevant unique
expressions. It specifies which keys must be unique on the query
level, and allows to leverage this into on the path level. At the moment
only distinctClause makes use of such mechanism, which enables potential
use of index skip scan.

Originally proposed by David Rowley, based on the UniqueKey patch
implementation from Andy Fan, contains few bits out of previous version
from Jesper Pedersen, Floris Van Nee.
---
 src/backend/nodes/list.c| 31 +
 src/backend/optimizer/path/Makefile |  3 +-
 src/backend/optimizer/path/pathkeys.c   | 62 +
 src/backend/optimizer/path/uniquekeys.c | 92 +
 src/backend/optimizer/plan/planner.c| 36 +-
 src/backend/optimizer/util/pathnode.c   | 32 ++---
 src/include/nodes/pathnodes.h   |  5 ++
 src/include/nodes/pg_list.h |  1 +
 src/include/optimizer/pathnode.h|  1 +
 src/include/optimizer/paths.h   |  9 +++
 10 files changed, 261 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index 94fb236daf..9f2c408a4e 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -1537,3 +1537,34 @@ list_oid_cmp(const ListCell *p1, const ListCell *p2)
 		return 1;
 	return 0;
 }
+
+/*
+ * Return true iff every entry in "members" list is also present
+ * in the "target" list.
+ */
+bool
+list_is_subset(const List *members, const List *target)
+{
+	const ListCell	*lc1, *lc2;
+
+	Assert(IsPointerList(members));
+	Assert(IsPointerList(target));
+	check_list_invariants(members);
+	check_list_invariants(target);
+
+	foreach(lc1, members)
+	{
+		bool found = false;
+		foreach(lc2, target)
+		{
+			if (equal(lfirst(lc1), lfirst(lc2)))
+			{
+found = true;
+break;
+			}
+		}
+		if (!found)
+			return false;
+	}
+	return true;
+}
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..7b9820c25f 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekeys.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index bd9a176d7d..f28547148d 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
 #include "utils/lsyscache.h"
 
 
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 			 RelOptInfo *partrel,
@@ -96,6 +97,29 @@ make_canonical_pathkey(PlannerInfo *root,
 	return pk;
 }
 
+/*
+ * pathkey_is_unique
+ *		Checks if the new pathkey's

Re: Index Skip Scan (new UniqueKeys)

2021-03-17 Thread Tomas Vondra




On 3/17/21 6:02 PM, Dmitry Dolgov wrote:
>> On Wed, Mar 17, 2021 at 03:28:00AM +0100, Tomas Vondra wrote:
>> Hi,
>>
>> I took a look at the new patch series, focusing mostly on the uniquekeys
>> part. It'd be a bit tedious to explain all the review comments here, so
>> attached is a patch series with a "review" patch for some of the parts.
> 
> Great, thanks.
> 
>> Most of it is fairly small (corrections to comments etc.), I'll go over
>> the more serious part so that we can discuss it here. I'll keep it split
>> per parts of the original patch series.
>> I suggest looking for XXX and FIXME comments in all the review patches.
>>
>>
>> 0001
>> 
>>
>> 
>>
>> 0002
>> 
>>
> 
> In fact both 0001 & 0002 belong to another thread, which these days
> span [1], [2]. I've included them only because they happened to be a
> dependency for index skip scan following David suggestions, sorry if
> it's confusing.
> 
> At the same time the author behind 0001 & 0002 is present in this thread
> as well, maybe Andy can answer these comments right here and better than me.
> 

Ah, sorry for the confusion. In that case the review comments probably
belong to the other threads, so we should move the discussion there.
It's not clear to me which of the threads is the right one.

>> 0003
>> 
>>
>> Just some comments/whitespace.
>>
>>
>> 0004
>> 
>>
>> I wonder why we don't include this in explain TEXT format? Seems it
>> might make it harder to write regression tests for this? It's easier to
>> just check that we deduced the right unique key(s) than having to
>> construct an example where it actually changes the plan.
> 
> Yeah, good point. I believe originally it was like that to not make
> explain too verbose for skip scans, but displaying prefix definitely
> could be helpful for testing, so will do this (and address other
> comments as well).
> 

Cool. Thanks.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Index Skip Scan (new UniqueKeys)

2021-03-17 Thread Dmitry Dolgov

> On Wed, Mar 17, 2021 at 03:28:00AM +0100, Tomas Vondra wrote:
> Hi,
>
> I took a look at the new patch series, focusing mostly on the uniquekeys
> part. It'd be a bit tedious to explain all the review comments here, so
> attached is a patch series with a "review" patch for some of the parts.

Great, thanks.

> Most of it is fairly small (corrections to comments etc.), I'll go over
> the more serious part so that we can discuss it here. I'll keep it split
> per parts of the original patch series.
> I suggest looking for XXX and FIXME comments in all the review patches.
>
>
> 0001
> 
>
> 
>
> 0002
> 
>

In fact both 0001 & 0002 belong to another thread, which these days
span [1], [2]. I've included them only because they happened to be a
dependency for index skip scan following David suggestions, sorry if
it's confusing.

At the same time the author behind 0001 & 0002 is present in this thread
as well, maybe Andy can answer these comments right here and better than me.

> 0003
> 
>
> Just some comments/whitespace.
>
>
> 0004
> 
>
> I wonder why we don't include this in explain TEXT format? Seems it
> might make it harder to write regression tests for this? It's easier to
> just check that we deduced the right unique key(s) than having to
> construct an example where it actually changes the plan.

Yeah, good point. I believe originally it was like that to not make
explain too verbose for skip scans, but displaying prefix definitely
could be helpful for testing, so will do this (and address other
comments as well).

[1]: 
https://www.postgresql.org/message-id/flat/caku4awpqjaqjwq2x-ar9g3+zhrzu1k8hnp7a+_mluov-n5a...@mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/flat/caku4awru35c9g3ce15jmvwh6b2hzf4hf7czukrsiktv7akr...@mail.gmail.com

Re: Index Skip Scan (new UniqueKeys)

2021-01-28 Thread Dmitry Dolgov

> On Thu, Jan 28, 2021 at 09:49:26PM +0900, Masahiko Sawada wrote:
> Hi Dmitry,
>
> Status update for a commitfest entry.
>
> This patch entry has been "Waiting on Author" on CF app and the
> discussion seems inactive from the last CF. Could you share the
> current status of this patch? Heikki already sent review comments and
> there was a discussion but the WoA status is correct? If it needs
> reviews, please rebase the patches and set it to "Needs Reviews" on CF
> app. If you're not working on this, I'm going to set it to "Returned
> with Feedback", barring objections.

Yes, I'm still on it. In fact, I've sketched up almost immediately
couple of changes to address Heikki feedback, but was distracted by
subscripting stuff. Will try to send new version of the patch soon.

Re: Index Skip Scan (new UniqueKeys)

2021-01-28 Thread Masahiko Sawada

Hi Dmitry,

On Sun, Oct 25, 2020 at 1:45 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Tue, Oct 06, 2020 at 05:20:39PM +0200, Dmitry Dolgov wrote:
> > > On Mon, Sep 21, 2020 at 05:59:32PM -0700, Peter Geoghegan wrote:
> > >
> > > * I see the following compiler warning:
> > >
> > > /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:
> > > In function ‘populate_baserel_uniquekeys’:
> > > /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13:
> > > warning: ‘expr’ may be used uninitialized in this function
> > > [-Wmaybe-uninitialized]
> > >   797 |   else if (!list_member(unique_index->rel->reltarget->exprs, 
> > > expr))
> > >   | ^~
> >
> > This is mostly for UniqueKeys patch, which is attached here only as a
> > dependency, but I'll prepare changes for that. Interesting enough I
> > can't reproduce this warning, but if I understand correctly gcc has some
> > history of spurious uninitialized warnings, so I guess it could be
> > version dependent.
> >
> > > * Perhaps the warning is related to this nearby code that I noticed
> > > Valgrind complains about:
> > >
> > > ==1083468== VALGRINDERROR-BEGIN
> > > ==1083468== Invalid read of size 4
> > > ==1083468==at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771)
> > > ==1083468==by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140)
> >
> > This also belongs to UniqueKeys patch, but at least I can reproduce this
> > one. My guess is that nkeycolums should be used there, not ncolums,
> > which is visible in index_incuding tests. The same as previous one, will
> > prepare corresponding changes.
> >
> > > * Do we really need the AM-level boolean flag/argument named
> > > "scanstart"? Why not just follow the example of btgettuple(), which
> > > determines whether or not the scan has been initialized based on the
> > > current scan position?
> > >
> > > Just because you set so->currPos.buf to InvalidBuffer doesn't mean you
> > > cannot or should not take the same approach as btgettuple(). And even
> > > if you can't take exactly the same approach, I would still think that
> > > the scan's opaque B-Tree state should remember if it's the first call
> > > to _bt_skip() (rather than some subsequent call) in some other way
> > > (e.g. carrying a "scanstart" bool flag directly).
> >
> > Yes, agree, carrying this flag inside the opaque state would be better.
>
> Here is a new version which doesn't require "scanstart" argument and
> contains few other changes to address the issues mentioned earlier. It's
> also based on the latest UniqueKeys patches with the valgrind issue
> fixed (as before they're attached also just for the references, you can
> find more in the original thread). I didn't rework commentaries yet,
> will post it soon (need to get an inspiration first, probably via
> reading Shakespeare unless someone has better suggestions).
>
> > > * Why is it okay to do anything important based on the
> > > _bt_scankey_within_page() return value?
> > >
> > > If the page is empty, then how can we know that it's okay to go to the
> > > next value? I'm concerned that there could be subtle bugs in this
> > > area. VACUUM will usually just delete the empty page. But it won't
> > > always do so, for a variety of reasons that aren't worth going into
> > > now. This could mask bugs in this area. I'm concerned about patterns
> > > like this one from _bt_skip():
> > >
> > > while (!nextFound)
> > > {
> > > 
> > >
> > > if (_bt_scankey_within_page(scan, so->skipScanKey,
> > > so->currPos.buf, dir))
> > > {
> > > ...
> > > }
> > > else
> > > /*
> > >  * If startItup could be not found within the current 
> > > page,
> > >  * assume we found something new
> > >  */
> > > nextFound = true;
> > > 
> > > }
> > >
> > > Why would you assume that "we found something new" here? In general I
> > > just don't understand the design of _bt_skip(). I get the basic idea
> > > of what you're trying to do, but it could really use better comments.
> >
> > Yeah, I'll put more efforts into clear comments. There are two different
> > ways in which _bt_scankey_within_page is being used.
> >
> > The first one is to check if it's possible to skip traversal of the tree
> > from root in case if what we're looking for could be on the current
> > page. In this case an empty page would mean we need to search from the
> > root, so not sure what could be the issue here?
> >
> > The second one (that you've highlighted above) I admit is probably the
> > most questionable part of the patch and open for suggestions how to
> > improve it. It's required for one

Re: Index Skip Scan (new UniqueKeys)

2020-12-05 Thread Thomas Munro

On Wed, Dec 2, 2020 at 9:59 AM Heikki Linnakangas  wrote:
> On 01/12/2020 22:21, Dmitry Dolgov wrote:
> >> On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote:
> >> - I'm surprised you need a new index AM function (amskip) for this. Can't
> >> you just restart the scan with index_rescan()? The btree AM can check if 
> >> the
> >> new keys are on the same page, and optimize the rescan accordingly, like
> >> amskip does. That would speed up e.g. nested loop scans too, where the keys
> >> just happen to be clustered.
> >
> > An interesting point. At the moment I'm not sure whether it's possible
> > to implement skipping via index_rescan or not, need to take a look. But
> > checking if the new keys are on the same page would introduce some
> > overhead I guess, wouldn't it be too invasive to add it into already
> > existing btree AM?
>
> I think it'll be OK. But if it's not, you could add a hint argument to
> index_rescan() to hint the index AM that the new key is known to be
> greater than the previous key.

FWIW here's what I wrote about that years ago[1]:

> It works by adding a new index operation 'skip' which the executor
> code can use during a scan to advance to the next value (for some
> prefix of the index's columns). That may be a terrible idea and
> totally unnecessary... but let me explain my
> reasoning:
>
> 1. Perhaps some index implementations can do something better than a
> search for the next key value from the root. Is it possible or
> desirable to use the current position as a starting point for a btree
> traversal? I don't know.
>
> 2. It seemed that I'd need to create a new search ScanKey to use the
> 'rescan' interface for skipping to the next value, but I already had
> an insertion ScanKey so I wanted a way to just reuse that. But maybe
> there is some other way to reuse existing index interfaces, or maybe
> there is an easy way to make a new search ScanKey from the existing
>insertion ScanKey?

[1] 
https://www.postgresql.org/message-id/CADLWmXWALK8NPZqdnRQiPnrzAnic7NxYKynrkzO_vxYr8enWww%40mail.gmail.com

Re: Index Skip Scan (new UniqueKeys)

2020-12-05 Thread Dmitry Dolgov

> On Tue, Dec 01, 2020 at 10:59:22PM +0200, Heikki Linnakangas wrote:
>
> > > - Does this optimization apply to bitmap index scans?
> >
> > No, from what I understand it doesn't.
>
> Would it be hard to add? Don't need to solve everything in the first
> version of this, but I think in principle you could do the same
> optimization for bitmap index scans, so if the current API can't do it,
> that's maybe an indication that the API isn't quite right.

I would expect it should not be hard as at the moment all parts seems
relatively generic. But of course I need to check, while it seems no one
had bitmap index scans in mind while developing this patch.

> > > I'm actually a bit confused why we need this condition. The IndexScan
> > > executor node should call amskip() only after checking the additional 
> > > quals,
> > > no?
> >
> > This part I don't quite get, what exactly you mean by checking the
> > additional quals in the executor node? But at the end of the day this
> > condition was implemented exactly to address the described issue, which
> > was found later and added to the tests.
>
> As I understand this, the executor logic goes like this:
>
> query: SELECT DISTINCT ON (a, b)  a, b FROM foo where c like '%y%' and a
> like 'a%' and b = 'b';
>
> 1. Call index_beginscan, keys: a >= 'a', b = 'b'
>
> 2. Call index_getnext, which returns first row to the Index Scan node
>
> 3. Evaluates the qual "c like '%y%'" on the tuple. If it's false, goto step
> 2 to get next tuple.
>
> 4. Return tuple to parent node
>
> 5. index_amskip(), to the next tuple with a > 'a'. Goto 2.
>
> The logic should work fine, even if there are quals that are not indexable,
> like "c like '%y'" in the above example. So why doesn't it work? What am I
> missing?

To remind myself how it works I went through this sequence, and from
what I understand the qual "c like '%y%'" is evaluated in this case in
ExecQual, not after index_getnext_tid (and values returned after
skipping are reported as filtered out). So when it comes to index_skip
only quals on a & b were evaluated. Or did you mean something else?

Another small detail is that in the current implementation there is no
goto 2 in the last step. Originally it was like that, but since skipping
return an exact position that we need there was something like "find a
value, then do one step back so that index_getnext will find it".
Unfortunately this stepping back part turns out to be a source of
troubles, and getting rid of it even allowed to make code somewhat more
concise. But of course I'm open for suggestions about improvements.

Re: Index Skip Scan (new UniqueKeys)

2020-12-01 Thread Heikki Linnakangas


On 01/12/2020 22:21, Dmitry Dolgov wrote:

On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote:

I had a quick look at this patch. I haven't been following this thread, so
sorry if I'm repeating old arguments, but here we go:


Thanks!


- I'm surprised you need a new index AM function (amskip) for this. Can't
you just restart the scan with index_rescan()? The btree AM can check if the
new keys are on the same page, and optimize the rescan accordingly, like
amskip does. That would speed up e.g. nested loop scans too, where the keys
just happen to be clustered.


An interesting point. At the moment I'm not sure whether it's possible
to implement skipping via index_rescan or not, need to take a look. But
checking if the new keys are on the same page would introduce some
overhead I guess, wouldn't it be too invasive to add it into already
existing btree AM?


I think it'll be OK. But if it's not, you could add a hint argument to 
index_rescan() to hint the index AM that the new key is known to be 
greater than the previous key.



- Does this optimization apply to bitmap index scans?


No, from what I understand it doesn't.


Would it be hard to add? Don't need to solve everything in the first 
version of this, but I think in principle you could do the same 
optimization for bitmap index scans, so if the current API can't do it, 
that's maybe an indication that the API isn't quite right.



- This logic in build_index_paths() is not correct:


+   /*
+* Skip scan is not supported when there are qual conditions, 
which are not
+* covered by index. The reason for that is that those 
conditions are
+* evaluated later, already after skipping was applied.
+*
+* TODO: This implementation is too restrictive, and doesn't 
allow e.g.
+* index expressions. For that we need to examine index_clauses 
too.
+*/
+   if (root->parse->jointree != NULL)
+   {
+   ListCell *lc;
+
+   foreach(lc, (List *)root->parse->jointree->quals)
+   {
+   Node *expr, *qual = (Node *) lfirst(lc);
+   Var *var;
+   bool found = false;
+
+   if (!is_opclause(qual))
+   {
+   not_empty_qual = true;
+   break;
+   }
+
+   expr = get_leftop(qual);
+
+   if (!IsA(expr, Var))
+   {
+   not_empty_qual = true;
+   break;
+   }
+
+   var = (Var *) expr;
+
+   for (int i = 0; i < index->ncolumns; i++)
+   {
+   if (index->indexkeys[i] == 
var->varattno)
+   {
+   found = true;
+   break;
+   }
+   }
+
+   if (!found)
+   {
+   not_empty_qual = true;
+   break;
+   }
+   }
+   }


If you care whether the qual is evaluated by the index AM or not, you need
to also check that the operator is indexable. Attached is a query that
demonstrates that problem.
...
Also, you should probably check that the index quals are in the operator
family as that used for the DISTINCT.


Yes, good point, will change this in the next version.


I'm actually a bit confused why we need this condition. The IndexScan
executor node should call amskip() only after checking the additional quals,
no?


This part I don't quite get, what exactly you mean by checking the
additional quals in the executor node? But at the end of the day this
condition was implemented exactly to address the described issue, which
was found later and added to the tests.


As I understand this, the executor logic goes like this:

query: SELECT DISTINCT ON (a, b)  a, b FROM foo where c like '%y%' and a 
like 'a%' and b = 'b';


1. Call index_beginscan, keys: a >= 'a', b = 'b'

2. Call index_getnext, which returns first row to the Index Scan node

3. Evaluates the qual "c like '%y%'" on the tuple. If it's false, goto 
step 2 to get next tuple.


4. Return tuple to parent node

5. index_amskip(), to the next tuple with a > 'a'. Goto 2.

The logic should work fine, even if there are quals that are not 
indexable, like "c like '%y'" in the above example. So why doesn't it 
work? What am I

Re: Index Skip Scan (new UniqueKeys)

2020-12-01 Thread Dmitry Dolgov

> On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote:
>
> I had a quick look at this patch. I haven't been following this thread, so
> sorry if I'm repeating old arguments, but here we go:

Thanks!

> - I'm surprised you need a new index AM function (amskip) for this. Can't
> you just restart the scan with index_rescan()? The btree AM can check if the
> new keys are on the same page, and optimize the rescan accordingly, like
> amskip does. That would speed up e.g. nested loop scans too, where the keys
> just happen to be clustered.

An interesting point. At the moment I'm not sure whether it's possible
to implement skipping via index_rescan or not, need to take a look. But
checking if the new keys are on the same page would introduce some
overhead I guess, wouldn't it be too invasive to add it into already
existing btree AM?

> - Does this optimization apply to bitmap index scans?

No, from what I understand it doesn't.

> - This logic in build_index_paths() is not correct:
>
> > +   /*
> > +* Skip scan is not supported when there are qual conditions, 
> > which are not
> > +* covered by index. The reason for that is that those 
> > conditions are
> > +* evaluated later, already after skipping was applied.
> > +*
> > +* TODO: This implementation is too restrictive, and doesn't 
> > allow e.g.
> > +* index expressions. For that we need to examine index_clauses 
> > too.
> > +*/
> > +   if (root->parse->jointree != NULL)
> > +   {
> > +   ListCell *lc;
> > +
> > +   foreach(lc, (List *)root->parse->jointree->quals)
> > +   {
> > +   Node *expr, *qual = (Node *) lfirst(lc);
> > +   Var *var;
> > +   bool found = false;
> > +
> > +   if (!is_opclause(qual))
> > +   {
> > +   not_empty_qual = true;
> > +   break;
> > +   }
> > +
> > +   expr = get_leftop(qual);
> > +
> > +   if (!IsA(expr, Var))
> > +   {
> > +   not_empty_qual = true;
> > +   break;
> > +   }
> > +
> > +   var = (Var *) expr;
> > +
> > +   for (int i = 0; i < index->ncolumns; i++)
> > +   {
> > +   if (index->indexkeys[i] == 
> > var->varattno)
> > +   {
> > +   found = true;
> > +   break;
> > +   }
> > +   }
> > +
> > +   if (!found)
> > +   {
> > +   not_empty_qual = true;
> > +   break;
> > +   }
> > +   }
> > +   }
>
> If you care whether the qual is evaluated by the index AM or not, you need
> to also check that the operator is indexable. Attached is a query that
> demonstrates that problem.
> ...
> Also, you should probably check that the index quals are in the operator
> family as that used for the DISTINCT.

Yes, good point, will change this in the next version.

> I'm actually a bit confused why we need this condition. The IndexScan
> executor node should call amskip() only after checking the additional quals,
> no?

This part I don't quite get, what exactly you mean by checking the
additional quals in the executor node? But at the end of the day this
condition was implemented exactly to address the described issue, which
was found later and added to the tests.

Re: Index Skip Scan (new UniqueKeys)

2020-11-30 Thread Heikki Linnakangas


On 24/10/2020 19:45, Dmitry Dolgov wrote:

Here is a new version which doesn't require "scanstart" argument and
contains few other changes to address the issues mentioned earlier. It's
also based on the latest UniqueKeys patches with the valgrind issue
fixed (as before they're attached also just for the references, you can
find more in the original thread). I didn't rework commentaries yet,
will post it soon (need to get an inspiration first, probably via
reading Shakespeare unless someone has better suggestions).


I had a quick look at this patch. I haven't been following this thread, 
so sorry if I'm repeating old arguments, but here we go:


- I'm surprised you need a new index AM function (amskip) for this. 
Can't you just restart the scan with index_rescan()? The btree AM can 
check if the new keys are on the same page, and optimize the rescan 
accordingly, like amskip does. That would speed up e.g. nested loop 
scans too, where the keys just happen to be clustered.


- Does this optimization apply to bitmap index scans?

- This logic in build_index_paths() is not correct:


+   /*
+* Skip scan is not supported when there are qual conditions, 
which are not
+* covered by index. The reason for that is that those 
conditions are
+* evaluated later, already after skipping was applied.
+*
+* TODO: This implementation is too restrictive, and doesn't 
allow e.g.
+* index expressions. For that we need to examine index_clauses 
too.
+*/
+   if (root->parse->jointree != NULL)
+   {
+   ListCell *lc;
+
+   foreach(lc, (List *)root->parse->jointree->quals)
+   {
+   Node *expr, *qual = (Node *) lfirst(lc);
+   Var *var;
+   bool found = false;
+
+   if (!is_opclause(qual))
+   {
+   not_empty_qual = true;
+   break;
+   }
+
+   expr = get_leftop(qual);
+
+   if (!IsA(expr, Var))
+   {
+   not_empty_qual = true;
+   break;
+   }
+
+   var = (Var *) expr;
+
+   for (int i = 0; i < index->ncolumns; i++)
+   {
+   if (index->indexkeys[i] == 
var->varattno)
+   {
+   found = true;
+   break;
+   }
+   }
+
+   if (!found)
+   {
+   not_empty_qual = true;
+   break;
+   }
+   }
+   }


If you care whether the qual is evaluated by the index AM or not, you 
need to also check that the operator is indexable. Attached is a query 
that demonstrates that problem.


I'm actually a bit confused why we need this condition. The IndexScan 
executor node should call amskip() only after checking the additional 
quals, no?


Also, you should probably check that the index quals are in the operator 
family as that used for the DISTINCT.


- Heikki


index-skipscan-bug.sql
Description: application/sql

Re: Index Skip Scan (new UniqueKeys)

2020-10-06 Thread Dmitry Dolgov

> On Mon, Sep 21, 2020 at 05:59:32PM -0700, Peter Geoghegan wrote:
>
> * I see the following compiler warning:
>
> /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:
> In function ‘populate_baserel_uniquekeys’:
> /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13:
> warning: ‘expr’ may be used uninitialized in this function
> [-Wmaybe-uninitialized]
>   797 |   else if (!list_member(unique_index->rel->reltarget->exprs, expr))
>   | ^~

This is mostly for UniqueKeys patch, which is attached here only as a
dependency, but I'll prepare changes for that. Interesting enough I
can't reproduce this warning, but if I understand correctly gcc has some
history of spurious uninitialized warnings, so I guess it could be
version dependent.

> * Perhaps the warning is related to this nearby code that I noticed
> Valgrind complains about:
>
> ==1083468== VALGRINDERROR-BEGIN
> ==1083468== Invalid read of size 4
> ==1083468==at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771)
> ==1083468==by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140)

This also belongs to UniqueKeys patch, but at least I can reproduce this
one. My guess is that nkeycolums should be used there, not ncolums,
which is visible in index_incuding tests. The same as previous one, will
prepare corresponding changes.

> * Do we really need the AM-level boolean flag/argument named
> "scanstart"? Why not just follow the example of btgettuple(), which
> determines whether or not the scan has been initialized based on the
> current scan position?
>
> Just because you set so->currPos.buf to InvalidBuffer doesn't mean you
> cannot or should not take the same approach as btgettuple(). And even
> if you can't take exactly the same approach, I would still think that
> the scan's opaque B-Tree state should remember if it's the first call
> to _bt_skip() (rather than some subsequent call) in some other way
> (e.g. carrying a "scanstart" bool flag directly).

Yes, agree, carrying this flag inside the opaque state would be better.

> * Why is it okay to do anything important based on the
> _bt_scankey_within_page() return value?
>
> If the page is empty, then how can we know that it's okay to go to the
> next value? I'm concerned that there could be subtle bugs in this
> area. VACUUM will usually just delete the empty page. But it won't
> always do so, for a variety of reasons that aren't worth going into
> now. This could mask bugs in this area. I'm concerned about patterns
> like this one from _bt_skip():
>
> while (!nextFound)
> {
> 
>
> if (_bt_scankey_within_page(scan, so->skipScanKey,
> so->currPos.buf, dir))
> {
> ...
> }
> else
> /*
>  * If startItup could be not found within the current 
> page,
>  * assume we found something new
>  */
> nextFound = true;
> 
> }
>
> Why would you assume that "we found something new" here? In general I
> just don't understand the design of _bt_skip(). I get the basic idea
> of what you're trying to do, but it could really use better comments.

Yeah, I'll put more efforts into clear comments. There are two different
ways in which _bt_scankey_within_page is being used.

The first one is to check if it's possible to skip traversal of the tree
from root in case if what we're looking for could be on the current
page. In this case an empty page would mean we need to search from the
root, so not sure what could be the issue here?

The second one (that you've highlighted above) I admit is probably the
most questionable part of the patch and open for suggestions how to
improve it. It's required for one particular case with a cursor when
scan advances forward but reads backward. What could happen here is we
found one valid item, but the next one e.g. do not pass scan key
conditions, and we end up with the previous item again. I'm not entirely
sure how presence of an empty page could change this scenario, could you
please show an example?

> *The "jump one more time if it's the same as at the beginning" thing
> seems scary to me. Maybe you should be doing something with the actual
> high key here.

Same as for the previous question, can you give a hint what do you mean
by "doing something with the actual high key"?

Re: Index Skip Scan (new UniqueKeys)

2020-09-21 Thread Peter Geoghegan

On Mon, Sep 21, 2020 at 5:59 PM Peter Geoghegan  wrote:
> That's all I have for now.

One more thing. I don't think that this should be a bitwise AND:

if ((offnum > maxoff) & (so->currPos.nextPage == P_NONE))
{

}


-- 
Peter Geoghegan

Re: Index Skip Scan (new UniqueKeys)

2020-09-21 Thread Peter Geoghegan

On Sat, Aug 15, 2020 at 7:09 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Here is a new version that hopefully address most of the concerns
> mentioned in this thread so far. As before, first two patches are taken
> from UniqueKeys thread and attached only for the reference. List of
> changes includes:

Some thoughts on this version of the patch series (I'm focussing on
v36-0005-Btree-implementation-of-skipping.patch again):

* I see the following compiler warning:

/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:
In function ‘populate_baserel_uniquekeys’:
/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13:
warning: ‘expr’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
  797 |   else if (!list_member(unique_index->rel->reltarget->exprs, expr))
  | ^~

* Perhaps the warning is related to this nearby code that I noticed
Valgrind complains about:

==1083468== VALGRINDERROR-BEGIN
==1083468== Invalid read of size 4
==1083468==at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771)
==1083468==by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140)
==1083468==by 0x56AEA5: set_plain_rel_size (allpaths.c:586)
==1083468==by 0x56AADB: set_rel_size (allpaths.c:412)
==1083468==by 0x56A8CD: set_base_rel_sizes (allpaths.c:323)
==1083468==by 0x56A5A7: make_one_rel (allpaths.c:185)
==1083468==by 0x5AB426: query_planner (planmain.c:269)
==1083468==by 0x5AF02C: grouping_planner (planner.c:2058)
==1083468==by 0x5AD202: subquery_planner (planner.c:1015)
==1083468==by 0x5ABABF: standard_planner (planner.c:405)
==1083468==by 0x5AB7F8: planner (planner.c:275)
==1083468==by 0x6E6F84: pg_plan_query (postgres.c:875)
==1083468==by 0x6E70C4: pg_plan_queries (postgres.c:966)
==1083468==by 0x6E7497: exec_simple_query (postgres.c:1158)
==1083468==by 0x6EBCD3: PostgresMain (postgres.c:4309)
==1083468==by 0x624284: BackendRun (postmaster.c:4541)
==1083468==by 0x623995: BackendStartup (postmaster.c:4225)
==1083468==by 0x61FB70: ServerLoop (postmaster.c:1742)
==1083468==by 0x61F309: PostmasterMain (postmaster.c:1415)
==1083468==by 0x514AF2: main (main.c:209)
==1083468==  Address 0x75f13e0 is 4,448 bytes inside a block of size
8,192 alloc'd
==1083468==at 0x483B7F3: malloc (in
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1083468==by 0x8C15C8: AllocSetAlloc (aset.c:919)
==1083468==by 0x8CEA52: palloc (mcxt.c:964)
==1083468==by 0x267F25: systable_beginscan (genam.c:373)
==1083468==by 0x8682CE: SearchCatCacheMiss (catcache.c:1359)
==1083468==by 0x868167: SearchCatCacheInternal (catcache.c:1299)
==1083468==by 0x867E2C: SearchCatCache1 (catcache.c:1167)
==1083468==by 0x8860B2: SearchSysCache1 (syscache.c:1123)
==1083468==by 0x8BD482: check_enable_rls (rls.c:66)
==1083468==by 0x68A113: get_row_security_policies (rowsecurity.c:134)
==1083468==by 0x683C2C: fireRIRrules (rewriteHandler.c:2045)
==1083468==by 0x687340: QueryRewrite (rewriteHandler.c:3962)
==1083468==by 0x6E6EB1: pg_rewrite_query (postgres.c:784)
==1083468==by 0x6E6D23: pg_analyze_and_rewrite (postgres.c:700)
==1083468==by 0x6E7476: exec_simple_query (postgres.c:1155)
==1083468==by 0x6EBCD3: PostgresMain (postgres.c:4309)
==1083468==by 0x624284: BackendRun (postmaster.c:4541)
==1083468==by 0x623995: BackendStartup (postmaster.c:4225)
==1083468==by 0x61FB70: ServerLoop (postmaster.c:1742)
==1083468==by 0x61F309: PostmasterMain (postmaster.c:1415)
==1083468==
==1083468== VALGRINDERROR-END

(You'll see the same error if you run Postgres Valgrind + "make
installcheck", though I don't think that the queries in question are
tests that you yourself wrote.)

* IndexScanDescData.xs_itup comments could stand to be updated here --
IndexScanDescData.xs_want_itup is no longer just about index-only
scans.

* Do we really need the AM-level boolean flag/argument named
"scanstart"? Why not just follow the example of btgettuple(), which
determines whether or not the scan has been initialized based on the
current scan position?

Just because you set so->currPos.buf to InvalidBuffer doesn't mean you
cannot or should not take the same approach as btgettuple(). And even
if you can't take exactly the same approach, I would still think that
the scan's opaque B-Tree state should remember if it's the first call
to _bt_skip() (rather than some subsequent call) in some other way
(e.g. carrying a "scanstart" bool flag directly).

A part of my objection to "scanstart" is that it seems to require that
much of the code within _bt_skip() get another level of
indentation...which makes it even more difficult to follow.

* I don't understand what _bt_scankey_within_page() comments mean when
they refer to "the page highkey". It looks like this function examines
the highest data

Re: Index Skip Scan (new UniqueKeys)

2020-08-02 Thread Andy Fan

Hi David:

Thanks for looking into this.

On Fri, Jul 31, 2020 at 11:07 AM David Rowley  wrote:

> On Mon, 13 Jul 2020 at 10:18, Floris Van Nee 
> wrote:
> > One question about the unique keys - probably for Andy or David: I've
> looked in the archives to find arguments for/against using Expr nodes or
> EquivalenceClasses in the Unique Keys patch. However, I couldn't really
> find a clear answer about why the current patch uses Expr rather than
> EquivalenceClasses. At some point David mentioned "that probably Expr nodes
> were needed rather than EquivalenceClasses", but it's not really clear to
> me why. What were the thoughts behind this?
>
> I'm still not quite sure on this either way.  I did think
> EquivalenceClasses were more suitable before I wrote the POC patch for
> unique keys.  But after that, I had in mind that Exprs might be
> better. The reason I thought this was due to the fact that the
> DISTINCT clause list is a bunch of Exprs and if the UniqueKeys were
> EquivalenceClasses then checking to see if the DISTINCT can be skipped
> turned into something more complex that required looking through lists
> of ec_members rather than just checking if the uniquekey exprs were a
> subset of the DISTINCT clause.
>

> Thinking about it a bit harder, if we did use Exprs then it would mean
> it a case like the following wouldn't work for Andy's DISTINCT no-op
> stuff.
>
> CREATE TABLE xy (x int primary key, y int not null);
>
> SELECT DISTINCT y FROM xy WHERE x=y;
>
> whereas if we use EquivalenceClasses then we'll find that we have an
> EC with x,y in it and can skip the DISTINCT since we have a UniqueKey
> containing that EquivalenceClass.


> Also, looking at what Andy wrote to make a case like the following
> work in his populate_baserel_uniquekeys() function in the 0002 patch:
>
> CREATE TABLE ab (a int, b int, primary key(a,b));
> SELECT DISTINCT a FROM ab WHERE b = 1;
>
> it's a bit uninspiring. Really what we want here when checking if we
> can skip doing the DISTINCT is a UniqueKey set using
> EquivalenceClasses as we can just insist that any unmatched UniqueKey
> items have an ec_is_const == true. However, that means we have to loop
> through the ec_members of the EquivalenceClasses in the uniquekeys
> during the DISTINCT check.  That's particularly bad when you consider
> that in a partitioned table case there might be an ec_member for each
> child partition and there could be 1000s of child partitions and
> following those ec_members chains is going to be too slow.
>
> My current thoughts are that we should be using EquivalenceClasses but
> we should first add some infrastructure to make them perform better.
> My current thoughts are that we do something like what I mentioned in
> [1] or something more like what Andres mentions in [2].  After that,
> we could either make EquivalenceClass.ec_members a hash table or
> binary search tree. Or even perhaps just have a single hash table/BST
> for all EquivalenceClasses that allows very fast lookups from {Expr}
> -> {EquivalenceClass}.  I think an Expr can only belong in a single
> non-merged EquivalenceClass. So when we do merging of
> EquivalenceClasses we could just repoint that data structure to point
> to the new EquivalenceClass. We'd never point to ones that have
> ec_merged != NULL.  This would also allow us to fix the poor
> performance in regards to get_eclass_for_sort_expr() for partitioned
> tables.
>
> So, it seems the patch dependency chain for skip scans just got a bit
> longer :-(
>
>
I admit that  EquivalenceClasses has a better expressive power.   There are
2 more
cases we can handle better with EquivalenceClasses.  SELECT DISTINCT a, b,
c
FROM t WHERE a = b;  Currently the UniqueKey is (a, b, c), but it is better
be (a, c)
and (b, c).   The other case happens similarly in group by case.

After realizing this,  I am still hesitant to do that, due to the
complexity.  If we do that,
we may have to maintain a EquivalenceClasses in one more place or make the
existing
EquivalenceClasses List longer, for example:  SELECT pk FROM t;  The
current
infrastructure doesn't create any EquivalenceClasses for pk.  So we have to
create
a new one in this case and reuse some existing ones in other cases.
Finally since the
EquivalenceClasses is not so straight to upper user, we have to depend on
the
infrastructure change to look up an EquivalenceClasses quickly from an
Expr.

I rethink more about the case you provide above,  IIUC, there is such issue
for joinrel.
then we can just add a EC checking for populate_baserel_uniquekeys.  As for
the
DISTINCT/GROUP BY case,  we should build the UniqueKeys from
root->distinct_pathkeys
and root->group_pathkeys where the  EquivalenceClasses are already there.

I am still not insisting on either Expr or EquivalenceClasses right now,
if we need to
change it to EquivalenceClasses,  I'd see if we need to have more places to
take
care before doing that.

-- 
Best Regards
Andy Fan

Re: Index Skip Scan (new UniqueKeys)

2020-07-30 Thread David Rowley

On Mon, 13 Jul 2020 at 10:18, Floris Van Nee  wrote:
> One question about the unique keys - probably for Andy or David: I've looked 
> in the archives to find arguments for/against using Expr nodes or 
> EquivalenceClasses in the Unique Keys patch. However, I couldn't really find 
> a clear answer about why the current patch uses Expr rather than 
> EquivalenceClasses. At some point David mentioned "that probably Expr nodes 
> were needed rather than EquivalenceClasses", but it's not really clear to me 
> why. What were the thoughts behind this?

I'm still not quite sure on this either way.  I did think
EquivalenceClasses were more suitable before I wrote the POC patch for
unique keys.  But after that, I had in mind that Exprs might be
better. The reason I thought this was due to the fact that the
DISTINCT clause list is a bunch of Exprs and if the UniqueKeys were
EquivalenceClasses then checking to see if the DISTINCT can be skipped
turned into something more complex that required looking through lists
of ec_members rather than just checking if the uniquekey exprs were a
subset of the DISTINCT clause.

Thinking about it a bit harder, if we did use Exprs then it would mean
it a case like the following wouldn't work for Andy's DISTINCT no-op
stuff.

CREATE TABLE xy (x int primary key, y int not null);

SELECT DISTINCT y FROM xy WHERE x=y;

whereas if we use EquivalenceClasses then we'll find that we have an
EC with x,y in it and can skip the DISTINCT since we have a UniqueKey
containing that EquivalenceClass.

Also, looking at what Andy wrote to make a case like the following
work in his populate_baserel_uniquekeys() function in the 0002 patch:

CREATE TABLE ab (a int, b int, primary key(a,b));
SELECT DISTINCT a FROM ab WHERE b = 1;

it's a bit uninspiring. Really what we want here when checking if we
can skip doing the DISTINCT is a UniqueKey set using
EquivalenceClasses as we can just insist that any unmatched UniqueKey
items have an ec_is_const == true. However, that means we have to loop
through the ec_members of the EquivalenceClasses in the uniquekeys
during the DISTINCT check.  That's particularly bad when you consider
that in a partitioned table case there might be an ec_member for each
child partition and there could be 1000s of child partitions and
following those ec_members chains is going to be too slow.

My current thoughts are that we should be using EquivalenceClasses but
we should first add some infrastructure to make them perform better.
My current thoughts are that we do something like what I mentioned in
[1] or something more like what Andres mentions in [2].  After that,
we could either make EquivalenceClass.ec_members a hash table or
binary search tree. Or even perhaps just have a single hash table/BST
for all EquivalenceClasses that allows very fast lookups from {Expr}
-> {EquivalenceClass}.  I think an Expr can only belong in a single
non-merged EquivalenceClass. So when we do merging of
EquivalenceClasses we could just repoint that data structure to point
to the new EquivalenceClass. We'd never point to ones that have
ec_merged != NULL.  This would also allow us to fix the poor
performance in regards to get_eclass_for_sort_expr() for partitioned
tables.

So, it seems the patch dependency chain for skip scans just got a bit longer :-(

David

[1] 
https://www.postgresql.org/message-id/CAApHDvrEXcadNYAAdq6RO0eKZUG6rRHXJGAbpzj8y432gCD9bA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/20190920051857.2fhnvhvx4qdddviz%40alap3.anarazel.de#c3add3919c534591eae2179a6c82742c

Re: Index Skip Scan (new UniqueKeys)

2020-07-27 Thread Dmitry Dolgov

> On Tue, Jul 21, 2020 at 04:35:55PM -0700, Peter Geoghegan wrote:
>
> > Well, it's obviously wrong, thanks for noticing. What is necessary is to
> > compare two index tuples, the start and the next one, to test if they're
> > the same (in which case if I'm not mistaken probably we can compare item
> > pointers). I've got this question when I was about to post a new version
> > with changes to address feedback from Andy, now I'll combine them and
> > send a cumulative patch.
>
> This sounds like approximately the same problem as the one that
> _bt_killitems() has to deal with as of Postgres 13. This is handled in
> a way that is admittedly pretty tricky, even though the code does not
> need to be 100% certain that it's "the same" tuple. Deduplication kind
> of makes that a fuzzy concept. In principle there could be one big
> index tuple instead of 5 tuples, even though the logical contents of
> the page have not been changed between the time we recording heap TIDs
> in local and the time _bt_killitems() tried to match on those heap
> TIDs to kill_prior_tuple-kill some index tuples -- a concurrent
> deduplication pass could do that. Your code needs to be prepared for
> stuff like that.
>
> Ultimately posting list tuples are just a matter of understanding the
> on-disk representation -- a "Small Matter of Programming". Even
> without deduplication there are potential hazards from the physical
> deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is
> not code that runs in VACUUM, despite the name). Make sure that you
> hold a buffer pin on the leaf page throughout, because you need to do
> that to make sure that VACUUM cannot concurrently recycle heap TIDs.
> If VACUUM *is* able to concurrently recycle heap TIDs then it'll be
> subtly broken. _bt_killitems() is safe because it either holds on to a
> pin or gives up when the LSN changes at all. (ISTM that your only
> choice is to hold on to a leaf page pin, since you cannot just decide
> to give up in the way that _bt_killitems() sometimes can.)

I see, thanks for clarification. You're right, in this part of
implementation there is no way to give up if LSN changes like
_bt_killitems does. As far as I can see the leaf page is already pinned
all the time between reading relevant tuples and comparing them, I only
need to handle posting list tuples.

RE: Index Skip Scan (new UniqueKeys)

2020-07-23 Thread Floris Van Nee

> 
> One UniqueKey can have multiple corresponding expressions, which gives us
> also possibility of having one unique key with (t1.a, t2.a) and it looks now
> similar to EquivalenceClass.
> 

I believe the current definition of a unique key with two expressions (t1.a, 
t2.a) means that it's unique on the tuple (t1.a, t2.a) - this gives weaker 
guarantees than uniqueness on (t1.a) and uniqueness on (t2.a).

> 
> The idea behind this query sounds questionable to me, more transparent
> would be to do this without distinct, skipping will actually do exactly the 
> same
> stuff just under another name. But if allowing skipping on constants do not
> bring significant changes in the code probably it's fine.
> 

Yeah indeed, I didn't say it's a query that people should generally write. :-) 
It's better to write as a regular SELECT with LIMIT 1 of course. However, it's 
more to be consistent and predictable to the user: if a SELECT DISTINCT ON (a) 
* FROM t1 runs fast, then it doesn't make sense to the user if a SELECT 
DISTINCT ON (a) * FROM t1 WHERE a=2 runs slow. And to support it also makes the 
implementation more consistent with little code changes.

> >
> > Yeah, there's definitely some double work there, but the actual impact may
> be limited - it doesn't actually allocate a new path key, but it looks it up 
> in
> root->canon_pathkeys and returns that path key.
> > I wrote it like this, because I couldn't find a way to identify from a 
> > certain
> PathKey the actual location in the index of that column. The constructed path
> keys list filters out all redundant path keys. An index on (a,a,b,a,b) becomes
> path keys (a,b). Now if we skip on (a,b) we actually need to use prefix=3. But
> how to get from PathKey=b to that number 3, I didn't find a solid way except
> doing this. Maybe there is though?
> 
> I don't think there is a direct way, but why not modify build_index_paths to
> also provide this information, or compare index_pathkeys expressions with
> indextlist without actually create those pathkeys again?
> 

I agree there could be other ways - I don't currently have a strong preference 
for either. I can have a look at this later.

> And couple of words about this thread [1]. It looks to me like a strange way
> of interacting with the community. Are you going to duplicate there
> everything, or what are your plans? At the very least you could try to include
> everyone involved in the recipients list, not exclude some of the authors.
> 

When I wrote the first mail in the thread, I went to this thread [1] and 
included everyone from there, but I see now that I only included the to: and 
cc: people and forgot the original thread author, you. I'm sorry about that - I 
should've looked better to make sure I had everyone.
In any case, my plan is to keep the patch at least applicable to master, as I 
believe it can be helpful for discussions about both patches.

[1] 
https://www.postgresql.org/message-id/20200609102247.jdlatmfyeecg52fi%40localhost

Re: Index Skip Scan (new UniqueKeys)

2020-07-23 Thread Dmitry Dolgov

> On Tue, Jul 14, 2020 at 06:18:50PM +, Floris Van Nee wrote:
>
> Due to the other changes I made in 
> create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan 
> now, even without the EC_MUST_BE_REDUNDANT check though, so it's difficult to 
> give an actual failing test case now. However, since all code filters out 
> constant keys, I think uniqueness should do it too - otherwise you could get 
> into problems later on. It's also more consistent. If you already know 
> something is unique by just (b), it doesn't make sense to store that it's 
> unique by (a,b). Now that I think of it, the best place to do this 
> EC_MUST_BE_REDUNDANT check is probably inside make_pathkeys_for_uniquekeys, 
> rather than build_uniquekeys though. It's probably good to move it there.

That would be my suggestion as well.

> > Along the lines I'm also curious about this part:
> >
> > -   ListCell   *k;
> > -   List *exprs = NIL;
> > -
> > -   foreach(k, ec->ec_members)
> > -   {
> > -   EquivalenceMember *mem = (EquivalenceMember *)
> > lfirst(k);
> > -   exprs = lappend(exprs, mem->em_expr);
> > -   }
> > -
> > -   result = lappend(result, makeUniqueKey(exprs, false, false));
> > +   EquivalenceMember *mem = (EquivalenceMember*)
> > +lfirst(list_head(ec->ec_members));
> >
> > I'm curious about this myself, maybe someone can clarify. It looks like
> > generaly speaking there could be more than one member (if not
> > ec_has_volatile), which "representing knowledge that multiple items are
> > effectively equal". Is this information is not interesting enough to 
> > preserve it
> > in unique keys?
>
> Yeah, that's a good question. Hence my question about the choice for Expr 
> rather than EquivalenceClass for the Unique Keys patch to Andy/David. When 
> storing just Expr, it is rather difficult to check equivalence in joins for 
> example. Suppose, later on we decide to add join support to the distinct skip 
> scan. Consider a query like this:
> SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a
> As far as my understanding goes (I didn't look into it in detail though), I 
> think here the distinct_pathkey will have an EqClass {t1.a, t2.a}. That 
> results in a UniqueKey with expr (t1.a) (because currently we only take the 
> first Expr in the list to construct the UniqueKey). We could also construct 
> *two* unique keys, one with Expr (t1.a) and one with Expr (t2.a), but I don't 
> think that's the correct approach either, as it will explode when you have 
> multiple pathkeys, each having multiple Expr inside their EqClasses.

One UniqueKey can have multiple corresponding expressions, which gives
us also possibility of having one unique key with (t1.a, t2.a) and it
looks now similar to EquivalenceClass.

> > > - the distinct_pathkeys may be NULL, even though there's a possibility for
> > skipping. But it wouldn't create the uniquekeys in this case. This makes the
> > planner not choose skip scans even though it could. For example in queries
> > that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a
> > is constant, it's eliminated from regular pathkeys.
> >
> > What would be the point of skipping if it's a constant?
>
> For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b;
> There may be 1000s of records with a=1. We're only interested in the first 
> one though. The traditional non-skip approach would still scan all records 
> with a=1. Skip would just fetch the first one with a=1 and then skip to the 
> next prefix and stop the scan.

The idea behind this query sounds questionable to me, more transparent
would be to do this without distinct, skipping will actually do exactly
the same stuff just under another name. But if allowing skipping on
constants do not bring significant changes in the code probably it's
fine.

> > > - to combat the issues mentioned earlier, there's now a check in
> > build_index_paths that checks if the query_pathkeys matches the
> > useful_pathkeys. Note that we have to use the path keys here rather than
> > any of the unique keys. The unique keys are only Expr nodes - they do not
> > contain the necessary information about ordering. Due to elimination of
> > some constant path keys, we have to search the attributes of the index to
> > find the correct prefix to use in skipping.
> >
> > IIUC here you mean this function, right?
> >
> > + prefix = find_index_prefix_for_pathkey(root,
> > +
> > index,
> > +
> > BackwardScanDirection,
> > +
> > llast_node(PathKey,
> > +
> > root->distinct_pathkeys));
> >
> > Doesn't it duplicate the job already done in build_index_pathkeys by 
> > building
> > those pathkeys again? If yes, probably it's possible to reuse 
> > useful_pathkeys.
> > Not sure about unordered indexes, but looks like query_pathkeys should
> > also match in this case.
> >
>
> Yeah, there's definitely some double work there, but the actual impact may be 
> limited - it doesn't actually allocate a new path key, but it looks

Re: Index Skip Scan (new UniqueKeys)

2020-07-21 Thread Peter Geoghegan

On Sat, Jul 11, 2020 at 9:10 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > > +   currItem = >currPos.items[so->currPos.lastItem];
> > > +   itup = (IndexTuple) (so->currTuples + 
> > > currItem->tupleOffset);
> > > +   nextOffset = ItemPointerGetOffsetNumber(>t_tid);
>
> Do you mean this last part with t_tid, which could also have a tid array
> in case of posting tuple format?

Yeah. There is a TID array at the end of the index when the tuple is a
posting list tuple (as indicated by BTreeTupleIsPivot()). It isn't
safe to assume that t_tid is a heap TID for this reason, even in code
that only ever considers data items (that is, non high-key tuples AKA
non-pivot tuples) on a leaf page. (Though new/incoming tuples cannot
be posting list tuples either, so you'll see the assumption that t_tid
is just a heap TID in parts of nbtinsert.c -- though only for the
new/incoming item.)

> Well, it's obviously wrong, thanks for noticing. What is necessary is to
> compare two index tuples, the start and the next one, to test if they're
> the same (in which case if I'm not mistaken probably we can compare item
> pointers). I've got this question when I was about to post a new version
> with changes to address feedback from Andy, now I'll combine them and
> send a cumulative patch.

This sounds like approximately the same problem as the one that
_bt_killitems() has to deal with as of Postgres 13. This is handled in
a way that is admittedly pretty tricky, even though the code does not
need to be 100% certain that it's "the same" tuple. Deduplication kind
of makes that a fuzzy concept. In principle there could be one big
index tuple instead of 5 tuples, even though the logical contents of
the page have not been changed between the time we recording heap TIDs
in local and the time _bt_killitems() tried to match on those heap
TIDs to kill_prior_tuple-kill some index tuples -- a concurrent
deduplication pass could do that. Your code needs to be prepared for
stuff like that.

Ultimately posting list tuples are just a matter of understanding the
on-disk representation -- a "Small Matter of Programming". Even
without deduplication there are potential hazards from the physical
deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is
not code that runs in VACUUM, despite the name). Make sure that you
hold a buffer pin on the leaf page throughout, because you need to do
that to make sure that VACUUM cannot concurrently recycle heap TIDs.
If VACUUM *is* able to concurrently recycle heap TIDs then it'll be
subtly broken. _bt_killitems() is safe because it either holds on to a
pin or gives up when the LSN changes at all. (ISTM that your only
choice is to hold on to a leaf page pin, since you cannot just decide
to give up in the way that _bt_killitems() sometimes can.)

Note that the rules surrounding buffer locks/pins for nbtree were
tightened up a tiny bit today -- see commit 4a70f829. Also, it's no
longer okay to use raw LockBuffer() calls in nbtree, so you're going
to have to fix that up when you next rebase -- you must use the new
_bt_lockbuf() wrapper function instead, so that the new Valgrind
instrumentation is used. This shouldn't be hard.

Perhaps you can use Valgrind to verify that this patch doesn't have
any unsafe buffer accesses. I recall problems like that in earlier
versions of the patch series. Valgrind should be able to detect most
bugs like that (though see comments within _bt_lockbuf() for details
of a bug in this area that Valgrind cannot detect even now).

-- 
Peter Geoghegan

RE: Index Skip Scan (new UniqueKeys)

2020-07-14 Thread Floris Van Nee

> 
> > - the uniquekeys that is built, still contains some redundant keys, that are
> normally eliminated from the path keys lists.
> 
> I guess you're talking about:
> 
> + if (EC_MUST_BE_REDUNDANT(ec))
> +   continue;
> 
> Can you add some test cases to your changes to show the effect of it? It
> seem to me redundant keys are already eliminated at this point by either
> make_pathkeys_for_uniquekeys or even earlier for distinct on, but could be
> I've missed something.
> 

The build_uniquekeys function calls make_pathkeys_for_uniquekeys, which checks 
for uniqueness using pathkey_is_unique, but not for constantness. Consider a 
query like:
SELECT DISTINCT ON (a,b) * FROM t1 WHERE a=1 ORDER BY a,b,c

All the regular path keys filter out 'a' for constantness here - so they would 
end up with a distinct pathkeys of (b) and sort path keys of (b,c).
The unique keys would end up with (a,b) though. In the previous patch, it'd 
compared in create_distinct_paths, the pathkeys to the unique keys, so it 
wouldn't consider the skip scan.
Due to the other changes I made in 
create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan 
now, even without the EC_MUST_BE_REDUNDANT check though, so it's difficult to 
give an actual failing test case now. However, since all code filters out 
constant keys, I think uniqueness should do it too - otherwise you could get 
into problems later on. It's also more consistent. If you already know 
something is unique by just (b), it doesn't make sense to store that it's 
unique by (a,b). Now that I think of it, the best place to do this 
EC_MUST_BE_REDUNDANT check is probably inside make_pathkeys_for_uniquekeys, 
rather than build_uniquekeys though. It's probably good to move it there.

> Along the lines I'm also curious about this part:
> 
> - ListCell   *k;
> - List *exprs = NIL;
> -
> - foreach(k, ec->ec_members)
> - {
> - EquivalenceMember *mem = (EquivalenceMember *)
> lfirst(k);
> - exprs = lappend(exprs, mem->em_expr);
> - }
> -
> - result = lappend(result, makeUniqueKey(exprs, false, false));
> + EquivalenceMember *mem = (EquivalenceMember*)
> +lfirst(list_head(ec->ec_members));
> 
> I'm curious about this myself, maybe someone can clarify. It looks like
> generaly speaking there could be more than one member (if not
> ec_has_volatile), which "representing knowledge that multiple items are
> effectively equal". Is this information is not interesting enough to preserve 
> it
> in unique keys?

Yeah, that's a good question. Hence my question about the choice for Expr 
rather than EquivalenceClass for the Unique Keys patch to Andy/David. When 
storing just Expr, it is rather difficult to check equivalence in joins for 
example. Suppose, later on we decide to add join support to the distinct skip 
scan. Consider a query like this:
SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a
As far as my understanding goes (I didn't look into it in detail though), I 
think here the distinct_pathkey will have an EqClass {t1.a, t2.a}. That results 
in a UniqueKey with expr (t1.a) (because currently we only take the first Expr 
in the list to construct the UniqueKey). We could also construct *two* unique 
keys, one with Expr (t1.a) and one with Expr (t2.a), but I don't think that's 
the correct approach either, as it will explode when you have multiple 
pathkeys, each having multiple Expr inside their EqClasses.
That makes it difficult to check if we can perform the DISTINCT skip scan on 
table t2 as well (theoretically we could, but for that we need to check 
equivalence classes rather than expressions).

> 
> > - the distinct_pathkeys may be NULL, even though there's a possibility for
> skipping. But it wouldn't create the uniquekeys in this case. This makes the
> planner not choose skip scans even though it could. For example in queries
> that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a
> is constant, it's eliminated from regular pathkeys.
> 
> What would be the point of skipping if it's a constant?

For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b;
There may be 1000s of records with a=1. We're only interested in the first one 
though. The traditional non-skip approach would still scan all records with 
a=1. Skip would just fetch the first one with a=1 and then skip to the next 
prefix and stop the scan.

> 
> > - to combat the issues mentioned earlier, there's now a check in
> build_index_paths that checks if the query_pathkeys matches the
> useful_pathkeys. Note that we have to use the path keys here rather than
> any of the unique keys. The unique keys are only Expr nodes - they do not
> contain the necessary information about ordering. Due to elimination of
> some constant path keys, we have to search the attributes of the index to
> find the correct prefix to use in skipping.
> 
> IIUC here you mean this function, right?
> 
> + prefix = find_index_prefix_for_pathkey(root,

Re: Index Skip Scan (new UniqueKeys)

2020-07-14 Thread Dmitry Dolgov

> On Sun, Jul 12, 2020 at 12:48:47PM +, Floris Van Nee wrote:
> >
> > Good point, thanks for looking at this. With the latest planner version 
> > there
> > are indeed more possibilities to use skipping. It never occured to me that
> > some of those paths will still rely on index scan returning full data set. 
> > I'll look
> > in details and add verification to prevent putting something like this on 
> > top of
> > skip scan in the next version.
>
> I believe the required changes are something like in attached patch. There 
> were a few things I've changed:
> - build_uniquekeys was constructing the list incorrectly. For a DISTINCT a,b, 
> it would create two unique keys, one with a and one with b. However, it 
> should be one unique key with (a,b).

Yes, I've also noticed that while preparing fix for index scan not
covered by index and included it.

> - the uniquekeys that is built, still contains some redundant keys, that are 
> normally eliminated from the path keys lists.

I guess you're talking about:

+ if (EC_MUST_BE_REDUNDANT(ec))
+   continue;

Can you add some test cases to your changes to show the effect of it? It
seem to me redundant keys are already eliminated at this point by either
make_pathkeys_for_uniquekeys or even earlier for distinct on, but could
be I've missed something.

Along the lines I'm also curious about this part:

-   ListCell   *k;
-   List *exprs = NIL;
-
-   foreach(k, ec->ec_members)
-   {
-   EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
-   exprs = lappend(exprs, mem->em_expr);
-   }
-
-   result = lappend(result, makeUniqueKey(exprs, false, false));
+   EquivalenceMember *mem = (EquivalenceMember*) 
lfirst(list_head(ec->ec_members));

I'm curious about this myself, maybe someone can clarify. It looks like
generaly speaking there could be more than one member (if not
ec_has_volatile), which "representing knowledge that multiple items are
effectively equal". Is this information is not interesting enough to
preserve it in unique keys?

> - the distinct_pathkeys may be NULL, even though there's a possibility for 
> skipping. But it wouldn't create the uniquekeys in this case. This makes the 
> planner not choose skip scans even though it could. For example in queries 
> that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a is 
> constant, it's eliminated from regular pathkeys.

What would be the point of skipping if it's a constant?

> - to combat the issues mentioned earlier, there's now a check in 
> build_index_paths that checks if the query_pathkeys matches the 
> useful_pathkeys. Note that we have to use the path keys here rather than any 
> of the unique keys. The unique keys are only Expr nodes - they do not contain 
> the necessary information about ordering. Due to elimination of some constant 
> path keys, we have to search the attributes of the index to find the correct 
> prefix to use in skipping.

IIUC here you mean this function, right?

+ prefix = find_index_prefix_for_pathkey(root,
+
index,
+
BackwardScanDirection,
+
llast_node(PathKey,
+
root->distinct_pathkeys));

Doesn't it duplicate the job already done in build_index_pathkeys by
building those pathkeys again? If yes, probably it's possible to reuse
useful_pathkeys. Not sure about unordered indexes, but looks like
query_pathkeys should also match in this case.

Will also look at the follow up questions in the next email.

Re: Index Skip Scan (new UniqueKeys)

2020-07-11 Thread Dmitry Dolgov

> On Fri, Jul 10, 2020 at 05:03:37PM +, Floris Van Nee wrote:
>
> Also took another look at the patch now, and found a case of incorrect
> data. It looks related to the new way of creating the paths, as I
> can't recall seeing this in earlier versions.
>
> create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 
> 10) a, generate_series(1,100) b;
> create index on t1 (a,b,c);
>
> postgres=# explain select distinct on (a) * from t1 order by a,b desc,c;
>   QUERY PLAN
> ---
>  Sort  (cost=2.92..2.94 rows=10 width=20)
>Sort Key: a, b DESC, c
>->  Index Scan using t1_a_b_c_idx on t1  (cost=0.28..2.75 rows=10 width=20)
>  Skip scan: true
> (4 rows)

Good point, thanks for looking at this. With the latest planner version
there are indeed more possibilities to use skipping. It never occured to
me that some of those paths will still rely on index scan returning full
data set. I'll look in details and add verification to prevent putting
something like this on top of skip scan in the next version.

Re: Index Skip Scan (new UniqueKeys)

2020-07-11 Thread Dmitry Dolgov

> On Wed, Jul 08, 2020 at 03:44:26PM -0700, Peter Geoghegan wrote:
>
> On Tue, Jun 9, 2020 at 3:20 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > * Btree-implementation contains btree specific code to implement amskip,
> >   introduced in the previous patch.
>
> The way that you're dealing with B-Tree tuples here needs to account
> for posting list tuples:
>
> > +   currItem = >currPos.items[so->currPos.lastItem];
> > +   itup = (IndexTuple) (so->currTuples + 
> > currItem->tupleOffset);
> > +   nextOffset = ItemPointerGetOffsetNumber(>t_tid);

Do you mean this last part with t_tid, which could also have a tid array
in case of posting tuple format?

> > +   /*
> > +* To check if we returned the same tuple, try to find a
> > +* startItup on the current page. For that we need to update
> > +* scankey to match the whole tuple and set nextkey to 
> > return
> > +* an exact tuple, not the next one. If the nextOffset is 
> > the
> > +* same as before, it means we are in the loop, return 
> > offnum
> > +* to the original position and jump further
> > +*/
>
> Why does it make sense to use the offset number like this? It isn't
> stable or reliable. The patch goes on to do this:
>
> > +   startOffset = _bt_binsrch(scan->indexRelation,
> > + so->skipScanKey,
> > + so->currPos.buf);
> > +
> > +   page = BufferGetPage(so->currPos.buf);
> > +   maxoff = PageGetMaxOffsetNumber(page);
> > +
> > +   if (nextOffset <= startOffset)
> > +   {
>
> Why compare a heap TID's offset number (an offset number for a heap
> page) to another offset number for a B-Tree leaf page? They're
> fundamentally different things.

Well, it's obviously wrong, thanks for noticing. What is necessary is to
compare two index tuples, the start and the next one, to test if they're
the same (in which case if I'm not mistaken probably we can compare item
pointers). I've got this question when I was about to post a new version
with changes to address feedback from Andy, now I'll combine them and
send a cumulative patch.

RE: Index Skip Scan (new UniqueKeys)

2020-07-10 Thread Floris Van Nee

Hi Dmitry,

Also took another look at the patch now, and found a case of incorrect data. It 
looks related to the new way of creating the paths, as I can't recall seeing 
this in earlier versions.

create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 
10) a, generate_series(1,100) b;
create index on t1 (a,b,c);

postgres=# explain select distinct on (a) * from t1 order by a,b desc,c;
  QUERY PLAN   
---
 Sort  (cost=2.92..2.94 rows=10 width=20)
   Sort Key: a, b DESC, c
   ->  Index Scan using t1_a_b_c_idx on t1  (cost=0.28..2.75 rows=10 width=20)
 Skip scan: true
(4 rows)


With the 'order by a, b desc, c' we expect the value of column 'b' to always be 
100. With index_skipscan enabled, it always gives 1 though. It's not correct 
that the planner chooses a skip scan followed by sort in this case.

-Floris

Re: Index Skip Scan (new UniqueKeys)

2020-07-08 Thread Peter Geoghegan

On Tue, Jun 9, 2020 at 3:20 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> * Btree-implementation contains btree specific code to implement amskip,
>   introduced in the previous patch.

The way that you're dealing with B-Tree tuples here needs to account
for posting list tuples:

> +   currItem = >currPos.items[so->currPos.lastItem];
> +   itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
> +   nextOffset = ItemPointerGetOffsetNumber(>t_tid);

But I wonder more generally what the idea here is. The following
comments that immediately follow provide some hints:

> +   /*
> +* To check if we returned the same tuple, try to find a
> +* startItup on the current page. For that we need to update
> +* scankey to match the whole tuple and set nextkey to return
> +* an exact tuple, not the next one. If the nextOffset is the
> +* same as before, it means we are in the loop, return offnum
> +* to the original position and jump further
> +*/

Why does it make sense to use the offset number like this? It isn't
stable or reliable. The patch goes on to do this:

> +   startOffset = _bt_binsrch(scan->indexRelation,
> + so->skipScanKey,
> + so->currPos.buf);
> +
> +   page = BufferGetPage(so->currPos.buf);
> +   maxoff = PageGetMaxOffsetNumber(page);
> +
> +   if (nextOffset <= startOffset)
> +   {

Why compare a heap TID's offset number (an offset number for a heap
page) to another offset number for a B-Tree leaf page? They're
fundamentally different things.

-- 
Peter Geoghegan

Re: Index Skip Scan (new UniqueKeys)

2020-06-29 Thread Dmitry Dolgov

> On Thu, Jun 11, 2020 at 04:14:07PM +0800, Andy Fan wrote:
>
> I just get the rough idea of patch, looks we have to narrow down the
> user cases where we can use this method. Consider the below example:

Hi

Not exactly narrow down, but rather get rid of wrong usage of skipping
for index scan. Since skipping for it was added later than for index
only scan I can imagine there are still blind spots, so good that you've
looked. In this particular case, when index expressions do not fully
cover those expressionse result need to be distinct on, skipping just
doesn't have enough information and should not be used. I'll add it to
the next version, thanks!

Re: Index Skip Scan (new UniqueKeys)

2020-06-11 Thread Andy Fan

On Tue, Jun 9, 2020 at 6:20 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> Hi,
>
> Here is a new version of index skip scan patch, based on v8 patch for
> UniqueKeys implementation from [1]. I want to start a new thread to
> simplify navigation, hopefully I didn't forget anyone who actively
> participated in the discussion.
>
> To simplify reviewing I've split it into several parts:
>
> * First two are taken from [1] just for the reference and to make cfbot
> happy.
>
> * Extend-UniqueKeys consists changes that needs to be done for
>   UniqueKeys to be used in skip scan. Essentially this is a reduced
>   version of previous implementation from Jesper & David, based on the
>   new UniqueKeys infrastructure, as it does the very same thing.
>
> * Index-Skip-Scan contains not am specific code and the overall
>   infrastructure, including configuration elements and so on.
>
> * Btree-implementation contains btree specific code to implement amskip,
>   introduced in the previous patch.
>
> * The last one contains just documentation bits.
>
> Interesting enough with a new UniqueKey implementation skipping is
> applied in some tests unexpectedly. For those I've disabled
> indexskipscan to avoid confusion.
>
> [1]:
> https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com
>

Thanks for the patch.

I just get the rough idea of patch, looks we have to narrow down the user
cases
where we can use this method.  Consider the below example:

create table j1(i int, im5 int,  im100 int, im1000 int);
insert into j1 select i, i%5, i%100, i%1000 from generate_series(1,
1000)i;
create index on j1(im5, i);
insert into j1 values (1, 1, 0, 0);
analyze j1;

demo=# select distinct * from j1 where i < 2;
 i | im5 | im100 | im1000
---+-+---+
 1 |   1 | 1 |  1
(1 row)

demo=# set enable_indexskipscan to off;
SET
demo=# select distinct * from j1 where i < 2;
 i | im5 | im100 | im1000
---+-+---+
 1 |   1 | 0 |  0
 1 |   1 | 1 |  1
(2 rows)

drop index j1_im5_i_idx;

create index on j1(im5, i, im100);
demo=# select distinct im5, i, im100 from j1 where i < 2;
 im5 | i | im100
-+---+---
   1 | 1 | 0
   1 | 1 | 1
(2 rows)
demo=# set enable_indexskipscan to on;
SET
demo=# select distinct im5, i, im100 from j1 where i < 2;
 im5 | i | im100
-+---+---
   1 | 1 | 0
(1 row)


-- 
Best Regards
Andy Fan

Re: Index Skip Scan

2020-06-02 Thread Andy Fan

On Tue, Jun 2, 2020 at 9:38 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Tue, Jun 02, 2020 at 08:36:31PM +0800, Andy Fan wrote:
> >
> > > Other than that to summarize current open points for future readers
> > > (this thread somehow became quite big):
> > >
> > > * Making UniqueKeys usage more generic to allow using skip scan for
> more
> > >   use cases (hopefully it was covered by the v33, but I still need a
> > >   confirmation from David, like blinking twice or something).
> > >
> > > * Suspicious performance difference between different type of workload,
> > >   mentioned by Tomas (unfortunately I had no chance yet to
> investigate).
> > >
> > > * Thinking about supporting conditions, that are not covered by the
> index,
> > >   to make skipping more flexible (one of the potential next steps in
> the
> > >   future, as suggested by Floris).
> > >
> >
> > Looks this is the latest patch,  which commit it is based on?  Thanks
>
> I have a rebased version, if you're about it. Didn't posted it yet
> mostly since I'm in the middle of adapting it to the UniqueKeys from
> other thread. Would it be ok for you to wait a bit until I'll post
> finished version?
>

Sure, that's OK.  The discussion on UniqueKey thread looks more complex
than what I expected, that's why I want to check the code here, but that's
fine,
you can work on your schedule.

-- 
Best Regards
Andy Fan

Re: Index Skip Scan

2020-06-02 Thread Dmitry Dolgov

> On Tue, Jun 02, 2020 at 08:36:31PM +0800, Andy Fan wrote:
>
> > Other than that to summarize current open points for future readers
> > (this thread somehow became quite big):
> >
> > * Making UniqueKeys usage more generic to allow using skip scan for more
> >   use cases (hopefully it was covered by the v33, but I still need a
> >   confirmation from David, like blinking twice or something).
> >
> > * Suspicious performance difference between different type of workload,
> >   mentioned by Tomas (unfortunately I had no chance yet to investigate).
> >
> > * Thinking about supporting conditions, that are not covered by the index,
> >   to make skipping more flexible (one of the potential next steps in the
> >   future, as suggested by Floris).
> >
>
> Looks this is the latest patch,  which commit it is based on?  Thanks

I have a rebased version, if you're about it. Didn't posted it yet
mostly since I'm in the middle of adapting it to the UniqueKeys from
other thread. Would it be ok for you to wait a bit until I'll post
finished version?

Re: Index Skip Scan

2020-06-02 Thread Andy Fan

On Wed, Apr 8, 2020 at 3:41 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Mon, Apr 06, 2020 at 06:31:08PM +, Floris Van Nee wrote:
> >
> > > Hm, I wasn't aware about this one, thanks for bringing this up. Btw,
> Floris, I
> > > would appreciate if in the future you can make it more visible that
> changes you
> > > suggest contain some fixes. E.g. it wasn't clear for me from your
> previous email
> > > that that's the case, and it doesn't make sense to pull into different
> direction
> > > when we're trying to achieve the same goal :)
> >
> > I wasn't aware that this particular case could be triggered before I saw
> Dilip's email, otherwise I'd have mentioned it here of course. It's just
> that because my patch handles filter conditions in general, it works for
> this case too.
>
> Oh, then fortunately I've got a wrong impression, sorry and thanks for
> clarification :)
>
> > > > > In the patch I posted a week ago these cases are all handled
> > > > > correctly, as it introduces this extra logic in the Executor.
> > > >
> > > > Okay, So I think we can merge those fixes in Dmitry's patch set.
> > >
> > > I'll definitely take a look at suggested changes in filtering part.
> >
> > It may be possible to just merge the filtering part into your patch, but
> I'm not entirely sure. Basically you have to pull the information about
> skipping one level up, out of the node, into the generic IndexNext code.
>
> I was actually thinking more about just preventing skip scan in this
> situation, which is if I'm not mistaken could be solved by inspecting
> qual conditions to figure out if they're covered in the index -
> something like in attachments (this implementation is actually too
> restrictive in this sense and will not allow e.g. expressions, that's
> why I haven't bumped patch set version for it - soon I'll post an
> extended version).
>
> Other than that to summarize current open points for future readers
> (this thread somehow became quite big):
>
> * Making UniqueKeys usage more generic to allow using skip scan for more
>   use cases (hopefully it was covered by the v33, but I still need a
>   confirmation from David, like blinking twice or something).
>
> * Suspicious performance difference between different type of workload,
>   mentioned by Tomas (unfortunately I had no chance yet to investigate).
>
> * Thinking about supporting conditions, that are not covered by the index,
>   to make skipping more flexible (one of the potential next steps in the
>   future, as suggested by Floris).
>

Looks this is the latest patch,  which commit it is based on?  Thanks

-- 
Best Regards
Andy Fan

Re: Index Skip Scan

2020-05-15 Thread Dilip Kumar

On Fri, 15 May 2020 at 6:06 PM, Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Wed, May 13, 2020 at 02:37:21PM +0530, Dilip Kumar wrote:
> >
> > + if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf,
> dir))
> > + {
> >
> > Here we expect whether the "next" unique key can be found on this page
> > or not, but we are using the function which suggested whether the
> > "current" key can be found on this page or not.  I think in boundary
> > cases where the high key is equal to the current key, this function
> > will return true (which is expected from this function), and based on
> > that we will simply scan the current page and IMHO that cost could be
> > avoided no?
>
> Yes, looks like you're right, there is indeed an unecessary extra scan
> happening. To avoid that we can see the key->nextkey and adjust higher
> boundary correspondingly. Will also add this into the next rebased
> patch, thanks!


Great thanks!

> --
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Index Skip Scan

2020-05-15 Thread Dmitry Dolgov

> On Wed, May 13, 2020 at 02:37:21PM +0530, Dilip Kumar wrote:
>
> + if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
> + {
>
> Here we expect whether the "next" unique key can be found on this page
> or not, but we are using the function which suggested whether the
> "current" key can be found on this page or not.  I think in boundary
> cases where the high key is equal to the current key, this function
> will return true (which is expected from this function), and based on
> that we will simply scan the current page and IMHO that cost could be
> avoided no?

Yes, looks like you're right, there is indeed an unecessary extra scan
happening. To avoid that we can see the key->nextkey and adjust higher
boundary correspondingly. Will also add this into the next rebased
patch, thanks!

Re: Index Skip Scan

2020-05-13 Thread Peter Geoghegan

On Mon, Jan 20, 2020 at 5:05 PM Peter Geoghegan  wrote:
> You can add another assertion that calls a new utility function in
> bufmgr.c. That can use the same logic as this existing assertion in
> FlushOneBuffer():
>
> Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
>
> We haven't needed assertions like this so far because it's usually it
> is clear whether or not a buffer lock is held (plus the bufmgr.c
> assertions help on their own).

Just in case anybody missed it, I am working on a patch that makes
nbtree use Valgrind instrumentation to detect page accessed without a
buffer content lock held:

https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpwrub_45eexlh1+k...@mail.gmail.com

There is also one component that detects when any buffer is accessed
without a buffer pin.

-- 
Peter Geoghegan

Re: Index Skip Scan

2020-05-13 Thread Dilip Kumar

On Mon, May 11, 2020 at 4:55 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Mon, May 11, 2020 at 04:04:00PM +0530, Dilip Kumar wrote:
> >
> > > > +static inline bool
> > > > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
> > > > + Buffer buf, ScanDirection dir)
> > > > +{
> > > > + OffsetNumber low, high;
> > > > + Page page = BufferGetPage(buf);
> > > > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
> > > > +
> > > > + low = P_FIRSTDATAKEY(opaque);
> > > > + high = PageGetMaxOffsetNumber(page);
> > > > +
> > > > + if (unlikely(high < low))
> > > > + return false;
> > > > +
> > > > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 &&
> > > > + _bt_compare(scan->indexRelation, key, page, high) < 1);
> > > > +}
> > > >
> > > > I think the high key condition should be changed to
> > > > _bt_compare(scan->indexRelation, key, page, high) < 0 ?  Because if
> > > > prefix qual is equal to the high key then also there is no point in
> > > > searching on the current page so we can directly skip.
> > >
> > > From nbtree/README and comments to functions like _bt_split I've got an
> > > impression that the high key could be equal to the last item on the leaf
> > > page, so there is a point in searching. Is that incorrect?
> >
> > But IIUC,  here we want to decide whether we will get the next key in
> > the current page or not?
>
> In general this function does what it says, it checks wether or not the
> provided scankey could be found within the page. All the logic about
> finding a proper next key to fetch is implemented on the call site, and
> within this function we want to test whatever was passed in. Does it
> answer the question?

Ok, I agree that the function is doing what it is expected to do.
But, then I have a problem with this call site.

+ /* Check if the next unique key can be found within the current page.
+ * Since we do not lock the current page between jumps, it's possible
+ * that it was splitted since the last time we saw it. This is fine in
+ * case of scanning forward, since page split to the right and we are
+ * still on the left most page. In case of scanning backwards it's
+ * possible to loose some pages and we need to remember the previous
+ * page, and then follow the right link from the current page until we
+ * find the original one.
+ *
+ * Since the whole idea of checking the current page is to protect
+ * ourselves and make more performant statistic mismatch case when
+ * there are too many distinct values for jumping, it's not clear if
+ * the complexity of this solution in case of backward scan is
+ * justified, so for now just avoid it.
+ */
+ if (BufferIsValid(so->currPos.buf) && ScanDirectionIsForward(dir))
+ {
+ LockBuffer(so->currPos.buf, BT_READ);
+
+ if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+ {

Here we expect whether the "next" unique key can be found on this page
or not, but we are using the function which suggested whether the
"current" key can be found on this page or not.  I think in boundary
cases where the high key is equal to the current key, this function
will return true (which is expected from this function), and based on
that we will simply scan the current page and IMHO that cost could be
avoided no?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Index Skip Scan

2020-05-11 Thread Dmitry Dolgov

> On Mon, May 11, 2020 at 04:04:00PM +0530, Dilip Kumar wrote:
>
> > > +static inline bool
> > > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
> > > + Buffer buf, ScanDirection dir)
> > > +{
> > > + OffsetNumber low, high;
> > > + Page page = BufferGetPage(buf);
> > > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
> > > +
> > > + low = P_FIRSTDATAKEY(opaque);
> > > + high = PageGetMaxOffsetNumber(page);
> > > +
> > > + if (unlikely(high < low))
> > > + return false;
> > > +
> > > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 &&
> > > + _bt_compare(scan->indexRelation, key, page, high) < 1);
> > > +}
> > >
> > > I think the high key condition should be changed to
> > > _bt_compare(scan->indexRelation, key, page, high) < 0 ?  Because if
> > > prefix qual is equal to the high key then also there is no point in
> > > searching on the current page so we can directly skip.
> >
> > From nbtree/README and comments to functions like _bt_split I've got an
> > impression that the high key could be equal to the last item on the leaf
> > page, so there is a point in searching. Is that incorrect?
>
> But IIUC,  here we want to decide whether we will get the next key in
> the current page or not?

In general this function does what it says, it checks wether or not the
provided scankey could be found within the page. All the logic about
finding a proper next key to fetch is implemented on the call site, and
within this function we want to test whatever was passed in. Does it
answer the question?

Re: Index Skip Scan

2020-05-11 Thread Dilip Kumar

On Sun, May 10, 2020 at 11:17 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Sat, Apr 11, 2020 at 03:17:25PM +0530, Dilip Kumar wrote:
> >
> > Some more comments...
>
> Thanks for reviewing. Since this patch took much longer than I expected,
> it's useful to have someone to look at it with a "fresh eyes".
>
> > + so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
> > + _bt_update_skip_scankeys(scan, indexRel);
> > +
> > ...
> > + /*
> > + * We haven't found scan key within the current page, so let's scan from
> > + * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
> > + * number
> > + */
> > + so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
> > + stack = _bt_search(scan->indexRelation, so->skipScanKey,
> > +, BT_READ, scan->xs_snapshot);
> >
> > Why do we need to set so->skipScanKey->nextkey =
> > ScanDirectionIsForward(dir); multiple times? I think it is fine to
> > just set it once?
>
> I believe it was necessary for previous implementations, but in the
> current version we can avoid this, you're right.
>
> > +static inline bool
> > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
> > + Buffer buf, ScanDirection dir)
> > +{
> > + OffsetNumber low, high;
> > + Page page = BufferGetPage(buf);
> > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
> > +
> > + low = P_FIRSTDATAKEY(opaque);
> > + high = PageGetMaxOffsetNumber(page);
> > +
> > + if (unlikely(high < low))
> > + return false;
> > +
> > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 &&
> > + _bt_compare(scan->indexRelation, key, page, high) < 1);
> > +}
> >
> > I think the high key condition should be changed to
> > _bt_compare(scan->indexRelation, key, page, high) < 0 ?  Because if
> > prefix qual is equal to the high key then also there is no point in
> > searching on the current page so we can directly skip.
>
> From nbtree/README and comments to functions like _bt_split I've got an
> impression that the high key could be equal to the last item on the leaf
> page, so there is a point in searching. Is that incorrect?

But IIUC,  here we want to decide whether we will get the next key in
the current page or not? Is my understanding is correct?   So if our
key (the last tuple key) is equal to the high key means the max key on
this page is the same as what we already got in the last tuple so why
would we want to go on this page? because this will not give us the
new key.  So ideally, we should only be looking into this page if our
last tuple key is smaller than the high key.  Am I missing something?

>
> > + /* Check if an index skip scan is possible. */
> > + can_skip = enable_indexskipscan & index->amcanskip;
> > +
> > + /*
> > + * Skip scan is not supported when there are qual conditions, which are not
> > + * covered by index. The reason for that is that those conditions are
> > + * evaluated later, already after skipping was applied.
> > + *
> > + * TODO: This implementation is too restrictive, and doesn't allow e.g.
> > + * index expressions. For that we need to examine index_clauses too.
> > + */
> > + if (root->parse->jointree != NULL)
> > + {
> > + ListCell *lc;
> > +
> > + foreach(lc, (List *)root->parse->jointree->quals)
> > + {
> > + Node *expr, *qual = (Node *) lfirst(lc);
> > + Var *var;
> >
> > I think we can avoid checking for expression if can_skip is already false.
>
> Yes, that makes sense. I'll include your suggestions into the next
> rebased version I'm preparing.

Ok.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Index Skip Scan

2020-05-10 Thread Dmitry Dolgov

> On Sat, Apr 11, 2020 at 03:17:25PM +0530, Dilip Kumar wrote:
>
> Some more comments...

Thanks for reviewing. Since this patch took much longer than I expected,
it's useful to have someone to look at it with a "fresh eyes".

> + so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
> + _bt_update_skip_scankeys(scan, indexRel);
> +
> ...
> + /*
> + * We haven't found scan key within the current page, so let's scan from
> + * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
> + * number
> + */
> + so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
> + stack = _bt_search(scan->indexRelation, so->skipScanKey,
> +, BT_READ, scan->xs_snapshot);
>
> Why do we need to set so->skipScanKey->nextkey =
> ScanDirectionIsForward(dir); multiple times? I think it is fine to
> just set it once?

I believe it was necessary for previous implementations, but in the
current version we can avoid this, you're right.

> +static inline bool
> +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
> + Buffer buf, ScanDirection dir)
> +{
> + OffsetNumber low, high;
> + Page page = BufferGetPage(buf);
> + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
> +
> + low = P_FIRSTDATAKEY(opaque);
> + high = PageGetMaxOffsetNumber(page);
> +
> + if (unlikely(high < low))
> + return false;
> +
> + return (_bt_compare(scan->indexRelation, key, page, low) > 0 &&
> + _bt_compare(scan->indexRelation, key, page, high) < 1);
> +}
>
> I think the high key condition should be changed to
> _bt_compare(scan->indexRelation, key, page, high) < 0 ?  Because if
> prefix qual is equal to the high key then also there is no point in
> searching on the current page so we can directly skip.

>From nbtree/README and comments to functions like _bt_split I've got an
impression that the high key could be equal to the last item on the leaf
page, so there is a point in searching. Is that incorrect?

> + /* Check if an index skip scan is possible. */
> + can_skip = enable_indexskipscan & index->amcanskip;
> +
> + /*
> + * Skip scan is not supported when there are qual conditions, which are not
> + * covered by index. The reason for that is that those conditions are
> + * evaluated later, already after skipping was applied.
> + *
> + * TODO: This implementation is too restrictive, and doesn't allow e.g.
> + * index expressions. For that we need to examine index_clauses too.
> + */
> + if (root->parse->jointree != NULL)
> + {
> + ListCell *lc;
> +
> + foreach(lc, (List *)root->parse->jointree->quals)
> + {
> + Node *expr, *qual = (Node *) lfirst(lc);
> + Var *var;
>
> I think we can avoid checking for expression if can_skip is already false.

Yes, that makes sense. I'll include your suggestions into the next
rebased version I'm preparing.

Re: Index Skip Scan

2020-05-10 Thread Dmitry Dolgov

Sorry for late reply.

> On Tue, Apr 14, 2020 at 09:19:22PM +1200, David Rowley wrote:
>
> I've not yet looked at the latest patch, but I did put some thoughts
> into an email on the other thread that's been discussing UniqueKeys
> [1].
>
> I'm keen to hear thoughts on the plan I mentioned over there.  Likely
> it would be best to discuss the specifics of what additional features
> we need to add to UniqueKeys for skip scans over here, but discuss any
> chances which affect both patches over there.  We certainly can't have
> two separate implementations of UniqueKeys, so I believe the skip
> scans UniqueKeys patch should most likely be based on the one in [1]
> or some descendant of it.
>
> [1] 
> https://www.postgresql.org/message-id/CAApHDvpx1qED1uLqubcKJ=ohatcmd7ptukkdq0b72_08nbr...@mail.gmail.com

Yes, I've come to the same conclusion, although I have my concerns about
having such a dependency between patches. Will look at the suggested
patches soon.

Re: Index Skip Scan

2020-04-14 Thread David Rowley

On Wed, 8 Apr 2020 at 07:40, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Other than that to summarize current open points for future readers
> (this thread somehow became quite big):
>
> * Making UniqueKeys usage more generic to allow using skip scan for more
>   use cases (hopefully it was covered by the v33, but I still need a
>   confirmation from David, like blinking twice or something).

I've not yet looked at the latest patch, but I did put some thoughts
into an email on the other thread that's been discussing UniqueKeys
[1].

I'm keen to hear thoughts on the plan I mentioned over there.  Likely
it would be best to discuss the specifics of what additional features
we need to add to UniqueKeys for skip scans over here, but discuss any
chances which affect both patches over there.  We certainly can't have
two separate implementations of UniqueKeys, so I believe the skip
scans UniqueKeys patch should most likely be based on the one in [1]
or some descendant of it.

[1] 
https://www.postgresql.org/message-id/CAApHDvpx1qED1uLqubcKJ=ohatcmd7ptukkdq0b72_08nbr...@mail.gmail.com

RE: Index Skip Scan

2020-04-07 Thread Floris Van Nee

> 
> * Suspicious performance difference between different type of workload,
>   mentioned by Tomas (unfortunately I had no chance yet to investigate).
> 

His benchmark results indeed most likely point to multiple comparisons being 
done. Since the most likely place where these occur is _bt_readpage, I suspect 
this is called multiple times. Looking at your patch, I think that's indeed the 
case. For example, suppose a page contains [1,2,3,4,5] and the planner makes a 
complete misestimation and chooses a skip scan here. First call to _bt_readpage 
will compare every tuple on the page already and store everything in the 
workspace, which will now contain [1,2,3,4,5]. However, when a skip is done the 
elements on the page (not the workspace) are compared to find the next one. 
Then, another _bt_readpage is done, starting at the new offnum. So we'll 
compare every tuple (except 1) on the page again. Workspace now contains 
[2,3,4,5]. Next tuple we'll end up with [3,4,5] etc. So tuple 5 actually gets 
compared 5 times in _bt_readpage alone.

-Floris

Re: Index Skip Scan

2020-04-07 Thread Dmitry Dolgov

> On Mon, Apr 06, 2020 at 06:31:08PM +, Floris Van Nee wrote:
>
> > Hm, I wasn't aware about this one, thanks for bringing this up. Btw, 
> > Floris, I
> > would appreciate if in the future you can make it more visible that changes 
> > you
> > suggest contain some fixes. E.g. it wasn't clear for me from your previous 
> > email
> > that that's the case, and it doesn't make sense to pull into different 
> > direction
> > when we're trying to achieve the same goal :)
>
> I wasn't aware that this particular case could be triggered before I saw 
> Dilip's email, otherwise I'd have mentioned it here of course. It's just that 
> because my patch handles filter conditions in general, it works for this case 
> too.

Oh, then fortunately I've got a wrong impression, sorry and thanks for
clarification :)

> > > > In the patch I posted a week ago these cases are all handled
> > > > correctly, as it introduces this extra logic in the Executor.
> > >
> > > Okay, So I think we can merge those fixes in Dmitry's patch set.
> >
> > I'll definitely take a look at suggested changes in filtering part.
>
> It may be possible to just merge the filtering part into your patch, but I'm 
> not entirely sure. Basically you have to pull the information about skipping 
> one level up, out of the node, into the generic IndexNext code.

I was actually thinking more about just preventing skip scan in this
situation, which is if I'm not mistaken could be solved by inspecting
qual conditions to figure out if they're covered in the index -
something like in attachments (this implementation is actually too
restrictive in this sense and will not allow e.g. expressions, that's
why I haven't bumped patch set version for it - soon I'll post an
extended version).

Other than that to summarize current open points for future readers
(this thread somehow became quite big):

* Making UniqueKeys usage more generic to allow using skip scan for more
  use cases (hopefully it was covered by the v33, but I still need a
  confirmation from David, like blinking twice or something).

* Suspicious performance difference between different type of workload,
  mentioned by Tomas (unfortunately I had no chance yet to investigate).

* Thinking about supporting conditions, that are not covered by the index,
  to make skipping more flexible (one of the potential next steps in the
  future, as suggested by Floris).
>From 15989c5250214fea8606a56afd1eeaf760b8723e Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Tue, 24 Mar 2020 17:04:32 +0100
Subject: [PATCH v33 1/2] Unique key

Design by David Rowley.

Author: Jesper Pedersen
---
 src/backend/nodes/outfuncs.c   |  14 +++
 src/backend/nodes/print.c  |  39 +++
 src/backend/optimizer/path/Makefile|   3 +-
 src/backend/optimizer/path/allpaths.c  |   8 ++
 src/backend/optimizer/path/indxpath.c  |  41 
 src/backend/optimizer/path/pathkeys.c  |  71 +++--
 src/backend/optimizer/path/uniquekey.c | 136 +
 src/backend/optimizer/plan/planagg.c   |   1 +
 src/backend/optimizer/plan/planmain.c  |   1 +
 src/backend/optimizer/plan/planner.c   |  37 ++-
 src/backend/optimizer/util/pathnode.c  |  46 +++--
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/pathnodes.h  |  19 
 src/include/nodes/print.h  |   1 +
 src/include/optimizer/pathnode.h   |   2 +
 src/include/optimizer/paths.h  |  11 ++
 16 files changed, 408 insertions(+), 23 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekey.c

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index d76fae44b8..16083e7a7e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1723,6 +1723,7 @@ _outPathInfo(StringInfo str, const Path *node)
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
 	WRITE_NODE_FIELD(pathkeys);
+	WRITE_NODE_FIELD(uniquekeys);
 }
 
 /*
@@ -2205,6 +2206,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(eq_classes);
 	WRITE_BOOL_FIELD(ec_merging_done);
 	WRITE_NODE_FIELD(canon_pathkeys);
+	WRITE_NODE_FIELD(canon_uniquekeys);
 	WRITE_NODE_FIELD(left_join_clauses);
 	WRITE_NODE_FIELD(right_join_clauses);
 	WRITE_NODE_FIELD(full_join_clauses);
@@ -2214,6 +2216,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(placeholder_list);
 	WRITE_NODE_FIELD(fkey_list);
 	WRITE_NODE_FIELD(query_pathkeys);
+	WRITE_NODE_FIELD(query_uniquekeys);
 	WRITE_NODE_FIELD(group_pathkeys);
 	WRITE_NODE_FIELD(window_pathkeys);
 	WRITE_NODE_FIELD(distinct_pathkeys);
@@ -2401,6 +2404,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+
+	WRITE_NODE_FIELD(eq_clause);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)

RE: Index Skip Scan

2020-04-06 Thread Floris Van Nee



> 
> Hm, I wasn't aware about this one, thanks for bringing this up. Btw, Floris, I
> would appreciate if in the future you can make it more visible that changes 
> you
> suggest contain some fixes. E.g. it wasn't clear for me from your previous 
> email
> that that's the case, and it doesn't make sense to pull into different 
> direction
> when we're trying to achieve the same goal :)

I wasn't aware that this particular case could be triggered before I saw 
Dilip's email, otherwise I'd have mentioned it here of course. It's just that 
because my patch handles filter conditions in general, it works for this case 
too.

> 
> > > In the patch I posted a week ago these cases are all handled
> > > correctly, as it introduces this extra logic in the Executor.
> >
> > Okay, So I think we can merge those fixes in Dmitry's patch set.
> 
> I'll definitely take a look at suggested changes in filtering part.

It may be possible to just merge the filtering part into your patch, but I'm 
not entirely sure. Basically you have to pull the information about skipping 
one level up, out of the node, into the generic IndexNext code. 

I'm eager to get some form of skip scans into master - any kind of patch that 
makes this possible is fine by me. Long term I think my version provides a more 
generic approach, with which we can optimize a much broader range of queries. 
However, since many more eyes have seen your patch so far, I hope yours can be 
committed much sooner. My knowledge on this committer process is limited 
though. That's why I've just posted mine so far in the hope of collecting some 
feedback, also on how we should continue with the process.

-Floris

Re: Index Skip Scan

2020-04-06 Thread Dmitry Dolgov

> On Mon, Apr 6, 2020 at 1:14 PM Floris Van Nee  
> wrote:
> >
> > There's a number of ways we can force a 'filter' rather than an
> > 'index condition'.

Hm, I wasn't aware about this one, thanks for bringing this up. Btw,
Floris, I would appreciate if in the future you can make it more visible
that changes you suggest contain some fixes. E.g. it wasn't clear for me
from your previous email that that's the case, and it doesn't make sense
to pull into different direction when we're trying to achieve the same
goal :)

> > In the patch I posted a week ago these cases are all handled
> > correctly, as it introduces this extra logic in the Executor.
>
> Okay, So I think we can merge those fixes in Dmitry's patch set.

I'll definitely take a look at suggested changes in filtering part.

Re: Index Skip Scan

2020-04-06 Thread Dilip Kumar

On Mon, Apr 6, 2020 at 1:14 PM Floris Van Nee  wrote:
>
> > > > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote:
> > > >
> > > > I was just wondering how the distinct will work with the "skip scan"
> > > > if we have some filter? I mean every time we select the unique row
> > > > based on the prefix key and that might get rejected by an external
> > > > filter right?
> > >
>
> Yeah, you're correct. This patch only handles the index conditions and 
> doesn't handle any filters correctly. There's a check in the planner for the 
> IndexScan for example that only columns that exist in the index are used. 
> However, this check is not sufficient as your example shows. There's a number 
> of ways we can force a 'filter' rather than an 'index condition' and still 
> choose a skip scan (WHERE b!=0 is another one I think). This leads to 
> incorrect query results.

Right

> This patch would need some logic in the planner to never choose the skip scan 
> in these cases. Better long-term solution is to adapt the rest of the 
> executor to work correctly in the cases of external filters (this ties in 
> with the previous visibility discussion as well, as that's basically also an 
> external filter, although a special case).

I agree

> In the patch I posted a week ago these cases are all handled correctly, as it 
> introduces this extra logic in the Executor.

Okay, So I think we can merge those fixes in Dmitry's patch set.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: Index Skip Scan

2020-04-06 Thread Floris Van Nee

> > > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote:
> > >
> > > I was just wondering how the distinct will work with the "skip scan"
> > > if we have some filter? I mean every time we select the unique row
> > > based on the prefix key and that might get rejected by an external
> > > filter right?
> >

Yeah, you're correct. This patch only handles the index conditions and doesn't 
handle any filters correctly. There's a check in the planner for the IndexScan 
for example that only columns that exist in the index are used. However, this 
check is not sufficient as your example shows. There's a number of ways we can 
force a 'filter' rather than an 'index condition' and still choose a skip scan 
(WHERE b!=0 is another one I think). This leads to incorrect query results.

This patch would need some logic in the planner to never choose the skip scan 
in these cases. Better long-term solution is to adapt the rest of the executor 
to work correctly in the cases of external filters (this ties in with the 
previous visibility discussion as well, as that's basically also an external 
filter, although a special case).
In the patch I posted a week ago these cases are all handled correctly, as it 
introduces this extra logic in the Executor.

-Floris

Re: Index Skip Scan

2020-04-05 Thread Dilip Kumar

On Sun, Apr 5, 2020 at 9:39 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote:
> >
> > I was just wondering how the distinct will work with the "skip scan"
> > if we have some filter? I mean every time we select the unique row
> > based on the prefix key and that might get rejected by an external
> > filter right?
>
> Not exactly. In the case of index-only scan, we skipping to the first
> unique position, and then use already existing functionality
> (_bt_readpage with stepping to the next pages) to filter out those
> records that do not pass the condition.

I agree but that will work if we have a valid index clause, but
"b%100=0" condition will not create an index clause, right?  However,
if we change the query to
select distinct(a) from t where b=100 then it works fine because this
condition will create an index clause.

 There are even couple of tests
> in the patch for this. In case of index scan, when there are some
> conditions, current implementation do not consider skipping.
>
> > So I tried an example to check this.
>
> Can you tell on which version of the patch you were testing?

I have tested on v33.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Index Skip Scan

2020-04-05 Thread Dmitry Dolgov

> On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote:
>
> I was just wondering how the distinct will work with the "skip scan"
> if we have some filter? I mean every time we select the unique row
> based on the prefix key and that might get rejected by an external
> filter right?

Not exactly. In the case of index-only scan, we skipping to the first
unique position, and then use already existing functionality
(_bt_readpage with stepping to the next pages) to filter out those
records that do not pass the condition. There are even couple of tests
in the patch for this. In case of index scan, when there are some
conditions, current implementation do not consider skipping.

> So I tried an example to check this.

Can you tell on which version of the patch you were testing?

Re: Index Skip Scan

2020-04-05 Thread Dilip Kumar

On Wed, Mar 25, 2020 at 2:19 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Wed, Mar 25, 2020 at 11:31:56AM +0530, Dilip Kumar wrote:
> >
> > Seems like you forgot to add the uniquekey.c file in the
> > v33-0001-Unique-key.patch.
>
> Oh, you're right, thanks. Here is the corrected patch.

I was just wondering how the distinct will work with the "skip scan"
if we have some filter? I mean every time we select the unique row
based on the prefix key and that might get rejected by an external
filter right?  So I tried an example to check this.

postgres[50006]=# \d t
 Table "public.t"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
 b  | integer |   |  |
Indexes:
"idx" btree (a, b)

postgres[50006]=# insert into t select 2, i from generate_series(1, 200)i;
INSERT 0 200
postgres[50006]=# insert into t select 1, i from generate_series(1, 200)i;
INSERT 0 200

postgres[50006]=# set enable_indexskipscan =off;
SET
postgres[50006]=# select distinct(a) from t where b%100=0;
 a
---
 1
 2
(2 rows)

postgres[50006]=# set enable_indexskipscan =on;
SET
postgres[50006]=# select distinct(a) from t where b%100=0;
 a
---
(0 rows)

postgres[50006]=# explain select distinct(a) from t where b%100=0;
QUERY PLAN
---
 Index Only Scan using idx on t  (cost=0.15..1.55 rows=10 width=4)
   Skip scan: true
   Filter: ((b % 100) = 0)
(3 rows)

I think in such cases we should not select the skip scan.  This should
behave like we have a filter on the non-index field.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Index Skip Scan

2020-03-25 Thread Dmitry Dolgov

> On Wed, Mar 25, 2020 at 11:31:56AM +0530, Dilip Kumar wrote:
>
> Seems like you forgot to add the uniquekey.c file in the
> v33-0001-Unique-key.patch.

Oh, you're right, thanks. Here is the corrected patch.
>From 15989c5250214fea8606a56afd1eeaf760b8723e Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Tue, 24 Mar 2020 17:04:32 +0100
Subject: [PATCH v33 1/2] Unique key

Design by David Rowley.

Author: Jesper Pedersen
---
 src/backend/nodes/outfuncs.c   |  14 +++
 src/backend/nodes/print.c  |  39 +++
 src/backend/optimizer/path/Makefile|   3 +-
 src/backend/optimizer/path/allpaths.c  |   8 ++
 src/backend/optimizer/path/indxpath.c  |  41 
 src/backend/optimizer/path/pathkeys.c  |  71 +++--
 src/backend/optimizer/path/uniquekey.c | 136 +
 src/backend/optimizer/plan/planagg.c   |   1 +
 src/backend/optimizer/plan/planmain.c  |   1 +
 src/backend/optimizer/plan/planner.c   |  37 ++-
 src/backend/optimizer/util/pathnode.c  |  46 +++--
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/pathnodes.h  |  19 
 src/include/nodes/print.h  |   1 +
 src/include/optimizer/pathnode.h   |   2 +
 src/include/optimizer/paths.h  |  11 ++
 16 files changed, 408 insertions(+), 23 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekey.c

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index d76fae44b8..16083e7a7e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1723,6 +1723,7 @@ _outPathInfo(StringInfo str, const Path *node)
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
 	WRITE_NODE_FIELD(pathkeys);
+	WRITE_NODE_FIELD(uniquekeys);
 }
 
 /*
@@ -2205,6 +2206,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(eq_classes);
 	WRITE_BOOL_FIELD(ec_merging_done);
 	WRITE_NODE_FIELD(canon_pathkeys);
+	WRITE_NODE_FIELD(canon_uniquekeys);
 	WRITE_NODE_FIELD(left_join_clauses);
 	WRITE_NODE_FIELD(right_join_clauses);
 	WRITE_NODE_FIELD(full_join_clauses);
@@ -2214,6 +2216,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(placeholder_list);
 	WRITE_NODE_FIELD(fkey_list);
 	WRITE_NODE_FIELD(query_pathkeys);
+	WRITE_NODE_FIELD(query_uniquekeys);
 	WRITE_NODE_FIELD(group_pathkeys);
 	WRITE_NODE_FIELD(window_pathkeys);
 	WRITE_NODE_FIELD(distinct_pathkeys);
@@ -2401,6 +2404,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+
+	WRITE_NODE_FIELD(eq_clause);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4092,6 +4103,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 _outPathKey(str, obj);
 break;
+			case T_UniqueKey:
+_outUniqueKey(str, obj);
+break;
 			case T_PathTarget:
 _outPathTarget(str, obj);
 break;
diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 42476724d8..d286b34544 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable)
 	printf(")\n");
 }
 
+/*
+ * print_uniquekeys -
+ *	  uniquekeys list of UniqueKeys
+ */
+void
+print_uniquekeys(const List *uniquekeys, const List *rtable)
+{
+	ListCell   *l;
+
+	printf("(");
+	foreach(l, uniquekeys)
+	{
+		UniqueKey *unique_key = (UniqueKey *) lfirst(l);
+		EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause;
+		ListCell   *k;
+		bool		first = true;
+
+		/* chase up */
+		while (eclass->ec_merged)
+			eclass = eclass->ec_merged;
+
+		printf("(");
+		foreach(k, eclass->ec_members)
+		{
+			EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
+
+			if (first)
+first = false;
+			else
+printf(", ");
+			print_expr((Node *) mem->em_expr, rtable);
+		}
+		printf(")");
+		if (lnext(uniquekeys, l))
+			printf(", ");
+	}
+	printf(")\n");
+}
+
 /*
  * print_tl
  *	  print targetlist in a more legible way.
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..63cc1505d9 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekey.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 8286d9cf34..bbc13e6141 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3954,6 +3954,14 @@ print_path(PlannerInfo *root, Path *path, int indent)
 		print_pathkeys(path->pathkeys, root->parse->rtable);
 	}
 
+	if (path->uniquekeys)
+	{
+		for (i = 0; i < indent; i++)
+			printf("\t");
+		printf("  uniquekeys: ");
+

Re: Index Skip Scan

2020-03-25 Thread Dilip Kumar

On Tue, Mar 24, 2020 at 10:08 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Wed, Mar 11, 2020 at 11:17:51AM +1300, David Rowley wrote:
> >
> > Yes, I was complaining that a ProjectionPath breaks the optimisation
> > and I don't believe there's any reason that it should.
> >
> > I believe the way to make that work correctly requires paying
> > attention to the Path's uniquekeys rather than what type of path it
> > is.
>
> Thanks for the suggestion. As a result of the discussion I've modified
> the patch, does it look similar to what you had in mind?
>
> In this version if all conditions are met and there are corresponding
> unique keys, a new index skip scan path will be added to
> unique_pathlists. In case if requested distinct clauses match with
> unique keys, create_distinct_paths can choose this path without needen
> to know what kind of path is it. Also unique_keys are passed through
> ProjectionPath, so optimization for the example mentioned in this thread
> before now should work (I've added one test for that).
>
> I haven't changed anything about UniqueKey structure itself (one of the
> suggestions was about Expr instead of EquivalenceClass), but I believe
> we need anyway to figure out how two existing imlementation (in this
> patch and from [1]) of this idea can be connected.
>
> [1]: 
> https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com

---
 src/backend/nodes/outfuncs.c  | 14 ++
 src/backend/nodes/print.c | 39 +++
 src/backend/optimizer/path/Makefile   |  3 +-
 src/backend/optimizer/path/allpaths.c |  8 +++
 src/backend/optimizer/path/indxpath.c | 41 
 src/backend/optimizer/path/pathkeys.c | 71 ++-
 src/backend/optimizer/plan/planagg.c  |  1 +
 src/backend/optimizer/plan/planmain.c |  1 +
 src/backend/optimizer/plan/planner.c  | 37 +-
 src/backend/optimizer/util/pathnode.c | 46 +
 src/include/nodes/nodes.h |  1 +
 src/include/nodes/pathnodes.h | 19 +++
 src/include/nodes/print.h |  1 +
 src/include/optimizer/pathnode.h  |  2 +
 src/include/optimizer/paths.h | 11 +
 15 files changed, 272 insertions(+), 23 deletions(-)

Seems like you forgot to add the uniquekey.c file in the
v33-0001-Unique-key.patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Index Skip Scan

2020-03-24 Thread Andy Fan

On Wed, Mar 25, 2020 at 12:41 AM Dmitry Dolgov <9erthali...@gmail.com>
wrote:

> > On Wed, Mar 11, 2020 at 06:56:09PM +0800, Andy Fan wrote:
> >
> > There was a dedicated thread  [1]  where David explain his idea very
> > detailed, and you can also check that messages around that message for
> > the context. hope it helps.
>
> Thanks for pointing out to this thread! Somehow I've missed it, and now
> looks like we need to make some efforts to align patches for index skip
> scan and distincClause elimination.
>

Yes:).   Looks Index skip scan is a way of make a distinct result without a
real
distinct node,  which happens after the UniqueKeys check where I try to see
if
the result is unique already and before the place where create a unique node
for distinct node(With index skip scan we don't need it all).  Currently in
my patch,
the logical here is 1).  Check the UniqueKey to see if the result is not
unique already.
if not, go to next 2).  After the distinct paths are created,  I will add
the result of distinct
path as a unique key.   Will you add the index skip scan path during
create_distincts_paths
and add the UniqueKey to RelOptInfo?  if so I guess my current patch can
handle it since
it cares about the result of distinct path but no worried about how we
archive that.

Best Regards
Andy Fan

Re: Index Skip Scan

2020-03-24 Thread Dmitry Dolgov

> On Wed, Mar 11, 2020 at 06:56:09PM +0800, Andy Fan wrote:
>
> There was a dedicated thread  [1]  where David explain his idea very
> detailed, and you can also check that messages around that message for
> the context. hope it helps.

Thanks for pointing out to this thread! Somehow I've missed it, and now
looks like we need to make some efforts to align patches for index skip
scan and distincClause elimination.

Re: Index Skip Scan

2020-03-24 Thread Dmitry Dolgov

> On Wed, Mar 11, 2020 at 11:17:51AM +1300, David Rowley wrote:
>
> Yes, I was complaining that a ProjectionPath breaks the optimisation
> and I don't believe there's any reason that it should.
>
> I believe the way to make that work correctly requires paying
> attention to the Path's uniquekeys rather than what type of path it
> is.

Thanks for the suggestion. As a result of the discussion I've modified
the patch, does it look similar to what you had in mind?

In this version if all conditions are met and there are corresponding
unique keys, a new index skip scan path will be added to
unique_pathlists. In case if requested distinct clauses match with
unique keys, create_distinct_paths can choose this path without needen
to know what kind of path is it. Also unique_keys are passed through
ProjectionPath, so optimization for the example mentioned in this thread
before now should work (I've added one test for that).

I haven't changed anything about UniqueKey structure itself (one of the
suggestions was about Expr instead of EquivalenceClass), but I believe
we need anyway to figure out how two existing imlementation (in this
patch and from [1]) of this idea can be connected.

[1]: 
https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com
>From c7af8157da82db1cedf02e6ec0de355b56275680 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Tue, 24 Mar 2020 17:04:32 +0100
Subject: [PATCH v33 1/2] Unique key

Design by David Rowley.

Author: Jesper Pedersen
---
 src/backend/nodes/outfuncs.c  | 14 ++
 src/backend/nodes/print.c | 39 +++
 src/backend/optimizer/path/Makefile   |  3 +-
 src/backend/optimizer/path/allpaths.c |  8 +++
 src/backend/optimizer/path/indxpath.c | 41 
 src/backend/optimizer/path/pathkeys.c | 71 ++-
 src/backend/optimizer/plan/planagg.c  |  1 +
 src/backend/optimizer/plan/planmain.c |  1 +
 src/backend/optimizer/plan/planner.c  | 37 +-
 src/backend/optimizer/util/pathnode.c | 46 +
 src/include/nodes/nodes.h |  1 +
 src/include/nodes/pathnodes.h | 19 +++
 src/include/nodes/print.h |  1 +
 src/include/optimizer/pathnode.h  |  2 +
 src/include/optimizer/paths.h | 11 +
 15 files changed, 272 insertions(+), 23 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index d76fae44b8..16083e7a7e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1723,6 +1723,7 @@ _outPathInfo(StringInfo str, const Path *node)
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
 	WRITE_NODE_FIELD(pathkeys);
+	WRITE_NODE_FIELD(uniquekeys);
 }
 
 /*
@@ -2205,6 +2206,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(eq_classes);
 	WRITE_BOOL_FIELD(ec_merging_done);
 	WRITE_NODE_FIELD(canon_pathkeys);
+	WRITE_NODE_FIELD(canon_uniquekeys);
 	WRITE_NODE_FIELD(left_join_clauses);
 	WRITE_NODE_FIELD(right_join_clauses);
 	WRITE_NODE_FIELD(full_join_clauses);
@@ -2214,6 +2216,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(placeholder_list);
 	WRITE_NODE_FIELD(fkey_list);
 	WRITE_NODE_FIELD(query_pathkeys);
+	WRITE_NODE_FIELD(query_uniquekeys);
 	WRITE_NODE_FIELD(group_pathkeys);
 	WRITE_NODE_FIELD(window_pathkeys);
 	WRITE_NODE_FIELD(distinct_pathkeys);
@@ -2401,6 +2404,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+
+	WRITE_NODE_FIELD(eq_clause);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4092,6 +4103,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 _outPathKey(str, obj);
 break;
+			case T_UniqueKey:
+_outUniqueKey(str, obj);
+break;
 			case T_PathTarget:
 _outPathTarget(str, obj);
 break;
diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 42476724d8..d286b34544 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable)
 	printf(")\n");
 }
 
+/*
+ * print_uniquekeys -
+ *	  uniquekeys list of UniqueKeys
+ */
+void
+print_uniquekeys(const List *uniquekeys, const List *rtable)
+{
+	ListCell   *l;
+
+	printf("(");
+	foreach(l, uniquekeys)
+	{
+		UniqueKey *unique_key = (UniqueKey *) lfirst(l);
+		EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause;
+		ListCell   *k;
+		bool		first = true;
+
+		/* chase up */
+		while (eclass->ec_merged)
+			eclass = eclass->ec_merged;
+
+		printf("(");
+		foreach(k, eclass->ec_members)
+		{
+			EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
+
+			if (first)
+first = false;
+			else
+printf(", ");
+			print_expr((Node *) mem->em_expr,

Re: Index Skip Scan

2020-03-23 Thread Andy Fan

>
>
> On Mon, Mar 23, 2020 at 1:55 AM Floris Van Nee 
> wrote:
> > I'm unsure which version number to give this patch (to continue with
> numbers from previous skip scan patches, or to start numbering from scratch
> again). It's a rather big change, so one could argue it's mostly a separate
> patch. I guess it mostly depends on how close the original versions were to
> be committable. Thoughts?
>
> I don't know, but from the sidelines, it'd be nice to see the unique
> path part go into PG13, where IIUC it can power the "useless unique
> removal" patch.
>

Actually I have a patch to remove the distinct clause some long time ago[1],
and later it came to the UniqueKey as well,  you can see [2] for the current
status.

[1]
https://www.postgresql.org/message-id/flat/CAKU4AWqOORqW900O-%2BL4L2%2B0xknsEqpfcs9FF7SeiO9TmpeZOg%40mail.gmail.com#f5d97cc66b9cd330add2fbb004a4d107

[2]
https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL=uafudmf4wgovkel3onbatqju9nsxtuucpp...@mail.gmail.com

Re: Index Skip Scan

2020-03-22 Thread Thomas Munro

Hi Floris,

On Sun, Mar 22, 2020 at 11:00 AM Floris Van Nee
 wrote:
> create index on t1 (a,b,c);

> select * from t1 where b in (100, 200);
>  Execution Time: 2.464 ms
>  Execution Time: 252.224 ms
>  Execution Time: 244.872 ms

Wow.  This is very cool work and I'm sure it will become a major
headline feature of PG14 if the requisite planner brains can be sorted
out.

On Mon, Mar 23, 2020 at 1:55 AM Floris Van Nee  wrote:
> I'm unsure which version number to give this patch (to continue with numbers 
> from previous skip scan patches, or to start numbering from scratch again). 
> It's a rather big change, so one could argue it's mostly a separate patch. I 
> guess it mostly depends on how close the original versions were to be 
> committable. Thoughts?

I don't know, but from the sidelines, it'd be nice to see the unique
path part go into PG13, where IIUC it can power the "useless unique
removal" patch.

Re: Index Skip Scan

2020-03-11 Thread David Rowley

On Wed, 11 Mar 2020 at 16:44, Andy Fan  wrote:
>>
>>
>> I think the UniqueKeys may need to be changed from using
>> EquivalenceClasses to use Exprs instead.
>
>
> When I try to understand why UniqueKeys needs EquivalenceClasses,
> see your comments here.  I feel that  FuncExpr can't be
> used to as a UniquePath even we can create unique index on f(a)
> and f->strict == true.  The reason is even we know a is not null,
>  f->strict = true.  it is still be possible that f(a) == null.  unique index
> allows more than 1 null values.  so shall we move further to use varattrno
> instead of Expr?  if so,  we can also use a list of Bitmapset to present multi
> unique path of a single RelOptInfo.

We do need some method to determine if NULL values are possible. At
the base relation level that can probably be done by checking NOT NULL
constraints and strict base quals. At higher levels, we can use strict
join quals as proofs.

As for bit a Bitmapset of varattnos, that would certainly work well at
the base relation level when there are no unique expression indexes,
but it's not so simple with join relations when the varattnos only
mean something when you know which base relation it comes from.  I'm
not saying that Lists of Exprs is ideal, but I think trying to
optimise some code that does not yet exist is premature.

There was some other talk in [1] on how we might make checking if a
List contains a given Node.  That could be advantageous in a few
places in the query planner, and it might be useful for this too.

[1] 
https://www.postgresql.org/message-id/CAKJS1f8v-fUG8YpaAGj309ZuALo3aEk7f6cqMHr_AVwz1fKXug%40mail.gmail.com

Re: Index Skip Scan

2020-03-11 Thread Andy Fan

On Tue, Mar 10, 2020 at 4:32 AM James Coleman  wrote:

> On Mon, Mar 9, 2020 at 3:56 PM Dmitry Dolgov <9erthali...@gmail.com>
> wrote:
> >
> > Assuming we'll implement it in a way that we do not know about what kind
> > of path type is that in create_distinct_path, then it can also work for
> > ProjectionPath or anything else (if UniqueKeys are present). But then
> > still EquivalenceMember are used only to figure out correct
> > distinctPrefixKeys and do not affect whether or not skipping is applied.
> > What do I miss?
>
>
> Part of the puzzle seems to me to this part of the response:
>
> > I think the UniqueKeys may need to be changed from using
> > EquivalenceClasses to use Exprs instead.
>
> But I can't say I'm being overly helpful by pointing that out, since I
> don't have my head in the code enough to understand how you'd
> accomplish that :)
>
>
There was a dedicated thread  [1]  where David explain his idea very
detailed,
and you can also check that messages around that message for the context.
hope it helps.

[1]
https://www.postgresql.org/message-id/CAApHDvq7i0%3DO97r4Y1pv68%2BtprVczKsXRsV28rM9H-rVPOfeNQ%40mail.gmail.com

Re: Index Skip Scan

2020-03-10 Thread Andy Fan

>
>
> I think the UniqueKeys may need to be changed from using
> EquivalenceClasses to use Exprs instead.
>

When I try to understand why UniqueKeys needs EquivalenceClasses,
see your comments here.  I feel that  FuncExpr can't be
used to as a UniquePath even we can create unique index on f(a)
and f->strict == true.  The reason is even we know a is not null,
 f->strict = true.  it is still be possible that f(a) == null.  unique index
allows more than 1 null values.  so shall we move further to use varattrno
instead of Expr?  if so,  we can also use a list of Bitmapset to present
multi
unique path of a single RelOptInfo.

Re: Index Skip Scan

2020-03-10 Thread David Rowley

On Wed, 11 Mar 2020 at 01:38, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > >On Tue, Mar 10, 2020 at 09:29:32AM +1300, David Rowley wrote:
> > There's also some weird looking assumptions that an EquivalenceMember
> > can only be a Var in create_distinct_paths(). I think you're only
> > saved from crashing there because a ProjectionPath will be created
> > atop of the IndexPath to evaluate expressions, in which case you're
> > not seeing the IndexPath.  This results in the optimisation not
> > working in cases like:
>
> I've read it as "an assumption that an EquivalenceMember can only be a
> Var" results in "the optimisation not working in cases like this". But
> you've meant that ignoring a ProjectionPath with an IndexPath inside
> results in this optimisation not working, right? If so, then everything
> is clear, and my apologies, maybe I need to finally fix my sleep
> schedule :)

Yes, I was complaining that a ProjectionPath breaks the optimisation
and I don't believe there's any reason that it should.

I believe the way to make that work correctly requires paying
attention to the Path's uniquekeys rather than what type of path it
is.

Re: Index Skip Scan

2020-03-10 Thread Dmitry Dolgov

> >On Tue, Mar 10, 2020 at 09:29:32AM +1300, David Rowley wrote:
> On Tue, 10 Mar 2020 at 08:56, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > Assuming we'll implement it in a way that we do not know about what kind
> > of path type is that in create_distinct_path, then it can also work for
> > ProjectionPath or anything else (if UniqueKeys are present). But then
> > still EquivalenceMember are used only to figure out correct
> > distinctPrefixKeys and do not affect whether or not skipping is applied.
> > What do I miss?
>
> I'm not sure I fully understand the question correctly, but let me
> explain further.
>
> In the 0001 patch, standard_qp_callback sets the query_uniquekeys
> depending on the DISTINCT / GROUP BY clause.  When building index
> paths in build_index_paths(), the 0002 patch should be looking at the
> root->query_uniquekeys to see if it can build any index paths that
> suit those keys.  Such paths should be tagged with the uniquekeys they
> satisfy, basically, exactly the same as how pathkeys work.  Many
> create_*_path functions will need to be modified to carry forward
> their uniquekeys. For example, create_projection_path(),
> create_limit_path() don't do anything which would cause the created
> path to violate the unique keys.   This way when you get down to
> create_distinct_paths(), paths other than IndexPath may have
> uniquekeys.  You'll be able to check which existing paths satisfy the
> unique keys required by the DISTINCT / GROUP BY and select those paths
> instead of having to create any HashAggregate / Unique paths.
>
> Does that answer the question?

Hmm... I'm afraid no, this was already clear. But looks like now I see
that I've misinterpreted one part.

> There's also some weird looking assumptions that an EquivalenceMember
> can only be a Var in create_distinct_paths(). I think you're only
> saved from crashing there because a ProjectionPath will be created
> atop of the IndexPath to evaluate expressions, in which case you're
> not seeing the IndexPath.  This results in the optimisation not
> working in cases like:

I've read it as "an assumption that an EquivalenceMember can only be a
Var" results in "the optimisation not working in cases like this". But
you've meant that ignoring a ProjectionPath with an IndexPath inside
results in this optimisation not working, right? If so, then everything
is clear, and my apologies, maybe I need to finally fix my sleep
schedule :)

Re: Index Skip Scan

2020-03-09 Thread James Coleman

On Mon, Mar 9, 2020 at 3:56 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> Assuming we'll implement it in a way that we do not know about what kind
> of path type is that in create_distinct_path, then it can also work for
> ProjectionPath or anything else (if UniqueKeys are present). But then
> still EquivalenceMember are used only to figure out correct
> distinctPrefixKeys and do not affect whether or not skipping is applied.
> What do I miss?


Part of the puzzle seems to me to this part of the response:

> I think the UniqueKeys may need to be changed from using
> EquivalenceClasses to use Exprs instead.

But I can't say I'm being overly helpful by pointing that out, since I
don't have my head in the code enough to understand how you'd
accomplish that :)

James

Re: Index Skip Scan

2020-03-09 Thread David Rowley

On Tue, 10 Mar 2020 at 08:56, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Assuming we'll implement it in a way that we do not know about what kind
> of path type is that in create_distinct_path, then it can also work for
> ProjectionPath or anything else (if UniqueKeys are present). But then
> still EquivalenceMember are used only to figure out correct
> distinctPrefixKeys and do not affect whether or not skipping is applied.
> What do I miss?

I'm not sure I fully understand the question correctly, but let me
explain further.

In the 0001 patch, standard_qp_callback sets the query_uniquekeys
depending on the DISTINCT / GROUP BY clause.  When building index
paths in build_index_paths(), the 0002 patch should be looking at the
root->query_uniquekeys to see if it can build any index paths that
suit those keys.  Such paths should be tagged with the uniquekeys they
satisfy, basically, exactly the same as how pathkeys work.  Many
create_*_path functions will need to be modified to carry forward
their uniquekeys. For example, create_projection_path(),
create_limit_path() don't do anything which would cause the created
path to violate the unique keys.   This way when you get down to
create_distinct_paths(), paths other than IndexPath may have
uniquekeys.  You'll be able to check which existing paths satisfy the
unique keys required by the DISTINCT / GROUP BY and select those paths
instead of having to create any HashAggregate / Unique paths.

Does that answer the question?

Re: Index Skip Scan

2020-03-09 Thread Dmitry Dolgov

> On Mon, Mar 09, 2020 at 10:27:26AM +1300, David Rowley wrote:
>
> > > I think the changes in create_distinct_paths() need more work.  The
> > > way I think this should work is that create_distinct_paths() gets to
> > > know exactly nothing about what path types support the elimination of
> > > duplicate values.  The Path should carry the UniqueKeys so that can be
> > > determined. In create_distinct_paths() you should just be able to make
> > > use of those paths, which should already have been created when
> > > creating index paths for the rel due to PlannerInfo's query_uniquekeys
> > > having been set.
> >
> > Just for me to clarify. The idea is to "move" information about what
> > path types support skipping into UniqueKeys (derived from PlannerInfo's
> > query_uniquekeys), but other checks (e.g. if index am supports that)
> > still perform in create_distinct_paths?
>
> create_distinct_paths() shouldn't know any details specific to the
> pathtype that it's using or considering using.  All the details should
> just be in Path. e.g. uniquekeys, pathkeys, costs etc. There should be
> no IsA(path, ...).  Please have a look over the details in my reply to
> Tomas. I hope that reply has enough information in it, but please
> reply there if I've missed something.

Yes, I've read this reply, just wanted to ask here, since I had other
questions as well. Speaking of which:

> > > On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote:
> > > There's also some weird looking assumptions that an EquivalenceMember
> > > can only be a Var in create_distinct_paths(). I think you're only
> > > saved from crashing there because a ProjectionPath will be created
> > > atop of the IndexPath to evaluate expressions, in which case you're
> > > not seeing the IndexPath.

I'm probably missing something, so to eliminate any misunderstanding
from my side:

> > > This results in the optimisation not working in cases like:
> > >
> > > postgres=# create table t (a int); create index on t ((a+1)); explain
> > > select distinct a+1 from t;
> > > CREATE TABLE
> > > CREATE INDEX
> > > QUERY PLAN
> > > ---
> > >  HashAggregate  (cost=48.25..50.75 rows=200 width=4)
> > >Group Key: (a + 1)
> > >->  Seq Scan on t  (cost=0.00..41.88 rows=2550 width=4)

In this particular example skipping is not applied because, as you've
mentioned, we're dealing with ProjectionPath (not IndexScan /
IndexOnlyScan). Which means we're not even reaching the code with
EquivalenceMember, so I'm still not sure how do they connected?

Assuming we'll implement it in a way that we do not know about what kind
of path type is that in create_distinct_path, then it can also work for
ProjectionPath or anything else (if UniqueKeys are present). But then
still EquivalenceMember are used only to figure out correct
distinctPrefixKeys and do not affect whether or not skipping is applied.
What do I miss?

Re: Index Skip Scan

2020-03-08 Thread David Rowley

On Mon, 9 Mar 2020 at 03:21, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> >
> > I've been looking over v32 of the patch and have a few comments
> > regarding the planner changes.
>
> Thanks for the commentaries!
>
> > I think the changes in create_distinct_paths() need more work.  The
> > way I think this should work is that create_distinct_paths() gets to
> > know exactly nothing about what path types support the elimination of
> > duplicate values.  The Path should carry the UniqueKeys so that can be
> > determined. In create_distinct_paths() you should just be able to make
> > use of those paths, which should already have been created when
> > creating index paths for the rel due to PlannerInfo's query_uniquekeys
> > having been set.
>
> Just for me to clarify. The idea is to "move" information about what
> path types support skipping into UniqueKeys (derived from PlannerInfo's
> query_uniquekeys), but other checks (e.g. if index am supports that)
> still perform in create_distinct_paths?

create_distinct_paths() shouldn't know any details specific to the
pathtype that it's using or considering using.  All the details should
just be in Path. e.g. uniquekeys, pathkeys, costs etc. There should be
no IsA(path, ...).  Please have a look over the details in my reply to
Tomas. I hope that reply has enough information in it, but please
reply there if I've missed something.

> > On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote:
> > There's also some weird looking assumptions that an EquivalenceMember
> > can only be a Var in create_distinct_paths(). I think you're only
> > saved from crashing there because a ProjectionPath will be created
> > atop of the IndexPath to evaluate expressions, in which case you're
> > not seeing the IndexPath.  This results in the optimisation not
> > working in cases like:
> >
> > postgres=# create table t (a int); create index on t ((a+1)); explain
> > select distinct a+1 from t;
> > CREATE TABLE
> > CREATE INDEX
> > QUERY PLAN
> > ---
> >  HashAggregate  (cost=48.25..50.75 rows=200 width=4)
> >Group Key: (a + 1)
> >->  Seq Scan on t  (cost=0.00..41.88 rows=2550 width=4)
>
> Yes, I need to fix it.
>
> > Using unique paths as I mentioned above should see that fixed.
>
> I'm a bit confused about this statement, how exactly unique paths should
> fix this?

The path's uniquekeys would mention that it's unique on  (a+1). You'd
compare the uniquekeys of the path to the DISTINCT clause and see that
the uniquekeys are a subset of the DISTINCT clause therefore the
DISTINCT is a no-op. If that uniquekey path is cheaper than the
cheapest_total_path + , then you should
pick the unique path, otherwise use the cheapest_total_path and
uniquify that.

I think the UniqueKeys may need to be changed from using
EquivalenceClasses to use Exprs instead.

Re: Index Skip Scan

2020-03-08 Thread Dmitry Dolgov

> On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote:
>
> I've been looking over v32 of the patch and have a few comments
> regarding the planner changes.

Thanks for the commentaries!

> I think the changes in create_distinct_paths() need more work.  The
> way I think this should work is that create_distinct_paths() gets to
> know exactly nothing about what path types support the elimination of
> duplicate values.  The Path should carry the UniqueKeys so that can be
> determined. In create_distinct_paths() you should just be able to make
> use of those paths, which should already have been created when
> creating index paths for the rel due to PlannerInfo's query_uniquekeys
> having been set.

Just for me to clarify. The idea is to "move" information about what
path types support skipping into UniqueKeys (derived from PlannerInfo's
query_uniquekeys), but other checks (e.g. if index am supports that)
still perform in create_distinct_paths?

> Also, should create_grouping_paths() be getting the same code?
> Jesper's UniqueKey patch seems to set query_uniquekeys when there's a
> GROUP BY with no aggregates. So it looks like he has intended that
> something like:
>
> SELECT x FROM t GROUP BY x;
>
> should work the same way as
>
> SELECT DISTINCT x FROM t;
>
> but the 0002 patch does not make this work.  Has that just been overlooked?

I believe it wasn't overlooked in 0002 patch, but rather added just in
case in 0001. I guess there are no theoretical problems in implementing
it, but since we wanted to keep scope of the patch under control and
concentrate on the existing functionality it probably makes sense to
plan it as one of the next steps?

> There's also some weird looking assumptions that an EquivalenceMember
> can only be a Var in create_distinct_paths(). I think you're only
> saved from crashing there because a ProjectionPath will be created
> atop of the IndexPath to evaluate expressions, in which case you're
> not seeing the IndexPath.  This results in the optimisation not
> working in cases like:
>
> postgres=# create table t (a int); create index on t ((a+1)); explain
> select distinct a+1 from t;
> CREATE TABLE
> CREATE INDEX
> QUERY PLAN
> ---
>  HashAggregate  (cost=48.25..50.75 rows=200 width=4)
>Group Key: (a + 1)
>->  Seq Scan on t  (cost=0.00..41.88 rows=2550 width=4)

Yes, I need to fix it.

> Using unique paths as I mentioned above should see that fixed.

I'm a bit confused about this statement, how exactly unique paths should
fix this?

Re: Index Skip Scan

2020-03-07 Thread David Rowley

On Sat, 7 Mar 2020 at 03:49, Tomas Vondra  wrote:
>
> On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote:
> >The reason it must be done this way is that when the RelOptInfo that
> >we're performing the DISTINCT on is a joinrel, then we're not going to
> >see any IndexPaths in the RelOptInfo's pathlist. We'll have some sort
> >of Join path instead.  I understand you're not yet supporting doing
> >this optimisation when joins are involved, but it should be coded in
> >such a way that it'll work when we do.  (It's probably also a separate
> >question as to whether we should have this only work when there are no
> >joins. I don't think I personally object to it for stage 1, but
> >perhaps someone else might think differently.)
> >
>
> I don't follow. Can you elaborate more?
>
> AFAICS skip-scan is essentially a capability of an (index) AM. I don't
> see how we could ever do that for joinrels? We can do that at the scan
> level, below a join, but that's what this patch already supports, I
> think. When you say "supporting this optimisation" with joins, do you
> mean doing skip-scan for join inputs, or on top of the join?

The skip index scan Path would still be created at the base rel level,
but the join path on the join relation would have one of the sub-paths
of the join as an index skip scan.

An example query that could make use of this is:

SELECT * FROM some_table WHERE a IN(SELECT
indexed_col_with_few_distinct_values FROM big_table);

In this case, we might want to create a Skip Scan path on big_table
using the index on the "indexed_col_with_few_distinct_values", then
Hash Join to "some_table". That class of query is likely stage 2 or 3
of this work, but we need to lay foundations that'll support it.

As for not having IndexScan paths in joinrels. Yes, of course, but
that's exactly why create_distinct_paths() cannot work the way the
patch currently codes it. The patch does:

+ /*
+ * XXX: In case of index scan quals evaluation happens
+ * after ExecScanFetch, which means skip results could be
+ * fitered out. Consider the following query:
+ *
+ * select distinct (a, b) a, b, c from t where  c < 100;
+ *
+ * Skip scan returns one tuple for one distinct set of (a,
+ * b) with arbitrary one of c, so if the choosed c does
+ * not match the qual and there is any c that matches the
+ * qual, we miss that tuple.
+ */
+ if (path->pathtype == T_IndexScan &&

which will never work for join relations since they'll only have paths
for Loop/Merge/Hash type joins.  The key here is to determine which
skip scan paths we should create when we're building the normal index
paths then see if we can make use of those when planning joins.
Subsequently, we'll then see if we can make use of the resulting join
paths during create_distinct_paths(). Doing it this way will allow us
to use skip scans in queries such as:

SELECT DISTINCT t1.z FROM t1 INNER JOIN t2 ON t1.a = t2.unique_col;

We'll first create the skip scan paths on t1, then when creating the
join paths we'll create additional join paths which use the skipscan
path. Because t1.unique_col will at most have 1 join partner for each
t2 row, then the join path will have the same unique_keys as the
skipscan path.  That'll allow us to use the join path which has the
skip scan on whichever side of the join the t1 relation ends up.  All
create_distinct_paths() should be doing is looking for paths that are
already implicitly unique on the distinct clause and consider using
those in a cost-based way. It shouldn't be making such paths itself.

> >For storing these new paths with UniqueKeys, I'm not sure exactly if
> >we can just add_path() such paths into the RelOptInfo's pathlist.
> >What we don't want to do is accidentally make use of paths which
> >eliminate duplicate values when we don't want that behaviour.  If we
> >did store these paths in RelOptInfo->pathlist then we'd need to go and
> >modify a bunch of places to ignore such paths. set_cheapest() would
> >have to do something special for them too, which makes me think
> >pathlist is the incorrect place.  Parallel query added
> >partial_pathlist, so perhaps we need unique_pathlist to make this
> >work.
> >
>
> Hmmm, good point. Do we actually produce incorrect plans with the
> current patch, using skip-scan path when we should not?

I don't think so. The patch is only creating skip scan paths on the
base rel when we discover it's valid to do so.  That's not the way it
should work though.  How the patch currently works would be similar to
initially only creating a SeqScan path for a query such as: SELECT *
FROM tab ORDER BY a;, but then, during  create_ordered_paths() go and
create some IndexPath to scan the btree index on tab.a because we
suddenly realise that it'll be good to use that for the ORDER BY.
The planner does not work that way.  We always create all the paths
that we think will be useful during set_base_rel_pathlists(). We then
make use of only existing paths in the upper planner.  See what

Re: Index Skip Scan

2020-03-06 Thread Tomas Vondra


On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote:

On Tue, 18 Feb 2020 at 05:24, Dmitry Dolgov <9erthali...@gmail.com> wrote:

Here is something similar to what I had in mind.


(changing to this email address for future emails)

Hi,

I've been looking over v32 of the patch and have a few comments
regarding the planner changes.

I think the changes in create_distinct_paths() need more work.  The
way I think this should work is that create_distinct_paths() gets to
know exactly nothing about what path types support the elimination of
duplicate values.  The Path should carry the UniqueKeys so that can be
determined. In create_distinct_paths() you should just be able to make
use of those paths, which should already have been created when
creating index paths for the rel due to PlannerInfo's query_uniquekeys
having been set.



+1 to code this in a generic way, using query_uniquekeys (if possible)


The reason it must be done this way is that when the RelOptInfo that
we're performing the DISTINCT on is a joinrel, then we're not going to
see any IndexPaths in the RelOptInfo's pathlist. We'll have some sort
of Join path instead.  I understand you're not yet supporting doing
this optimisation when joins are involved, but it should be coded in
such a way that it'll work when we do.  (It's probably also a separate
question as to whether we should have this only work when there are no
joins. I don't think I personally object to it for stage 1, but
perhaps someone else might think differently.)



I don't follow. Can you elaborate more?

AFAICS skip-scan is essentially a capability of an (index) AM. I don't
see how we could ever do that for joinrels? We can do that at the scan
level, below a join, but that's what this patch already supports, I
think. When you say "supporting this optimisation" with joins, do you
mean doing skip-scan for join inputs, or on top of the join?


For storing these new paths with UniqueKeys, I'm not sure exactly if
we can just add_path() such paths into the RelOptInfo's pathlist.
What we don't want to do is accidentally make use of paths which
eliminate duplicate values when we don't want that behaviour.  If we
did store these paths in RelOptInfo->pathlist then we'd need to go and
modify a bunch of places to ignore such paths. set_cheapest() would
have to do something special for them too, which makes me think
pathlist is the incorrect place.  Parallel query added
partial_pathlist, so perhaps we need unique_pathlist to make this
work.



Hmmm, good point. Do we actually produce incorrect plans with the
current patch, using skip-scan path when we should not?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Index Skip Scan

2020-03-03 Thread David Rowley

On Tue, 18 Feb 2020 at 05:24, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Here is something similar to what I had in mind.

(changing to this email address for future emails)

Hi,

I've been looking over v32 of the patch and have a few comments
regarding the planner changes.

I think the changes in create_distinct_paths() need more work.  The
way I think this should work is that create_distinct_paths() gets to
know exactly nothing about what path types support the elimination of
duplicate values.  The Path should carry the UniqueKeys so that can be
determined. In create_distinct_paths() you should just be able to make
use of those paths, which should already have been created when
creating index paths for the rel due to PlannerInfo's query_uniquekeys
having been set.

The reason it must be done this way is that when the RelOptInfo that
we're performing the DISTINCT on is a joinrel, then we're not going to
see any IndexPaths in the RelOptInfo's pathlist. We'll have some sort
of Join path instead.  I understand you're not yet supporting doing
this optimisation when joins are involved, but it should be coded in
such a way that it'll work when we do.  (It's probably also a separate
question as to whether we should have this only work when there are no
joins. I don't think I personally object to it for stage 1, but
perhaps someone else might think differently.)

For storing these new paths with UniqueKeys, I'm not sure exactly if
we can just add_path() such paths into the RelOptInfo's pathlist.
What we don't want to do is accidentally make use of paths which
eliminate duplicate values when we don't want that behaviour.  If we
did store these paths in RelOptInfo->pathlist then we'd need to go and
modify a bunch of places to ignore such paths. set_cheapest() would
have to do something special for them too, which makes me think
pathlist is the incorrect place.  Parallel query added
partial_pathlist, so perhaps we need unique_pathlist to make this
work.

Also, should create_grouping_paths() be getting the same code?
Jesper's UniqueKey patch seems to set query_uniquekeys when there's a
GROUP BY with no aggregates. So it looks like he has intended that
something like:

SELECT x FROM t GROUP BY x;

should work the same way as

SELECT DISTINCT x FROM t;

but the 0002 patch does not make this work.  Has that just been overlooked?

There's also some weird looking assumptions that an EquivalenceMember
can only be a Var in create_distinct_paths(). I think you're only
saved from crashing there because a ProjectionPath will be created
atop of the IndexPath to evaluate expressions, in which case you're
not seeing the IndexPath.  This results in the optimisation not
working in cases like:

postgres=# create table t (a int); create index on t ((a+1)); explain
select distinct a+1 from t;
CREATE TABLE
CREATE INDEX
QUERY PLAN
---
 HashAggregate  (cost=48.25..50.75 rows=200 width=4)
   Group Key: (a + 1)
   ->  Seq Scan on t  (cost=0.00..41.88 rows=2550 width=4)

Using unique paths as I mentioned above should see that fixed.

David

Re: Index Skip Scan

2020-02-17 Thread Dmitry Dolgov

> On Fri, Feb 14, 2020 at 01:18:20PM +0100, Dmitry Dolgov wrote:
> > On Fri, Feb 14, 2020 at 05:23:13PM +0900, Kyotaro Horiguchi wrote:
> > The first attached (renamed to .txt not to confuse the cfbots) is a
> > small patch that makes sure if _bt_readpage is called with the proper
> > condition as written in its comment, that is, caller must have pinned
> > and read-locked so->currPos.buf. This patch reveals many instances of
> > breakage of the contract.
>
> Thanks! On top of which patch version one can apply it? I'm asking
> because I believe I've addressed similar issues in the last version, and
> the last proposed diff (after resolving some conflicts) breaks tests for
> me, so not sure if I miss something.
>
> At the same time if you and Tomas strongly agree that it actually makes
> sense to make moving forward/reading backward case work with dead tuples
> correctly, I'll take a shot and try to teach the code around _bt_skip to
> do what is required for that. I can merge your changes there and we can
> see what would be the result.

Here is something similar to what I had in mind. In this version of the
patch IndexOnlyNext now verify if we returned to the same position as
before while reading in opposite to the advancing direction due to
visibility checks (similar to what is implemented inside _bt_skip for
the situation when some distinct keys being eliminated due to scankey
conditions). It's actually not that invasive as I feared, but still
pretty hacky. I'm not sure if it's ok to compare resulting heaptid in
this situation, but all the mention tests are passing. Also, this version
doesn't incorporate any planner feedback from Tomas yet, my intention is
just to check if it could be the right direction.
>From 22e6b4ccd5f79ca069bd5cd90ba3696dd97f76ea Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 9 Jul 2019 06:44:57 -0400
Subject: [PATCH v32 1/2] Unique key

Design by David Rowley.

Author: Jesper Pedersen
---
 src/backend/nodes/outfuncs.c   |  14 +++
 src/backend/nodes/print.c  |  39 +++
 src/backend/optimizer/path/Makefile|   3 +-
 src/backend/optimizer/path/allpaths.c  |   8 ++
 src/backend/optimizer/path/indxpath.c  |  41 +++
 src/backend/optimizer/path/pathkeys.c  |  71 ++--
 src/backend/optimizer/path/uniquekey.c | 147 +
 src/backend/optimizer/plan/planagg.c   |   1 +
 src/backend/optimizer/plan/planmain.c  |   1 +
 src/backend/optimizer/plan/planner.c   |  17 ++-
 src/backend/optimizer/util/pathnode.c  |  12 ++
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/pathnodes.h  |  18 +++
 src/include/nodes/print.h  |   1 +
 src/include/optimizer/pathnode.h   |   1 +
 src/include/optimizer/paths.h  |  11 ++
 16 files changed, 373 insertions(+), 13 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekey.c

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index d76fae44b8..16083e7a7e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1723,6 +1723,7 @@ _outPathInfo(StringInfo str, const Path *node)
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
 	WRITE_NODE_FIELD(pathkeys);
+	WRITE_NODE_FIELD(uniquekeys);
 }
 
 /*
@@ -2205,6 +2206,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(eq_classes);
 	WRITE_BOOL_FIELD(ec_merging_done);
 	WRITE_NODE_FIELD(canon_pathkeys);
+	WRITE_NODE_FIELD(canon_uniquekeys);
 	WRITE_NODE_FIELD(left_join_clauses);
 	WRITE_NODE_FIELD(right_join_clauses);
 	WRITE_NODE_FIELD(full_join_clauses);
@@ -2214,6 +2216,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(placeholder_list);
 	WRITE_NODE_FIELD(fkey_list);
 	WRITE_NODE_FIELD(query_pathkeys);
+	WRITE_NODE_FIELD(query_uniquekeys);
 	WRITE_NODE_FIELD(group_pathkeys);
 	WRITE_NODE_FIELD(window_pathkeys);
 	WRITE_NODE_FIELD(distinct_pathkeys);
@@ -2401,6 +2404,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+
+	WRITE_NODE_FIELD(eq_clause);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4092,6 +4103,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 _outPathKey(str, obj);
 break;
+			case T_UniqueKey:
+_outUniqueKey(str, obj);
+break;
 			case T_PathTarget:
 _outPathTarget(str, obj);
 break;
diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 42476724d8..d286b34544 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable)
 	printf(")\n");
 }
 
+/*
+ * print_uniquekeys -
+ *	  uniquekeys list of UniqueKeys
+ */
+void
+print_uniquekeys(const List *uniquekeys, const List *rtable)
+{
+	ListCell   *l;
+
+	printf("(");
+

Re: Index Skip Scan

2020-02-14 Thread Dmitry Dolgov

> On Fri, Feb 14, 2020 at 05:23:13PM +0900, Kyotaro Horiguchi wrote:
> The first attached (renamed to .txt not to confuse the cfbots) is a
> small patch that makes sure if _bt_readpage is called with the proper
> condition as written in its comment, that is, caller must have pinned
> and read-locked so->currPos.buf. This patch reveals many instances of
> breakage of the contract.

Thanks! On top of which patch version one can apply it? I'm asking
because I believe I've addressed similar issues in the last version, and
the last proposed diff (after resolving some conflicts) breaks tests for
me, so not sure if I miss something.

At the same time if you and Tomas strongly agree that it actually makes
sense to make moving forward/reading backward case work with dead tuples
correctly, I'll take a shot and try to teach the code around _bt_skip to
do what is required for that. I can merge your changes there and we can
see what would be the result.

Re: Index Skip Scan

2020-02-14 Thread Kyotaro Horiguchi

Thank you very much for the benchmarking!

A bit different topic from  the latest branch..

At Sat, 8 Feb 2020 14:11:59 +0100, Tomas Vondra  
wrote in 
> >Yes, I've mentioned that already in one of the previous emails :) The
> >simplest way I see to achieve what we want is to do something like in
> >attached modified version with a new hasDeclaredCursor field. It's not
> >a
> >final version though, but posted just for discussion, so feel free to
> >suggest any improvements or alternatives.
> 
> IMO the proper fix for this case (moving forward, reading backwards)
> is
> simply making it work by properly checking deleted tuples etc. Not
> sure
> why that would be so much complex (haven't tried implementing it)?

I don't think it's not so complex. But I suspect that it might be a
bit harder starting from the current shpae.

The first attached (renamed to .txt not to confuse the cfbots) is a
small patch that makes sure if _bt_readpage is called with the proper
condition as written in its comment, that is, caller must have pinned
and read-locked so->currPos.buf. This patch reveals many instances of
breakage of the contract.

The second is a crude fix the breakages, but the result seems far from
neat.. I think we need rethinking taking modification of support
functions into consideration.

> I think making this depend on things like declared cursor etc. is
> going
> to be tricky, may easily be more complex than checking deleted tuples,
> and the behavior may be quite surprising.

Sure.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From de129e5a261ed43f002c1684dc9d6575f3880b16 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 6 Feb 2020 14:31:36 +0900
Subject: [PATCH 1/2] debug aid

---
 src/backend/access/nbtree/nbtsearch.c |  1 +
 src/backend/storage/buffer/bufmgr.c   | 13 +
 src/include/storage/bufmgr.h  |  1 +
 3 files changed, 15 insertions(+)

diff --git a/src/backend/access/nbtree/nbtsearch.c 
b/src/backend/access/nbtree/nbtsearch.c
index c5f5d228f2..5cd97d8bb5 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1785,6 +1785,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, 
OffsetNumber offnum)
 * used here; this function is what makes it good for currPos.
 */
Assert(BufferIsValid(so->currPos.buf));
+   Assert(BufferLockAndPinHeldByMe(so->currPos.buf));
 
page = BufferGetPage(so->currPos.buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
diff --git a/src/backend/storage/buffer/bufmgr.c 
b/src/backend/storage/buffer/bufmgr.c
index aba3960481..08a75a6846 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1553,6 +1553,19 @@ ReleaseAndReadBuffer(Buffer buffer,
return ReadBuffer(relation, blockNum);
 }
 
+/* tmp function for debugging */
+bool
+BufferLockAndPinHeldByMe(Buffer buffer)
+{
+   BufferDesc  *b = GetBufferDescriptor(buffer - 1);
+
+   if (BufferIsPinned(buffer) &&
+   LWLockHeldByMe(BufferDescriptorGetContentLock(b)))
+   return true;
+
+   return false;
+}
+
 /*
  * PinBuffer -- make buffer unavailable for replacement.
  *
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 73c7e9ba38..8e5fc639a0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -177,6 +177,7 @@ extern void MarkBufferDirty(Buffer buffer);
 extern void IncrBufferRefCount(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
   BlockNumber 
blockNum);
+extern bool BufferLockAndPinHeldByMe(Buffer buffer);
 
 extern void InitBufferPool(void);
 extern void InitBufferPoolAccess(void);
-- 
2.18.2

>From 912bad2ec8c66ccd01cebf1f69233b004c633243 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 6 Feb 2020 19:09:09 +0900
Subject: [PATCH 2/2] crude fix

---
 src/backend/access/nbtree/nbtsearch.c | 43 +--
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c 
b/src/backend/access/nbtree/nbtsearch.c
index 5cd97d8bb5..1f18b38ca5 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1619,6 +1619,9 @@ _bt_skip(IndexScanDesc scan, ScanDirection dir,
 
nextOffset = startOffset = 
ItemPointerGetOffsetNumber(>xs_itup->t_tid);
 
+   if (nextOffset != startOffset)
+   LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+
while (nextOffset == startOffset)
{
IndexTuple itup;
@@ -1653,7 +1656,7 @@ _bt_skip(IndexScanDesc scan, ScanDirection dir,
offnum = OffsetNumberPrev(offnum);
 
/* Check if _bt_readpage returns already found

Re: Index Skip Scan

2020-02-10 Thread Dmitry Dolgov

> On Sat, Feb 08, 2020 at 01:31:02PM -0500, James Coleman wrote:
> On Sat, Feb 8, 2020 at 10:24 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> >
> > > On Sat, Feb 08, 2020 at 03:22:17PM +0100, Tomas Vondra wrote:
> > > So in this case the skip scan is ~15x slower than the usual plan (index
> > > only scan + unique). The reason why this happens is pretty simple - to
> > > estimate the number of groups we multiply the ndistinct estimates for
> > > the two columns (which both have n_distinct = 1), but then we cap
> > > the estimate to 10% of the table. But when the columns are independent
> > > with high cardinalities that under-estimates the actual value, making
> > > the cost for skip scan much lower than it should be.
> >
> > The current implementation checks if we can find the next value on the
> > same page to do a shortcut instead of tree traversal and improve such
> > kind of situations, but I can easily imagine that it's still not enough
> > in some extreme situations.
>
> This is almost certainly rehashing already covered ground, but since I
> doubt it's been discussed recently, would you be able to summarize
> that choice (not to always get the next tuple by scanning from the top
> of the tree again) and the performance/complexity tradeoffs?

Yeah, this part of discussion happened already some time ago. The idea
[1] is to protect ourselves at least partially from incorrect ndistinct
estimations. Simply doing jumping over an index means that even if the
next key we're searching for is on the same page as previous, we still
end up doing a search from the root of the tree, which is of course less
efficient than just check right on the page before jumping further.

Performance tradeoff in this case is simple, we make regular use case
slightly slower, but can perform better in the worst case scenarios.
Complexity tradeoff was never discussed, but I guess everyone assumed
it's relatively straightforward to check the current page and return if
something was found before jumping.

[1]: 
https://www.postgresql.org/message-id/CA%2BTgmoY7QTHhzLWZupNSyyqFRBfMgYocg3R-6g%3DDRgT4-KBGqg%40mail.gmail.com

Re: Index Skip Scan

2020-02-08 Thread James Coleman

On Sat, Feb 8, 2020 at 10:24 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Sat, Feb 08, 2020 at 03:22:17PM +0100, Tomas Vondra wrote:
> > So in this case the skip scan is ~15x slower than the usual plan (index
> > only scan + unique). The reason why this happens is pretty simple - to
> > estimate the number of groups we multiply the ndistinct estimates for
> > the two columns (which both have n_distinct = 1), but then we cap
> > the estimate to 10% of the table. But when the columns are independent
> > with high cardinalities that under-estimates the actual value, making
> > the cost for skip scan much lower than it should be.
>
> The current implementation checks if we can find the next value on the
> same page to do a shortcut instead of tree traversal and improve such
> kind of situations, but I can easily imagine that it's still not enough
> in some extreme situations.

This is almost certainly rehashing already covered ground, but since I
doubt it's been discussed recently, would you be able to summarize
that choice (not to always get the next tuple by scanning from the top
of the tree again) and the performance/complexity tradeoffs?

Thanks,
James

Re: Index Skip Scan

2020-02-08 Thread Tomas Vondra

On Sat, Feb 08, 2020 at 04:24:40PM +0100, Dmitry Dolgov wrote:

On Sat, Feb 08, 2020 at 03:22:17PM +0100, Tomas Vondra wrote:

I've done some testing and benchmarking of the v31 patch, looking for
regressions, costing issues etc. Essentially, I've ran a bunch of SELECT
DISTINCT queries on data sets of various size, number of distinct values
etc. The results are fairly large, so I've uploaded them to github

https://github.com/tvondra/skip-scan-test

Thanks a lot for testing!

There are a couple of regressions, where the plan with skipscan enables
is ~10x slower. But this seems to happen only in high-cardinality cases
where we misestimate the number of groups. Consider a table with two
independent columns

  CREATE TABLE t (a text, b text);
  INSERT INTO t SELECT
md5((1*random())::int::text),
md5((1*random())::int::text)
  FROM generate_series(1,100) s(i);

  CREATE INDEX ON t(a,b);

  ANALYZE;

which then behaves like this:

  test=# select * from (select distinct a,b from t) foo offset 1000;
  Time: 3138.222 ms (00:03.138)
  test=# set enable_indexskipscan = off;
  Time: 0.312 ms
  test=# select * from (select distinct a,b from t) foo offset 1000;
  Time: 199.749 ms

So in this case the skip scan is ~15x slower than the usual plan (index
only scan + unique). The reason why this happens is pretty simple - to
estimate the number of groups we multiply the ndistinct estimates for
the two columns (which both have n_distinct = 1), but then we cap
the estimate to 10% of the table. But when the columns are independent
with high cardinalities that under-estimates the actual value, making
the cost for skip scan much lower than it should be.

The current implementation checks if we can find the next value on the
same page to do a shortcut instead of tree traversal and improve such
kind of situations, but I can easily imagine that it's still not enough
in some extreme situations.

Yeah. I'm not sure there's room for further improvements. The regressed
cases were subject to the 10% cap, and with ndistinct being more than
10% of the table, we probably can find many distinct keys on each index
page - we know that every ~10 rows the values change.

I don't think this is an issue the skipscan patch needs to fix, though.
Firstly, the regressed cases are a tiny minority. Secondly, we already
have a way to improve the root cause - creating extended stats with
ndistinct coefficients generally makes the problem go away.

Yes, I agree.

One interesting observation however is that this regression only
happened with text columns but not with int or bigint. My assumption is
that this is due to text comparisons being much more expensive. Not sure
if there is something we could do to deal with this - reduce the number
of comparisons or something?

Hm, interesting. I need to check that we do not do any unnecessary
comparisons.

On Sat, Feb 08, 2020 at 02:11:59PM +0100, Tomas Vondra wrote:
> Yes, I've mentioned that already in one of the previous emails :) The
> simplest way I see to achieve what we want is to do something like in
> attached modified version with a new hasDeclaredCursor field. It's not a
> final version though, but posted just for discussion, so feel free to
> suggest any improvements or alternatives.

IMO the proper fix for this case (moving forward, reading backwards) is
simply making it work by properly checking deleted tuples etc. Not sure
why that would be so much complex (haven't tried implementing it)?

It's probably not that complex by itself, but requires changing
responsibilities isolation. At the moment current implementation leaves
jumping over a tree fully to _bt_skip, and heap visibility checks only
to IndexOnlyNext. To check deleted tuples properly we need to either
verify a corresponding heap tuple visibility inside _bt_skip (as I've
mentioned in one of the previous emails, checking if an index tuple is
dead is not enough), or teach the code in IndexOnlyNext to understand
that _bt_skip can lead to returning the same tuple while moving forward
& reading backward. Do you think it's still makes sense to go this way?

Not sure. I have to think about this first.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Index Skip Scan

2020-02-08 Thread Dmitry Dolgov

> On Sat, Feb 08, 2020 at 03:22:17PM +0100, Tomas Vondra wrote:
>
> I've done some testing and benchmarking of the v31 patch, looking for
> regressions, costing issues etc. Essentially, I've ran a bunch of SELECT
> DISTINCT queries on data sets of various size, number of distinct values
> etc. The results are fairly large, so I've uploaded them to github
>
> https://github.com/tvondra/skip-scan-test

Thanks a lot for testing!

> There are a couple of regressions, where the plan with skipscan enables
> is ~10x slower. But this seems to happen only in high-cardinality cases
> where we misestimate the number of groups. Consider a table with two
> independent columns
>
>   CREATE TABLE t (a text, b text);
>   INSERT INTO t SELECT
> md5((1*random())::int::text),
> md5((1*random())::int::text)
>   FROM generate_series(1,100) s(i);
>
>   CREATE INDEX ON t(a,b);
>
>   ANALYZE;
>
> which then behaves like this:
>
>   test=# select * from (select distinct a,b from t) foo offset 1000;
>   Time: 3138.222 ms (00:03.138)
>   test=# set enable_indexskipscan = off;
>   Time: 0.312 ms
>   test=# select * from (select distinct a,b from t) foo offset 1000;
>   Time: 199.749 ms
>
> So in this case the skip scan is ~15x slower than the usual plan (index
> only scan + unique). The reason why this happens is pretty simple - to
> estimate the number of groups we multiply the ndistinct estimates for
> the two columns (which both have n_distinct = 1), but then we cap
> the estimate to 10% of the table. But when the columns are independent
> with high cardinalities that under-estimates the actual value, making
> the cost for skip scan much lower than it should be.

The current implementation checks if we can find the next value on the
same page to do a shortcut instead of tree traversal and improve such
kind of situations, but I can easily imagine that it's still not enough
in some extreme situations.

> I don't think this is an issue the skipscan patch needs to fix, though.
> Firstly, the regressed cases are a tiny minority. Secondly, we already
> have a way to improve the root cause - creating extended stats with
> ndistinct coefficients generally makes the problem go away.

Yes, I agree.

> One interesting observation however is that this regression only
> happened with text columns but not with int or bigint. My assumption is
> that this is due to text comparisons being much more expensive. Not sure
> if there is something we could do to deal with this - reduce the number
> of comparisons or something?

Hm, interesting. I need to check that we do not do any unnecessary
comparisons.

> On Sat, Feb 08, 2020 at 02:11:59PM +0100, Tomas Vondra wrote:
> > Yes, I've mentioned that already in one of the previous emails :) The
> > simplest way I see to achieve what we want is to do something like in
> > attached modified version with a new hasDeclaredCursor field. It's not a
> > final version though, but posted just for discussion, so feel free to
> > suggest any improvements or alternatives.
>
> IMO the proper fix for this case (moving forward, reading backwards) is
> simply making it work by properly checking deleted tuples etc. Not sure
> why that would be so much complex (haven't tried implementing it)?

It's probably not that complex by itself, but requires changing
responsibilities isolation. At the moment current implementation leaves
jumping over a tree fully to _bt_skip, and heap visibility checks only
to IndexOnlyNext. To check deleted tuples properly we need to either
verify a corresponding heap tuple visibility inside _bt_skip (as I've
mentioned in one of the previous emails, checking if an index tuple is
dead is not enough), or teach the code in IndexOnlyNext to understand
that _bt_skip can lead to returning the same tuple while moving forward
& reading backward. Do you think it's still makes sense to go this way?

Re: Index Skip Scan

2020-02-08 Thread Tomas Vondra


OK,

A couple more comments based on quick review of the patch, particularly
the part related to planning:

1) create_skipscan_unique_path has one assert commented out. Either it's
something we want to enforce, or we should remove it.

  /*Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));*/


2) I wonder if the current costing model is overly optimistic. We simply
copy the startup cost from the IndexPath, which seems fine. But for
total cost we do this:

  pathnode->path.total_cost = basepath->startup_cost * numGroups;

which seems a bit too simplistic. The startup cost is pretty much just
the cost to find the first item in the index, but surely we need to do
more to find the next group - we need to do comparisons to skip some of
the items, etc. If we think that's unnecessary, we need to explain it in
a comment or somthing.


3) I don't think we should make planning dependent on hasDeclareCursor.


4) I'm not quite sure how sensible it's to create a new IndexPath in
create_skipscan_unique_path. On the one hand it works, but I don't think
any other path is constructed like this so I wonder if we're missing
something. Perhaps it'd be better to just add a new path node on top of
the IndexPath, and then handle this in create_plan. We already do
something similar for Bitmap Index Scans, where we create a different
executor node from IndexPath depending on the parent node.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Index Skip Scan

2020-02-08 Thread Tomas Vondra


Hi,

I've done some testing and benchmarking of the v31 patch, looking for
regressions, costing issues etc. Essentially, I've ran a bunch of SELECT
DISTINCT queries on data sets of various size, number of distinct values
etc. The results are fairly large, so I've uploaded them to github

https://github.com/tvondra/skip-scan-test

There are four benchmark groups, depending on how the data are generated
and availability of extended stats and if the columns are independent:

1) skipscan - just indexes, columns are independent

2) skipscan-with-stats - indexes and extended stats, independent columns

3) skipscan-correlated - just indexes, correlated columns

4) skipscan-correlated-with-stats - indexes and extended stats,
correlated columns

The github repository contains *.ods spreadsheet comparing duration with
the regular query plan (no skip scan) and skip scan. In general, there
are pretty massive speedups, often by about two orders of magnitude.

There are a couple of regressions, where the plan with skipscan enables
is ~10x slower. But this seems to happen only in high-cardinality cases
where we misestimate the number of groups. Consider a table with two
independent columns

  CREATE TABLE t (a text, b text);
  INSERT INTO t SELECT
md5((1*random())::int::text),
md5((1*random())::int::text)
  FROM generate_series(1,100) s(i);

  CREATE INDEX ON t(a,b);

  ANALYZE;

which then behaves like this:

  test=# select * from (select distinct a,b from t) foo offset 1000;
  Time: 3138.222 ms (00:03.138)
  test=# set enable_indexskipscan = off;
  Time: 0.312 ms
  test=# select * from (select distinct a,b from t) foo offset 1000;
  Time: 199.749 ms

So in this case the skip scan is ~15x slower than the usual plan (index
only scan + unique). The reason why this happens is pretty simple - to
estimate the number of groups we multiply the ndistinct estimates for
the two columns (which both have n_distinct = 1), but then we cap
the estimate to 10% of the table. But when the columns are independent
with high cardinalities that under-estimates the actual value, making
the cost for skip scan much lower than it should be.

I don't think this is an issue the skipscan patch needs to fix, though.
Firstly, the regressed cases are a tiny minority. Secondly, we already
have a way to improve the root cause - creating extended stats with
ndistinct coefficients generally makes the problem go away.

One interesting observation however is that this regression only
happened with text columns but not with int or bigint. My assumption is
that this is due to text comparisons being much more expensive. Not sure
if there is something we could do to deal with this - reduce the number
of comparisons or something?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Index Skip Scan

2020-02-08 Thread Tomas Vondra

On Fri, Feb 07, 2020 at 05:25:43PM +0100, Dmitry Dolgov wrote:

On Thu, Feb 06, 2020 at 09:22:20PM +0900, Kyotaro Horiguchi wrote:
At Thu, 6 Feb 2020 11:57:07 +0100, Dmitry Dolgov <9erthali...@gmail.com> wrote 
in
> > On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote:
> > At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthali...@gmail.com> 
wrote in
> > We could add an additional parameter "in_cursor" to
> > ExecSupportBackwardScan and let skip scan return false if in_cursor is
> > true, but I'm not sure it's acceptable.
>
> I also was thinking about whether it's possible to use
> ExecSupportBackwardScan here, but skip scan is just a mode of an
> index/indexonly scan. Which means that ExecSupportBackwardScan also need
> to know somehow if this mode is being used, and then, since this
> function is called after it's already decided to use skip scan in the
> resulting plan, somehow correct the plan (exclude skipping and try to
> find next best path?) - do I understand your suggestion correct?

I didn't thought so hardly, but a bit of confirmation told me that
IndexSupportsBackwardScan returns fixed flag for AM.  It seems that
things are not that simple.

Yes, I've mentioned that already in one of the previous emails :) The
simplest way I see to achieve what we want is to do something like in
attached modified version with a new hasDeclaredCursor field. It's not a
final version though, but posted just for discussion, so feel free to
suggest any improvements or alternatives.

IMO the proper fix for this case (moving forward, reading backwards) is
simply making it work by properly checking deleted tuples etc. Not sure
why that would be so much complex (haven't tried implementing it)?

I think making this depend on things like declared cursor etc. is going
to be tricky, may easily be more complex than checking deleted tuples,
and the behavior may be quite surprising.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Index Skip Scan

2020-02-07 Thread Dmitry Dolgov

> On Thu, Feb 06, 2020 at 09:22:20PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 6 Feb 2020 11:57:07 +0100, Dmitry Dolgov <9erthali...@gmail.com> 
> wrote in 
> > > On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote:
> > > At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthali...@gmail.com> 
> > > wrote in
> > > We could add an additional parameter "in_cursor" to
> > > ExecSupportBackwardScan and let skip scan return false if in_cursor is
> > > true, but I'm not sure it's acceptable.
> > 
> > I also was thinking about whether it's possible to use
> > ExecSupportBackwardScan here, but skip scan is just a mode of an
> > index/indexonly scan. Which means that ExecSupportBackwardScan also need
> > to know somehow if this mode is being used, and then, since this
> > function is called after it's already decided to use skip scan in the
> > resulting plan, somehow correct the plan (exclude skipping and try to
> > find next best path?) - do I understand your suggestion correct?
> 
> I didn't thought so hardly, but a bit of confirmation told me that
> IndexSupportsBackwardScan returns fixed flag for AM.  It seems that
> things are not that simple.

Yes, I've mentioned that already in one of the previous emails :) The
simplest way I see to achieve what we want is to do something like in
attached modified version with a new hasDeclaredCursor field. It's not a
final version though, but posted just for discussion, so feel free to
suggest any improvements or alternatives.
>From 22e6b4ccd5f79ca069bd5cd90ba3696dd97f76ea Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 9 Jul 2019 06:44:57 -0400
Subject: [PATCH v33 1/2] Unique key

Design by David Rowley.

Author: Jesper Pedersen
---
 src/backend/nodes/outfuncs.c   |  14 +++
 src/backend/nodes/print.c  |  39 +++
 src/backend/optimizer/path/Makefile|   3 +-
 src/backend/optimizer/path/allpaths.c  |   8 ++
 src/backend/optimizer/path/indxpath.c  |  41 +++
 src/backend/optimizer/path/pathkeys.c  |  71 ++--
 src/backend/optimizer/path/uniquekey.c | 147 +
 src/backend/optimizer/plan/planagg.c   |   1 +
 src/backend/optimizer/plan/planmain.c  |   1 +
 src/backend/optimizer/plan/planner.c   |  17 ++-
 src/backend/optimizer/util/pathnode.c  |  12 ++
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/pathnodes.h  |  18 +++
 src/include/nodes/print.h  |   1 +
 src/include/optimizer/pathnode.h   |   1 +
 src/include/optimizer/paths.h  |  11 ++
 16 files changed, 373 insertions(+), 13 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekey.c

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index d76fae44b8..16083e7a7e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1723,6 +1723,7 @@ _outPathInfo(StringInfo str, const Path *node)
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
 	WRITE_NODE_FIELD(pathkeys);
+	WRITE_NODE_FIELD(uniquekeys);
 }
 
 /*
@@ -2205,6 +2206,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(eq_classes);
 	WRITE_BOOL_FIELD(ec_merging_done);
 	WRITE_NODE_FIELD(canon_pathkeys);
+	WRITE_NODE_FIELD(canon_uniquekeys);
 	WRITE_NODE_FIELD(left_join_clauses);
 	WRITE_NODE_FIELD(right_join_clauses);
 	WRITE_NODE_FIELD(full_join_clauses);
@@ -2214,6 +2216,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(placeholder_list);
 	WRITE_NODE_FIELD(fkey_list);
 	WRITE_NODE_FIELD(query_pathkeys);
+	WRITE_NODE_FIELD(query_uniquekeys);
 	WRITE_NODE_FIELD(group_pathkeys);
 	WRITE_NODE_FIELD(window_pathkeys);
 	WRITE_NODE_FIELD(distinct_pathkeys);
@@ -2401,6 +2404,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+
+	WRITE_NODE_FIELD(eq_clause);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4092,6 +4103,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 _outPathKey(str, obj);
 break;
+			case T_UniqueKey:
+_outUniqueKey(str, obj);
+break;
 			case T_PathTarget:
 _outPathTarget(str, obj);
 break;
diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 42476724d8..d286b34544 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable)
 	printf(")\n");
 }
 
+/*
+ * print_uniquekeys -
+ *	  uniquekeys list of UniqueKeys
+ */
+void
+print_uniquekeys(const List *uniquekeys, const List *rtable)
+{
+	ListCell   *l;
+
+	printf("(");
+	foreach(l, uniquekeys)
+	{
+		UniqueKey *unique_key = (UniqueKey *) lfirst(l);
+		EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause;
+		ListCell   *k;
+		bool		first = true;
+
+		/* chase up */
+		while

Re: Index Skip Scan

2020-02-06 Thread Kyotaro Horiguchi


Sorry, I forgot to write more significant thing.

On 2020/02/06 21:22, Kyotaro Horiguchi wrote:

At Thu, 6 Feb 2020 11:57:07 +0100, Dmitry Dolgov <9erthali...@gmail.com> wrote 
in

On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote:
At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthali...@gmail.com> wrote 
in
We could add an additional parameter "in_cursor" to
ExecSupportBackwardScan and let skip scan return false if in_cursor is
true, but I'm not sure it's acceptable.

I also was thinking about whether it's possible to use
ExecSupportBackwardScan here, but skip scan is just a mode of an
index/indexonly scan. Which means that ExecSupportBackwardScan also need
to know somehow if this mode is being used, and then, since this
function is called after it's already decided to use skip scan in the
resulting plan, somehow correct the plan (exclude skipping and try to
find next best path?) - do I understand your suggestion correct?


No. I thought of the opposite thing. I meant that
IndexSupportsBackwardScan returns false if Index(Only)Scan is
going to do skip scan. But I found that the function doesn't have
access to plan node nor executor node.  So I wrote as the follows.


I didn't thought so hardly, but a bit of confirmation told me that
IndexSupportsBackwardScan returns fixed flag for AM.  It seems that
things are not that simple.


regards.

--

Kyotaro Horiguchi

Re: Index Skip Scan

2020-02-06 Thread Kyotaro Horiguchi

At Thu, 6 Feb 2020 11:57:07 +0100, Dmitry Dolgov <9erthali...@gmail.com> wrote 
in 
> > On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote:
> > At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthali...@gmail.com> 
> > wrote in
> > We could add an additional parameter "in_cursor" to
> > ExecSupportBackwardScan and let skip scan return false if in_cursor is
> > true, but I'm not sure it's acceptable.
> 
> I also was thinking about whether it's possible to use
> ExecSupportBackwardScan here, but skip scan is just a mode of an
> index/indexonly scan. Which means that ExecSupportBackwardScan also need
> to know somehow if this mode is being used, and then, since this
> function is called after it's already decided to use skip scan in the
> resulting plan, somehow correct the plan (exclude skipping and try to
> find next best path?) - do I understand your suggestion correct?

I didn't thought so hardly, but a bit of confirmation told me that
IndexSupportsBackwardScan returns fixed flag for AM.  It seems that
things are not that simple.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Index Skip Scan

2020-02-06 Thread Dmitry Dolgov

> On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote:
> At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthali...@gmail.com> 
> wrote in
> > > On Tue, Feb 04, 2020 at 08:34:09PM +, Floris Van Nee wrote:
> > >
> > > > this point me and Jesper inclined to go with the second option. But 
> > > > maybe
> > > > I'm missing something, are there any other suggestions?
> > >
> > > Unfortunately I figured this would need a more invasive fix. I tend to 
> > > agree that it'd be better to not skip in situations like this. I think 
> > > it'd make most sense to make any plan for these 'prepare/fetch' queries 
> > > would not use skip, but rather a materialize node, right?
> >
> > Yes, sort of, without a skip scan it would be just an index only scan
> > with unique on top. Actually it's not immediately clean how to achieve
> > this, since at the moment, when planner is deciding to consider index
> > skip scan, there is no information about neither direction nor whether
> > we're dealing with a cursor. Maybe we can somehow signal to the decision
> > logic that the root was a DeclareCursorStmt by e.g. introducing a new
> > field to the query structure (or abusing an existing one, since
> > DeclareCursorStmt is being processed by standard_ProcessUtility, just
> > for a test I've tried to use utilityStmt of a nested statement hoping
> > that it's unused and it didn't break tests yet).
>
> Umm.  I think it's a wrong direction.  While defining a cursor,
> default scrollability is decided based on the query allows backward
> scan or not.  That is, the definition of backward-scan'ability is not
> just whether it can scan from the end toward the beginning, but
> whether it can go back and forth freely or not.  In that definition,
> the *current* skip scan does not supporting backward scan.  If we want
> to allow descending order-by in a query, we should support scrollable
> cursor, too.
>
> We could add an additional parameter "in_cursor" to
> ExecSupportBackwardScan and let skip scan return false if in_cursor is
> true, but I'm not sure it's acceptable.

I also was thinking about whether it's possible to use
ExecSupportBackwardScan here, but skip scan is just a mode of an
index/indexonly scan. Which means that ExecSupportBackwardScan also need
to know somehow if this mode is being used, and then, since this
function is called after it's already decided to use skip scan in the
resulting plan, somehow correct the plan (exclude skipping and try to
find next best path?) - do I understand your suggestion correct?

Re: Index Skip Scan

2020-02-05 Thread Kyotaro Horiguchi

At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthali...@gmail.com> wrote 
in 
> > On Tue, Feb 04, 2020 at 08:34:09PM +, Floris Van Nee wrote:
> >
> > > this point me and Jesper inclined to go with the second option. But maybe
> > > I'm missing something, are there any other suggestions?
> >
> > Unfortunately I figured this would need a more invasive fix. I tend to 
> > agree that it'd be better to not skip in situations like this. I think it'd 
> > make most sense to make any plan for these 'prepare/fetch' queries would 
> > not use skip, but rather a materialize node, right?
> 
> Yes, sort of, without a skip scan it would be just an index only scan
> with unique on top. Actually it's not immediately clean how to achieve
> this, since at the moment, when planner is deciding to consider index
> skip scan, there is no information about neither direction nor whether
> we're dealing with a cursor. Maybe we can somehow signal to the decision
> logic that the root was a DeclareCursorStmt by e.g. introducing a new
> field to the query structure (or abusing an existing one, since
> DeclareCursorStmt is being processed by standard_ProcessUtility, just
> for a test I've tried to use utilityStmt of a nested statement hoping
> that it's unused and it didn't break tests yet).

Umm.  I think it's a wrong direction.  While defining a cursor,
default scrollability is decided based on the query allows backward
scan or not.  That is, the definition of backward-scan'ability is not
just whether it can scan from the end toward the beginning, but
whether it can go back and forth freely or not.  In that definition,
the *current* skip scan does not supporting backward scan.  If we want
to allow descending order-by in a query, we should support scrollable
cursor, too.

We could add an additional parameter "in_cursor" to
ExecSupportBackwardScan and let skip scan return false if in_cursor is
true, but I'm not sure it's acceptable.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Index Skip Scan

2020-02-05 Thread Dmitry Dolgov

> On Tue, Feb 04, 2020 at 08:34:09PM +, Floris Van Nee wrote:
>
> > this point me and Jesper inclined to go with the second option. But maybe
> > I'm missing something, are there any other suggestions?
>
> Unfortunately I figured this would need a more invasive fix. I tend to agree 
> that it'd be better to not skip in situations like this. I think it'd make 
> most sense to make any plan for these 'prepare/fetch' queries would not use 
> skip, but rather a materialize node, right?

Yes, sort of, without a skip scan it would be just an index only scan
with unique on top. Actually it's not immediately clean how to achieve
this, since at the moment, when planner is deciding to consider index
skip scan, there is no information about neither direction nor whether
we're dealing with a cursor. Maybe we can somehow signal to the decision
logic that the root was a DeclareCursorStmt by e.g. introducing a new
field to the query structure (or abusing an existing one, since
DeclareCursorStmt is being processed by standard_ProcessUtility, just
for a test I've tried to use utilityStmt of a nested statement hoping
that it's unused and it didn't break tests yet).

RE: Index Skip Scan

2020-02-04 Thread Floris Van Nee



> this point me and Jesper inclined to go with the second option. But maybe
> I'm missing something, are there any other suggestions?

Unfortunately I figured this would need a more invasive fix. I tend to agree 
that it'd be better to not skip in situations like this. I think it'd make most 
sense to make any plan for these 'prepare/fetch' queries would not use skip, 
but rather a materialize node, right?

-Floris

Re: Index Skip Scan

2020-02-04 Thread Dmitry Dolgov

> On Mon, Jan 27, 2020 at 02:00:39PM +, Floris Van Nee wrote:
>
> Thanks for the new patch! I tested it and managed to find a case that causes
> some issues. Here's how to reproduce:

So, after a bit of investigation I found out the issue (it was actually there
even in the previous version). In this only case of moving forward and reading
backward, exactly scenarious you've described above, current implementation was
not ignoring deleted tuples.

My first idea to fix this was to use _bt_readpage when necessary and put
couple of _bt_killitems when we leave a page while jumping before, so that
deleted tuples will be ignored. To demonstrate it visually, let's say we
want to go backward on a cursor over an ORDER BY a DESC, b DESC query,
i.e. return:

(1,100), (2, 100), (3, 100) etc.

To achieve that we jump from (1,1) to (1,100), from (2,1) to (2,100) and so on.
If some values are deleted, we need to read backward. E.g. if (3,100) is
deleted, we need to return (3,99).

   +---+---+---+---+ 
   |   |   |   |   | 
   | 1,1 ... 1,100 | 2,1 ... 2,100 | 3,1 ... 3,100 | 4,1 ... 4,100 | 
   |   |   |   |   | 
   +---+---+---+---+ 

   | ^ | ^ | ^ | ^   
   | | | | | | | |   
   +-+ +-+ +-+ +-+   

If it happened that a whole value series is deleted, we return to the
previous value and need to detect such situation. E.g. if all the values
from (3,1) to (3,100) were deleted, we will return to (2,100).

   +---+---+   +---+ 
   |   |   |   |   | 
   | 1,1 ... 1,100 | 2,1 ... 2,100 |<--+ 4,1 ... 4,100 | 
   |   |   |   |   | 
   +---+---+   +---+ 
 ^   
   | ^ | ^ | ^   |   
   | | | | | |   |   
   +-+ +-+ +-+   |   
   +-+ 

This all is implemented inside _bt_skip. Unfortunately as I see it now the idea
of relying on ability to skip dead index tuples without checking a heap tuple
is not reliable, since an index tuple will be added into killedItems and can be
marked as dead only when not a single transaction can see it anymore.

Potentially there are two possible solutions:

* Adjust the code in nodeIndexOnlyscan to perform a proper visibility check and
  understand if we returned back. Obviously it will make the patch more 
invasive.

* Reduce scope of the patch, and simply do not apply jumping in this case. This
  means less functionality but hopefully still brings some value.

At this point me and Jesper inclined to go with the second option. But maybe
I'm missing something, are there any other suggestions?

Re: Index Skip Scan

2020-01-28 Thread Dmitry Dolgov

Oh, interesting, thank you. I believe I know what happened, there is
one unnecessary locking part that eventually gives only problems, plus
one direct access to a page items without _bt_readpage. Will post a
new version soon.

On Mon, Jan 27, 2020 at 3:00 PM Floris Van Nee  wrote:
>
> Hi Dmitry,
>
> Thanks for the new patch! I tested it and managed to find a case that causes 
> some issues. Here's how to reproduce:
>
> drop table if exists t;
> create table t as select a,b,b%2 as c,10 as d from generate_series(1,5) a, 
> generate_series(1,1000) b;
> create index on t (a,b,c,d);
>
> -- correct
> postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d 
> from t order by a desc, b desc; fetch forward all from c; fetch backward all 
> from c; commit;
> BEGIN
> DECLARE CURSOR
>  a |  b   | c | d
> ---+--+---+
>  5 | 1000 | 0 | 10
>  4 | 1000 | 0 | 10
>  3 | 1000 | 0 | 10
>  2 | 1000 | 0 | 10
>  1 | 1000 | 0 | 10
> (5 rows)
>
>  a |  b   | c | d
> ---+--+---+
>  1 | 1000 | 0 | 10
>  2 | 1000 | 0 | 10
>  3 | 1000 | 0 | 10
>  4 | 1000 | 0 | 10
>  5 | 1000 | 0 | 10
> (5 rows)
>
> -- now delete some rows
> postgres=# delete from t where a=3;
> DELETE 1000
>
> -- and rerun: error is thrown
> postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d 
> from t order by a desc, b desc; fetch forward all from c; fetch backward all 
> from c; commit;
> BEGIN
> DECLARE CURSOR
>  a |  b   | c | d
> ---+--+---+
>  5 | 1000 | 0 | 10
>  4 | 1000 | 0 | 10
>  2 | 1000 | 0 | 10
>  1 | 1000 | 0 | 10
> (4 rows)
>
> ERROR:  lock buffer_content is not held
> ROLLBACK
>
>
> A slightly different situation arises when executing the cursor with an ORDER 
> BY a, b instead of the ORDER BY a DESC, b DESC:
> -- recreate table again and execute the delete as above
>
> postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d 
> from t order by a, b; fetch forward all from c; fetch backward all from c; 
> commit;
> BEGIN
> DECLARE CURSOR
>  a | b | c | d
> ---+---+---+
>  1 | 1 | 1 | 10
>  2 | 1 | 1 | 10
>  4 | 1 | 1 | 10
>  5 | 1 | 1 | 10
> (4 rows)
>
>  a |  b  | c | d
> ---+-+---+
>  5 |   1 | 1 | 10
>  4 |   1 | 1 | 10
>  2 | 827 | 1 | 10
>  1 |   1 | 1 | 10
> (4 rows)
>
> COMMIT
>
> And lastly, you'll also get incorrect results if you do the delete slightly 
> differently:
> -- leave one row where a=3 and b=1000
> postgres=# delete from t where a=3 and b<=999;
> -- the cursor query above won't show any of the a=3 rows even though they 
> should
>
>
> -Floris
>

RE: Index Skip Scan

2020-01-27 Thread Floris Van Nee

Hi Dmitry,

Thanks for the new patch! I tested it and managed to find a case that causes 
some issues. Here's how to reproduce:

drop table if exists t; 
create table t as select a,b,b%2 as c,10 as d from generate_series(1,5) a, 
generate_series(1,1000) b;
create index on t (a,b,c,d);

-- correct
postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d 
from t order by a desc, b desc; fetch forward all from c; fetch backward all 
from c; commit; 
BEGIN
DECLARE CURSOR
 a |  b   | c | d  
---+--+---+
 5 | 1000 | 0 | 10
 4 | 1000 | 0 | 10
 3 | 1000 | 0 | 10
 2 | 1000 | 0 | 10
 1 | 1000 | 0 | 10
(5 rows)

 a |  b   | c | d  
---+--+---+
 1 | 1000 | 0 | 10
 2 | 1000 | 0 | 10
 3 | 1000 | 0 | 10
 4 | 1000 | 0 | 10
 5 | 1000 | 0 | 10
(5 rows)

-- now delete some rows
postgres=# delete from t where a=3;
DELETE 1000

-- and rerun: error is thrown
postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d 
from t order by a desc, b desc; fetch forward all from c; fetch backward all 
from c; commit; 
BEGIN
DECLARE CURSOR
 a |  b   | c | d  
---+--+---+
 5 | 1000 | 0 | 10
 4 | 1000 | 0 | 10
 2 | 1000 | 0 | 10
 1 | 1000 | 0 | 10
(4 rows)

ERROR:  lock buffer_content is not held
ROLLBACK


A slightly different situation arises when executing the cursor with an ORDER 
BY a, b instead of the ORDER BY a DESC, b DESC:
-- recreate table again and execute the delete as above

postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d 
from t order by a, b; fetch forward all from c; fetch backward all from c; 
commit; 
BEGIN
DECLARE CURSOR
 a | b | c | d  
---+---+---+
 1 | 1 | 1 | 10
 2 | 1 | 1 | 10
 4 | 1 | 1 | 10
 5 | 1 | 1 | 10
(4 rows)

 a |  b  | c | d  
---+-+---+
 5 |   1 | 1 | 10
 4 |   1 | 1 | 10
 2 | 827 | 1 | 10
 1 |   1 | 1 | 10
(4 rows)

COMMIT

And lastly, you'll also get incorrect results if you do the delete slightly 
differently:
-- leave one row where a=3 and b=1000
postgres=# delete from t where a=3 and b<=999;
-- the cursor query above won't show any of the a=3 rows even though they should


-Floris

Re: Index Skip Scan

2020-01-27 Thread Dmitry Dolgov

> On Wed, Jan 22, 2020 at 10:36:03PM +0100, Dmitry Dolgov wrote:
>
> > Let me try to clarify. After we return the first tuple, so->currPos.buf is 
> > pointing to page=1 in my example (it's the only page after all). We've 
> > returned item=8. Then the split happens and the items get rearranged as in 
> > my example. We're still pointing with so->currPos.buf to page=1, but the 
> > page now contains [2,4]. The split happened to the right, so there's a 
> > page=2 with [5,6,8], however the ongoing index scan is unaware of that.
> > Now _bt_skip gets called to fetch the next tuple. It starts by checking 
> > _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir), the 
> > result of which will be 'true': we're comparing the skip key to the low key 
> > of the page. So it thinks the next key can be found on the current page. It 
> > locks the page and does a _binsrch, finding item=4 to be returned.
> >
> > The problem here is that _bt_scankey_within_page mistakenly returns true, 
> > thereby limiting the search to just the page that it's pointing to already.
> > It may be fine to just fix this function to return the proper value (I 
> > guess it'd also need to look at the high key in this example). It could 
> > also be fixed by not looking at the lo/hi key of the page, but to use the 
> > local tuple buffer instead. We already did a _read_page once, so if we have 
> > any matching tuples on that specific page, we have them locally in the 
> > buffer already. That way we never need to lock the same page twice.
>
> Oh, that's what you mean. Yes, I was somehow tricked by the name of this
> function and didn't notice that it checks only one boundary, so in case
> of backward scan it returns wrong result. I think in the situation
> you've describe it would actually not find any item on the current page
> and restart from the root, but nevertheless we need to check for both
> keys in _bt_scankey_within_page.

Thanks again everyone for commentaries and clarification. Here is the
version, where hopefully I've addressed all the mentioned issues.

As mentioned in the _bt_skip commentaries, before we were moving left to
check the next page to avoid significant issues in case if ndistinct was
underestimated and we need to skip too often. To make it work safe in
presense of splits we need to remember an original page and move right
again until we find a page with the right link pointing to it. It's not
clear whether it's worth to increase complexity for such sort of "edge
case" with ndistinct estimation while moving left, so at least for now
we ignore this in the implementation and just start from the root
immediately.

Offset based code in moving forward/reading backward was replaced with
remembering a start index tuple and an attempt to find it on the new
page. Also a missing page lock before _bt_scankey_within_page was added
and _bt_scankey_within_page checks for both page boundaries.
>From c363e1f5dc7e2f0288fb04ca80bd073229c458a1 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 9 Jul 2019 06:44:57 -0400
Subject: [PATCH v31 1/2] Unique key

Design by David Rowley.

Author: Jesper Pedersen
---
 src/backend/nodes/outfuncs.c   |  14 +++
 src/backend/nodes/print.c  |  39 +++
 src/backend/optimizer/path/Makefile|   3 +-
 src/backend/optimizer/path/allpaths.c  |   8 ++
 src/backend/optimizer/path/indxpath.c  |  41 +++
 src/backend/optimizer/path/pathkeys.c  |  71 ++--
 src/backend/optimizer/path/uniquekey.c | 147 +
 src/backend/optimizer/plan/planagg.c   |   1 +
 src/backend/optimizer/plan/planmain.c  |   1 +
 src/backend/optimizer/plan/planner.c   |  17 ++-
 src/backend/optimizer/util/pathnode.c  |  12 ++
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/pathnodes.h  |  18 +++
 src/include/nodes/print.h  |   1 +
 src/include/optimizer/pathnode.h   |   1 +
 src/include/optimizer/paths.h  |  11 ++
 16 files changed, 373 insertions(+), 13 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekey.c

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index d76fae44b8..16083e7a7e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1723,6 +1723,7 @@ _outPathInfo(StringInfo str, const Path *node)
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
 	WRITE_NODE_FIELD(pathkeys);
+	WRITE_NODE_FIELD(uniquekeys);
 }
 
 /*
@@ -2205,6 +2206,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(eq_classes);
 	WRITE_BOOL_FIELD(ec_merging_done);
 	WRITE_NODE_FIELD(canon_pathkeys);
+	WRITE_NODE_FIELD(canon_uniquekeys);
 	WRITE_NODE_FIELD(left_join_clauses);
 	WRITE_NODE_FIELD(right_join_clauses);
 	WRITE_NODE_FIELD(full_join_clauses);
@@ -2214,6 +2216,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(placeholder_list);
 	WRITE_NODE_FIELD(fkey_list);

Re: Index Skip Scan

2020-01-22 Thread Peter Geoghegan

On Wed, Jan 22, 2020 at 10:55 AM Peter Geoghegan  wrote:
> On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee
>  wrote:
> > As far as I know, a page split could happen at any random element in the 
> > page. One of the situations we could be left with is:
> > Leaf page 1 = [2,4]
> > Leaf page 2 = [5,6,8]
> > However, our scan is still pointing to leaf page 1. For non-skip scans this 
> > is not a problem, as we already read all matching elements in our local 
> > buffer and we'll return those. But the skip scan currently:
> > a) checks the lo-key of the page to see if the next prefix can be found on 
> > the leaf page 1
> > b) finds out that this is actually true
> > c) does a search on the page and returns value=4 (while it should have 
> > returned value=6)
> >
> > Peter, is my understanding about the btree internals correct so far?
>
> This is a good summary. This is the kind of scenario I had in mind
> when I expressed a general concern about "stopping between pages".
> Processing a whole page at a time is a crucial part of how
> _bt_readpage() currently deals with concurrent page splits.

I want to be clear about what it means that the page doesn't have a
"low key". Let us once again start with a very simple one leaf-page
btree containing four elements: leaf page 1 = [2,4,6,8] -- just like
in Floris' original page split scenario.

Let us also say that page 1 has a left sibling page -- page 0. Page 0
happens to have a high key with the integer value 0. So you could
*kind of* claim that the "low key" of page 1 is the integer value 0
(page 1 values must be > 0) -- *not* the integer value 2 (the
so-called "low key" here is neither > 2, nor >= 2). More formally, an
invariant exists that says that all values on page 1 must be
*strictly* greater than the integer value 0. However, this formal
invariant thing is hard or impossible to rely on when we actually
reach page 1 and want to know about its lower bound -- since there is
no "low key" pivot tuple on page 1 (we can only speak of a "low key"
as an abstract concept, or something that works transitively from the
parent -- there is only a physical high key pivot tuple on page 1
itself).

Suppose further that Page 0 is now empty, apart from its "all values
on page are <= 0" high key (page 0 must have had a few negative
integer values in its tuples at some point, but not anymore). VACUUM
will delete the page, *changing the effective low key* of Page 0 in
the process. The lower bound from the shared parent page will move
lower/left as a consequence of the deletion of page 0. nbtree page
deletion makes the "keyspace move right, not left". So the "conceptual
low key" of page 1 just went down from 0 to -5 (say), without there
being any practical way of a skip scan reading page 1 noticing the
change (the left sibling of page 0, page -1, has a high key of <= -5,
say).

Not only is it possible for somebody to insert the value 1 in page 1
-- now they can insert the value -3 or -4!

More concretely, the pivot tuple in the parent that originally pointed
to page 0 is still there -- all that page deletion changed about this
tuple is its downlink, which now points to page 1 instead or page 0.
Confusingly, page deletion removes the pivot tuple of the right
sibling page from the parent -- *not* the pivot tuple of the empty
page that gets deleted (in this case page 0) itself.

Note: this example ignores things like negative infinity values in
truncated pivot tuples, and the heap TID tiebreaker column -- in
reality this would look a bit different because of those factors.

See also: amcheck's bt_right_page_check_scankey() function, which has
a huge comment that reasons about a race involving page deletion. In
general, page deletion is by far the biggest source of complexity when
reasoning about the key space.
-- 
Peter Geoghegan

Re: Index Skip Scan

2020-01-22 Thread Peter Geoghegan

On Wed, Jan 22, 2020 at 1:35 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Oh, that's what you mean. Yes, I was somehow tricked by the name of this
> function and didn't notice that it checks only one boundary, so in case
> of backward scan it returns wrong result. I think in the situation
> you've describe it would actually not find any item on the current page
> and restart from the root, but nevertheless we need to check for both
> keys in _bt_scankey_within_page.

I suggest reading the nbtree README file's description of backwards
scans. Read the paragraph that begins with 'We support the notion of
an ordered "scan" of an index...'. I also suggest that you read a bit
of the stuff in the large section on page deletion. Certainly read the
paragraph that begins with 'Moving left in a backward scan is
complicated because...'.

It's important to grok why it's okay that we don't "couple" or "crab"
buffer locks as we descend the tree with Lehman & Yao's design -- we
can get away with having *no* interlock against page splits (e.g.,
pin, buffer lock) when we are "between" levels of the tree. This is
safe, since the page that we land on must still be "substantively the
same page", no matter how much time passes. That is, it must at least
cover the leftmost portion of the keyspace covered by the original
version of the page that we saw that we needed to descend to within
the parent page. The worst that can happen is that we have to recover
from a concurrent page split by moving right one or more times.
(Actually, page deletion can change the contents of a page entirely,
but that's not really an exception to the general rule -- page
deletion is careful about recycling pages that an in flight index scan
might land on.)

Lehman & Yao don't have backwards scans (or left links, or page
deletion). Unlike nbtree. This is why the usual Lehman & Yao
guarantees don't quite work with backward scans. We must therefore
compensate as described by the README file (basically, we check and
re-check for races, possibly returning to the original page when we
think that we might have overlooked something and need to make sure).
It's an exception to the general rule, you could say.

--
Peter Geoghegan

Re: Index Skip Scan

2020-01-22 Thread Peter Geoghegan

On Wed, Jan 22, 2020 at 1:09 PM Floris Van Nee  wrote:
> The loose index scan shouldn't return a tuple twice. It should only be able 
> to skip 'further', so that shouldn't be a problem. Out of curiosity, why 
> can't index scans return the same tuple twice? Is there something in the 
> executor that isn't able to handle this?

I have no reason to believe that the executor has a problem with index
scans that return a tuple more than once, aside from the very obvious:
in general, that will often be wrong. It might not be wrong when the
scan happens to be input to a unique node anyway, or something like
that.

I'm not particularly concerned about it. Just wanted to be clear on
our assumptions for loose index scans -- if loose index scans were
allowed to return a tuple more than once, then that would at least
have to at least be considered in the wider context of the executor
(but apparently they're not, so no need to worry about it). This may
have been mentioned somewhere already. If it is then I must have
missed it.

-- 
Peter Geoghegan

Re: Index Skip Scan

2020-01-22 Thread Dmitry Dolgov

> On Wed, Jan 22, 2020 at 09:08:59PM +, Floris Van Nee wrote:
> > Note in particular that index scans cannot return the same index tuple 
> > twice -
> > - processing a page at a time ensures that that cannot happen.
> >
> > Can a loose index scan return the same tuple (i.e. a tuple with the same 
> > heap
> > TID) to the executor more than once?
> >
>
> The loose index scan shouldn't return a tuple twice. It should only be able 
> to skip 'further', so that shouldn't be a problem.

Yes, it shouldn't happen.

Re: Index Skip Scan

2020-01-22 Thread Dmitry Dolgov

> On Wed, Jan 22, 2020 at 05:24:43PM +, Floris Van Nee wrote:
>
> > > Anyone please correct me if I'm wrong, but I think one case where the 
> > > current patch relies on some data from the page it has locked before it 
> > > in checking this hi/lo key. I think it's possible for the following 
> > > sequence to happen. Suppose we have a very simple one leaf-page btree 
> > > containing four elements: leaf page 1 = [2,4,6,8]
> > > We do a backwards index skip scan on this and have just returned our 
> > > first tuple (8). The buffer is left pinned but unlocked. Now, someone 
> > > else comes in and inserts a tuple (value 5) into this page, but suppose 
> > > the page happens to be full. So a page split occurs. As far as I know, a 
> > > page split could happen at any random element in the page. One of the 
> > > situations we could be left with is:
> > > Leaf page 1 = [2,4]
> > > Leaf page 2 = [5,6,8]
> > > However, our scan is still pointing to leaf page 1.
> 
> > In case if we just returned a tuple, the next action would be either
> >> check the next page for another key or search down to the tree. Maybe
> 
> But it won't look at the 'next page for another key', but rather at the 'same 
> page or another key', right? In the _bt_scankey_within_page shortcut we're 
> taking, there's no stepping to a next page involved. It just locks the page 
> again that it previously also locked.

Yep, it would look only on the same page. Not sure what do you mean by
"another key", if the current key is not found within the current page
at the first stage, we restart from the root.

> > I'm missing something in your scenario, but the latter will land us on a
> > required page (we do not point to any leaf here), and before the former
> > there is a check for high/low key. Is there anything else missing?
> 
> Let me try to clarify. After we return the first tuple, so->currPos.buf is 
> pointing to page=1 in my example (it's the only page after all). We've 
> returned item=8. Then the split happens and the items get rearranged as in my 
> example. We're still pointing with so->currPos.buf to page=1, but the page 
> now contains [2,4]. The split happened to the right, so there's a page=2 with 
> [5,6,8], however the ongoing index scan is unaware of that.
> Now _bt_skip gets called to fetch the next tuple. It starts by checking 
> _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir), the 
> result of which will be 'true': we're comparing the skip key to the low key 
> of the page. So it thinks the next key can be found on the current page. It 
> locks the page and does a _binsrch, finding item=4 to be returned.
> 
> The problem here is that _bt_scankey_within_page mistakenly returns true, 
> thereby limiting the search to just the page that it's pointing to already.
> It may be fine to just fix this function to return the proper value (I guess 
> it'd also need to look at the high key in this example). It could also be 
> fixed by not looking at the lo/hi key of the page, but to use the local tuple 
> buffer instead. We already did a _read_page once, so if we have any matching 
> tuples on that specific page, we have them locally in the buffer already. 
> That way we never need to lock the same page twice.

Oh, that's what you mean. Yes, I was somehow tricked by the name of this
function and didn't notice that it checks only one boundary, so in case
of backward scan it returns wrong result. I think in the situation
you've describe it would actually not find any item on the current page
and restart from the root, but nevertheless we need to check for both
keys in _bt_scankey_within_page.

RE: Index Skip Scan

2020-01-22 Thread Floris Van Nee

> Note in particular that index scans cannot return the same index tuple twice -
> - processing a page at a time ensures that that cannot happen.
> 
> Can a loose index scan return the same tuple (i.e. a tuple with the same heap
> TID) to the executor more than once?
> 

The loose index scan shouldn't return a tuple twice. It should only be able to 
skip 'further', so that shouldn't be a problem. Out of curiosity, why can't 
index scans return the same tuple twice? Is there something in the executor 
that isn't able to handle this?

-Floris

1 2 3 >

1 - 100 of 266 matches

Mail list logo