Re: WIP: Covering + unique indexes.

2018-08-31 Thread Alvaro Herrera
I'm wondering what's the genesis of this coninclude column actually.
As far as I can tell, the only reason this column is there, is to be
able to print the INCLUDE clause in a UNIQUE/PK constraint in ruleutils
... but surely the same list can be obtained from the pg_index.indkey
instead?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: Covering + unique indexes.

2018-04-19 Thread Teodor Sigaev

Thank you, pushed

Peter Geoghegan wrote:

On Wed, Apr 18, 2018 at 10:47 PM, Teodor Sigaev  wrote:

Thank you, pushed.


Thanks.

I saw another preexisting issue, this time one that has been around
since 2007. Commit bc292937 forgot to remove a comment above
_bt_insertonpg() (the 'afteritem' stuff ended up being moved to the
bottom of _bt_findinsertloc(), where it remains today). The attached
patch fixes this, and in passing mentions the fact that
_bt_insertonpg() only performs retail insertions, and specifically
never inserts high key items.

I don't think it's necessary to add something about negative infinity
items to the same comment block. While it's true that _bt_insertonpg()
cannot truncate downlinks to make new minus infinity items, I see that
as a narrower issue.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: WIP: Covering + unique indexes.

2018-04-19 Thread Peter Geoghegan
On Wed, Apr 18, 2018 at 10:47 PM, Teodor Sigaev  wrote:
> Thank you, pushed.

Thanks.

I saw another preexisting issue, this time one that has been around
since 2007. Commit bc292937 forgot to remove a comment above
_bt_insertonpg() (the 'afteritem' stuff ended up being moved to the
bottom of _bt_findinsertloc(), where it remains today). The attached
patch fixes this, and in passing mentions the fact that
_bt_insertonpg() only performs retail insertions, and specifically
never inserts high key items.

I don't think it's necessary to add something about negative infinity
items to the same comment block. While it's true that _bt_insertonpg()
cannot truncate downlinks to make new minus infinity items, I see that
as a narrower issue.

-- 
Peter Geoghegan
From dced00be29775965a45c7889ca99e19b96d9e4d0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Wed, 18 Apr 2018 22:38:32 -0700
Subject: [PATCH] Adjust _bt_insertonpg() comments.

Remove an obsolete reference to the 'afteritem' argument, which was
removed by commit bc292937.  Add a comment that clarifies how
_bt_insertonpg() indirectly handles the insertion of high key items.

Author: Peter Geoghegan
---
 src/backend/access/nbtree/nbtinsert.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index dbd5c92..ecf4e53 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -817,18 +817,18 @@ _bt_findinsertloc(Relation rel,
  *		insertion, and the buffer must be pinned and write-locked.  On return,
  *		we will have dropped both the pin and the lock on the buffer.
  *
- *		When inserting to a non-leaf page, 'cbuf' is the left-sibling of the
- *		page we're inserting the downlink for.  This function will clear the
+ *		This routine only performs retail tuple insertions.  'itup' should
+ *		always be either a non-highkey leaf item, or a downlink (new high
+ *		key items are created indirectly, when a page is split).  When
+ *		inserting to a non-leaf page, 'cbuf' is the left-sibling of the page
+ *		we're inserting the downlink for.  This function will clear the
  *		INCOMPLETE_SPLIT flag on it, and release the buffer.
  *
  *		The locking interactions in this code are critical.  You should
  *		grok Lehman and Yao's paper before making any changes.  In addition,
  *		you need to understand how we disambiguate duplicate keys in this
  *		implementation, in order to be able to find our location using
- *		L "move right" operations.  Since we may insert duplicate user
- *		keys, and since these dups may propagate up the tree, we use the
- *		'afteritem' parameter to position ourselves correctly for the
- *		insertion on internal pages.
+ *		L "move right" operations.
  *--
  */
 static void
-- 
2.7.4



Re: WIP: Covering + unique indexes.

2018-04-18 Thread Teodor Sigaev

Thank you, pushed.


Actually, I see one tiny issue with extra '*' characters here:


+* The number of attributes won't be explicitly represented if the
+* negative infinity tuple was generated during a page split that
+* occurred with a version of Postgres before v11.  There must be a
+* problem when there is an explicit representation that is
+* non-zero, * or when there is no explicit representation and the
+* tuple is * evidently not a pre-pg_upgrade tuple.


I also suggest fixing this indentation before commit:


+   /*
+*Cannot leak memory here, TupleDescCopy() doesn't allocate any
+* inner structure, so, plain pfree() should clean all allocated memory
+*/


fixed

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/



Re: WIP: Covering + unique indexes.

2018-04-18 Thread Peter Geoghegan
On Wed, Apr 18, 2018 at 1:45 PM, Peter Geoghegan  wrote:
> I suggest committing this patch as-is.

Actually, I see one tiny issue with extra '*' characters here:

> +* The number of attributes won't be explicitly represented if the
> +* negative infinity tuple was generated during a page split that
> +* occurred with a version of Postgres before v11.  There must be 
> a
> +* problem when there is an explicit representation that is
> +* non-zero, * or when there is no explicit representation and the
> +* tuple is * evidently not a pre-pg_upgrade tuple.

I also suggest fixing this indentation before commit:

> +   /*
> +*Cannot leak memory here, TupleDescCopy() doesn't allocate any
> +* inner structure, so, plain pfree() should clean all allocated memory
> +*/

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-04-18 Thread Peter Geoghegan
On Wed, Apr 18, 2018 at 1:32 PM, Teodor Sigaev  wrote:
>> I don't understand. We do check the number of attributes on rightmost
>> pages, but we do so separately, in the main loop. For every item that
>> isn't the high key.
>
> Comment added, pls, verify. And refactored _bt_check_natts(), I hope, now
> it's a bit more readable.

The new comment looks good.

Now I understand what you meant about _bt_check_natts(). And, I agree
that this is an improvement -- the extra verbosity is worth it.

> I didn't do that in v1, sorry, I was unclear. Attached patch contains all
> changes suggested in my previous email.

Looks new BTreeTupSetNAtts () assertion good to me.

I suggest committing this patch as-is.

Thank you
-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-04-18 Thread Teodor Sigaev

I don't understand. We do check the number of attributes on rightmost
pages, but we do so separately, in the main loop. For every item that
isn't the high key.
Comment added, pls, verify. And refactored _bt_check_natts(), I hope, 
now it's a bit more readable.



4) BTreeTupSetNAtts - seems, it's better to add check  of nattrs to fits  to
BT_N_KEYS_OFFSET_MASK  mask, and it should not touch BT_RESERVED_OFFSET_MASK
bits, now it will overwrite that bits.


An assertion sounds like it would be an improvement, though I don't
see that in the patch you posted.
I didn't do that in v1, sorry, I was unclear. Attached patch contains 
all changes suggested in my previous email.


--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index be0206d58e..1a605f9944 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -698,6 +698,9 @@ nextpage:
  *	 "real" data item on the page to the right (if such a first item is
  *	 available).
  *
+ * - That tuples report that they have the expected number of attributes.
+ *	 INCLUDE index pivot tuples should not contain non-key attributes.
+ *
  * Furthermore, when state passed shows ShareLock held, and target page is
  * internal page, function also checks:
  *
@@ -722,43 +725,35 @@ bt_target_page_check(BtreeCheckState *state)
 	elog(DEBUG2, "verifying %u items on %s block %u", max,
 		 P_ISLEAF(topaque) ? "leaf" : "internal", state->targetblock);
 
-
-	/* Check the number of attributes in high key if any */
-	if (!P_RIGHTMOST(topaque))
+	/*
+	 * Check the number of attributes in high key. Note, rightmost page doesn't
+	 * contain a high key, so nothing to check
+	 */
+	if (!P_RIGHTMOST(topaque) &&
+		!_bt_check_natts(state->rel, state->target, P_HIKEY))
 	{
-		if (!_bt_check_natts(state->rel, state->target, P_HIKEY))
-		{
-			ItemId		itemid;
-			IndexTuple	itup;
-			char	   *itid,
-	   *htid;
+		ItemId		itemid;
+		IndexTuple	itup;
 
-			itemid = PageGetItemId(state->target, P_HIKEY);
-			itup = (IndexTuple) PageGetItem(state->target, itemid);
-			itid = psprintf("(%u,%u)", state->targetblock, P_HIKEY);
-			htid = psprintf("(%u,%u)",
-			ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-			ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+		itemid = PageGetItemId(state->target, P_HIKEY);
+		itup = (IndexTuple) PageGetItem(state->target, itemid);
 
-			ereport(ERROR,
-	(errcode(ERRCODE_INDEX_CORRUPTED),
-	 errmsg("wrong number of index tuple attributes for index \"%s\"",
-			RelationGetRelationName(state->rel)),
-	 errdetail_internal("Index tid=%s natts=%u points to %s tid=%s page lsn=%X/%X.",
-		itid,
-		BTreeTupGetNAtts(itup, state->rel),
-		P_ISLEAF(topaque) ? "heap" : "index",
-		htid,
-		(uint32) (state->targetlsn >> 32),
-		(uint32) state->targetlsn)));
-		}
+		ereport(ERROR,
+(errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("wrong number of high key index tuple attributes in index \"%s\"",
+		RelationGetRelationName(state->rel)),
+ errdetail_internal("Index block=%u natts=%u block type=%s page lsn=%X/%X.",
+	state->targetblock,
+	BTreeTupleGetNAtts(itup, state->rel),
+	P_ISLEAF(topaque) ? "heap" : "index",
+	(uint32) (state->targetlsn >> 32),
+	(uint32) state->targetlsn)));
 	}
 
-
 	/*
 	 * Loop over page items, starting from first non-highkey item, not high
-	 * key (if any).  Also, immediately skip "negative infinity" real item (if
-	 * any).
+	 * key (if any).  Most tests are not performed for the "negative infinity"
+	 * real item (if any).
 	 */
 	for (offset = P_FIRSTDATAKEY(topaque);
 		 offset <= max;
@@ -791,7 +786,7 @@ bt_target_page_check(BtreeCheckState *state)
 		tupsize, ItemIdGetLength(itemid),
 		(uint32) (state->targetlsn >> 32),
 		(uint32) state->targetlsn),
-	 errhint("This could be a torn page problem")));
+	 errhint("This could be a torn page problem.")));
 
 		/* Check the number of index tuple attributes */
 		if (!_bt_check_natts(state->rel, state->target, offset))
@@ -806,11 +801,11 @@ bt_target_page_check(BtreeCheckState *state)
 
 			ereport(ERROR,
 	(errcode(ERRCODE_INDEX_CORRUPTED),
-	 errmsg("wrong number of index tuple attributes for index \"%s\"",
+	 errmsg("wrong number of index tuple attributes in index \"%s\"",
 			RelationGetRelationName(state->rel)),
 	 errdetail_internal("Index tid=%s natts=%u points to %s tid=%s page lsn=%X/%X.",
 		itid,
-		BTreeTupGetNAtts(itup, state->rel),
+		BTreeTupleGetNAtts(itup, state->rel),
 		P_ISLEAF(topaque) ? "heap" : "index",
 		htid,
 		(uint32) (state->targetlsn >> 32),
@@ -818,8 +813,8 @@ bt_target_page_check(BtreeCheckState *state)
 		}
 
 		/*
-		 * Don't try to generate scankey using 

Re: WIP: Covering + unique indexes.

2018-04-18 Thread Peter Geoghegan
(()
On Wed, Apr 18, 2018 at 10:10 AM, Teodor Sigaev  wrote:
> I mostly agree with your patch, nice work, but I have some notices for your
> patch:

Thanks.

> 1)
> bt_target_page_check():
> if (!P_RIGHTMOST(topaque) &&
> !_bt_check_natts(state->rel, state->target, P_HIKEY))
>
> Seems not very obvious: it looks like we don't need to check nattrs on
> rightmost page. Okay, I remember that on rightmost page there is no hikey at
> all, but at least comment should added. Implicitly bt_target_page_check()
> already takes into account 'is page rightmost or not?' by using
> P_FIRSTDATAKEY, so, may be better to move rightmost check into
> bt_target_page_check() with some refactoring if-logic:

I don't understand. We do check the number of attributes on rightmost
pages, but we do so separately, in the main loop. For every item that
isn't the high key.

This code appears before the main bt_target_page_check() loop because
we're checking the high key itself, on its own, which is a new thing.
The high key is also involved in the loop (on non-rightmost pages),
but that's only because we check real items *against* the high key (we
assume the high key is good and that the item might be bad). The high
key is involved in every iteration of the main loop (on non-rightmost
pages), rather than getting its own loop.

That said, I am quite happy if you want to put a comment about this
being the rightmost page at the beginning of the check.

> 2)
> Style notice:
> ItemPointerSetInvalid(_tid);
> +   BTreeTupSetNAtts(, 0);
> if (PageAddItem(page, (Item) , sizeof(IndexTupleData),
> P_HIKEY,
> It's better to have blank line between BTreeTupSetNAtts() and if clause.

Sure.

> 3) Naming BTreeTupGetNAtts/BTreeTupSetNAtts - several lines above we use
> full Tuple word in dowlink macroses, here we use just Tup. Seems, better to
> have Tuple in both cases. Or Tup, but still in both cases.

+1

> 4) BTreeTupSetNAtts - seems, it's better to add check  of nattrs to fits  to
> BT_N_KEYS_OFFSET_MASK  mask, and it should not touch BT_RESERVED_OFFSET_MASK
> bits, now it will overwrite that bits.

An assertion sounds like it would be an improvement, though I don't
see that in the patch you posted.

> Attached patch is rebased to current head and contains some comment
> improvement in index_truncate_tuple() - you save some amount of memory with
> TupleDescCopy() call but didn't explain why pfree is enough to free all
> allocated memory.

Makes sense.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-04-18 Thread Teodor Sigaev

I mostly agree with your patch, nice work, but I have some notices for your 
patch:

1)
bt_target_page_check():
if (!P_RIGHTMOST(topaque) &&
!_bt_check_natts(state->rel, state->target, P_HIKEY))

Seems not very obvious: it looks like we don't need to check nattrs on rightmost 
page. Okay, I remember that on rightmost page there is no hikey at all, but at 
least comment should added. Implicitly bt_target_page_check() already takes into 
account 'is page rightmost or not?' by using  P_FIRSTDATAKEY, so, may be better 
to move rightmost check into bt_target_page_check() with some refactoring if-logic:


if (offset > maxoff)
return true; //nothing to check, also covers empty rightmost page

if (P_ISLEAF) {
if (offnum >= P_FIRSTDATAKEY)
...
else /* if (offnum == P_HIKEY) */
...
}
else // !P_ISLEAF
{
if (offnum == P_FIRSTDATAKEY)
...
else if (offnum > P_FIRSTDATAKEY)
...
else /* if (offnum == P_HIKEY) */
...
}

I see it's possible only 3 nattrs value: 0, nkey and nkey+nincluded, but 
collapsing if-clause to three branches causes difficulties for code readers. Let 
compiler optimize that. Sorry for late notice, but it takes my attention only 
when I noticed (!P_RIGHTMOST && !_bt_check_natts) condition.


2)
Style notice:
ItemPointerSetInvalid(_tid);
+   BTreeTupSetNAtts(, 0);
if (PageAddItem(page, (Item) , sizeof(IndexTupleData), P_HIKEY,
It's better to have blank line between BTreeTupSetNAtts() and if clause.

3) Naming BTreeTupGetNAtts/BTreeTupSetNAtts - several lines above we use full 
Tuple word in dowlink macroses, here we use just Tup. Seems, better to have 
Tuple in both cases. Or Tup, but still in both cases.


4) BTreeTupSetNAtts - seems, it's better to add check  of nattrs to fits  to 
BT_N_KEYS_OFFSET_MASK  mask, and it should not touch BT_RESERVED_OFFSET_MASK 
bits, now it will overwrite that bits.


Attached patch is rebased to current head and contains some comment improvement 
in index_truncate_tuple() - you save some amount of memory with TupleDescCopy() 
call but didn't explain why pfree is enough to free all allocated memory.





Peter Geoghegan wrote:

On Tue, Apr 17, 2018 at 3:12 AM, Alexander Korotkov
 wrote:

Hmm, what do you think about making BTreeTupGetNAtts() take tupledesc
argument, not relation>  It anyway doesn't need number of key attributes,
only total number of attributes.  Then _bt_isequal() would be able to use
BTreeTupGetNAtts().


That would make the BTreeTupGetNAtts() assertions quite a bit more
verbose, since there is usually no existing tuple descriptor variable,
but there is almost always a "rel" variable. The coverage within
_bt_isequal() does not seem important, because we only use it with the
page high key in rare cases, where _bt_moveright() will already have
tested the highkey.


I think it's completely OK to fix broken things when you've to touch
them.  Probably, Teodor would decide to make that by separate commit.
So, it's up to him.


You're right to say that this old negative infinity tuple code within
_bt_insertonpg() is broken code, and not just dead code. The code
doesn't just confuse things (e.g. see recent commit 2a67d644). It also
seems like it could actually be harmful. This is code that could only
ever corrupt your database.

I'm fine if Teodor wants to commit that change separately, of course.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index be0206d58e..7efd7ac47b 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -698,6 +698,9 @@ nextpage:
  *	 "real" data item on the page to the right (if such a first item is
  *	 available).
  *
+ * - That tuples report that they have the expected number of attributes.
+ *	 INCLUDE index pivot tuples should not contain non-key attributes.
+ *
  * Furthermore, when state passed shows ShareLock held, and target page is
  * internal page, function also checks:
  *
@@ -722,43 +725,32 @@ bt_target_page_check(BtreeCheckState *state)
 	elog(DEBUG2, "verifying %u items on %s block %u", max,
 		 P_ISLEAF(topaque) ? "leaf" : "internal", state->targetblock);
 
-
-	/* Check the number of attributes in high key if any */
-	if (!P_RIGHTMOST(topaque))
+	/* Check the number of attributes in high key */
+	if (!P_RIGHTMOST(topaque) &&
+		!_bt_check_natts(state->rel, state->target, P_HIKEY))
 	{
-		if (!_bt_check_natts(state->rel, state->target, P_HIKEY))
-		{
-			ItemId		itemid;
-			IndexTuple	itup;
-			char	   *itid,
-	   *htid;
+		ItemId		itemid;
+		IndexTuple	itup;
 
-			itemid = PageGetItemId(state->target, P_HIKEY);
-			itup = (IndexTuple) PageGetItem(state->target, itemid);
-			itid = psprintf("(%u,%u)", 

Re: WIP: Covering + unique indexes.

2018-04-17 Thread Peter Geoghegan
On Tue, Apr 17, 2018 at 3:12 AM, Alexander Korotkov
 wrote:
> Hmm, what do you think about making BTreeTupGetNAtts() take tupledesc
> argument, not relation>  It anyway doesn't need number of key attributes,
> only total number of attributes.  Then _bt_isequal() would be able to use
> BTreeTupGetNAtts().

That would make the BTreeTupGetNAtts() assertions quite a bit more
verbose, since there is usually no existing tuple descriptor variable,
but there is almost always a "rel" variable. The coverage within
_bt_isequal() does not seem important, because we only use it with the
page high key in rare cases, where _bt_moveright() will already have
tested the highkey.

> I think it's completely OK to fix broken things when you've to touch
> them.  Probably, Teodor would decide to make that by separate commit.
> So, it's up to him.

You're right to say that this old negative infinity tuple code within
_bt_insertonpg() is broken code, and not just dead code. The code
doesn't just confuse things (e.g. see recent commit 2a67d644). It also
seems like it could actually be harmful. This is code that could only
ever corrupt your database.

I'm fine if Teodor wants to commit that change separately, of course.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-04-17 Thread Alexander Korotkov
On Mon, Apr 16, 2018 at 1:05 AM, Peter Geoghegan  wrote:

> Attached patch makes the changes that I talked about, and a few
> others. The commit message has full details. The general direction of
> the patch is that it documents our assumptions, and verifies them in
> more cases. Most of the changes I've made are clear improvements,
> though in a few cases I've made changes that are perhaps more
> debatable.


Great, thank you very much!


> These other, more debatable cases are:
>
> * The comments added to _bt_isequal() about suffix truncation may not
> be to your taste. The same is true of the way that I restored the
> previous _bt_isequal() function signature. (Yes, I want to change it
> back despite the fact that I was the person that originally suggested
> we change _bt_isequal().)
>

Hmm, what do you think about making BTreeTupGetNAtts() take tupledesc
argument, not relation>  It anyway doesn't need number of key attributes,
only total number of attributes.  Then _bt_isequal() would be able to use
BTreeTupGetNAtts().

* I added BTreeTupSetNAtts() calls to a few places that don't truly
> need them, such as the point where we generate a dummy 0-attribute
> high key within _bt_mark_page_halfdead(). I think that we should try
> to be as consistent as possible about using BTreeTupSetNAtts(), to set
> a good example. I don't think it's necessary to use BTreeTupSetNAtts()
> for pivot tuples when the number of key attributes matches indnatts
> (it seems inconvenient to have to palloc() our own scratch buffer to
> do this when we don't have to), but that doesn't apply to these
> now-covered cases.
>

+1

>
> > I think, we need move _bt_check_natts() and its call under
> > USE_ASSERT_CHECKING to prevent performance degradation. Users shouldn't
> pay
> > for unused feature.
>
> I eventually decided that you were right about this, and made the
> _bt_compare() call to _bt_check_natts() a simple assertion without
> waiting to hear more opinions on the matter. Concurrency isn't a
> factor here, so adding a check to standard release builds isn't
> particularly likely to detect bugs. Besides, there is really only a
> small number of places that need to do truncation for themselves. And,
> if you want to be sure that the structure is consistent in the field,
> there is always amcheck, which can check _bt_check_natts() (while also
> checking other things that we care about just as much).
>

Good point, risk of performance degradation caused by _bt_check_natts()
in _bt_compare() is high.  So, let's move in to assertion.

Note that I removed some dead code from _bt_insertonpg() that wasn't
> added by the INCLUDE patch. It confused matters for this patch, since
> we don't want to consider what's supposed to happen when there is a
> retail insertion of a new, second negative infinity item -- clearly,
> that should simply never happen (I thought about adding a
> BTreeTupSetNAtts() call, but then decided to just remove the dead code
> and add a new "can't happen" elog error).


I think it's completely OK to fix broken things when you've to touch
them.  Probably, Teodor would decide to make that by separate commit.
So, it's up to him.


> Finally, I made sure that we
> don't drop all tables in the regression tests, so that we have some
> pg_dump coverage for INCLUDE indexes, per a request from Tom.
>

Makes sense, because that've already appeared to be broken.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: WIP: Covering + unique indexes.

2018-04-15 Thread Peter Geoghegan
On Thu, Apr 12, 2018 at 9:21 AM, Teodor Sigaev  wrote:
> Agree, I prefer to add more Assert, even. may be, more than actually needed.
> Assert-documented code :)

Absolutely. The danger with a feature like this is that we'll miss one
place. I suppose that you could say that I am in the Poul-Henning Kamp
camp on assertions [1].

>> I'll add an item to "Decisions to Recheck Mid-Beta" section of the
>> open items page for this patch. We should review the decision to make
>> a call to _bt_check_natts() within _bt_compare(). It might work just
>> as well as an assertion, and it would be unfortunate if workloads that
>> don't use covering indexes had to pay a price for the
>> _bt_check_natts() call, even if it was a small price. I've seen
>> _bt_compare() appear prominently in profiles quite a few times.
>>
>
> Could you show a patch?

Attached patch makes the changes that I talked about, and a few
others. The commit message has full details. The general direction of
the patch is that it documents our assumptions, and verifies them in
more cases. Most of the changes I've made are clear improvements,
though in a few cases I've made changes that are perhaps more
debatable. These other, more debatable cases are:

* The comments added to _bt_isequal() about suffix truncation may not
be to your taste. The same is true of the way that I restored the
previous _bt_isequal() function signature. (Yes, I want to change it
back despite the fact that I was the person that originally suggested
we change _bt_isequal().)

* I added BTreeTupSetNAtts() calls to a few places that don't truly
need them, such as the point where we generate a dummy 0-attribute
high key within _bt_mark_page_halfdead(). I think that we should try
to be as consistent as possible about using BTreeTupSetNAtts(), to set
a good example. I don't think it's necessary to use BTreeTupSetNAtts()
for pivot tuples when the number of key attributes matches indnatts
(it seems inconvenient to have to palloc() our own scratch buffer to
do this when we don't have to), but that doesn't apply to these
now-covered cases.

I imagine that you'll have no problem with the other changes in the
patch, which is why I haven't mentioned them here. Let me know what
you think.

> I think, we need move _bt_check_natts() and its call under
> USE_ASSERT_CHECKING to prevent performance degradation. Users shouldn't pay
> for unused feature.

I eventually decided that you were right about this, and made the
_bt_compare() call to _bt_check_natts() a simple assertion without
waiting to hear more opinions on the matter. Concurrency isn't a
factor here, so adding a check to standard release builds isn't
particularly likely to detect bugs. Besides, there is really only a
small number of places that need to do truncation for themselves. And,
if you want to be sure that the structure is consistent in the field,
there is always amcheck, which can check _bt_check_natts() (while also
checking other things that we care about just as much).

Note that I removed some dead code from _bt_insertonpg() that wasn't
added by the INCLUDE patch. It confused matters for this patch, since
we don't want to consider what's supposed to happen when there is a
retail insertion of a new, second negative infinity item -- clearly,
that should simply never happen (I thought about adding a
BTreeTupSetNAtts() call, but then decided to just remove the dead code
and add a new "can't happen" elog error). Finally, I made sure that we
don't drop all tables in the regression tests, so that we have some
pg_dump coverage for INCLUDE indexes, per a request from Tom.

[1] https://queue.acm.org/detail.cfm?id=2220317
-- 
Peter Geoghegan
From db777cad48f185afb70c589beae15fa166c0ddf1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Mon, 9 Apr 2018 17:45:33 -0700
Subject: [PATCH] Adjust INCLUDE index truncation comments and code.

Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected.  Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature.  A
similar check will be performed for the page highkey within
_bt_moveright() in most cases.  Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.

Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer).  This isn't strictly necessary,
but seems more future-proof than the old approach.  It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation.  Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.

In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE 

Re: WIP: Covering + unique indexes.

2018-04-13 Thread Andrey Borodin
Hi!

> 12 апр. 2018 г., в 21:21, Teodor Sigaev  написал(а):

I was adapting tests for GiST covering index and found out that REINDEX test is 
somewhat not a REINDEX test...
I propose following micropatch.

Best regards, Andrey Borodin.


fix-reindex-test.diff
Description: Binary data


Re: WIP: Covering + unique indexes.

2018-04-12 Thread Teodor Sigaev



Peter Geoghegan wrote:

On Tue, Apr 10, 2018 at 5:45 PM, Peter Geoghegan  wrote:

I did find another problem, though. Looks like the idea to explicitly
represent the number of attributes directly has paid off already:

pg@~[3711]=# create table covering_bug (f1 int, f2 int, f3 text);
create unique index cov_idx on covering_bug (f1) include(f2);
insert into covering_bug select i, i * random() * 1000, i * random() *
10 from generate_series(0,10) i;
DEBUG:  building index "pg_toast_16451_index" on table "pg_toast_16451" serially
CREATE TABLE
DEBUG:  building index "cov_idx" on table "covering_bug" serially
CREATE INDEX
ERROR:  tuple has wrong number of attributes in index "cov_idx"


Actually, this was an error on my part (though I'd still maintain that
the check paid off here!). I'll still add defensive assertions inside
_bt_newroot(), and anywhere else that they're needed. There is no
reason to not add defensive assertions in all code that handles page
splits, and needs to fetch a highkey from some other page. We missed a
few of those.
Agree, I prefer to add more Assert, even. may be, more than actually 
needed. Assert-documented code :)




I'll add an item to "Decisions to Recheck Mid-Beta" section of the
open items page for this patch. We should review the decision to make
a call to _bt_check_natts() within _bt_compare(). It might work just
as well as an assertion, and it would be unfortunate if workloads that
don't use covering indexes had to pay a price for the
_bt_check_natts() call, even if it was a small price. I've seen
_bt_compare() appear prominently in profiles quite a few times.



Could you show a patch?

I think, we need move _bt_check_natts() and its call under 
USE_ASSERT_CHECKING to prevent performance degradation. Users shouldn't 
pay for unused feature.

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/



Re: WIP: Covering + unique indexes.

2018-04-11 Thread Peter Geoghegan
On Tue, Apr 10, 2018 at 5:45 PM, Peter Geoghegan  wrote:
> I did find another problem, though. Looks like the idea to explicitly
> represent the number of attributes directly has paid off already:
>
> pg@~[3711]=# create table covering_bug (f1 int, f2 int, f3 text);
> create unique index cov_idx on covering_bug (f1) include(f2);
> insert into covering_bug select i, i * random() * 1000, i * random() *
> 10 from generate_series(0,10) i;
> DEBUG:  building index "pg_toast_16451_index" on table "pg_toast_16451" 
> serially
> CREATE TABLE
> DEBUG:  building index "cov_idx" on table "covering_bug" serially
> CREATE INDEX
> ERROR:  tuple has wrong number of attributes in index "cov_idx"

Actually, this was an error on my part (though I'd still maintain that
the check paid off here!). I'll still add defensive assertions inside
_bt_newroot(), and anywhere else that they're needed. There is no
reason to not add defensive assertions in all code that handles page
splits, and needs to fetch a highkey from some other page. We missed a
few of those.

I'll add an item to "Decisions to Recheck Mid-Beta" section of the
open items page for this patch. We should review the decision to make
a call to _bt_check_natts() within _bt_compare(). It might work just
as well as an assertion, and it would be unfortunate if workloads that
don't use covering indexes had to pay a price for the
_bt_check_natts() call, even if it was a small price. I've seen
_bt_compare() appear prominently in profiles quite a few times.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-04-11 Thread Teodor Sigaev

_bt_mark_page_halfdead() looked like it had a problem, but it now
looks like I was wrong. I also verified every other
BTreeInnerTupleGetDownLink() caller. It now looks like everything is
good here.


Right - it tries to find right page by conlsulting in parent page, by taking of 
next key.




--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: WIP: Covering + unique indexes.

2018-04-10 Thread Peter Geoghegan
On Tue, Apr 10, 2018 at 1:37 PM, Peter Geoghegan  wrote:
> _bt_mark_page_halfdead() looked like it had a problem, but it now
> looks like I was wrong.

I did find another problem, though. Looks like the idea to explicitly
represent the number of attributes directly has paid off already:

pg@~[3711]=# create table covering_bug (f1 int, f2 int, f3 text);
create unique index cov_idx on covering_bug (f1) include(f2);
insert into covering_bug select i, i * random() * 1000, i * random() *
10 from generate_series(0,10) i;
DEBUG:  building index "pg_toast_16451_index" on table "pg_toast_16451" serially
CREATE TABLE
DEBUG:  building index "cov_idx" on table "covering_bug" serially
CREATE INDEX
ERROR:  tuple has wrong number of attributes in index "cov_idx"

Note that amcheck can detect the issue with the index after the fact, too:

pg@~[3711]=# select bt_index_check('cov_idx');
ERROR:  wrong number of index tuple attributes for index "cov_idx"
DETAIL:  Index tid=(3,2) natts=2 points to index tid=(2,92) page lsn=0/170DC88.

I don't think that the issue is complicated. Looks like we missed a
place that we have to truncate within _bt_split(), located directly
after this comment block:

/*
 * If the page we're splitting is not the rightmost page at its level in
 * the tree, then the first entry on the page is the high key for the
 * page.  We need to copy that to the right half.  Otherwise (meaning the
 * rightmost page case), all the items on the right half will be user
 * data.
 */

I believe that the reason that we didn't find this bug prior to commit
is that we only have a single index tuple with the wrong number of
attributes after an initial root page split through insertions, but
the next root page split masks the problems. Not 100% sure that that's
why we missed it just yet, though.

This bug shouldn't be hard to fix. I'll take care of it as part of
that post-commit review patch I'm working on.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-04-10 Thread Peter Geoghegan
On Tue, Apr 10, 2018 at 9:03 AM, Teodor Sigaev  wrote:
>> * Not sure that all calls to BTreeInnerTupleGetDownLink() are limited
>> to inner tuples, which might be worth doing something about (perhaps
>> just renaming the macro).
>
> What is suspicious place for you opinion?

_bt_mark_page_halfdead() looked like it had a problem, but it now
looks like I was wrong. I also verified every other
BTreeInnerTupleGetDownLink() caller. It now looks like everything is
good here.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-04-10 Thread Teodor Sigaev

* There is no pfree() within _bt_buildadd() for truncated tuples, even
though that's a context where it's clearly not okay.

Agree



* It might be a good idea to also pfree() the truncated tuple for most
other _bt_buildadd() callers. Even though it's arguably okay in other
cases, it seems worth being consistent about it (consistent with old
nbtree code).

Seems, I don't see other calls to pfree after.


* There should probably be some documentation around why it's okay
that we call index_truncate_tuple() with an exclusive buffer lock held
(during a page split). For example, there should probably be a comment
on the VARATT_IS_EXTERNAL() situation.

I havn't objection to improve docs/comments.



* Not sure that all calls to BTreeInnerTupleGetDownLink() are limited
to inner tuples, which might be worth doing something about (perhaps
just renaming the macro).

What is suspicious place for you opinion?



I do not have the time to write a patch right away, but I should be
able to post one in a few days. I want to avoid sending several small
patches.

no problem, we can wait


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: WIP: Covering + unique indexes.

2018-04-09 Thread Peter Geoghegan
On Sun, Apr 8, 2018 at 11:19 PM, Teodor Sigaev  wrote:
> Thank you, pushed.

I noticed a few more issues following another pass-through of the patch:

* There is no pfree() within _bt_buildadd() for truncated tuples, even
though that's a context where it's clearly not okay.

* It might be a good idea to also pfree() the truncated tuple for most
other _bt_buildadd() callers. Even though it's arguably okay in other
cases, it seems worth being consistent about it (consistent with old
nbtree code).

* There should probably be some documentation around why it's okay
that we call index_truncate_tuple() with an exclusive buffer lock held
(during a page split). For example, there should probably be a comment
on the VARATT_IS_EXTERNAL() situation.

* Not sure that all calls to BTreeInnerTupleGetDownLink() are limited
to inner tuples, which might be worth doing something about (perhaps
just renaming the macro).

I do not have the time to write a patch right away, but I should be
able to post one in a few days. I want to avoid sending several small
patches.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-04-09 Thread Teodor Sigaev

Thanks to both of you, pushed

Shinoda, Noriyoshi wrote:

Hi!

Thank you for your response.

I think that it is good with your proposal.

Regards,

Noriyoshi Shinoda

*From:*Alexander Korotkov [mailto:a.korot...@postgrespro.ru]
*Sent:* Monday, April 9, 2018 11:22 PM
*To:* Shinoda, Noriyoshi <noriyoshi.shin...@hpe.com>
*Cc:* PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Teodor Sigaev 
<teo...@sigaev.ru>; Peter Geoghegan <p...@bowt.ie>; Jeff Janes 
<jeff.ja...@gmail.com>; Anastasia Lubennikova <a.lubennik...@postgrespro.ru>

*Subject:* Re: WIP: Covering + unique indexes.

Hi!

On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi <noriyoshi.shin...@hpe.com 
<mailto:noriyoshi.shin...@hpe.com>> wrote:


I tested this feature and found a document shortage in the columns added to
the pg_constraint catalog.
The attached patch will add the description of the 'conincluding' column to
the manual of the pg_constraint catalog.

Thank you for pointing this!

I think we need more wordy explanation here.  My proposal is attached.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com <http://www.postgrespro.com/>
The Russian Postgres Company



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



RE: WIP: Covering + unique indexes.

2018-04-09 Thread Shinoda, Noriyoshi
Hi!

Thank you for your response.
I think that it is good with your proposal.

Regards,
Noriyoshi Shinoda

From: Alexander Korotkov [mailto:a.korot...@postgrespro.ru]
Sent: Monday, April 9, 2018 11:22 PM
To: Shinoda, Noriyoshi <noriyoshi.shin...@hpe.com>
Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Teodor Sigaev 
<teo...@sigaev.ru>; Peter Geoghegan <p...@bowt.ie>; Jeff Janes 
<jeff.ja...@gmail.com>; Anastasia Lubennikova <a.lubennik...@postgrespro.ru>
Subject: Re: WIP: Covering + unique indexes.

Hi!

On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi 
<noriyoshi.shin...@hpe.com<mailto:noriyoshi.shin...@hpe.com>> wrote:
I tested this feature and found a document shortage in the columns added to the 
pg_constraint catalog.
The attached patch will add the description of the 'conincluding' column to the 
manual of the pg_constraint catalog.

Thank you for pointing this!
I think we need more wordy explanation here.  My proposal is attached.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com<http://www.postgrespro.com/>
The Russian Postgres Company


Re: WIP: Covering + unique indexes.

2018-04-09 Thread Alexander Korotkov
Hi!

On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi <
noriyoshi.shin...@hpe.com> wrote:

> I tested this feature and found a document shortage in the columns added
> to the pg_constraint catalog.
> The attached patch will add the description of the 'conincluding' column
> to the manual of the pg_constraint catalog.
>

Thank you for pointing this!
I think we need more wordy explanation here.  My proposal is attached.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg_constraint-2.patch
Description: Binary data


RE: WIP: Covering + unique indexes.

2018-04-09 Thread Shinoda, Noriyoshi
Hi, 

I tested this feature and found a document shortage in the columns added to the 
pg_constraint catalog.
The attached patch will add the description of the 'conincluding' column to the 
manual of the pg_constraint catalog.

Regards,
Noriyoshi Shinoda

-Original Message-
From: Teodor Sigaev [mailto:teo...@sigaev.ru] 
Sent: Monday, April 9, 2018 3:20 PM
To: Peter Geoghegan <p...@bowt.ie>
Cc: Jeff Janes <jeff.ja...@gmail.com>; Alexander Korotkov 
<a.korot...@postgrespro.ru>; Anastasia Lubennikova 
<a.lubennik...@postgrespro.ru>; PostgreSQL Hackers 
<pgsql-hackers@lists.postgresql.org>
Subject: Re: WIP: Covering + unique indexes.

Thank you, pushed.


Peter Geoghegan wrote:
> On Sun, Apr 8, 2018 at 11:18 AM, Teodor Sigaev <teo...@sigaev.ru> wrote:
>> Thank you, fixed
> 
> I suggest that we remove some unneeded amcheck tests, as in the 
> attached patch. They don't seem to add anything.
> 

-- 
Teodor Sigaev  E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



pg_constraint.patch
Description: pg_constraint.patch


Re: WIP: Covering + unique indexes.

2018-04-09 Thread Teodor Sigaev

Thank you, pushed.


Peter Geoghegan wrote:

On Sun, Apr 8, 2018 at 11:18 AM, Teodor Sigaev  wrote:

Thank you, fixed


I suggest that we remove some unneeded amcheck tests, as in the
attached patch. They don't seem to add anything.



--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/



Re: WIP: Covering + unique indexes.

2018-04-08 Thread Peter Geoghegan
On Sun, Apr 8, 2018 at 11:18 AM, Teodor Sigaev  wrote:
> Thank you, fixed

I suggest that we remove some unneeded amcheck tests, as in the
attached patch. They don't seem to add anything.

-- 
Peter Geoghegan
From 0dbbee5bfff8816cddf86961bf4959192f62f1ff Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sun, 8 Apr 2018 11:56:00 -0700
Subject: [PATCH] Remove some superfluous amcheck INCLUDE tests.

Repeating these tests adds unnecessary cycles, since no improvement in
test coverage is expected.

Cleanup from commit 8224de4f42ccf98e08db07b43d52fed72f962ebb.

Author: Peter Geoghegan
---
 contrib/amcheck/expected/check_btree.out | 20 +---
 contrib/amcheck/sql/check_btree.sql  |  6 +-
 2 files changed, 2 insertions(+), 24 deletions(-)

diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index 2a06cce..ed80ac4 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -111,27 +111,9 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
  
 (1 row)
 
-SELECT bt_index_parent_check('bttest_multi_idx', true);
- bt_index_parent_check 

- 
-(1 row)
-
--- repeat same checks with index made by insertions
+-- repeat expansive test for index built using insertions
 TRUNCATE bttest_multi;
 INSERT INTO bttest_multi SELECT i, i%2  FROM generate_series(1, 10) as i;
-SELECT bt_index_check('bttest_multi_idx');
- bt_index_check 
-
- 
-(1 row)
-
-SELECT bt_index_parent_check('bttest_multi_idx', true);
- bt_index_parent_check 

- 
-(1 row)
-
 SELECT bt_index_parent_check('bttest_multi_idx', true);
  bt_index_parent_check 
 ---
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index da2f131..4ca9d2d 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -65,15 +65,11 @@ COMMIT;
 SELECT bt_index_check('bttest_multi_idx');
 -- more expansive test for index with included columns
 SELECT bt_index_parent_check('bttest_multi_idx', true);
-SELECT bt_index_parent_check('bttest_multi_idx', true);
 
--- repeat same checks with index made by insertions
+-- repeat expansive test for index built using insertions
 TRUNCATE bttest_multi;
 INSERT INTO bttest_multi SELECT i, i%2  FROM generate_series(1, 10) as i;
-SELECT bt_index_check('bttest_multi_idx');
 SELECT bt_index_parent_check('bttest_multi_idx', true);
-SELECT bt_index_parent_check('bttest_multi_idx', true);
-
 
 -- cleanup
 DROP TABLE bttest_a;
-- 
2.7.4



Re: WIP: Covering + unique indexes.

2018-04-08 Thread Teodor Sigaev

Thanks to everyone, pushed.


Indeed thanks, this will be a nice feature.

It is giving me a compiler warning on non-cassert builds using gcc 
(Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609:


indextuple.c: In function 'index_truncate_tuple':
indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable]
   int   indnatts = tupleDescriptor->natts;


Thank you, fixed

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/



Re: WIP: Covering + unique indexes.

2018-04-08 Thread Teodor Sigaev

Thank you, fixed

Jeff Janes wrote:
On Sat, Apr 7, 2018 at 4:02 PM, Teodor Sigaev > wrote:


Thanks to everyone, pushed.


Indeed thanks, this will be a nice feature.

It is giving me a compiler warning on non-cassert builds using gcc 
(Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609:


indextuple.c: In function 'index_truncate_tuple':
indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable]
   int   indnatts = tupleDescriptor->natts;

Cheers,

Jeff


--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/



Re: WIP: Covering + unique indexes.

2018-04-08 Thread Jeff Janes
On Sat, Apr 7, 2018 at 4:02 PM, Teodor Sigaev  wrote:

> Thanks to everyone, pushed.
>
>
Indeed thanks, this will be a nice feature.

It is giving me a compiler warning on non-cassert builds using gcc (Ubuntu
5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609:

indextuple.c: In function 'index_truncate_tuple':
indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable]
  int   indnatts = tupleDescriptor->natts;

Cheers,

Jeff


Re: WIP: Covering + unique indexes.

2018-04-07 Thread Andrew Gierth
> "Teodor" == Teodor Sigaev  writes:

 >> Support for testing amcaninclude via
 >> pg_indexam_has_property(oid,'can_include') seems to be missing?
 >> 
 >> Also the return values of pg_index_column_has_property for included
 >> columns seem a bit dubious... should probably be returning NULL for most
 >> properties except 'returnable'.
 
 Teodor> Damn, you right, it's missed.

 >> I can look at fixing these for you if you like?

 Teodor> If you will do that I will be very grateful

OK, I will deal with it.

-- 
Andrew (irc:RhodiumToad)



Re: WIP: Covering + unique indexes.

2018-04-07 Thread Teodor Sigaev

Support for testing amcaninclude via
pg_indexam_has_property(oid,'can_include') seems to be missing?

Also the return values of pg_index_column_has_property for included
columns seem a bit dubious... should probably be returning NULL for most
properties except 'returnable'.

Damn, you right, it's missed.


I can look at fixing these for you if you like?


If you will do that I will be very grateful

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/



Re: WIP: Covering + unique indexes.

2018-04-07 Thread Peter Geoghegan
On Sat, Apr 7, 2018 at 1:52 PM, Andrew Gierth
 wrote:
> Support for testing amcaninclude via
> pg_indexam_has_property(oid,'can_include') seems to be missing?
>
> Also the return values of pg_index_column_has_property for included
> columns seem a bit dubious... should probably be returning NULL for most
> properties except 'returnable'.
>
> I can look at fixing these for you if you like?

I'm happy to accept your help with it, for one.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-04-07 Thread Andrew Gierth
> "Teodor" == Teodor Sigaev  writes:

 >> I'll keep an eye on the buildfarm, since it's late in Russia.
 
 Teodor> Thank you very much! Now 23:10 MSK and I'll be able to follow
 Teodor> during approximately hour.

Support for testing amcaninclude via
pg_indexam_has_property(oid,'can_include') seems to be missing?

Also the return values of pg_index_column_has_property for included
columns seem a bit dubious... should probably be returning NULL for most
properties except 'returnable'.

I can look at fixing these for you if you like?

-- 
Andrew (irc:RhodiumToad)



Re: WIP: Covering + unique indexes.

2018-04-07 Thread Teodor Sigaev

Thank you, I looked to buildfarm and completely forget about commitfest site

Andres Freund wrote:

On 2018-04-07 23:02:08 +0300, Teodor Sigaev wrote:

Thanks to everyone, pushed.


Marked CF entry as committed.

Greetings,

Andres Freund



--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/



Re: WIP: Covering + unique indexes.

2018-04-07 Thread Andres Freund
On 2018-04-07 23:02:08 +0300, Teodor Sigaev wrote:
> Thanks to everyone, pushed.

Marked CF entry as committed.

Greetings,

Andres Freund



Re: WIP: Covering + unique indexes.

2018-04-07 Thread Teodor Sigaev

I'll keep an eye on the buildfarm, since it's late in Russia.


Thank you very much! Now 23:10 MSK and I'll be able to follow during 
approximately hour.


--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/



Re: WIP: Covering + unique indexes.

2018-04-07 Thread Peter Geoghegan
On Sat, Apr 7, 2018 at 1:02 PM, Teodor Sigaev  wrote:
> Thanks to everyone, pushed.

I'll keep an eye on the buildfarm, since it's late in Russia.

-- 
Peter Geoghegan



Sv: Re: WIP: Covering + unique indexes.

2018-04-07 Thread Andreas Joseph Krogh
På lørdag 07. april 2018 kl. 22:02:08, skrev Teodor Sigaev >:
Thanks to everyone, pushed.
 
Rock!
 
--
Andreas Joseph Krogh

 


Re: WIP: Covering + unique indexes.

2018-04-07 Thread Teodor Sigaev

Thanks to everyone, pushed.




Peter Geoghegan wrote:

On Sat, Apr 7, 2018 at 5:48 AM, Teodor Sigaev  wrote:

On close look, bts_btentry.ip_posid is not used anymore, I change
bts_btentry type to BlockNumber. As result, BTEntrySame() is removed.


That seems like a good idea.


I'm not very happy with massive usage of
ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)), suggest to  wrap it to
macro something like this:
#define BTreeInnerTupleGetDownLink(itup) \
 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid))


Agreed. We do that with GIN.



--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/



Re: WIP: Covering + unique indexes.

2018-04-07 Thread Peter Geoghegan
On Sat, Apr 7, 2018 at 5:48 AM, Teodor Sigaev  wrote:
> On close look, bts_btentry.ip_posid is not used anymore, I change
> bts_btentry type to BlockNumber. As result, BTEntrySame() is removed.

That seems like a good idea.

> I'm not very happy with massive usage of
> ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)), suggest to  wrap it to
> macro something like this:
> #define BTreeInnerTupleGetDownLink(itup) \
> ItemPointerGetBlockNumberNoCheck(&(itup->t_tid))

Agreed. We do that with GIN.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-04-07 Thread Alvaro Herrera
I didn't like rel.h being included in itup.h.  Do you really need a
Relation as argument to index_truncate_tuple?  It looks to me like you
could pass the tupledesc instead; indnatts could be passed as a separate
argument instead of IndexRelationGetNumberOfAttributes.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: Covering + unique indexes. (the good and the bad)

2018-04-07 Thread Erik Rijkers

On 2018-04-07 14:27, Alexander Korotkov wrote:

On Sat, Apr 7, 2018 at 2:57 PM, Erik Rijkers  wrote:


On 2018-04-06 20:08, Alexander Korotkov wrote:


[0001-Covering-v15.patch]

After some more testing I notice there is also a down-side/slow-down 
to
this patch that is not so bad but more than negligible, and I don't 
think

it has been mentioned (but I may have missed something in this thread
that's now been running for 1.5 year, not to mention the tangential
btree-thread(s)).

I attach my test-program, which compares master (this morning) with
covered_indexes (warning: it takes a while to generate the used 
tables).


The test tables are created as:
  create table $t (c1 int, c2 int, c3 int, c4 int);
  insert into $t (select x, 2*x, 3*x, 4 from generate_series(1, 
$rowcount)

as x);
  create unique index ${t}uniqueinclude_idx on $t using btree (c1, c2)
include (c3, c4);

or for HEAD, just:
  create unique index ${t}unique_idx on $t using btree (c1, c2);



Do I understand correctly that you compare unique index on (c1, c2) 
with
master to unqiue index on (c1, c2) include (c3, c4) with patched 
version?

If so then I think it's wrong to say about down-side/slow-down of this
patch based on this comparison.
Patch *does not* cause slowdown in this case.  Patch provides user a 
*new
option* which has its advantages and disadvantages.  And what you 
compare
is advantages and disadvantages of this option, not slow-down of the 
patch.

In the case you compare *the same* index on master and patched version,
then it's possible to say about slow-down of the patch.


OK, I take your point -- you are right.  Although my measurement was (I 
think) correct, my comparison was not (as Teodor wrote, not quite 
'fair').


Sorry, I should have better thought that message through.  The somewhat 
longer time is indeed just a disadvantage of this new option, to be 
balanced against the advantages that are pretty clear too.



Erik Rijkers



Re: WIP: Covering + unique indexes. (the good and the bad)

2018-04-07 Thread Alexander Korotkov
On Sat, Apr 7, 2018 at 2:57 PM, Erik Rijkers  wrote:

> On 2018-04-06 20:08, Alexander Korotkov wrote:
>
>>
>> [0001-Covering-v15.patch]
>>
>>
> After some more testing I notice there is also a down-side/slow-down to
> this patch that is not so bad but more than negligible, and I don't think
> it has been mentioned (but I may have missed something in this thread
> that's now been running for 1.5 year, not to mention the tangential
> btree-thread(s)).
>
> I attach my test-program, which compares master (this morning) with
> covered_indexes (warning: it takes a while to generate the used tables).
>
> The test tables are created as:
>   create table $t (c1 int, c2 int, c3 int, c4 int);
>   insert into $t (select x, 2*x, 3*x, 4 from generate_series(1, $rowcount)
> as x);
>   create unique index ${t}uniqueinclude_idx on $t using btree (c1, c2)
> include (c3, c4);
>
> or for HEAD, just:
>   create unique index ${t}unique_idx on $t using btree (c1, c2);
>

Do I understand correctly that you compare unique index on (c1, c2) with
master to unqiue index on (c1, c2) include (c3, c4) with patched version?
If so then I think it's wrong to say about down-side/slow-down of this
patch based on this comparison.
Patch *does not* cause slowdown in this case.  Patch provides user a *new
option* which has its advantages and disadvantages.  And what you compare
is advantages and disadvantages of this option, not slow-down of the patch.
In the case you compare *the same* index on master and patched version,
then it's possible to say about slow-down of the patch.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: WIP: Covering + unique indexes. (the good and the bad)

2018-04-07 Thread Teodor Sigaev

Thank you!

   create unique index ${t}uniqueinclude_idx on $t using btree (c1, c2) 
include (c3, c4);

or for HEAD, just:
   create unique index ${t}unique_idx on $t using btree (c1, c2);




-- explain analyze select c1, c2 from nt0___1 where c1 < 1 
-- explain analyze select c1, c2 from nt0___1 where c1 < 1 
and c3 < 20


Not so fair comparison, include index twice bigger because of include 
columns. Try to compare with covering-emulated index:

create unique index ${t}unique_idx on $t using btree (c1, c2, c3, c4)

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/



Re: WIP: Covering + unique indexes. (the good and the bad)

2018-04-07 Thread Erik Rijkers

On 2018-04-06 20:08, Alexander Korotkov wrote:


[0001-Covering-v15.patch]



After some more testing I notice there is also a down-side/slow-down to 
this patch that is not so bad but more than negligible, and I don't 
think it has been mentioned (but I may have missed something in this 
thread that's now been running for 1.5 year, not to mention the 
tangential btree-thread(s)).


I attach my test-program, which compares master (this morning) with 
covered_indexes (warning: it takes a while to generate the used tables).


The test tables are created as:
  create table $t (c1 int, c2 int, c3 int, c4 int);
  insert into $t (select x, 2*x, 3*x, 4 from generate_series(1, 
$rowcount) as x);
  create unique index ${t}uniqueinclude_idx on $t using btree (c1, c2) 
include (c3, c4);


or for HEAD, just:
  create unique index ${t}unique_idx on $t using btree (c1, c2);


Here is typical output (edited a bit to prevent email-mangling):

test1:
-- explain analyze select c1, c2 from nt0___1 where c1 < 1   
-- 250x
unpatched 6511: 100M rows Execution Time:  (normal/normal)  98 %  exec 
avg: 2.44
  patched 6976: 100M rows Execution Time: (covered/normal) 108 %  exec 
avg: 2.67
   test1 patched / 
unpatched: 109.49 %


test4:
-- explain analyze select c1, c2 from nt0___1 where c1 < 1 
and c3 < 20
unpatched 6511: 100M rows Execution Time:  (normal/normal)  95 %exec 
avg: 1.56
  patched 6976: 100M rows Execution Time: (covered/normal)  60 %exec 
avg: 0.95
   test4 patched / 
unpatched:  60.83 %



So the main good thing is that 60%, a good improvement -- but that ~109% 
(a slow-down) is also quite repeatable.


(there are a more goodies from the patch (like improved insert-speed) 
but I just wanted to draw attention to this particular slow-down too)


I took all timings from explain analyze versions of the statements, on 
the assumption that that would be quite comparable to 'normal' querying. 
(please let me know if that introduces error).



# \dti+ nt0___1*
   List of relations
 Schema |   Name   | Type  |  Owner   |  
Table  |  Size

+--+---+--+-+
 public | nt0___1  | table | aardvark |  
   | 4224 MB
 public | nt0___1uniqueinclude_idx | index | aardvark | 
nt0___1 | 3004 MB



(for what it's worth, I'm in favor of getting this patch into v11 
although I can't say I followed the technical details too much)



thanks,


Erik Rijkers



#!/bin/env perl
#!/opt/perl-5.26/bin/perl
use strict;
use warnings;
use DBI;
use Time::HiRes qw/tv_interval gettimeofday/;
use Getopt::Long;
$| = 1;
our $PGPORT_VANILLA  =  6511;
our $PGPORT_COVERING_INDEXES =  6976;
our $SQL_REPEAT = 251;

main();
exit;

sub size_unit { 
  1_000_000
}
sub main {

  my $size   = 100; #rowcount in millions; this $size variable determines the table used

  GetOptions ("size=i" => \$size)   or die("Error in command line arguments\n");

  my $dbh_patched  = connectdb_covering_indexes();
  my $dbh_vanilla  = connectdb_vanilla();
  my $port_patched = check_debug_state( $dbh_patched  );
  my $port_vanilla = check_debug_state( $dbh_vanilla );

  # create tables on patched instance
  for my $n (1, 10, 100) { # , 250 ) {
my $rowcount = $n * size_unit();
create_tables($dbh_patched, $port_patched, $rowcount, my $overwrite = 0);
  }

  # create tables on vanilla instance
  for my $n (1, 10, 100) { # , 250 ) {
my $rowcount = $n * size_unit();
create_tables($dbh_vanilla, $port_vanilla, $rowcount, my $overwrite = 0);
  }

# print sprintf("-- Perl %vd\n", $^V)
#, "-- ", $dbh_vanilla->selectrow_arrayref( "select version()" )->[0], "\n"
#, "-- ", $dbh_patched->selectrow_arrayref( "select version()" )->[0], "\n" ;

  my $c1 = 1; ##  5000 + int(rand(5000)) + 1;
  my $c3 =20; ##   20 + int(rand(  30)) + 1;

  #  $c1 =  5000; ##  5000 + int(rand(5000)) + 1;
  #  $c3 =20; ##   20 + int(rand(  30)) + 1;

  # enable to vary WHERE-clause a little bit:
  if (0) {
 $c1 = 5000 + int(rand(5000)) + 1;
 $c3 =   20 + int(rand(  30)) + 1;
  }
  

  my $vanilla = test1($dbh_vanilla, $port_vanilla, $size, $c1);
  my $patched = test1($dbh_patched, $port_patched, $size, $c1);
  print " "x84, sprintf( "%6.2f %%  <- test1, patched / unpatched\n",  ((average($patched) * 100) / average($vanilla)) );

# test2($dbh_vanilla, $port_vanilla, $size, $c1);
# test2($dbh_patched, $port_patched, $size, $c1);

# test3($dbh_vanilla, $port_vanilla, $size, $c1, $c3);
# test3($dbh_patched, $port_patched, $size, $c1, $c3);

  $vanilla = test4($dbh_vanilla, $port_vanilla, $size, $c1, $c3);
  $patched = test4($dbh_patched, $port_patched, $size, $c1, $c3);
  print " "x84, sprintf( "%6.2f %%  <- test4, 

Re: WIP: Covering + unique indexes.

2018-04-06 Thread Peter Geoghegan
On Fri, Apr 6, 2018 at 11:08 AM, Alexander Korotkov
 wrote:
> OK, incorporated into v15.  I've also added sentence about pg_upgrade
> to the commit message.

I will summarize my feelings on this patch. I endorse committing the
patch, because I think that the benefits of committing it now
noticeably outweigh the costs. I have various caveats about pushing
the patch, but these are manageable.

Costs
=

First, there is the question of risks, or costs. I think that this
patch has a negligible chance of being problematic in a way that will
become memorable. That seems improbable because the patch only really
changes the representation of what we're calling "pivot keys" (high
keys and internal page downlinks), which is something that VACUUM
doesn't care about. I see this patch as a special case of suffix
truncation, a technique that has been around since the 1970s. Although
you have to look carefully to see it, the amount of extra complexity
is pretty small, and the only place where a critical change is made is
during leaf page splits. As long as we get that right, everything else
should fall into place. There are no risks that I can see that are
related to concurrency, or that crop up when doing an anti-wraparound
VACUUM. There may be problems, but at least they won't be *pernicious*
problems that unravel over a long period of time.

The latest amcheck enhancement, and Alexander's recent changes to the
patch to make the on-disk representation explicit (not implicit)
should change things. We now have the tools to detect any corruption
problem that I can think of. For example, if there was some subtle
reason why assessing HOT safety broke, then we'd have a way of
mechanically detecting that without having to devise a custom test
(like the test Pavan happened to be using when the bug fixed by
2aaec654 was originally discovered). The lessons that I applied to
designing amcheck were in a few cases from actual experience with real
world bugs, including that 2aaec654 bug.

I hope that it goes without saying that I've also taken reasonable
steps to address all of these risks directly, by auditing code. And,
that this remains the first line of defense.

Here are the other specific issues that I see with the patch:

* It's possible that something was missed in the optimizer. I'm not sure.

I share the intuition that very little code is actually needed there,
but I'm far from the best person to judge whether or not some subtle
detail was missed.

* This seems out of date:

> +* NOTE: It is not crucial for reliability in present, but maybe
> +* it will be that in the future. Now the purpose is just to save
> +* more space on inner pages of btree.

* CheckIndexCompatible() didn't seem to get the memo about this patch.
Maybe just a comment?

* It's possible that there are some more bugs in places like
relcache.c, or deparsing, or pg_dump, or indexcmds.c; perhaps simple
omissions, like the one I just mentioned. If there are, I don't expect
them to be particularly serious, or to make me reassess my basic
position. But there could be.

* I was wrong to suggest _bt_isequal() has an assertion against
truncation. It is called for the highkey. Suggest you weaken the
assertion, so it only applies when the offset isn't P_HIKEY on
non-rightmost page.

* Suggest adding a comment above BTStackData, about bts_btentry + offset.

* Suggest breaking BTEntrySame() into 3 lines, not 2.

* This comment needs to be updated:

/* get high key from left page == lowest key on new right page */

Suggest "get high key from left page == lower bound for new right page".

* This comment needs to be updated:

13th bit: unused

Suggest "13th bit: AM-defined meaning"

* Suggest adding a note that the use of P_HIKEY here is historical,
since it isn't used to match downlinks:

/*
 * Find the parent buffer and get the parent page.
 *
 * Oops - if we were moved right then we need to change stack item! We
 * want to find parent pointing to where we are, right ?- vadim
 * 05/27/97
 */
ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);

* I'm slightly concerned that this patch subtly breaks an optimization
within _bt_preprocess_keys(), or _bt_checkkeys(). I cannot find any
evidence of that, though, and I consider it unlikely, based on the
intuition that the simple Pathkey changes in the optimizer don't
provide the executor with a truly new set of constraints for index
scans. Also, if there was a problem here, it would be in the less
serious category of problems -- those that can't really affect anyone
not using the user-visible feature.

* The docs need some more polishing. Didn't spend very much time on this at all.

Benefits


There is also the matter of the benefits of this patch, that I think
are considerable, and far greater than they appear. This 

Re: WIP: Covering + unique indexes.

2018-04-06 Thread Peter Geoghegan
On Fri, Apr 6, 2018 at 10:33 AM, Alexander Korotkov
 wrote:
> Thinking about that again, I found that we should relax our requirements to
> "minus infinity" items, because pg_upgraded indexes doesn't have any
> special bits set for those items.
>
> What do you think about applying following patch on the top of v14?

It's clearly necessary. Looks fine to me.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-04-06 Thread Alexander Korotkov
On Fri, Apr 6, 2018 at 8:22 PM, Peter Geoghegan  wrote:

> On Fri, Apr 6, 2018 at 10:20 AM, Teodor Sigaev  wrote:
> > As far I can see, there is no any on-disk representation differece for
> > *existing* indexes. So, pg_upgrade is not need here and there isn't any
> new
> > code to support "on-fly" modification. Am I right?
>
> Yes.
>
> I'm going to look at this again today, and will post something within
> 12 hours. Please hold off on committing until then.
>

Thank you.

Thinking about that again, I found that we should relax our requirements to
"minus infinity" items, because pg_upgraded indexes doesn't have any
special bits set for those items.

What do you think about applying following patch on the top of v14?

diff --git a/src/backend/access/nbtree/nbtsearch.c
b/src/backend/access/nbtree/nbtsearch.c
index 44605fb5a4..53dc47ff82 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -2000,8 +2000,12 @@ _bt_check_natts(Relation index, Page page,
OffsetNumber offnum)
}
else if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
{
-   /* Leftmost tuples on non-leaf pages have no attributes */
-   return (BTreeTupGetNAtts(itup, index) == 0);
+   /*
+* Leftmost tuples on non-leaf pages have no attributes, or haven't
+* INDEX_ALT_TID_MASK set in pg_upgraded indexes.
+*/
+   return (BTreeTupGetNAtts(itup, index) == 0 ||
+   ((itup->t_info & INDEX_ALT_TID_MASK) == 0));
}
else
{

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: WIP: Covering + unique indexes.

2018-04-06 Thread Peter Geoghegan
On Fri, Apr 6, 2018 at 10:20 AM, Teodor Sigaev  wrote:
> As far I can see, there is no any on-disk representation differece for
> *existing* indexes. So, pg_upgrade is not need here and there isn't any new
> code to support "on-fly" modification. Am I right?

Yes.

I'm going to look at this again today, and will post something within
12 hours. Please hold off on committing until then.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-04-06 Thread Teodor Sigaev
As far I can see, there is no any on-disk representation differece for 
*existing* indexes. So, pg_upgrade is not need here and there isn't any new code 
to support "on-fly" modification. Am I right?




--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: WIP: Covering + unique indexes.

2018-04-05 Thread Peter Geoghegan
On Thu, Apr 5, 2018 at 7:59 AM, Alexander Korotkov
 wrote:
>> * btree_xlog_split() still has this code:

> Right, I think there is absolutely no need in this code.  It's removed in
> the attached patchset.

I'm now a bit nervous about always logging the high key, since that
could impact performance. I think that there is a good way to only do
it when needed. New plan:

1. Add these new fields to split record's set of xl_info fields (it
should be placed directly after XLOG_BTREE_SPLIT_R):

#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated
highkey */
#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated
highkey */

2. Within _bt_split(), restore the old "leaf vs. internal" logic, so
that the high key is only logged for internal (!isleaf) pages.
However, only log it when needed for leaf pages -- only when the new
highkey was *actually* truncated (or when its an internal page), since
only then will it actually be different to the first item on the right
page. Also, set XLOG_BTREE_SPLIT_L_HIGHKEY instead of
XLOG_BTREE_SPLIT_L when we must log (or set XLOG_BTREE_SPLIT_R_HIGHKEY
instead of XLOG_BTREE_SPLIT_R), so that recovery actually knows that
it should restore the truncated highkey.

(Sometimes I think it would be nice to be able to do more during
recovery, but that's a much bigger issue.)

3. Restore all the master code within btree_xlog_split(), except
instead of restoring the high key when !isleaf, do so when the record
is XLOG_BTREE_SPLIT_L_HIGHKEY|XLOG_BTREE_SPLIT_R_HIGHKEY.

4. Add an assertion within btree_xlog_split(), that ensures that
internal pages never fail to have their high key logged, since there
is no reason why that should ever not happen with internal pages.

5. Fix this struct xl_btree_split comment, which commit 0c504a80 from
2017 missed when it reclaimed two xl_info status bits:

 * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
 * The _L and _R variants indicate whether the inserted tuple went into the
 * left or right split page (and thus, whether newitemoff and the new item
 * are stored or not).  The _ROOT variants indicate that we are splitting
 * the root page, and thus that a newroot record rather than an insert or
 * split record should follow.  Note that a split record never carries a
 * metapage update --- we'll do that in the parent-level update.

6. Add your own xl_btree_split comment in its place, noting the new
usage. Basically, the _ROOT sentence with a similar _HIGHKEY sentence.

7. Don't forget about btree_desc().

I'd say that there is a good change that Anastasia is correct to think
that it isn't worth worrying about the extra WAL that her approach
implied, and that it is in fact good enough to simply always log the
left page's high key. However, it seems easier and lower risk all
around to do it this way. It doesn't leave us with ambiguity. In my
experience, *ambiguity* on design questions makes a patch miss a
release much more frequently than bugs or regressions make that
happen.

Sorry that I didn't just say this the first time I brought up
btree_xlog_split(). I didn't see the opportunity to avoid creating
more WAL until now.

> OK, I've added asserting that number of tuple attributes shoud be
> either natts or nkeyatts, because we call _bt_mkscankey() for
> pivot index tuples too.

Makes sense.

> If you're talking about these assertions
>
> Assert(IndexRelationGetNumberOfAttributes(rel) != 0);
> indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
> Assert(indnkeyatts != 0);

Actually, I was just talking about the first one,
"Assert(IndexRelationGetNumberOfAttributes(rel) != 0)". I was unclear.
Maybe it isn't worth getting rid of even the first one.

> then I would rather leave both them.  If we know that index tuple
> length is either natts or nkeyatts, that doesn't make you sure, that
> both natts and nkeyatts are non-zero.

I suppose.

> I've done so.  Tests are passing for me.

Great. I'm glad that worked out. One simple, broad rule.

> Hmm, we have four bit reserved.  But I'm not sure whether we would use
> *all* of them for non-pivot tuples.  Probably we would use some of them for
> pivot tuples.  I don't know that in advance.  Thus, I propose to don't
> rename this.  But I've added comment that non-pivot tuples might also
> use those bits.

Okay. Good enough.

> Sorry, I didn't get which particular further use of reserved bits do you
> mean?
> Did you mean key normalization?

I was being unclear. I was just reiterating my point about having a
non-pivot bit. It doesn't matter, though.

> Thank you for writing that explanation.  Looks good.

I think that once you realize how INCLUDE indexes don't change pivot
tuples, and actually understand what pivot tuples are, the patch seems
a lot less scary.

> This patchset also incorporates docs enhacements by Erik Rijkers and
> sentence which states that indexes with included colums are also called
> "covering indexes".


Re: WIP: Covering + unique indexes.

2018-04-05 Thread Alexander Korotkov
On Thu, Apr 5, 2018 at 5:02 PM, Erik Rijkers  wrote:

> On 2018-04-05 00:09, Alexander Korotkov wrote:
>
>> Thank you for review!  Revised patchset is attached.
>> [0001-Covering-core-v12.patch]
>> [0002-Covering-btree-v12.patch]
>> [0003-Covering-amcheck-v12.patch]
>> [0004-Covering-natts-v12.patch]
>>
>
> Really nice performance gains.
>
> I read through the docs and made some changes.  I hope it can count as
> improvement.
>

Thank you for your improvements of the docs.  Your chenges will be
incorporated into new revision of patchset which I'm going to post today.


> It would probably also be a good idea to add the term "covering index"
> somewhere, at least in the documentation's index; the term does now not
> occur anywhere.  (This doc-patch does not add it)
>

I'll think about it.  May be we'll define "covering index" in the docs.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: WIP: Covering + unique indexes.

2018-04-05 Thread Erik Rijkers

On 2018-04-05 00:09, Alexander Korotkov wrote:

Hi!

Thank you for review!  Revised patchset is attached.
[0001-Covering-core-v12.patch]
[0002-Covering-btree-v12.patch]
[0003-Covering-amcheck-v12.patch]
[0004-Covering-natts-v12.patch]


Really nice performance gains.

I read through the docs and made some changes.  I hope it can count as 
improvement.


It would probably also be a good idea to add the term "covering index" 
somewhere, at least in the documentation's index; the term does now not 
occur anywhere.  (This doc-patch does not add it)


thanks,

Erik Rijkers
--- doc/src/sgml/ref/create_index.sgml.orig	2018-04-05 14:36:00.904617793 +0200
+++ doc/src/sgml/ref/create_index.sgml	2018-04-05 15:49:03.778805965 +0200
@@ -148,31 +148,27 @@
   INCLUDE
   

-An optional INCLUDE clause allows to specify the
-list of columns which will be included in the non-key part of the index.
-Columns listed in this clause cannot co-exist as index key columns,
-and vice versa.  The INCLUDE columns exist solely to
+The optional INCLUDE clause specifies a
+list of columns which will be included as a non-key part in the index.
+Columns listed in this clause cannot also be present as index key columns.
+The INCLUDE columns exist solely to
 allow more queries to benefit from index-only scans
-by including specified columns into the index.  Values of these columns
+by including the values of the specified columns in the index.  These values
 would otherwise have to be obtained by reading the table's heap.
-Having these columns in the INCLUDE clause
-in some cases allows PostgreSQL to skip
-the heap read completely.

 

-In the UNIQUE indexes, uniqueness is only enforced
+In UNIQUE indexes, uniqueness is only enforced
 for key columns.  Columns listed in the INCLUDE
-clause have no influence to uniqueness enforcement.  Other constraints
+clause have no effect on uniqueness enforcement.  Other constraints
 (PRIMARY KEY and EXCLUDE) work the same way.

 

-Columns listed in the INCLUDE clause doesn't need
-appropriate operator class to exist.  Therefore,
-INCLUDE clause if useful to add non-key index
-columns, whose data types don't have operator classes defined for
-given access method.
+Columns listed in the INCLUDE clause don't need
+appropriate operator classes; the clause can contain non-key index
+columns whose data types don't have operator classes defined for
+a given access method.

 

@@ -182,12 +178,12 @@
 

 Currently, only the B-tree index access method supports this feature.
-In B-tree indexes, values of columns listed in the
-INCLUDE clause are included into leaf tuples which
-are linked to the heap tuples, but aren't included into pivot tuples
+In B-tree indexes, the values of columns listed in the
+INCLUDE clause are included in leaf tuples which
+are linked to the heap tuples, but are not included into pivot tuples
 used for tree navigation.  Therefore, moving columns from the list of
 key columns to the INCLUDE clause can slightly
-reduce index size and improve tree branching factor.
+reduce index size and improve the tree branching factor.

   
  


Re: WIP: Covering + unique indexes.

2018-04-04 Thread Peter Geoghegan
On Wed, Apr 4, 2018 at 3:09 PM, Alexander Korotkov
 wrote:
> Thank you for review!  Revised patchset is attached.

Cool.

* btree_xlog_split() still has this code:

/*
 * On leaf level, the high key of the left page is equal to the first key
 * on the right page.
 */
if (isleaf)
{
ItemId  hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));

left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
left_hikeysz = ItemIdGetLength(hiItemId);
}

However, we never fail to store the high key now, even at the leaf
level, because of this change to the corresponding point in
_bt_split():

> -   /* Log left page */
> -   if (!isleaf)
> -   {
> -   /*
> -* We must also log the left page's high key, because the right
> -* page's leftmost key is suppressed on non-leaf levels.  Show it
> -* as belonging to the left page buffer, so that it is not stored
> -* if XLogInsert decides it needs a full-page image of the left
> -* page.
> -*/
> -   itemid = PageGetItemId(origpage, P_HIKEY);
> -   item = (IndexTuple) PageGetItem(origpage, itemid);
> -   XLogRegisterBufData(0, (char *) item, 
> MAXALIGN(IndexTupleSize(item)));
> -   }
> +   /*
> +* We must also log the left page's high key.  There are two reasons
> +* for that: right page's leftmost key is suppressed on non-leaf 
> levels,
> +* in covering indexes, included columns are truncated from high keys.
> +* For simplicity, we don't distinguish these cases, but log the high
> +* key every time.  Show it as belonging to the left page buffer, so
> +* that it is not stored if XLogInsert decides it needs a full-page
> +* image of the left page.
> +*/
> +   itemid = PageGetItemId(origpage, P_HIKEY);
> +   item = (IndexTuple) PageGetItem(origpage, itemid);
> +   XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));

So should we remove the first block of code? Note also that this
existing comment has been made obsolete:

/* don't release the buffer yet; we touch right page's first item below */

/* Now reconstruct left (original) sibling page */
if (XLogReadBufferForRedo(record, 0, ) == BLK_NEEDS_REDO)

Maybe we *should* release the right sibling buffer at the point of the
comment now?

* _bt_mkscankey() should assert that the IndexTuple has the correct
number of attributes.

I don't expect you to change routines like _bt_mkscankey() so they
actually respect the number of attributes from BTreeTupGetNAtts(),
rather than just relying on IndexRelationGetNumberOfKeyAttributes().
However, an assertion seems well worthwhile. It's a big reason for
having BTreeTupGetNAtts().

This also lets you get rid of at least one assertion from
_bt_doinsert(), I think.

* _bt_isequal() should assert that the IndexTuple was not truncated.

* The order could be switched here:

> @@ -443,6 +443,17 @@ _bt_compare(Relation rel,
> if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
> return 1;
>
> +   /*
> +* Check tuple has correct number of attributes.
> +*/
> +   if (unlikely(!_bt_check_natts(rel, page, offnum)))
> +   {
> +   ereport(ERROR,
> +   (errcode(ERRCODE_INTERNAL_ERROR),
> +errmsg("tuple has wrong number of attributes in index 
> \"%s\"",
> +   RelationGetRelationName(rel;
> +   }

In principle, we should also check _bt_check_natts() for "minus
infinity" items, just like you did within verify_nbtree.c. Also, there
is no need for parenthesis here.

* Maybe _bt_truncate_tuple() should assert that the caller has not
tried to truncate a tuple that has already been truncated.

I'm not sure if our assertion should be quite that strong, but I think
that that might be good because in general we only need to truncate on
the leaf level -- truncating at any other level on the tree (e.g.
doing traditional suffix truncation) is always subtly wrong. What we
definitely should do, at a minimum, is make sure that attempting to
truncate a tuple to 2 attributes when it already has 0 attributes
fails with an assertion failure.

Can you try adding the strong assertion (truncate only once) to
_bt_truncate_tuple()? Maybe that's not possible, but it seems worth a
try.

* I suggest we invent a new flag for 0x2000 within itup.h, to replace
"/* bit 0x2000 is reserved for index-AM specific usage */".

We can call it INDEX_AM_RESERVED_BIT. Then, we can change
INDEX_ALT_TID_MASK to use this rather than a raw 0x2000. We can do the
same for INDEX_MOVED_BY_SPLIT_MASK within hash.h, too. I find this
neater.

* We should "use" one of the 4 new status bit that are available from
an offset (for INDEX_ALT_TID_MASK index tuples) for future use by leaf
index tuples. Perhaps call it BT_ALT_TID_NONPIVOT.

I guess you could say that I want us to reserve 

Re: WIP: Covering + unique indexes.

2018-04-03 Thread Peter Geoghegan
On Tue, Apr 3, 2018 at 7:02 AM, Alexander Korotkov
 wrote:
> Great, I'm looking forward your feedback.

I took a look at V11 (0001-Covering-core-v11.patch,
0002-Covering-btree-v11.patch, 0003-Covering-amcheck-v11.patch,
0004-Covering-natts-v11.patch) today.

* What's a pivot tuple?

This is the same thing as what I call a "separator key", I think --
you're talking about the set of IndexTuples including all high keys
(including leaf level high keys), as well as internal items
(downlinks). I think that it's a good idea to have a standard word
that describes this set of keys, to formalize the two categories
(pivot tuples vs. tuples that point to the heap itself). Your word is
just as good as mine, so we can go with that.

Let's put this somewhere central. Maybe in the nbtree README, and/or
nbtree.h. Also, verify_nbtree.c should probably get some small
explanation of pivot tuples. offset_is_negative_infinity() is a nice
place to mention pivot tuples, since that already has a bit of
high-level commentary about them.

* Compiler warning:

/home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:
In function ‘index_create’:
/home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:476:45:
warning: ‘opclassTup’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
   if (keyType == ANYELEMENTOID && opclassTup->opcintype == ANYARRAYOID)
 ^
/home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:332:19:
note: ‘opclassTup’ was declared here
   Form_pg_opclass opclassTup;
   ^

* Your new amcheck tests should definitely use the new
"heapallindexed" option. There were a number of bugs I can remember
seeing in earlier versions of this patch that that would catch
(probably not during regression tests, but let's at least do that
much).

* The modified amcheck contrib regression tests don't actually pass. I
see these unexpected errors:

10037/2018-04-03 16:31:12 PDT ERROR:  wrong number of index tuple
attributes for index "bttest_multi_idx"
10037/2018-04-03 16:31:12 PDT DETAIL:  Index tid=(290,2) points to
index tid=(289,2) page lsn=0/162407A8.
10037/2018-04-03 16:31:12 PDT ERROR:  wrong number of index tuple
attributes for index "bttest_multi_idx"
10037/2018-04-03 16:31:12 PDT DETAIL:  Index tid=(290,2) points to
index tid=(289,2) page lsn=0/162407A8.

* I see that we use "- 1" with attribute number, like this:

> +/* Get number of attributes in B-tree index tuple */
> +#define BtreeTupGetNAtts(itup, index)  \
> +   ( \
> +   (itup)->t_info & INDEX_ALT_TID_MASK ? \
> +   ( \
> +   AssertMacro((ItemPointerGetOffsetNumber(&(itup)->t_tid) & 
> BT_RESERVED_OFFSET_MASK) == 0), \
> +   ItemPointerGetOffsetNumber(&(itup)->t_tid) & 
> BT_N_KEYS_OFFSET_MASK - 1 \
> +   ) \
> +   : \
> +   IndexRelationGetNumberOfAttributes(index) \
> +   )

Is this left behind from before you decided to adopt
INDEX_ALT_TID_MASK? Is it your intention here to encode
InvalidOffsetNumber() without tripping up assertions? Or is it
something else?

Maybe we should follow the example of GinItemPointerGetOffsetNumber(),
and use ItemPointerGetOffsetNumberNoCheck() instead of
ItemPointerGetOffsetNumber(). What do you think? That would allow us
to get rid of the -1 thing, which might be nice. Just because we use
ItemPointerGetOffsetNumberNoCheck() in places that use an alternative
offset representation does not mean we need to use it in existing
places. If existing places had a regression tests failure because of
this, that would probably be due to a real bug. No?

* ISTM that the "points to index tid=(289,2)" part of the message just
shown would be a bit clearer if I didn't have to know that 2 actually
means 1 when we talk about the pointed-to offset (yeah, it will
probably become unclear in the future when we start using the reserved
offset status bits, but why not make the low bits of offset
simple/logical way?). Your new amcheck error message should spell it
out (it should say the number of attributes indicated by the offset,
if any) -- regardless of what we do about the "must apply - 1 to
offset" question.

* "Minus infinity" items do not have the new status bit
INDEX_ALT_TID_MASK set in at least some cases. They should.

* _bt_sortaddtup() should not do "trunctuple.t_info =
sizeof(IndexTupleData)", since that destroys useful information. Maybe
that's the reason for the last bug?

* Ditto for _bt_pgaddtup().

* Why expose _bt_pgaddtup() so that nbtsort.c/_bt_buildadd() can call
it? The only reason we have _bt_sortaddtup() is because we cannot
trust P_RIGHTMOST() within _bt_pgaddtup() when called in the context
of CREATE INDEX (from nbtsort.c/_bt_buildadd()). There is no real
change needed, because _bt_sortaddtup() knows that it's inserting on a
non-rightmost page both without this patch, and when this patch needs
to truncate and then add the high key back.

It's clear that you can 

Re: WIP: Covering + unique indexes.

2018-04-03 Thread Alexander Korotkov
On Tue, Apr 3, 2018 at 7:02 AM, Peter Geoghegan  wrote:

> On Mon, Apr 2, 2018 at 4:27 PM, Alexander Korotkov
>  wrote:
> > I thought abut that another time and I decided that it would be safer
> > to use 13th bit in index tuple flags.  There are already attempt to
> > use whole 6 bytes of tid for not heap pointer information [1].  Thus, it
> > would be safe to use 13th bit for indicating alternative offset meaning
> > in pivot tuples, because it wouldn't block further work.  Revised
> patchset
> > in the attachment implements it.
>
> This is definitely not the only time someone has talked about this
> 13th bit -- it's quite coveted. It also came up with UPSERT, and with
> WARM. That's just the cases that I can personally remember.
>
> I'm glad that you found a way to make this work, that will keep things
> flexible for future patches, and make testing easier. I think that we
> can find a flexible representation that makes almost everyone happy.
>

OK, good.

I didn't have time to look at this properly today, but I will try to
> do so tomorrow.
>

Great, I'm looking forward your feedback.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: WIP: Covering + unique indexes.

2018-04-02 Thread Peter Geoghegan
On Mon, Apr 2, 2018 at 4:27 PM, Alexander Korotkov
 wrote:
> I thought abut that another time and I decided that it would be safer
> to use 13th bit in index tuple flags.  There are already attempt to
> use whole 6 bytes of tid for not heap pointer information [1].  Thus, it
> would be safe to use 13th bit for indicating alternative offset meaning
> in pivot tuples, because it wouldn't block further work.  Revised patchset
> in the attachment implements it.

This is definitely not the only time someone has talked about this
13th bit -- it's quite coveted. It also came up with UPSERT, and with
WARM. That's just the cases that I can personally remember.

I'm glad that you found a way to make this work, that will keep things
flexible for future patches, and make testing easier. I think that we
can find a flexible representation that makes almost everyone happy.

> I don't know.  We still need an offset number to check expected number
> of attributes.  Passing index tuple as separate attribute would be
> redundant and open door for extra possible errors.

You're right. I must have been tired when I wrote that. :-)

>> Do you store an attribute number in the "minus infinity" item (the
>> leftmost one of internal pages)? I guess that that should be zero,
>> because it's totally truncated.
>
>
> Yes, I store zero number of attributes in "minus infinity" item.  See this
> part of the patch.

> However, note that I've to store (number_of_attributes + 1) in the offset
> in order to correctly store zero number of attributes.  Otherwise, assertion
> is faised in ItemPointerIsValid() macro.

Makes sense.

> Yes.  But that depends on how difficulty to adopt patch to big picture
> correlate with difficulty, which non-adopted patch makes to that big
> picture.  My point was that second difficulty isn't high.  But we can be
> satisfied with implementation in the attached patchset (probably some
> small enhancements are still required), then the first difficulty isn't high
> too.

I think it's possible.

I didn't have time to look at this properly today, but I will try to
do so tomorrow.

Thanks
-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-04-01 Thread Peter Geoghegan
On Sun, Apr 1, 2018 at 10:09 AM, Alexander Korotkov
 wrote:
>> So? GIN doesn't have the same legacy at all. The GIN posting lists
>> *don't* have regular heap TID pointers at all. They started out
>> without them, and still don't have them.
>
>
> Yes, GIN never stored heap TID pointers in t_tid of index tuple.  But GIN
> assumes that heap TID pointer has at most 11 significant bits during
> posting list encoding.

I think that we should avoid assuming things, unless the cost of
representing them is too high, which I don't think applies here. The
more defensive general purpose code can be, the better.

I will admit to being paranoid here. But experience suggests that
paranoia is a good thing, if it isn't too expensive. Look at the
thread on XFS + fsync() for an example of things being wrong for a
very long time without anyone realizing, and despite the best efforts
of many smart people. As far as anyone can tell, PostgreSQL on Linux +
XFS is kinda, sorta broken, and has been forever. XFS was mature
before ext4 was, and is a popular choice, and yet this is the first
we're hearing about it being kind of broken. After many years.

Look at this check that made it into my amcheck patch, that was
committed yesterday:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=contrib/amcheck/verify_nbtree.c;h=a15fe21933b9a5b8baefedaa8f38e517d6c91877;hb=7f563c09f8901f6acd72cb8fba7b1bd3cf3aca8e#l745

As it says, nbtree is surprisingly tolerant of corrupt lp_len fields.
You may find it an interesting exercise to use pg_hexedit to corrupt
many lp_len fields in an index page. What's really interesting about
this is that it doesn't appear to break anything at all! We don't get
the length from there in most cases, so reads won't break at all. I
see that we use ItemIdGetLength() in a couple of rare cases (though
even those could be avoided) during a page split. You'd be lucky to
notice a problem if lp_len fields were regularly corrupt. When you
notice, it will probably have already caused big problems.

On a similar note, I've noticed that many of my experimental B-Tree
patches (that I never find time to finish) tend to almost work quite
early on, sometimes without my really understanding why. The whole L
approach of recovering from problems that were detected (detecting
concurrent page splits, and moving right) makes the code *very*
forgiving. I hope that I don't sound trite, but everyone should try to
be modest about what they *don't* know when writing complex system
software with concurrency. It is not a platitude, even though it
probably seems that way. A tiny mistake can have big consequences, so
it's very important that we have a way to easily detect them after the
fact.

> I don't think we should use assertions, because they are typically disabled
> on
> production PostgreSQL builds.  But we can have some explicit check in some
> common path.  In the attached patch I've such check to _bt_compare().
> Probably,
> together with amcheck, that would be sufficient.

Good idea -- a "can't happen" check in _bt_compare seems better, which
I see here:

> diff --git a/src/backend/access/nbtree/nbtsearch.c 
> b/src/backend/access/nbtree/nbtsearch.c
> index 51dca64e13..fcf9832147 100644
> --- a/src/backend/access/nbtree/nbtsearch.c
> +++ b/src/backend/access/nbtree/nbtsearch.c
> @@ -443,6 +443,17 @@ _bt_compare(Relation rel,
> if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
> return 1;
>
> +   /*
> +* Check tuple has correct number of attributes.
> +*/
> +   if (!_bt_check_natts(rel, page, offnum))
> +   {
> +   ereport(ERROR,
> +   (errcode(ERRCODE_INTERNAL_ERROR),
> +errmsg("tuple has wrong number of attributes in index 
> \"%s\"",
> +   RelationGetRelationName(rel;
> +   }
> +
> itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));

It seems like it might be a good idea to make this accept an
IndexTuple, though, to possibly save some work. Also, perhaps this
should be an unlikely() condition, if only because it makes the intent
clearer (might actually matter in a tight loop like this too, though).

Do you store an attribute number in the "minus infinity" item (the
leftmost one of internal pages)? I guess that that should be zero,
because it's totally truncated.

> OK, thank for the explanation.  I agree that check of offset is redundant
> here.

Cool.

>> The fact is that that information can go out of date almost
>> immediately, whereas high keys usually last forever. The only reason
>> that there is a heap TID in the high key is because we'd have to add
>> special code to remove it; not because it has any real value. I find
>> it very hard to imagine it being used in a forensic situation. If you
>> actually wanted to do this, the key itself is probably enough -- you
>> probably wouldn't need the TID.
>
>
> I don't know,  When I wrote my own implementation of B-tree and debug
> it, I found 

Re: WIP: Covering + unique indexes.

2018-03-31 Thread Peter Geoghegan
On Fri, Mar 30, 2018 at 6:24 AM, Alexander Korotkov
 wrote:
>> With an extreme enough case, this could result in a failure to find a
>> split point. Or at least, if that isn't true then it's not clear why,
>> and I think it needs to be explained.
>
>
> I don't think this could result in a failure to find a split point.
> So, it finds a split point without taking into account that hikey
> will be shorter.  If such split point exists then split point with
> truncated hikey should also exists.  If not, then it would be
> failure even without covering indexes.  I've updated comment
> accordingly.

You're right. We're going to truncate the unneeded trailing attributes
from whatever tuple is to the immediate right of the final split point
that we choose (that's the tuple that we'll copy to make a new high
key for the left page). Truncation already has to result in a tuple
that is less than or equal to the original tuple.

I also agree that it isn't worth trying harder to make sure that space
is distributed evenly when truncation will go ahead. It will only
matter in very rare cases, but the computational overhead of having an
accurate high key size for every candidate split point would be
significant.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-03-30 Thread Peter Geoghegan
On Fri, Mar 30, 2018 at 10:39 PM, Peter Geoghegan  wrote:
> It's safe, although I admit that that's a bit hard to see.
> Specifically, look at this code in _bt_insert_parent():
>
> /*
>  * Find the parent buffer and get the parent page.
>  *
>  * Oops - if we were moved right then we need to change stack item! We
>  * want to find parent pointing to where we are, right ?- vadim
>  * 05/27/97
>  */
> ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
> pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
>
> Vadim doesn't seem too sure of why he did it that way. What's clear is
> that the offset on all internal pages is always P_HIKEY (that is, 1),
> because this is the one and only place where new IndexTuples get
> generated for internal pages. That's unambiguous. So how could
> BTEntrySame() truly need to care about offset? How could there ever be
> an internal page offset that wasn't just P_HIKEY? You can look
> yourself, using pg_hexedit or pageinspect.

Sorry, I meant this code, right before:

/* form an index tuple that points at the new right page */
new_item = CopyIndexTuple(ritem);
ItemPointerSet(&(new_item->t_tid), rbknum, P_HIKEY);

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-03-30 Thread Peter Geoghegan
On Fri, Mar 30, 2018 at 4:08 PM, Alexander Korotkov
 wrote:
>> I'll try it.  But I'm afraid that it's not as easy as you expect.
>
>
> So, I have some implementation of storage of number of attributes inside
> index tuple itself.  I made it as additional patch on top of previous
> patchset.
> I attach the whole patchset in order to make commitfest.cputube.org happy.

Looks like 0004-* became mangled. Can you send a version that is not
mangled, please?

> I decided not to use 13th bit of IndexTuple flags.  Instead I use only high
> bit
> of offset which is also always free on regular tuples.  In fact, we already
> use
> assumption that there is at most 11 significant bits of index tuple offset
> in
> GIN (see ginpostinglist.c).

So? GIN doesn't have the same legacy at all. The GIN posting lists
*don't* have regular heap TID pointers at all. They started out
without them, and still don't have them.

> Anastasia also pointed that if we're going to do on-disk changes, they
> should be compatible not only with suffix truncation, but also with
> duplicate
> compression (which was already posted in thread [1]).

I definitely agree with that, and I think that Anastasia should push
for whatever will make future nbtree enhancements easier, especially
her own pending or planned enhancements.

> However, I think
> there is no problem.  We can use one of 3 free bits in offset as flag that
> it's tuple with posting list.  Duplicates compression needs to store
> number of posting list items and their offset in the tuple.  Free bits
> left in item pointer after reserving 2 bits (1 flag of alternative meaning
> of offset and 1 flag of posting list) is far enough for that.

The issue that I see is that we could easily make this unambiguous,
free of any context, with a tiny bit more work. Why not just do it
that way?

Maybe it won't actually matter, but I see no reason not to do it, since we can.

> However, I find following arguments against implementing this feature
> in covering indexes.
>
>  * We write number of attributes present into tuple.  But how to prove that
> it's correct.  I add appropriate checks to amcheck.  But I don't think all
> the
> users runs amcheck frequent enough.  Thus, in order to be sure that it's
> correct we should check number of attributes is written correct everywhere
> in the B-tree code.

Use an assertion. Problem solved.

I agree that people aren't using amcheck all that much, but give it
time. Oracle and SQL Server have had tools like amcheck for 30+ years.
We have had amcheck for one year.

> Without that we can face the situation that we've
> introduced new on-disk representation better to further B-tree enhancements,
> but actually it's broken.  And that would be much worse than nothing.
> In order to check number of attributes everywhere in the B-tree code, we
> need to actually implement significant part of suffix compression.  And I
> really think we shouldn't do this as part as covering indexes patch.

I don't think that you need to do that, actually. I'm not asking you
to go to those lengths. I have only asked that you make the on-disk
representation *compatible* with a future Postgres version that has
full suffix truncation (and other such enhancements, too). I care
about the on-disk representation more than the code.

>  * Offset number is used now for parent refind (see BTEntrySame() macro).
> In the attached patch, this condition is relaxed.  But I don't think I
> really like
> that.  This shoud be thought out very carefully...

It's safe, although I admit that that's a bit hard to see.
Specifically, look at this code in _bt_insert_parent():

/*
 * Find the parent buffer and get the parent page.
 *
 * Oops - if we were moved right then we need to change stack item! We
 * want to find parent pointing to where we are, right ?- vadim
 * 05/27/97
 */
ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);

Vadim doesn't seem too sure of why he did it that way. What's clear is
that the offset on all internal pages is always P_HIKEY (that is, 1),
because this is the one and only place where new IndexTuples get
generated for internal pages. That's unambiguous. So how could
BTEntrySame() truly need to care about offset? How could there ever be
an internal page offset that wasn't just P_HIKEY? You can look
yourself, using pg_hexedit or pageinspect.

The comments above BTTidSame()/BTEntrySame() are actually wrong,
including "New Comments". Vadim wanted to make TIDs part of the
keyspace [1], beginning in around 1997. The idea was that we'd have
truly unique keys by including TID, as L intended, but that never
happened. Instead, we got commit 9e85183bf in 2000, which among many
other things changed the L invariant to deal with duplicates. I
think that Tom should have changed BTTidSame() to not care about
offset number in that same commit 

Re: WIP: Covering + unique indexes.

2018-03-29 Thread Peter Geoghegan
On Wed, Mar 28, 2018 at 7:59 AM, Anastasia Lubennikova
 wrote:
> Here is the new version of the patch set.
> All patches are rebased to apply without conflicts.
>
> Besides, they contain following fixes:
> - pg_dump bug is fixed
> - index_truncate_tuple() now has 3rd argument new_indnatts.
> - new tests for amcheck, dblink and subscription/t/001_rep_changes.pl
> - info about opclass implementors and included columns is now in sgml doc

This only changes the arguments given to index_truncate_tuple(), which
is a superficial change. It does not actually change anything about
the on-disk representation, which is what I sought. Why is that a
problem? I don't think it's very complicated.

The patch needs a rebase, as Erik mentioned:

1 out of 19 hunks FAILED -- saving rejects to file
src/backend/utils/cache/relcache.c.rej
(Stripping trailing CRs from patch; use --binary to disable.)

I also noticed that you still haven't done anything differently with
this code in _bt_checksplitloc(), which I mentioned in April of last
year:

/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
(state->olddataitemstotal - olddataitemstoleft);

/*
 * The first item on the right page becomes the high key of the left page;
 * therefore it counts against left space as well as right space.
 */
leftfree -= firstrightitemsz;

/* account for the new item */
if (newitemonleft)
leftfree -= (int) state->newitemsz;
else
rightfree -= (int) state->newitemsz;

With an extreme enough case, this could result in a failure to find a
split point. Or at least, if that isn't true then it's not clear why,
and I think it needs to be explained.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-03-28 Thread Peter Eisentraut
On 1/25/18 23:19, Thomas Munro wrote:
> +  PRIMARY KEY (  class="parameter">column_name [, ... ] )  class="parameter">index_parameters INCLUDE
> (column_name [,
> ...]) |
> 
> I hadn't seen that use of "" before.  Almost everywhere else
> we use explicit [ and ] characters, but I see that there are other
> examples, and it is rendered as [ and ] in the output.

I think this will probably not come out right in the generated psql
help.  Check that please.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: Covering + unique indexes.

2018-03-28 Thread Erik Rijkers

On 2018-03-28 16:59, Anastasia Lubennikova wrote:

Here is the new version of the patch set.


I can't get these to apply:

patch -b -l -F 25 -p 1 < 
/home/aardvark/download/pgpatches/0110/covering_indexes/20180328/0001-Covering-core-v8.patch



1 out of 19 hunks FAILED -- saving rejects to file 
src/backend/utils/cache/relcache.c.rej



$ cat src/backend/utils/cache/relcache.c.rej
--- src/backend/utils/cache/relcache.c
+++ src/backend/utils/cache/relcache.c
@@ -542,7 +542,7 @@
attp = (Form_pg_attribute) 
GETSTRUCT(pg_attribute_tuple);


if (attp->attnum <= 0 ||
-   attp->attnum > relation->rd_rel->relnatts)
+   attp->attnum > 
RelationGetNumberOfAttributes(relation))
elog(ERROR, "invalid attribute number %d for 
%s",
 attp->attnum, 
RelationGetRelationName(relation));






Erik Rijkers




Re: WIP: Covering + unique indexes.

2018-03-27 Thread Peter Geoghegan
On Tue, Mar 27, 2018 at 10:14 AM, Teodor Sigaev  wrote:
>> b) I don't like an idea to limiting usage of that field if we can do not
>> that. Future usage could use it, for example, for different compression
>> technics or something else.Or even removing t_tid from inner tuples to save
>> 2 bytes in IndexTupleData. Of
>
> course, I remember about aligment, but it could be subject to change too in
> future.

This is contradictory. You seem to be arguing that we need to preserve
on-disk compatibility for an optimization that throws out on-disk
compatibility.

Saving a single byte per internal IndexTuple is not worth it. We could
actually save 2 bytes in *all* nbtree pages, by halving the size of
ItemId for nbtree -- we don't need lp_len, which is redundant, and we
could reclaim one of the status bits too, to get back a full 16 bits.
Also, we could use suffix truncation to save at least one byte in
almost all cases, even with the thinnest possible
single-integer-attribute IndexTuples. What you describe just isn't
going to happen.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-03-27 Thread Peter Geoghegan
On Tue, Mar 27, 2018 at 10:07 AM, Teodor Sigaev  wrote:
> Storing number of attributes in now unused t_tid seems to me not so good
> idea. a) it could (and suppose, should) separate patch, at least it's not
> directly connected to covering patch, it could be added even before covering
> patch.

I think that we should do that first. It's not very hard.

> b) I don't like an idea to limiting usage of that field if we can do not
> that. Future usage could use it, for example, for different compression
> technics or something else.

The extra status bits that this would leave within the offset field
can be used for that in the future.

>> * It makes diagnosing issues in the field quite a bit easier. Tools
>> like pg_filedump can do the right thing (Tom mentioned pg_filedump and
>> amcheck specifically). The nbtree IndexTuple format should not need to
>> be interpreted in a context-sensitive way, if we can avoid it.
>
> Both pg_filedump and amcheck could correclty parse any tuple based on
> BTP_LEAF flags and length of tuple.

amcheck doesn't just care about the length of the tuple. It would have
to rely on catalog metadata about this being an INCLUDE index.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-03-27 Thread Teodor Sigaev
b) I don't like an idea to limiting usage of that field if we can do not that. 
Future usage could use it, for example, for different compression technics or 
something else.Or even removing t_tid from inner tuples to save 2 bytes in IndexTupleData. Of 

course, I remember about aligment, but it could be subject to change too in 
future.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: WIP: Covering + unique indexes.

2018-03-26 Thread Peter Geoghegan
On Mon, Mar 26, 2018 at 3:10 AM, Alexander Korotkov
 wrote:
> So, as I get you're proposing to introduce INDEX_ALT_TID_MASK flag
> which would indicate that we're storing something special in the t_tid
> offset.  And that should help us not only for covering indexes, but also for
> further btree enhancements including suffix truncation.  What exactly do
> you propose to store into t_tid offset when INDEX_ALT_TID_MASK flag
> is set?  Is it number of attributes in this particular index tuple?

Yes. I think that once INDEX_ALT_TID_MASK is available, we should
store the number of attributes in that particular "separator key"
tuple (which has undergone suffix truncation), and always work off of
that. You could then have status bits in offset as follows:

* 1 bit that represents that this is a "separator key" IndexTuple
(high key or internal IndexTuple). Otherwise, it's a leaf IndexTuple
with an ordinary heap TID. (When INDEX_ALT_TID_MASK isn't set, it's
the same as today.)

* 3 reserved bits. I think that one of these bits can eventually be
used to indicate that the internal IndexTuple actually has a
"normalized key" representation [1], which seems like the best way to
do suffix truncation, long term. I think that we should support simple
suffix truncation, of the kind that this patch implements, alongside
normalized key suffix truncation. We need both for various reasons
[2].

Not sure what the other two flag bits might be used for, but they seem
worth having.

* 12 bits for the number of attributes, which should be more than
enough, even when INDEX_MAX_KEYS is significantly higher than 32. A
static assertion can keep this safe when INDEX_MAX_KEYS is set
ridiculously high.

I think that this scheme is future-proof. Maybe you have additional
ideas on the representation. Please let me know what you think.

When we eventually add optimizations that affect IndexTuples on the
leaf level, we can start using the block number (bi_hi + bi_lo)
itself, much like GIN posting lists. No need to further consider that
(the leaf level optimizations) today, because using block number
provides us with many more bits.

In internal page items, the block number is always a block number, so
internal IndexTuples are rather like GIN posting tree pointers in the
main entry tree (its leaf level) -- a conventional item pointer block
number is used, alongside unconventional use of the offset field,
where there are 16 bits available because no real offset is required.

[1] 
https://wiki.postgresql.org/wiki/Key_normalization#Optimizations_enabled_by_key_normalization
[2] 
https://wiki.postgresql.org/wiki/Key_normalization#How_big_can_normalized_keys_get.2C_and_is_it_worth_it.3F
-- 
Peter Geoghegan



Re: Re: WIP: Covering + unique indexes.

2018-03-26 Thread David Steele
On 3/26/18 6:10 AM, Alexander Korotkov wrote:
> 
> So, as I get you're proposing to introduce INDEX_ALT_TID_MASK flag
> which would indicate that we're storing something special in the t_tid
> offset.  And that should help us not only for covering indexes, but also for
> further btree enhancements including suffix truncation.  What exactly do
> you propose to store into t_tid offset when INDEX_ALT_TID_MASK flag
> is set?  Is it number of attributes in this particular index tuple?

It appears that discussion and review of this patch is ongoing so it
should not be marked Ready for Committer.  I have changed it to Waiting
on Author since there are several pending reviews and at least one bug.

Regards,
-- 
-David
da...@pgmasters.net



Re: WIP: Covering + unique indexes.

2018-03-26 Thread Alexander Korotkov
On Sun, Mar 25, 2018 at 1:47 AM, Peter Geoghegan  wrote:

> I was going to say that you could just treat the low bit in the t_tid
> offset as representing "see catalog entry". My first idea was that
> nothing would have to change about the existing format, since internal
> page items already have only the low bit set within their offset.
> However, I now see that that won't really work, because we don't
> change the offset in high keys when they're copied from a real item
> during a page split. Whatever we do, it has to work equally well for
> all "separator keys" -- that is, it must work for both downlinks in
> internal pages, and all high keys (including high keys at the leaf
> level).
>

OK.


> A good solution is to use the unused 13th t_bit. If hash can have a
> INDEX_MOVED_BY_SPLIT_MASK, then nbtree can have a INDEX_ALT_TID_MASK.
> This avoids a BTREE_VERSION bump, and allows us to deal with the
> highkey offset issue. Actually, it's even more flexible than that --
> it can work with ordinary leaf tuples in the future, too. That is, we
> can eventually implement prefix truncation and deduplication at the
> leaf level using this representation, since there is nothing that
> limits INDEX_ALT_TID_MASK IndexTuples to "separator keys".
>
> The main difference between this approach to leaf prefix
> truncation/compression/deduplication, and the GIN entry tree's posting
> list representation would be that it wouldn't have to be
> super-optimized for duplicates, at the expense of more common case for
> regular nbtree indexes -- having few or no duplicates. A btree_gin
> index on pgbench_accounts(aid) looks very similar to an equivalent
> nbtree index if you just compare internal pages from each, but they
> look quite different at the leaf level, where GIN has 24 byte
> IndexTuples instead of 16 bytes IndexTuples. Of course, this is
> because the leaf pages have posting lists that can never be simple
> heap pointer TIDs.
>

Right, btree_gin is much smaller than regular btree when there are a lot
of duplicates.  When there is no duplicates then btree_gin becomes larger
than regular btree, because gin stores single item pointer less compact
than btree.

A secondary goal of this INDEX_ALT_TID_MASK representation should be
> that it won't even be necessary to know that an IndexTuple is
> contained within a leaf page rather than an index page (again, unlike
> GIN). I'm pretty confident that we can have a truly universal
> IndexTuple representation for nbtree, while supporting all of these
> standard optimizations.
>
> Sorry for going off in a tangent, but I think it's somewhat necessary
> to have a strategy here. Of course, we don't have to get everything
> right now, but we should be looking in this direction whenever we talk
> about on-disk nbtree changes.
>

So, as I get you're proposing to introduce INDEX_ALT_TID_MASK flag
which would indicate that we're storing something special in the t_tid
offset.  And that should help us not only for covering indexes, but also for
further btree enhancements including suffix truncation.  What exactly do
you propose to store into t_tid offset when INDEX_ALT_TID_MASK flag
is set?  Is it number of attributes in this particular index tuple?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: WIP: Covering + unique indexes.

2018-03-24 Thread Peter Geoghegan
On Sat, Mar 24, 2018 at 12:39 PM, Alexander Korotkov
 wrote:
> +1, putting "nattributes" to argument of index_truncate_tuple() would
> make this function way more universal.

Great.

> Originally that code was written by Teodor, but I also put my hands there.
> In GIN entry tree, item pointers are stored in a posting list which is
> located
> after index tuple attributes.  So, both t_tid block number and offset are
> used for GIN needs.

Well, you worked on the posting list compression stuff, at least.  :-)

> That makes sense.  Let's store the number of tuple attributes to t_tid.
> Assuming that our INDEX_MAX_KEYS is quite small number, we will have
> higher bits of t_tid free for latter use.

I was going to say that you could just treat the low bit in the t_tid
offset as representing "see catalog entry". My first idea was that
nothing would have to change about the existing format, since internal
page items already have only the low bit set within their offset.
However, I now see that that won't really work, because we don't
change the offset in high keys when they're copied from a real item
during a page split. Whatever we do, it has to work equally well for
all "separator keys" -- that is, it must work for both downlinks in
internal pages, and all high keys (including high keys at the leaf
level).

A good solution is to use the unused 13th t_bit. If hash can have a
INDEX_MOVED_BY_SPLIT_MASK, then nbtree can have a INDEX_ALT_TID_MASK.
This avoids a BTREE_VERSION bump, and allows us to deal with the
highkey offset issue. Actually, it's even more flexible than that --
it can work with ordinary leaf tuples in the future, too. That is, we
can eventually implement prefix truncation and deduplication at the
leaf level using this representation, since there is nothing that
limits INDEX_ALT_TID_MASK IndexTuples to "separator keys".

The main difference between this approach to leaf prefix
truncation/compression/deduplication, and the GIN entry tree's posting
list representation would be that it wouldn't have to be
super-optimized for duplicates, at the expense of more common case for
regular nbtree indexes -- having few or no duplicates. A btree_gin
index on pgbench_accounts(aid) looks very similar to an equivalent
nbtree index if you just compare internal pages from each, but they
look quite different at the leaf level, where GIN has 24 byte
IndexTuples instead of 16 bytes IndexTuples. Of course, this is
because the leaf pages have posting lists that can never be simple
heap pointer TIDs.

A secondary goal of this INDEX_ALT_TID_MASK representation should be
that it won't even be necessary to know that an IndexTuple is
contained within a leaf page rather than an index page (again, unlike
GIN). I'm pretty confident that we can have a truly universal
IndexTuple representation for nbtree, while supporting all of these
standard optimizations.

Sorry for going off in a tangent, but I think it's somewhat necessary
to have a strategy here. Of course, we don't have to get everything
right now, but we should be looking in this direction whenever we talk
about on-disk nbtree changes.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-03-24 Thread Alexander Korotkov
On Sat, Mar 24, 2018 at 5:21 AM, Peter Geoghegan  wrote:

> On Thu, Mar 22, 2018 at 8:23 AM, Alexander Korotkov
>  wrote:
> >> * There is minor formatting issue in this part of code.  Some spaces
> need
> >> to be replaced with tabs.
> >> +IndexTuple
> >> +index_truncate_tuple(Relation idxrel, IndexTuple olditup)
> >> +{
> >> + TupleDesc   itupdesc =
> >> CreateTupleDescCopyConstr(RelationGetDescr(idxrel));
> >> + Datum   values[INDEX_MAX_KEYS];
> >> + boolisnull[INDEX_MAX_KEYS];
> >> + IndexTuple newitup;
>
> The last time I looked at this patch, in April 2017, I made the point
> that we should add something like an "nattributes" argument to
> index_truncate_tuple() [1], rather than always using
> IndexRelationGetNumberOfKeyAttributes() within index_truncate_tuple().
> I can see that that change hasn't been made since that time.
>

+1, putting "nattributes" to argument of index_truncate_tuple() would
make this function way more universal.

With that approach, we can avoid relying on catalog metadata to the
> same degree, which was a specific concern that Tom had around that
> time. It's easy to do something with t_tid's offset, which is unused
> in internal page IndexTuples. We do very similar things in GIN, where
> alternative use of an IndexTuple's t_tid supports all kinds of
> enhancements, some of which were not originally anticipated. Alexander
> surely knows more about this than I do, since he wrote that code.
>

Originally that code was written by Teodor, but I also put my hands there.
In GIN entry tree, item pointers are stored in a posting list which is
located
after index tuple attributes.  So, both t_tid block number and offset are
used for GIN needs.

Having this index_truncate_tuple() "nattributes" argument, and storing
> the number of attributes directly improves quite a lot of things:
>
> * It makes diagnosing issues in the field quite a bit easier. Tools
> like pg_filedump can do the right thing (Tom mentioned pg_filedump and
> amcheck specifically). The nbtree IndexTuple format should not need to
> be interpreted in a context-sensitive way, if we can avoid it.
>
> * It lets you use index_truncate_tuple() for regular suffix truncation
> in the future. These INCLUDE IndexTuples are really just a special
> case of suffix truncation. At least, they should be, because otherwise
> an eventual suffix truncation feature is going to be incompatible with
> the INCLUDE tuple format. *Not* doing this makes suffix truncation
> harder. Suffix truncation is a classic technique, first described by
> Bayer in 1977, and we are very probably going to add it someday.
>
> * Once you can tell a truncated IndexTuple from a non-truncated one
> with little or no context, you can add defensive assertions in various
> places where they're helpful. Similarly, amcheck can use and expect
> this as a cross-check against IndexRelationGetNumberOfKeyAttributes().
> This will increase confidence in the design, both initially and over
> time.
>

That makes sense.  Let's store the number of tuple attributes to t_tid.
Assuming that our INDEX_MAX_KEYS is quite small number, we will have
higher bits of t_tid free for latter use.

I must say that I am disappointed that nothing has happened here,
> especially because this really wasn't very much additional work, and
> has essentially no downside. I can see that it doesn't work that way
> in the Postgres Pro fork [2], and diverging from that may
> inconvenience Postgres Pro, but that's a downside of forking. I don't
> think that the community should have to absorb that cost.
>

Sure, community shouldn't take care about Postgres Pro fork.  If we find
that something is better to be done differently, than let us do it so.

> +Notes to Operator Class Implementors
> > +
> >
> > Besides I really appreciate it, it seems to be unrelated to the covering
> > indexes.
> > I'd like this to be extracted into separate patch and be committed
> > separately.
>
> Commit 3785f7ee, from last month, moved the original "Notes to
> Operator Class Implementors" section to the SGML docs. It looks like
> that README section was accidentally reintroduced during rebasing. The
> new information ("Included attributes in B-tree indexes") should be
> moved over to the new section of the user docs -- the section that
> 3785f7ee added.
>

Thank you for noticing that.  I've overlooked that.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: WIP: Covering + unique indexes.

2018-03-23 Thread Peter Geoghegan
On Thu, Mar 22, 2018 at 8:23 AM, Alexander Korotkov
 wrote:
>> * There is minor formatting issue in this part of code.  Some spaces need
>> to be replaced with tabs.
>> +IndexTuple
>> +index_truncate_tuple(Relation idxrel, IndexTuple olditup)
>> +{
>> + TupleDesc   itupdesc =
>> CreateTupleDescCopyConstr(RelationGetDescr(idxrel));
>> + Datum   values[INDEX_MAX_KEYS];
>> + boolisnull[INDEX_MAX_KEYS];
>> + IndexTuple newitup;

The last time I looked at this patch, in April 2017, I made the point
that we should add something like an "nattributes" argument to
index_truncate_tuple() [1], rather than always using
IndexRelationGetNumberOfKeyAttributes() within index_truncate_tuple().
I can see that that change hasn't been made since that time.

With that approach, we can avoid relying on catalog metadata to the
same degree, which was a specific concern that Tom had around that
time. It's easy to do something with t_tid's offset, which is unused
in internal page IndexTuples. We do very similar things in GIN, where
alternative use of an IndexTuple's t_tid supports all kinds of
enhancements, some of which were not originally anticipated. Alexander
surely knows more about this than I do, since he wrote that code.

Having this index_truncate_tuple() "nattributes" argument, and storing
the number of attributes directly improves quite a lot of things:

* It makes diagnosing issues in the field quite a bit easier. Tools
like pg_filedump can do the right thing (Tom mentioned pg_filedump and
amcheck specifically). The nbtree IndexTuple format should not need to
be interpreted in a context-sensitive way, if we can avoid it.

* It lets you use index_truncate_tuple() for regular suffix truncation
in the future. These INCLUDE IndexTuples are really just a special
case of suffix truncation. At least, they should be, because otherwise
an eventual suffix truncation feature is going to be incompatible with
the INCLUDE tuple format. *Not* doing this makes suffix truncation
harder. Suffix truncation is a classic technique, first described by
Bayer in 1977, and we are very probably going to add it someday.

* Once you can tell a truncated IndexTuple from a non-truncated one
with little or no context, you can add defensive assertions in various
places where they're helpful. Similarly, amcheck can use and expect
this as a cross-check against IndexRelationGetNumberOfKeyAttributes().
This will increase confidence in the design, both initially and over
time.

I must say that I am disappointed that nothing has happened here,
especially because this really wasn't very much additional work, and
has essentially no downside. I can see that it doesn't work that way
in the Postgres Pro fork [2], and diverging from that may
inconvenience Postgres Pro, but that's a downside of forking. I don't
think that the community should have to absorb that cost.

> +Notes to Operator Class Implementors
> +
>
> Besides I really appreciate it, it seems to be unrelated to the covering
> indexes.
> I'd like this to be extracted into separate patch and be committed
> separately.

Commit 3785f7ee, from last month, moved the original "Notes to
Operator Class Implementors" section to the SGML docs. It looks like
that README section was accidentally reintroduced during rebasing. The
new information ("Included attributes in B-tree indexes") should be
moved over to the new section of the user docs -- the section that
3785f7ee added.

[1] 
https://postgr.es/m/cah2-wzm9y59h2m6izjm4fpdup5r4bsrvzgbn2gtrco1j4nz...@mail.gmail.com
[2] 
https://github.com/postgrespro/postgrespro/blob/PGPRO9_5/src/backend/access/common/indextuple.c#L451
-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

2018-03-22 Thread Alexander Korotkov
On Wed, Mar 21, 2018 at 9:51 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Thu, Mar 8, 2018 at 7:13 PM, Anastasia Lubennikova <
> a.lubennik...@postgrespro.ru> wrote:
>
>> 06.03.2018 11:52, Thomas Munro:
>>
>>> On Wed, Jan 31, 2018 at 3:09 AM, Anastasia Lubennikova
>>>  wrote:
>>>
 Thank you for reviewing. All mentioned issues are fixed.

>>> == Applying patch 0002-covering-btree_v4.patch...
>>> 1 out of 1 hunk FAILED -- saving rejects to file
>>> src/backend/access/nbtree/README.rej
>>> 1 out of 1 hunk FAILED -- saving rejects to file
>>> src/backend/access/nbtree/nbtxlog.c.rej
>>>
>>> Can we please have a new patch set?
>>>
>>
>> Here it is.
>> Many thanks to Andrey Borodin.
>
>
> I took a look at this patchset.  I have some notes about it.
>
> * I see patch changes dblink, amcheck and tcl contribs.  It would be nice
> to add corresponding
> check to dblink and amcheck regression tests.  It would be good to do the
> same with tcn contrib.
> But tcn doesn't have regression tests at all.  And it's out of scope of
> this patch to add regression
> tests to tcn.  So, it's OK to just check that it's working correctly with
> covering indexes (I hope it's
> already done by other reviewers).
>
> * I think that subscription regression tests in
> src/test/subscription should have some use
> of covering indexes.  Logical decoding and subscription are heavily using
> primary keys.
> So they need to be tested to work correctly with covering indexes.
>
> * I also think some isolation tests in src/test/isolation need to check
> covering indexes too.
> In particular insert-conflict-*.spec and lock-*.spec and probably more.
>
> *  pg_dump doesn't handle old PostgreSQL versions correctly.  If I try to
> dump database
> of PostgreSQL 9.6, pg_dump gives me following error:
>
> pg_dump: [archiver (db)] query failed: ERROR:  column i.indnkeyatts does
> not exist
> LINE 1: ...atalog.pg_get_indexdef(i.indexrelid) AS indexdef, i.indnkeya...
>  ^
>
> If fact there is a sequence of "if" ... "else if" blocks in getIndexes()
> which selects
> appropriate query depending on remote server version.  And for pre-11 we
> should
> use indnatts instead of indnkeyatts.
>
> * There is minor formatting issue in this part of code.  Some spaces need
> to be replaced with tabs.
> +IndexTuple
> +index_truncate_tuple(Relation idxrel, IndexTuple olditup)
> +{
> + TupleDesc   itupdesc = CreateTupleDescCopyConstr(Rela
> tionGetDescr(idxrel));
> + Datum   values[INDEX_MAX_KEYS];
> + boolisnull[INDEX_MAX_KEYS];
> + IndexTuple newitup;
>
> * I think this comment needs to be rephrased.
> + /*
> +  * Code below is concerned to the opclasses which are not used
> +  * with the included columns.
> +  */
> I would write something like this: "Code below checks opclass key type.
> Included columns
> don't have opclasses, and this check is not required for them.".  Native
> english speakers
> could provide even better phrasing though.
>
> * I would also like all the patches in patchset version to have same
> version number.
> I understand that "Covering-btree" and "Covering-amcheck" have less
> previous
> versions than "Covering-core".  But it's way easier to identify patches
> belonging to
> the same patchset version if they have same version number.  For sure,
> then some
> patches would skip some version numbers, but that doesn't seem to be a
> problem for me.
>

I have few more notes regarding this patchset.

* indkeyatts is described in the documentation, but I think that
description of indnatts
should be also updated clarifying that indnatts counts "included" columns.

+  
+  indnkeyatts
+  int2
+  
+  The number of key columns in the index. "Key columns" are
ordinary
+  index columns (as opposed to "included" columns).
+ 

*  It seems like this paragraph appears in the patchset without any
mentioning
in the thread.

+Notes to Operator Class Implementors
+

Besides I really appreciate it, it seems to be unrelated to the covering
indexes.
I'd like this to be extracted into separate patch and be committed
separately.

* There is a typo here: brtee -> btree
+ * 7. Check various AMs. All but brtee must fail.

* This comment should be updated assuming that we now put left page
hikey to the WAL independently on whether it's leaf page split.

+ /*
+ * We must also log the left page's high key, because the right
+ * page's leftmost key is suppressed on non-leaf levels.  Show it
+ * as belonging to the left page buffer, so that it is not stored
+ * if XLogInsert decides it needs a full-page image of the left
+ * page.
+ */

* get_index_def() is adjusted to support covering indexes.  I think this
support
deserve to be checked in regression tests.

* In PostgreSQL sentences are sometimes divided by single spacing, sometimes
divided by double spacing.  I think we should follow 

Re: WIP: Covering + unique indexes.

2018-03-21 Thread Alexander Korotkov
On Thu, Mar 8, 2018 at 7:13 PM, Anastasia Lubennikova <
a.lubennik...@postgrespro.ru> wrote:

> 06.03.2018 11:52, Thomas Munro:
>
>> On Wed, Jan 31, 2018 at 3:09 AM, Anastasia Lubennikova
>>  wrote:
>>
>>> Thank you for reviewing. All mentioned issues are fixed.
>>>
>> == Applying patch 0002-covering-btree_v4.patch...
>> 1 out of 1 hunk FAILED -- saving rejects to file
>> src/backend/access/nbtree/README.rej
>> 1 out of 1 hunk FAILED -- saving rejects to file
>> src/backend/access/nbtree/nbtxlog.c.rej
>>
>> Can we please have a new patch set?
>>
>
> Here it is.
> Many thanks to Andrey Borodin.


I took a look at this patchset.  I have some notes about it.

* I see patch changes dblink, amcheck and tcl contribs.  It would be nice
to add corresponding
check to dblink and amcheck regression tests.  It would be good to do the
same with tcn contrib.
But tcn doesn't have regression tests at all.  And it's out of scope of
this patch to add regression
tests to tcn.  So, it's OK to just check that it's working correctly with
covering indexes (I hope it's
already done by other reviewers).

* I think that subscription regression tests in
src/test/subscription should have some use
of covering indexes.  Logical decoding and subscription are heavily using
primary keys.
So they need to be tested to work correctly with covering indexes.

* I also think some isolation tests in src/test/isolation need to check
covering indexes too.
In particular insert-conflict-*.spec and lock-*.spec and probably more.

*  pg_dump doesn't handle old PostgreSQL versions correctly.  If I try to
dump database
of PostgreSQL 9.6, pg_dump gives me following error:

pg_dump: [archiver (db)] query failed: ERROR:  column i.indnkeyatts does
not exist
LINE 1: ...atalog.pg_get_indexdef(i.indexrelid) AS indexdef, i.indnkeya...
 ^

If fact there is a sequence of "if" ... "else if" blocks in getIndexes()
which selects
appropriate query depending on remote server version.  And for pre-11 we
should
use indnatts instead of indnkeyatts.

* There is minor formatting issue in this part of code.  Some spaces need
to be replaced with tabs.
+IndexTuple
+index_truncate_tuple(Relation idxrel, IndexTuple olditup)
+{
+ TupleDesc   itupdesc = CreateTupleDescCopyConstr(Rela
tionGetDescr(idxrel));
+ Datum   values[INDEX_MAX_KEYS];
+ boolisnull[INDEX_MAX_KEYS];
+ IndexTuple newitup;

* I think this comment needs to be rephrased.
+ /*
+  * Code below is concerned to the opclasses which are not used
+  * with the included columns.
+  */
I would write something like this: "Code below checks opclass key type.
Included columns
don't have opclasses, and this check is not required for them.".  Native
english speakers
could provide even better phrasing though.

* I would also like all the patches in patchset version to have same
version number.
I understand that "Covering-btree" and "Covering-amcheck" have less previous
versions than "Covering-core".  But it's way easier to identify patches
belonging to
the same patchset version if they have same version number.  For sure, then
some
patches would skip some version numbers, but that doesn't seem to be a
problem for me.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: WIP: Covering + unique indexes.

2018-03-06 Thread Thomas Munro
On Wed, Jan 31, 2018 at 3:09 AM, Anastasia Lubennikova
 wrote:
> Thank you for reviewing. All mentioned issues are fixed.

== Applying patch 0002-covering-btree_v4.patch...
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/access/nbtree/README.rej
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/access/nbtree/nbtxlog.c.rej

Can we please have a new patch set?

-- 
Thomas Munro
http://www.enterprisedb.com



Re: WIP: Covering + unique indexes.

2018-01-25 Thread Thomas Munro
On Fri, Jan 26, 2018 at 3:01 AM, Anastasia Lubennikova
 wrote:
> Thanks for the reminder. Rebased patches are attached.

This is a really cool and also difficult feature.  Thanks for working
on it!  Here are a couple of quick comments on the documentation,
since I noticed it doesn't build:

SGML->XML change: (1) empty closing tags "" are no longer accepted,
(2)  now needs to be written  and (3) xref IDs
are now case-sensitive.

+  PRIMARY KEY ( column_name [, ... ] ) index_parameters INCLUDE
(column_name [,
...]) |

I hadn't seen that use of "" before.  Almost everywhere else
we use explicit [ and ] characters, but I see that there are other
examples, and it is rendered as [ and ] in the output.  OK, cool, but
I think there should be some extra whitespace so that it comes out as:

  [ INCLUDE ... ]

instead of:

  [INCLUDE ...]

to fit with the existing convention.

+... This also allows UNIQUE indexes to be defined on
+one set of columns, which can include another set of columns in the
+   INCLUDE clause, on which the uniqueness is not enforced.
+It's the same with other constraints (PRIMARY KEY and
EXCLUDE). This can
+also can be used for non-unique indexes as any columns which
are not required
+for the searching or ordering of records can be used in the
+INCLUDE clause, which can slightly reduce the
size of the index.

Can I suggest rewording these three sentences a bit?  Just an idea:

UNIQUE indexes, PRIMARY KEY
constraints and EXCLUDE constraints can be defined
with extra columns in an INCLUDE clause, in which
case uniqueness is not enforced for the extra columns.  Moving columns
that are not needed for searching, ordering or uniqueness into the
INCLUDE clause can sometimes reduce the size of the
index while retaining the possibility of using a faster index-only
scan.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: WIP: Covering + unique indexes.

2018-01-21 Thread Andrey Borodin
I feel sorry for the noise, switching this patch there and back. But the patch 
needs rebase again. It still applies with -3, but do not compile anymore.

Best regards, Andrey Borodin.

The new status of this patch is: Waiting on Author


Re: [HACKERS] WIP: Covering + unique indexes.

2018-01-21 Thread Andrey Borodin

> 21 янв. 2018 г., в 3:36, Peter Geoghegan  написал(а):
> 
> On Wed, Jan 17, 2018 at 12:45 AM, Andrey Borodin  wrote:
>> Unfortunately, amcheck_next does not work currently on HEAD (there are 
>> problems with AllocSetContextCreate() signature), but I've tested 
>> bt_index_check() before, during and after pgbench, on primary and on slave. 
>> Also, I've checked bt_index_parent_check() on master.
> 
> I fixed that recently. It should be fine now.
Oh, sorry, missed that I'm using patched stale amcheck_next. Thanks!
Affirmative, amcheck_next works fine.

I run pgbench against several covering indexes. Checking before load, during 
and after, both on master and slave.
I do not observe any errors besides infrequent "canceling statement due to 
conflict with recovery", which is not a sign of any malfunction.

Best regards, Andrey Borodin.


Re: [HACKERS] WIP: Covering + unique indexes.

2018-01-20 Thread Peter Geoghegan
On Wed, Jan 17, 2018 at 12:45 AM, Andrey Borodin  wrote:
> Unfortunately, amcheck_next does not work currently on HEAD (there are 
> problems with AllocSetContextCreate() signature), but I've tested 
> bt_index_check() before, during and after pgbench, on primary and on slave. 
> Also, I've checked bt_index_parent_check() on master.

I fixed that recently. It should be fine now.

-- 
Peter Geoghegan



Re: [HACKERS] WIP: Covering + unique indexes.

2018-01-18 Thread Andrey Borodin
Hi!
> 18 янв. 2018 г., в 18:57, Anastasia Lubennikova 
>  написал(а):
> 
> What is amcheck_next ?
amcheck_next is external version of amcheck, maintained by Peter G. on his 
github. It checks one more thing: that every heap tuple has twin in B-tree, so 
called heapallindexed check.
Version V3 of your patch was checked with heapallindexed and passed the test, 
both on master and on slave.

>> During bt_index_check()  test from time to time I was observing
>> ERROR:  canceling statement due to conflict with recovery
>> DETAIL:  User query might have needed to see row versions that must be 
>> removed.
>> 
> 
> Sorry, I forgot  to attach the amcheck fix to the previous message.
No problem, surely I've fixed that before testing.
> Now all the patches are in attachment.
> Could you recheck if the error is still there?
No need to do that, I was checking exactly same codebase.
And that error has nothing to do with your patch, amcheck does not always can 
perform bt_index_parent_check() on slave when master is heavy loaded. It's OK. 
I reported this error just to be 100% precise about observed things.

Thanks for working on this feature, hope to see it in 11.

Best regards, Andrey Borodin.

Re: [HACKERS] WIP: Covering + unique indexes.

2018-01-17 Thread Andrey Borodin
Hi!
> 16 янв. 2018 г., в 21:50, Anastasia Lubennikova 
>  написал(а):
> 
> Updated patches are attached.
> 
Cool, thanks!

I've looked into the code, but haven't found anything broken.
Since I've tried to rebase patch myself and failed on parse utils, I've spend 
some cycles trying to break parsing. 
One minor complain (no need to fix). 
This is fine
x4mmm=# create index on pgbench_accounts (bid) include 
(aid,filler,upper(filler));
ERROR:  expressions are not supported in included columns
But why not same error here? Previous message is very descriptive.
x4mmm=# create index on pgbench_accounts (bid) include (aid,filler,aid+1);
ERROR:  syntax error at or near "+"
This works. But should not, IMHO
x4mmm=# create index on pgbench_accounts (bid) include (aid,aid,aid);
CREATE INDEX
Do not know what's that...
# create index on pgbench_accounts (bid) include (aid desc, aid asc);
CREATE INDEX

All these things allow foot-shooting with a small caliber, but do not break big 
things.

Unfortunately, amcheck_next does not work currently on HEAD (there are problems 
with AllocSetContextCreate() signature), but I've tested bt_index_check() 
before, during and after pgbench, on primary and on slave. Also, I've checked 
bt_index_parent_check() on master.

During bt_index_check()  test from time to time I was observing
ERROR:  canceling statement due to conflict with recovery
DETAIL:  User query might have needed to see row versions that must be removed.

[install]check[-world] passed :)

From my POV, patch is in a good shape.
I think it is time to make the patch ready for committer again.

Best regards, Andrey Borodin.


Re: [HACKERS] WIP: Covering + unique indexes.

2018-01-08 Thread Andrey Borodin
Hello!

The patch does not apply currently.
Anastasia, can you, please, rebase the patch?

Best regards, Andrey Borodin.



Re: [HACKERS] WIP: Covering + unique indexes.

2017-12-04 Thread Andrey Borodin

> 30 нояб. 2017 г., в 23:07, Andrey Borodin  написал(а):
> 
> Seems like it was not a big deal of patching, I've fixed those bits (see 
> attachment). 
> I've done only simple tests as for now, but I'm planning to do better testing 
> before next CF.
> Thanks for mentioning "heapallindexed", I'll use it too.

I've tested the patch with fixed amcheck (including "heapallindexed" feature), 
tests included bulk index creation, pgbenching and amcheck of index itself and 
WAL-replicated index.
Everything worked fine.

Spotted one more typo:
> Since 10.0 there is an optional INCLUDE clause
should be
> Since 11.0 there is an optional INCLUDE clause

I think that patch set (two patches + 1 amcheck diff) is ready for committer.

Best regards, Andrey Borodin.


Re: [HACKERS] WIP: Covering + unique indexes.

2017-11-30 Thread Andrey Borodin
Hi, Peter!
> 29 нояб. 2017 г., в 8:45, Peter Geoghegan  написал(а):
> 
> It looks like amcheck needs to be patched -- a simple oversight.
> amcheck is probably calling _bt_compare() without realizing that
> internal pages don't have the extra attributes (just leaf pages,
> although they should also not participate in comparisons in respect of
> included/extra columns). There were changes to amcheck at one point in
> the past. That must have slipped through again. I don't think it's
> that complicated.
> 
> BTW, it would probably be a good idea to use the new Github version's
> "heapallindexed" verification [1] for testing this patch. Anastasia
> will need to patch the externally maintained amcheck to do this, but
> it's probably no extra work, because this is already needed for
> contrib/amcheck, and because the heapallindexed check doesn't actually
> care about index structure at all.

Seems like it was not a big deal of patching, I've fixed those bits (see 
attachment). 
I've done only simple tests as for now, but I'm planning to do better testing 
before next CF.
Thanks for mentioning "heapallindexed", I'll use it too.

Best regards, Andrey Borodin.


amcheck_include.diff
Description: Binary data


Re: [HACKERS] WIP: Covering + unique indexes.

2017-11-28 Thread Michael Paquier
On Sun, Nov 12, 2017 at 8:40 PM, Andrey Borodin  wrote:
> Postgres crashes:
> TRAP: FailedAssertion("!(((const void*)() != ((void*)0)) && 
> (scankey->sk_attno) > 0)", File: "nbtsearch.c", Line: 466)
>
> May be I'm doing something wrong or amcheck support will go with different 
> patch?

Usually amcheck complaining is a sign of other symptoms. I am marking
this patch as returned with feedback for now as no updates have been
provided after two weeks.
-- 
Michael