Re: WIP: Covering + unique indexes.
I'm wondering what's the genesis of this coninclude column actually. As far as I can tell, the only reason this column is there, is to be able to print the INCLUDE clause in a UNIQUE/PK constraint in ruleutils ... but surely the same list can be obtained from the pg_index.indkey instead? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WIP: Covering + unique indexes.
Thank you, pushed Peter Geoghegan wrote: On Wed, Apr 18, 2018 at 10:47 PM, Teodor Sigaevwrote: Thank you, pushed. Thanks. I saw another preexisting issue, this time one that has been around since 2007. Commit bc292937 forgot to remove a comment above _bt_insertonpg() (the 'afteritem' stuff ended up being moved to the bottom of _bt_findinsertloc(), where it remains today). The attached patch fixes this, and in passing mentions the fact that _bt_insertonpg() only performs retail insertions, and specifically never inserts high key items. I don't think it's necessary to add something about negative infinity items to the same comment block. While it's true that _bt_insertonpg() cannot truncate downlinks to make new minus infinity items, I see that as a narrower issue. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
On Wed, Apr 18, 2018 at 10:47 PM, Teodor Sigaevwrote: > Thank you, pushed. Thanks. I saw another preexisting issue, this time one that has been around since 2007. Commit bc292937 forgot to remove a comment above _bt_insertonpg() (the 'afteritem' stuff ended up being moved to the bottom of _bt_findinsertloc(), where it remains today). The attached patch fixes this, and in passing mentions the fact that _bt_insertonpg() only performs retail insertions, and specifically never inserts high key items. I don't think it's necessary to add something about negative infinity items to the same comment block. While it's true that _bt_insertonpg() cannot truncate downlinks to make new minus infinity items, I see that as a narrower issue. -- Peter Geoghegan From dced00be29775965a45c7889ca99e19b96d9e4d0 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Wed, 18 Apr 2018 22:38:32 -0700 Subject: [PATCH] Adjust _bt_insertonpg() comments. Remove an obsolete reference to the 'afteritem' argument, which was removed by commit bc292937. Add a comment that clarifies how _bt_insertonpg() indirectly handles the insertion of high key items. Author: Peter Geoghegan --- src/backend/access/nbtree/nbtinsert.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index dbd5c92..ecf4e53 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -817,18 +817,18 @@ _bt_findinsertloc(Relation rel, * insertion, and the buffer must be pinned and write-locked. On return, * we will have dropped both the pin and the lock on the buffer. * - * When inserting to a non-leaf page, 'cbuf' is the left-sibling of the - * page we're inserting the downlink for. This function will clear the + * This routine only performs retail tuple insertions. 'itup' should + * always be either a non-highkey leaf item, or a downlink (new high + * key items are created indirectly, when a page is split). When + * inserting to a non-leaf page, 'cbuf' is the left-sibling of the page + * we're inserting the downlink for. This function will clear the * INCOMPLETE_SPLIT flag on it, and release the buffer. * * The locking interactions in this code are critical. You should * grok Lehman and Yao's paper before making any changes. In addition, * you need to understand how we disambiguate duplicate keys in this * implementation, in order to be able to find our location using - * L "move right" operations. Since we may insert duplicate user - * keys, and since these dups may propagate up the tree, we use the - * 'afteritem' parameter to position ourselves correctly for the - * insertion on internal pages. + * L "move right" operations. *-- */ static void -- 2.7.4
Re: WIP: Covering + unique indexes.
Thank you, pushed. Actually, I see one tiny issue with extra '*' characters here: +* The number of attributes won't be explicitly represented if the +* negative infinity tuple was generated during a page split that +* occurred with a version of Postgres before v11. There must be a +* problem when there is an explicit representation that is +* non-zero, * or when there is no explicit representation and the +* tuple is * evidently not a pre-pg_upgrade tuple. I also suggest fixing this indentation before commit: + /* +*Cannot leak memory here, TupleDescCopy() doesn't allocate any +* inner structure, so, plain pfree() should clean all allocated memory +*/ fixed -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
On Wed, Apr 18, 2018 at 1:45 PM, Peter Geogheganwrote: > I suggest committing this patch as-is. Actually, I see one tiny issue with extra '*' characters here: > +* The number of attributes won't be explicitly represented if the > +* negative infinity tuple was generated during a page split that > +* occurred with a version of Postgres before v11. There must be > a > +* problem when there is an explicit representation that is > +* non-zero, * or when there is no explicit representation and the > +* tuple is * evidently not a pre-pg_upgrade tuple. I also suggest fixing this indentation before commit: > + /* > +*Cannot leak memory here, TupleDescCopy() doesn't allocate any > +* inner structure, so, plain pfree() should clean all allocated memory > +*/ -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
On Wed, Apr 18, 2018 at 1:32 PM, Teodor Sigaevwrote: >> I don't understand. We do check the number of attributes on rightmost >> pages, but we do so separately, in the main loop. For every item that >> isn't the high key. > > Comment added, pls, verify. And refactored _bt_check_natts(), I hope, now > it's a bit more readable. The new comment looks good. Now I understand what you meant about _bt_check_natts(). And, I agree that this is an improvement -- the extra verbosity is worth it. > I didn't do that in v1, sorry, I was unclear. Attached patch contains all > changes suggested in my previous email. Looks new BTreeTupSetNAtts () assertion good to me. I suggest committing this patch as-is. Thank you -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
I don't understand. We do check the number of attributes on rightmost pages, but we do so separately, in the main loop. For every item that isn't the high key. Comment added, pls, verify. And refactored _bt_check_natts(), I hope, now it's a bit more readable. 4) BTreeTupSetNAtts - seems, it's better to add check of nattrs to fits to BT_N_KEYS_OFFSET_MASK mask, and it should not touch BT_RESERVED_OFFSET_MASK bits, now it will overwrite that bits. An assertion sounds like it would be an improvement, though I don't see that in the patch you posted. I didn't do that in v1, sorry, I was unclear. Attached patch contains all changes suggested in my previous email. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c index be0206d58e..1a605f9944 100644 --- a/contrib/amcheck/verify_nbtree.c +++ b/contrib/amcheck/verify_nbtree.c @@ -698,6 +698,9 @@ nextpage: * "real" data item on the page to the right (if such a first item is * available). * + * - That tuples report that they have the expected number of attributes. + * INCLUDE index pivot tuples should not contain non-key attributes. + * * Furthermore, when state passed shows ShareLock held, and target page is * internal page, function also checks: * @@ -722,43 +725,35 @@ bt_target_page_check(BtreeCheckState *state) elog(DEBUG2, "verifying %u items on %s block %u", max, P_ISLEAF(topaque) ? "leaf" : "internal", state->targetblock); - - /* Check the number of attributes in high key if any */ - if (!P_RIGHTMOST(topaque)) + /* + * Check the number of attributes in high key. Note, rightmost page doesn't + * contain a high key, so nothing to check + */ + if (!P_RIGHTMOST(topaque) && + !_bt_check_natts(state->rel, state->target, P_HIKEY)) { - if (!_bt_check_natts(state->rel, state->target, P_HIKEY)) - { - ItemId itemid; - IndexTuple itup; - char *itid, - *htid; + ItemId itemid; + IndexTuple itup; - itemid = PageGetItemId(state->target, P_HIKEY); - itup = (IndexTuple) PageGetItem(state->target, itemid); - itid = psprintf("(%u,%u)", state->targetblock, P_HIKEY); - htid = psprintf("(%u,%u)", - ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)), - ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid))); + itemid = PageGetItemId(state->target, P_HIKEY); + itup = (IndexTuple) PageGetItem(state->target, itemid); - ereport(ERROR, - (errcode(ERRCODE_INDEX_CORRUPTED), - errmsg("wrong number of index tuple attributes for index \"%s\"", - RelationGetRelationName(state->rel)), - errdetail_internal("Index tid=%s natts=%u points to %s tid=%s page lsn=%X/%X.", - itid, - BTreeTupGetNAtts(itup, state->rel), - P_ISLEAF(topaque) ? "heap" : "index", - htid, - (uint32) (state->targetlsn >> 32), - (uint32) state->targetlsn))); - } + ereport(ERROR, +(errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("wrong number of high key index tuple attributes in index \"%s\"", + RelationGetRelationName(state->rel)), + errdetail_internal("Index block=%u natts=%u block type=%s page lsn=%X/%X.", + state->targetblock, + BTreeTupleGetNAtts(itup, state->rel), + P_ISLEAF(topaque) ? "heap" : "index", + (uint32) (state->targetlsn >> 32), + (uint32) state->targetlsn))); } - /* * Loop over page items, starting from first non-highkey item, not high - * key (if any). Also, immediately skip "negative infinity" real item (if - * any). + * key (if any). Most tests are not performed for the "negative infinity" + * real item (if any). */ for (offset = P_FIRSTDATAKEY(topaque); offset <= max; @@ -791,7 +786,7 @@ bt_target_page_check(BtreeCheckState *state) tupsize, ItemIdGetLength(itemid), (uint32) (state->targetlsn >> 32), (uint32) state->targetlsn), - errhint("This could be a torn page problem"))); + errhint("This could be a torn page problem."))); /* Check the number of index tuple attributes */ if (!_bt_check_natts(state->rel, state->target, offset)) @@ -806,11 +801,11 @@ bt_target_page_check(BtreeCheckState *state) ereport(ERROR, (errcode(ERRCODE_INDEX_CORRUPTED), - errmsg("wrong number of index tuple attributes for index \"%s\"", + errmsg("wrong number of index tuple attributes in index \"%s\"", RelationGetRelationName(state->rel)), errdetail_internal("Index tid=%s natts=%u points to %s tid=%s page lsn=%X/%X.", itid, - BTreeTupGetNAtts(itup, state->rel), + BTreeTupleGetNAtts(itup, state->rel), P_ISLEAF(topaque) ? "heap" : "index", htid, (uint32) (state->targetlsn >> 32), @@ -818,8 +813,8 @@ bt_target_page_check(BtreeCheckState *state) } /* - * Don't try to generate scankey using
Re: WIP: Covering + unique indexes.
(() On Wed, Apr 18, 2018 at 10:10 AM, Teodor Sigaevwrote: > I mostly agree with your patch, nice work, but I have some notices for your > patch: Thanks. > 1) > bt_target_page_check(): > if (!P_RIGHTMOST(topaque) && > !_bt_check_natts(state->rel, state->target, P_HIKEY)) > > Seems not very obvious: it looks like we don't need to check nattrs on > rightmost page. Okay, I remember that on rightmost page there is no hikey at > all, but at least comment should added. Implicitly bt_target_page_check() > already takes into account 'is page rightmost or not?' by using > P_FIRSTDATAKEY, so, may be better to move rightmost check into > bt_target_page_check() with some refactoring if-logic: I don't understand. We do check the number of attributes on rightmost pages, but we do so separately, in the main loop. For every item that isn't the high key. This code appears before the main bt_target_page_check() loop because we're checking the high key itself, on its own, which is a new thing. The high key is also involved in the loop (on non-rightmost pages), but that's only because we check real items *against* the high key (we assume the high key is good and that the item might be bad). The high key is involved in every iteration of the main loop (on non-rightmost pages), rather than getting its own loop. That said, I am quite happy if you want to put a comment about this being the rightmost page at the beginning of the check. > 2) > Style notice: > ItemPointerSetInvalid(_tid); > + BTreeTupSetNAtts(, 0); > if (PageAddItem(page, (Item) , sizeof(IndexTupleData), > P_HIKEY, > It's better to have blank line between BTreeTupSetNAtts() and if clause. Sure. > 3) Naming BTreeTupGetNAtts/BTreeTupSetNAtts - several lines above we use > full Tuple word in dowlink macroses, here we use just Tup. Seems, better to > have Tuple in both cases. Or Tup, but still in both cases. +1 > 4) BTreeTupSetNAtts - seems, it's better to add check of nattrs to fits to > BT_N_KEYS_OFFSET_MASK mask, and it should not touch BT_RESERVED_OFFSET_MASK > bits, now it will overwrite that bits. An assertion sounds like it would be an improvement, though I don't see that in the patch you posted. > Attached patch is rebased to current head and contains some comment > improvement in index_truncate_tuple() - you save some amount of memory with > TupleDescCopy() call but didn't explain why pfree is enough to free all > allocated memory. Makes sense. -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
I mostly agree with your patch, nice work, but I have some notices for your patch: 1) bt_target_page_check(): if (!P_RIGHTMOST(topaque) && !_bt_check_natts(state->rel, state->target, P_HIKEY)) Seems not very obvious: it looks like we don't need to check nattrs on rightmost page. Okay, I remember that on rightmost page there is no hikey at all, but at least comment should added. Implicitly bt_target_page_check() already takes into account 'is page rightmost or not?' by using P_FIRSTDATAKEY, so, may be better to move rightmost check into bt_target_page_check() with some refactoring if-logic: if (offset > maxoff) return true; //nothing to check, also covers empty rightmost page if (P_ISLEAF) { if (offnum >= P_FIRSTDATAKEY) ... else /* if (offnum == P_HIKEY) */ ... } else // !P_ISLEAF { if (offnum == P_FIRSTDATAKEY) ... else if (offnum > P_FIRSTDATAKEY) ... else /* if (offnum == P_HIKEY) */ ... } I see it's possible only 3 nattrs value: 0, nkey and nkey+nincluded, but collapsing if-clause to three branches causes difficulties for code readers. Let compiler optimize that. Sorry for late notice, but it takes my attention only when I noticed (!P_RIGHTMOST && !_bt_check_natts) condition. 2) Style notice: ItemPointerSetInvalid(_tid); + BTreeTupSetNAtts(, 0); if (PageAddItem(page, (Item) , sizeof(IndexTupleData), P_HIKEY, It's better to have blank line between BTreeTupSetNAtts() and if clause. 3) Naming BTreeTupGetNAtts/BTreeTupSetNAtts - several lines above we use full Tuple word in dowlink macroses, here we use just Tup. Seems, better to have Tuple in both cases. Or Tup, but still in both cases. 4) BTreeTupSetNAtts - seems, it's better to add check of nattrs to fits to BT_N_KEYS_OFFSET_MASK mask, and it should not touch BT_RESERVED_OFFSET_MASK bits, now it will overwrite that bits. Attached patch is rebased to current head and contains some comment improvement in index_truncate_tuple() - you save some amount of memory with TupleDescCopy() call but didn't explain why pfree is enough to free all allocated memory. Peter Geoghegan wrote: On Tue, Apr 17, 2018 at 3:12 AM, Alexander Korotkovwrote: Hmm, what do you think about making BTreeTupGetNAtts() take tupledesc argument, not relation> It anyway doesn't need number of key attributes, only total number of attributes. Then _bt_isequal() would be able to use BTreeTupGetNAtts(). That would make the BTreeTupGetNAtts() assertions quite a bit more verbose, since there is usually no existing tuple descriptor variable, but there is almost always a "rel" variable. The coverage within _bt_isequal() does not seem important, because we only use it with the page high key in rare cases, where _bt_moveright() will already have tested the highkey. I think it's completely OK to fix broken things when you've to touch them. Probably, Teodor would decide to make that by separate commit. So, it's up to him. You're right to say that this old negative infinity tuple code within _bt_insertonpg() is broken code, and not just dead code. The code doesn't just confuse things (e.g. see recent commit 2a67d644). It also seems like it could actually be harmful. This is code that could only ever corrupt your database. I'm fine if Teodor wants to commit that change separately, of course. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c index be0206d58e..7efd7ac47b 100644 --- a/contrib/amcheck/verify_nbtree.c +++ b/contrib/amcheck/verify_nbtree.c @@ -698,6 +698,9 @@ nextpage: * "real" data item on the page to the right (if such a first item is * available). * + * - That tuples report that they have the expected number of attributes. + * INCLUDE index pivot tuples should not contain non-key attributes. + * * Furthermore, when state passed shows ShareLock held, and target page is * internal page, function also checks: * @@ -722,43 +725,32 @@ bt_target_page_check(BtreeCheckState *state) elog(DEBUG2, "verifying %u items on %s block %u", max, P_ISLEAF(topaque) ? "leaf" : "internal", state->targetblock); - - /* Check the number of attributes in high key if any */ - if (!P_RIGHTMOST(topaque)) + /* Check the number of attributes in high key */ + if (!P_RIGHTMOST(topaque) && + !_bt_check_natts(state->rel, state->target, P_HIKEY)) { - if (!_bt_check_natts(state->rel, state->target, P_HIKEY)) - { - ItemId itemid; - IndexTuple itup; - char *itid, - *htid; + ItemId itemid; + IndexTuple itup; - itemid = PageGetItemId(state->target, P_HIKEY); - itup = (IndexTuple) PageGetItem(state->target, itemid); - itid = psprintf("(%u,%u)",
Re: WIP: Covering + unique indexes.
On Tue, Apr 17, 2018 at 3:12 AM, Alexander Korotkovwrote: > Hmm, what do you think about making BTreeTupGetNAtts() take tupledesc > argument, not relation> It anyway doesn't need number of key attributes, > only total number of attributes. Then _bt_isequal() would be able to use > BTreeTupGetNAtts(). That would make the BTreeTupGetNAtts() assertions quite a bit more verbose, since there is usually no existing tuple descriptor variable, but there is almost always a "rel" variable. The coverage within _bt_isequal() does not seem important, because we only use it with the page high key in rare cases, where _bt_moveright() will already have tested the highkey. > I think it's completely OK to fix broken things when you've to touch > them. Probably, Teodor would decide to make that by separate commit. > So, it's up to him. You're right to say that this old negative infinity tuple code within _bt_insertonpg() is broken code, and not just dead code. The code doesn't just confuse things (e.g. see recent commit 2a67d644). It also seems like it could actually be harmful. This is code that could only ever corrupt your database. I'm fine if Teodor wants to commit that change separately, of course. -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
On Mon, Apr 16, 2018 at 1:05 AM, Peter Geogheganwrote: > Attached patch makes the changes that I talked about, and a few > others. The commit message has full details. The general direction of > the patch is that it documents our assumptions, and verifies them in > more cases. Most of the changes I've made are clear improvements, > though in a few cases I've made changes that are perhaps more > debatable. Great, thank you very much! > These other, more debatable cases are: > > * The comments added to _bt_isequal() about suffix truncation may not > be to your taste. The same is true of the way that I restored the > previous _bt_isequal() function signature. (Yes, I want to change it > back despite the fact that I was the person that originally suggested > we change _bt_isequal().) > Hmm, what do you think about making BTreeTupGetNAtts() take tupledesc argument, not relation> It anyway doesn't need number of key attributes, only total number of attributes. Then _bt_isequal() would be able to use BTreeTupGetNAtts(). * I added BTreeTupSetNAtts() calls to a few places that don't truly > need them, such as the point where we generate a dummy 0-attribute > high key within _bt_mark_page_halfdead(). I think that we should try > to be as consistent as possible about using BTreeTupSetNAtts(), to set > a good example. I don't think it's necessary to use BTreeTupSetNAtts() > for pivot tuples when the number of key attributes matches indnatts > (it seems inconvenient to have to palloc() our own scratch buffer to > do this when we don't have to), but that doesn't apply to these > now-covered cases. > +1 > > > I think, we need move _bt_check_natts() and its call under > > USE_ASSERT_CHECKING to prevent performance degradation. Users shouldn't > pay > > for unused feature. > > I eventually decided that you were right about this, and made the > _bt_compare() call to _bt_check_natts() a simple assertion without > waiting to hear more opinions on the matter. Concurrency isn't a > factor here, so adding a check to standard release builds isn't > particularly likely to detect bugs. Besides, there is really only a > small number of places that need to do truncation for themselves. And, > if you want to be sure that the structure is consistent in the field, > there is always amcheck, which can check _bt_check_natts() (while also > checking other things that we care about just as much). > Good point, risk of performance degradation caused by _bt_check_natts() in _bt_compare() is high. So, let's move in to assertion. Note that I removed some dead code from _bt_insertonpg() that wasn't > added by the INCLUDE patch. It confused matters for this patch, since > we don't want to consider what's supposed to happen when there is a > retail insertion of a new, second negative infinity item -- clearly, > that should simply never happen (I thought about adding a > BTreeTupSetNAtts() call, but then decided to just remove the dead code > and add a new "can't happen" elog error). I think it's completely OK to fix broken things when you've to touch them. Probably, Teodor would decide to make that by separate commit. So, it's up to him. > Finally, I made sure that we > don't drop all tables in the regression tests, so that we have some > pg_dump coverage for INCLUDE indexes, per a request from Tom. > Makes sense, because that've already appeared to be broken. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: WIP: Covering + unique indexes.
On Thu, Apr 12, 2018 at 9:21 AM, Teodor Sigaevwrote: > Agree, I prefer to add more Assert, even. may be, more than actually needed. > Assert-documented code :) Absolutely. The danger with a feature like this is that we'll miss one place. I suppose that you could say that I am in the Poul-Henning Kamp camp on assertions [1]. >> I'll add an item to "Decisions to Recheck Mid-Beta" section of the >> open items page for this patch. We should review the decision to make >> a call to _bt_check_natts() within _bt_compare(). It might work just >> as well as an assertion, and it would be unfortunate if workloads that >> don't use covering indexes had to pay a price for the >> _bt_check_natts() call, even if it was a small price. I've seen >> _bt_compare() appear prominently in profiles quite a few times. >> > > Could you show a patch? Attached patch makes the changes that I talked about, and a few others. The commit message has full details. The general direction of the patch is that it documents our assumptions, and verifies them in more cases. Most of the changes I've made are clear improvements, though in a few cases I've made changes that are perhaps more debatable. These other, more debatable cases are: * The comments added to _bt_isequal() about suffix truncation may not be to your taste. The same is true of the way that I restored the previous _bt_isequal() function signature. (Yes, I want to change it back despite the fact that I was the person that originally suggested we change _bt_isequal().) * I added BTreeTupSetNAtts() calls to a few places that don't truly need them, such as the point where we generate a dummy 0-attribute high key within _bt_mark_page_halfdead(). I think that we should try to be as consistent as possible about using BTreeTupSetNAtts(), to set a good example. I don't think it's necessary to use BTreeTupSetNAtts() for pivot tuples when the number of key attributes matches indnatts (it seems inconvenient to have to palloc() our own scratch buffer to do this when we don't have to), but that doesn't apply to these now-covered cases. I imagine that you'll have no problem with the other changes in the patch, which is why I haven't mentioned them here. Let me know what you think. > I think, we need move _bt_check_natts() and its call under > USE_ASSERT_CHECKING to prevent performance degradation. Users shouldn't pay > for unused feature. I eventually decided that you were right about this, and made the _bt_compare() call to _bt_check_natts() a simple assertion without waiting to hear more opinions on the matter. Concurrency isn't a factor here, so adding a check to standard release builds isn't particularly likely to detect bugs. Besides, there is really only a small number of places that need to do truncation for themselves. And, if you want to be sure that the structure is consistent in the field, there is always amcheck, which can check _bt_check_natts() (while also checking other things that we care about just as much). Note that I removed some dead code from _bt_insertonpg() that wasn't added by the INCLUDE patch. It confused matters for this patch, since we don't want to consider what's supposed to happen when there is a retail insertion of a new, second negative infinity item -- clearly, that should simply never happen (I thought about adding a BTreeTupSetNAtts() call, but then decided to just remove the dead code and add a new "can't happen" elog error). Finally, I made sure that we don't drop all tables in the regression tests, so that we have some pg_dump coverage for INCLUDE indexes, per a request from Tom. [1] https://queue.acm.org/detail.cfm?id=2220317 -- Peter Geoghegan From db777cad48f185afb70c589beae15fa166c0ddf1 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Mon, 9 Apr 2018 17:45:33 -0700 Subject: [PATCH] Adjust INCLUDE index truncation comments and code. Add several assertions that ensure that we're dealing with a pivot tuple without non-key attributes where that's expected. Also, remove the assertion within _bt_isequal(), restoring the v10 function signature. A similar check will be performed for the page highkey within _bt_moveright() in most cases. Also avoid dropping all objects within regression tests, to increase pg_dump test coverage for INCLUDE indexes. Rather than using infrastructure that's generally intended to be used with reference counted heap tuple descriptors during truncation, use the same function that was introduced to store flat TupleDescs in shared memory (we use a temp palloc'd buffer). This isn't strictly necessary, but seems more future-proof than the old approach. It also lets us avoid including rel.h within indextuple.c, which was arguably a modularity violation. Also, we now call index_deform_tuple() with the truncated TupleDesc, not the source TupleDesc, since that's more robust, and saves a few cycles. In passing, fix a memory leak by pfree'ing truncated pivot tuple memory during CREATE
Re: WIP: Covering + unique indexes.
Hi! > 12 апр. 2018 г., в 21:21, Teodor Sigaevнаписал(а): I was adapting tests for GiST covering index and found out that REINDEX test is somewhat not a REINDEX test... I propose following micropatch. Best regards, Andrey Borodin. fix-reindex-test.diff Description: Binary data
Re: WIP: Covering + unique indexes.
Peter Geoghegan wrote: On Tue, Apr 10, 2018 at 5:45 PM, Peter Geogheganwrote: I did find another problem, though. Looks like the idea to explicitly represent the number of attributes directly has paid off already: pg@~[3711]=# create table covering_bug (f1 int, f2 int, f3 text); create unique index cov_idx on covering_bug (f1) include(f2); insert into covering_bug select i, i * random() * 1000, i * random() * 10 from generate_series(0,10) i; DEBUG: building index "pg_toast_16451_index" on table "pg_toast_16451" serially CREATE TABLE DEBUG: building index "cov_idx" on table "covering_bug" serially CREATE INDEX ERROR: tuple has wrong number of attributes in index "cov_idx" Actually, this was an error on my part (though I'd still maintain that the check paid off here!). I'll still add defensive assertions inside _bt_newroot(), and anywhere else that they're needed. There is no reason to not add defensive assertions in all code that handles page splits, and needs to fetch a highkey from some other page. We missed a few of those. Agree, I prefer to add more Assert, even. may be, more than actually needed. Assert-documented code :) I'll add an item to "Decisions to Recheck Mid-Beta" section of the open items page for this patch. We should review the decision to make a call to _bt_check_natts() within _bt_compare(). It might work just as well as an assertion, and it would be unfortunate if workloads that don't use covering indexes had to pay a price for the _bt_check_natts() call, even if it was a small price. I've seen _bt_compare() appear prominently in profiles quite a few times. Could you show a patch? I think, we need move _bt_check_natts() and its call under USE_ASSERT_CHECKING to prevent performance degradation. Users shouldn't pay for unused feature. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
On Tue, Apr 10, 2018 at 5:45 PM, Peter Geogheganwrote: > I did find another problem, though. Looks like the idea to explicitly > represent the number of attributes directly has paid off already: > > pg@~[3711]=# create table covering_bug (f1 int, f2 int, f3 text); > create unique index cov_idx on covering_bug (f1) include(f2); > insert into covering_bug select i, i * random() * 1000, i * random() * > 10 from generate_series(0,10) i; > DEBUG: building index "pg_toast_16451_index" on table "pg_toast_16451" > serially > CREATE TABLE > DEBUG: building index "cov_idx" on table "covering_bug" serially > CREATE INDEX > ERROR: tuple has wrong number of attributes in index "cov_idx" Actually, this was an error on my part (though I'd still maintain that the check paid off here!). I'll still add defensive assertions inside _bt_newroot(), and anywhere else that they're needed. There is no reason to not add defensive assertions in all code that handles page splits, and needs to fetch a highkey from some other page. We missed a few of those. I'll add an item to "Decisions to Recheck Mid-Beta" section of the open items page for this patch. We should review the decision to make a call to _bt_check_natts() within _bt_compare(). It might work just as well as an assertion, and it would be unfortunate if workloads that don't use covering indexes had to pay a price for the _bt_check_natts() call, even if it was a small price. I've seen _bt_compare() appear prominently in profiles quite a few times. -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
_bt_mark_page_halfdead() looked like it had a problem, but it now looks like I was wrong. I also verified every other BTreeInnerTupleGetDownLink() caller. It now looks like everything is good here. Right - it tries to find right page by conlsulting in parent page, by taking of next key. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
On Tue, Apr 10, 2018 at 1:37 PM, Peter Geogheganwrote: > _bt_mark_page_halfdead() looked like it had a problem, but it now > looks like I was wrong. I did find another problem, though. Looks like the idea to explicitly represent the number of attributes directly has paid off already: pg@~[3711]=# create table covering_bug (f1 int, f2 int, f3 text); create unique index cov_idx on covering_bug (f1) include(f2); insert into covering_bug select i, i * random() * 1000, i * random() * 10 from generate_series(0,10) i; DEBUG: building index "pg_toast_16451_index" on table "pg_toast_16451" serially CREATE TABLE DEBUG: building index "cov_idx" on table "covering_bug" serially CREATE INDEX ERROR: tuple has wrong number of attributes in index "cov_idx" Note that amcheck can detect the issue with the index after the fact, too: pg@~[3711]=# select bt_index_check('cov_idx'); ERROR: wrong number of index tuple attributes for index "cov_idx" DETAIL: Index tid=(3,2) natts=2 points to index tid=(2,92) page lsn=0/170DC88. I don't think that the issue is complicated. Looks like we missed a place that we have to truncate within _bt_split(), located directly after this comment block: /* * If the page we're splitting is not the rightmost page at its level in * the tree, then the first entry on the page is the high key for the * page. We need to copy that to the right half. Otherwise (meaning the * rightmost page case), all the items on the right half will be user * data. */ I believe that the reason that we didn't find this bug prior to commit is that we only have a single index tuple with the wrong number of attributes after an initial root page split through insertions, but the next root page split masks the problems. Not 100% sure that that's why we missed it just yet, though. This bug shouldn't be hard to fix. I'll take care of it as part of that post-commit review patch I'm working on. -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
On Tue, Apr 10, 2018 at 9:03 AM, Teodor Sigaevwrote: >> * Not sure that all calls to BTreeInnerTupleGetDownLink() are limited >> to inner tuples, which might be worth doing something about (perhaps >> just renaming the macro). > > What is suspicious place for you opinion? _bt_mark_page_halfdead() looked like it had a problem, but it now looks like I was wrong. I also verified every other BTreeInnerTupleGetDownLink() caller. It now looks like everything is good here. -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
* There is no pfree() within _bt_buildadd() for truncated tuples, even though that's a context where it's clearly not okay. Agree * It might be a good idea to also pfree() the truncated tuple for most other _bt_buildadd() callers. Even though it's arguably okay in other cases, it seems worth being consistent about it (consistent with old nbtree code). Seems, I don't see other calls to pfree after. * There should probably be some documentation around why it's okay that we call index_truncate_tuple() with an exclusive buffer lock held (during a page split). For example, there should probably be a comment on the VARATT_IS_EXTERNAL() situation. I havn't objection to improve docs/comments. * Not sure that all calls to BTreeInnerTupleGetDownLink() are limited to inner tuples, which might be worth doing something about (perhaps just renaming the macro). What is suspicious place for you opinion? I do not have the time to write a patch right away, but I should be able to post one in a few days. I want to avoid sending several small patches. no problem, we can wait -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
On Sun, Apr 8, 2018 at 11:19 PM, Teodor Sigaevwrote: > Thank you, pushed. I noticed a few more issues following another pass-through of the patch: * There is no pfree() within _bt_buildadd() for truncated tuples, even though that's a context where it's clearly not okay. * It might be a good idea to also pfree() the truncated tuple for most other _bt_buildadd() callers. Even though it's arguably okay in other cases, it seems worth being consistent about it (consistent with old nbtree code). * There should probably be some documentation around why it's okay that we call index_truncate_tuple() with an exclusive buffer lock held (during a page split). For example, there should probably be a comment on the VARATT_IS_EXTERNAL() situation. * Not sure that all calls to BTreeInnerTupleGetDownLink() are limited to inner tuples, which might be worth doing something about (perhaps just renaming the macro). I do not have the time to write a patch right away, but I should be able to post one in a few days. I want to avoid sending several small patches. -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
Thanks to both of you, pushed Shinoda, Noriyoshi wrote: Hi! Thank you for your response. I think that it is good with your proposal. Regards, Noriyoshi Shinoda *From:*Alexander Korotkov [mailto:a.korot...@postgrespro.ru] *Sent:* Monday, April 9, 2018 11:22 PM *To:* Shinoda, Noriyoshi <noriyoshi.shin...@hpe.com> *Cc:* PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Teodor Sigaev <teo...@sigaev.ru>; Peter Geoghegan <p...@bowt.ie>; Jeff Janes <jeff.ja...@gmail.com>; Anastasia Lubennikova <a.lubennik...@postgrespro.ru> *Subject:* Re: WIP: Covering + unique indexes. Hi! On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi <noriyoshi.shin...@hpe.com <mailto:noriyoshi.shin...@hpe.com>> wrote: I tested this feature and found a document shortage in the columns added to the pg_constraint catalog. The attached patch will add the description of the 'conincluding' column to the manual of the pg_constraint catalog. Thank you for pointing this! I think we need more wordy explanation here. My proposal is attached. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com <http://www.postgrespro.com/> The Russian Postgres Company -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
RE: WIP: Covering + unique indexes.
Hi! Thank you for your response. I think that it is good with your proposal. Regards, Noriyoshi Shinoda From: Alexander Korotkov [mailto:a.korot...@postgrespro.ru] Sent: Monday, April 9, 2018 11:22 PM To: Shinoda, Noriyoshi <noriyoshi.shin...@hpe.com> Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Teodor Sigaev <teo...@sigaev.ru>; Peter Geoghegan <p...@bowt.ie>; Jeff Janes <jeff.ja...@gmail.com>; Anastasia Lubennikova <a.lubennik...@postgrespro.ru> Subject: Re: WIP: Covering + unique indexes. Hi! On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi <noriyoshi.shin...@hpe.com<mailto:noriyoshi.shin...@hpe.com>> wrote: I tested this feature and found a document shortage in the columns added to the pg_constraint catalog. The attached patch will add the description of the 'conincluding' column to the manual of the pg_constraint catalog. Thank you for pointing this! I think we need more wordy explanation here. My proposal is attached. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com<http://www.postgrespro.com/> The Russian Postgres Company
Re: WIP: Covering + unique indexes.
Hi! On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi < noriyoshi.shin...@hpe.com> wrote: > I tested this feature and found a document shortage in the columns added > to the pg_constraint catalog. > The attached patch will add the description of the 'conincluding' column > to the manual of the pg_constraint catalog. > Thank you for pointing this! I think we need more wordy explanation here. My proposal is attached. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company pg_constraint-2.patch Description: Binary data
RE: WIP: Covering + unique indexes.
Hi, I tested this feature and found a document shortage in the columns added to the pg_constraint catalog. The attached patch will add the description of the 'conincluding' column to the manual of the pg_constraint catalog. Regards, Noriyoshi Shinoda -Original Message- From: Teodor Sigaev [mailto:teo...@sigaev.ru] Sent: Monday, April 9, 2018 3:20 PM To: Peter Geoghegan <p...@bowt.ie> Cc: Jeff Janes <jeff.ja...@gmail.com>; Alexander Korotkov <a.korot...@postgrespro.ru>; Anastasia Lubennikova <a.lubennik...@postgrespro.ru>; PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org> Subject: Re: WIP: Covering + unique indexes. Thank you, pushed. Peter Geoghegan wrote: > On Sun, Apr 8, 2018 at 11:18 AM, Teodor Sigaev <teo...@sigaev.ru> wrote: >> Thank you, fixed > > I suggest that we remove some unneeded amcheck tests, as in the > attached patch. They don't seem to add anything. > -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ pg_constraint.patch Description: pg_constraint.patch
Re: WIP: Covering + unique indexes.
Thank you, pushed. Peter Geoghegan wrote: On Sun, Apr 8, 2018 at 11:18 AM, Teodor Sigaevwrote: Thank you, fixed I suggest that we remove some unneeded amcheck tests, as in the attached patch. They don't seem to add anything. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
On Sun, Apr 8, 2018 at 11:18 AM, Teodor Sigaevwrote: > Thank you, fixed I suggest that we remove some unneeded amcheck tests, as in the attached patch. They don't seem to add anything. -- Peter Geoghegan From 0dbbee5bfff8816cddf86961bf4959192f62f1ff Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Sun, 8 Apr 2018 11:56:00 -0700 Subject: [PATCH] Remove some superfluous amcheck INCLUDE tests. Repeating these tests adds unnecessary cycles, since no improvement in test coverage is expected. Cleanup from commit 8224de4f42ccf98e08db07b43d52fed72f962ebb. Author: Peter Geoghegan --- contrib/amcheck/expected/check_btree.out | 20 +--- contrib/amcheck/sql/check_btree.sql | 6 +- 2 files changed, 2 insertions(+), 24 deletions(-) diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out index 2a06cce..ed80ac4 100644 --- a/contrib/amcheck/expected/check_btree.out +++ b/contrib/amcheck/expected/check_btree.out @@ -111,27 +111,9 @@ SELECT bt_index_parent_check('bttest_multi_idx', true); (1 row) -SELECT bt_index_parent_check('bttest_multi_idx', true); - bt_index_parent_check - -(1 row) - --- repeat same checks with index made by insertions +-- repeat expansive test for index built using insertions TRUNCATE bttest_multi; INSERT INTO bttest_multi SELECT i, i%2 FROM generate_series(1, 10) as i; -SELECT bt_index_check('bttest_multi_idx'); - bt_index_check - - -(1 row) - -SELECT bt_index_parent_check('bttest_multi_idx', true); - bt_index_parent_check - -(1 row) - SELECT bt_index_parent_check('bttest_multi_idx', true); bt_index_parent_check --- diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql index da2f131..4ca9d2d 100644 --- a/contrib/amcheck/sql/check_btree.sql +++ b/contrib/amcheck/sql/check_btree.sql @@ -65,15 +65,11 @@ COMMIT; SELECT bt_index_check('bttest_multi_idx'); -- more expansive test for index with included columns SELECT bt_index_parent_check('bttest_multi_idx', true); -SELECT bt_index_parent_check('bttest_multi_idx', true); --- repeat same checks with index made by insertions +-- repeat expansive test for index built using insertions TRUNCATE bttest_multi; INSERT INTO bttest_multi SELECT i, i%2 FROM generate_series(1, 10) as i; -SELECT bt_index_check('bttest_multi_idx'); SELECT bt_index_parent_check('bttest_multi_idx', true); -SELECT bt_index_parent_check('bttest_multi_idx', true); - -- cleanup DROP TABLE bttest_a; -- 2.7.4
Re: WIP: Covering + unique indexes.
Thanks to everyone, pushed. Indeed thanks, this will be a nice feature. It is giving me a compiler warning on non-cassert builds using gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609: indextuple.c: In function 'index_truncate_tuple': indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable] int indnatts = tupleDescriptor->natts; Thank you, fixed -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
Thank you, fixed Jeff Janes wrote: On Sat, Apr 7, 2018 at 4:02 PM, Teodor Sigaev> wrote: Thanks to everyone, pushed. Indeed thanks, this will be a nice feature. It is giving me a compiler warning on non-cassert builds using gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609: indextuple.c: In function 'index_truncate_tuple': indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable] int indnatts = tupleDescriptor->natts; Cheers, Jeff -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
On Sat, Apr 7, 2018 at 4:02 PM, Teodor Sigaevwrote: > Thanks to everyone, pushed. > > Indeed thanks, this will be a nice feature. It is giving me a compiler warning on non-cassert builds using gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609: indextuple.c: In function 'index_truncate_tuple': indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable] int indnatts = tupleDescriptor->natts; Cheers, Jeff
Re: WIP: Covering + unique indexes.
> "Teodor" == Teodor Sigaevwrites: >> Support for testing amcaninclude via >> pg_indexam_has_property(oid,'can_include') seems to be missing? >> >> Also the return values of pg_index_column_has_property for included >> columns seem a bit dubious... should probably be returning NULL for most >> properties except 'returnable'. Teodor> Damn, you right, it's missed. >> I can look at fixing these for you if you like? Teodor> If you will do that I will be very grateful OK, I will deal with it. -- Andrew (irc:RhodiumToad)
Re: WIP: Covering + unique indexes.
Support for testing amcaninclude via pg_indexam_has_property(oid,'can_include') seems to be missing? Also the return values of pg_index_column_has_property for included columns seem a bit dubious... should probably be returning NULL for most properties except 'returnable'. Damn, you right, it's missed. I can look at fixing these for you if you like? If you will do that I will be very grateful -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
On Sat, Apr 7, 2018 at 1:52 PM, Andrew Gierthwrote: > Support for testing amcaninclude via > pg_indexam_has_property(oid,'can_include') seems to be missing? > > Also the return values of pg_index_column_has_property for included > columns seem a bit dubious... should probably be returning NULL for most > properties except 'returnable'. > > I can look at fixing these for you if you like? I'm happy to accept your help with it, for one. -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
> "Teodor" == Teodor Sigaevwrites: >> I'll keep an eye on the buildfarm, since it's late in Russia. Teodor> Thank you very much! Now 23:10 MSK and I'll be able to follow Teodor> during approximately hour. Support for testing amcaninclude via pg_indexam_has_property(oid,'can_include') seems to be missing? Also the return values of pg_index_column_has_property for included columns seem a bit dubious... should probably be returning NULL for most properties except 'returnable'. I can look at fixing these for you if you like? -- Andrew (irc:RhodiumToad)
Re: WIP: Covering + unique indexes.
Thank you, I looked to buildfarm and completely forget about commitfest site Andres Freund wrote: On 2018-04-07 23:02:08 +0300, Teodor Sigaev wrote: Thanks to everyone, pushed. Marked CF entry as committed. Greetings, Andres Freund -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
On 2018-04-07 23:02:08 +0300, Teodor Sigaev wrote: > Thanks to everyone, pushed. Marked CF entry as committed. Greetings, Andres Freund
Re: WIP: Covering + unique indexes.
I'll keep an eye on the buildfarm, since it's late in Russia. Thank you very much! Now 23:10 MSK and I'll be able to follow during approximately hour. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
On Sat, Apr 7, 2018 at 1:02 PM, Teodor Sigaevwrote: > Thanks to everyone, pushed. I'll keep an eye on the buildfarm, since it's late in Russia. -- Peter Geoghegan
Sv: Re: WIP: Covering + unique indexes.
På lørdag 07. april 2018 kl. 22:02:08, skrev Teodor Sigaev>: Thanks to everyone, pushed. Rock! -- Andreas Joseph Krogh
Re: WIP: Covering + unique indexes.
Thanks to everyone, pushed. Peter Geoghegan wrote: On Sat, Apr 7, 2018 at 5:48 AM, Teodor Sigaevwrote: On close look, bts_btentry.ip_posid is not used anymore, I change bts_btentry type to BlockNumber. As result, BTEntrySame() is removed. That seems like a good idea. I'm not very happy with massive usage of ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)), suggest to wrap it to macro something like this: #define BTreeInnerTupleGetDownLink(itup) \ ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)) Agreed. We do that with GIN. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
On Sat, Apr 7, 2018 at 5:48 AM, Teodor Sigaevwrote: > On close look, bts_btentry.ip_posid is not used anymore, I change > bts_btentry type to BlockNumber. As result, BTEntrySame() is removed. That seems like a good idea. > I'm not very happy with massive usage of > ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)), suggest to wrap it to > macro something like this: > #define BTreeInnerTupleGetDownLink(itup) \ > ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)) Agreed. We do that with GIN. -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
I didn't like rel.h being included in itup.h. Do you really need a Relation as argument to index_truncate_tuple? It looks to me like you could pass the tupledesc instead; indnatts could be passed as a separate argument instead of IndexRelationGetNumberOfAttributes. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WIP: Covering + unique indexes. (the good and the bad)
On 2018-04-07 14:27, Alexander Korotkov wrote: On Sat, Apr 7, 2018 at 2:57 PM, Erik Rijkerswrote: On 2018-04-06 20:08, Alexander Korotkov wrote: [0001-Covering-v15.patch] After some more testing I notice there is also a down-side/slow-down to this patch that is not so bad but more than negligible, and I don't think it has been mentioned (but I may have missed something in this thread that's now been running for 1.5 year, not to mention the tangential btree-thread(s)). I attach my test-program, which compares master (this morning) with covered_indexes (warning: it takes a while to generate the used tables). The test tables are created as: create table $t (c1 int, c2 int, c3 int, c4 int); insert into $t (select x, 2*x, 3*x, 4 from generate_series(1, $rowcount) as x); create unique index ${t}uniqueinclude_idx on $t using btree (c1, c2) include (c3, c4); or for HEAD, just: create unique index ${t}unique_idx on $t using btree (c1, c2); Do I understand correctly that you compare unique index on (c1, c2) with master to unqiue index on (c1, c2) include (c3, c4) with patched version? If so then I think it's wrong to say about down-side/slow-down of this patch based on this comparison. Patch *does not* cause slowdown in this case. Patch provides user a *new option* which has its advantages and disadvantages. And what you compare is advantages and disadvantages of this option, not slow-down of the patch. In the case you compare *the same* index on master and patched version, then it's possible to say about slow-down of the patch. OK, I take your point -- you are right. Although my measurement was (I think) correct, my comparison was not (as Teodor wrote, not quite 'fair'). Sorry, I should have better thought that message through. The somewhat longer time is indeed just a disadvantage of this new option, to be balanced against the advantages that are pretty clear too. Erik Rijkers
Re: WIP: Covering + unique indexes. (the good and the bad)
On Sat, Apr 7, 2018 at 2:57 PM, Erik Rijkerswrote: > On 2018-04-06 20:08, Alexander Korotkov wrote: > >> >> [0001-Covering-v15.patch] >> >> > After some more testing I notice there is also a down-side/slow-down to > this patch that is not so bad but more than negligible, and I don't think > it has been mentioned (but I may have missed something in this thread > that's now been running for 1.5 year, not to mention the tangential > btree-thread(s)). > > I attach my test-program, which compares master (this morning) with > covered_indexes (warning: it takes a while to generate the used tables). > > The test tables are created as: > create table $t (c1 int, c2 int, c3 int, c4 int); > insert into $t (select x, 2*x, 3*x, 4 from generate_series(1, $rowcount) > as x); > create unique index ${t}uniqueinclude_idx on $t using btree (c1, c2) > include (c3, c4); > > or for HEAD, just: > create unique index ${t}unique_idx on $t using btree (c1, c2); > Do I understand correctly that you compare unique index on (c1, c2) with master to unqiue index on (c1, c2) include (c3, c4) with patched version? If so then I think it's wrong to say about down-side/slow-down of this patch based on this comparison. Patch *does not* cause slowdown in this case. Patch provides user a *new option* which has its advantages and disadvantages. And what you compare is advantages and disadvantages of this option, not slow-down of the patch. In the case you compare *the same* index on master and patched version, then it's possible to say about slow-down of the patch. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: WIP: Covering + unique indexes. (the good and the bad)
Thank you! create unique index ${t}uniqueinclude_idx on $t using btree (c1, c2) include (c3, c4); or for HEAD, just: create unique index ${t}unique_idx on $t using btree (c1, c2); -- explain analyze select c1, c2 from nt0___1 where c1 < 1 -- explain analyze select c1, c2 from nt0___1 where c1 < 1 and c3 < 20 Not so fair comparison, include index twice bigger because of include columns. Try to compare with covering-emulated index: create unique index ${t}unique_idx on $t using btree (c1, c2, c3, c4) -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes. (the good and the bad)
On 2018-04-06 20:08, Alexander Korotkov wrote: [0001-Covering-v15.patch] After some more testing I notice there is also a down-side/slow-down to this patch that is not so bad but more than negligible, and I don't think it has been mentioned (but I may have missed something in this thread that's now been running for 1.5 year, not to mention the tangential btree-thread(s)). I attach my test-program, which compares master (this morning) with covered_indexes (warning: it takes a while to generate the used tables). The test tables are created as: create table $t (c1 int, c2 int, c3 int, c4 int); insert into $t (select x, 2*x, 3*x, 4 from generate_series(1, $rowcount) as x); create unique index ${t}uniqueinclude_idx on $t using btree (c1, c2) include (c3, c4); or for HEAD, just: create unique index ${t}unique_idx on $t using btree (c1, c2); Here is typical output (edited a bit to prevent email-mangling): test1: -- explain analyze select c1, c2 from nt0___1 where c1 < 1 -- 250x unpatched 6511: 100M rows Execution Time: (normal/normal) 98 % exec avg: 2.44 patched 6976: 100M rows Execution Time: (covered/normal) 108 % exec avg: 2.67 test1 patched / unpatched: 109.49 % test4: -- explain analyze select c1, c2 from nt0___1 where c1 < 1 and c3 < 20 unpatched 6511: 100M rows Execution Time: (normal/normal) 95 %exec avg: 1.56 patched 6976: 100M rows Execution Time: (covered/normal) 60 %exec avg: 0.95 test4 patched / unpatched: 60.83 % So the main good thing is that 60%, a good improvement -- but that ~109% (a slow-down) is also quite repeatable. (there are a more goodies from the patch (like improved insert-speed) but I just wanted to draw attention to this particular slow-down too) I took all timings from explain analyze versions of the statements, on the assumption that that would be quite comparable to 'normal' querying. (please let me know if that introduces error). # \dti+ nt0___1* List of relations Schema | Name | Type | Owner | Table | Size +--+---+--+-+ public | nt0___1 | table | aardvark | | 4224 MB public | nt0___1uniqueinclude_idx | index | aardvark | nt0___1 | 3004 MB (for what it's worth, I'm in favor of getting this patch into v11 although I can't say I followed the technical details too much) thanks, Erik Rijkers #!/bin/env perl #!/opt/perl-5.26/bin/perl use strict; use warnings; use DBI; use Time::HiRes qw/tv_interval gettimeofday/; use Getopt::Long; $| = 1; our $PGPORT_VANILLA = 6511; our $PGPORT_COVERING_INDEXES = 6976; our $SQL_REPEAT = 251; main(); exit; sub size_unit { 1_000_000 } sub main { my $size = 100; #rowcount in millions; this $size variable determines the table used GetOptions ("size=i" => \$size) or die("Error in command line arguments\n"); my $dbh_patched = connectdb_covering_indexes(); my $dbh_vanilla = connectdb_vanilla(); my $port_patched = check_debug_state( $dbh_patched ); my $port_vanilla = check_debug_state( $dbh_vanilla ); # create tables on patched instance for my $n (1, 10, 100) { # , 250 ) { my $rowcount = $n * size_unit(); create_tables($dbh_patched, $port_patched, $rowcount, my $overwrite = 0); } # create tables on vanilla instance for my $n (1, 10, 100) { # , 250 ) { my $rowcount = $n * size_unit(); create_tables($dbh_vanilla, $port_vanilla, $rowcount, my $overwrite = 0); } # print sprintf("-- Perl %vd\n", $^V) #, "-- ", $dbh_vanilla->selectrow_arrayref( "select version()" )->[0], "\n" #, "-- ", $dbh_patched->selectrow_arrayref( "select version()" )->[0], "\n" ; my $c1 = 1; ## 5000 + int(rand(5000)) + 1; my $c3 =20; ## 20 + int(rand( 30)) + 1; # $c1 = 5000; ## 5000 + int(rand(5000)) + 1; # $c3 =20; ## 20 + int(rand( 30)) + 1; # enable to vary WHERE-clause a little bit: if (0) { $c1 = 5000 + int(rand(5000)) + 1; $c3 = 20 + int(rand( 30)) + 1; } my $vanilla = test1($dbh_vanilla, $port_vanilla, $size, $c1); my $patched = test1($dbh_patched, $port_patched, $size, $c1); print " "x84, sprintf( "%6.2f %% <- test1, patched / unpatched\n", ((average($patched) * 100) / average($vanilla)) ); # test2($dbh_vanilla, $port_vanilla, $size, $c1); # test2($dbh_patched, $port_patched, $size, $c1); # test3($dbh_vanilla, $port_vanilla, $size, $c1, $c3); # test3($dbh_patched, $port_patched, $size, $c1, $c3); $vanilla = test4($dbh_vanilla, $port_vanilla, $size, $c1, $c3); $patched = test4($dbh_patched, $port_patched, $size, $c1, $c3); print " "x84, sprintf( "%6.2f %% <- test4,
Re: WIP: Covering + unique indexes.
On Fri, Apr 6, 2018 at 11:08 AM, Alexander Korotkovwrote: > OK, incorporated into v15. I've also added sentence about pg_upgrade > to the commit message. I will summarize my feelings on this patch. I endorse committing the patch, because I think that the benefits of committing it now noticeably outweigh the costs. I have various caveats about pushing the patch, but these are manageable. Costs = First, there is the question of risks, or costs. I think that this patch has a negligible chance of being problematic in a way that will become memorable. That seems improbable because the patch only really changes the representation of what we're calling "pivot keys" (high keys and internal page downlinks), which is something that VACUUM doesn't care about. I see this patch as a special case of suffix truncation, a technique that has been around since the 1970s. Although you have to look carefully to see it, the amount of extra complexity is pretty small, and the only place where a critical change is made is during leaf page splits. As long as we get that right, everything else should fall into place. There are no risks that I can see that are related to concurrency, or that crop up when doing an anti-wraparound VACUUM. There may be problems, but at least they won't be *pernicious* problems that unravel over a long period of time. The latest amcheck enhancement, and Alexander's recent changes to the patch to make the on-disk representation explicit (not implicit) should change things. We now have the tools to detect any corruption problem that I can think of. For example, if there was some subtle reason why assessing HOT safety broke, then we'd have a way of mechanically detecting that without having to devise a custom test (like the test Pavan happened to be using when the bug fixed by 2aaec654 was originally discovered). The lessons that I applied to designing amcheck were in a few cases from actual experience with real world bugs, including that 2aaec654 bug. I hope that it goes without saying that I've also taken reasonable steps to address all of these risks directly, by auditing code. And, that this remains the first line of defense. Here are the other specific issues that I see with the patch: * It's possible that something was missed in the optimizer. I'm not sure. I share the intuition that very little code is actually needed there, but I'm far from the best person to judge whether or not some subtle detail was missed. * This seems out of date: > +* NOTE: It is not crucial for reliability in present, but maybe > +* it will be that in the future. Now the purpose is just to save > +* more space on inner pages of btree. * CheckIndexCompatible() didn't seem to get the memo about this patch. Maybe just a comment? * It's possible that there are some more bugs in places like relcache.c, or deparsing, or pg_dump, or indexcmds.c; perhaps simple omissions, like the one I just mentioned. If there are, I don't expect them to be particularly serious, or to make me reassess my basic position. But there could be. * I was wrong to suggest _bt_isequal() has an assertion against truncation. It is called for the highkey. Suggest you weaken the assertion, so it only applies when the offset isn't P_HIKEY on non-rightmost page. * Suggest adding a comment above BTStackData, about bts_btentry + offset. * Suggest breaking BTEntrySame() into 3 lines, not 2. * This comment needs to be updated: /* get high key from left page == lowest key on new right page */ Suggest "get high key from left page == lower bound for new right page". * This comment needs to be updated: 13th bit: unused Suggest "13th bit: AM-defined meaning" * Suggest adding a note that the use of P_HIKEY here is historical, since it isn't used to match downlinks: /* * Find the parent buffer and get the parent page. * * Oops - if we were moved right then we need to change stack item! We * want to find parent pointing to where we are, right ?- vadim * 05/27/97 */ ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY); pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); * I'm slightly concerned that this patch subtly breaks an optimization within _bt_preprocess_keys(), or _bt_checkkeys(). I cannot find any evidence of that, though, and I consider it unlikely, based on the intuition that the simple Pathkey changes in the optimizer don't provide the executor with a truly new set of constraints for index scans. Also, if there was a problem here, it would be in the less serious category of problems -- those that can't really affect anyone not using the user-visible feature. * The docs need some more polishing. Didn't spend very much time on this at all. Benefits There is also the matter of the benefits of this patch, that I think are considerable, and far greater than they appear. This
Re: WIP: Covering + unique indexes.
On Fri, Apr 6, 2018 at 10:33 AM, Alexander Korotkovwrote: > Thinking about that again, I found that we should relax our requirements to > "minus infinity" items, because pg_upgraded indexes doesn't have any > special bits set for those items. > > What do you think about applying following patch on the top of v14? It's clearly necessary. Looks fine to me. -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
On Fri, Apr 6, 2018 at 8:22 PM, Peter Geogheganwrote: > On Fri, Apr 6, 2018 at 10:20 AM, Teodor Sigaev wrote: > > As far I can see, there is no any on-disk representation differece for > > *existing* indexes. So, pg_upgrade is not need here and there isn't any > new > > code to support "on-fly" modification. Am I right? > > Yes. > > I'm going to look at this again today, and will post something within > 12 hours. Please hold off on committing until then. > Thank you. Thinking about that again, I found that we should relax our requirements to "minus infinity" items, because pg_upgraded indexes doesn't have any special bits set for those items. What do you think about applying following patch on the top of v14? diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c index 44605fb5a4..53dc47ff82 100644 --- a/src/backend/access/nbtree/nbtsearch.c +++ b/src/backend/access/nbtree/nbtsearch.c @@ -2000,8 +2000,12 @@ _bt_check_natts(Relation index, Page page, OffsetNumber offnum) } else if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque)) { - /* Leftmost tuples on non-leaf pages have no attributes */ - return (BTreeTupGetNAtts(itup, index) == 0); + /* +* Leftmost tuples on non-leaf pages have no attributes, or haven't +* INDEX_ALT_TID_MASK set in pg_upgraded indexes. +*/ + return (BTreeTupGetNAtts(itup, index) == 0 || + ((itup->t_info & INDEX_ALT_TID_MASK) == 0)); } else { -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: WIP: Covering + unique indexes.
On Fri, Apr 6, 2018 at 10:20 AM, Teodor Sigaevwrote: > As far I can see, there is no any on-disk representation differece for > *existing* indexes. So, pg_upgrade is not need here and there isn't any new > code to support "on-fly" modification. Am I right? Yes. I'm going to look at this again today, and will post something within 12 hours. Please hold off on committing until then. -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
As far I can see, there is no any on-disk representation differece for *existing* indexes. So, pg_upgrade is not need here and there isn't any new code to support "on-fly" modification. Am I right? -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
On Thu, Apr 5, 2018 at 7:59 AM, Alexander Korotkovwrote: >> * btree_xlog_split() still has this code: > Right, I think there is absolutely no need in this code. It's removed in > the attached patchset. I'm now a bit nervous about always logging the high key, since that could impact performance. I think that there is a good way to only do it when needed. New plan: 1. Add these new fields to split record's set of xl_info fields (it should be placed directly after XLOG_BTREE_SPLIT_R): #define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */ #define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */ 2. Within _bt_split(), restore the old "leaf vs. internal" logic, so that the high key is only logged for internal (!isleaf) pages. However, only log it when needed for leaf pages -- only when the new highkey was *actually* truncated (or when its an internal page), since only then will it actually be different to the first item on the right page. Also, set XLOG_BTREE_SPLIT_L_HIGHKEY instead of XLOG_BTREE_SPLIT_L when we must log (or set XLOG_BTREE_SPLIT_R_HIGHKEY instead of XLOG_BTREE_SPLIT_R), so that recovery actually knows that it should restore the truncated highkey. (Sometimes I think it would be nice to be able to do more during recovery, but that's a much bigger issue.) 3. Restore all the master code within btree_xlog_split(), except instead of restoring the high key when !isleaf, do so when the record is XLOG_BTREE_SPLIT_L_HIGHKEY|XLOG_BTREE_SPLIT_R_HIGHKEY. 4. Add an assertion within btree_xlog_split(), that ensures that internal pages never fail to have their high key logged, since there is no reason why that should ever not happen with internal pages. 5. Fix this struct xl_btree_split comment, which commit 0c504a80 from 2017 missed when it reclaimed two xl_info status bits: * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record. * The _L and _R variants indicate whether the inserted tuple went into the * left or right split page (and thus, whether newitemoff and the new item * are stored or not). The _ROOT variants indicate that we are splitting * the root page, and thus that a newroot record rather than an insert or * split record should follow. Note that a split record never carries a * metapage update --- we'll do that in the parent-level update. 6. Add your own xl_btree_split comment in its place, noting the new usage. Basically, the _ROOT sentence with a similar _HIGHKEY sentence. 7. Don't forget about btree_desc(). I'd say that there is a good change that Anastasia is correct to think that it isn't worth worrying about the extra WAL that her approach implied, and that it is in fact good enough to simply always log the left page's high key. However, it seems easier and lower risk all around to do it this way. It doesn't leave us with ambiguity. In my experience, *ambiguity* on design questions makes a patch miss a release much more frequently than bugs or regressions make that happen. Sorry that I didn't just say this the first time I brought up btree_xlog_split(). I didn't see the opportunity to avoid creating more WAL until now. > OK, I've added asserting that number of tuple attributes shoud be > either natts or nkeyatts, because we call _bt_mkscankey() for > pivot index tuples too. Makes sense. > If you're talking about these assertions > > Assert(IndexRelationGetNumberOfAttributes(rel) != 0); > indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel); > Assert(indnkeyatts != 0); Actually, I was just talking about the first one, "Assert(IndexRelationGetNumberOfAttributes(rel) != 0)". I was unclear. Maybe it isn't worth getting rid of even the first one. > then I would rather leave both them. If we know that index tuple > length is either natts or nkeyatts, that doesn't make you sure, that > both natts and nkeyatts are non-zero. I suppose. > I've done so. Tests are passing for me. Great. I'm glad that worked out. One simple, broad rule. > Hmm, we have four bit reserved. But I'm not sure whether we would use > *all* of them for non-pivot tuples. Probably we would use some of them for > pivot tuples. I don't know that in advance. Thus, I propose to don't > rename this. But I've added comment that non-pivot tuples might also > use those bits. Okay. Good enough. > Sorry, I didn't get which particular further use of reserved bits do you > mean? > Did you mean key normalization? I was being unclear. I was just reiterating my point about having a non-pivot bit. It doesn't matter, though. > Thank you for writing that explanation. Looks good. I think that once you realize how INCLUDE indexes don't change pivot tuples, and actually understand what pivot tuples are, the patch seems a lot less scary. > This patchset also incorporates docs enhacements by Erik Rijkers and > sentence which states that indexes with included colums are also called > "covering indexes".
Re: WIP: Covering + unique indexes.
On Thu, Apr 5, 2018 at 5:02 PM, Erik Rijkerswrote: > On 2018-04-05 00:09, Alexander Korotkov wrote: > >> Thank you for review! Revised patchset is attached. >> [0001-Covering-core-v12.patch] >> [0002-Covering-btree-v12.patch] >> [0003-Covering-amcheck-v12.patch] >> [0004-Covering-natts-v12.patch] >> > > Really nice performance gains. > > I read through the docs and made some changes. I hope it can count as > improvement. > Thank you for your improvements of the docs. Your chenges will be incorporated into new revision of patchset which I'm going to post today. > It would probably also be a good idea to add the term "covering index" > somewhere, at least in the documentation's index; the term does now not > occur anywhere. (This doc-patch does not add it) > I'll think about it. May be we'll define "covering index" in the docs. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: WIP: Covering + unique indexes.
On 2018-04-05 00:09, Alexander Korotkov wrote: Hi! Thank you for review! Revised patchset is attached. [0001-Covering-core-v12.patch] [0002-Covering-btree-v12.patch] [0003-Covering-amcheck-v12.patch] [0004-Covering-natts-v12.patch] Really nice performance gains. I read through the docs and made some changes. I hope it can count as improvement. It would probably also be a good idea to add the term "covering index" somewhere, at least in the documentation's index; the term does now not occur anywhere. (This doc-patch does not add it) thanks, Erik Rijkers --- doc/src/sgml/ref/create_index.sgml.orig 2018-04-05 14:36:00.904617793 +0200 +++ doc/src/sgml/ref/create_index.sgml 2018-04-05 15:49:03.778805965 +0200 @@ -148,31 +148,27 @@ INCLUDE -An optional INCLUDE clause allows to specify the -list of columns which will be included in the non-key part of the index. -Columns listed in this clause cannot co-exist as index key columns, -and vice versa. The INCLUDE columns exist solely to +The optional INCLUDE clause specifies a +list of columns which will be included as a non-key part in the index. +Columns listed in this clause cannot also be present as index key columns. +The INCLUDE columns exist solely to allow more queries to benefit from index-only scans -by including specified columns into the index. Values of these columns +by including the values of the specified columns in the index. These values would otherwise have to be obtained by reading the table's heap. -Having these columns in the INCLUDE clause -in some cases allows PostgreSQL to skip -the heap read completely. -In the UNIQUE indexes, uniqueness is only enforced +In UNIQUE indexes, uniqueness is only enforced for key columns. Columns listed in the INCLUDE -clause have no influence to uniqueness enforcement. Other constraints +clause have no effect on uniqueness enforcement. Other constraints (PRIMARY KEY and EXCLUDE) work the same way. -Columns listed in the INCLUDE clause doesn't need -appropriate operator class to exist. Therefore, -INCLUDE clause if useful to add non-key index -columns, whose data types don't have operator classes defined for -given access method. +Columns listed in the INCLUDE clause don't need +appropriate operator classes; the clause can contain non-key index +columns whose data types don't have operator classes defined for +a given access method. @@ -182,12 +178,12 @@ Currently, only the B-tree index access method supports this feature. -In B-tree indexes, values of columns listed in the -INCLUDE clause are included into leaf tuples which -are linked to the heap tuples, but aren't included into pivot tuples +In B-tree indexes, the values of columns listed in the +INCLUDE clause are included in leaf tuples which +are linked to the heap tuples, but are not included into pivot tuples used for tree navigation. Therefore, moving columns from the list of key columns to the INCLUDE clause can slightly -reduce index size and improve tree branching factor. +reduce index size and improve the tree branching factor.
Re: WIP: Covering + unique indexes.
On Wed, Apr 4, 2018 at 3:09 PM, Alexander Korotkovwrote: > Thank you for review! Revised patchset is attached. Cool. * btree_xlog_split() still has this code: /* * On leaf level, the high key of the left page is equal to the first key * on the right page. */ if (isleaf) { ItemId hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque)); left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId); left_hikeysz = ItemIdGetLength(hiItemId); } However, we never fail to store the high key now, even at the leaf level, because of this change to the corresponding point in _bt_split(): > - /* Log left page */ > - if (!isleaf) > - { > - /* > -* We must also log the left page's high key, because the right > -* page's leftmost key is suppressed on non-leaf levels. Show it > -* as belonging to the left page buffer, so that it is not stored > -* if XLogInsert decides it needs a full-page image of the left > -* page. > -*/ > - itemid = PageGetItemId(origpage, P_HIKEY); > - item = (IndexTuple) PageGetItem(origpage, itemid); > - XLogRegisterBufData(0, (char *) item, > MAXALIGN(IndexTupleSize(item))); > - } > + /* > +* We must also log the left page's high key. There are two reasons > +* for that: right page's leftmost key is suppressed on non-leaf > levels, > +* in covering indexes, included columns are truncated from high keys. > +* For simplicity, we don't distinguish these cases, but log the high > +* key every time. Show it as belonging to the left page buffer, so > +* that it is not stored if XLogInsert decides it needs a full-page > +* image of the left page. > +*/ > + itemid = PageGetItemId(origpage, P_HIKEY); > + item = (IndexTuple) PageGetItem(origpage, itemid); > + XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item))); So should we remove the first block of code? Note also that this existing comment has been made obsolete: /* don't release the buffer yet; we touch right page's first item below */ /* Now reconstruct left (original) sibling page */ if (XLogReadBufferForRedo(record, 0, ) == BLK_NEEDS_REDO) Maybe we *should* release the right sibling buffer at the point of the comment now? * _bt_mkscankey() should assert that the IndexTuple has the correct number of attributes. I don't expect you to change routines like _bt_mkscankey() so they actually respect the number of attributes from BTreeTupGetNAtts(), rather than just relying on IndexRelationGetNumberOfKeyAttributes(). However, an assertion seems well worthwhile. It's a big reason for having BTreeTupGetNAtts(). This also lets you get rid of at least one assertion from _bt_doinsert(), I think. * _bt_isequal() should assert that the IndexTuple was not truncated. * The order could be switched here: > @@ -443,6 +443,17 @@ _bt_compare(Relation rel, > if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque)) > return 1; > > + /* > +* Check tuple has correct number of attributes. > +*/ > + if (unlikely(!_bt_check_natts(rel, page, offnum))) > + { > + ereport(ERROR, > + (errcode(ERRCODE_INTERNAL_ERROR), > +errmsg("tuple has wrong number of attributes in index > \"%s\"", > + RelationGetRelationName(rel; > + } In principle, we should also check _bt_check_natts() for "minus infinity" items, just like you did within verify_nbtree.c. Also, there is no need for parenthesis here. * Maybe _bt_truncate_tuple() should assert that the caller has not tried to truncate a tuple that has already been truncated. I'm not sure if our assertion should be quite that strong, but I think that that might be good because in general we only need to truncate on the leaf level -- truncating at any other level on the tree (e.g. doing traditional suffix truncation) is always subtly wrong. What we definitely should do, at a minimum, is make sure that attempting to truncate a tuple to 2 attributes when it already has 0 attributes fails with an assertion failure. Can you try adding the strong assertion (truncate only once) to _bt_truncate_tuple()? Maybe that's not possible, but it seems worth a try. * I suggest we invent a new flag for 0x2000 within itup.h, to replace "/* bit 0x2000 is reserved for index-AM specific usage */". We can call it INDEX_AM_RESERVED_BIT. Then, we can change INDEX_ALT_TID_MASK to use this rather than a raw 0x2000. We can do the same for INDEX_MOVED_BY_SPLIT_MASK within hash.h, too. I find this neater. * We should "use" one of the 4 new status bit that are available from an offset (for INDEX_ALT_TID_MASK index tuples) for future use by leaf index tuples. Perhaps call it BT_ALT_TID_NONPIVOT. I guess you could say that I want us to reserve
Re: WIP: Covering + unique indexes.
On Tue, Apr 3, 2018 at 7:02 AM, Alexander Korotkovwrote: > Great, I'm looking forward your feedback. I took a look at V11 (0001-Covering-core-v11.patch, 0002-Covering-btree-v11.patch, 0003-Covering-amcheck-v11.patch, 0004-Covering-natts-v11.patch) today. * What's a pivot tuple? This is the same thing as what I call a "separator key", I think -- you're talking about the set of IndexTuples including all high keys (including leaf level high keys), as well as internal items (downlinks). I think that it's a good idea to have a standard word that describes this set of keys, to formalize the two categories (pivot tuples vs. tuples that point to the heap itself). Your word is just as good as mine, so we can go with that. Let's put this somewhere central. Maybe in the nbtree README, and/or nbtree.h. Also, verify_nbtree.c should probably get some small explanation of pivot tuples. offset_is_negative_infinity() is a nice place to mention pivot tuples, since that already has a bit of high-level commentary about them. * Compiler warning: /home/pg/postgresql/root/build/../source/src/backend/catalog/index.c: In function ‘index_create’: /home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:476:45: warning: ‘opclassTup’ may be used uninitialized in this function [-Wmaybe-uninitialized] if (keyType == ANYELEMENTOID && opclassTup->opcintype == ANYARRAYOID) ^ /home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:332:19: note: ‘opclassTup’ was declared here Form_pg_opclass opclassTup; ^ * Your new amcheck tests should definitely use the new "heapallindexed" option. There were a number of bugs I can remember seeing in earlier versions of this patch that that would catch (probably not during regression tests, but let's at least do that much). * The modified amcheck contrib regression tests don't actually pass. I see these unexpected errors: 10037/2018-04-03 16:31:12 PDT ERROR: wrong number of index tuple attributes for index "bttest_multi_idx" 10037/2018-04-03 16:31:12 PDT DETAIL: Index tid=(290,2) points to index tid=(289,2) page lsn=0/162407A8. 10037/2018-04-03 16:31:12 PDT ERROR: wrong number of index tuple attributes for index "bttest_multi_idx" 10037/2018-04-03 16:31:12 PDT DETAIL: Index tid=(290,2) points to index tid=(289,2) page lsn=0/162407A8. * I see that we use "- 1" with attribute number, like this: > +/* Get number of attributes in B-tree index tuple */ > +#define BtreeTupGetNAtts(itup, index) \ > + ( \ > + (itup)->t_info & INDEX_ALT_TID_MASK ? \ > + ( \ > + AssertMacro((ItemPointerGetOffsetNumber(&(itup)->t_tid) & > BT_RESERVED_OFFSET_MASK) == 0), \ > + ItemPointerGetOffsetNumber(&(itup)->t_tid) & > BT_N_KEYS_OFFSET_MASK - 1 \ > + ) \ > + : \ > + IndexRelationGetNumberOfAttributes(index) \ > + ) Is this left behind from before you decided to adopt INDEX_ALT_TID_MASK? Is it your intention here to encode InvalidOffsetNumber() without tripping up assertions? Or is it something else? Maybe we should follow the example of GinItemPointerGetOffsetNumber(), and use ItemPointerGetOffsetNumberNoCheck() instead of ItemPointerGetOffsetNumber(). What do you think? That would allow us to get rid of the -1 thing, which might be nice. Just because we use ItemPointerGetOffsetNumberNoCheck() in places that use an alternative offset representation does not mean we need to use it in existing places. If existing places had a regression tests failure because of this, that would probably be due to a real bug. No? * ISTM that the "points to index tid=(289,2)" part of the message just shown would be a bit clearer if I didn't have to know that 2 actually means 1 when we talk about the pointed-to offset (yeah, it will probably become unclear in the future when we start using the reserved offset status bits, but why not make the low bits of offset simple/logical way?). Your new amcheck error message should spell it out (it should say the number of attributes indicated by the offset, if any) -- regardless of what we do about the "must apply - 1 to offset" question. * "Minus infinity" items do not have the new status bit INDEX_ALT_TID_MASK set in at least some cases. They should. * _bt_sortaddtup() should not do "trunctuple.t_info = sizeof(IndexTupleData)", since that destroys useful information. Maybe that's the reason for the last bug? * Ditto for _bt_pgaddtup(). * Why expose _bt_pgaddtup() so that nbtsort.c/_bt_buildadd() can call it? The only reason we have _bt_sortaddtup() is because we cannot trust P_RIGHTMOST() within _bt_pgaddtup() when called in the context of CREATE INDEX (from nbtsort.c/_bt_buildadd()). There is no real change needed, because _bt_sortaddtup() knows that it's inserting on a non-rightmost page both without this patch, and when this patch needs to truncate and then add the high key back. It's clear that you can
Re: WIP: Covering + unique indexes.
On Tue, Apr 3, 2018 at 7:02 AM, Peter Geogheganwrote: > On Mon, Apr 2, 2018 at 4:27 PM, Alexander Korotkov > wrote: > > I thought abut that another time and I decided that it would be safer > > to use 13th bit in index tuple flags. There are already attempt to > > use whole 6 bytes of tid for not heap pointer information [1]. Thus, it > > would be safe to use 13th bit for indicating alternative offset meaning > > in pivot tuples, because it wouldn't block further work. Revised > patchset > > in the attachment implements it. > > This is definitely not the only time someone has talked about this > 13th bit -- it's quite coveted. It also came up with UPSERT, and with > WARM. That's just the cases that I can personally remember. > > I'm glad that you found a way to make this work, that will keep things > flexible for future patches, and make testing easier. I think that we > can find a flexible representation that makes almost everyone happy. > OK, good. I didn't have time to look at this properly today, but I will try to > do so tomorrow. > Great, I'm looking forward your feedback. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: WIP: Covering + unique indexes.
On Mon, Apr 2, 2018 at 4:27 PM, Alexander Korotkovwrote: > I thought abut that another time and I decided that it would be safer > to use 13th bit in index tuple flags. There are already attempt to > use whole 6 bytes of tid for not heap pointer information [1]. Thus, it > would be safe to use 13th bit for indicating alternative offset meaning > in pivot tuples, because it wouldn't block further work. Revised patchset > in the attachment implements it. This is definitely not the only time someone has talked about this 13th bit -- it's quite coveted. It also came up with UPSERT, and with WARM. That's just the cases that I can personally remember. I'm glad that you found a way to make this work, that will keep things flexible for future patches, and make testing easier. I think that we can find a flexible representation that makes almost everyone happy. > I don't know. We still need an offset number to check expected number > of attributes. Passing index tuple as separate attribute would be > redundant and open door for extra possible errors. You're right. I must have been tired when I wrote that. :-) >> Do you store an attribute number in the "minus infinity" item (the >> leftmost one of internal pages)? I guess that that should be zero, >> because it's totally truncated. > > > Yes, I store zero number of attributes in "minus infinity" item. See this > part of the patch. > However, note that I've to store (number_of_attributes + 1) in the offset > in order to correctly store zero number of attributes. Otherwise, assertion > is faised in ItemPointerIsValid() macro. Makes sense. > Yes. But that depends on how difficulty to adopt patch to big picture > correlate with difficulty, which non-adopted patch makes to that big > picture. My point was that second difficulty isn't high. But we can be > satisfied with implementation in the attached patchset (probably some > small enhancements are still required), then the first difficulty isn't high > too. I think it's possible. I didn't have time to look at this properly today, but I will try to do so tomorrow. Thanks -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
On Sun, Apr 1, 2018 at 10:09 AM, Alexander Korotkovwrote: >> So? GIN doesn't have the same legacy at all. The GIN posting lists >> *don't* have regular heap TID pointers at all. They started out >> without them, and still don't have them. > > > Yes, GIN never stored heap TID pointers in t_tid of index tuple. But GIN > assumes that heap TID pointer has at most 11 significant bits during > posting list encoding. I think that we should avoid assuming things, unless the cost of representing them is too high, which I don't think applies here. The more defensive general purpose code can be, the better. I will admit to being paranoid here. But experience suggests that paranoia is a good thing, if it isn't too expensive. Look at the thread on XFS + fsync() for an example of things being wrong for a very long time without anyone realizing, and despite the best efforts of many smart people. As far as anyone can tell, PostgreSQL on Linux + XFS is kinda, sorta broken, and has been forever. XFS was mature before ext4 was, and is a popular choice, and yet this is the first we're hearing about it being kind of broken. After many years. Look at this check that made it into my amcheck patch, that was committed yesterday: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=contrib/amcheck/verify_nbtree.c;h=a15fe21933b9a5b8baefedaa8f38e517d6c91877;hb=7f563c09f8901f6acd72cb8fba7b1bd3cf3aca8e#l745 As it says, nbtree is surprisingly tolerant of corrupt lp_len fields. You may find it an interesting exercise to use pg_hexedit to corrupt many lp_len fields in an index page. What's really interesting about this is that it doesn't appear to break anything at all! We don't get the length from there in most cases, so reads won't break at all. I see that we use ItemIdGetLength() in a couple of rare cases (though even those could be avoided) during a page split. You'd be lucky to notice a problem if lp_len fields were regularly corrupt. When you notice, it will probably have already caused big problems. On a similar note, I've noticed that many of my experimental B-Tree patches (that I never find time to finish) tend to almost work quite early on, sometimes without my really understanding why. The whole L approach of recovering from problems that were detected (detecting concurrent page splits, and moving right) makes the code *very* forgiving. I hope that I don't sound trite, but everyone should try to be modest about what they *don't* know when writing complex system software with concurrency. It is not a platitude, even though it probably seems that way. A tiny mistake can have big consequences, so it's very important that we have a way to easily detect them after the fact. > I don't think we should use assertions, because they are typically disabled > on > production PostgreSQL builds. But we can have some explicit check in some > common path. In the attached patch I've such check to _bt_compare(). > Probably, > together with amcheck, that would be sufficient. Good idea -- a "can't happen" check in _bt_compare seems better, which I see here: > diff --git a/src/backend/access/nbtree/nbtsearch.c > b/src/backend/access/nbtree/nbtsearch.c > index 51dca64e13..fcf9832147 100644 > --- a/src/backend/access/nbtree/nbtsearch.c > +++ b/src/backend/access/nbtree/nbtsearch.c > @@ -443,6 +443,17 @@ _bt_compare(Relation rel, > if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque)) > return 1; > > + /* > +* Check tuple has correct number of attributes. > +*/ > + if (!_bt_check_natts(rel, page, offnum)) > + { > + ereport(ERROR, > + (errcode(ERRCODE_INTERNAL_ERROR), > +errmsg("tuple has wrong number of attributes in index > \"%s\"", > + RelationGetRelationName(rel; > + } > + > itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum)); It seems like it might be a good idea to make this accept an IndexTuple, though, to possibly save some work. Also, perhaps this should be an unlikely() condition, if only because it makes the intent clearer (might actually matter in a tight loop like this too, though). Do you store an attribute number in the "minus infinity" item (the leftmost one of internal pages)? I guess that that should be zero, because it's totally truncated. > OK, thank for the explanation. I agree that check of offset is redundant > here. Cool. >> The fact is that that information can go out of date almost >> immediately, whereas high keys usually last forever. The only reason >> that there is a heap TID in the high key is because we'd have to add >> special code to remove it; not because it has any real value. I find >> it very hard to imagine it being used in a forensic situation. If you >> actually wanted to do this, the key itself is probably enough -- you >> probably wouldn't need the TID. > > > I don't know, When I wrote my own implementation of B-tree and debug > it, I found
Re: WIP: Covering + unique indexes.
On Fri, Mar 30, 2018 at 6:24 AM, Alexander Korotkovwrote: >> With an extreme enough case, this could result in a failure to find a >> split point. Or at least, if that isn't true then it's not clear why, >> and I think it needs to be explained. > > > I don't think this could result in a failure to find a split point. > So, it finds a split point without taking into account that hikey > will be shorter. If such split point exists then split point with > truncated hikey should also exists. If not, then it would be > failure even without covering indexes. I've updated comment > accordingly. You're right. We're going to truncate the unneeded trailing attributes from whatever tuple is to the immediate right of the final split point that we choose (that's the tuple that we'll copy to make a new high key for the left page). Truncation already has to result in a tuple that is less than or equal to the original tuple. I also agree that it isn't worth trying harder to make sure that space is distributed evenly when truncation will go ahead. It will only matter in very rare cases, but the computational overhead of having an accurate high key size for every candidate split point would be significant. -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
On Fri, Mar 30, 2018 at 10:39 PM, Peter Geogheganwrote: > It's safe, although I admit that that's a bit hard to see. > Specifically, look at this code in _bt_insert_parent(): > > /* > * Find the parent buffer and get the parent page. > * > * Oops - if we were moved right then we need to change stack item! We > * want to find parent pointing to where we are, right ?- vadim > * 05/27/97 > */ > ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY); > pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); > > Vadim doesn't seem too sure of why he did it that way. What's clear is > that the offset on all internal pages is always P_HIKEY (that is, 1), > because this is the one and only place where new IndexTuples get > generated for internal pages. That's unambiguous. So how could > BTEntrySame() truly need to care about offset? How could there ever be > an internal page offset that wasn't just P_HIKEY? You can look > yourself, using pg_hexedit or pageinspect. Sorry, I meant this code, right before: /* form an index tuple that points at the new right page */ new_item = CopyIndexTuple(ritem); ItemPointerSet(&(new_item->t_tid), rbknum, P_HIKEY); -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
On Fri, Mar 30, 2018 at 4:08 PM, Alexander Korotkovwrote: >> I'll try it. But I'm afraid that it's not as easy as you expect. > > > So, I have some implementation of storage of number of attributes inside > index tuple itself. I made it as additional patch on top of previous > patchset. > I attach the whole patchset in order to make commitfest.cputube.org happy. Looks like 0004-* became mangled. Can you send a version that is not mangled, please? > I decided not to use 13th bit of IndexTuple flags. Instead I use only high > bit > of offset which is also always free on regular tuples. In fact, we already > use > assumption that there is at most 11 significant bits of index tuple offset > in > GIN (see ginpostinglist.c). So? GIN doesn't have the same legacy at all. The GIN posting lists *don't* have regular heap TID pointers at all. They started out without them, and still don't have them. > Anastasia also pointed that if we're going to do on-disk changes, they > should be compatible not only with suffix truncation, but also with > duplicate > compression (which was already posted in thread [1]). I definitely agree with that, and I think that Anastasia should push for whatever will make future nbtree enhancements easier, especially her own pending or planned enhancements. > However, I think > there is no problem. We can use one of 3 free bits in offset as flag that > it's tuple with posting list. Duplicates compression needs to store > number of posting list items and their offset in the tuple. Free bits > left in item pointer after reserving 2 bits (1 flag of alternative meaning > of offset and 1 flag of posting list) is far enough for that. The issue that I see is that we could easily make this unambiguous, free of any context, with a tiny bit more work. Why not just do it that way? Maybe it won't actually matter, but I see no reason not to do it, since we can. > However, I find following arguments against implementing this feature > in covering indexes. > > * We write number of attributes present into tuple. But how to prove that > it's correct. I add appropriate checks to amcheck. But I don't think all > the > users runs amcheck frequent enough. Thus, in order to be sure that it's > correct we should check number of attributes is written correct everywhere > in the B-tree code. Use an assertion. Problem solved. I agree that people aren't using amcheck all that much, but give it time. Oracle and SQL Server have had tools like amcheck for 30+ years. We have had amcheck for one year. > Without that we can face the situation that we've > introduced new on-disk representation better to further B-tree enhancements, > but actually it's broken. And that would be much worse than nothing. > In order to check number of attributes everywhere in the B-tree code, we > need to actually implement significant part of suffix compression. And I > really think we shouldn't do this as part as covering indexes patch. I don't think that you need to do that, actually. I'm not asking you to go to those lengths. I have only asked that you make the on-disk representation *compatible* with a future Postgres version that has full suffix truncation (and other such enhancements, too). I care about the on-disk representation more than the code. > * Offset number is used now for parent refind (see BTEntrySame() macro). > In the attached patch, this condition is relaxed. But I don't think I > really like > that. This shoud be thought out very carefully... It's safe, although I admit that that's a bit hard to see. Specifically, look at this code in _bt_insert_parent(): /* * Find the parent buffer and get the parent page. * * Oops - if we were moved right then we need to change stack item! We * want to find parent pointing to where we are, right ?- vadim * 05/27/97 */ ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY); pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); Vadim doesn't seem too sure of why he did it that way. What's clear is that the offset on all internal pages is always P_HIKEY (that is, 1), because this is the one and only place where new IndexTuples get generated for internal pages. That's unambiguous. So how could BTEntrySame() truly need to care about offset? How could there ever be an internal page offset that wasn't just P_HIKEY? You can look yourself, using pg_hexedit or pageinspect. The comments above BTTidSame()/BTEntrySame() are actually wrong, including "New Comments". Vadim wanted to make TIDs part of the keyspace [1], beginning in around 1997. The idea was that we'd have truly unique keys by including TID, as L intended, but that never happened. Instead, we got commit 9e85183bf in 2000, which among many other things changed the L invariant to deal with duplicates. I think that Tom should have changed BTTidSame() to not care about offset number in that same commit
Re: WIP: Covering + unique indexes.
On Wed, Mar 28, 2018 at 7:59 AM, Anastasia Lubennikovawrote: > Here is the new version of the patch set. > All patches are rebased to apply without conflicts. > > Besides, they contain following fixes: > - pg_dump bug is fixed > - index_truncate_tuple() now has 3rd argument new_indnatts. > - new tests for amcheck, dblink and subscription/t/001_rep_changes.pl > - info about opclass implementors and included columns is now in sgml doc This only changes the arguments given to index_truncate_tuple(), which is a superficial change. It does not actually change anything about the on-disk representation, which is what I sought. Why is that a problem? I don't think it's very complicated. The patch needs a rebase, as Erik mentioned: 1 out of 19 hunks FAILED -- saving rejects to file src/backend/utils/cache/relcache.c.rej (Stripping trailing CRs from patch; use --binary to disable.) I also noticed that you still haven't done anything differently with this code in _bt_checksplitloc(), which I mentioned in April of last year: /* Account for all the old tuples */ leftfree = state->leftspace - olddataitemstoleft; rightfree = state->rightspace - (state->olddataitemstotal - olddataitemstoleft); /* * The first item on the right page becomes the high key of the left page; * therefore it counts against left space as well as right space. */ leftfree -= firstrightitemsz; /* account for the new item */ if (newitemonleft) leftfree -= (int) state->newitemsz; else rightfree -= (int) state->newitemsz; With an extreme enough case, this could result in a failure to find a split point. Or at least, if that isn't true then it's not clear why, and I think it needs to be explained. -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
On 1/25/18 23:19, Thomas Munro wrote: > + PRIMARY KEY ( class="parameter">column_name [, ... ] ) class="parameter">index_parameters INCLUDE > (column_name [, > ...]) | > > I hadn't seen that use of "" before. Almost everywhere else > we use explicit [ and ] characters, but I see that there are other > examples, and it is rendered as [ and ] in the output. I think this will probably not come out right in the generated psql help. Check that please. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WIP: Covering + unique indexes.
On 2018-03-28 16:59, Anastasia Lubennikova wrote: Here is the new version of the patch set. I can't get these to apply: patch -b -l -F 25 -p 1 < /home/aardvark/download/pgpatches/0110/covering_indexes/20180328/0001-Covering-core-v8.patch 1 out of 19 hunks FAILED -- saving rejects to file src/backend/utils/cache/relcache.c.rej $ cat src/backend/utils/cache/relcache.c.rej --- src/backend/utils/cache/relcache.c +++ src/backend/utils/cache/relcache.c @@ -542,7 +542,7 @@ attp = (Form_pg_attribute) GETSTRUCT(pg_attribute_tuple); if (attp->attnum <= 0 || - attp->attnum > relation->rd_rel->relnatts) + attp->attnum > RelationGetNumberOfAttributes(relation)) elog(ERROR, "invalid attribute number %d for %s", attp->attnum, RelationGetRelationName(relation)); Erik Rijkers
Re: WIP: Covering + unique indexes.
On Tue, Mar 27, 2018 at 10:14 AM, Teodor Sigaevwrote: >> b) I don't like an idea to limiting usage of that field if we can do not >> that. Future usage could use it, for example, for different compression >> technics or something else.Or even removing t_tid from inner tuples to save >> 2 bytes in IndexTupleData. Of > > course, I remember about aligment, but it could be subject to change too in > future. This is contradictory. You seem to be arguing that we need to preserve on-disk compatibility for an optimization that throws out on-disk compatibility. Saving a single byte per internal IndexTuple is not worth it. We could actually save 2 bytes in *all* nbtree pages, by halving the size of ItemId for nbtree -- we don't need lp_len, which is redundant, and we could reclaim one of the status bits too, to get back a full 16 bits. Also, we could use suffix truncation to save at least one byte in almost all cases, even with the thinnest possible single-integer-attribute IndexTuples. What you describe just isn't going to happen. -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
On Tue, Mar 27, 2018 at 10:07 AM, Teodor Sigaevwrote: > Storing number of attributes in now unused t_tid seems to me not so good > idea. a) it could (and suppose, should) separate patch, at least it's not > directly connected to covering patch, it could be added even before covering > patch. I think that we should do that first. It's not very hard. > b) I don't like an idea to limiting usage of that field if we can do not > that. Future usage could use it, for example, for different compression > technics or something else. The extra status bits that this would leave within the offset field can be used for that in the future. >> * It makes diagnosing issues in the field quite a bit easier. Tools >> like pg_filedump can do the right thing (Tom mentioned pg_filedump and >> amcheck specifically). The nbtree IndexTuple format should not need to >> be interpreted in a context-sensitive way, if we can avoid it. > > Both pg_filedump and amcheck could correclty parse any tuple based on > BTP_LEAF flags and length of tuple. amcheck doesn't just care about the length of the tuple. It would have to rely on catalog metadata about this being an INCLUDE index. -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
b) I don't like an idea to limiting usage of that field if we can do not that. Future usage could use it, for example, for different compression technics or something else.Or even removing t_tid from inner tuples to save 2 bytes in IndexTupleData. Of course, I remember about aligment, but it could be subject to change too in future. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
On Mon, Mar 26, 2018 at 3:10 AM, Alexander Korotkovwrote: > So, as I get you're proposing to introduce INDEX_ALT_TID_MASK flag > which would indicate that we're storing something special in the t_tid > offset. And that should help us not only for covering indexes, but also for > further btree enhancements including suffix truncation. What exactly do > you propose to store into t_tid offset when INDEX_ALT_TID_MASK flag > is set? Is it number of attributes in this particular index tuple? Yes. I think that once INDEX_ALT_TID_MASK is available, we should store the number of attributes in that particular "separator key" tuple (which has undergone suffix truncation), and always work off of that. You could then have status bits in offset as follows: * 1 bit that represents that this is a "separator key" IndexTuple (high key or internal IndexTuple). Otherwise, it's a leaf IndexTuple with an ordinary heap TID. (When INDEX_ALT_TID_MASK isn't set, it's the same as today.) * 3 reserved bits. I think that one of these bits can eventually be used to indicate that the internal IndexTuple actually has a "normalized key" representation [1], which seems like the best way to do suffix truncation, long term. I think that we should support simple suffix truncation, of the kind that this patch implements, alongside normalized key suffix truncation. We need both for various reasons [2]. Not sure what the other two flag bits might be used for, but they seem worth having. * 12 bits for the number of attributes, which should be more than enough, even when INDEX_MAX_KEYS is significantly higher than 32. A static assertion can keep this safe when INDEX_MAX_KEYS is set ridiculously high. I think that this scheme is future-proof. Maybe you have additional ideas on the representation. Please let me know what you think. When we eventually add optimizations that affect IndexTuples on the leaf level, we can start using the block number (bi_hi + bi_lo) itself, much like GIN posting lists. No need to further consider that (the leaf level optimizations) today, because using block number provides us with many more bits. In internal page items, the block number is always a block number, so internal IndexTuples are rather like GIN posting tree pointers in the main entry tree (its leaf level) -- a conventional item pointer block number is used, alongside unconventional use of the offset field, where there are 16 bits available because no real offset is required. [1] https://wiki.postgresql.org/wiki/Key_normalization#Optimizations_enabled_by_key_normalization [2] https://wiki.postgresql.org/wiki/Key_normalization#How_big_can_normalized_keys_get.2C_and_is_it_worth_it.3F -- Peter Geoghegan
Re: Re: WIP: Covering + unique indexes.
On 3/26/18 6:10 AM, Alexander Korotkov wrote: > > So, as I get you're proposing to introduce INDEX_ALT_TID_MASK flag > which would indicate that we're storing something special in the t_tid > offset. And that should help us not only for covering indexes, but also for > further btree enhancements including suffix truncation. What exactly do > you propose to store into t_tid offset when INDEX_ALT_TID_MASK flag > is set? Is it number of attributes in this particular index tuple? It appears that discussion and review of this patch is ongoing so it should not be marked Ready for Committer. I have changed it to Waiting on Author since there are several pending reviews and at least one bug. Regards, -- -David da...@pgmasters.net
Re: WIP: Covering + unique indexes.
On Sun, Mar 25, 2018 at 1:47 AM, Peter Geogheganwrote: > I was going to say that you could just treat the low bit in the t_tid > offset as representing "see catalog entry". My first idea was that > nothing would have to change about the existing format, since internal > page items already have only the low bit set within their offset. > However, I now see that that won't really work, because we don't > change the offset in high keys when they're copied from a real item > during a page split. Whatever we do, it has to work equally well for > all "separator keys" -- that is, it must work for both downlinks in > internal pages, and all high keys (including high keys at the leaf > level). > OK. > A good solution is to use the unused 13th t_bit. If hash can have a > INDEX_MOVED_BY_SPLIT_MASK, then nbtree can have a INDEX_ALT_TID_MASK. > This avoids a BTREE_VERSION bump, and allows us to deal with the > highkey offset issue. Actually, it's even more flexible than that -- > it can work with ordinary leaf tuples in the future, too. That is, we > can eventually implement prefix truncation and deduplication at the > leaf level using this representation, since there is nothing that > limits INDEX_ALT_TID_MASK IndexTuples to "separator keys". > > The main difference between this approach to leaf prefix > truncation/compression/deduplication, and the GIN entry tree's posting > list representation would be that it wouldn't have to be > super-optimized for duplicates, at the expense of more common case for > regular nbtree indexes -- having few or no duplicates. A btree_gin > index on pgbench_accounts(aid) looks very similar to an equivalent > nbtree index if you just compare internal pages from each, but they > look quite different at the leaf level, where GIN has 24 byte > IndexTuples instead of 16 bytes IndexTuples. Of course, this is > because the leaf pages have posting lists that can never be simple > heap pointer TIDs. > Right, btree_gin is much smaller than regular btree when there are a lot of duplicates. When there is no duplicates then btree_gin becomes larger than regular btree, because gin stores single item pointer less compact than btree. A secondary goal of this INDEX_ALT_TID_MASK representation should be > that it won't even be necessary to know that an IndexTuple is > contained within a leaf page rather than an index page (again, unlike > GIN). I'm pretty confident that we can have a truly universal > IndexTuple representation for nbtree, while supporting all of these > standard optimizations. > > Sorry for going off in a tangent, but I think it's somewhat necessary > to have a strategy here. Of course, we don't have to get everything > right now, but we should be looking in this direction whenever we talk > about on-disk nbtree changes. > So, as I get you're proposing to introduce INDEX_ALT_TID_MASK flag which would indicate that we're storing something special in the t_tid offset. And that should help us not only for covering indexes, but also for further btree enhancements including suffix truncation. What exactly do you propose to store into t_tid offset when INDEX_ALT_TID_MASK flag is set? Is it number of attributes in this particular index tuple? -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: WIP: Covering + unique indexes.
On Sat, Mar 24, 2018 at 12:39 PM, Alexander Korotkovwrote: > +1, putting "nattributes" to argument of index_truncate_tuple() would > make this function way more universal. Great. > Originally that code was written by Teodor, but I also put my hands there. > In GIN entry tree, item pointers are stored in a posting list which is > located > after index tuple attributes. So, both t_tid block number and offset are > used for GIN needs. Well, you worked on the posting list compression stuff, at least. :-) > That makes sense. Let's store the number of tuple attributes to t_tid. > Assuming that our INDEX_MAX_KEYS is quite small number, we will have > higher bits of t_tid free for latter use. I was going to say that you could just treat the low bit in the t_tid offset as representing "see catalog entry". My first idea was that nothing would have to change about the existing format, since internal page items already have only the low bit set within their offset. However, I now see that that won't really work, because we don't change the offset in high keys when they're copied from a real item during a page split. Whatever we do, it has to work equally well for all "separator keys" -- that is, it must work for both downlinks in internal pages, and all high keys (including high keys at the leaf level). A good solution is to use the unused 13th t_bit. If hash can have a INDEX_MOVED_BY_SPLIT_MASK, then nbtree can have a INDEX_ALT_TID_MASK. This avoids a BTREE_VERSION bump, and allows us to deal with the highkey offset issue. Actually, it's even more flexible than that -- it can work with ordinary leaf tuples in the future, too. That is, we can eventually implement prefix truncation and deduplication at the leaf level using this representation, since there is nothing that limits INDEX_ALT_TID_MASK IndexTuples to "separator keys". The main difference between this approach to leaf prefix truncation/compression/deduplication, and the GIN entry tree's posting list representation would be that it wouldn't have to be super-optimized for duplicates, at the expense of more common case for regular nbtree indexes -- having few or no duplicates. A btree_gin index on pgbench_accounts(aid) looks very similar to an equivalent nbtree index if you just compare internal pages from each, but they look quite different at the leaf level, where GIN has 24 byte IndexTuples instead of 16 bytes IndexTuples. Of course, this is because the leaf pages have posting lists that can never be simple heap pointer TIDs. A secondary goal of this INDEX_ALT_TID_MASK representation should be that it won't even be necessary to know that an IndexTuple is contained within a leaf page rather than an index page (again, unlike GIN). I'm pretty confident that we can have a truly universal IndexTuple representation for nbtree, while supporting all of these standard optimizations. Sorry for going off in a tangent, but I think it's somewhat necessary to have a strategy here. Of course, we don't have to get everything right now, but we should be looking in this direction whenever we talk about on-disk nbtree changes. -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
On Sat, Mar 24, 2018 at 5:21 AM, Peter Geogheganwrote: > On Thu, Mar 22, 2018 at 8:23 AM, Alexander Korotkov > wrote: > >> * There is minor formatting issue in this part of code. Some spaces > need > >> to be replaced with tabs. > >> +IndexTuple > >> +index_truncate_tuple(Relation idxrel, IndexTuple olditup) > >> +{ > >> + TupleDesc itupdesc = > >> CreateTupleDescCopyConstr(RelationGetDescr(idxrel)); > >> + Datum values[INDEX_MAX_KEYS]; > >> + boolisnull[INDEX_MAX_KEYS]; > >> + IndexTuple newitup; > > The last time I looked at this patch, in April 2017, I made the point > that we should add something like an "nattributes" argument to > index_truncate_tuple() [1], rather than always using > IndexRelationGetNumberOfKeyAttributes() within index_truncate_tuple(). > I can see that that change hasn't been made since that time. > +1, putting "nattributes" to argument of index_truncate_tuple() would make this function way more universal. With that approach, we can avoid relying on catalog metadata to the > same degree, which was a specific concern that Tom had around that > time. It's easy to do something with t_tid's offset, which is unused > in internal page IndexTuples. We do very similar things in GIN, where > alternative use of an IndexTuple's t_tid supports all kinds of > enhancements, some of which were not originally anticipated. Alexander > surely knows more about this than I do, since he wrote that code. > Originally that code was written by Teodor, but I also put my hands there. In GIN entry tree, item pointers are stored in a posting list which is located after index tuple attributes. So, both t_tid block number and offset are used for GIN needs. Having this index_truncate_tuple() "nattributes" argument, and storing > the number of attributes directly improves quite a lot of things: > > * It makes diagnosing issues in the field quite a bit easier. Tools > like pg_filedump can do the right thing (Tom mentioned pg_filedump and > amcheck specifically). The nbtree IndexTuple format should not need to > be interpreted in a context-sensitive way, if we can avoid it. > > * It lets you use index_truncate_tuple() for regular suffix truncation > in the future. These INCLUDE IndexTuples are really just a special > case of suffix truncation. At least, they should be, because otherwise > an eventual suffix truncation feature is going to be incompatible with > the INCLUDE tuple format. *Not* doing this makes suffix truncation > harder. Suffix truncation is a classic technique, first described by > Bayer in 1977, and we are very probably going to add it someday. > > * Once you can tell a truncated IndexTuple from a non-truncated one > with little or no context, you can add defensive assertions in various > places where they're helpful. Similarly, amcheck can use and expect > this as a cross-check against IndexRelationGetNumberOfKeyAttributes(). > This will increase confidence in the design, both initially and over > time. > That makes sense. Let's store the number of tuple attributes to t_tid. Assuming that our INDEX_MAX_KEYS is quite small number, we will have higher bits of t_tid free for latter use. I must say that I am disappointed that nothing has happened here, > especially because this really wasn't very much additional work, and > has essentially no downside. I can see that it doesn't work that way > in the Postgres Pro fork [2], and diverging from that may > inconvenience Postgres Pro, but that's a downside of forking. I don't > think that the community should have to absorb that cost. > Sure, community shouldn't take care about Postgres Pro fork. If we find that something is better to be done differently, than let us do it so. > +Notes to Operator Class Implementors > > + > > > > Besides I really appreciate it, it seems to be unrelated to the covering > > indexes. > > I'd like this to be extracted into separate patch and be committed > > separately. > > Commit 3785f7ee, from last month, moved the original "Notes to > Operator Class Implementors" section to the SGML docs. It looks like > that README section was accidentally reintroduced during rebasing. The > new information ("Included attributes in B-tree indexes") should be > moved over to the new section of the user docs -- the section that > 3785f7ee added. > Thank you for noticing that. I've overlooked that. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: WIP: Covering + unique indexes.
On Thu, Mar 22, 2018 at 8:23 AM, Alexander Korotkovwrote: >> * There is minor formatting issue in this part of code. Some spaces need >> to be replaced with tabs. >> +IndexTuple >> +index_truncate_tuple(Relation idxrel, IndexTuple olditup) >> +{ >> + TupleDesc itupdesc = >> CreateTupleDescCopyConstr(RelationGetDescr(idxrel)); >> + Datum values[INDEX_MAX_KEYS]; >> + boolisnull[INDEX_MAX_KEYS]; >> + IndexTuple newitup; The last time I looked at this patch, in April 2017, I made the point that we should add something like an "nattributes" argument to index_truncate_tuple() [1], rather than always using IndexRelationGetNumberOfKeyAttributes() within index_truncate_tuple(). I can see that that change hasn't been made since that time. With that approach, we can avoid relying on catalog metadata to the same degree, which was a specific concern that Tom had around that time. It's easy to do something with t_tid's offset, which is unused in internal page IndexTuples. We do very similar things in GIN, where alternative use of an IndexTuple's t_tid supports all kinds of enhancements, some of which were not originally anticipated. Alexander surely knows more about this than I do, since he wrote that code. Having this index_truncate_tuple() "nattributes" argument, and storing the number of attributes directly improves quite a lot of things: * It makes diagnosing issues in the field quite a bit easier. Tools like pg_filedump can do the right thing (Tom mentioned pg_filedump and amcheck specifically). The nbtree IndexTuple format should not need to be interpreted in a context-sensitive way, if we can avoid it. * It lets you use index_truncate_tuple() for regular suffix truncation in the future. These INCLUDE IndexTuples are really just a special case of suffix truncation. At least, they should be, because otherwise an eventual suffix truncation feature is going to be incompatible with the INCLUDE tuple format. *Not* doing this makes suffix truncation harder. Suffix truncation is a classic technique, first described by Bayer in 1977, and we are very probably going to add it someday. * Once you can tell a truncated IndexTuple from a non-truncated one with little or no context, you can add defensive assertions in various places where they're helpful. Similarly, amcheck can use and expect this as a cross-check against IndexRelationGetNumberOfKeyAttributes(). This will increase confidence in the design, both initially and over time. I must say that I am disappointed that nothing has happened here, especially because this really wasn't very much additional work, and has essentially no downside. I can see that it doesn't work that way in the Postgres Pro fork [2], and diverging from that may inconvenience Postgres Pro, but that's a downside of forking. I don't think that the community should have to absorb that cost. > +Notes to Operator Class Implementors > + > > Besides I really appreciate it, it seems to be unrelated to the covering > indexes. > I'd like this to be extracted into separate patch and be committed > separately. Commit 3785f7ee, from last month, moved the original "Notes to Operator Class Implementors" section to the SGML docs. It looks like that README section was accidentally reintroduced during rebasing. The new information ("Included attributes in B-tree indexes") should be moved over to the new section of the user docs -- the section that 3785f7ee added. [1] https://postgr.es/m/cah2-wzm9y59h2m6izjm4fpdup5r4bsrvzgbn2gtrco1j4nz...@mail.gmail.com [2] https://github.com/postgrespro/postgrespro/blob/PGPRO9_5/src/backend/access/common/indextuple.c#L451 -- Peter Geoghegan
Re: WIP: Covering + unique indexes.
On Wed, Mar 21, 2018 at 9:51 PM, Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > On Thu, Mar 8, 2018 at 7:13 PM, Anastasia Lubennikova < > a.lubennik...@postgrespro.ru> wrote: > >> 06.03.2018 11:52, Thomas Munro: >> >>> On Wed, Jan 31, 2018 at 3:09 AM, Anastasia Lubennikova >>>wrote: >>> Thank you for reviewing. All mentioned issues are fixed. >>> == Applying patch 0002-covering-btree_v4.patch... >>> 1 out of 1 hunk FAILED -- saving rejects to file >>> src/backend/access/nbtree/README.rej >>> 1 out of 1 hunk FAILED -- saving rejects to file >>> src/backend/access/nbtree/nbtxlog.c.rej >>> >>> Can we please have a new patch set? >>> >> >> Here it is. >> Many thanks to Andrey Borodin. > > > I took a look at this patchset. I have some notes about it. > > * I see patch changes dblink, amcheck and tcl contribs. It would be nice > to add corresponding > check to dblink and amcheck regression tests. It would be good to do the > same with tcn contrib. > But tcn doesn't have regression tests at all. And it's out of scope of > this patch to add regression > tests to tcn. So, it's OK to just check that it's working correctly with > covering indexes (I hope it's > already done by other reviewers). > > * I think that subscription regression tests in > src/test/subscription should have some use > of covering indexes. Logical decoding and subscription are heavily using > primary keys. > So they need to be tested to work correctly with covering indexes. > > * I also think some isolation tests in src/test/isolation need to check > covering indexes too. > In particular insert-conflict-*.spec and lock-*.spec and probably more. > > * pg_dump doesn't handle old PostgreSQL versions correctly. If I try to > dump database > of PostgreSQL 9.6, pg_dump gives me following error: > > pg_dump: [archiver (db)] query failed: ERROR: column i.indnkeyatts does > not exist > LINE 1: ...atalog.pg_get_indexdef(i.indexrelid) AS indexdef, i.indnkeya... > ^ > > If fact there is a sequence of "if" ... "else if" blocks in getIndexes() > which selects > appropriate query depending on remote server version. And for pre-11 we > should > use indnatts instead of indnkeyatts. > > * There is minor formatting issue in this part of code. Some spaces need > to be replaced with tabs. > +IndexTuple > +index_truncate_tuple(Relation idxrel, IndexTuple olditup) > +{ > + TupleDesc itupdesc = CreateTupleDescCopyConstr(Rela > tionGetDescr(idxrel)); > + Datum values[INDEX_MAX_KEYS]; > + boolisnull[INDEX_MAX_KEYS]; > + IndexTuple newitup; > > * I think this comment needs to be rephrased. > + /* > + * Code below is concerned to the opclasses which are not used > + * with the included columns. > + */ > I would write something like this: "Code below checks opclass key type. > Included columns > don't have opclasses, and this check is not required for them.". Native > english speakers > could provide even better phrasing though. > > * I would also like all the patches in patchset version to have same > version number. > I understand that "Covering-btree" and "Covering-amcheck" have less > previous > versions than "Covering-core". But it's way easier to identify patches > belonging to > the same patchset version if they have same version number. For sure, > then some > patches would skip some version numbers, but that doesn't seem to be a > problem for me. > I have few more notes regarding this patchset. * indkeyatts is described in the documentation, but I think that description of indnatts should be also updated clarifying that indnatts counts "included" columns. + + indnkeyatts + int2 + + The number of key columns in the index. "Key columns" are ordinary + index columns (as opposed to "included" columns). + * It seems like this paragraph appears in the patchset without any mentioning in the thread. +Notes to Operator Class Implementors + Besides I really appreciate it, it seems to be unrelated to the covering indexes. I'd like this to be extracted into separate patch and be committed separately. * There is a typo here: brtee -> btree + * 7. Check various AMs. All but brtee must fail. * This comment should be updated assuming that we now put left page hikey to the WAL independently on whether it's leaf page split. + /* + * We must also log the left page's high key, because the right + * page's leftmost key is suppressed on non-leaf levels. Show it + * as belonging to the left page buffer, so that it is not stored + * if XLogInsert decides it needs a full-page image of the left + * page. + */ * get_index_def() is adjusted to support covering indexes. I think this support deserve to be checked in regression tests. * In PostgreSQL sentences are sometimes divided by single spacing, sometimes divided by double spacing. I think we should follow
Re: WIP: Covering + unique indexes.
On Thu, Mar 8, 2018 at 7:13 PM, Anastasia Lubennikova < a.lubennik...@postgrespro.ru> wrote: > 06.03.2018 11:52, Thomas Munro: > >> On Wed, Jan 31, 2018 at 3:09 AM, Anastasia Lubennikova >>wrote: >> >>> Thank you for reviewing. All mentioned issues are fixed. >>> >> == Applying patch 0002-covering-btree_v4.patch... >> 1 out of 1 hunk FAILED -- saving rejects to file >> src/backend/access/nbtree/README.rej >> 1 out of 1 hunk FAILED -- saving rejects to file >> src/backend/access/nbtree/nbtxlog.c.rej >> >> Can we please have a new patch set? >> > > Here it is. > Many thanks to Andrey Borodin. I took a look at this patchset. I have some notes about it. * I see patch changes dblink, amcheck and tcl contribs. It would be nice to add corresponding check to dblink and amcheck regression tests. It would be good to do the same with tcn contrib. But tcn doesn't have regression tests at all. And it's out of scope of this patch to add regression tests to tcn. So, it's OK to just check that it's working correctly with covering indexes (I hope it's already done by other reviewers). * I think that subscription regression tests in src/test/subscription should have some use of covering indexes. Logical decoding and subscription are heavily using primary keys. So they need to be tested to work correctly with covering indexes. * I also think some isolation tests in src/test/isolation need to check covering indexes too. In particular insert-conflict-*.spec and lock-*.spec and probably more. * pg_dump doesn't handle old PostgreSQL versions correctly. If I try to dump database of PostgreSQL 9.6, pg_dump gives me following error: pg_dump: [archiver (db)] query failed: ERROR: column i.indnkeyatts does not exist LINE 1: ...atalog.pg_get_indexdef(i.indexrelid) AS indexdef, i.indnkeya... ^ If fact there is a sequence of "if" ... "else if" blocks in getIndexes() which selects appropriate query depending on remote server version. And for pre-11 we should use indnatts instead of indnkeyatts. * There is minor formatting issue in this part of code. Some spaces need to be replaced with tabs. +IndexTuple +index_truncate_tuple(Relation idxrel, IndexTuple olditup) +{ + TupleDesc itupdesc = CreateTupleDescCopyConstr(Rela tionGetDescr(idxrel)); + Datum values[INDEX_MAX_KEYS]; + boolisnull[INDEX_MAX_KEYS]; + IndexTuple newitup; * I think this comment needs to be rephrased. + /* + * Code below is concerned to the opclasses which are not used + * with the included columns. + */ I would write something like this: "Code below checks opclass key type. Included columns don't have opclasses, and this check is not required for them.". Native english speakers could provide even better phrasing though. * I would also like all the patches in patchset version to have same version number. I understand that "Covering-btree" and "Covering-amcheck" have less previous versions than "Covering-core". But it's way easier to identify patches belonging to the same patchset version if they have same version number. For sure, then some patches would skip some version numbers, but that doesn't seem to be a problem for me. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: WIP: Covering + unique indexes.
On Wed, Jan 31, 2018 at 3:09 AM, Anastasia Lubennikovawrote: > Thank you for reviewing. All mentioned issues are fixed. == Applying patch 0002-covering-btree_v4.patch... 1 out of 1 hunk FAILED -- saving rejects to file src/backend/access/nbtree/README.rej 1 out of 1 hunk FAILED -- saving rejects to file src/backend/access/nbtree/nbtxlog.c.rej Can we please have a new patch set? -- Thomas Munro http://www.enterprisedb.com
Re: WIP: Covering + unique indexes.
On Fri, Jan 26, 2018 at 3:01 AM, Anastasia Lubennikovawrote: > Thanks for the reminder. Rebased patches are attached. This is a really cool and also difficult feature. Thanks for working on it! Here are a couple of quick comments on the documentation, since I noticed it doesn't build: SGML->XML change: (1) empty closing tags "" are no longer accepted, (2) now needs to be written and (3) xref IDs are now case-sensitive. + PRIMARY KEY ( column_name [, ... ] ) index_parameters INCLUDE (column_name [, ...]) | I hadn't seen that use of "" before. Almost everywhere else we use explicit [ and ] characters, but I see that there are other examples, and it is rendered as [ and ] in the output. OK, cool, but I think there should be some extra whitespace so that it comes out as: [ INCLUDE ... ] instead of: [INCLUDE ...] to fit with the existing convention. +... This also allows UNIQUE indexes to be defined on +one set of columns, which can include another set of columns in the + INCLUDE clause, on which the uniqueness is not enforced. +It's the same with other constraints (PRIMARY KEY and EXCLUDE). This can +also can be used for non-unique indexes as any columns which are not required +for the searching or ordering of records can be used in the +INCLUDE clause, which can slightly reduce the size of the index. Can I suggest rewording these three sentences a bit? Just an idea: UNIQUE indexes, PRIMARY KEY constraints and EXCLUDE constraints can be defined with extra columns in an INCLUDE clause, in which case uniqueness is not enforced for the extra columns. Moving columns that are not needed for searching, ordering or uniqueness into the INCLUDE clause can sometimes reduce the size of the index while retaining the possibility of using a faster index-only scan. -- Thomas Munro http://www.enterprisedb.com
Re: WIP: Covering + unique indexes.
I feel sorry for the noise, switching this patch there and back. But the patch needs rebase again. It still applies with -3, but do not compile anymore. Best regards, Andrey Borodin. The new status of this patch is: Waiting on Author
Re: [HACKERS] WIP: Covering + unique indexes.
> 21 янв. 2018 г., в 3:36, Peter Geogheganнаписал(а): > > On Wed, Jan 17, 2018 at 12:45 AM, Andrey Borodin wrote: >> Unfortunately, amcheck_next does not work currently on HEAD (there are >> problems with AllocSetContextCreate() signature), but I've tested >> bt_index_check() before, during and after pgbench, on primary and on slave. >> Also, I've checked bt_index_parent_check() on master. > > I fixed that recently. It should be fine now. Oh, sorry, missed that I'm using patched stale amcheck_next. Thanks! Affirmative, amcheck_next works fine. I run pgbench against several covering indexes. Checking before load, during and after, both on master and slave. I do not observe any errors besides infrequent "canceling statement due to conflict with recovery", which is not a sign of any malfunction. Best regards, Andrey Borodin.
Re: [HACKERS] WIP: Covering + unique indexes.
On Wed, Jan 17, 2018 at 12:45 AM, Andrey Borodinwrote: > Unfortunately, amcheck_next does not work currently on HEAD (there are > problems with AllocSetContextCreate() signature), but I've tested > bt_index_check() before, during and after pgbench, on primary and on slave. > Also, I've checked bt_index_parent_check() on master. I fixed that recently. It should be fine now. -- Peter Geoghegan
Re: [HACKERS] WIP: Covering + unique indexes.
Hi! > 18 янв. 2018 г., в 18:57, Anastasia Lubennikova >написал(а): > > What is amcheck_next ? amcheck_next is external version of amcheck, maintained by Peter G. on his github. It checks one more thing: that every heap tuple has twin in B-tree, so called heapallindexed check. Version V3 of your patch was checked with heapallindexed and passed the test, both on master and on slave. >> During bt_index_check() test from time to time I was observing >> ERROR: canceling statement due to conflict with recovery >> DETAIL: User query might have needed to see row versions that must be >> removed. >> > > Sorry, I forgot to attach the amcheck fix to the previous message. No problem, surely I've fixed that before testing. > Now all the patches are in attachment. > Could you recheck if the error is still there? No need to do that, I was checking exactly same codebase. And that error has nothing to do with your patch, amcheck does not always can perform bt_index_parent_check() on slave when master is heavy loaded. It's OK. I reported this error just to be 100% precise about observed things. Thanks for working on this feature, hope to see it in 11. Best regards, Andrey Borodin.
Re: [HACKERS] WIP: Covering + unique indexes.
Hi! > 16 янв. 2018 г., в 21:50, Anastasia Lubennikova >написал(а): > > Updated patches are attached. > Cool, thanks! I've looked into the code, but haven't found anything broken. Since I've tried to rebase patch myself and failed on parse utils, I've spend some cycles trying to break parsing. One minor complain (no need to fix). This is fine x4mmm=# create index on pgbench_accounts (bid) include (aid,filler,upper(filler)); ERROR: expressions are not supported in included columns But why not same error here? Previous message is very descriptive. x4mmm=# create index on pgbench_accounts (bid) include (aid,filler,aid+1); ERROR: syntax error at or near "+" This works. But should not, IMHO x4mmm=# create index on pgbench_accounts (bid) include (aid,aid,aid); CREATE INDEX Do not know what's that... # create index on pgbench_accounts (bid) include (aid desc, aid asc); CREATE INDEX All these things allow foot-shooting with a small caliber, but do not break big things. Unfortunately, amcheck_next does not work currently on HEAD (there are problems with AllocSetContextCreate() signature), but I've tested bt_index_check() before, during and after pgbench, on primary and on slave. Also, I've checked bt_index_parent_check() on master. During bt_index_check() test from time to time I was observing ERROR: canceling statement due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. [install]check[-world] passed :) From my POV, patch is in a good shape. I think it is time to make the patch ready for committer again. Best regards, Andrey Borodin.
Re: [HACKERS] WIP: Covering + unique indexes.
Hello! The patch does not apply currently. Anastasia, can you, please, rebase the patch? Best regards, Andrey Borodin.
Re: [HACKERS] WIP: Covering + unique indexes.
> 30 нояб. 2017 г., в 23:07, Andrey Borodinнаписал(а): > > Seems like it was not a big deal of patching, I've fixed those bits (see > attachment). > I've done only simple tests as for now, but I'm planning to do better testing > before next CF. > Thanks for mentioning "heapallindexed", I'll use it too. I've tested the patch with fixed amcheck (including "heapallindexed" feature), tests included bulk index creation, pgbenching and amcheck of index itself and WAL-replicated index. Everything worked fine. Spotted one more typo: > Since 10.0 there is an optional INCLUDE clause should be > Since 11.0 there is an optional INCLUDE clause I think that patch set (two patches + 1 amcheck diff) is ready for committer. Best regards, Andrey Borodin.
Re: [HACKERS] WIP: Covering + unique indexes.
Hi, Peter! > 29 нояб. 2017 г., в 8:45, Peter Geogheganнаписал(а): > > It looks like amcheck needs to be patched -- a simple oversight. > amcheck is probably calling _bt_compare() without realizing that > internal pages don't have the extra attributes (just leaf pages, > although they should also not participate in comparisons in respect of > included/extra columns). There were changes to amcheck at one point in > the past. That must have slipped through again. I don't think it's > that complicated. > > BTW, it would probably be a good idea to use the new Github version's > "heapallindexed" verification [1] for testing this patch. Anastasia > will need to patch the externally maintained amcheck to do this, but > it's probably no extra work, because this is already needed for > contrib/amcheck, and because the heapallindexed check doesn't actually > care about index structure at all. Seems like it was not a big deal of patching, I've fixed those bits (see attachment). I've done only simple tests as for now, but I'm planning to do better testing before next CF. Thanks for mentioning "heapallindexed", I'll use it too. Best regards, Andrey Borodin. amcheck_include.diff Description: Binary data
Re: [HACKERS] WIP: Covering + unique indexes.
On Sun, Nov 12, 2017 at 8:40 PM, Andrey Borodinwrote: > Postgres crashes: > TRAP: FailedAssertion("!(((const void*)() != ((void*)0)) && > (scankey->sk_attno) > 0)", File: "nbtsearch.c", Line: 466) > > May be I'm doing something wrong or amcheck support will go with different > patch? Usually amcheck complaining is a sign of other symptoms. I am marking this patch as returned with feedback for now as no updates have been provided after two weeks. -- Michael