Re: Improve search for missing parent downlinks in amcheck

2020-03-11 Thread Peter Geoghegan
On Wed, Mar 11, 2020 at 2:02 AM Alexander Korotkov
 wrote:
> Thank you!  Pushed with this comment revised!

Thanks!

-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2020-03-11 Thread Alexander Korotkov
On Wed, Mar 11, 2020 at 7:19 AM Peter Geoghegan  wrote:
> This looks committable. I only noticed one thing: The comments above
> bt_target_page_check() need to be updated to reflect the new check,
> which no longer has anything to do with "heapallindexed = true".

Thank you!  Pushed with this comment revised!

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Improve search for missing parent downlinks in amcheck

2020-03-10 Thread Peter Geoghegan
On Tue, Mar 10, 2020 at 8:30 AM Alexander Korotkov
 wrote:
> Yes, current example looks confusing in this aspect.  But your comment
> spotted to me an algorithmic issue.  We don't match highkey of
> leftmost child against parent pivot key.  But we can.  The "if
> (!BlockNumberIsValid(blkno))" branch survived from the patch version
> when we didn't match high keys.  I've revised it.  Now we enter the
> loop even for leftmost page on child level and match high key for that
> page.

Great. That looks better.

> > BTW, a P_LEFTMOST() assertion at the beginning of
> > bt_child_highkey_check() would make this easier to follow.
>
> Yes, but why should it be an assert?  We can imagine corruption, when
> there is left sibling of first child of leftmost target.

I agree. I would only make it an assertion when it concerns an
implementation detail of amcheck, but that doesn't apply here.

> Thank you.  I'd like to have another feedback from you assuming there
> are logic changes.

This looks committable. I only noticed one thing: The comments above
bt_target_page_check() need to be updated to reflect the new check,
which no longer has anything to do with "heapallindexed = true".

-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2020-03-10 Thread Alexander Korotkov
On Tue, Mar 10, 2020 at 3:07 AM Peter Geoghegan  wrote:
> On Sun, Mar 8, 2020 at 2:52 PM Alexander Korotkov
>  wrote:
> > I've revised this comment.  Hopefully it's better now.
>
> I think that the new comments about why we need a low key for the page
> are much better now.

Good, thank you.

> > I've updated this comment using terms "cousin page" and "subtree".  I
> > didn't refer the diagram example, because it doesn't contain
> > appropriate case.  And I wouldn't like this diagram to contain such
> > case, because that probably makes this diagram too complex.  I've also
> > invented term "uncle page". BTW, should it be "aunt page"?  I don't
> > know.
>
> I have never heard the term "uncle page" before, but I like it --
> though maybe say "right uncle page". That happens to be the exact
> relationship that we're talking about here. I think any one of
> "uncle", "aunt", or "uncle/aunt" are acceptable. We'll probably never
> need to use this term again, but it seems like the right term to use
> here.

According to context that should be left uncle page.  I've changed the
text accordingly.

> Anyway, this section also seems much better now.
>
> Other things that I noticed:
>
> * Typo:
>
> > +   /*
> > +* We don't call bt_child_check() for "negative infinity" items.
> > +* But if we're performatin downlink connectivity check, we do 
> > it
> > +* for every item including "negative infinity" one.
> > +*/
>
> s/performatin/performing/

Fixed.

> * Suggest that you say "has incomplete split flag set" here:
>
> > + * - The call for block 4 will initialize data structure, but doesn't do 
> > actual
> > + *   checks assuming page 4 has incomplete split.

Yes, that sounds better.  Changed here and in the other places.

> * More importantly, is this the right thing to say about page 4? Isn't
> it also true that page 4 is the leftmost leaf page, and therefore kind
> of special in another way? Even without having the incomplete split
> flag set at all? Wouldn't it be better to address the incomplete split
> flag issue by making that apply to some other page that isn't also the
> leftmost? That would allow you to talk about the leftmost case
> directly here. Or it would at least make it less confusing.

Yes, current example looks confusing in this aspect.  But your comment
spotted to me an algorithmic issue.  We don't match highkey of
leftmost child against parent pivot key.  But we can.  The "if
(!BlockNumberIsValid(blkno))" branch survived from the patch version
when we didn't match high keys.  I've revised it.  Now we enter the
loop even for leftmost page on child level and match high key for that
page.

> BTW, a P_LEFTMOST() assertion at the beginning of
> bt_child_highkey_check() would make this easier to follow.

Yes, but why should it be an assert?  We can imagine corruption, when
there is left sibling of first child of leftmost target.  I guess,
current code would report such situation as an error, because this
left sibling lacks of parent downlink.  I've revised that "if" branch,
so we don't load a child page there anymore.  Error reporting is added
to the main loop.

> * Correct spelling is "happens" here:
>
> > +* current child page is not incomplete split, then its high key
> > +* should match to the target's key of current offset number. 
> > This
> > +* happends when child page referenced by previous downlink is
>
> * Actually, maybe this whole sentence should be reworded instead --
> say "This happens when a previous call here (to
> bt_child_highkey_check()) found an incomplete split, and we reach a
> right sibling page without a downlink -- the right sibling page's high
> key still needs to be matched to a separator key on the parent/target
> level".
>
> * Maybe say "Don't apply OffsetNumberNext() to target_downlinkoffnum
> when we already had to step right on the child level. Our traversal of
> the child level must try to move in perfect lockstep behind (to the
> left of) the target/parent level traversal."
>
> I found this detail very confusing at first.
>
> * The docs should say "...relationships, including checking that there
> are no missing downlinks in the index structure" here:
>
> >unlike bt_index_check,
> >bt_index_parent_check also checks
> > -  invariants that span parent/child relationships.
> > +  invariants that span parent/child relationships including check
> > +  that there are no missing downlinks in the index structure.
> >bt_index_parent_check follows the general
> >convention of raising an error if it finds a logical
> >inconsistency or other problem.

This comments are revised as you proposed.

> This is very close now. I would be okay with you committing the patch
> once you deal with this feedback. If you prefer, I can take another
> look at a new revision.

Thank you.  I'd like to have another feedback from you assuming there
are logic 

Re: Improve search for missing parent downlinks in amcheck

2020-03-09 Thread Peter Geoghegan
On Sun, Mar 8, 2020 at 2:52 PM Alexander Korotkov
 wrote:
> I've revised this comment.  Hopefully it's better now.

I think that the new comments about why we need a low key for the page
are much better now.

> I've updated this comment using terms "cousin page" and "subtree".  I
> didn't refer the diagram example, because it doesn't contain
> appropriate case.  And I wouldn't like this diagram to contain such
> case, because that probably makes this diagram too complex.  I've also
> invented term "uncle page". BTW, should it be "aunt page"?  I don't
> know.

I have never heard the term "uncle page" before, but I like it --
though maybe say "right uncle page". That happens to be the exact
relationship that we're talking about here. I think any one of
"uncle", "aunt", or "uncle/aunt" are acceptable. We'll probably never
need to use this term again, but it seems like the right term to use
here.

Anyway, this section also seems much better now.

Other things that I noticed:

* Typo:

> +   /*
> +* We don't call bt_child_check() for "negative infinity" items.
> +* But if we're performatin downlink connectivity check, we do it
> +* for every item including "negative infinity" one.
> +*/

s/performatin/performing/

* Suggest that you say "has incomplete split flag set" here:

> + * - The call for block 4 will initialize data structure, but doesn't do 
> actual
> + *   checks assuming page 4 has incomplete split.

* More importantly, is this the right thing to say about page 4? Isn't
it also true that page 4 is the leftmost leaf page, and therefore kind
of special in another way? Even without having the incomplete split
flag set at all? Wouldn't it be better to address the incomplete split
flag issue by making that apply to some other page that isn't also the
leftmost? That would allow you to talk about the leftmost case
directly here. Or it would at least make it less confusing.

BTW, a P_LEFTMOST() assertion at the beginning of
bt_child_highkey_check() would make this easier to follow.

* Correct spelling is "happens" here:

> +* current child page is not incomplete split, then its high key
> +* should match to the target's key of current offset number. This
> +* happends when child page referenced by previous downlink is

* Actually, maybe this whole sentence should be reworded instead --
say "This happens when a previous call here (to
bt_child_highkey_check()) found an incomplete split, and we reach a
right sibling page without a downlink -- the right sibling page's high
key still needs to be matched to a separator key on the parent/target
level".

* Maybe say "Don't apply OffsetNumberNext() to target_downlinkoffnum
when we already had to step right on the child level. Our traversal of
the child level must try to move in perfect lockstep behind (to the
left of) the target/parent level traversal."

I found this detail very confusing at first.

* The docs should say "...relationships, including checking that there
are no missing downlinks in the index structure" here:

>unlike bt_index_check,
>bt_index_parent_check also checks
> -  invariants that span parent/child relationships.
> +  invariants that span parent/child relationships including check
> +  that there are no missing downlinks in the index structure.
>bt_index_parent_check follows the general
>convention of raising an error if it finds a logical
>inconsistency or other problem.

This is very close now. I would be okay with you committing the patch
once you deal with this feedback. If you prefer, I can take another
look at a new revision.

--
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2020-03-09 Thread Alexander Korotkov
On Mon, Mar 9, 2020 at 12:52 AM Alexander Korotkov
 wrote:
> Attached patch also has revised commit message.  I'll wait for your
> response before commit.

Oh, I found that I haven't attached the patch.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


amcheck-btree-improve-missing-parent-downlinks-check-8.patch
Description: Binary data


Re: Improve search for missing parent downlinks in amcheck

2020-03-08 Thread Alexander Korotkov
Hi, Peter!

On Tue, Mar 3, 2020 at 3:04 AM Peter Geoghegan  wrote:
> Apologies for the delayed response. I was a little tired from the
> deduplication project.

No problem.  Apologies for the delayed revision as well.

> I taught pageinspect to display a "htid" field for pivot tuples
> recently, making it easier to visualize this example.

Great!

> I think that you should say more about how "lowkey" is used here:
>
> > /*
> > -* Record if page that is about to become target is the right half 
> > of
> > -* an incomplete page split.  This can go stale immediately in
> > -* !readonly case.
> > +* Copy current target low key as the high key of right sibling.
> > +* Allocate memory in upper level context, so it would be cleared
> > +* after reset of target context.
> > +*
> > +* We only need low key for parent check.
> >  */
> > -   state->rightsplit = P_INCOMPLETE_SPLIT(opaque);
> > +   if (state->readonly && !P_RIGHTMOST(opaque))
> > +   {
>
> Say something about concurrent page splits, since they're the only
> case where we actually use lowkey. Maybe say something like: "We
> probably won't end up doing anything with lowkey, but it's simpler for
> readonly verification to always have it available".

I've revised this comment.  Hopefully it's better now.

> * I think that these comments could still be clearer:
>
> > +   /*
> > +* We're going to try match child high key to "negative
> > +* infinity key".  This normally happens when the last child
> > +* we visited for target's left sibling was an incomplete
> > +* split. So, we must be still on the child of target's left
> > +* sibling. Thus, we should match to target's left sibling
> > +* high key. Thankfully we saved it, it's called a "low 
> > key".
> > +*/
>
> Maybe start with "We cannot try to match child's high key to a
> negative infinity key in target, since there is nothing to compare.
> However...". Perhaps use terms like "cousin page" and "subtree", which
> can be useful. Alternatively, mention this case in the diagram example
> at the top of bt_child_highkey_check(). It's tough to write comments
> like this, but I think it's worth it.

I've updated this comment using terms "cousin page" and "subtree".  I
didn't refer the diagram example, because it doesn't contain
appropriate case.  And I wouldn't like this diagram to contain such
case, because that probably makes this diagram too complex.  I've also
invented term "uncle page". BTW, should it be "aunt page"?  I don't
know.

> Note that a high key is also a pivot tuple, so I wouldn't mention high
> keys here:
>
> > +/*
> > + * Check if two tuples are binary identical except the block number.  So,
> > + * this function is capable to compare high keys with pivot keys.
> > + */
> > +static bool
> > +bt_pivot_tuple_identical(IndexTuple itup1, IndexTuple itup2)
> > +{

Sure, this comment is revised.

> v7 looks pretty close to being commitable, though I'll probably want
> to update some comments that you haven't touched when you commit this.
> I should probably wait until you've committed the patch to go do that.
> I'm thinking of things like old comments in bt_downlink_check().
>
> I will test the patch properly one more time when you produce a new
> revision. I haven't really tested it since the last time.

Attached patch also has revised commit message.  I'll wait for your
response before commit.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Improve search for missing parent downlinks in amcheck

2020-03-02 Thread Peter Geoghegan
Hi Alexander,

Apologies for the delayed response. I was a little tired from the
deduplication project.

On Mon, Feb 24, 2020 at 2:54 PM Alexander Korotkov
 wrote:
> Thank you for your review.  The revised version is attached.

This has bitrot, because of the deduplication patch. Shouldn't be hard
to rebase, though.

> > 'foo, -inf'
> > 'foo, (1,24)'
> > 'food, -inf'. <-- This pivot tuple's downlink points to the final leaf
> > page that's filled with duplicates of the value 'foo'
> > 'food, (3,19)' <-- This pivot tuple's downlink points to the *first*
> > leaf page that's filled with duplicates of the value 'food'
> > ...

> Thank you for the explanation!

I taught pageinspect to display a "htid" field for pivot tuples
recently, making it easier to visualize this example.

I think that you should say more about how "lowkey" is used here:

> /*
> -* Record if page that is about to become target is the right half of
> -* an incomplete page split.  This can go stale immediately in
> -* !readonly case.
> +* Copy current target low key as the high key of right sibling.
> +* Allocate memory in upper level context, so it would be cleared
> +* after reset of target context.
> +*
> +* We only need low key for parent check.
>  */
> -   state->rightsplit = P_INCOMPLETE_SPLIT(opaque);
> +   if (state->readonly && !P_RIGHTMOST(opaque))
> +   {

Say something about concurrent page splits, since they're the only
case where we actually use lowkey. Maybe say something like: "We
probably won't end up doing anything with lowkey, but it's simpler for
readonly verification to always have it available".

> Actually, lowkey is used for removing "cousin page verification blind
> spot" when there are incomplete splits.  It might happen that we read
> child with hikey matching its parent high key only when
> bt_child_highkey_check() is called for "minus infinity" key of parent
> right sibling.  Saving low key helps in this case.

That makes sense to me.

> It appears that these false positives were cause by very basic error made 
> here:
>
> if (!first && !P_IGNORE(opaque))
> {
> /* blkno probably has missing parent downlink */
> bt_downlink_missing_check(state, rightsplit, blkno, page);
> }
>
> actually it should be
>
> if (blkno != downlink && !P_IGNORE(opaque))
> {
> /* blkno probably has missing parent downlink */
> bt_downlink_missing_check(state, rightsplit, blkno, page);
> }
>
> So "blkno == downlink" means blkno has downlink, not being checked
> first in the loop.  This is remains of old version of patch which I
> forget to clean.  Now the check you've described works for me.
>
> If you still think lowkey check is a problem, please help me figure it out.

* I think that these comments could still be clearer:

> +   /*
> +* We're going to try match child high key to "negative
> +* infinity key".  This normally happens when the last child
> +* we visited for target's left sibling was an incomplete
> +* split. So, we must be still on the child of target's left
> +* sibling. Thus, we should match to target's left sibling
> +* high key. Thankfully we saved it, it's called a "low key".
> +*/

Maybe start with "We cannot try to match child's high key to a
negative infinity key in target, since there is nothing to compare.
However...". Perhaps use terms like "cousin page" and "subtree", which
can be useful. Alternatively, mention this case in the diagram example
at the top of bt_child_highkey_check(). It's tough to write comments
like this, but I think it's worth it.

Note that a high key is also a pivot tuple, so I wouldn't mention high
keys here:

> +/*
> + * Check if two tuples are binary identical except the block number.  So,
> + * this function is capable to compare high keys with pivot keys.
> + */
> +static bool
> +bt_pivot_tuple_identical(IndexTuple itup1, IndexTuple itup2)
> +{

v7 looks pretty close to being commitable, though I'll probably want
to update some comments that you haven't touched when you commit this.
I should probably wait until you've committed the patch to go do that.
I'm thinking of things like old comments in bt_downlink_check().

I will test the patch properly one more time when you produce a new
revision. I haven't really tested it since the last time.

-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2020-02-24 Thread Alexander Korotkov
Hi!

Thank you for your review.  The revised version is attached.

On Wed, Feb 19, 2020 at 1:16 AM Peter Geoghegan  wrote:
> On Tue, Feb 18, 2020 at 2:16 AM Alexander Korotkov
>  wrote:
> > > Don't need the "!P_ISLEAF()" here.
> >
> > Why don't I need.  bt_downlink_connectivity_check() checks one level
> > down to the target level.  But there is no one level down to leaf...
>
> Because offset_is_negative_infinity() checks P_ISLEAF() for you. Maybe
> it's better your way, though -- apparently it's clearer.

Oh, I see.  I prefer to leave it my way.  Explicit check is nearly
free and makes it clearer for me.

> > > > Alternatively
> > > > +* we might find child with high key while traversing from
> > > > +* previous downlink to current one.  Then matching key 
> > > > resides
> > > > +* the same offset number as current downlink.
> > > > +*/
> > >
> > > Not sure what "traversing from previous downlink to current one" means at 
> > > all.
> >
> > I've rephrased this comment, please check.
>
> > Agree, replaced _bt_compare() with bt_pivot_tuple_identical().  It
> > becomes even simpler now, thanks!
>
> There was actually an even better reason to invent
> bt_pivot_tuple_identical(): a call to _bt_compare() in amcheck needs
> to do something like the extra steps that you see in routines like
> invariant_l_offset(). _bt_compare() will return 0 when the insertion
> scankey has a prefix of scankey/column values that are equal, even
> though there may be additional columns in the index tuple that are not
> compared. So, you could have a truncated multi-column high key that is
> "equal" to pivot tuple in parent that is actually to the right in the
> key space. This blind spot would often occur with low cardinality
> indexes, where we often have something like this in pivot tuples on
> internal pages:
>
> 'foo, -inf'
> 'foo, (1,24)'
> 'food, -inf'. <-- This pivot tuple's downlink points to the final leaf
> page that's filled with duplicates of the value 'foo'
> 'food, (3,19)' <-- This pivot tuple's downlink points to the *first*
> leaf page that's filled with duplicates of the value 'food'
> ...
>
> The situation is really common in low cardinality indexes because
> nbtsplitloc.c hates splitting a leaf page between two duplicates -- it
> is avoided whenever possible. You reliably get a '-inf' value for the
> TID in the first pivot tuple for the duplicate, followed by a real
> heap TID for later pivot tuples for pages with the same duplicate
> value.
>

Thank you for the explanation!

> > > I think bt_downlink_check() and bt_downlink_connectivity_check()
> > > should be renamed to something broader. In my mind, downlink is
> > > basically a block number. We have been sloppy about using the term
> > > downlink when we really mean "pivot tuple with a downlink" -- I am
> > > guilty of this myself. But it seems more important, now that you have
> > > the new high key check.
> >
> > Hmm... Names are hard for me.  I didn't do any renaming for now.  What
> > about this?
> > bt_downlink_check() => bt_child_check()
> > bt_downlink_connectivity_check() => bt_child_connectivity_highkey_check()
>
> I suggest:
>
> bt_downlink_check() => bt_child_check()
> bt_downlink_connectivity_check() => bt_child_highkey_check()
>
> While bt_downlink_connectivity_check() moves right on the target's
> child level, this isn't the common case. Moving right like that
> doesn't need to be suggested by the name of the function.
>
> Most of the time, we just check the high key -- right?

Good.  Renamed as you proposed.

> * You should say "target page lsn" here instead:
>
> pg@tpce:5432 [19852]=# select bt_index_parent_check(:'idxrelation',true, 
> true);
> ERROR:  mismatch between parent key and child high key in index "i_holding2"
> DETAIL:  Target block=1570 child block=1690 parent page lsn=0/0.
> Time: 12.509 ms

Yep, fixed.

> * Maybe say "Move to the right on the child level" in a comment above
> the bt_downlink_connectivity_check() "while (true)" loop here:

Comment is added.

> > +
> > +   while (true)
> > +   {
> > +   /*
> > +* Did we traverse the whole tree level and this is check for pages 
> > to
> > +* the right of rightmost downlink?
> > +*/
>
> * If you are going to save a low key for the target page in memory,
> then you only need to do so for "state->readonly"/parent verification.

Sure, now it's just for state->readonly.

> * You should s/lokey/lowkey/ -- I prefer the spelling "lowkey" or "low
> key". This is a term that nbtsort.c now uses, in case you didn't know.

Changed, thank you for pointing.

> * The reason for saving a low key for each target page is very
> unclear. Can we fix this?

Actually, lowkey is used for removing "cousin page verification blind
spot" when there are incomplete splits.  It might happen that we read
child with hikey matching its parent high key only when
bt_child_highkey_check() is called for "minus infinity" key of parent
right 

Re: Improve search for missing parent downlinks in amcheck

2020-02-18 Thread Peter Geoghegan
On Tue, Feb 18, 2020 at 2:16 AM Alexander Korotkov
 wrote:
> Great, thank you very much!

No problem!

My remarks here are based on
"amcheck-btree-improve-missing-parent-downlinks-check-6.patch". I have
found a false positive corruption report bug in this latest version --
see note below about incomplete page splits.

> > Don't need the "!P_ISLEAF()" here.
>
> Why don't I need.  bt_downlink_connectivity_check() checks one level
> down to the target level.  But there is no one level down to leaf...

Because offset_is_negative_infinity() checks P_ISLEAF() for you. Maybe
it's better your way, though -- apparently it's clearer.

> > > Alternatively
> > > +* we might find child with high key while traversing from
> > > +* previous downlink to current one.  Then matching key 
> > > resides
> > > +* the same offset number as current downlink.
> > > +*/
> >
> > Not sure what "traversing from previous downlink to current one" means at 
> > all.
>
> I've rephrased this comment, please check.

> Agree, replaced _bt_compare() with bt_pivot_tuple_identical().  It
> becomes even simpler now, thanks!

There was actually an even better reason to invent
bt_pivot_tuple_identical(): a call to _bt_compare() in amcheck needs
to do something like the extra steps that you see in routines like
invariant_l_offset(). _bt_compare() will return 0 when the insertion
scankey has a prefix of scankey/column values that are equal, even
though there may be additional columns in the index tuple that are not
compared. So, you could have a truncated multi-column high key that is
"equal" to pivot tuple in parent that is actually to the right in the
key space. This blind spot would often occur with low cardinality
indexes, where we often have something like this in pivot tuples on
internal pages:

'foo, -inf'
'foo, (1,24)'
'food, -inf'. <-- This pivot tuple's downlink points to the final leaf
page that's filled with duplicates of the value 'foo'
'food, (3,19)' <-- This pivot tuple's downlink points to the *first*
leaf page that's filled with duplicates of the value 'food'
...

The situation is really common in low cardinality indexes because
nbtsplitloc.c hates splitting a leaf page between two duplicates -- it
is avoided whenever possible. You reliably get a '-inf' value for the
TID in the first pivot tuple for the duplicate, followed by a real
heap TID for later pivot tuples for pages with the same duplicate
value.

(Anyway, it's not important now.)

> > I think bt_downlink_check() and bt_downlink_connectivity_check()
> > should be renamed to something broader. In my mind, downlink is
> > basically a block number. We have been sloppy about using the term
> > downlink when we really mean "pivot tuple with a downlink" -- I am
> > guilty of this myself. But it seems more important, now that you have
> > the new high key check.
>
> Hmm... Names are hard for me.  I didn't do any renaming for now.  What
> about this?
> bt_downlink_check() => bt_child_check()
> bt_downlink_connectivity_check() => bt_child_connectivity_highkey_check()

I suggest:

bt_downlink_check() => bt_child_check()
bt_downlink_connectivity_check() => bt_child_highkey_check()

While bt_downlink_connectivity_check() moves right on the target's
child level, this isn't the common case. Moving right like that
doesn't need to be suggested by the name of the function.

Most of the time, we just check the high key -- right?

> I've revised finding the matching pivot key for high key.  Now, code
> assumes we should always find a matching pivot key.  It could use both
> target high key or left sibling high key (which is memorized as "low
> key").
>
> I've checked this on normal indexes, but I didn't try to exercise this
> with broken indexes.  I would appreciate if you do.

I can confirm that checking the high key in target's child page
against the high key in target (when appropriate) removes the "cousin
page verification blind spot" that I noticed in the last version, as
expected. Great!

* You should say "target page lsn" here instead:

pg@tpce:5432 [19852]=# select bt_index_parent_check(:'idxrelation',true, true);
ERROR:  mismatch between parent key and child high key in index "i_holding2"
DETAIL:  Target block=1570 child block=1690 parent page lsn=0/0.
Time: 12.509 ms

* Maybe say "Move to the right on the child level" in a comment above
the bt_downlink_connectivity_check() "while (true)" loop here:

> +
> +   while (true)
> +   {
> +   /*
> +* Did we traverse the whole tree level and this is check for pages to
> +* the right of rightmost downlink?
> +*/

* If you are going to save a low key for the target page in memory,
then you only need to do so for "state->readonly"/parent verification.

* You should s/lokey/lowkey/ -- I prefer the spelling "lowkey" or "low
key". This is a term that nbtsort.c now uses, in case you didn't know.

* The reason for saving a low key for each target page is very
unclear. Can 

Re: Improve search for missing parent downlinks in amcheck

2020-02-18 Thread Alexander Korotkov
On Tue, Feb 18, 2020 at 1:15 PM Alexander Korotkov
 wrote:
> On Fri, Jan 24, 2020 at 4:31 AM Peter Geoghegan  wrote:
> > On Wed, Jan 22, 2020 at 6:41 PM Alexander Korotkov
> >  wrote:
> > > Rebased patch is attached.  Sorry for so huge delay.
> >
> > I really like this patch. Your interest in amcheck is something that
> > makes me feel good about having put so much work into it myself.
> >
> > Here are some review comments:
>
> Great, thank you very much!

Sorry, I've forgot to commit some comments before publishing a patch.
The right version is attached.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


amcheck-btree-improve-missing-parent-downlinks-check-6.patch
Description: Binary data


Re: Improve search for missing parent downlinks in amcheck

2020-02-18 Thread Alexander Korotkov
Hi, Peter!

On Fri, Jan 24, 2020 at 4:31 AM Peter Geoghegan  wrote:
> On Wed, Jan 22, 2020 at 6:41 PM Alexander Korotkov
>  wrote:
> > Rebased patch is attached.  Sorry for so huge delay.
>
> I really like this patch. Your interest in amcheck is something that
> makes me feel good about having put so much work into it myself.
>
> Here are some review comments:

Great, thank you very much!

> > +   /*
> > +* Rightlink and incomplete split flag of previous block referenced by
> > +* downlink.
> > +*/
> > +   BlockNumber prevrightlink;
> > +   boolprevincompletesplit;
> > +
>
> What downlink? What does this mean? Do you mean the most recently
> followed rightlink on the current level, or InvalidBlockNumber if
> target page is the leftmost page on the current level on the scan?
>
> (Thinks some more...)
>
> Actually, these two new fields track the state *one level down* from
> the target page level when !readonly (unless target page is on the
> leaf level). Right? Comments should be explicit about this. The
> current comments about downlinks isn't clear.

I agree. I've used very vague terms in comments.  Revised now.

> > if (offset_is_negative_infinity(topaque, offset))
> > +   {
> > +   /*
> > +* Initialize downlink connectivity check if needed.
> > +*/
> > +   if (!P_ISLEAF(topaque) && state->readonly)
> > +   {
> > +   bt_downlink_connectivity_check(state,
> > +  offset,
> > +  NULL,
> > +  topaque->btpo.level);
> > +   }
> > continue;
> > +   }
>
> Don't need the "!P_ISLEAF()" here.

Why don't I need.  bt_downlink_connectivity_check() checks one level
down to the target level.  But there is no one level down to leaf...

> Also, you should say something like
> "we need to call this here because the usual callsite in
> bt_downlink_check() won't be reached".

Sure, fixed.

> > /*
> > -* * Check if page has a downlink in parent *
> > -*
> > -* This can only be checked in heapallindexed + readonly case.
> > +* If we traversed the whole level to the rightmost page, there might be
> > +* missing downlinks for the pages to the right of rightmost downlink.
> > +* Check for them.
> >  */
>
> You mean "to the right of the child page pointed to by our rightmost 
> downlink"?

Yep, fixed.

> I think that the final bt_downlink_connectivity_check() call within
> bt_target_page_check() should make it clear that it is kind of special
> compared to the other two calls.

Yes, this is fixed too.

> > +/*
> > + * Check connectivity of downlinks.  Traverse rightlinks from previous 
> > downlink
> > + * to the current one.  Check that there are no intermediate pages with 
> > missing
> > + * downlinks.
> > + *
> > + * If 'loaded_page' is given, it's assumed to be contents of downlink
> > + * referenced by 'downlinkoffnum'.
> > + */
>
> Say "assumed to be the page pointed to by the downlink", perhaps?

Yes, fixed.

> > +static void
> > +bt_downlink_connectivity_check(BtreeCheckState *state,
> > +  OffsetNumber downlinkoffnum,
> > +  Page loaded_page,
> > +  uint32 parent_level)
> > +{
>
> In amcheck, we always have a current target page. Every page gets to
> be the target exactly once, though sometimes other subsidiary pages
> are accessed. We try to blame the target page, even with checks that
> are technically against its child/sibling/whatever. The target page is
> always our conceptual point of reference. Sometimes this is a bit
> artificial, but it's still worth doing consistently. So I think you
> should change these argument names with that in mind (see below).

Yes, the arguments were changes as you proposed.

> > +   /*
> > +* If we visit page with high key, check that it should be equal to
> > +* the target key next to corresponding downlink.
> > +*/
>
> I suggest "...check that it is equal to the target key..."

Agree, fixed.

> > +   /*
> > +* There might be two situations when we examine high key.  If
> > +* current child page is referenced by given downlink, we should
> > +* look to the next offset number for matching key.
>
> You mean "the next offset number for the matching key from the target
> page"? I find it much easier to keep this stuff in my head if
> everything is defined in terms of its relationship with the current
> target page. For example, bt_downlink_connectivity_check()'s
> "parent_level" argument should be called "target_level" instead, while
> its "loaded_page" should be called "loaded_child". Maybe
> "downlinkoffnum" should be "target_downlinkoffnum". And downlinkoffnum
> should definitely be explained in comments at the top of
> bt_downlink_connectivity_check() 

Re: Improve search for missing parent downlinks in amcheck

2020-01-23 Thread Peter Geoghegan
On Thu, Jan 23, 2020 at 5:40 PM Peter Geoghegan  wrote:
> I suppose the alternative is to get the high key of the parent's left
> sibling, rather than going to the parent's parent (i.e. our
> grandparent). That would probably be the best way to get a separator
> key to compare against the high key in the leftmost cousin page of a
> subtree, if in fact we wanted to *fully* solve the "cousin problem".

I think I've confused myself here. The
"!offset_is_negative_infinity(topaque, pivotkey_offset)" part of the
bt_downlink_connectivity_check() high key check test that I mentioned
now *also* seem unnecessary. Any high key in a page that isn't marked
"half-dead" or marked "deleted" or marked "has incomplete split" can
be targeted by the check. Once the page meets that criteria, there
must be a pivot tuple in the parent page that contains an
identical-to-highkey separator key (this could be the parent's own
high key).

The only thing that you need to do is be careful about rightmost
parent pages of a non-rightmost page -- stuff like that. But, I think
that that's only needed because an amcheck segfault isn't a very nice
way to detect corruption.

The existing comment about negative infinity items within
bt_downlink_check() that I mentioned in my main e-mail (the "Note:"
part) don't quite apply here. You're not verifying a pivot tuple that
has a downlink (which might be negative infinity) against the lower
levels -- you're doing *the opposite*. That is, you're verifying a
high key using the parent. Which seems like the right way to do it --
you can test for the incomplete split flag and so on by doing it
bottom up (not top down). This must have been why I was confused.

Phew!
-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2020-01-23 Thread Peter Geoghegan
On Thu, Jan 23, 2020 at 5:31 PM Peter Geoghegan  wrote:
> In summary: I suppose that we can also solve "the cousin problem"
> quite easily, but only for rightmost cousins within a subtree --
> leftmost cousins might be too messy to verify for it to be worth it.
> We don't want to have to jump two or three levels up within
> bt_downlink_connectivity_check() just for leftmost cousin pages. But
> maybe you're feeling ambitious! What do you think?

I suppose the alternative is to get the high key of the parent's left
sibling, rather than going to the parent's parent (i.e. our
grandparent). That would probably be the best way to get a separator
key to compare against the high key in the leftmost cousin page of a
subtree, if in fact we wanted to *fully* solve the "cousin problem".
Goetz Graefe recommends keeping both a low key and a high key in every
page for verification purposes. We don't actually have a low key (we
only have a truncated negative infinity item), but this approach isn't
that far off having a low key.

-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2020-01-23 Thread Peter Geoghegan
On Wed, Jan 22, 2020 at 6:41 PM Alexander Korotkov
 wrote:
> Rebased patch is attached.  Sorry for so huge delay.

I really like this patch. Your interest in amcheck is something that
makes me feel good about having put so much work into it myself.

Here are some review comments:

> +   /*
> +* Rightlink and incomplete split flag of previous block referenced by
> +* downlink.
> +*/
> +   BlockNumber prevrightlink;
> +   boolprevincompletesplit;
> +

What downlink? What does this mean? Do you mean the most recently
followed rightlink on the current level, or InvalidBlockNumber if
target page is the leftmost page on the current level on the scan?

(Thinks some more...)

Actually, these two new fields track the state *one level down* from
the target page level when !readonly (unless target page is on the
leaf level). Right? Comments should be explicit about this. The
current comments about downlinks isn't clear.

> if (offset_is_negative_infinity(topaque, offset))
> +   {
> +   /*
> +* Initialize downlink connectivity check if needed.
> +*/
> +   if (!P_ISLEAF(topaque) && state->readonly)
> +   {
> +   bt_downlink_connectivity_check(state,
> +  offset,
> +  NULL,
> +  topaque->btpo.level);
> +   }
> continue;
> +   }

Don't need the "!P_ISLEAF()" here. Also, you should say something like
"we need to call this here because the usual callsite in
bt_downlink_check() won't be reached".

> /*
> -* * Check if page has a downlink in parent *
> -*
> -* This can only be checked in heapallindexed + readonly case.
> +* If we traversed the whole level to the rightmost page, there might be
> +* missing downlinks for the pages to the right of rightmost downlink.
> +* Check for them.
>  */

You mean "to the right of the child page pointed to by our rightmost downlink"?

I think that the final bt_downlink_connectivity_check() call within
bt_target_page_check() should make it clear that it is kind of special
compared to the other two calls.

> +/*
> + * Check connectivity of downlinks.  Traverse rightlinks from previous 
> downlink
> + * to the current one.  Check that there are no intermediate pages with 
> missing
> + * downlinks.
> + *
> + * If 'loaded_page' is given, it's assumed to be contents of downlink
> + * referenced by 'downlinkoffnum'.
> + */

Say "assumed to be the page pointed to by the downlink", perhaps?

> +static void
> +bt_downlink_connectivity_check(BtreeCheckState *state,
> +  OffsetNumber downlinkoffnum,
> +  Page loaded_page,
> +  uint32 parent_level)
> +{

In amcheck, we always have a current target page. Every page gets to
be the target exactly once, though sometimes other subsidiary pages
are accessed. We try to blame the target page, even with checks that
are technically against its child/sibling/whatever. The target page is
always our conceptual point of reference. Sometimes this is a bit
artificial, but it's still worth doing consistently. So I think you
should change these argument names with that in mind (see below).

> +   /*
> +* If we visit page with high key, check that it should be equal to
> +* the target key next to corresponding downlink.
> +*/

I suggest "...check that it is equal to the target key..."

> +   /*
> +* There might be two situations when we examine high key.  If
> +* current child page is referenced by given downlink, we should
> +* look to the next offset number for matching key.

You mean "the next offset number for the matching key from the target
page"? I find it much easier to keep this stuff in my head if
everything is defined in terms of its relationship with the current
target page. For example, bt_downlink_connectivity_check()'s
"parent_level" argument should be called "target_level" instead, while
its "loaded_page" should be called "loaded_child". Maybe
"downlinkoffnum" should be "target_downlinkoffnum". And downlinkoffnum
should definitely be explained in comments at the top of
bt_downlink_connectivity_check() (e.g., say what it means when it is
InvalidOffsetNumber).

> Alternatively
> +* we might find child with high key while traversing from
> +* previous downlink to current one.  Then matching key resides
> +* the same offset number as current downlink.
> +*/

Not sure what "traversing from previous downlink to current one" means at all.

> +   if (!offset_is_negative_infinity(topaque, pivotkey_offset) &&
> +   pivotkey_offset <= PageGetMaxOffsetNumber(state->target))
> +   {
> +   uint32  cmp = _bt_compare(state->rel,
> +   

Re: Improve search for missing parent downlinks in amcheck

2020-01-22 Thread Alexander Korotkov
On Fri, Jan 17, 2020 at 1:05 AM Tomas Vondra
 wrote:
> On Fri, Nov 29, 2019 at 03:03:01PM +0900, Michael Paquier wrote:
> >On Mon, Aug 19, 2019 at 01:15:19AM +0300, Alexander Korotkov wrote:
> >> The revised patch seems to fix all of above.
> >
> >The latest patch is failing to apply.  Please provide a rebase.
>
> This still does not apply (per cputube). Can you provide a fixed patch?

Rebased patch is attached.  Sorry for so huge delay.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


amcheck-btree-improve-missing-parent-downlinks-check-4.patch
Description: Binary data


Re: Improve search for missing parent downlinks in amcheck

2020-01-16 Thread Tomas Vondra

On Fri, Nov 29, 2019 at 03:03:01PM +0900, Michael Paquier wrote:

On Mon, Aug 19, 2019 at 01:15:19AM +0300, Alexander Korotkov wrote:

The revised patch seems to fix all of above.


The latest patch is failing to apply.  Please provide a rebase.


This still does not apply (per cputube). Can you provide a fixed patch?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Improve search for missing parent downlinks in amcheck

2019-11-28 Thread Michael Paquier
On Mon, Aug 19, 2019 at 01:15:19AM +0300, Alexander Korotkov wrote:
> The revised patch seems to fix all of above.

The latest patch is failing to apply.  Please provide a rebase.
--
Michael


signature.asc
Description: PGP signature


Re: Improve search for missing parent downlinks in amcheck

2019-08-18 Thread Alexander Korotkov
On Tue, Aug 13, 2019 at 11:44 PM Peter Geoghegan  wrote:
> > In this revision check for missing downlinks is combined with
> > bt_downlink_check().  So, pages visited by bt_downlink_check() patch
> > doesn't cause extra accessed.  It only causes following additional
> > page accesses:
> > 1) Downlinks corresponding to "negative infinity" keys,
> > 2) Pages of child level, which are not referenced by downlinks.
> >
> > But I think these two kinds are very minority, and those accesses
> > could be trade off with more precise missing downlink check without
> > bloom filter.  What do you think?
>
> I am generally in favor of making the downlink check absolutely
> reliable, and am not too worried about the modest additional overhead.
> After all, bt_index_parent_check() is supposed to be thorough though
> expensive. The only reason that I used a Bloom filter for
> fingerprinting downlink blocks was that it seemed important to quickly
> get amcheck coverage for subtle multi-level page deletion bugs just
> after v11 feature freeze. We can now come up with a better design for
> that.

Great!

> I was confused about how this patch worked at first. But then I
> remembered that Lehman and Yao consider downlinks to be distinct
> things to separator keys in internal pages. The high key of an
> internal page in the final separator key, so you have n downlinks and
> n + 1 separator keys per internal page -- two distinct things that
> appear in alternating order (the negative infinity item is not
> considered to have any separator key here). So, while internal page
> items are explicitly "(downlink, separator)" pairs that are grouped
> into a single tuple in nbtree, with a separate tuple just for the high
> key, Lehman and Yao would find it just as natural to treat them as
> "(separator, downlink)" pairs. You have to skip the first downlink on
> each internal level to make that work, but this makes our
> bt_downlink_check() check have something from the target page (child's
> parent page) that is like the high key in the child.
>
> It's already really confusing that we don't quite mean the same thing
> as Lehman and Yao when we say downlink (See also my long "why a
> highkey is never truly a copy of another item" comment block within
> bt_target_page_check()), and that is not your patch's fault. But maybe
> we need to fix that to make your patch easier to understand. (i.e.
> maybe we need to go over every use of the word "downlink" in nbtree,
> and make it say something more precise, to make everything less
> confusing.)

I agree that current terms nbtree use to describe downlinks and
separator keys may be confusing.  I'll try to fix this and come up
with patch if succeed.

> Other feedback:
>
> * Did you miss the opportunity to verify that every high key matches
> its right sibling page's downlink tuple in parent page? We talked
> about this already, but you don't seem to match the key (you only
> match the downlink block).
>
> * You are decoupling the new check from the bt_index_parent_check()
> "heapallindexed" option. That's probably a good thing, but you need to
> update the sgml docs.
>
> * Didn't you forget to remove the BtreeCheckState.rightsplit flag?
>
> * You've significantly changed the behavior of bt_downlink_check() --
> I would note this in its header comments. This is where ~99% of the
> new work happens.
>
> * I don't like that you use the loaded term "target" here -- anything
> called "target" should be BtreeCheckState.target:
>
> >  static void
> > -bt_downlink_missing_check(BtreeCheckState *state)
> > +bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
> > + BlockNumber targetblock, Page target)
> >  {
>
> If it's unclear what I mean, this old comment should make it clearer:
>
> /*
>  * State associated with verifying a B-Tree index
>  *
>  * target is the point of reference for a verification operation.
>  *
>  * Other B-Tree pages may be allocated, but those are always auxiliary (e.g.,
>  * they are current target's child pages).  Conceptually, problems are only
>  * ever found in the current target page (or for a particular heap tuple 
> during
>  * heapallindexed verification).  Each page found by verification's 
> left/right,
>  * top/bottom scan becomes the target exactly once.
>  */

The revised patch seems to fix all of above.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


amcheck-btree-improve-missing-parent-downlinks-check-3.patch
Description: Binary data


Re: Improve search for missing parent downlinks in amcheck

2019-08-13 Thread Peter Geoghegan
On Mon, Aug 12, 2019 at 12:01 PM Alexander Korotkov
 wrote:
> BTW, there is next revision of patch I'm proposing for v13.

Cool.

> In this revision check for missing downlinks is combined with
> bt_downlink_check().  So, pages visited by bt_downlink_check() patch
> doesn't cause extra accessed.  It only causes following additional
> page accesses:
> 1) Downlinks corresponding to "negative infinity" keys,
> 2) Pages of child level, which are not referenced by downlinks.
>
> But I think these two kinds are very minority, and those accesses
> could be trade off with more precise missing downlink check without
> bloom filter.  What do you think?

I am generally in favor of making the downlink check absolutely
reliable, and am not too worried about the modest additional overhead.
After all, bt_index_parent_check() is supposed to be thorough though
expensive. The only reason that I used a Bloom filter for
fingerprinting downlink blocks was that it seemed important to quickly
get amcheck coverage for subtle multi-level page deletion bugs just
after v11 feature freeze. We can now come up with a better design for
that.

I was confused about how this patch worked at first. But then I
remembered that Lehman and Yao consider downlinks to be distinct
things to separator keys in internal pages. The high key of an
internal page in the final separator key, so you have n downlinks and
n + 1 separator keys per internal page -- two distinct things that
appear in alternating order (the negative infinity item is not
considered to have any separator key here). So, while internal page
items are explicitly "(downlink, separator)" pairs that are grouped
into a single tuple in nbtree, with a separate tuple just for the high
key, Lehman and Yao would find it just as natural to treat them as
"(separator, downlink)" pairs. You have to skip the first downlink on
each internal level to make that work, but this makes our
bt_downlink_check() check have something from the target page (child's
parent page) that is like the high key in the child.

It's already really confusing that we don't quite mean the same thing
as Lehman and Yao when we say downlink (See also my long "why a
highkey is never truly a copy of another item" comment block within
bt_target_page_check()), and that is not your patch's fault. But maybe
we need to fix that to make your patch easier to understand. (i.e.
maybe we need to go over every use of the word "downlink" in nbtree,
and make it say something more precise, to make everything less
confusing.)

Other feedback:

* Did you miss the opportunity to verify that every high key matches
its right sibling page's downlink tuple in parent page? We talked
about this already, but you don't seem to match the key (you only
match the downlink block).

* You are decoupling the new check from the bt_index_parent_check()
"heapallindexed" option. That's probably a good thing, but you need to
update the sgml docs.

* Didn't you forget to remove the BtreeCheckState.rightsplit flag?

* You've significantly changed the behavior of bt_downlink_check() --
I would note this in its header comments. This is where ~99% of the
new work happens.

* I don't like that you use the loaded term "target" here -- anything
called "target" should be BtreeCheckState.target:

>  static void
> -bt_downlink_missing_check(BtreeCheckState *state)
> +bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
> + BlockNumber targetblock, Page target)
>  {

If it's unclear what I mean, this old comment should make it clearer:

/*
 * State associated with verifying a B-Tree index
 *
 * target is the point of reference for a verification operation.
 *
 * Other B-Tree pages may be allocated, but those are always auxiliary (e.g.,
 * they are current target's child pages).  Conceptually, problems are only
 * ever found in the current target page (or for a particular heap tuple during
 * heapallindexed verification).  Each page found by verification's left/right,
 * top/bottom scan becomes the target exactly once.
 */

-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2019-08-12 Thread Alexander Korotkov
On Fri, Jul 19, 2019 at 3:21 AM Peter Geoghegan  wrote:
> On Tue, Apr 30, 2019 at 5:58 PM Peter Geoghegan  wrote:
> > I will think about a simple fix, but after the upcoming point release.
> > There is no hurry.
>
> Attached draft patch uses RelationGetNumberOfBlocks() to size each of
> the two Bloom filters that may be used by amcheck to perform
> verification.
>
> The basic heapallindexed Bloom filter is now sized based on the
> conservative assumption that there must be *at least*
> "RelationGetNumberOfBlocks()  * 50" elements to fingerprint (reltuples
> will continue to be used to size the basic heapallindexed Bloom filter
> in most cases, though). The patch also uses the same
> RelationGetNumberOfBlocks() value to size the downlink Bloom filter.
> This second change will fix your problem very non-invasively.
>
> I intend to backpatch this to v11 in the next few days.

Thank you for backpatching this!

BTW, there is next revision of patch I'm proposing for v13.

In this revision check for missing downlinks is combined with
bt_downlink_check().  So, pages visited by bt_downlink_check() patch
doesn't cause extra accessed.  It only causes following additional
page accesses:
1) Downlinks corresponding to "negative infinity" keys,
2) Pages of child level, which are not referenced by downlinks.

But I think these two kinds are very minority, and those accesses
could be trade off with more precise missing downlink check without
bloom filter.  What do you think?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


amcheck-btree-improve-missing-parent-downlinks-check-2.patch
Description: Binary data


Re: Improve search for missing parent downlinks in amcheck

2019-07-18 Thread Peter Geoghegan
On Tue, Apr 30, 2019 at 5:58 PM Peter Geoghegan  wrote:
> I will think about a simple fix, but after the upcoming point release.
> There is no hurry.

Attached draft patch uses RelationGetNumberOfBlocks() to size each of
the two Bloom filters that may be used by amcheck to perform
verification.

The basic heapallindexed Bloom filter is now sized based on the
conservative assumption that there must be *at least*
"RelationGetNumberOfBlocks()  * 50" elements to fingerprint (reltuples
will continue to be used to size the basic heapallindexed Bloom filter
in most cases, though). The patch also uses the same
RelationGetNumberOfBlocks() value to size the downlink Bloom filter.
This second change will fix your problem very non-invasively.

I intend to backpatch this to v11 in the next few days.

-- 
Peter Geoghegan


0001-Set-minimum-amcheck-Bloom-filter-size.patch
Description: Binary data


Re: Improve search for missing parent downlinks in amcheck

2019-07-08 Thread Peter Geoghegan
On Sun, Jul 7, 2019 at 7:53 PM Thomas Munro  wrote:
> On Wed, May 1, 2019 at 12:58 PM Peter Geoghegan  wrote:
> > I will think about a simple fix, but after the upcoming point release.
> > There is no hurry.
>
> A bureaucratic question:  What should the status be for this CF entry?

I have plans to use RelationGetNumberOfBlocks() to size amcheck's
downlink Bloom filter. I think that that will solve the problems with
unreliable estimates of index size. which seemed to be the basis of
Alexander's complaint.

--
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2019-07-07 Thread Thomas Munro
On Wed, May 1, 2019 at 12:58 PM Peter Geoghegan  wrote:
> On Sun, Apr 28, 2019 at 10:15 AM Alexander Korotkov
>  wrote:
> > I think this definitely not bug fix.  Bloom filter was designed to be
> > lossy, no way blaming it for that :)
>
> I will think about a simple fix, but after the upcoming point release.
> There is no hurry.

A bureaucratic question:  What should the status be for this CF entry?

-- 
Thomas Munro
https://enterprisedb.com




Re: Improve search for missing parent downlinks in amcheck

2019-04-30 Thread Peter Geoghegan
On Sun, Apr 28, 2019 at 10:15 AM Alexander Korotkov
 wrote:
> I think this definitely not bug fix.  Bloom filter was designed to be
> lossy, no way blaming it for that :)

I will think about a simple fix, but after the upcoming point release.
There is no hurry.


-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2019-04-28 Thread Alexander Korotkov
On Sun, Apr 28, 2019 at 4:36 AM Peter Geoghegan  wrote:
> On Sat, Apr 27, 2019 at 5:13 PM Alexander Korotkov
>  wrote:
> > Yes, increasing of Bloom filter size also helps.  But my intention was
> > to make non-lossy check here.
>
> Why is that your intention? Do you want to do this as a feature for
> Postgres 13, or do you want to treat this as a bug that we need to
> backpatch a fix for?

I think this definitely not bug fix.  Bloom filter was designed to be
lossy, no way blaming it for that :)

> Can we avoid the problem you saw with the Bloom filter approach by
> using the real size of the index (i.e.
> smgrnblocks()/RelationGetNumberOfBlocks()) to size the Bloom filter,
> and/or by rethinking the work_mem cap? Maybe we can have a WARNING
> that advertises that work_mem is probably too low?
>
> The  state->downlinkfilter Bloom filter should be small in almost all
> cases, so I still don't fully understand your concern. With a 100GB
> index, we'll have ~13 million blocks. We only need a Bloom filter that
> is ~250MB to have less than a 1% chance of missing inconsistencies
> even with such a large index. I admit that its unfriendly that users
> are not warned about the shortage currently, but that is something we
> can probably find a simple (backpatchable) fix for.

Sounds reasonable.  I'll think about proposing backpatch of something like this.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Improve search for missing parent downlinks in amcheck

2019-04-27 Thread Peter Geoghegan
On Sat, Apr 27, 2019 at 5:13 PM Alexander Korotkov
 wrote:
> Yes, increasing of Bloom filter size also helps.  But my intention was
> to make non-lossy check here.

Why is that your intention? Do you want to do this as a feature for
Postgres 13, or do you want to treat this as a bug that we need to
backpatch a fix for?

Can we avoid the problem you saw with the Bloom filter approach by
using the real size of the index (i.e.
smgrnblocks()/RelationGetNumberOfBlocks()) to size the Bloom filter,
and/or by rethinking the work_mem cap? Maybe we can have a WARNING
that advertises that work_mem is probably too low?

The  state->downlinkfilter Bloom filter should be small in almost all
cases, so I still don't fully understand your concern. With a 100GB
index, we'll have ~13 million blocks. We only need a Bloom filter that
is ~250MB to have less than a 1% chance of missing inconsistencies
even with such a large index. I admit that its unfriendly that users
are not warned about the shortage currently, but that is something we
can probably find a simple (backpatchable) fix for.

-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2019-04-27 Thread Peter Geoghegan
On Sat, Apr 27, 2019 at 5:13 PM Alexander Korotkov
 wrote:
> Yes, increasing of Bloom filter size also helps.  But my intention was
> to make non-lossy check here.

I agree that that might be a good goal, but I am interested in knowing
if there is something naive about how the downlinkfilter Bloom filter
is sized. I think that we could probably do better than this without
much work:

/*
 * Extra readonly downlink check.
 *
 * In readonly case, we know that there cannot be a concurrent
 * page split or a concurrent page deletion, which gives us the
 * opportunity to verify that every non-ignorable page had a
 * downlink one level up.  We must be tolerant of interrupted page
 * splits and page deletions, though.  This is taken care of in
 * bt_downlink_missing_check().
 */
total_pages = (int64) state->rel->rd_rel->relpages;
state->downlinkfilter = bloom_create(total_pages, work_mem, seed);

Maybe we could use "smgrnblocks(index->rd_smgr, MAIN_FORKNUM))"
instead of relpages, for example.

> It wouldn't be really wasteful, because the same children were
> accessed recently.  So, they are likely not yet evicted from
> shared_buffers.  I think we can fit both checks into one loop over
> downlinks instead of two.

I see your point, but if we're going to treat this as a bug then I
would prefer a simple fix.

> Yes, we can do more checks with bloom filter.  But assuming that we
> anyway iterate over children for each non-leaf page, can we do:
> 1) All the checks, which bt_downlink_check() does for now
> 2) Check there are no missing downlinks
> 3) Check hikeys
> in one pass, can't we?

We can expect every high key in a page to have a copy contained within
its parent, either as one of the keys, or as parent's own high key
(assuming no concurrent or interrupted page splits or page deletions).
This is true today, even though we truncate negative infinity items in
internal pages.

I think that the simple answer to your question is yes. It would be
more complicated that way, and the only extra check would be the check
of high keys against their parent, but overall this does seem
possible.

-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2019-04-27 Thread Alexander Korotkov
On Tue, Apr 16, 2019 at 10:00 PM Peter Geoghegan  wrote:
>
> On Mon, Apr 15, 2019 at 7:30 PM Alexander Korotkov
>  wrote:
> > Currently we amcheck supports lossy checking for missing parent
> > downlinks.  It collects bitmap of downlink hashes and use it to check
> > subsequent tree level.  We've experienced some large corrupted indexes
> > which pass this check due to its looseness.
>
> Can you be more specific? What was the cause of the corruption? I'm
> always very interested in hearing about cases that amcheck could have
> detected, but didn't.

AFAIR, the cause of corruption in this case was our (Postgres Pro)
modification.  Not something really present in PostgreSQL core.

>
> Was the issue that the Bloom filter was simply undersized/ineffective?

Yes, increasing of Bloom filter size also helps.  But my intention was
to make non-lossy check here.

>
> > However, it seems to me we can implement this check in non-lossy
> > manner without making it significantly slower.  We anyway traverse
> > downlinks from parent to children in order to verify that hikeys are
> > corresponding to downlink keys.
>
> Actually, we don't check the high keys in children against the parent
> (all other items are checked, though). We probably *should* do
> something with the high key when verifying consistency across levels,
> but currently we don't. (We only use the high key for the same-page
> high key check -- more on this below.)

Nice idea.

> > We can also traverse from one
> > downlink to subsequent using rightlinks.  So, if there are some
> > intermediate pages between them, they are candidates to have missing
> > parent downlinks.  The patch is attached.
> >
> > With this patch amcheck could successfully detect corruption for our
> > customer, which unpatched amcheck couldn't find.
>
> Maybe we can be a lot less conservative about sizing the Bloom filter
> instead? That would be an easier fix IMV -- we can probably change
> that logic to be a lot more aggressive without anybody noticing, since
> the Bloom filter is already usually tiny. We are already not very
> careful about saving work within bt_index_parent_check(), but with
> this patch we follow each downlink twice instead of once. That seems
> wasteful.

It wouldn't be really wasteful, because the same children were
accessed recently.  So, they are likely not yet evicted from
shared_buffers.  I think we can fit both checks into one loop over
downlinks instead of two.

> The reason I used a Bloom filter here is because I would eventually
> like the missing downlink check to fingerprint entire tuples, not just
> block numbers. In L terms, the check could in the future fingerprint
> the separator key and the downlink at the same time, not just the
> downlink. That way, we could probe the Bloom filter on the next level
> down for its high key (with the right sibling pointer set to be
> consistent with the parent) iff we don't see that the page split was
> interrupted (i.e. iff P_INCOMPLETE_SPLIT() bit is not set). Obviously
> this would be a more effective form of verification, since we would
> notice high key values that don't agree with the parent's values for
> the same sibling/cousin/child block.
>
> I didn't do it that way for v11 because of "minus infinity" items on
> internal pages, which don't store the original key (the key remains
> the high key of the left sibling page, but we truncate the original to
> 0 attributes in _bt_pgaddtup()). I think that we should eventually
> stop truncating minus infinity items, and actually store a "low key"
> at P_FIRSTDATAKEY() within internal pages instead. That would be
> useful for other things anyway (e.g. prefix compression).

Yes, we can do more checks with bloom filter.  But assuming that we
anyway iterate over children for each non-leaf page, can we do:
1) All the checks, which bt_downlink_check() does for now
2) Check there are no missing downlinks
3) Check hikeys
in one pass, can't we?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Improve search for missing parent downlinks in amcheck

2019-04-27 Thread Peter Geoghegan
On Sat, Apr 27, 2019 at 4:57 PM Alexander Korotkov
 wrote:
> "rootdescend" is cool type of check.  Thank you for noticing, I wasn't aware 
> of it.
> But can it detect the missing downlink in following situation?
>
> A
>  / \
>   B <-> C <-> D
>
> Here A has downlinks to B and D, which downlink to C is missing,
> while B, C and D are correctly connected with leftlinks and rightlinks.
> I can see "rootdescend" calls _bt_search(), which would just step
> right from C to D as if it was concurrent split.

There is a comment about this scenario above bt_rootdescend() in amcheck.

You're right -- this is a kind of corruption that even the new
rootdescend verification option would miss. We can imagine a version
of rootdescend verification that tells the core code to only move
right when there was an *interrupted* page split (i.e.
P_INCOMPLETE_SPLIT() flag bit is set), but that isn't what happens
right now.

That said, the lossy downlink check that you want to improve on
*should* already catch this situation. Of course it might not because
it is lossy (uses a Bloom filter), but I think that that's very
unlikely. That's why I would like to understand the problem that you
found with the check.

-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2019-04-27 Thread Alexander Korotkov
On Tue, Apr 16, 2019 at 10:04 PM Peter Geoghegan  wrote:
>
> On Tue, Apr 16, 2019 at 12:00 PM Peter Geoghegan  wrote:
> > Can you be more specific? What was the cause of the corruption? I'm
> > always very interested in hearing about cases that amcheck could have
> > detected, but didn't.
>
> FWIW, v4 indexes in Postgres 12 will support the new "rootdescend"
> verification option, which isn't lossy, and would certainly have
> detected your customer issue in practice. Admittedly the new check is
> quite expensive, even compared to the other bt_index_parent_check()
> checks, but it is nice that we now have a verification option that is
> *extremely* thorough, and uses _bt_search() directly.

"rootdescend" is cool type of check.  Thank you for noticing, I wasn't
aware of it.
But can it detect the missing downlink in following situation?

A
 / \
  B <-> C <-> D

Here A has downlinks to B and D, which downlink to C is missing,
while B, C and D are correctly connected with leftlinks and rightlinks.
I can see "rootdescend" calls _bt_search(), which would just step
right from C to D as if it was concurrent split.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Improve search for missing parent downlinks in amcheck

2019-04-25 Thread Peter Geoghegan
On Tue, Apr 16, 2019 at 12:00 PM Peter Geoghegan  wrote:
> On Mon, Apr 15, 2019 at 7:30 PM Alexander Korotkov
>  wrote:
> > Currently we amcheck supports lossy checking for missing parent
> > downlinks.  It collects bitmap of downlink hashes and use it to check
> > subsequent tree level.  We've experienced some large corrupted indexes
> > which pass this check due to its looseness.
>
> Can you be more specific? What was the cause of the corruption? I'm
> always very interested in hearing about cases that amcheck could have
> detected, but didn't.

Ping?

I am interested in doing better here. The background information seems
very interesting.

-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2019-04-16 Thread Peter Geoghegan
On Tue, Apr 16, 2019 at 12:00 PM Peter Geoghegan  wrote:
> Can you be more specific? What was the cause of the corruption? I'm
> always very interested in hearing about cases that amcheck could have
> detected, but didn't.

FWIW, v4 indexes in Postgres 12 will support the new "rootdescend"
verification option, which isn't lossy, and would certainly have
detected your customer issue in practice. Admittedly the new check is
quite expensive, even compared to the other bt_index_parent_check()
checks, but it is nice that we now have a verification option that is
*extremely* thorough, and uses _bt_search() directly.

-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2019-04-16 Thread Peter Geoghegan
On Mon, Apr 15, 2019 at 7:30 PM Alexander Korotkov
 wrote:
> Currently we amcheck supports lossy checking for missing parent
> downlinks.  It collects bitmap of downlink hashes and use it to check
> subsequent tree level.  We've experienced some large corrupted indexes
> which pass this check due to its looseness.

Can you be more specific? What was the cause of the corruption? I'm
always very interested in hearing about cases that amcheck could have
detected, but didn't.

Was the issue that the Bloom filter was simply undersized/ineffective?

> However, it seems to me we can implement this check in non-lossy
> manner without making it significantly slower.  We anyway traverse
> downlinks from parent to children in order to verify that hikeys are
> corresponding to downlink keys.

Actually, we don't check the high keys in children against the parent
(all other items are checked, though). We probably *should* do
something with the high key when verifying consistency across levels,
but currently we don't. (We only use the high key for the same-page
high key check -- more on this below.)

> We can also traverse from one
> downlink to subsequent using rightlinks.  So, if there are some
> intermediate pages between them, they are candidates to have missing
> parent downlinks.  The patch is attached.
>
> With this patch amcheck could successfully detect corruption for our
> customer, which unpatched amcheck couldn't find.

Maybe we can be a lot less conservative about sizing the Bloom filter
instead? That would be an easier fix IMV -- we can probably change
that logic to be a lot more aggressive without anybody noticing, since
the Bloom filter is already usually tiny. We are already not very
careful about saving work within bt_index_parent_check(), but with
this patch we follow each downlink twice instead of once. That seems
wasteful.

The reason I used a Bloom filter here is because I would eventually
like the missing downlink check to fingerprint entire tuples, not just
block numbers. In L terms, the check could in the future fingerprint
the separator key and the downlink at the same time, not just the
downlink. That way, we could probe the Bloom filter on the next level
down for its high key (with the right sibling pointer set to be
consistent with the parent) iff we don't see that the page split was
interrupted (i.e. iff P_INCOMPLETE_SPLIT() bit is not set). Obviously
this would be a more effective form of verification, since we would
notice high key values that don't agree with the parent's values for
the same sibling/cousin/child block.

I didn't do it that way for v11 because of "minus infinity" items on
internal pages, which don't store the original key (the key remains
the high key of the left sibling page, but we truncate the original to
0 attributes in _bt_pgaddtup()). I think that we should eventually
stop truncating minus infinity items, and actually store a "low key"
at P_FIRSTDATAKEY() within internal pages instead. That would be
useful for other things anyway (e.g. prefix compression).

--
Peter Geoghegan