subject:"Making all nbtree entries unique by having heap TIDs participate in comparisons"

On Mon, Mar 18, 2019 at 5:12 PM Peter Geoghegan  wrote:
> Smarter choices on page splits pay off with higher client counts
> because they reduce contention at likely hot points. It's kind of
> crazy that the code in _bt_check_unique() sometimes has to move right,
> while holding an exclusive buffer lock on the original page and a
> shared buffer lock on its sibling page at the same time. It then has
> to hold a third buffer lock concurrently, this time on any heap pages
> it is interested in.

Actually, by the time we get to 16 clients, this workload does make
the indexes and tables smaller. Here is pg_buffercache output
collected after the first 16 client case:

Master
==

 relname │ relforknumber │
size_main_rel_fork_blocks │ buffer_count │ avg_buffer_usg
─┼───┼───┼──┼
 pgbench_history │ 0 │
  123,484 │  123,484 │ 4.9989715266755207
 pgbench_accounts│ 0 │
   34,665 │   10,682 │ 4.4948511514697622
 pgbench_accounts_pkey   │ 0 │
5,708 │1,561 │ 4.8731582319026265
 pgbench_tellers │ 0 │
  489 │  489 │ 5.
 pgbench_branches│ 0 │
  284 │  284 │ 5.
 pgbench_tellers_pkey│ 0 │
   56 │   56 │ 5.

Patch
=

 relname │ relforknumber │
size_main_rel_fork_blocks │ buffer_count │ avg_buffer_usg
─┼───┼───┼──┼
 pgbench_history │ 0 │
  127,864 │  127,864 │ 4.9980447975974473
 pgbench_accounts│ 0 │
   33,933 │9,614 │ 4.3517786561264822
 pgbench_accounts_pkey   │ 0 │
5,487 │1,322 │ 4.8857791225416036
 pgbench_tellers │ 0 │
  204 │  204 │ 4.9803921568627451
 pgbench_branches│ 0 │
  198 │  198 │ 4.3535353535353535
 pgbench_tellers_pkey│ 0 │
   14 │   14 │ 5.

The main fork for pgbench_history is larger with the patch, obviously,
but that's good. pgbench_accounts_pkey is about 4% smaller, which is
probably the most interesting observation that can be made here, but
the tables are also smaller. pgbench_accounts itself is ~2% smaller.
pgbench_branches is ~30% smaller, and pgbench_tellers is 60% smaller.
Of course, the smaller tables were already very small, so maybe that
isn't important. I think that this is due to more effective pruning,
possibly because we get better lock arbitration as a consequence of
better splits, and also because duplicates are in heap TID order. I
haven't observed this effect with larger databases, which have been my
focus.

It isn't weird that shared_buffers doesn't have all the
pgbench_accounts blocks, since, of course, this is highly skewed by
design -- most blocks were never accessed from the table.

This effect seems to be robust, at least with this workload. The
second round of benchmarks (which have their own pgbench -i
initialization) show very similar amounts of bloat at the same point.
It may not be that significant, but it's also not a fluke.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Mon, Mar 18, 2019 at 5:00 PM Robert Haas  wrote:
> Blech.  I think the patch has enough other advantages that it's worth
> accepting that, but it's not great.  We seem to keep finding reasons
> to reduce single client performance in the name of scalability, which
> is often reasonable not but wonderful.

The good news is that the quicksort that we now perform in
nbtsplitloc.c is not optimized at all. Heikki thought it premature to
optimize that, for example by inlining/specializing the quicksort. I
can make that 3x faster fairly easily, which could easily change the
picture here. The code will be uglier that way, but not much more
complicated. I even prototyped this, and managed to make serial
microbenchmarks I've used noticeably faster. This could very well fix
the problem here. It clearly showed up in perf profiles with serial
bulks loads.

> > However, this isn't completely
> > free (particularly the page split stuff), and it doesn't pay for
> > itself until the number of clients ramps up.
>
> I don't really understand that explanation.  It makes sense that more
> intelligent page split decisions could require more CPU cycles, but it
> is not evident to me why more clients would help better page split
> decisions pay off.

Smarter choices on page splits pay off with higher client counts
because they reduce contention at likely hot points. It's kind of
crazy that the code in _bt_check_unique() sometimes has to move right,
while holding an exclusive buffer lock on the original page and a
shared buffer lock on its sibling page at the same time. It then has
to hold a third buffer lock concurrently, this time on any heap pages
it is interested in. Each in turn, to check if they're possibly
conflicting. gcov shows that that never happens with the regression
tests once the patch is applied (you can at least get away with only
having one buffer lock on a leaf page at all times in practically all
cases).

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-18 Thread Robert Haas

On Mon, Mar 18, 2019 at 7:34 PM Peter Geoghegan  wrote:
> With pgbench scale factor 20, here are results for patch and master
> with a Gaussian distribution on my 8 thread/4 core home server, with
> each run reported lasting 10 minutes, repeating twice for client
> counts 1, 2, 8, 16, and 64, patch and master branch:
>
> 1 client master:
> tps = 7203.983289 (including connections establishing)
> 1 client patch:
> tps = 7012.575167 (including connections establishing)
>
> 2 clients master:
> tps = 13434.043832 (including connections establishing)
> 2 clients patch:
> tps = 13105.620223 (including connections establishing)

Blech.  I think the patch has enough other advantages that it's worth
accepting that, but it's not great.  We seem to keep finding reasons
to reduce single client performance in the name of scalability, which
is often reasonable not but wonderful.

> However, this isn't completely
> free (particularly the page split stuff), and it doesn't pay for
> itself until the number of clients ramps up.

I don't really understand that explanation.  It makes sense that more
intelligent page split decisions could require more CPU cycles, but it
is not evident to me why more clients would help better page split
decisions pay off.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Mar 12, 2019 at 11:40 AM Robert Haas  wrote:
> I think it's pretty clear that we have to view that as acceptable.  I
> mean, we could reduce contention even further by finding a way to make
> indexes 40% larger, but I think it's clear that nobody wants that.
> Now, maybe in the future we'll want to work on other techniques for
> reducing contention, but I don't think we should make that the problem
> of this patch, especially because the regressions are small and go
> away after a few hours of heavy use.  We should optimize for the case
> where the user intends to keep the database around for years, not
> hours.

I came back to the question of contention recently. I don't think it's
okay to make contention worse in workloads where indexes are mostly
the same size as before, such as almost any workload that pgbench can
simulate. I have made a lot of the fact that the TPC-C indexes are
~40% smaller, in part because lots of people outside the community
find TPC-C interesting, and in part because this patch series is
focused on cases where we currently do unusually badly (cases where
good intuitions about how B-Trees are supposed to perform break down
[1]). These pinpointable problems must affect a lot of users some of
the time, but certainly not all users all of the time.

The patch series is actually supposed to *improve* the situation with
index buffer lock contention in general, and it looks like it manages
to do that with pgbench, which doesn't do inserts into indexes, except
for those required for non-HOT updates. pgbench requires relatively
few page splits, but is in every other sense a high contention
workload.

With pgbench scale factor 20, here are results for patch and master
with a Gaussian distribution on my 8 thread/4 core home server, with
each run reported lasting 10 minutes, repeating twice for client
counts 1, 2, 8, 16, and 64, patch and master branch:

\set aid random_gaussian(1, 10 * :scale, 20)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
(:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;

1st pass


(init pgbench from scratch for each database, scale 20)

1 client master:
tps = 7203.983289 (including connections establishing)
tps = 7204.020457 (excluding connections establishing)
latency average = 0.139 ms
latency stddev = 0.026 ms
1 client patch:
tps = 7012.575167 (including connections establishing)
tps = 7012.590007 (excluding connections establishing)
latency average = 0.143 ms
latency stddev = 0.020 ms

2 clients master:
tps = 13434.043832 (including connections establishing)
tps = 13434.076194 (excluding connections establishing)
latency average = 0.149 ms
latency stddev = 0.032 ms
2 clients patch:
tps = 13105.620223 (including connections establishing)
tps = 13105.654109 (excluding connections establishing)
latency average = 0.153 ms
latency stddev = 0.033 ms

8 clients master:
tps = 27126.852038 (including connections establishing)
tps = 27126.986978 (excluding connections establishing)
latency average = 0.295 ms
latency stddev = 0.095 ms
8 clients patch:
tps = 27945.457965 (including connections establishing)
tps = 27945.565242 (excluding connections establishing)
latency average = 0.286 ms
latency stddev = 0.089 ms

16 clients master:
tps = 32297.612323 (including connections establishing)
tps = 32297.743929 (excluding connections establishing)
latency average = 0.495 ms
latency stddev = 0.185 ms
16 clients patch:
tps = 33434.889405 (including connections establishing)
tps = 33435.021738 (excluding connections establishing)
latency average = 0.478 ms
latency stddev = 0.167 ms

64 clients master:
tps = 25699.029787 (including connections establishing)
tps = 25699.217022 (excluding connections establishing)
latency average = 2.482 ms
latency stddev = 1.715 ms
64 clients patch:
tps = 26513.816673 (including connections establishing)
tps = 26514.013638 (excluding connections establishing)
latency average = 2.405 ms
latency stddev = 1.690 ms

2nd pass


(init pgbench from scratch for each database, scale 20)

1 client master:
tps = 7172.995796 (including connections establishing)
tps = 7173.013472 (excluding connections establishing)
latency average = 0.139 ms
latency stddev = 0.022 ms
1 client patch:
tps = 7024.724365 (including connections establishing)
tps = 7024.739237 (excluding connections establishing)
latency average = 0.142 ms
latency stddev = 0.021 ms

2 clients master:
tps = 13489.016303 (including connections establishing)
tps = 13489.047968 (excluding connections establishing)
latency average = 0.148 ms
latency stddev = 0.032 ms
2 clients patch:

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Mon, Mar 18, 2019 at 4:59 AM Heikki Linnakangas  wrote:
> I'm getting a regression failure from the 'create_table' test with this:

> Are you seeing that?

Yes -- though the bug is in your revised v18, not the original v18,
which passed CFTester. Your revision fails on Travis/Linux, which is
pretty close to what I see locally, and much less subtle than the test
failures you mentioned:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/507816665

"make check" did pass locally for me with your patch, but "make
check-world" (parallel recipe) did not.

The original v18 passed both CFTester tests about 15 hour ago:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/507643402

I see the bug. You're not supposed to test this way with a heapkeyspace index:

> +   if (P_RIGHTMOST(lpageop) ||
> +   _bt_compare(rel, itup_key, page, P_HIKEY) != 0)
> +   break;

This is because the presence of scantid makes it almost certain that
you'll break when you shouldn't. You're doing it the old way, which is
inappropriate for a heapkeyspace index. Note that it would probably
take much longer to notice this bug if the "consider secondary
factors" patch was also applied, because then you would rarely have
cause to step right here (duplicates would never occupy more than a
single page in the regression tests). The test failures are probably
also timing sensitive, though they happen very reliably on my machine.

> Looking at the patches 1 and 2 again:
>
> I'm still not totally happy with the program flow and all the conditions
> in _bt_findsplitloc(). I have a hard time following which codepaths are
> taken when. I refactored that, so that there is a separate copy of the
> loop for V3 and V4 indexes.

The big difference is that you make the possible call to
_bt_stepright() conditional on this being a checkingunique index --
the duplicate code is indented in that branch of _bt_findsplitloc().
Whereas I break early in the loop when "checkingunique &&
heapkeyspace".

The flow of the original loop not only had less code. It also
contrasted the important differences between heapkeyspace and
!heapkeyspace cases:

/* If this is the page that the tuple must go on, stop */
if (P_RIGHTMOST(lpageop))
break;
cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
if (itup_key->heapkeyspace)
{
if (cmpval <= 0)
break;
}
else
{
/*
 * pg_upgrade'd !heapkeyspace index.
 *
 * May have to handle legacy case where there is a choice of which
 * page to place new tuple on, and we must balance space
 * utilization as best we can.  Note that this may invalidate
 * cached bounds for us.
 */
if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, insertstate))
break;
}

I thought it was obvious that the "cmpval <= 0" code was different for
a reason. It now seems that this at least needs a comment.

I still believe that the best way to handle the !heapkeyspace case is
to make it similar to the heapkeyspace checkingunique case, regardless
of whether or not we're checkingunique. The fact that this bug slipped
in supports that view. Besides, the alternative that you suggest
treats !heapkeyspace indexes as if they were just as important to the
reader, which seems inappropriate (better to make the legacy case
follow the new case, not the other way around). I'm fine with the
comment tweaks that you made that are not related to
_bt_findsplitloc(), though.

I won't push the patches today, to give you the opportunity to
respond. I am not at all convinced right now, though.

--
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sat, Mar 16, 2019 at 2:01 PM Peter Geoghegan  wrote:
>
> On Sat, Mar 16, 2019 at 1:47 PM Peter Geoghegan  wrote:
> > I agree that it's always true that the high key is also in the parent,
> > and we could cross-check that within the child. Actually, I should
> > probably figure out a way of arranging for the Bloom filter used
> > within bt_relocate_from_root() (which has been around since PG v11) to
> > include the key itself when fingerprinting. That would probably
> > necessitate that we don't truncate "negative infinity" items (it was
> > actually that way about 18 years ago), just for the benefit of
> > verification.
>
> Clarification: You'd fingerprint an entire pivot tuple -- key, block
> number, everything. Then, you'd probe the Bloom filter for the high
> key one level down, with the downlink block in the high key set to
> point to the current sibling on the same level (the child level). As I
> said, I think that the only reason that that cannot be done at present
> is because of the micro-optimization of truncating the first item on
> the right page to zero attributes during an internal page split. (We
> can retain the key without getting rid of the hard-coded logic for
> negative infinity within _bt_compare()).
>
> bt_relocate_from_root() already has smarts around interrupted page
> splits (with the incomplete split bit set).

Clarification to my clarification: I meant
bt_downlink_missing_check(), not bt_relocate_from_root(). The former
really has been around since v11, unlike the latter, which is part of
this new amcheck patch we're discussing.


-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sat, Mar 16, 2019 at 1:47 PM Peter Geoghegan  wrote:
> I agree that it's always true that the high key is also in the parent,
> and we could cross-check that within the child. Actually, I should
> probably figure out a way of arranging for the Bloom filter used
> within bt_relocate_from_root() (which has been around since PG v11) to
> include the key itself when fingerprinting. That would probably
> necessitate that we don't truncate "negative infinity" items (it was
> actually that way about 18 years ago), just for the benefit of
> verification.

Clarification: You'd fingerprint an entire pivot tuple -- key, block
number, everything. Then, you'd probe the Bloom filter for the high
key one level down, with the downlink block in the high key set to
point to the current sibling on the same level (the child level). As I
said, I think that the only reason that that cannot be done at present
is because of the micro-optimization of truncating the first item on
the right page to zero attributes during an internal page split. (We
can retain the key without getting rid of the hard-coded logic for
negative infinity within _bt_compare()).

bt_relocate_from_root() already has smarts around interrupted page
splits (with the incomplete split bit set).

Finally, you'd also make bt_downlink_check follow every downlink, not
all-but-one downlink (no more excuse for leaving out the first one if
we don't truncate during internal page splits).

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sat, Mar 16, 2019 at 1:33 PM Heikki Linnakangas  wrote:
> AFAICS, there is a copy of every page's high key in its immediate
> parent. Either in the downlink of the right sibling, or as the high key
> of the parent page itself. Cross-checking those would catch any
> corruption in high keys.

I agree that it's always true that the high key is also in the parent,
and we could cross-check that within the child. Actually, I should
probably figure out a way of arranging for the Bloom filter used
within bt_relocate_from_root() (which has been around since PG v11) to
include the key itself when fingerprinting. That would probably
necessitate that we don't truncate "negative infinity" items (it was
actually that way about 18 years ago), just for the benefit of
verification. This is almost the same thing as what Graefe argues for
(don't think that you need a low key on the leaf level, since you can
cross a single level there). I wonder if truncating the negative
infinity item in internal pages to zero attributes is actually worth
it, since a low key might be useful for a number of reasons.

> Note that this would potentially catch some corruption that the
> descend-from-root would not. If you have a mismatch between the high key
> of a leaf page and its parent or grandparent, all the current tuples
> might be pass the rootdescend check. But a tuple might get inserted to
> wrong location later.

I also agree with this. However, the rootdescend check will always
work better than this in some cases that you can at least imagine, for
as long as there are negative infinity items to worry about. (And,
even if we decided not to truncate to support easy verification, there
is still a good argument to be made for involving the authoritative
_bt_search() code at some point).

> > Maybe you could actually do something with the high key from leaf page
> > 5 to detect the stray value "20" in leaf page 6, but again, we don't
> > do anything like that right now.
>
> Hmm, yeah, to check for stray values, you could follow the left-link,
> get the high key of the left sibling, and compare against that.

Graefe argues for retaining a low key, even in leaf pages (the left
page's old high key becomes the left page's low key during a split,
and the left page's new high key becomes the new right pages low key
at the same time). It's useful for what he calls "write-optimized
B-Trees", and maybe even for optional compression. As I said earlier,
I guess you can just go left on the leaf level if you need to, and you
have all you need. But I'd need to think about it some more.

Point taken; rootdescend isn't enough to make verification exactly
perfect. But it makes verification approach being perfect, because
you're going to get right answers to queries as long as it passes (I
think). There could be a future problem for a future insertion that
you could also detect, but can't. But you'd have to be extraordinarily
unlucky to have that happen for any amount of time. Unlucky even by my
own extremely paranoid standard.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons


On 16/03/2019 18:55, Peter Geoghegan wrote:

On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas  wrote:

Then, a cosmic ray changes the 20 on the root page to 18. That causes
the the leaf tuple 19 to become non-re-findable; if you descend the for
19, you'll incorrectly land on page 6. But it also causes the high key
on page 2 to be different from the downlink on the root page. Wouldn't
the existing checks catch this?


No, the existing checks will not check that. I suppose something
closer to the existing approach *could* detect this issue, by making
sure that the "seam of identical high keys" from the root to the leaf
are a match, but we don't use the high key outside of its own page.
Besides, there is something useful about having the code actually rely
on _bt_search().

There are other similar cases, where it's not obvious how you can do
verification without either 1) crossing multiple levels, or 2)
retaining a "low key" as well as a high key (this is what Goetz Graefe
calls "retaining fence keys to solve the cousin verification problem"
in Modern B-Tree Techniques). What if the corruption was in the leaf
page 6 from your example, which had a 20 at the start? We wouldn't
have compared the downlink from the parent to the child, because leaf
page 6 is the leftmost child, and so we only have "-inf". The lower
bound actually comes from the root page, because we truncate "-inf"
attributes during page splits, even though we don't have to. Most of
the time they're not "absolute minus infinity" -- they're "minus
infinity in this subtree".


AFAICS, there is a copy of every page's high key in its immediate 
parent. Either in the downlink of the right sibling, or as the high key 
of the parent page itself. Cross-checking those would catch any 
corruption in high keys.


Note that this would potentially catch some corruption that the 
descend-from-root would not. If you have a mismatch between the high key 
of a leaf page and its parent or grandparent, all the current tuples 
might be pass the rootdescend check. But a tuple might get inserted to 
wrong location later.



Maybe you could actually do something with the high key from leaf page
5 to detect the stray value "20" in leaf page 6, but again, we don't
do anything like that right now.


Hmm, yeah, to check for stray values, you could follow the left-link, 
get the high key of the left sibling, and compare against that.


- Heikki

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons


On 16/03/2019 19:32, Peter Geoghegan wrote:

On Sat, Mar 16, 2019 at 9:55 AM Peter Geoghegan  wrote:

On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas  wrote:

Hmm. "re-find", maybe? We use that term in a few error messages already,
to mean something similar.


WFM.


Actually, how about "rootsearch", or "rootdescend"? You're supposed to
hyphenate "re-find", and so it doesn't really work as a function
argument name.


Works for me.

- Heikki

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sat, Mar 16, 2019 at 9:55 AM Peter Geoghegan  wrote:
> On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas  wrote:
> > Hmm. "re-find", maybe? We use that term in a few error messages already,
> > to mean something similar.
>
> WFM.

Actually, how about "rootsearch", or "rootdescend"? You're supposed to
hyphenate "re-find", and so it doesn't really work as a function
argument name.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas  wrote:
> Hmm. "re-find", maybe? We use that term in a few error messages already,
> to mean something similar.

WFM.

> At first, I thought this would be horrendously expensive, but thinking
> about it a bit more, nearby tuples will always follow the same path
> through the upper nodes, so it'll all be cached. So maybe it's not quite
> so bad.

That's deliberate, though you could call bt_relocate_from_root() from
anywhere else if you wanted to. It's a bit like a big nested loop
join, where the inner side has locality.

> Then, a cosmic ray changes the 20 on the root page to 18. That causes
> the the leaf tuple 19 to become non-re-findable; if you descend the for
> 19, you'll incorrectly land on page 6. But it also causes the high key
> on page 2 to be different from the downlink on the root page. Wouldn't
> the existing checks catch this?

No, the existing checks will not check that. I suppose something
closer to the existing approach *could* detect this issue, by making
sure that the "seam of identical high keys" from the root to the leaf
are a match, but we don't use the high key outside of its own page.
Besides, there is something useful about having the code actually rely
on _bt_search().

There are other similar cases, where it's not obvious how you can do
verification without either 1) crossing multiple levels, or 2)
retaining a "low key" as well as a high key (this is what Goetz Graefe
calls "retaining fence keys to solve the cousin verification problem"
in Modern B-Tree Techniques). What if the corruption was in the leaf
page 6 from your example, which had a 20 at the start? We wouldn't
have compared the downlink from the parent to the child, because leaf
page 6 is the leftmost child, and so we only have "-inf". The lower
bound actually comes from the root page, because we truncate "-inf"
attributes during page splits, even though we don't have to. Most of
the time they're not "absolute minus infinity" -- they're "minus
infinity in this subtree".

Maybe you could actually do something with the high key from leaf page
5 to detect the stray value "20" in leaf page 6, but again, we don't
do anything like that right now.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons


On 16/03/2019 10:51, Peter Geoghegan wrote:

On Sat, Mar 16, 2019 at 1:44 AM Heikki Linnakangas  wrote:

It would be nice if you could take a look at the amcheck "relocate"
patch

When I started looking at this, I thought that "relocate" means "move".
So I thought that the new mode would actually move tuples, i.e. that it
would modify the index. That sounded horrible. Of course, it doesn't
actually do that. It just checks that each tuple can be re-found, or
"relocated", by descending the tree from the root. I'd suggest changing
the language to avoid that confusion.


Okay. What do you suggest? :-)


Hmm. "re-find", maybe? We use that term in a few error messages already, 
to mean something similar.



It seems like a nice way to catch all kinds of index corruption issues.
Although, we already check that the tuples are in order within the page.
Is it really necessary to traverse the tree for every tuple, as well?
Maybe do it just for the first and last item?


It's mainly intended as a developer option. I want it to be possible
to detect any form of corruption, however unlikely. It's an
adversarial mindset that will at least make me less nervous about the
patch.


Fair enough.

At first, I thought this would be horrendously expensive, but thinking 
about it a bit more, nearby tuples will always follow the same path 
through the upper nodes, so it'll all be cached. So maybe it's not quite 
so bad.



I don't understand this. Can you give an example of this kind of
inconsistency?


The commit message gives an example, but I suggest trying it out for
yourself. Corrupt the least significant key byte of a root page of a
B-Tree using pg_hexedit. Say it's an index on a text column, then
you'd corrupt the last byte to be something slightly wrong. Then, the
only way to catch it is with "relocate" verification. You'll only miss
a few tuples on a cousin page at the leaf level (those on either side
of the high key that the corrupted separator key in the root was
originally copied from).

The regular checks won't catch this, because the keys are similar
enough one level down. The "minus infinity" item is a kind of a blind
spot, because we cannot do a parent check of its children, because we
don't have the key (it's truncated when the item becomes a right page
minus infinity item, during an internal page split).


Hmm. So, the initial situation would be something like this:

 +---+
 | 1: root   |
 |   |
 | -inf -> 2 |
 | 20   -> 3 |
 |   |
 +---+

+-+ +-+
| 2: internal | | 3: internal |
| | | |
| -inf -> 4   | | -inf -> 6   |
| 10   -> 5   | | 30   -> 7   |
| | | |
| hi: 20  | | |
+-+ +-+

+-+ +-+ +-+ +-+
| 4: leaf | | 5: leaf | | 6: leaf | | 7: leaf |
| | | | | | | |
| 1   | | 11  | | 21  | | 31  |
| 5   | | 15  | | 25  | | 35  |
| 9   | | 19  | | 29  | | 39  |
| | | | | | | |
| hi: 10  | | hi: 20  | | hi: 30  | | |
+-+ +-+ +-+ +-+

Then, a cosmic ray changes the 20 on the root page to 18. That causes 
the the leaf tuple 19 to become non-re-findable; if you descend the for 
19, you'll incorrectly land on page 6. But it also causes the high key 
on page 2 to be different from the downlink on the root page. Wouldn't 
the existing checks catch this?


- Heikki

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sat, Mar 16, 2019 at 1:44 AM Heikki Linnakangas  wrote:
> > It would be nice if you could take a look at the amcheck "relocate"
> > patch
> When I started looking at this, I thought that "relocate" means "move".
> So I thought that the new mode would actually move tuples, i.e. that it
> would modify the index. That sounded horrible. Of course, it doesn't
> actually do that. It just checks that each tuple can be re-found, or
> "relocated", by descending the tree from the root. I'd suggest changing
> the language to avoid that confusion.

Okay. What do you suggest? :-)

> It seems like a nice way to catch all kinds of index corruption issues.
> Although, we already check that the tuples are in order within the page.
> Is it really necessary to traverse the tree for every tuple, as well?
> Maybe do it just for the first and last item?

It's mainly intended as a developer option. I want it to be possible
to detect any form of corruption, however unlikely. It's an
adversarial mindset that will at least make me less nervous about the
patch.

> I don't understand this. Can you give an example of this kind of
> inconsistency?

The commit message gives an example, but I suggest trying it out for
yourself. Corrupt the least significant key byte of a root page of a
B-Tree using pg_hexedit. Say it's an index on a text column, then
you'd corrupt the last byte to be something slightly wrong. Then, the
only way to catch it is with "relocate" verification. You'll only miss
a few tuples on a cousin page at the leaf level (those on either side
of the high key that the corrupted separator key in the root was
originally copied from).

The regular checks won't catch this, because the keys are similar
enough one level down. The "minus infinity" item is a kind of a blind
spot, because we cannot do a parent check of its children, because we
don't have the key (it's truncated when the item becomes a right page
minus infinity item, during an internal page split).

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons


On 16/03/2019 06:16, Peter Geoghegan wrote:

On Thu, Mar 14, 2019 at 2:21 AM Heikki Linnakangas  wrote:

It doesn't matter how often it happens, the code still needs to deal
with it. So let's try to make it as readable as possible.



Well, IMHO holding the buffer and the bounds in the new struct is more
clean than the savebinsrc/restorebinsrch flags. That's exactly why I
suggested it. I don't know what else to suggest. I haven't done any
benchmarking, but I doubt there's any measurable difference.


Fair enough. Attached is v17, which does it using the approach taken
in your earlier prototype. I even came around to your view on
_bt_binsrch_insert() -- I kept that part, too. Note, however, that I
still pass checkingunique to _bt_findinsertloc(), because that's a
distinct condition to whether or not bounds were cached (they happen
to be the same thing right now, but I don't want to assume that).

This revision also integrates your approach to the "continuescan"
optimization patch, with the small tweak I mentioned yesterday (we
also pass ntupatts). I also prefer this approach.


Great, thank you!


It would be nice if you could take a look at the amcheck "relocate"
patch
When I started looking at this, I thought that "relocate" means "move". 
So I thought that the new mode would actually move tuples, i.e. that it 
would modify the index. That sounded horrible. Of course, it doesn't 
actually do that. It just checks that each tuple can be re-found, or 
"relocated", by descending the tree from the root. I'd suggest changing 
the language to avoid that confusion.


It seems like a nice way to catch all kinds of index corruption issues. 
Although, we already check that the tuples are in order within the page. 
Is it really necessary to traverse the tree for every tuple, as well? 
Maybe do it just for the first and last item?



+ * This routine can detect very subtle transitive consistency issues across
+ * more than one level of the tree.  Leaf pages all have a high key (even the
+ * rightmost page has a conceptual positive infinity high key), but not a low
+ * key.  Their downlink in parent is a lower bound, which along with the high
+ * key is almost enough to detect every possible inconsistency.  A downlink
+ * separator key value won't always be available from parent, though, because
+ * the first items of internal pages are negative infinity items, truncated
+ * down to zero attributes during internal page splits.  While it's true that
+ * bt_downlink_check() and the high key check can detect most imaginable key
+ * space problems, there are remaining problems it won't detect with non-pivot
+ * tuples in cousin leaf pages.  Starting a search from the root for every
+ * existing leaf tuple detects small inconsistencies in upper levels of the
+ * tree that cannot be detected any other way.  (Besides all this, it's
+ * probably a useful testing strategy to exhaustively verify that all
+ * non-pivot tuples can be relocated in the index using the same code paths as
+ * those used by index scans.)


I don't understand this. Can you give an example of this kind of 
inconsistency?


- Heikki

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-14 Thread Peter Geoghegan

On Thu, Mar 14, 2019 at 4:00 AM Heikki Linnakangas  wrote:
> Oh yeah, that makes perfect sense. I wonder why we haven't done it like
> that before? The new page split logic makes it more likely to help, but
> even without that, I don't see any downside.

The only downside is that we spend a few extra cycles, and that might
not work out. This optimization would have always worked, though. The
new page split logic clearly makes checking the high key in the
"continuescan" path an easy win.

> I find it a bit confusing, that the logic is now split between
> _bt_checkkeys() and _bt_readpage(). For a forward scan, _bt_readpage()
> does the high-key check, but the corresponding "first-key" check in a
> backward scan is done in _bt_checkkeys(). I'd suggest moving the logic
> completely to _bt_readpage(), so that it's in one place. With that,
> _bt_checkkeys() can always check the keys as it's told, without looking
> at the LP_DEAD flag. Like the attached.

I'm convinced. I'd like to go a bit further, and also pass tupnatts to
_bt_checkkeys().  That makes it closer to the similar
_bt_check_rowcompare() function that _bt_checkkeys() must sometimes
call. It also allows us to only call BTreeTupleGetNAtts() for the high
key, while passes down a generic, loop-invariant
IndexRelationGetNumberOfAttributes() value for non-pivot tuples.

I'll do it that way in the next revision.

Thanks
-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-14 Thread Heikki Linnakangas


On 13/03/2019 03:28, Peter Geoghegan wrote:

It would be great if you could take a look at the 'Add high key
"continuescan" optimization' patch, which is the only one you haven't
commented on so far (excluding the amcheck "relocate" patch, which is
less important). I can put that one off for a while after the first 3
go in. I will also put off the "split after new item" commit for at
least a week or two. I'm sure that the idea behind the "continuescan"
patch will now seem pretty obvious to you -- it's just taking
advantage of the high key when an index scan on the leaf level (which
uses a search style scankey, not an insertion style scankey) looks
like it may have to move to the next leaf page, but we'd like to avoid
it where possible. Checking the high key there is now much more likely
to result in the index scan not going to the next page, since we're
more careful when considering a leaf split point these days. The high
key often looks like the items on the page to the right, not the items
on the same page.


Oh yeah, that makes perfect sense. I wonder why we haven't done it like 
that before? The new page split logic makes it more likely to help, but 
even without that, I don't see any downside.


I find it a bit confusing, that the logic is now split between 
_bt_checkkeys() and _bt_readpage(). For a forward scan, _bt_readpage() 
does the high-key check, but the corresponding "first-key" check in a 
backward scan is done in _bt_checkkeys(). I'd suggest moving the logic 
completely to _bt_readpage(), so that it's in one place. With that, 
_bt_checkkeys() can always check the keys as it's told, without looking 
at the LP_DEAD flag. Like the attached.


- Heikki
>From 4b5ea0f361e3feda93852bd084fb0d325e808e4c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Mon, 12 Nov 2018 13:11:21 -0800
Subject: [PATCH 1/1] Add high key "continuescan" optimization.

Teach B-Tree forward index scans to check the high key before moving to
the next page in the hopes of finding that it isn't actually necessary
to move to the next page.  We already opportunistically force a key
check of the last item on leaf pages, even when it's clear that it
cannot be returned to the scan due to being dead-to-all, for the same
reason.  Since forcing the last item to be key checked no longer makes
any difference in the case of forward scans, the existing extra key
check is now only used for backwards scans.  Like the existing check,
the new check won't always work out, but that seems like an acceptable
price to pay.

The new approach is more effective than just checking non-pivot tuples,
especially with composite indexes and non-unique indexes.  The high key
represents an upper bound on all values that can appear on the page,
which is often greater than whatever tuple happens to appear last at the
time of the check.  Also, suffix truncation's new logic for picking a
split point will often result in high keys that are relatively
dissimilar to the other (non-pivot) tuples on the page, and therefore
more likely to indicate that the scan need not proceed to the next page.

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.

(This is Heikki's refactored version)
---
 src/backend/access/nbtree/nbtsearch.c |  86 ++---
 src/backend/access/nbtree/nbtutils.c  | 103 +++---
 src/include/access/nbtree.h   |   3 +-
 3 files changed, 122 insertions(+), 70 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index af3da3aa5b6..243be6f410d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1220,7 +1220,6 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
 	int			itemIndex;
-	IndexTuple	itup;
 	bool		continuescan;
 
 	/*
@@ -1241,6 +1240,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
+	continuescan = true;		/* default assumption */
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1282,23 +1282,58 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		while (offnum <= maxoff)
 		{
-			itup = _bt_checkkeys(scan, page, offnum, dir, );
-			if (itup != NULL)
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	itup;
+
+			/*
+			 * If the scan specifies not to return killed tuples, then we
+			 * treat a killed tuple as not passing the qual.  Most of the
+			 * time, it's a win to not bother examining the tuple's index
+			 * keys, but just skip to the next tuple.
+			 */
+			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+			{
+offnum = OffsetNumberNext(offnum);
+continue;
+			}
+
+			itup = (IndexTuple) PageGetItem(page, iid);
+
+			if (_bt_checkkeys(scan, itup, dir, ))
 			{
 /* tuple passes all scan key conditions, so remember it

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-14 Thread Heikki Linnakangas


On 13/03/2019 03:28, Peter Geoghegan wrote:

On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas  wrote:

I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
search like _bt_binsrch does, but the bounds caching is only done in
_bt_binsrch_insert. Seems more clear to have separate functions for them
now, even though there's some duplication.



/*
   * Do the insertion. First move right to find the correct page to
   * insert to, if necessary. If we're inserting to a non-unique index,
   * _bt_search() already did this when it checked if a move to the
   * right was required for leaf page.  Insertion scankey's scantid
   * would have been filled out at the time. On a unique index, the
   * current buffer is the first buffer containing duplicates, however,
   * so we may need to move right to the correct location for this
   * tuple.
   */
if (checkingunique || itup_key->heapkeyspace)
 _bt_findinsertpage(rel, , stack, heapRel);

newitemoff = _bt_binsrch_insert(rel, );



The attached new version simplifies this, IMHO. The bounds and the
current buffer go together in the same struct, so it's easier to keep
track whether the bounds are valid or not.


Now that you have a full understanding of how the negative infinity
sentinel values work, and how page deletion's leaf page search and
!heapkeyspace indexes need to be considered, I think that we should
come back to this _bt_binsrch()/_bt_findsplitloc() stuff. My sense is
that you now have a full understanding of all the subtleties of the
patch, including those that that affect unique index insertion. That
will make it much easier to talk about these unresolved questions.

My current sense is that it isn't useful to store the current buffer
alongside the binary search bounds/hint. It'll hardly ever need to be
invalidated, because we'll hardly ever have to move right within
_bt_findsplitloc() when doing unique index insertion (as I said
before, the regression tests *never* have to do this according to
gcov).


It doesn't matter how often it happens, the code still needs to deal 
with it. So let's try to make it as readable as possible.



We're talking about a very specific set of conditions here, so
I'd like something that's lightweight and specialized. I agree that
the savebinsrch/restorebinsrch fields are a bit ugly, though. I can't
think of anything that's better offhand. Perhaps you can suggest
something that is both lightweight, and an improvement on
savebinsrch/restorebinsrch.


Well, IMHO holding the buffer and the bounds in the new struct is more 
clean than the savebinsrc/restorebinsrch flags. That's exactly why I 
suggested it. I don't know what else to suggest. I haven't done any 
benchmarking, but I doubt there's any measurable difference.



I'm of the opinion that having a separate _bt_binsrch_insert() does
not make anything clearer. Actually, I think that saving the bounds
within the original _bt_binsrch() makes the design of that function
clearer, not less clear. It's all quite confusing at the moment, given
the rightmost/!leaf/page empty special cases. Seeing how the bounds
are reused (or not reused) outside of _bt_binsrch() helps with that.


Ok. I think having some code duplication is better than one function 
that tries to do many things, but I'm not wedded to that.


- Heikki

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-13 Thread Peter Geoghegan

On Tue, Mar 12, 2019 at 11:40 AM Robert Haas wrote:
> I think it's pretty clear that we have to view that as acceptable. I
> mean, we could reduce contention even further by finding a way to make
> indexes 40% larger, but I think it's clear that nobody wants that.

I found this analysis of bloat in the production database of Gitlab in
January 2019 fascinating:

https://about.gitlab.com/handbook/engineering/infrastructure/blueprint/201901-postgres-bloat/

They determined that their tables consisted of about 2% bloat, whereas
indexes were 51% bloat (determined by running VACUUM FULL, and
measuring how much smaller indexes and tables were afterwards). That
in itself may not be that telling. What is telling is the index bloat
disproportionately affects certain kinds of indexes. As they put it,
"Indexes that do not serve a primary key constraint make up 95% of the
overall index bloat". In other words, the vast majority of all bloat
occurs within non-unique indexes, with most remaining bloat in unique
indexes.

One factor that could be relevant is that unique indexes get a lot
more opportunistic LP_DEAD killing. Unique indexes don't rely on the
similar-but-distinct kill_prior_tuple optimization -- a lot more
tuples can be killed within _bt_check_unique() than with
kill_prior_tuple in realistic cases. That's probably not really that
big a factor, though. It seems almost certain that "getting tired" is
the single biggest problem.

The blog post drills down further, and cites the examples of several
*extremely* bloated indexes on a single-column, which is obviously low
cardinality. This includes an index on a boolean field, and an index
on an enum-like text field. In my experience, having many indexes like
that is very common in real world applications, though not at all
common in popular benchmarks (with the exception of TPC-E).

It also looks like they may benefit from the "split after new item"
optimization, at least among the few unique indexes that were very
bloated, such as merge_requests_pkey:

https://gitlab.com/snippets/1812014

Gitlab is open source, so it should be possible to confirm my theory
about the "split after new item" optimization (I am certain about
"getting tired", though). Their schema is defined here:

https://gitlab.com/gitlab-org/gitlab-ce/blob/master/db/schema.rb

I don't have time to confirm all this right now, but I am pretty
confident that they have both problems. And almost as confident that
they'd notice substantial benefits from this patch series.
--
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas  wrote:
> I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
> search like _bt_binsrch does, but the bounds caching is only done in
> _bt_binsrch_insert. Seems more clear to have separate functions for them
> now, even though there's some duplication.

> /*
>   * Do the insertion. First move right to find the correct page to
>   * insert to, if necessary. If we're inserting to a non-unique index,
>   * _bt_search() already did this when it checked if a move to the
>   * right was required for leaf page.  Insertion scankey's scantid
>   * would have been filled out at the time. On a unique index, the
>   * current buffer is the first buffer containing duplicates, however,
>   * so we may need to move right to the correct location for this
>   * tuple.
>   */
> if (checkingunique || itup_key->heapkeyspace)
> _bt_findinsertpage(rel, , stack, heapRel);
>
> newitemoff = _bt_binsrch_insert(rel, );

> The attached new version simplifies this, IMHO. The bounds and the
> current buffer go together in the same struct, so it's easier to keep
> track whether the bounds are valid or not.

Now that you have a full understanding of how the negative infinity
sentinel values work, and how page deletion's leaf page search and
!heapkeyspace indexes need to be considered, I think that we should
come back to this _bt_binsrch()/_bt_findsplitloc() stuff. My sense is
that you now have a full understanding of all the subtleties of the
patch, including those that that affect unique index insertion. That
will make it much easier to talk about these unresolved questions.

My current sense is that it isn't useful to store the current buffer
alongside the binary search bounds/hint. It'll hardly ever need to be
invalidated, because we'll hardly ever have to move right within
_bt_findsplitloc() when doing unique index insertion (as I said
before, the regression tests *never* have to do this according to
gcov). We're talking about a very specific set of conditions here, so
I'd like something that's lightweight and specialized. I agree that
the savebinsrch/restorebinsrch fields are a bit ugly, though. I can't
think of anything that's better offhand. Perhaps you can suggest
something that is both lightweight, and an improvement on
savebinsrch/restorebinsrch.

I'm of the opinion that having a separate _bt_binsrch_insert() does
not make anything clearer. Actually, I think that saving the bounds
within the original _bt_binsrch() makes the design of that function
clearer, not less clear. It's all quite confusing at the moment, given
the rightmost/!leaf/page empty special cases. Seeing how the bounds
are reused (or not reused) outside of _bt_binsrch() helps with that.

The first 3 patches seem commitable now, but I think that it's
important to be sure that I've addressed everything you raised
satisfactorily before pushing. Or that everything works in a way that
you can live with, at least.

It would be great if you could take a look at the 'Add high key
"continuescan" optimization' patch, which is the only one you haven't
commented on so far (excluding the amcheck "relocate" patch, which is
less important). I can put that one off for a while after the first 3
go in. I will also put off the "split after new item" commit for at
least a week or two. I'm sure that the idea behind the "continuescan"
patch will now seem pretty obvious to you -- it's just taking
advantage of the high key when an index scan on the leaf level (which
uses a search style scankey, not an insertion style scankey) looks
like it may have to move to the next leaf page, but we'd like to avoid
it where possible. Checking the high key there is now much more likely
to result in the index scan not going to the next page, since we're
more careful when considering a leaf split point these days. The high
key often looks like the items on the page to the right, not the items
on the same page.

Thanks
-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Mar 12, 2019 at 2:22 PM Andres Freund  wrote:
> I'm basically just curious which buffers have most of the additional
> contention. Is it the lower number of leaf pages, the inner pages, or
> (somewhat unexplicably) the meta page, or ...?  I was thinking that the
> callstack that e.g. my lwlock tool gives should be able to explain what
> callstack most of the waits are occuring on.

Right -- that's exactly what I'm interested in, too. If we can
characterize the contention in terms of the types of nbtree blocks
that are involved (their level), that could be really helpful. There
are 200x+ more leaf blocks than internal blocks, so the internal
blocks are a lot hotter. But, there is also a lot fewer splits of
internal pages, because you need hundreds of leaf page splits to get
one internal split.

Is the problem contention caused by internal page splits, or is it
contention in internal pages that can be traced back to leaf splits,
that insert a downlink in to their parent page? It would be really
cool to have some idea of the answers to questions like these.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-12 Thread Andres Freund

On 2019-03-12 14:15:06 -0700, Peter Geoghegan wrote:
> On Tue, Mar 12, 2019 at 12:40 PM Andres Freund  wrote:
> > Have you looked at an offwake or lwlock wait graph (bcc tools) or
> > something in that vein? Would be interesting to see what is waiting for
> > what most often...
> 
> Not recently, though I did use your BCC script for this very purpose
> quite a few months ago. I don't remember it helping that much at the
> time, but then that was with a version of the patch that lacked a
> couple of important optimizations that we have now. We're now very
> careful to not descend to the left with an equal pivot tuple. We
> descend right instead when that's definitely the only place we'll find
> matches (a high key doesn't count as a match in almost all cases!).
> Edge-cases where we unnecessarily move left then right, or
> unnecessarily move right a second time once on the leaf level have
> been fixed. I fixed the regression I was worried about at the time,
> without getting much benefit from the BCC script, and moved on.
> 
> This kind of minutiae is more important than it sounds. I have used
> EXPLAIN(ANALYZE, BUFFERS) instrumentation to make sure that I
> understand where every single block access comes from with these
> edge-cases, paying close attention to the structure of the index, and
> how the key space is broken up (the values of pivot tuples in internal
> pages). It is one thing to make the index smaller, and another thing
> to take full advantage of that -- I have both. This is one of the
> reasons why I believe that this minor regression cannot be avoided,
> short of simply allowing the index to get bloated: I'm simply not
> doing things that differently outside of the page split code, and what
> I am doing differently is clearly superior. Both in general, and for
> the NEW_ORDER transaction in particular.
> 
> I'll make that another TODO item -- this regression will be revisited
> using BCC instrumentation. I am currently performing a multi-day
> benchmark on a very large TPC-C/BenchmarkSQL database, and it will
> have to wait for that. (I would like to use the same environment as
> before.)

I'm basically just curious which buffers have most of the additional
contention. Is it the lower number of leaf pages, the inner pages, or
(somewhat unexplicably) the meta page, or ...?  I was thinking that the
callstack that e.g. my lwlock tool gives should be able to explain what
callstack most of the waits are occuring on.

(I should work a bit on that script, I locally had a version that showed
both waiters and the waking up callstack, but I don't find it anymore)

Greetings,

Andres Freund

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Mar 12, 2019 at 12:40 PM Andres Freund  wrote:
> Have you looked at an offwake or lwlock wait graph (bcc tools) or
> something in that vein? Would be interesting to see what is waiting for
> what most often...

Not recently, though I did use your BCC script for this very purpose
quite a few months ago. I don't remember it helping that much at the
time, but then that was with a version of the patch that lacked a
couple of important optimizations that we have now. We're now very
careful to not descend to the left with an equal pivot tuple. We
descend right instead when that's definitely the only place we'll find
matches (a high key doesn't count as a match in almost all cases!).
Edge-cases where we unnecessarily move left then right, or
unnecessarily move right a second time once on the leaf level have
been fixed. I fixed the regression I was worried about at the time,
without getting much benefit from the BCC script, and moved on.

This kind of minutiae is more important than it sounds. I have used
EXPLAIN(ANALYZE, BUFFERS) instrumentation to make sure that I
understand where every single block access comes from with these
edge-cases, paying close attention to the structure of the index, and
how the key space is broken up (the values of pivot tuples in internal
pages). It is one thing to make the index smaller, and another thing
to take full advantage of that -- I have both. This is one of the
reasons why I believe that this minor regression cannot be avoided,
short of simply allowing the index to get bloated: I'm simply not
doing things that differently outside of the page split code, and what
I am doing differently is clearly superior. Both in general, and for
the NEW_ORDER transaction in particular.

I'll make that another TODO item -- this regression will be revisited
using BCC instrumentation. I am currently performing a multi-day
benchmark on a very large TPC-C/BenchmarkSQL database, and it will
have to wait for that. (I would like to use the same environment as
before.)

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-12 Thread Andres Freund

Hi,

On 2019-03-11 19:47:29 -0700, Peter Geoghegan wrote:
> I now believe that the problem is with LWLock/buffer lock contention
> on index pages, and that that's an inherent cost with a minority of
> write-heavy high contention workloads. A cost that we should just
> accept.

Have you looked at an offwake or lwlock wait graph (bcc tools) or
something in that vein? Would be interesting to see what is waiting for
what most often...

Greetings,

Andres Freund

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Mar 12, 2019 at 11:40 AM Robert Haas  wrote:
> Hey, I understood something today!

And I said something that could be understood!

> I think it's pretty clear that we have to view that as acceptable.  I
> mean, we could reduce contention even further by finding a way to make
> indexes 40% larger, but I think it's clear that nobody wants that.
> Now, maybe in the future we'll want to work on other techniques for
> reducing contention, but I don't think we should make that the problem
> of this patch, especially because the regressions are small and go
> away after a few hours of heavy use.  We should optimize for the case
> where the user intends to keep the database around for years, not
> hours.

I think so too. There is a feature in other database systems called
"reverse key indexes", which deals with this problem in a rather
extreme way. This situation is very similar to the situation with
rightmost page splits, where fillfactor is applied to pack leaf pages
full. The only difference is that there are multiple groupings, not
just one single implicit grouping (everything in the index). You could
probably make very similar observations about rightmost page splits
that apply leaf fillfactor.

Here is an example of how the largest index looks for master with the
100 warehouse case that was slightly regressed:

table_name|  index_name   | page_type | npages  |
avg_live_items | avg_dead_items | avg_item_size
--+---+---+-+++---
 bmsql_order_line | bmsql_order_line_pkey | R | 1 |
54.000   |0.000   |   23.000
 bmsql_order_line | bmsql_order_line_pkey | I |11,482 |
143.200   |0.000   |   23.000
 bmsql_order_line | bmsql_order_line_pkey | L | 1,621,316 |
139.458   |0.003   |   24.000

Here is what we see with the patch:

table_name|  index_name   | page_type | npages  |
avg_live_items | avg_dead_items | avg_item_size
--+---+---+-+++---
 bmsql_order_line | bmsql_order_line_pkey | R |   1 |
29.000   |0.000   |   22.000
 bmsql_order_line | bmsql_order_line_pkey | I |   5,957 |
159.149   |0.000   |   23.000
 bmsql_order_line | bmsql_order_line_pkey | L | 936,170 |
233.496   |0.052   |   23.999

REINDEX would leave bmsql_order_line_pkey with 262 items, and we see
here that it has 233 after several hours, which is pretty good given
the amount of contention. The index actually looks very much like it
was just REINDEXED when initial bulk loading finishes, before we get
any updates, even though that happens using retail insertions.

Notice that the number of internal pages is reduced by almost a full
50% -- it's somewhat better than the reduction in the number of leaf
pages, because the benefits compound (items in the root are even a bit
smaller, because of this compounding effect, despite alignment
effects). Internal pages are the most important pages to have cached,
but also potentially the biggest points of contention.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-12 Thread Robert Haas

On Tue, Mar 12, 2019 at 2:34 PM Peter Geoghegan  wrote:
> On Tue, Mar 12, 2019 at 11:32 AM Robert Haas  wrote:
> > If I wanted to try to say this in fewer words, would it be fair to say
> > that reducing the size of an index by 40% without changing anything
> > else can increase contention on the remaining pages?
>
> Yes.

Hey, I understood something today!

I think it's pretty clear that we have to view that as acceptable.  I
mean, we could reduce contention even further by finding a way to make
indexes 40% larger, but I think it's clear that nobody wants that.
Now, maybe in the future we'll want to work on other techniques for
reducing contention, but I don't think we should make that the problem
of this patch, especially because the regressions are small and go
away after a few hours of heavy use.  We should optimize for the case
where the user intends to keep the database around for years, not
hours.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Mar 12, 2019 at 11:32 AM Robert Haas  wrote:
> If I wanted to try to say this in fewer words, would it be fair to say
> that reducing the size of an index by 40% without changing anything
> else can increase contention on the remaining pages?

Yes.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-12 Thread Robert Haas

On Mon, Mar 11, 2019 at 10:47 PM Peter Geoghegan  wrote:
>
> On Sun, Mar 10, 2019 at 5:17 PM Peter Geoghegan  wrote:
> > The regression that I mentioned earlier isn't in pgbench type
> > workloads (even when the distribution is something more interesting
> > that the uniform distribution default). It is only in workloads with
> > lots of page splits and lots of index churn, where we get most of the
> > benefit of the patch, but also where the costs are most apparent.
> > Hopefully it can be fixed, but if not I'm inclined to think that it's
> > a price worth paying. This certainly still needs further analysis and
> > discussion, though. This revision of the patch does not attempt to
> > address that problem in any way.
>
> I believe that I've figured out what's going on here.
>
> At first, I thought that this regression was due to the cycles that
> have been added to page splits, but that doesn't seem to be the case
> at all. Nothing that I did to make page splits faster helped (e.g.
> temporarily go back to doing them "bottom up" made no difference). CPU
> utilization was consistently slightly *higher* with the master branch
> (patch spent slightly more CPU time idle). I now believe that the
> problem is with LWLock/buffer lock contention on index pages, and that
> that's an inherent cost with a minority of write-heavy high contention
> workloads. A cost that we should just accept.

If I wanted to try to say this in fewer words, would it be fair to say
that reducing the size of an index by 40% without changing anything
else can increase contention on the remaining pages?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Mon, Mar 11, 2019 at 11:30 PM Heikki Linnakangas  wrote:
> Yeah, that's fine. I'm curious, though, could you bloat the indexes back
> to the old size by setting the fillfactor?

I think that that might work, though it's hard to say for sure offhand.

The "split after new item" optimization is supposed to be a variation
of rightmost splits, of course. We apply fillfactor in the same way
much of the time. You would still literally split immediately after
the new item some of the time, though, which makes it unclear how much
bloat there would be without testing it.

Some indexes mostly apply fillfactor in non-rightmost pages, while
other indexes mostly split at the exact point past the new item,
depending on details like the size of the groupings.

I am currently doing a multi-day 6,000 warehouse benchmark, since I
want to be sure that the bloat resistance will hold up over days. I
think that it will, because there aren't that many updates, and
they're almost all HOT-safe. I'll put the idea of a 50/50 fillfactor
benchmark with the high-contention/regressed workload on my TODO list,
though.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-12 Thread Heikki Linnakangas


On 12/03/2019 04:47, Peter Geoghegan wrote:

In conclusion: I think that this regression is a cost worth accepting.
The regression in throughput is relatively small (2% - 3%), and the
NEW_ORDER transaction seems to be the only problem (NEW_ORDER happens
to be used for 45% of all transactions with TPC-C, and inserts into
the largest index by far, without reading much). The patch overtakes
master after a few hours anyway -- the patch will still win after
about 6 hours, once the database gets big enough, despite all the
contention. As I said, I think that we see a regression*because*  the
indexes are much smaller, not in spite of the fact that they're
smaller. The TPC-C/BenchmarkSQL indexes never fail to be about 40%
smaller than they are on master, no matter the details, even after
many hours.


Yeah, that's fine. I'm curious, though, could you bloat the indexes back 
to the old size by setting the fillfactor?


- Heikki

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-10 Thread Peter Geoghegan

On Sun, Mar 10, 2019 at 12:53 PM Heikki Linnakangas  wrote:
> Ah, yeah. Not sure. I wrote it as "searching_for_pivot_tuple" first, but
> changed to "searching_for_leaf_page" at the last minute. My thinking was
> that in the page-deletion case, you're trying to re-locate a particular
> leaf page. Otherwise, you're searching for matching tuples, not a
> particular page. Although during insertion, I guess you are also
> searching for a particular page, the page to insert to.

I prefer something like "searching_for_pivot_tuple", because it's
unambiguous. Okay with that?

> It's a hot codepath, but I doubt it's *that* hot that it matters,
> performance-wise...

I'll figure that out. Although I am currently looking into a
regression in workloads that fit in shared_buffers, that my
micro-benchmarks didn't catch initially. Indexes are still much
smaller, but we get a ~2% regression all the same. OTOH, we get a
7.5%+ increase in throughput when the workload is I/O bound, and
latency is generally no worse and even better with any workload.

I suspect that the nice top-down approach to nbtsplitloc.c has its
costs...will let you know more when I know more.

> > The idea with pg_upgrade'd v3 indexes is, as I said a while back, that
> > they too have a heap TID attribute. nbtsearch.c code is not allowed to
> > rely on its value, though, and must use
> > minusinfkey/searching_for_pivot_tuple semantics (relying on its value
> > being minus infinity is still relying on its value being something).
>
> Yeah. I find that's a complicated way to think about it. My mental model
> is that v4 indexes store heap TIDs, and every tuple is unique thanks to
> that. But on v3, we don't store heap TIDs, and duplicates are possible.

I'll try it that way, then.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-10 Thread Heikki Linnakangas


On 10/03/2019 20:53, Peter Geoghegan wrote:

On Sun, Mar 10, 2019 at 7:09 AM Heikki Linnakangas  wrote:

If we don't flip the meaning of the flag, then maybe calling it
something like "searching_for_leaf_page" would make sense:

/*
   * When set, we're searching for the leaf page with the given high key,
   * rather than leaf tuples matching the search keys.
   *
   * Normally, when !searching_for_pivot_tuple, if a page's high key


I guess you meant to say "searching_for_pivot_tuple" both times (not
"searching_for_leaf_page"). After all, we always search for a leaf
page. :-)


Ah, yeah. Not sure. I wrote it as "searching_for_pivot_tuple" first, but 
changed to "searching_for_leaf_page" at the last minute. My thinking was 
that in the page-deletion case, you're trying to re-locate a particular 
leaf page. Otherwise, you're searching for matching tuples, not a 
particular page. Although during insertion, I guess you are also 
searching for a particular page, the page to insert to.



As the patch stands, you're also setting minusinfkey when dealing with
v3 indexes. I think it would be better to only set
searching_for_leaf_page in nbtpage.c.


That would mean I would have to check both heapkeyspace and
minusinfkey within _bt_compare().


Yeah.


I would rather just keep the
assertion that makes sure that !heapkeyspace callers are also
minusinfkey callers, and the comments that explain why that is. It
might even matter to performance -- having an extra condition in
_bt_compare() is something we should avoid.


It's a hot codepath, but I doubt it's *that* hot that it matters, 
performance-wise...



In general, I think BTScanInsert
should describe the search key, regardless of whether it's a V3 or V4
index. Properties of the index belong elsewhere. (We're violating that
by storing the 'heapkeyspace' flag in BTScanInsert. That wart is
probably OK, it is pretty convenient to have it there. But in principle...)


The idea with pg_upgrade'd v3 indexes is, as I said a while back, that
they too have a heap TID attribute. nbtsearch.c code is not allowed to
rely on its value, though, and must use
minusinfkey/searching_for_pivot_tuple semantics (relying on its value
being minus infinity is still relying on its value being something).


Yeah. I find that's a complicated way to think about it. My mental model 
is that v4 indexes store heap TIDs, and every tuple is unique thanks to 
that. But on v3, we don't store heap TIDs, and duplicates are possible.


- Heikki

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-10 Thread Peter Geoghegan

On Sun, Mar 10, 2019 at 7:09 AM Heikki Linnakangas  wrote:
> "descendrighttrunc" doesn't make much sense to me, either. I don't
> understand it. Maybe a comment would make it clear, though.

It's not an easily grasped concept. I don't think that any name will
easily convey the idea to the reader, though. I'm happy to go with
whatever name you prefer.

> I don't feel like this is an optimization. It's natural consequence of
> what the high key means. I guess you can think of it as an optimization,
> in the same way that not fully scanning the whole index for every search
> is an optimization, but that's not how I think of it :-).

I would agree with this in a green field situation, where we don't
have to consider the legacy of v3 indexes. But that's not the case
here.

> If we don't flip the meaning of the flag, then maybe calling it
> something like "searching_for_leaf_page" would make sense:
>
> /*
>   * When set, we're searching for the leaf page with the given high key,
>   * rather than leaf tuples matching the search keys.
>   *
>   * Normally, when !searching_for_pivot_tuple, if a page's high key

I guess you meant to say "searching_for_pivot_tuple" both times (not
"searching_for_leaf_page"). After all, we always search for a leaf
page. :-)

I'm fine with "searching_for_pivot_tuple", I think. The underscores
are not really stylistically consistent with other stuff in nbtree.h,
but I can use something very similar to your suggestion that is
consistent.

>   * has truncated columns, and the columns that are present are equal to
>   * the search key, the search will not descend to that page. For
>   * example, if an index has two columns, and a page's high key is
>   * ("foo", ), and the search key is also ("foo," ),
>   * the search will not descend to that page, but its right sibling. The
>   * omitted column in the high key means that all tuples on the page must
>   * be strictly < "foo", so we don't need to visit it. However, sometimes
>   * we perform a search to find the parent of a leaf page, using the leaf
>   * page's high key as the search key. In that case, when we search for
>   * ("foo", ), we do want to land on that page, not its right
>   * sibling.
>   */
> boolsearching_for_leaf_page;

That works for me (assuming you meant searching_for_pivot_tuple).

> As the patch stands, you're also setting minusinfkey when dealing with
> v3 indexes. I think it would be better to only set
> searching_for_leaf_page in nbtpage.c.

That would mean I would have to check both heapkeyspace and
minusinfkey within _bt_compare(). I would rather just keep the
assertion that makes sure that !heapkeyspace callers are also
minusinfkey callers, and the comments that explain why that is. It
might even matter to performance -- having an extra condition in
_bt_compare() is something we should avoid.

> In general, I think BTScanInsert
> should describe the search key, regardless of whether it's a V3 or V4
> index. Properties of the index belong elsewhere. (We're violating that
> by storing the 'heapkeyspace' flag in BTScanInsert. That wart is
> probably OK, it is pretty convenient to have it there. But in principle...)

The idea with pg_upgrade'd v3 indexes is, as I said a while back, that
they too have a heap TID attribute. nbtsearch.c code is not allowed to
rely on its value, though, and must use
minusinfkey/searching_for_pivot_tuple semantics (relying on its value
being minus infinity is still relying on its value being something).

Now, it's also true that there are a number of things that we have to
do within nbtinsert.c for !heapkeyspace that don't really concern the
key space as such. Even still, thinking about everything with
reference to the keyspace, and keeping that as similar as possible
between v3 and v4 is a good thing. It is up to high level code (such
as _bt_first()) to not allow a !heapkeyspace index scan to do
something that won't work for it. It is not up to low level code like
_bt_compare() to worry about these differences (beyond asserting that
caller got it right). If page deletion didn't need minusinfkey
semantics (if nobody but v3 indexes needed that), then I would just
make the "move right of separator" !minusinfkey code within
_bt_compare() test heapkeyspace. But we do have a general need for
minusinfkey semantics, so it seems simpler and more future-proof to
keep heapkeyspace out of low-level nbtsearch.c code.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-10 Thread Heikki Linnakangas


On 08/03/2019 23:21, Peter Geoghegan wrote:

On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan  wrote:

All of that said, maybe it would be clearer if page deletion was not
the special case that has to opt in to minusinfkey semantics. Perhaps
it would make more sense for everyone else to opt out of minusinfkey
semantics, and to get the !minusinfkey optimization as a result of
that. I only did it the other way around because that meant that only
nbtpage.c had to acknowledge the special case.


This seems like a good idea -- we should reframe the !minusinfkey
optimization, without actually changing the behavior. Flip it around.

The minusinfkey field within the insertion scankey struct would be
called something like "descendrighttrunc" instead. Same idea, but with
the definition inverted. Most _bt_search() callers (all of those
outside of nbtpage.c and amcheck) would be required to opt in to that
optimization to get it.

Under this arrangement, nbtpage.c/page deletion would not ask for the
"descendrighttrunc" optimization, and would therefore continue to do
what it has always done: find the first leaf page that its insertion
scankey values could be on (we don't lie about searching for negative
infinity, or having a negative infinity sentinel value in scan key).
The only difference for page deletion between v3 indexes and v4
indexes is that with v4 indexes we'll relocate the same leaf page
reliably, since every separator key value is guaranteed to be unique
on its level (including the leaf level/leaf high keys). This is just a
detail, though, and not one that's even worth pointing out; we're not
*relying* on that being true on v4 indexes anyway (we check that the
block number is a match too, which is strictly necessary for v3
indexes and seems like a good idea for v4 indexes).

This is also good because it makes it clear that the unique index code
within _bt_doinsert() (that temporarily sets scantid to NULL) benefits
from the descendrighttrunc/!minusinfkey optimization -- it should be
"honest" and ask for it explicitly. We can make _bt_doinsert() opt in
to the optimization for unique indexes, but not for other indexes,
where scantid is set from the start. The
descendrighttrunc/!minusinfkey optimization cannot help when scantid
is set from the start, because we'll always have an attribute value in
insertion scankey that breaks the tie for us instead. We'll always
move right of a heap-TID-truncated separator key whose untruncated
attributes are all equal to a prefix of our insertion scankey values.

(This _bt_doinsert() descendrighttrunc/!minusinfkey optimization for
unique indexes matters more than you might think -- we do really badly
with things like Zipfian distributions currently, and reducing the
contention goes some way towards helping with that. Postgres pro
noticed this a couple of years back, and analyzed it in detail at that
time. It's really nice that we very rarely have to move right within
code like _bt_check_unique() and _bt_findsplitloc() with the patch.)

Does that make sense to you? Can you live with the name
"descendrighttrunc", or do you have a better one?


"descendrighttrunc" doesn't make much sense to me, either. I don't 
understand it. Maybe a comment would make it clear, though.


I don't feel like this is an optimization. It's natural consequence of 
what the high key means. I guess you can think of it as an optimization, 
in the same way that not fully scanning the whole index for every search 
is an optimization, but that's not how I think of it :-).


If we don't flip the meaning of the flag, then maybe calling it 
something like "searching_for_leaf_page" would make sense:


/*
 * When set, we're searching for the leaf page with the given high key,
 * rather than leaf tuples matching the search keys.
 *
 * Normally, when !searching_for_pivot_tuple, if a page's high key
 * has truncated columns, and the columns that are present are equal to
 * the search key, the search will not descend to that page. For
 * example, if an index has two columns, and a page's high key is
 * ("foo", ), and the search key is also ("foo," ),
 * the search will not descend to that page, but its right sibling. The
 * omitted column in the high key means that all tuples on the page must
 * be strictly < "foo", so we don't need to visit it. However, sometimes
 * we perform a search to find the parent of a leaf page, using the leaf
 * page's high key as the search key. In that case, when we search for
 * ("foo", ), we do want to land on that page, not its right
 * sibling.
 */
boolsearching_for_leaf_page;


As the patch stands, you're also setting minusinfkey when dealing with 
v3 indexes. I think it would be better to only set 
searching_for_leaf_page in nbtpage.c. In general, I think BTScanInsert 
should describe the search key, regardless of whether it's a V3 or V4 
index. Properties of the index belong elsewhere. (We're violating that 
by storing the 'heapkeyspace' flag in BTScanInsert. That wart is

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-08 Thread Peter Geoghegan

On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan  wrote:
> All of that said, maybe it would be clearer if page deletion was not
> the special case that has to opt in to minusinfkey semantics. Perhaps
> it would make more sense for everyone else to opt out of minusinfkey
> semantics, and to get the !minusinfkey optimization as a result of
> that. I only did it the other way around because that meant that only
> nbtpage.c had to acknowledge the special case.

This seems like a good idea -- we should reframe the !minusinfkey
optimization, without actually changing the behavior. Flip it around.

The minusinfkey field within the insertion scankey struct would be
called something like "descendrighttrunc" instead. Same idea, but with
the definition inverted. Most _bt_search() callers (all of those
outside of nbtpage.c and amcheck) would be required to opt in to that
optimization to get it.

Under this arrangement, nbtpage.c/page deletion would not ask for the
"descendrighttrunc" optimization, and would therefore continue to do
what it has always done: find the first leaf page that its insertion
scankey values could be on (we don't lie about searching for negative
infinity, or having a negative infinity sentinel value in scan key).
The only difference for page deletion between v3 indexes and v4
indexes is that with v4 indexes we'll relocate the same leaf page
reliably, since every separator key value is guaranteed to be unique
on its level (including the leaf level/leaf high keys). This is just a
detail, though, and not one that's even worth pointing out; we're not
*relying* on that being true on v4 indexes anyway (we check that the
block number is a match too, which is strictly necessary for v3
indexes and seems like a good idea for v4 indexes).

This is also good because it makes it clear that the unique index code
within _bt_doinsert() (that temporarily sets scantid to NULL) benefits
from the descendrighttrunc/!minusinfkey optimization -- it should be
"honest" and ask for it explicitly. We can make _bt_doinsert() opt in
to the optimization for unique indexes, but not for other indexes,
where scantid is set from the start. The
descendrighttrunc/!minusinfkey optimization cannot help when scantid
is set from the start, because we'll always have an attribute value in
insertion scankey that breaks the tie for us instead. We'll always
move right of a heap-TID-truncated separator key whose untruncated
attributes are all equal to a prefix of our insertion scankey values.

(This _bt_doinsert() descendrighttrunc/!minusinfkey optimization for
unique indexes matters more than you might think -- we do really badly
with things like Zipfian distributions currently, and reducing the
contention goes some way towards helping with that. Postgres pro
noticed this a couple of years back, and analyzed it in detail at that
time. It's really nice that we very rarely have to move right within
code like _bt_check_unique() and _bt_findsplitloc() with the patch.)

Does that make sense to you? Can you live with the name
"descendrighttrunc", or do you have a better one?

Thanks
-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-08 Thread Peter Geoghegan

On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan  wrote:
> > Question: Wouldn't it be more straightforward to use "1 +inf" as page
> > 1's high key? I.e treat any missing columns as positive infinity.
>
> That might also work, but it wouldn't be more straightforward on
> balance. This is because:

I thought of another reason:

* The 'Add high key "continuescan" optimization' is effective because
the high key of a leaf page tends to look relatively dissimilar to
other items on the page. The optimization would almost never help if
the high key was derived from the lastleft item at the time of a split
-- that's no more informative than the lastleft item itself.

As things stand with the patch, a high key usually has a value for its
last untruncated attribute that can only appear on the page to the
right, and never the current page. We'd quite like to be able to
conclude that the page to the right can't be interesting there and
then, without needing to visit it. Making new leaf high keys "as close
as possible to items on the right, without actually touching them"
makes the optimization quite likely to work out with the TPC-C
indexes, when we search for orderline items for an order that is
rightmost of a leaf page in the orderlines primary key.

And another reason:

* This makes it likely that any new items that would go between the
original lastleft and firstright items end up on the right page when
they're inserted after the lastleft/firstright split. This is
generally a good thing, because we've optimized WAL-logging for new
pages that go on the right. (You pointed this out to me in Lisbon, in
fact.)

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-08 Thread Peter Geoghegan

On Fri, Mar 8, 2019 at 2:12 AM Heikki Linnakangas  wrote:
> Now, what do we have as the high key of page 1? Answer: "2 -inf". The
> "-inf" is not stored in the key itself, the second key column is just
> omitted, and the search code knows to treat it implicitly as a value
> that's lower than any real value. Hence, "minus infinity".

Right.

> However, during page deletion, we need to perform a search to find the
> downlink pointing to a leaf page. We do that by using the leaf page's
> high key as the search key. But the search needs to treat it slightly
> differently in that case. Normally, searching with a single key value,
> "2", we would land on page 2, because any real value beginning with "2"
> would be on that page, but in the page deletion case, we want to find
> page 1. Setting the BTScanInsert.minusinfkey flag tells the search code
> to do that.

Right.

> Question: Wouldn't it be more straightforward to use "1 +inf" as page
> 1's high key? I.e treat any missing columns as positive infinity.

That might also work, but it wouldn't be more straightforward on
balance. This is because:

* We have always taken the new high key from the firstright item, and
we also continue to do that on internal pages -- same as before. It
would certainly complicate the nbtsplitloc.c code to have to deal with
this new special case now (leaf and internal pages would have to have
far different handling, not just slightly different handling).

* We have always had "-inf" values as the first item on an internal
page, which is explicitly truncated to zero attributes as of Postgres
v11. It seems ugly to me to make truncated attributes mean negative
infinity in that context, but positive infinity in every other
context.

* Another reason that I prefer "-inf" to "+inf" is that you can
imagine an implementation that makes pivot tuples into normalized
binary keys, that are truncated using generic/opclass-agnostic logic,
and compared using strcmp(). If the scankey binary string is longer
than the pivot tuple, then it's greater according to strcmp() -- that
just works. And, you can truncate the original binary strings built
using opclass infrastructure without having to understand where
attributes begin and end (though this relies on encoding things like
NULL-ness a certain way). If we define truncation to be "+inf" now,
then none of this works.

All of that said, maybe it would be clearer if page deletion was not
the special case that has to opt in to minusinfkey semantics. Perhaps
it would make more sense for everyone else to opt out of minusinfkey
semantics, and to get the !minusinfkey optimization as a result of
that. I only did it the other way around because that meant that only
nbtpage.c had to acknowledge the special case.

Even calling it minusinfkey is misleading in one way, because we're
not so much searching for "-inf" values as we are searching for the
first page that could have tuples for the untruncated attributes. But
isn't that how this has always worked, given that we've had to deal
with duplicate pivot tuples on the same level before now? As I said,
we're not doing an extra thing when minusinfykey is true (during page
deletion) -- it's the other way around. Saying that we're searching
for minus infinity values for the truncated attributes is kind of a
lie, although the search does behave that way.

>That way, the search for page deletion wouldn't need to be treated
> differently. That's also how this used to work: all tuples on a page
> used to be <= high key, not strictly < high key.

That isn't accurate -- it still works that way on the leaf level. The
alternative that you've described is possible, I think, but the key
space works just the same with either of our approaches. You've merely
thought of an alternative way of generating new high keys that satisfy
the same invariants as my own scheme. Provided the new separator for
high key is >= last item on the left and < first item on the right,
everything works.

As you point out, the original Lehman and Yao rule for leaf pages
(which Postgres kinda followed before) is that the high key is <=
items on the leaf level. But this patch makes nbtree follow that rule
fully and properly.

Maybe you noticed that amcheck tests < on internal pages, and only
checks <= on leaf pages. Perhaps it led you to believe that I did
things differently. Actually, this is classic Lehman and Yao. The keys
in internal pages are all "separators" as far as Lehman and Yao are
concerned, so the high key is less of a special case on internal
pages. We check < on internal pages because all separators are
supposed to be unique on a level. But, as I said, we do check <= on
the leaf level.

Take a look at "Fig. 7 A B-Link Tree" in the Lehman and Yao paper if
this is unclear. That shows that internal pages have unique keys -- we
can therefore expect the high key to be < items in internal pages. It
also shows that leaf pages copy the high key from the last item on the
left page -- we can expect

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-08 Thread Heikki Linnakangas


On 08/03/2019 12:22, Peter Geoghegan wrote:

I would like to work through these other items with you
(_bt_binsrch_insert() and so on), but I think that it would be helpful
if you made an effort to understand the minusinfkey stuff first. I
spent a lot of time improving the explanation of that within
_bt_compare(). It's important.


Ok, after thinking about it for a while, I think I understand the minus 
infinity stuff now. Let me try to explain it in my own words:


Imagine that you have an index with two key columns, A and B. The index 
has two leaf pages, with the following items:


++   ++
| Page 1 |   | Page 2 |
||   ||
|1 1 |   |2 1 |
|1 2 |   |2 2 |
|1 3 |   |2 3 |
|1 4 |   |2 4 |
|1 5 |   |2 5 |
++   ++

The key space is neatly split on the first key column - probably thanks 
to the new magic in the page split code.


Now, what do we have as the high key of page 1? Answer: "2 -inf". The 
"-inf" is not stored in the key itself, the second key column is just 
omitted, and the search code knows to treat it implicitly as a value 
that's lower than any real value. Hence, "minus infinity".


However, during page deletion, we need to perform a search to find the 
downlink pointing to a leaf page. We do that by using the leaf page's 
high key as the search key. But the search needs to treat it slightly 
differently in that case. Normally, searching with a single key value, 
"2", we would land on page 2, because any real value beginning with "2" 
would be on that page, but in the page deletion case, we want to find 
page 1. Setting the BTScanInsert.minusinfkey flag tells the search code 
to do that.


Question: Wouldn't it be more straightforward to use "1 +inf" as page 
1's high key? I.e treat any missing columns as positive infinity. That 
way, the search for page deletion wouldn't need to be treated 
differently. That's also how this used to work: all tuples on a page 
used to be <= high key, not strictly < high key. And it would also make 
the rightmost page less of a special case: you could pretend the 
rightmost page has a pivot tuple with all columns truncated away, which 
means positive infinity.


You have this comment _bt_split which touches the subject:


/*
 * The "high key" for the new left page will be the first key that's 
going
 * to go into the new right page, or possibly a truncated version if 
this
 * is a leaf page split.  This might be either the existing data item at
 * position firstright, or the incoming tuple.
 *
 * The high key for the left page is formed using the first item on the
 * right page, which may seem to be contrary to Lehman & Yao's approach 
of
 * using the left page's last item as its new high key when splitting on
 * the leaf level.  It isn't, though: suffix truncation will leave the
 * left page's high key fully equal to the last item on the left page 
when
 * two tuples with equal key values (excluding heap TID) enclose the 
split
 * point.  It isn't actually necessary for a new leaf high key to be 
equal
 * to the last item on the left for the L "subtree" invariant to hold.
 * It's sufficient to make sure that the new leaf high key is strictly
 * less than the first item on the right leaf page, and greater than or
 * equal to (not necessarily equal to) the last item on the left leaf
 * page.
 *
 * In other words, when suffix truncation isn't possible, L's exact
 * approach to leaf splits is taken.  (Actually, even that is slightly
 * inaccurate.  A tuple with all the keys from firstright but the heap 
TID
 * from lastleft will be used as the new high key, since the last left
 * tuple could be physically larger despite being opclass-equal in 
respect
 * of all attributes prior to the heap TID attribute.)
 */


But it doesn't explain why it's done like that.

- Heikki

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Wed, Mar 6, 2019 at 11:41 PM Heikki Linnakangas  wrote:
> I don't understand it :-(. I guess that's valuable feedback on its own.
> I'll spend more time reading the code around that, but meanwhile, if you
> can think of a simpler way to explain it in the comments, that'd be good.

One more thing on this: If you force bitmap index scans (by disabling
index-only scans and index scans with the "enable_" GUCs), then you
get EXPLAIN (ANALYZE, BUFFERS) instrumentation for the index alone
(and the heap, separately). No visibility map accesses, which obscure
the same numbers for a similar index-only scan.

You can then observe that most searches of a single value will touch
the bare minimum number of index pages. For example, if there are 3
levels in the index, you should access only 3 index pages total,
unless there are literally hundreds of matches, and cannot avoid
storing them on more than one leaf page. You'll see that the scan
touches the minimum possible number of index pages, because of:

* Many duplicates strategy. (Not single value strategy, which I
incorrectly mentioned in relation to this earlier.)

* The !minusinfykey optimization, which ensures that we go to the
right of an otherwise-equal pivot tuple in an internal page, rather
than left.

* The "continuescan" high key patch, which ensures that the scan
doesn't go to the right from the first leaf page to try to find even
more matches. The high key on the same leaf page will indicate that
the scan is over, without actually visiting the sibling. (Again, I'm
assuming that your search is for a single value.)

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Thu, Mar 7, 2019 at 12:37 AM Peter Geoghegan  wrote:
> When I drew you that picture while we were in Lisbon, I mentioned to
> you that the patch sometimes used a sentinel scantid value that was
> greater than minus infinity, but less than any real scantid. This
> could be used to force an otherwise-equal-to-pivot search to go left
> rather than uselessly going right. I explained this about 30 minutes
> in, when I was drawing you a picture.

I meant the opposite: it could be used to go right, instead of going
left when descending the tree and unnecessarily moving right on the
leaf level.

As I said, moving right on the leaf level (rather than during the
descent) should only happen when it's necessary, such as when there is
a concurrent page split. It shouldn't happen reliably when searching
for the same value, unless there really are matches across multiple
leaf pages, and that's just what we have to do.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Wed, Mar 6, 2019 at 11:41 PM Heikki Linnakangas  wrote:
> > BTW, I would like to hear what you think of the idea of minusinfkey
> > (and the !minusinfkey optimization) specifically.
>
> I don't understand it :-(. I guess that's valuable feedback on its own.
> I'll spend more time reading the code around that, but meanwhile, if you
> can think of a simpler way to explain it in the comments, that'd be good.

Here is another way of explaining it:

When I drew you that picture while we were in Lisbon, I mentioned to
you that the patch sometimes used a sentinel scantid value that was
greater than minus infinity, but less than any real scantid. This
could be used to force an otherwise-equal-to-pivot search to go left
rather than uselessly going right. I explained this about 30 minutes
in, when I was drawing you a picture.

Well, that sentinel heap TID thing doesn't exist any more, because it
was replaced by the !minusinfkey optimization, which is a
*generalization* of the same idea, which extends it to all columns
(not just the heap TID column). That way, you never have to go to two
pages just because you searched for a value that happened to be at the
"right at the edge" of a leaf page.

Page deletion wants to assume that truncated attributes from the high
key of the page being deleted have actual negative infinity values --
negative infinity is a value, just like any other, albeit one that can
only appear in pivot tuples. This is simulated by VACUUM using
"minusinfkey = true". We go left in the parent, not right, and land on
the correct leaf page. Technically we don't compare the negative
infinity values in the pivot to the negative infinity values in the
scankey, but we return 0 just as if we had, and found them equal.
Similarly, v3 indexes specify "minusinfkey = true" in all cases,
because they always want to go left -- just like in old Postgres
versions. They don't have negative infinity values (matches can be on
either side of the all-equal pivot, so they must go left).

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons