Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-04-01 Thread Peter Geoghegan
On Fri, Mar 22, 2019 at 2:15 PM Peter Geoghegan  wrote:
> On Thu, Mar 21, 2019 at 10:28 AM Peter Geoghegan  wrote:
> > I'll likely push the remaining two patches on Sunday or Monday.
>
> I noticed that if I initidb and run "make installcheck" with and
> without the "split after new tuple" optimization patch, the largest
> system catalog indexes shrink quite noticeably:

I pushed this final patch a week ago, as commit f21668f3, concluding
work on integrating the patch series.

I have some closing thoughts that I would like to close out the
project on. I was casually discussing this project over IM with Robert
the other day. I was asked a question I'd often asked myself about the
"split after new item" heuristics: What if you're wrong? What if some
"black swan" type workload fools your heuristics into bloating an
index uncontrollably?

I gave an answer to his question that may have seemed kind of
inscrutable. My intuition about the worst case for the heuristics is
based on its similarity to the worst case for quicksort. Any
real-world instance of quicksort going quadratic is essentially a case
where we *consistently* do the wrong thing when selecting a pivot. A
random pivot selection will still perform reasonably well, because
we'll still choose the median pivot on average. A malicious actor will
always be able to fool any quicksort implementation into going
quadratic [1] in certain circumstances. We're defending against
Murphy, not Machiavelli, though, so that's okay.

I think that I can produce a more tangible argument than this, though.
Attached patch removes every heuristic that limits the application of
the "split after new item" optimization (it doesn't force the
optimization in the case of rightmost splits, or in the case where the
new item happens to be first on the page, since caller isn't prepared
for that). This is an attempt to come up with a wildly exaggerated
worst case. Nevertheless, the consequences are not actually all that
bad. Summary:

* The "UK land registry" test case that I leaned on a lot for the
patch has a final index that's about 1% larger. However, it was about
16% smaller compared to Postgres without the patch, so this is not a
problem.

* Most of the TPC-C indexes are actually slightly smaller, because we
didn't quite go as far as we could have (TPC-C strongly rewards this
optimization). 8 out of the 10 indexes are either smaller or
unchanged. The customer name index is about 28% larger, though. The
oorder table index is also about 28% larger.

* TPC-E never benefits from the "split after new item" optimization,
and yet the picture isn't so bad here either. The holding history PK
is about 40% bigger, which is quite bad, and the biggest regression
overall. However, in other affected cases indexes are about 15%
larger, which is not that bad.

Also attached are the regressions from my test suite in the form of
diff files -- these are the full details of the regression, just in
case that's interesting to somebody.

This isn't the final word. I'm not asking anybody to accept with total
certainty that there can never be a "black swan" workload that the
heuristics consistently mishandle, leading to pathological
performance. However, I think it's fair to say that the risk of that
happening has been managed well. The attached test patch literally
removes any restraint on applying the optimization, and yet we
arguably do no worse than Postgres 11 would overall.

Once again, I would like to thank my collaborators for all their help,
especially Heikki.

[1] https://www.cs.dartmouth.edu/~doug/mdmspe.pdf
-- 
Peter Geoghegan


always-split-after-new-item.patch
Description: Binary data


land_balanced.diff
Description: Binary data


tpcc_balanced.diff
Description: Binary data


tpce_balanced.diff
Description: Binary data


Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-22 Thread Peter Geoghegan
On Thu, Mar 21, 2019 at 10:28 AM Peter Geoghegan  wrote:
> I've committed the first 4 patches. Many thanks to Heikki for his very
> valuable help! Thanks also to the other reviewers.
>
> I'll likely push the remaining two patches on Sunday or Monday.

I noticed that if I initidb and run "make installcheck" with and
without the "split after new tuple" optimization patch, the largest
system catalog indexes shrink quite noticeably:

Master
==
pg_depend_depender_index 1456 kB
pg_depend_reference_index 1416 kB
pg_class_tblspc_relfilenode_index 224 kB

Patch
=
pg_depend_depender_index 1088 kB   -- ~25% smaller
pg_depend_reference_index 1136 kB   -- ~20% smaller
pg_class_tblspc_relfilenode_index 160 kB -- 28% smaller

This is interesting to me because it is further evidence that the
problem that the patch targets is reasonably common. It's also
interesting to me because we benefit despite the fact there are a lot
of duplicates in parts of these indexes; we vary our strategy at
different parts of the key space, which works well. We pack pages
tightly where they're full of duplicates, using the "single value"
strategy that I've already committed, whereas the apply the "split
after new tuple" optimization in parts of the index with localized
monotonically increasing insertions. If there were no duplicates in
the indexes, then they'd be about 40% smaller, which is exactly what
we see with the TPC-C indexes (they're all unique indexes, with very
few physical duplicates). Looks like the duplicates are mostly
bootstrap mode entries. Lots of the pg_depend_depender_index
duplicates look like "(classid, objid, objsubid)=(0, 0, 0)", for
example.

I also noticed one further difference: the pg_shdepend_depender_index
index grew from 40 kB to 48 kB. I guess that might count as a
regression, though I'm not sure that it should. I think that we would
do better if the volume of data in the underlying table was greater.
contrib/pageinspect shows that a small number of the leaf pages in the
improved cases are not very filled at all, which is more than made up
for by the fact that many more pages are packed as if they were
created by a rightmost split (262 items 24 byte tuples is exactly
consistent with that). IOW, I suspect that the extra page in
pg_shdepend_depender_index is due to a "local minimum".

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-21 Thread Peter Geoghegan
On Tue, Mar 19, 2019 at 4:15 PM Peter Geoghegan  wrote:
> Heikki and I discussed this issue privately, over IM, and reached
> final agreement on remaining loose ends. I'm going to use his code for
> _bt_findsplitloc(). Plan to push a final version of the first four
> patches tomorrow morning PST.

I've committed the first 4 patches. Many thanks to Heikki for his very
valuable help! Thanks also to the other reviewers.

I'll likely push the remaining two patches on Sunday or Monday.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-19 Thread Peter Geoghegan
On Mon, Mar 18, 2019 at 10:17 AM Peter Geoghegan  wrote:
> The big difference is that you make the possible call to
> _bt_stepright() conditional on this being a checkingunique index --
> the duplicate code is indented in that branch of _bt_findsplitloc().
> Whereas I break early in the loop when "checkingunique &&
> heapkeyspace".

Heikki and I discussed this issue privately, over IM, and reached
final agreement on remaining loose ends. I'm going to use his code for
_bt_findsplitloc(). Plan to push a final version of the first four
patches tomorrow morning PST.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-18 Thread Peter Geoghegan
On Mon, Mar 18, 2019 at 5:12 PM Peter Geoghegan  wrote:
> Smarter choices on page splits pay off with higher client counts
> because they reduce contention at likely hot points. It's kind of
> crazy that the code in _bt_check_unique() sometimes has to move right,
> while holding an exclusive buffer lock on the original page and a
> shared buffer lock on its sibling page at the same time. It then has
> to hold a third buffer lock concurrently, this time on any heap pages
> it is interested in.

Actually, by the time we get to 16 clients, this workload does make
the indexes and tables smaller. Here is pg_buffercache output
collected after the first 16 client case:

Master
==

 relname │ relforknumber │
size_main_rel_fork_blocks │ buffer_count │ avg_buffer_usg
─┼───┼───┼──┼
 pgbench_history │ 0 │
  123,484 │  123,484 │ 4.9989715266755207
 pgbench_accounts│ 0 │
   34,665 │   10,682 │ 4.4948511514697622
 pgbench_accounts_pkey   │ 0 │
5,708 │1,561 │ 4.8731582319026265
 pgbench_tellers │ 0 │
  489 │  489 │ 5.
 pgbench_branches│ 0 │
  284 │  284 │ 5.
 pgbench_tellers_pkey│ 0 │
   56 │   56 │ 5.


Patch
=

 relname │ relforknumber │
size_main_rel_fork_blocks │ buffer_count │ avg_buffer_usg
─┼───┼───┼──┼
 pgbench_history │ 0 │
  127,864 │  127,864 │ 4.9980447975974473
 pgbench_accounts│ 0 │
   33,933 │9,614 │ 4.3517786561264822
 pgbench_accounts_pkey   │ 0 │
5,487 │1,322 │ 4.8857791225416036
 pgbench_tellers │ 0 │
  204 │  204 │ 4.9803921568627451
 pgbench_branches│ 0 │
  198 │  198 │ 4.3535353535353535
 pgbench_tellers_pkey│ 0 │
   14 │   14 │ 5.


The main fork for pgbench_history is larger with the patch, obviously,
but that's good. pgbench_accounts_pkey is about 4% smaller, which is
probably the most interesting observation that can be made here, but
the tables are also smaller. pgbench_accounts itself is ~2% smaller.
pgbench_branches is ~30% smaller, and pgbench_tellers is 60% smaller.
Of course, the smaller tables were already very small, so maybe that
isn't important. I think that this is due to more effective pruning,
possibly because we get better lock arbitration as a consequence of
better splits, and also because duplicates are in heap TID order. I
haven't observed this effect with larger databases, which have been my
focus.

It isn't weird that shared_buffers doesn't have all the
pgbench_accounts blocks, since, of course, this is highly skewed by
design -- most blocks were never accessed from the table.

This effect seems to be robust, at least with this workload. The
second round of benchmarks (which have their own pgbench -i
initialization) show very similar amounts of bloat at the same point.
It may not be that significant, but it's also not a fluke.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-18 Thread Peter Geoghegan
On Mon, Mar 18, 2019 at 5:00 PM Robert Haas  wrote:
> Blech.  I think the patch has enough other advantages that it's worth
> accepting that, but it's not great.  We seem to keep finding reasons
> to reduce single client performance in the name of scalability, which
> is often reasonable not but wonderful.

The good news is that the quicksort that we now perform in
nbtsplitloc.c is not optimized at all. Heikki thought it premature to
optimize that, for example by inlining/specializing the quicksort. I
can make that 3x faster fairly easily, which could easily change the
picture here. The code will be uglier that way, but not much more
complicated. I even prototyped this, and managed to make serial
microbenchmarks I've used noticeably faster. This could very well fix
the problem here. It clearly showed up in perf profiles with serial
bulks loads.

> > However, this isn't completely
> > free (particularly the page split stuff), and it doesn't pay for
> > itself until the number of clients ramps up.
>
> I don't really understand that explanation.  It makes sense that more
> intelligent page split decisions could require more CPU cycles, but it
> is not evident to me why more clients would help better page split
> decisions pay off.

Smarter choices on page splits pay off with higher client counts
because they reduce contention at likely hot points. It's kind of
crazy that the code in _bt_check_unique() sometimes has to move right,
while holding an exclusive buffer lock on the original page and a
shared buffer lock on its sibling page at the same time. It then has
to hold a third buffer lock concurrently, this time on any heap pages
it is interested in. Each in turn, to check if they're possibly
conflicting. gcov shows that that never happens with the regression
tests once the patch is applied (you can at least get away with only
having one buffer lock on a leaf page at all times in practically all
cases).

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-18 Thread Robert Haas
On Mon, Mar 18, 2019 at 7:34 PM Peter Geoghegan  wrote:
> With pgbench scale factor 20, here are results for patch and master
> with a Gaussian distribution on my 8 thread/4 core home server, with
> each run reported lasting 10 minutes, repeating twice for client
> counts 1, 2, 8, 16, and 64, patch and master branch:
>
> 1 client master:
> tps = 7203.983289 (including connections establishing)
> 1 client patch:
> tps = 7012.575167 (including connections establishing)
>
> 2 clients master:
> tps = 13434.043832 (including connections establishing)
> 2 clients patch:
> tps = 13105.620223 (including connections establishing)

Blech.  I think the patch has enough other advantages that it's worth
accepting that, but it's not great.  We seem to keep finding reasons
to reduce single client performance in the name of scalability, which
is often reasonable not but wonderful.

> However, this isn't completely
> free (particularly the page split stuff), and it doesn't pay for
> itself until the number of clients ramps up.

I don't really understand that explanation.  It makes sense that more
intelligent page split decisions could require more CPU cycles, but it
is not evident to me why more clients would help better page split
decisions pay off.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-18 Thread Peter Geoghegan
On Tue, Mar 12, 2019 at 11:40 AM Robert Haas  wrote:
> I think it's pretty clear that we have to view that as acceptable.  I
> mean, we could reduce contention even further by finding a way to make
> indexes 40% larger, but I think it's clear that nobody wants that.
> Now, maybe in the future we'll want to work on other techniques for
> reducing contention, but I don't think we should make that the problem
> of this patch, especially because the regressions are small and go
> away after a few hours of heavy use.  We should optimize for the case
> where the user intends to keep the database around for years, not
> hours.

I came back to the question of contention recently. I don't think it's
okay to make contention worse in workloads where indexes are mostly
the same size as before, such as almost any workload that pgbench can
simulate. I have made a lot of the fact that the TPC-C indexes are
~40% smaller, in part because lots of people outside the community
find TPC-C interesting, and in part because this patch series is
focused on cases where we currently do unusually badly (cases where
good intuitions about how B-Trees are supposed to perform break down
[1]). These pinpointable problems must affect a lot of users some of
the time, but certainly not all users all of the time.

The patch series is actually supposed to *improve* the situation with
index buffer lock contention in general, and it looks like it manages
to do that with pgbench, which doesn't do inserts into indexes, except
for those required for non-HOT updates. pgbench requires relatively
few page splits, but is in every other sense a high contention
workload.

With pgbench scale factor 20, here are results for patch and master
with a Gaussian distribution on my 8 thread/4 core home server, with
each run reported lasting 10 minutes, repeating twice for client
counts 1, 2, 8, 16, and 64, patch and master branch:

\set aid random_gaussian(1, 10 * :scale, 20)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
(:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;

1st pass


(init pgbench from scratch for each database, scale 20)

1 client master:
tps = 7203.983289 (including connections establishing)
tps = 7204.020457 (excluding connections establishing)
latency average = 0.139 ms
latency stddev = 0.026 ms
1 client patch:
tps = 7012.575167 (including connections establishing)
tps = 7012.590007 (excluding connections establishing)
latency average = 0.143 ms
latency stddev = 0.020 ms

2 clients master:
tps = 13434.043832 (including connections establishing)
tps = 13434.076194 (excluding connections establishing)
latency average = 0.149 ms
latency stddev = 0.032 ms
2 clients patch:
tps = 13105.620223 (including connections establishing)
tps = 13105.654109 (excluding connections establishing)
latency average = 0.153 ms
latency stddev = 0.033 ms

8 clients master:
tps = 27126.852038 (including connections establishing)
tps = 27126.986978 (excluding connections establishing)
latency average = 0.295 ms
latency stddev = 0.095 ms
8 clients patch:
tps = 27945.457965 (including connections establishing)
tps = 27945.565242 (excluding connections establishing)
latency average = 0.286 ms
latency stddev = 0.089 ms

16 clients master:
tps = 32297.612323 (including connections establishing)
tps = 32297.743929 (excluding connections establishing)
latency average = 0.495 ms
latency stddev = 0.185 ms
16 clients patch:
tps = 33434.889405 (including connections establishing)
tps = 33435.021738 (excluding connections establishing)
latency average = 0.478 ms
latency stddev = 0.167 ms

64 clients master:
tps = 25699.029787 (including connections establishing)
tps = 25699.217022 (excluding connections establishing)
latency average = 2.482 ms
latency stddev = 1.715 ms
64 clients patch:
tps = 26513.816673 (including connections establishing)
tps = 26514.013638 (excluding connections establishing)
latency average = 2.405 ms
latency stddev = 1.690 ms

2nd pass


(init pgbench from scratch for each database, scale 20)

1 client master:
tps = 7172.995796 (including connections establishing)
tps = 7173.013472 (excluding connections establishing)
latency average = 0.139 ms
latency stddev = 0.022 ms
1 client patch:
tps = 7024.724365 (including connections establishing)
tps = 7024.739237 (excluding connections establishing)
latency average = 0.142 ms
latency stddev = 0.021 ms

2 clients master:
tps = 13489.016303 (including connections establishing)
tps = 13489.047968 (excluding connections establishing)
latency average = 0.148 ms
latency stddev = 0.032 ms
2 clients patch:

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-18 Thread Peter Geoghegan
On Mon, Mar 18, 2019 at 4:59 AM Heikki Linnakangas  wrote:
> I'm getting a regression failure from the 'create_table' test with this:

> Are you seeing that?

Yes -- though the bug is in your revised v18, not the original v18,
which passed CFTester. Your revision fails on Travis/Linux, which is
pretty close to what I see locally, and much less subtle than the test
failures you mentioned:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/507816665

"make check" did pass locally for me with your patch, but "make
check-world" (parallel recipe) did not.

The original v18 passed both CFTester tests about 15 hour ago:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/507643402

I see the bug. You're not supposed to test this way with a heapkeyspace index:

> +   if (P_RIGHTMOST(lpageop) ||
> +   _bt_compare(rel, itup_key, page, P_HIKEY) != 0)
> +   break;

This is because the presence of scantid makes it almost certain that
you'll break when you shouldn't. You're doing it the old way, which is
inappropriate for a heapkeyspace index. Note that it would probably
take much longer to notice this bug if the "consider secondary
factors" patch was also applied, because then you would rarely have
cause to step right here (duplicates would never occupy more than a
single page in the regression tests). The test failures are probably
also timing sensitive, though they happen very reliably on my machine.

> Looking at the patches 1 and 2 again:
>
> I'm still not totally happy with the program flow and all the conditions
> in _bt_findsplitloc(). I have a hard time following which codepaths are
> taken when. I refactored that, so that there is a separate copy of the
> loop for V3 and V4 indexes.

The big difference is that you make the possible call to
_bt_stepright() conditional on this being a checkingunique index --
the duplicate code is indented in that branch of _bt_findsplitloc().
Whereas I break early in the loop when "checkingunique &&
heapkeyspace".

The flow of the original loop not only had less code. It also
contrasted the important differences between heapkeyspace and
!heapkeyspace cases:

/* If this is the page that the tuple must go on, stop */
if (P_RIGHTMOST(lpageop))
break;
cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
if (itup_key->heapkeyspace)
{
if (cmpval <= 0)
break;
}
else
{
/*
 * pg_upgrade'd !heapkeyspace index.
 *
 * May have to handle legacy case where there is a choice of which
 * page to place new tuple on, and we must balance space
 * utilization as best we can.  Note that this may invalidate
 * cached bounds for us.
 */
if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, insertstate))
break;
}

I thought it was obvious that the "cmpval <= 0" code was different for
a reason. It now seems that this at least needs a comment.

I still believe that the best way to handle the !heapkeyspace case is
to make it similar to the heapkeyspace checkingunique case, regardless
of whether or not we're checkingunique. The fact that this bug slipped
in supports that view. Besides, the alternative that you suggest
treats !heapkeyspace indexes as if they were just as important to the
reader, which seems inappropriate (better to make the legacy case
follow the new case, not the other way around). I'm fine with the
comment tweaks that you made that are not related to
_bt_findsplitloc(), though.

I won't push the patches today, to give you the opportunity to
respond. I am not at all convinced right now, though.

--
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-16 Thread Peter Geoghegan
On Sat, Mar 16, 2019 at 2:01 PM Peter Geoghegan  wrote:
>
> On Sat, Mar 16, 2019 at 1:47 PM Peter Geoghegan  wrote:
> > I agree that it's always true that the high key is also in the parent,
> > and we could cross-check that within the child. Actually, I should
> > probably figure out a way of arranging for the Bloom filter used
> > within bt_relocate_from_root() (which has been around since PG v11) to
> > include the key itself when fingerprinting. That would probably
> > necessitate that we don't truncate "negative infinity" items (it was
> > actually that way about 18 years ago), just for the benefit of
> > verification.
>
> Clarification: You'd fingerprint an entire pivot tuple -- key, block
> number, everything. Then, you'd probe the Bloom filter for the high
> key one level down, with the downlink block in the high key set to
> point to the current sibling on the same level (the child level). As I
> said, I think that the only reason that that cannot be done at present
> is because of the micro-optimization of truncating the first item on
> the right page to zero attributes during an internal page split. (We
> can retain the key without getting rid of the hard-coded logic for
> negative infinity within _bt_compare()).
>
> bt_relocate_from_root() already has smarts around interrupted page
> splits (with the incomplete split bit set).

Clarification to my clarification: I meant
bt_downlink_missing_check(), not bt_relocate_from_root(). The former
really has been around since v11, unlike the latter, which is part of
this new amcheck patch we're discussing.


-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-16 Thread Peter Geoghegan
On Sat, Mar 16, 2019 at 1:47 PM Peter Geoghegan  wrote:
> I agree that it's always true that the high key is also in the parent,
> and we could cross-check that within the child. Actually, I should
> probably figure out a way of arranging for the Bloom filter used
> within bt_relocate_from_root() (which has been around since PG v11) to
> include the key itself when fingerprinting. That would probably
> necessitate that we don't truncate "negative infinity" items (it was
> actually that way about 18 years ago), just for the benefit of
> verification.

Clarification: You'd fingerprint an entire pivot tuple -- key, block
number, everything. Then, you'd probe the Bloom filter for the high
key one level down, with the downlink block in the high key set to
point to the current sibling on the same level (the child level). As I
said, I think that the only reason that that cannot be done at present
is because of the micro-optimization of truncating the first item on
the right page to zero attributes during an internal page split. (We
can retain the key without getting rid of the hard-coded logic for
negative infinity within _bt_compare()).

bt_relocate_from_root() already has smarts around interrupted page
splits (with the incomplete split bit set).

Finally, you'd also make bt_downlink_check follow every downlink, not
all-but-one downlink (no more excuse for leaving out the first one if
we don't truncate during internal page splits).

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-16 Thread Peter Geoghegan
On Sat, Mar 16, 2019 at 1:33 PM Heikki Linnakangas  wrote:
> AFAICS, there is a copy of every page's high key in its immediate
> parent. Either in the downlink of the right sibling, or as the high key
> of the parent page itself. Cross-checking those would catch any
> corruption in high keys.

I agree that it's always true that the high key is also in the parent,
and we could cross-check that within the child. Actually, I should
probably figure out a way of arranging for the Bloom filter used
within bt_relocate_from_root() (which has been around since PG v11) to
include the key itself when fingerprinting. That would probably
necessitate that we don't truncate "negative infinity" items (it was
actually that way about 18 years ago), just for the benefit of
verification. This is almost the same thing as what Graefe argues for
(don't think that you need a low key on the leaf level, since you can
cross a single level there). I wonder if truncating the negative
infinity item in internal pages to zero attributes is actually worth
it, since a low key might be useful for a number of reasons.

> Note that this would potentially catch some corruption that the
> descend-from-root would not. If you have a mismatch between the high key
> of a leaf page and its parent or grandparent, all the current tuples
> might be pass the rootdescend check. But a tuple might get inserted to
> wrong location later.

I also agree with this. However, the rootdescend check will always
work better than this in some cases that you can at least imagine, for
as long as there are negative infinity items to worry about. (And,
even if we decided not to truncate to support easy verification, there
is still a good argument to be made for involving the authoritative
_bt_search() code at some point).

> > Maybe you could actually do something with the high key from leaf page
> > 5 to detect the stray value "20" in leaf page 6, but again, we don't
> > do anything like that right now.
>
> Hmm, yeah, to check for stray values, you could follow the left-link,
> get the high key of the left sibling, and compare against that.

Graefe argues for retaining a low key, even in leaf pages (the left
page's old high key becomes the left page's low key during a split,
and the left page's new high key becomes the new right pages low key
at the same time). It's useful for what he calls "write-optimized
B-Trees", and maybe even for optional compression. As I said earlier,
I guess you can just go left on the leaf level if you need to, and you
have all you need. But I'd need to think about it some more.

Point taken; rootdescend isn't enough to make verification exactly
perfect. But it makes verification approach being perfect, because
you're going to get right answers to queries as long as it passes (I
think). There could be a future problem for a future insertion that
you could also detect, but can't. But you'd have to be extraordinarily
unlucky to have that happen for any amount of time. Unlucky even by my
own extremely paranoid standard.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-16 Thread Heikki Linnakangas

On 16/03/2019 18:55, Peter Geoghegan wrote:

On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas  wrote:

Then, a cosmic ray changes the 20 on the root page to 18. That causes
the the leaf tuple 19 to become non-re-findable; if you descend the for
19, you'll incorrectly land on page 6. But it also causes the high key
on page 2 to be different from the downlink on the root page. Wouldn't
the existing checks catch this?


No, the existing checks will not check that. I suppose something
closer to the existing approach *could* detect this issue, by making
sure that the "seam of identical high keys" from the root to the leaf
are a match, but we don't use the high key outside of its own page.
Besides, there is something useful about having the code actually rely
on _bt_search().

There are other similar cases, where it's not obvious how you can do
verification without either 1) crossing multiple levels, or 2)
retaining a "low key" as well as a high key (this is what Goetz Graefe
calls "retaining fence keys to solve the cousin verification problem"
in Modern B-Tree Techniques). What if the corruption was in the leaf
page 6 from your example, which had a 20 at the start? We wouldn't
have compared the downlink from the parent to the child, because leaf
page 6 is the leftmost child, and so we only have "-inf". The lower
bound actually comes from the root page, because we truncate "-inf"
attributes during page splits, even though we don't have to. Most of
the time they're not "absolute minus infinity" -- they're "minus
infinity in this subtree".


AFAICS, there is a copy of every page's high key in its immediate 
parent. Either in the downlink of the right sibling, or as the high key 
of the parent page itself. Cross-checking those would catch any 
corruption in high keys.


Note that this would potentially catch some corruption that the 
descend-from-root would not. If you have a mismatch between the high key 
of a leaf page and its parent or grandparent, all the current tuples 
might be pass the rootdescend check. But a tuple might get inserted to 
wrong location later.



Maybe you could actually do something with the high key from leaf page
5 to detect the stray value "20" in leaf page 6, but again, we don't
do anything like that right now.


Hmm, yeah, to check for stray values, you could follow the left-link, 
get the high key of the left sibling, and compare against that.


- Heikki



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-16 Thread Heikki Linnakangas

On 16/03/2019 19:32, Peter Geoghegan wrote:

On Sat, Mar 16, 2019 at 9:55 AM Peter Geoghegan  wrote:

On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas  wrote:

Hmm. "re-find", maybe? We use that term in a few error messages already,
to mean something similar.


WFM.


Actually, how about "rootsearch", or "rootdescend"? You're supposed to
hyphenate "re-find", and so it doesn't really work as a function
argument name.


Works for me.

- Heikki



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-16 Thread Peter Geoghegan
On Sat, Mar 16, 2019 at 9:55 AM Peter Geoghegan  wrote:
> On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas  wrote:
> > Hmm. "re-find", maybe? We use that term in a few error messages already,
> > to mean something similar.
>
> WFM.

Actually, how about "rootsearch", or "rootdescend"? You're supposed to
hyphenate "re-find", and so it doesn't really work as a function
argument name.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-16 Thread Peter Geoghegan
On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas  wrote:
> Hmm. "re-find", maybe? We use that term in a few error messages already,
> to mean something similar.

WFM.

> At first, I thought this would be horrendously expensive, but thinking
> about it a bit more, nearby tuples will always follow the same path
> through the upper nodes, so it'll all be cached. So maybe it's not quite
> so bad.

That's deliberate, though you could call bt_relocate_from_root() from
anywhere else if you wanted to. It's a bit like a big nested loop
join, where the inner side has locality.

> Then, a cosmic ray changes the 20 on the root page to 18. That causes
> the the leaf tuple 19 to become non-re-findable; if you descend the for
> 19, you'll incorrectly land on page 6. But it also causes the high key
> on page 2 to be different from the downlink on the root page. Wouldn't
> the existing checks catch this?

No, the existing checks will not check that. I suppose something
closer to the existing approach *could* detect this issue, by making
sure that the "seam of identical high keys" from the root to the leaf
are a match, but we don't use the high key outside of its own page.
Besides, there is something useful about having the code actually rely
on _bt_search().

There are other similar cases, where it's not obvious how you can do
verification without either 1) crossing multiple levels, or 2)
retaining a "low key" as well as a high key (this is what Goetz Graefe
calls "retaining fence keys to solve the cousin verification problem"
in Modern B-Tree Techniques). What if the corruption was in the leaf
page 6 from your example, which had a 20 at the start? We wouldn't
have compared the downlink from the parent to the child, because leaf
page 6 is the leftmost child, and so we only have "-inf". The lower
bound actually comes from the root page, because we truncate "-inf"
attributes during page splits, even though we don't have to. Most of
the time they're not "absolute minus infinity" -- they're "minus
infinity in this subtree".

Maybe you could actually do something with the high key from leaf page
5 to detect the stray value "20" in leaf page 6, but again, we don't
do anything like that right now.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-16 Thread Heikki Linnakangas

On 16/03/2019 10:51, Peter Geoghegan wrote:

On Sat, Mar 16, 2019 at 1:44 AM Heikki Linnakangas  wrote:

It would be nice if you could take a look at the amcheck "relocate"
patch

When I started looking at this, I thought that "relocate" means "move".
So I thought that the new mode would actually move tuples, i.e. that it
would modify the index. That sounded horrible. Of course, it doesn't
actually do that. It just checks that each tuple can be re-found, or
"relocated", by descending the tree from the root. I'd suggest changing
the language to avoid that confusion.


Okay. What do you suggest? :-)


Hmm. "re-find", maybe? We use that term in a few error messages already, 
to mean something similar.



It seems like a nice way to catch all kinds of index corruption issues.
Although, we already check that the tuples are in order within the page.
Is it really necessary to traverse the tree for every tuple, as well?
Maybe do it just for the first and last item?


It's mainly intended as a developer option. I want it to be possible
to detect any form of corruption, however unlikely. It's an
adversarial mindset that will at least make me less nervous about the
patch.


Fair enough.

At first, I thought this would be horrendously expensive, but thinking 
about it a bit more, nearby tuples will always follow the same path 
through the upper nodes, so it'll all be cached. So maybe it's not quite 
so bad.



I don't understand this. Can you give an example of this kind of
inconsistency?


The commit message gives an example, but I suggest trying it out for
yourself. Corrupt the least significant key byte of a root page of a
B-Tree using pg_hexedit. Say it's an index on a text column, then
you'd corrupt the last byte to be something slightly wrong. Then, the
only way to catch it is with "relocate" verification. You'll only miss
a few tuples on a cousin page at the leaf level (those on either side
of the high key that the corrupted separator key in the root was
originally copied from).

The regular checks won't catch this, because the keys are similar
enough one level down. The "minus infinity" item is a kind of a blind
spot, because we cannot do a parent check of its children, because we
don't have the key (it's truncated when the item becomes a right page
minus infinity item, during an internal page split).


Hmm. So, the initial situation would be something like this:

 +---+
 | 1: root   |
 |   |
 | -inf -> 2 |
 | 20   -> 3 |
 |   |
 +---+

+-+ +-+
| 2: internal | | 3: internal |
| | | |
| -inf -> 4   | | -inf -> 6   |
| 10   -> 5   | | 30   -> 7   |
| | | |
| hi: 20  | | |
+-+ +-+

+-+ +-+ +-+ +-+
| 4: leaf | | 5: leaf | | 6: leaf | | 7: leaf |
| | | | | | | |
| 1   | | 11  | | 21  | | 31  |
| 5   | | 15  | | 25  | | 35  |
| 9   | | 19  | | 29  | | 39  |
| | | | | | | |
| hi: 10  | | hi: 20  | | hi: 30  | | |
+-+ +-+ +-+ +-+

Then, a cosmic ray changes the 20 on the root page to 18. That causes 
the the leaf tuple 19 to become non-re-findable; if you descend the for 
19, you'll incorrectly land on page 6. But it also causes the high key 
on page 2 to be different from the downlink on the root page. Wouldn't 
the existing checks catch this?


- Heikki



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-16 Thread Peter Geoghegan
On Sat, Mar 16, 2019 at 1:44 AM Heikki Linnakangas  wrote:
> > It would be nice if you could take a look at the amcheck "relocate"
> > patch
> When I started looking at this, I thought that "relocate" means "move".
> So I thought that the new mode would actually move tuples, i.e. that it
> would modify the index. That sounded horrible. Of course, it doesn't
> actually do that. It just checks that each tuple can be re-found, or
> "relocated", by descending the tree from the root. I'd suggest changing
> the language to avoid that confusion.

Okay. What do you suggest? :-)

> It seems like a nice way to catch all kinds of index corruption issues.
> Although, we already check that the tuples are in order within the page.
> Is it really necessary to traverse the tree for every tuple, as well?
> Maybe do it just for the first and last item?

It's mainly intended as a developer option. I want it to be possible
to detect any form of corruption, however unlikely. It's an
adversarial mindset that will at least make me less nervous about the
patch.

> I don't understand this. Can you give an example of this kind of
> inconsistency?

The commit message gives an example, but I suggest trying it out for
yourself. Corrupt the least significant key byte of a root page of a
B-Tree using pg_hexedit. Say it's an index on a text column, then
you'd corrupt the last byte to be something slightly wrong. Then, the
only way to catch it is with "relocate" verification. You'll only miss
a few tuples on a cousin page at the leaf level (those on either side
of the high key that the corrupted separator key in the root was
originally copied from).

The regular checks won't catch this, because the keys are similar
enough one level down. The "minus infinity" item is a kind of a blind
spot, because we cannot do a parent check of its children, because we
don't have the key (it's truncated when the item becomes a right page
minus infinity item, during an internal page split).

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-16 Thread Heikki Linnakangas

On 16/03/2019 06:16, Peter Geoghegan wrote:

On Thu, Mar 14, 2019 at 2:21 AM Heikki Linnakangas  wrote:

It doesn't matter how often it happens, the code still needs to deal
with it. So let's try to make it as readable as possible.



Well, IMHO holding the buffer and the bounds in the new struct is more
clean than the savebinsrc/restorebinsrch flags. That's exactly why I
suggested it. I don't know what else to suggest. I haven't done any
benchmarking, but I doubt there's any measurable difference.


Fair enough. Attached is v17, which does it using the approach taken
in your earlier prototype. I even came around to your view on
_bt_binsrch_insert() -- I kept that part, too. Note, however, that I
still pass checkingunique to _bt_findinsertloc(), because that's a
distinct condition to whether or not bounds were cached (they happen
to be the same thing right now, but I don't want to assume that).

This revision also integrates your approach to the "continuescan"
optimization patch, with the small tweak I mentioned yesterday (we
also pass ntupatts). I also prefer this approach.


Great, thank you!


It would be nice if you could take a look at the amcheck "relocate"
patch
When I started looking at this, I thought that "relocate" means "move". 
So I thought that the new mode would actually move tuples, i.e. that it 
would modify the index. That sounded horrible. Of course, it doesn't 
actually do that. It just checks that each tuple can be re-found, or 
"relocated", by descending the tree from the root. I'd suggest changing 
the language to avoid that confusion.


It seems like a nice way to catch all kinds of index corruption issues. 
Although, we already check that the tuples are in order within the page. 
Is it really necessary to traverse the tree for every tuple, as well? 
Maybe do it just for the first and last item?



+ * This routine can detect very subtle transitive consistency issues across
+ * more than one level of the tree.  Leaf pages all have a high key (even the
+ * rightmost page has a conceptual positive infinity high key), but not a low
+ * key.  Their downlink in parent is a lower bound, which along with the high
+ * key is almost enough to detect every possible inconsistency.  A downlink
+ * separator key value won't always be available from parent, though, because
+ * the first items of internal pages are negative infinity items, truncated
+ * down to zero attributes during internal page splits.  While it's true that
+ * bt_downlink_check() and the high key check can detect most imaginable key
+ * space problems, there are remaining problems it won't detect with non-pivot
+ * tuples in cousin leaf pages.  Starting a search from the root for every
+ * existing leaf tuple detects small inconsistencies in upper levels of the
+ * tree that cannot be detected any other way.  (Besides all this, it's
+ * probably a useful testing strategy to exhaustively verify that all
+ * non-pivot tuples can be relocated in the index using the same code paths as
+ * those used by index scans.)


I don't understand this. Can you give an example of this kind of 
inconsistency?


- Heikki



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-14 Thread Peter Geoghegan
On Thu, Mar 14, 2019 at 4:00 AM Heikki Linnakangas  wrote:
> Oh yeah, that makes perfect sense. I wonder why we haven't done it like
> that before? The new page split logic makes it more likely to help, but
> even without that, I don't see any downside.

The only downside is that we spend a few extra cycles, and that might
not work out. This optimization would have always worked, though. The
new page split logic clearly makes checking the high key in the
"continuescan" path an easy win.

> I find it a bit confusing, that the logic is now split between
> _bt_checkkeys() and _bt_readpage(). For a forward scan, _bt_readpage()
> does the high-key check, but the corresponding "first-key" check in a
> backward scan is done in _bt_checkkeys(). I'd suggest moving the logic
> completely to _bt_readpage(), so that it's in one place. With that,
> _bt_checkkeys() can always check the keys as it's told, without looking
> at the LP_DEAD flag. Like the attached.

I'm convinced. I'd like to go a bit further, and also pass tupnatts to
_bt_checkkeys().  That makes it closer to the similar
_bt_check_rowcompare() function that _bt_checkkeys() must sometimes
call. It also allows us to only call BTreeTupleGetNAtts() for the high
key, while passes down a generic, loop-invariant
IndexRelationGetNumberOfAttributes() value for non-pivot tuples.

I'll do it that way in the next revision.

Thanks
-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-14 Thread Heikki Linnakangas

On 13/03/2019 03:28, Peter Geoghegan wrote:

It would be great if you could take a look at the 'Add high key
"continuescan" optimization' patch, which is the only one you haven't
commented on so far (excluding the amcheck "relocate" patch, which is
less important). I can put that one off for a while after the first 3
go in. I will also put off the "split after new item" commit for at
least a week or two. I'm sure that the idea behind the "continuescan"
patch will now seem pretty obvious to you -- it's just taking
advantage of the high key when an index scan on the leaf level (which
uses a search style scankey, not an insertion style scankey) looks
like it may have to move to the next leaf page, but we'd like to avoid
it where possible. Checking the high key there is now much more likely
to result in the index scan not going to the next page, since we're
more careful when considering a leaf split point these days. The high
key often looks like the items on the page to the right, not the items
on the same page.


Oh yeah, that makes perfect sense. I wonder why we haven't done it like 
that before? The new page split logic makes it more likely to help, but 
even without that, I don't see any downside.


I find it a bit confusing, that the logic is now split between 
_bt_checkkeys() and _bt_readpage(). For a forward scan, _bt_readpage() 
does the high-key check, but the corresponding "first-key" check in a 
backward scan is done in _bt_checkkeys(). I'd suggest moving the logic 
completely to _bt_readpage(), so that it's in one place. With that, 
_bt_checkkeys() can always check the keys as it's told, without looking 
at the LP_DEAD flag. Like the attached.


- Heikki
>From 4b5ea0f361e3feda93852bd084fb0d325e808e4c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Mon, 12 Nov 2018 13:11:21 -0800
Subject: [PATCH 1/1] Add high key "continuescan" optimization.

Teach B-Tree forward index scans to check the high key before moving to
the next page in the hopes of finding that it isn't actually necessary
to move to the next page.  We already opportunistically force a key
check of the last item on leaf pages, even when it's clear that it
cannot be returned to the scan due to being dead-to-all, for the same
reason.  Since forcing the last item to be key checked no longer makes
any difference in the case of forward scans, the existing extra key
check is now only used for backwards scans.  Like the existing check,
the new check won't always work out, but that seems like an acceptable
price to pay.

The new approach is more effective than just checking non-pivot tuples,
especially with composite indexes and non-unique indexes.  The high key
represents an upper bound on all values that can appear on the page,
which is often greater than whatever tuple happens to appear last at the
time of the check.  Also, suffix truncation's new logic for picking a
split point will often result in high keys that are relatively
dissimilar to the other (non-pivot) tuples on the page, and therefore
more likely to indicate that the scan need not proceed to the next page.

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.

(This is Heikki's refactored version)
---
 src/backend/access/nbtree/nbtsearch.c |  86 ++---
 src/backend/access/nbtree/nbtutils.c  | 103 +++---
 src/include/access/nbtree.h   |   3 +-
 3 files changed, 122 insertions(+), 70 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index af3da3aa5b6..243be6f410d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1220,7 +1220,6 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
 	int			itemIndex;
-	IndexTuple	itup;
 	bool		continuescan;
 
 	/*
@@ -1241,6 +1240,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
+	continuescan = true;		/* default assumption */
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1282,23 +1282,58 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		while (offnum <= maxoff)
 		{
-			itup = _bt_checkkeys(scan, page, offnum, dir, );
-			if (itup != NULL)
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	itup;
+
+			/*
+			 * If the scan specifies not to return killed tuples, then we
+			 * treat a killed tuple as not passing the qual.  Most of the
+			 * time, it's a win to not bother examining the tuple's index
+			 * keys, but just skip to the next tuple.
+			 */
+			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+			{
+offnum = OffsetNumberNext(offnum);
+continue;
+			}
+
+			itup = (IndexTuple) PageGetItem(page, iid);
+
+			if (_bt_checkkeys(scan, itup, dir, ))
 			{
 /* tuple passes all scan key conditions, so remember it 

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-14 Thread Heikki Linnakangas

On 13/03/2019 03:28, Peter Geoghegan wrote:

On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas  wrote:

I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
search like _bt_binsrch does, but the bounds caching is only done in
_bt_binsrch_insert. Seems more clear to have separate functions for them
now, even though there's some duplication.



/*
   * Do the insertion. First move right to find the correct page to
   * insert to, if necessary. If we're inserting to a non-unique index,
   * _bt_search() already did this when it checked if a move to the
   * right was required for leaf page.  Insertion scankey's scantid
   * would have been filled out at the time. On a unique index, the
   * current buffer is the first buffer containing duplicates, however,
   * so we may need to move right to the correct location for this
   * tuple.
   */
if (checkingunique || itup_key->heapkeyspace)
 _bt_findinsertpage(rel, , stack, heapRel);

newitemoff = _bt_binsrch_insert(rel, );



The attached new version simplifies this, IMHO. The bounds and the
current buffer go together in the same struct, so it's easier to keep
track whether the bounds are valid or not.


Now that you have a full understanding of how the negative infinity
sentinel values work, and how page deletion's leaf page search and
!heapkeyspace indexes need to be considered, I think that we should
come back to this _bt_binsrch()/_bt_findsplitloc() stuff. My sense is
that you now have a full understanding of all the subtleties of the
patch, including those that that affect unique index insertion. That
will make it much easier to talk about these unresolved questions.

My current sense is that it isn't useful to store the current buffer
alongside the binary search bounds/hint. It'll hardly ever need to be
invalidated, because we'll hardly ever have to move right within
_bt_findsplitloc() when doing unique index insertion (as I said
before, the regression tests *never* have to do this according to
gcov).


It doesn't matter how often it happens, the code still needs to deal 
with it. So let's try to make it as readable as possible.



We're talking about a very specific set of conditions here, so
I'd like something that's lightweight and specialized. I agree that
the savebinsrch/restorebinsrch fields are a bit ugly, though. I can't
think of anything that's better offhand. Perhaps you can suggest
something that is both lightweight, and an improvement on
savebinsrch/restorebinsrch.


Well, IMHO holding the buffer and the bounds in the new struct is more 
clean than the savebinsrc/restorebinsrch flags. That's exactly why I 
suggested it. I don't know what else to suggest. I haven't done any 
benchmarking, but I doubt there's any measurable difference.



I'm of the opinion that having a separate _bt_binsrch_insert() does
not make anything clearer. Actually, I think that saving the bounds
within the original _bt_binsrch() makes the design of that function
clearer, not less clear. It's all quite confusing at the moment, given
the rightmost/!leaf/page empty special cases. Seeing how the bounds
are reused (or not reused) outside of _bt_binsrch() helps with that.


Ok. I think having some code duplication is better than one function 
that tries to do many things, but I'm not wedded to that.


- Heikki



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-13 Thread Peter Geoghegan
On Tue, Mar 12, 2019 at 11:40 AM Robert Haas  wrote:
> I think it's pretty clear that we have to view that as acceptable.  I
> mean, we could reduce contention even further by finding a way to make
> indexes 40% larger, but I think it's clear that nobody wants that.

I found this analysis of bloat in the production database of Gitlab in
January 2019 fascinating:

https://about.gitlab.com/handbook/engineering/infrastructure/blueprint/201901-postgres-bloat/

They determined that their tables consisted of about 2% bloat, whereas
indexes were 51% bloat (determined by running VACUUM FULL, and
measuring how much smaller indexes and tables were afterwards). That
in itself may not be that telling. What is telling is the index bloat
disproportionately affects certain kinds of indexes. As they put it,
"Indexes that do not serve a primary key constraint make up 95% of the
overall index bloat". In other words, the vast majority of all bloat
occurs within non-unique indexes, with most remaining bloat in unique
indexes.

One factor that could be relevant is that unique indexes get a lot
more opportunistic LP_DEAD killing. Unique indexes don't rely on the
similar-but-distinct kill_prior_tuple optimization --  a lot more
tuples can be killed within _bt_check_unique() than with
kill_prior_tuple in realistic cases. That's probably not really that
big a factor, though. It seems almost certain that "getting tired" is
the single biggest problem.

The blog post drills down further, and cites the examples of several
*extremely* bloated indexes on a single-column, which is obviously low
cardinality. This includes an index on a boolean field, and an index
on an enum-like text field. In my experience, having many indexes like
that is very common in real world applications, though not at all
common in popular benchmarks (with the exception of TPC-E).

It also looks like they may benefit from the "split after new item"
optimization, at least among the few unique indexes that were very
bloated, such as merge_requests_pkey:

https://gitlab.com/snippets/1812014

Gitlab is open source, so it should be possible to confirm my theory
about the "split after new item" optimization (I am certain about
"getting tired", though). Their schema is defined here:

https://gitlab.com/gitlab-org/gitlab-ce/blob/master/db/schema.rb

I don't have time to confirm all this right now, but I am pretty
confident that they have both problems. And almost as confident that
they'd notice substantial benefits from this patch series.
-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-12 Thread Peter Geoghegan
On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas  wrote:
> I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
> search like _bt_binsrch does, but the bounds caching is only done in
> _bt_binsrch_insert. Seems more clear to have separate functions for them
> now, even though there's some duplication.

> /*
>   * Do the insertion. First move right to find the correct page to
>   * insert to, if necessary. If we're inserting to a non-unique index,
>   * _bt_search() already did this when it checked if a move to the
>   * right was required for leaf page.  Insertion scankey's scantid
>   * would have been filled out at the time. On a unique index, the
>   * current buffer is the first buffer containing duplicates, however,
>   * so we may need to move right to the correct location for this
>   * tuple.
>   */
> if (checkingunique || itup_key->heapkeyspace)
> _bt_findinsertpage(rel, , stack, heapRel);
>
> newitemoff = _bt_binsrch_insert(rel, );

> The attached new version simplifies this, IMHO. The bounds and the
> current buffer go together in the same struct, so it's easier to keep
> track whether the bounds are valid or not.

Now that you have a full understanding of how the negative infinity
sentinel values work, and how page deletion's leaf page search and
!heapkeyspace indexes need to be considered, I think that we should
come back to this _bt_binsrch()/_bt_findsplitloc() stuff. My sense is
that you now have a full understanding of all the subtleties of the
patch, including those that that affect unique index insertion. That
will make it much easier to talk about these unresolved questions.

My current sense is that it isn't useful to store the current buffer
alongside the binary search bounds/hint. It'll hardly ever need to be
invalidated, because we'll hardly ever have to move right within
_bt_findsplitloc() when doing unique index insertion (as I said
before, the regression tests *never* have to do this according to
gcov). We're talking about a very specific set of conditions here, so
I'd like something that's lightweight and specialized. I agree that
the savebinsrch/restorebinsrch fields are a bit ugly, though. I can't
think of anything that's better offhand. Perhaps you can suggest
something that is both lightweight, and an improvement on
savebinsrch/restorebinsrch.

I'm of the opinion that having a separate _bt_binsrch_insert() does
not make anything clearer. Actually, I think that saving the bounds
within the original _bt_binsrch() makes the design of that function
clearer, not less clear. It's all quite confusing at the moment, given
the rightmost/!leaf/page empty special cases. Seeing how the bounds
are reused (or not reused) outside of _bt_binsrch() helps with that.

The first 3 patches seem commitable now, but I think that it's
important to be sure that I've addressed everything you raised
satisfactorily before pushing. Or that everything works in a way that
you can live with, at least.

It would be great if you could take a look at the 'Add high key
"continuescan" optimization' patch, which is the only one you haven't
commented on so far (excluding the amcheck "relocate" patch, which is
less important). I can put that one off for a while after the first 3
go in. I will also put off the "split after new item" commit for at
least a week or two. I'm sure that the idea behind the "continuescan"
patch will now seem pretty obvious to you -- it's just taking
advantage of the high key when an index scan on the leaf level (which
uses a search style scankey, not an insertion style scankey) looks
like it may have to move to the next leaf page, but we'd like to avoid
it where possible. Checking the high key there is now much more likely
to result in the index scan not going to the next page, since we're
more careful when considering a leaf split point these days. The high
key often looks like the items on the page to the right, not the items
on the same page.

Thanks
-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-12 Thread Peter Geoghegan
On Tue, Mar 12, 2019 at 2:22 PM Andres Freund  wrote:
> I'm basically just curious which buffers have most of the additional
> contention. Is it the lower number of leaf pages, the inner pages, or
> (somewhat unexplicably) the meta page, or ...?  I was thinking that the
> callstack that e.g. my lwlock tool gives should be able to explain what
> callstack most of the waits are occuring on.

Right -- that's exactly what I'm interested in, too. If we can
characterize the contention in terms of the types of nbtree blocks
that are involved (their level), that could be really helpful. There
are 200x+ more leaf blocks than internal blocks, so the internal
blocks are a lot hotter. But, there is also a lot fewer splits of
internal pages, because you need hundreds of leaf page splits to get
one internal split.

Is the problem contention caused by internal page splits, or is it
contention in internal pages that can be traced back to leaf splits,
that insert a downlink in to their parent page? It would be really
cool to have some idea of the answers to questions like these.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-12 Thread Andres Freund
On 2019-03-12 14:15:06 -0700, Peter Geoghegan wrote:
> On Tue, Mar 12, 2019 at 12:40 PM Andres Freund  wrote:
> > Have you looked at an offwake or lwlock wait graph (bcc tools) or
> > something in that vein? Would be interesting to see what is waiting for
> > what most often...
> 
> Not recently, though I did use your BCC script for this very purpose
> quite a few months ago. I don't remember it helping that much at the
> time, but then that was with a version of the patch that lacked a
> couple of important optimizations that we have now. We're now very
> careful to not descend to the left with an equal pivot tuple. We
> descend right instead when that's definitely the only place we'll find
> matches (a high key doesn't count as a match in almost all cases!).
> Edge-cases where we unnecessarily move left then right, or
> unnecessarily move right a second time once on the leaf level have
> been fixed. I fixed the regression I was worried about at the time,
> without getting much benefit from the BCC script, and moved on.
> 
> This kind of minutiae is more important than it sounds. I have used
> EXPLAIN(ANALYZE, BUFFERS) instrumentation to make sure that I
> understand where every single block access comes from with these
> edge-cases, paying close attention to the structure of the index, and
> how the key space is broken up (the values of pivot tuples in internal
> pages). It is one thing to make the index smaller, and another thing
> to take full advantage of that -- I have both. This is one of the
> reasons why I believe that this minor regression cannot be avoided,
> short of simply allowing the index to get bloated: I'm simply not
> doing things that differently outside of the page split code, and what
> I am doing differently is clearly superior. Both in general, and for
> the NEW_ORDER transaction in particular.
> 
> I'll make that another TODO item -- this regression will be revisited
> using BCC instrumentation. I am currently performing a multi-day
> benchmark on a very large TPC-C/BenchmarkSQL database, and it will
> have to wait for that. (I would like to use the same environment as
> before.)

I'm basically just curious which buffers have most of the additional
contention. Is it the lower number of leaf pages, the inner pages, or
(somewhat unexplicably) the meta page, or ...?  I was thinking that the
callstack that e.g. my lwlock tool gives should be able to explain what
callstack most of the waits are occuring on.

(I should work a bit on that script, I locally had a version that showed
both waiters and the waking up callstack, but I don't find it anymore)

Greetings,

Andres Freund



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-12 Thread Peter Geoghegan
On Tue, Mar 12, 2019 at 12:40 PM Andres Freund  wrote:
> Have you looked at an offwake or lwlock wait graph (bcc tools) or
> something in that vein? Would be interesting to see what is waiting for
> what most often...

Not recently, though I did use your BCC script for this very purpose
quite a few months ago. I don't remember it helping that much at the
time, but then that was with a version of the patch that lacked a
couple of important optimizations that we have now. We're now very
careful to not descend to the left with an equal pivot tuple. We
descend right instead when that's definitely the only place we'll find
matches (a high key doesn't count as a match in almost all cases!).
Edge-cases where we unnecessarily move left then right, or
unnecessarily move right a second time once on the leaf level have
been fixed. I fixed the regression I was worried about at the time,
without getting much benefit from the BCC script, and moved on.

This kind of minutiae is more important than it sounds. I have used
EXPLAIN(ANALYZE, BUFFERS) instrumentation to make sure that I
understand where every single block access comes from with these
edge-cases, paying close attention to the structure of the index, and
how the key space is broken up (the values of pivot tuples in internal
pages). It is one thing to make the index smaller, and another thing
to take full advantage of that -- I have both. This is one of the
reasons why I believe that this minor regression cannot be avoided,
short of simply allowing the index to get bloated: I'm simply not
doing things that differently outside of the page split code, and what
I am doing differently is clearly superior. Both in general, and for
the NEW_ORDER transaction in particular.

I'll make that another TODO item -- this regression will be revisited
using BCC instrumentation. I am currently performing a multi-day
benchmark on a very large TPC-C/BenchmarkSQL database, and it will
have to wait for that. (I would like to use the same environment as
before.)

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-12 Thread Andres Freund
Hi,

On 2019-03-11 19:47:29 -0700, Peter Geoghegan wrote:
> I now believe that the problem is with LWLock/buffer lock contention
> on index pages, and that that's an inherent cost with a minority of
> write-heavy high contention workloads. A cost that we should just
> accept.

Have you looked at an offwake or lwlock wait graph (bcc tools) or
something in that vein? Would be interesting to see what is waiting for
what most often...

Greetings,

Andres Freund



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-12 Thread Peter Geoghegan
On Tue, Mar 12, 2019 at 11:40 AM Robert Haas  wrote:
> Hey, I understood something today!

And I said something that could be understood!

> I think it's pretty clear that we have to view that as acceptable.  I
> mean, we could reduce contention even further by finding a way to make
> indexes 40% larger, but I think it's clear that nobody wants that.
> Now, maybe in the future we'll want to work on other techniques for
> reducing contention, but I don't think we should make that the problem
> of this patch, especially because the regressions are small and go
> away after a few hours of heavy use.  We should optimize for the case
> where the user intends to keep the database around for years, not
> hours.

I think so too. There is a feature in other database systems called
"reverse key indexes", which deals with this problem in a rather
extreme way. This situation is very similar to the situation with
rightmost page splits, where fillfactor is applied to pack leaf pages
full. The only difference is that there are multiple groupings, not
just one single implicit grouping (everything in the index). You could
probably make very similar observations about rightmost page splits
that apply leaf fillfactor.

Here is an example of how the largest index looks for master with the
100 warehouse case that was slightly regressed:

table_name|  index_name   | page_type | npages  |
avg_live_items | avg_dead_items | avg_item_size
--+---+---+-+++---
 bmsql_order_line | bmsql_order_line_pkey | R | 1 |
54.000   |0.000   |   23.000
 bmsql_order_line | bmsql_order_line_pkey | I |11,482 |
143.200   |0.000   |   23.000
 bmsql_order_line | bmsql_order_line_pkey | L | 1,621,316 |
139.458   |0.003   |   24.000

Here is what we see with the patch:

table_name|  index_name   | page_type | npages  |
avg_live_items | avg_dead_items | avg_item_size
--+---+---+-+++---
 bmsql_order_line | bmsql_order_line_pkey | R |   1 |
29.000   |0.000   |   22.000
 bmsql_order_line | bmsql_order_line_pkey | I |   5,957 |
159.149   |0.000   |   23.000
 bmsql_order_line | bmsql_order_line_pkey | L | 936,170 |
233.496   |0.052   |   23.999

REINDEX would leave bmsql_order_line_pkey with 262 items, and we see
here that it has 233 after several hours, which is pretty good given
the amount of contention. The index actually looks very much like it
was just REINDEXED when initial bulk loading finishes, before we get
any updates, even though that happens using retail insertions.

Notice that the number of internal pages is reduced by almost a full
50% -- it's somewhat better than the reduction in the number of leaf
pages, because the benefits compound (items in the root are even a bit
smaller, because of this compounding effect, despite alignment
effects). Internal pages are the most important pages to have cached,
but also potentially the biggest points of contention.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-12 Thread Robert Haas
On Tue, Mar 12, 2019 at 2:34 PM Peter Geoghegan  wrote:
> On Tue, Mar 12, 2019 at 11:32 AM Robert Haas  wrote:
> > If I wanted to try to say this in fewer words, would it be fair to say
> > that reducing the size of an index by 40% without changing anything
> > else can increase contention on the remaining pages?
>
> Yes.

Hey, I understood something today!

I think it's pretty clear that we have to view that as acceptable.  I
mean, we could reduce contention even further by finding a way to make
indexes 40% larger, but I think it's clear that nobody wants that.
Now, maybe in the future we'll want to work on other techniques for
reducing contention, but I don't think we should make that the problem
of this patch, especially because the regressions are small and go
away after a few hours of heavy use.  We should optimize for the case
where the user intends to keep the database around for years, not
hours.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-12 Thread Peter Geoghegan
On Tue, Mar 12, 2019 at 11:32 AM Robert Haas  wrote:
> If I wanted to try to say this in fewer words, would it be fair to say
> that reducing the size of an index by 40% without changing anything
> else can increase contention on the remaining pages?

Yes.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-12 Thread Robert Haas
On Mon, Mar 11, 2019 at 10:47 PM Peter Geoghegan  wrote:
>
> On Sun, Mar 10, 2019 at 5:17 PM Peter Geoghegan  wrote:
> > The regression that I mentioned earlier isn't in pgbench type
> > workloads (even when the distribution is something more interesting
> > that the uniform distribution default). It is only in workloads with
> > lots of page splits and lots of index churn, where we get most of the
> > benefit of the patch, but also where the costs are most apparent.
> > Hopefully it can be fixed, but if not I'm inclined to think that it's
> > a price worth paying. This certainly still needs further analysis and
> > discussion, though. This revision of the patch does not attempt to
> > address that problem in any way.
>
> I believe that I've figured out what's going on here.
>
> At first, I thought that this regression was due to the cycles that
> have been added to page splits, but that doesn't seem to be the case
> at all. Nothing that I did to make page splits faster helped (e.g.
> temporarily go back to doing them "bottom up" made no difference). CPU
> utilization was consistently slightly *higher* with the master branch
> (patch spent slightly more CPU time idle). I now believe that the
> problem is with LWLock/buffer lock contention on index pages, and that
> that's an inherent cost with a minority of write-heavy high contention
> workloads. A cost that we should just accept.

If I wanted to try to say this in fewer words, would it be fair to say
that reducing the size of an index by 40% without changing anything
else can increase contention on the remaining pages?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-12 Thread Peter Geoghegan
On Mon, Mar 11, 2019 at 11:30 PM Heikki Linnakangas  wrote:
> Yeah, that's fine. I'm curious, though, could you bloat the indexes back
> to the old size by setting the fillfactor?

I think that that might work, though it's hard to say for sure offhand.

The "split after new item" optimization is supposed to be a variation
of rightmost splits, of course. We apply fillfactor in the same way
much of the time. You would still literally split immediately after
the new item some of the time, though, which makes it unclear how much
bloat there would be without testing it.

Some indexes mostly apply fillfactor in non-rightmost pages, while
other indexes mostly split at the exact point past the new item,
depending on details like the size of the groupings.

I am currently doing a multi-day 6,000 warehouse benchmark, since I
want to be sure that the bloat resistance will hold up over days. I
think that it will, because there aren't that many updates, and
they're almost all HOT-safe. I'll put the idea of a 50/50 fillfactor
benchmark with the high-contention/regressed workload on my TODO list,
though.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-12 Thread Heikki Linnakangas

On 12/03/2019 04:47, Peter Geoghegan wrote:

In conclusion: I think that this regression is a cost worth accepting.
The regression in throughput is relatively small (2% - 3%), and the
NEW_ORDER transaction seems to be the only problem (NEW_ORDER happens
to be used for 45% of all transactions with TPC-C, and inserts into
the largest index by far, without reading much). The patch overtakes
master after a few hours anyway -- the patch will still win after
about 6 hours, once the database gets big enough, despite all the
contention. As I said, I think that we see a regression*because*  the
indexes are much smaller, not in spite of the fact that they're
smaller. The TPC-C/BenchmarkSQL indexes never fail to be about 40%
smaller than they are on master, no matter the details, even after
many hours.


Yeah, that's fine. I'm curious, though, could you bloat the indexes back 
to the old size by setting the fillfactor?


- Heikki



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-10 Thread Peter Geoghegan
On Sun, Mar 10, 2019 at 12:53 PM Heikki Linnakangas  wrote:
> Ah, yeah. Not sure. I wrote it as "searching_for_pivot_tuple" first, but
> changed to "searching_for_leaf_page" at the last minute. My thinking was
> that in the page-deletion case, you're trying to re-locate a particular
> leaf page. Otherwise, you're searching for matching tuples, not a
> particular page. Although during insertion, I guess you are also
> searching for a particular page, the page to insert to.

I prefer something like "searching_for_pivot_tuple", because it's
unambiguous. Okay with that?

> It's a hot codepath, but I doubt it's *that* hot that it matters,
> performance-wise...

I'll figure that out. Although I am currently looking into a
regression in workloads that fit in shared_buffers, that my
micro-benchmarks didn't catch initially. Indexes are still much
smaller, but we get a ~2% regression all the same. OTOH, we get a
7.5%+ increase in throughput when the workload is I/O bound, and
latency is generally no worse and even better with any workload.

I suspect that the nice top-down approach to nbtsplitloc.c has its
costs...will let you know more when I know more.

> > The idea with pg_upgrade'd v3 indexes is, as I said a while back, that
> > they too have a heap TID attribute. nbtsearch.c code is not allowed to
> > rely on its value, though, and must use
> > minusinfkey/searching_for_pivot_tuple semantics (relying on its value
> > being minus infinity is still relying on its value being something).
>
> Yeah. I find that's a complicated way to think about it. My mental model
> is that v4 indexes store heap TIDs, and every tuple is unique thanks to
> that. But on v3, we don't store heap TIDs, and duplicates are possible.

I'll try it that way, then.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-10 Thread Heikki Linnakangas

On 10/03/2019 20:53, Peter Geoghegan wrote:

On Sun, Mar 10, 2019 at 7:09 AM Heikki Linnakangas  wrote:

If we don't flip the meaning of the flag, then maybe calling it
something like "searching_for_leaf_page" would make sense:

/*
   * When set, we're searching for the leaf page with the given high key,
   * rather than leaf tuples matching the search keys.
   *
   * Normally, when !searching_for_pivot_tuple, if a page's high key


I guess you meant to say "searching_for_pivot_tuple" both times (not
"searching_for_leaf_page"). After all, we always search for a leaf
page. :-)


Ah, yeah. Not sure. I wrote it as "searching_for_pivot_tuple" first, but 
changed to "searching_for_leaf_page" at the last minute. My thinking was 
that in the page-deletion case, you're trying to re-locate a particular 
leaf page. Otherwise, you're searching for matching tuples, not a 
particular page. Although during insertion, I guess you are also 
searching for a particular page, the page to insert to.



As the patch stands, you're also setting minusinfkey when dealing with
v3 indexes. I think it would be better to only set
searching_for_leaf_page in nbtpage.c.


That would mean I would have to check both heapkeyspace and
minusinfkey within _bt_compare().


Yeah.


I would rather just keep the
assertion that makes sure that !heapkeyspace callers are also
minusinfkey callers, and the comments that explain why that is. It
might even matter to performance -- having an extra condition in
_bt_compare() is something we should avoid.


It's a hot codepath, but I doubt it's *that* hot that it matters, 
performance-wise...



In general, I think BTScanInsert
should describe the search key, regardless of whether it's a V3 or V4
index. Properties of the index belong elsewhere. (We're violating that
by storing the 'heapkeyspace' flag in BTScanInsert. That wart is
probably OK, it is pretty convenient to have it there. But in principle...)


The idea with pg_upgrade'd v3 indexes is, as I said a while back, that
they too have a heap TID attribute. nbtsearch.c code is not allowed to
rely on its value, though, and must use
minusinfkey/searching_for_pivot_tuple semantics (relying on its value
being minus infinity is still relying on its value being something).


Yeah. I find that's a complicated way to think about it. My mental model 
is that v4 indexes store heap TIDs, and every tuple is unique thanks to 
that. But on v3, we don't store heap TIDs, and duplicates are possible.


- Heikki



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-10 Thread Peter Geoghegan
On Sun, Mar 10, 2019 at 7:09 AM Heikki Linnakangas  wrote:
> "descendrighttrunc" doesn't make much sense to me, either. I don't
> understand it. Maybe a comment would make it clear, though.

It's not an easily grasped concept. I don't think that any name will
easily convey the idea to the reader, though. I'm happy to go with
whatever name you prefer.

> I don't feel like this is an optimization. It's natural consequence of
> what the high key means. I guess you can think of it as an optimization,
> in the same way that not fully scanning the whole index for every search
> is an optimization, but that's not how I think of it :-).

I would agree with this in a green field situation, where we don't
have to consider the legacy of v3 indexes. But that's not the case
here.

> If we don't flip the meaning of the flag, then maybe calling it
> something like "searching_for_leaf_page" would make sense:
>
> /*
>   * When set, we're searching for the leaf page with the given high key,
>   * rather than leaf tuples matching the search keys.
>   *
>   * Normally, when !searching_for_pivot_tuple, if a page's high key

I guess you meant to say "searching_for_pivot_tuple" both times (not
"searching_for_leaf_page"). After all, we always search for a leaf
page. :-)

I'm fine with "searching_for_pivot_tuple", I think. The underscores
are not really stylistically consistent with other stuff in nbtree.h,
but I can use something very similar to your suggestion that is
consistent.

>   * has truncated columns, and the columns that are present are equal to
>   * the search key, the search will not descend to that page. For
>   * example, if an index has two columns, and a page's high key is
>   * ("foo", ), and the search key is also ("foo," ),
>   * the search will not descend to that page, but its right sibling. The
>   * omitted column in the high key means that all tuples on the page must
>   * be strictly < "foo", so we don't need to visit it. However, sometimes
>   * we perform a search to find the parent of a leaf page, using the leaf
>   * page's high key as the search key. In that case, when we search for
>   * ("foo", ), we do want to land on that page, not its right
>   * sibling.
>   */
> boolsearching_for_leaf_page;

That works for me (assuming you meant searching_for_pivot_tuple).

> As the patch stands, you're also setting minusinfkey when dealing with
> v3 indexes. I think it would be better to only set
> searching_for_leaf_page in nbtpage.c.

That would mean I would have to check both heapkeyspace and
minusinfkey within _bt_compare(). I would rather just keep the
assertion that makes sure that !heapkeyspace callers are also
minusinfkey callers, and the comments that explain why that is. It
might even matter to performance -- having an extra condition in
_bt_compare() is something we should avoid.

> In general, I think BTScanInsert
> should describe the search key, regardless of whether it's a V3 or V4
> index. Properties of the index belong elsewhere. (We're violating that
> by storing the 'heapkeyspace' flag in BTScanInsert. That wart is
> probably OK, it is pretty convenient to have it there. But in principle...)

The idea with pg_upgrade'd v3 indexes is, as I said a while back, that
they too have a heap TID attribute. nbtsearch.c code is not allowed to
rely on its value, though, and must use
minusinfkey/searching_for_pivot_tuple semantics (relying on its value
being minus infinity is still relying on its value being something).

Now, it's also true that there are a number of things that we have to
do within nbtinsert.c for !heapkeyspace that don't really concern the
key space as such. Even still, thinking about everything with
reference to the keyspace, and keeping that as similar as possible
between v3 and v4 is a good thing. It is up to high level code (such
as _bt_first()) to not allow a !heapkeyspace index scan to do
something that won't work for it. It is not up to low level code like
_bt_compare() to worry about these differences (beyond asserting that
caller got it right). If page deletion didn't need minusinfkey
semantics (if nobody but v3 indexes needed that), then I would just
make the "move right of separator" !minusinfkey code within
_bt_compare() test heapkeyspace. But we do have a general need for
minusinfkey semantics, so it seems simpler and more future-proof to
keep heapkeyspace out of low-level nbtsearch.c code.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-10 Thread Heikki Linnakangas

On 08/03/2019 23:21, Peter Geoghegan wrote:

On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan  wrote:

All of that said, maybe it would be clearer if page deletion was not
the special case that has to opt in to minusinfkey semantics. Perhaps
it would make more sense for everyone else to opt out of minusinfkey
semantics, and to get the !minusinfkey optimization as a result of
that. I only did it the other way around because that meant that only
nbtpage.c had to acknowledge the special case.


This seems like a good idea -- we should reframe the !minusinfkey
optimization, without actually changing the behavior. Flip it around.

The minusinfkey field within the insertion scankey struct would be
called something like "descendrighttrunc" instead. Same idea, but with
the definition inverted. Most _bt_search() callers (all of those
outside of nbtpage.c and amcheck) would be required to opt in to that
optimization to get it.

Under this arrangement, nbtpage.c/page deletion would not ask for the
"descendrighttrunc" optimization, and would therefore continue to do
what it has always done: find the first leaf page that its insertion
scankey values could be on (we don't lie about searching for negative
infinity, or having a negative infinity sentinel value in scan key).
The only difference for page deletion between v3 indexes and v4
indexes is that with v4 indexes we'll relocate the same leaf page
reliably, since every separator key value is guaranteed to be unique
on its level (including the leaf level/leaf high keys). This is just a
detail, though, and not one that's even worth pointing out; we're not
*relying* on that being true on v4 indexes anyway (we check that the
block number is a match too, which is strictly necessary for v3
indexes and seems like a good idea for v4 indexes).

This is also good because it makes it clear that the unique index code
within _bt_doinsert() (that temporarily sets scantid to NULL) benefits
from the descendrighttrunc/!minusinfkey optimization -- it should be
"honest" and ask for it explicitly. We can make _bt_doinsert() opt in
to the optimization for unique indexes, but not for other indexes,
where scantid is set from the start. The
descendrighttrunc/!minusinfkey optimization cannot help when scantid
is set from the start, because we'll always have an attribute value in
insertion scankey that breaks the tie for us instead. We'll always
move right of a heap-TID-truncated separator key whose untruncated
attributes are all equal to a prefix of our insertion scankey values.

(This _bt_doinsert() descendrighttrunc/!minusinfkey optimization for
unique indexes matters more than you might think -- we do really badly
with things like Zipfian distributions currently, and reducing the
contention goes some way towards helping with that. Postgres pro
noticed this a couple of years back, and analyzed it in detail at that
time. It's really nice that we very rarely have to move right within
code like _bt_check_unique() and _bt_findsplitloc() with the patch.)

Does that make sense to you? Can you live with the name
"descendrighttrunc", or do you have a better one?


"descendrighttrunc" doesn't make much sense to me, either. I don't 
understand it. Maybe a comment would make it clear, though.


I don't feel like this is an optimization. It's natural consequence of 
what the high key means. I guess you can think of it as an optimization, 
in the same way that not fully scanning the whole index for every search 
is an optimization, but that's not how I think of it :-).


If we don't flip the meaning of the flag, then maybe calling it 
something like "searching_for_leaf_page" would make sense:


/*
 * When set, we're searching for the leaf page with the given high key,
 * rather than leaf tuples matching the search keys.
 *
 * Normally, when !searching_for_pivot_tuple, if a page's high key
 * has truncated columns, and the columns that are present are equal to
 * the search key, the search will not descend to that page. For
 * example, if an index has two columns, and a page's high key is
 * ("foo", ), and the search key is also ("foo," ),
 * the search will not descend to that page, but its right sibling. The
 * omitted column in the high key means that all tuples on the page must
 * be strictly < "foo", so we don't need to visit it. However, sometimes
 * we perform a search to find the parent of a leaf page, using the leaf
 * page's high key as the search key. In that case, when we search for
 * ("foo", ), we do want to land on that page, not its right
 * sibling.
 */
boolsearching_for_leaf_page;


As the patch stands, you're also setting minusinfkey when dealing with 
v3 indexes. I think it would be better to only set 
searching_for_leaf_page in nbtpage.c. In general, I think BTScanInsert 
should describe the search key, regardless of whether it's a V3 or V4 
index. Properties of the index belong elsewhere. (We're violating that 
by storing the 'heapkeyspace' flag in BTScanInsert. That wart is 

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-08 Thread Peter Geoghegan
On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan  wrote:
> All of that said, maybe it would be clearer if page deletion was not
> the special case that has to opt in to minusinfkey semantics. Perhaps
> it would make more sense for everyone else to opt out of minusinfkey
> semantics, and to get the !minusinfkey optimization as a result of
> that. I only did it the other way around because that meant that only
> nbtpage.c had to acknowledge the special case.

This seems like a good idea -- we should reframe the !minusinfkey
optimization, without actually changing the behavior. Flip it around.

The minusinfkey field within the insertion scankey struct would be
called something like "descendrighttrunc" instead. Same idea, but with
the definition inverted. Most _bt_search() callers (all of those
outside of nbtpage.c and amcheck) would be required to opt in to that
optimization to get it.

Under this arrangement, nbtpage.c/page deletion would not ask for the
"descendrighttrunc" optimization, and would therefore continue to do
what it has always done: find the first leaf page that its insertion
scankey values could be on (we don't lie about searching for negative
infinity, or having a negative infinity sentinel value in scan key).
The only difference for page deletion between v3 indexes and v4
indexes is that with v4 indexes we'll relocate the same leaf page
reliably, since every separator key value is guaranteed to be unique
on its level (including the leaf level/leaf high keys). This is just a
detail, though, and not one that's even worth pointing out; we're not
*relying* on that being true on v4 indexes anyway (we check that the
block number is a match too, which is strictly necessary for v3
indexes and seems like a good idea for v4 indexes).

This is also good because it makes it clear that the unique index code
within _bt_doinsert() (that temporarily sets scantid to NULL) benefits
from the descendrighttrunc/!minusinfkey optimization -- it should be
"honest" and ask for it explicitly. We can make _bt_doinsert() opt in
to the optimization for unique indexes, but not for other indexes,
where scantid is set from the start. The
descendrighttrunc/!minusinfkey optimization cannot help when scantid
is set from the start, because we'll always have an attribute value in
insertion scankey that breaks the tie for us instead. We'll always
move right of a heap-TID-truncated separator key whose untruncated
attributes are all equal to a prefix of our insertion scankey values.

(This _bt_doinsert() descendrighttrunc/!minusinfkey optimization for
unique indexes matters more than you might think -- we do really badly
with things like Zipfian distributions currently, and reducing the
contention goes some way towards helping with that. Postgres pro
noticed this a couple of years back, and analyzed it in detail at that
time. It's really nice that we very rarely have to move right within
code like _bt_check_unique() and _bt_findsplitloc() with the patch.)

Does that make sense to you? Can you live with the name
"descendrighttrunc", or do you have a better one?

Thanks
-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-08 Thread Peter Geoghegan
On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan  wrote:
> > Question: Wouldn't it be more straightforward to use "1 +inf" as page
> > 1's high key? I.e treat any missing columns as positive infinity.
>
> That might also work, but it wouldn't be more straightforward on
> balance. This is because:

I thought of another reason:

* The 'Add high key "continuescan" optimization' is effective because
the high key of a leaf page tends to look relatively dissimilar to
other items on the page. The optimization would almost never help if
the high key was derived from the lastleft item at the time of a split
-- that's no more informative than the lastleft item itself.

As things stand with the patch, a high key usually has a value for its
last untruncated attribute that can only appear on the page to the
right, and never the current page. We'd quite like to be able to
conclude that the page to the right can't be interesting there and
then, without needing to visit it. Making new leaf high keys "as close
as possible to items on the right, without actually touching them"
makes the optimization quite likely to work out with the TPC-C
indexes, when we search for orderline items for an order that is
rightmost of a leaf page in the orderlines primary key.

And another reason:

* This makes it likely that any new items that would go between the
original lastleft and firstright items end up on the right page when
they're inserted after the lastleft/firstright split. This is
generally a good thing, because we've optimized WAL-logging for new
pages that go on the right. (You pointed this out to me in Lisbon, in
fact.)

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-08 Thread Peter Geoghegan
On Fri, Mar 8, 2019 at 2:12 AM Heikki Linnakangas  wrote:
> Now, what do we have as the high key of page 1? Answer: "2 -inf". The
> "-inf" is not stored in the key itself, the second key column is just
> omitted, and the search code knows to treat it implicitly as a value
> that's lower than any real value. Hence, "minus infinity".

Right.

> However, during page deletion, we need to perform a search to find the
> downlink pointing to a leaf page. We do that by using the leaf page's
> high key as the search key. But the search needs to treat it slightly
> differently in that case. Normally, searching with a single key value,
> "2", we would land on page 2, because any real value beginning with "2"
> would be on that page, but in the page deletion case, we want to find
> page 1. Setting the BTScanInsert.minusinfkey flag tells the search code
> to do that.

Right.

> Question: Wouldn't it be more straightforward to use "1 +inf" as page
> 1's high key? I.e treat any missing columns as positive infinity.

That might also work, but it wouldn't be more straightforward on
balance. This is because:

* We have always taken the new high key from the firstright item, and
we also continue to do that on internal pages -- same as before. It
would certainly complicate the nbtsplitloc.c code to have to deal with
this new special case now (leaf and internal pages would have to have
far different handling, not just slightly different handling).

* We have always had "-inf" values as the first item on an internal
page, which is explicitly truncated to zero attributes as of Postgres
v11. It seems ugly to me to make truncated attributes mean negative
infinity in that context, but positive infinity in every other
context.

* Another reason that I prefer "-inf" to "+inf" is that you can
imagine an implementation that makes pivot tuples into normalized
binary keys, that are truncated using generic/opclass-agnostic logic,
and compared using strcmp(). If the scankey binary string is longer
than the pivot tuple, then it's greater according to strcmp() -- that
just works. And, you can truncate the original binary strings built
using opclass infrastructure without having to understand where
attributes begin and end (though this relies on encoding things like
NULL-ness a certain way). If we define truncation to be "+inf" now,
then none of this works.

All of that said, maybe it would be clearer if page deletion was not
the special case that has to opt in to minusinfkey semantics. Perhaps
it would make more sense for everyone else to opt out of minusinfkey
semantics, and to get the !minusinfkey optimization as a result of
that. I only did it the other way around because that meant that only
nbtpage.c had to acknowledge the special case.

Even calling it minusinfkey is misleading in one way, because we're
not so much searching for "-inf" values as we are searching for the
first page that could have tuples for the untruncated attributes. But
isn't that how this has always worked, given that we've had to deal
with duplicate pivot tuples on the same level before now? As I said,
we're not doing an extra thing when minusinfykey is true (during page
deletion) -- it's the other way around. Saying that we're searching
for minus infinity values for the truncated attributes is kind of a
lie, although the search does behave that way.

>That way, the search for page deletion wouldn't need to be treated
> differently. That's also how this used to work: all tuples on a page
> used to be <= high key, not strictly < high key.

That isn't accurate -- it still works that way on the leaf level. The
alternative that you've described is possible, I think, but the key
space works just the same with either of our approaches. You've merely
thought of an alternative way of generating new high keys that satisfy
the same invariants as my own scheme. Provided the new separator for
high key is >= last item on the left and < first item on the right,
everything works.

As you point out, the original Lehman and Yao rule for leaf pages
(which Postgres kinda followed before) is that the high key is <=
items on the leaf level. But this patch makes nbtree follow that rule
fully and properly.

Maybe you noticed that amcheck tests < on internal pages, and only
checks <= on leaf pages. Perhaps it led you to believe that I did
things differently. Actually, this is classic Lehman and Yao. The keys
in internal pages are all "separators" as far as Lehman and Yao are
concerned, so the high key is less of a special case on internal
pages. We check < on internal pages because all separators are
supposed to be unique on a level. But, as I said, we do check <= on
the leaf level.

Take a look at "Fig. 7 A B-Link Tree" in the Lehman and Yao paper if
this is unclear. That shows that internal pages have unique keys -- we
can therefore expect the high key to be < items in internal pages. It
also shows that leaf pages copy the high key from the last item on the
left page -- we can expect 

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-08 Thread Heikki Linnakangas

On 08/03/2019 12:22, Peter Geoghegan wrote:

I would like to work through these other items with you
(_bt_binsrch_insert() and so on), but I think that it would be helpful
if you made an effort to understand the minusinfkey stuff first. I
spent a lot of time improving the explanation of that within
_bt_compare(). It's important.


Ok, after thinking about it for a while, I think I understand the minus 
infinity stuff now. Let me try to explain it in my own words:


Imagine that you have an index with two key columns, A and B. The index 
has two leaf pages, with the following items:


++   ++
| Page 1 |   | Page 2 |
||   ||
|1 1 |   |2 1 |
|1 2 |   |2 2 |
|1 3 |   |2 3 |
|1 4 |   |2 4 |
|1 5 |   |2 5 |
++   ++

The key space is neatly split on the first key column - probably thanks 
to the new magic in the page split code.


Now, what do we have as the high key of page 1? Answer: "2 -inf". The 
"-inf" is not stored in the key itself, the second key column is just 
omitted, and the search code knows to treat it implicitly as a value 
that's lower than any real value. Hence, "minus infinity".


However, during page deletion, we need to perform a search to find the 
downlink pointing to a leaf page. We do that by using the leaf page's 
high key as the search key. But the search needs to treat it slightly 
differently in that case. Normally, searching with a single key value, 
"2", we would land on page 2, because any real value beginning with "2" 
would be on that page, but in the page deletion case, we want to find 
page 1. Setting the BTScanInsert.minusinfkey flag tells the search code 
to do that.


Question: Wouldn't it be more straightforward to use "1 +inf" as page 
1's high key? I.e treat any missing columns as positive infinity. That 
way, the search for page deletion wouldn't need to be treated 
differently. That's also how this used to work: all tuples on a page 
used to be <= high key, not strictly < high key. And it would also make 
the rightmost page less of a special case: you could pretend the 
rightmost page has a pivot tuple with all columns truncated away, which 
means positive infinity.


You have this comment _bt_split which touches the subject:


/*
 * The "high key" for the new left page will be the first key that's 
going
 * to go into the new right page, or possibly a truncated version if 
this
 * is a leaf page split.  This might be either the existing data item at
 * position firstright, or the incoming tuple.
 *
 * The high key for the left page is formed using the first item on the
 * right page, which may seem to be contrary to Lehman & Yao's approach 
of
 * using the left page's last item as its new high key when splitting on
 * the leaf level.  It isn't, though: suffix truncation will leave the
 * left page's high key fully equal to the last item on the left page 
when
 * two tuples with equal key values (excluding heap TID) enclose the 
split
 * point.  It isn't actually necessary for a new leaf high key to be 
equal
 * to the last item on the left for the L "subtree" invariant to hold.
 * It's sufficient to make sure that the new leaf high key is strictly
 * less than the first item on the right leaf page, and greater than or
 * equal to (not necessarily equal to) the last item on the left leaf
 * page.
 *
 * In other words, when suffix truncation isn't possible, L's exact
 * approach to leaf splits is taken.  (Actually, even that is slightly
 * inaccurate.  A tuple with all the keys from firstright but the heap 
TID
 * from lastleft will be used as the new high key, since the last left
 * tuple could be physically larger despite being opclass-equal in 
respect
 * of all attributes prior to the heap TID attribute.)
 */


But it doesn't explain why it's done like that.

- Heikki



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-07 Thread Peter Geoghegan
On Wed, Mar 6, 2019 at 11:41 PM Heikki Linnakangas  wrote:
> I don't understand it :-(. I guess that's valuable feedback on its own.
> I'll spend more time reading the code around that, but meanwhile, if you
> can think of a simpler way to explain it in the comments, that'd be good.

One more thing on this: If you force bitmap index scans (by disabling
index-only scans and index scans with the "enable_" GUCs), then you
get EXPLAIN (ANALYZE, BUFFERS) instrumentation for the index alone
(and the heap, separately). No visibility map accesses, which obscure
the same numbers for a similar index-only scan.

You can then observe that most searches of a single value will touch
the bare minimum number of index pages. For example, if there are 3
levels in the index, you should access only 3 index pages total,
unless there are literally hundreds of matches, and cannot avoid
storing them on more than one leaf page. You'll see that the scan
touches the minimum possible number of index pages, because of:

* Many duplicates strategy. (Not single value strategy, which I
incorrectly mentioned in relation to this earlier.)

* The !minusinfykey optimization, which ensures that we go to the
right of an otherwise-equal pivot tuple in an internal page, rather
than left.

* The "continuescan" high key patch, which ensures that the scan
doesn't go to the right from the first leaf page to try to find even
more matches. The high key on the same leaf page will indicate that
the scan is over, without actually visiting the sibling. (Again, I'm
assuming that your search is for a single value.)

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-07 Thread Peter Geoghegan
On Thu, Mar 7, 2019 at 12:37 AM Peter Geoghegan  wrote:
> When I drew you that picture while we were in Lisbon, I mentioned to
> you that the patch sometimes used a sentinel scantid value that was
> greater than minus infinity, but less than any real scantid. This
> could be used to force an otherwise-equal-to-pivot search to go left
> rather than uselessly going right. I explained this about 30 minutes
> in, when I was drawing you a picture.

I meant the opposite: it could be used to go right, instead of going
left when descending the tree and unnecessarily moving right on the
leaf level.

As I said, moving right on the leaf level (rather than during the
descent) should only happen when it's necessary, such as when there is
a concurrent page split. It shouldn't happen reliably when searching
for the same value, unless there really are matches across multiple
leaf pages, and that's just what we have to do.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-07 Thread Peter Geoghegan
On Wed, Mar 6, 2019 at 11:41 PM Heikki Linnakangas  wrote:
> > BTW, I would like to hear what you think of the idea of minusinfkey
> > (and the !minusinfkey optimization) specifically.
>
> I don't understand it :-(. I guess that's valuable feedback on its own.
> I'll spend more time reading the code around that, but meanwhile, if you
> can think of a simpler way to explain it in the comments, that'd be good.

Here is another way of explaining it:

When I drew you that picture while we were in Lisbon, I mentioned to
you that the patch sometimes used a sentinel scantid value that was
greater than minus infinity, but less than any real scantid. This
could be used to force an otherwise-equal-to-pivot search to go left
rather than uselessly going right. I explained this about 30 minutes
in, when I was drawing you a picture.

Well, that sentinel heap TID thing doesn't exist any more, because it
was replaced by the !minusinfkey optimization, which is a
*generalization* of the same idea, which extends it to all columns
(not just the heap TID column). That way, you never have to go to two
pages just because you searched for a value that happened to be at the
"right at the edge" of a leaf page.

Page deletion wants to assume that truncated attributes from the high
key of the page being deleted have actual negative infinity values --
negative infinity is a value, just like any other, albeit one that can
only appear in pivot tuples. This is simulated by VACUUM using
"minusinfkey = true". We go left in the parent, not right, and land on
the correct leaf page. Technically we don't compare the negative
infinity values in the pivot to the negative infinity values in the
scankey, but we return 0 just as if we had, and found them equal.
Similarly, v3 indexes specify "minusinfkey = true" in all cases,
because they always want to go left -- just like in old Postgres
versions. They don't have negative infinity values (matches can be on
either side of the all-equal pivot, so they must go left).

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-07 Thread Peter Geoghegan
On Thu, Mar 7, 2019 at 12:14 AM Heikki Linnakangas  wrote:
> I haven't investigated any deeper, but apparently something's broken.
> This was with patch v14, without any further changes.

Try it with my patch -- attached.

I think that you missed that the INCLUDE indexes thing within
nbtsort.c needs to be changed back.

-- 
Peter Geoghegan
From cdfe29c5479da6198aa918f4c373cb8a1a1acfe1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Mon, 21 Jan 2019 15:35:37 -0800
Subject: [PATCH 08/12] DEBUG: Force version 3 artificially.

---
 contrib/amcheck/expected/check_btree.out |  7 ++-
 contrib/pageinspect/expected/btree.out   |  2 +-
 contrib/pgstattuple/expected/pgstattuple.out | 10 +-
 src/backend/access/nbtree/nbtpage.c  |  2 +-
 src/backend/access/nbtree/nbtsort.c  | 16 
 src/backend/postmaster/postmaster.c  |  1 +
 src/test/regress/expected/dependency.out |  4 ++--
 src/test/regress/expected/event_trigger.out  |  4 ++--
 src/test/regress/expected/foreign_data.out   |  8 
 src/test/regress/expected/rowsecurity.out|  4 ++--
 10 files changed, 24 insertions(+), 34 deletions(-)

diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index 687fde8fce..60bebb1c00 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -139,11 +139,8 @@ VACUUM delete_test_table;
 DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 SELECT bt_index_parent_check('delete_test_table_pkey', true, true);
- bt_index_parent_check 

- 
-(1 row)
-
+ERROR:  index "delete_test_table_pkey" does not support relocating tuples
+HINT:  Only indexes initialized on PostgreSQL 12 support relocation verification.
 --
 -- BUG #15597: must not assume consistent input toasting state when forming
 -- tuple.  Bloom filter must fingerprint normalized index tuple representation.
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 067e73f21a..7f003bf801 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]---+---
 magic   | 340322
-version | 4
+version | 3
 root| 1
 level   | 0
 fastroot| 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9920dbfd40..9858ea69d4 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
 from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 -+++---+++-+---+--+
-   4 |  0 |  1 | 0 |  0 |  0 |   0 | 0 |  NaN |NaN
+   3 |  0 |  1 | 0 |  0 |  0 |   0 | 0 |  NaN |NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
 from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 -+++---+++-+---+--+
-   4 |  0 |  1 | 0 |  0 |  0 |   0 | 0 |  NaN |NaN
+   3 |  0 |  1 | 0 |  0 |  0 |   0 | 0 |  NaN |NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
 from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 -+++---+++-+---+--+
-   4 |  0 |  1 | 0 |  0 |  0 |   0 | 0 |  NaN |NaN
+   3 |  0 |  1 | 0 |  0 |  0 |   0 | 0 |  NaN |NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, 

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-07 Thread Heikki Linnakangas

On 05/03/2019 05:16, Peter Geoghegan wrote:
Attached is v14, which has changes based on your feedback. 
As a quick check of the backwards-compatibility code, i.e. 
!heapkeyspace, I hacked _bt_initmetapage to force the version number to 
3, and ran the regression tests. It failed an assertion in the 
'create_index' test:


(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x7f2943f9a535 in __GI_abort () at abort.c:79
#2  0x5622c7d9d6b4 in ExceptionalCondition 
(conditionName=0x5622c7e4cbe8 "!(_bt_check_natts(rel, key->heapkeyspace, 
page, offnum))", errorType=0x5622c7e4c62a "FailedAssertion",

fileName=0x5622c7e4c734 "nbtsearch.c", lineNumber=511) at assert.c:54
#3  0x5622c78627fb in _bt_compare (rel=0x5622c85afbe0, 
key=0x7ffd7a996db0, page=0x7f293d433780 "", offnum=2) at nbtsearch.c:511
#4  0x5622c7862640 in _bt_binsrch (rel=0x5622c85afbe0, 
key=0x7ffd7a996db0, buf=4622) at nbtsearch.c:432
#5  0x5622c7861ec9 in _bt_search (rel=0x5622c85afbe0, 
key=0x7ffd7a996db0, bufP=0x7ffd7a9976d4, access=1, 
snapshot=0x5622c8353740) at nbtsearch.c:142
#6  0x5622c7863a44 in _bt_first (scan=0x5622c841e828, 
dir=ForwardScanDirection) at nbtsearch.c:1183
#7  0x5622c785f8b0 in btgettuple (scan=0x5622c841e828, 
dir=ForwardScanDirection) at nbtree.c:245
#8  0x5622c78522e3 in index_getnext_tid (scan=0x5622c841e828, 
direction=ForwardScanDirection) at indexam.c:542
#9  0x5622c7a67784 in IndexOnlyNext (node=0x5622c83ad280) at 
nodeIndexonlyscan.c:120
#10 0x5622c7a438d5 in ExecScanFetch (node=0x5622c83ad280, 
accessMtd=0x5622c7a67254 , recheckMtd=0x5622c7a67bc9 
) at execScan.c:95
#11 0x5622c7a4394a in ExecScan (node=0x5622c83ad280, 
accessMtd=0x5622c7a67254 , recheckMtd=0x5622c7a67bc9 
) at execScan.c:145
#12 0x5622c7a67c73 in ExecIndexOnlyScan (pstate=0x5622c83ad280) at 
nodeIndexonlyscan.c:322
#13 0x5622c7a41814 in ExecProcNodeFirst (node=0x5622c83ad280) at 
execProcnode.c:445
#14 0x5622c7a501a5 in ExecProcNode (node=0x5622c83ad280) at 
../../../src/include/executor/executor.h:231
#15 0x5622c7a50693 in fetch_input_tuple (aggstate=0x5622c83acdd0) at 
nodeAgg.c:406
#16 0x5622c7a529d9 in agg_retrieve_direct (aggstate=0x5622c83acdd0) 
at nodeAgg.c:1737

#17 0x5622c7a525a9 in ExecAgg (pstate=0x5622c83acdd0) at nodeAgg.c:1552
#18 0x5622c7a41814 in ExecProcNodeFirst (node=0x5622c83acdd0) at 
execProcnode.c:445
#19 0x5622c7a3621d in ExecProcNode (node=0x5622c83acdd0) at 
../../../src/include/executor/executor.h:231
#20 0x5622c7a38bd9 in ExecutePlan (estate=0x5622c83acb78, 
planstate=0x5622c83acdd0, use_parallel_mode=false, operation=CMD_SELECT, 
sendTuples=true, numberTuples=0,
direction=ForwardScanDirection, dest=0x5622c8462088, 
execute_once=true) at execMain.c:1645
#21 0x5622c7a36872 in standard_ExecutorRun 
(queryDesc=0x5622c83a9eb8, direction=ForwardScanDirection, count=0, 
execute_once=true) at execMain.c:363
#22 0x5622c7a36696 in ExecutorRun (queryDesc=0x5622c83a9eb8, 
direction=ForwardScanDirection, count=0, execute_once=true) at 
execMain.c:307
#23 0x5622c7c357dc in PortalRunSelect (portal=0x5622c8336778, 
forward=true, count=0, dest=0x5622c8462088) at pquery.c:929
#24 0x5622c7c3546f in PortalRun (portal=0x5622c8336778, 
count=9223372036854775807, isTopLevel=true, run_once=true, 
dest=0x5622c8462088, altdest=0x5622c8462088,

completionTag=0x7ffd7a997d50 "") at pquery.c:770
#25 0x5622c7c2f029 in exec_simple_query (query_string=0x5622c82cf508 
"SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 
IS NULL;") at postgres.c:1215
#26 0x5622c7c3369a in PostgresMain (argc=1, argv=0x5622c82faee0, 
dbname=0x5622c82fac50 "regression", username=0x5622c82c81e8 "heikki") at 
postgres.c:4256
#27 0x5622c7b8bcf2 in BackendRun (port=0x5622c82f3d80) at 
postmaster.c:4378
#28 0x5622c7b8b45b in BackendStartup (port=0x5622c82f3d80) at 
postmaster.c:4069

#29 0x5622c7b87633 in ServerLoop () at postmaster.c:1699
#30 0x5622c7b86e61 in PostmasterMain (argc=3, argv=0x5622c82c6160) 
at postmaster.c:1372

#31 0x5622c7aa9925 in main (argc=3, argv=0x5622c82c6160) at main.c:228

I haven't investigated any deeper, but apparently something's broken. 
This was with patch v14, without any further changes.


- Heikki



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-06 Thread Heikki Linnakangas

On 07/03/2019 15:41, Heikki Linnakangas wrote:

On 07/03/2019 14:54, Peter Geoghegan wrote:

On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas  wrote:

After staring at the first patch for bit longer, a few things started to
bother me:

* The new struct is called BTScanInsert, but it's used for searches,
too. It makes sense when you read the README, which explains the
difference between "search scan keys" and "insertion scan keys", but now
that we have a separate struct for this, perhaps we call insertion scan
keys with a less confusing name. I don't know what to suggest, though.
"Positioning key"?


I think that insertion scan key is fine. It's been called that for
almost twenty years. It's not like it's an intuitive concept that
could be conveyed easily if only we came up with a new, pithy name.


Yeah. It's been like that forever, but I must confess I hadn't paid any
attention to it, until now. I had not understood that text in the README
explaining the difference between search and insertion scan keys, before
looking at this patch. Not sure I ever read it with any thought. Now
that I understand it, I don't like the "insertion scan key" name.


BTW, I would like to hear what you think of the idea of minusinfkey
(and the !minusinfkey optimization) specifically.


I don't understand it :-(. I guess that's valuable feedback on its own.
I'll spend more time reading the code around that, but meanwhile, if you
can think of a simpler way to explain it in the comments, that'd be good.


The new BTInsertStateData struct also
holds the current buffer we're considering to insert to, and a
'bounds_valid' flag to indicate if the saved bounds are valid for the
current buffer. That way, it's more straightforward to clear the
'bounds_valid' flag whenever we move right.


I'm not sure that that's an improvement. Moving right should be very
rare with my patch. gcov shows that we never move right here anymore
with the regression tests, or within _bt_check_unique() -- not once.


I haven't given performance much thought, really. I don't expect this
method to be any slower, but the point of the refactoring is to make the
code easier to understand.


/*
* Do the insertion. First move right to find the correct page to
* insert to, if necessary. If we're inserting to a non-unique index,
* _bt_search() already did this when it checked if a move to the
* right was required for leaf page.  Insertion scankey's scantid
* would have been filled out at the time. On a unique index, the
* current buffer is the first buffer containing duplicates, however,
* so we may need to move right to the correct location for this
* tuple.
*/
if (checkingunique || itup_key->heapkeyspace)
  _bt_findinsertpage(rel, , stack, heapRel);

newitemoff = _bt_binsrch_insert(rel, );

_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
newitemoff, false);

Does this make sense?


I guess you're saying this because you noticed that the for (;;) loop
in _bt_findinsertloc() doesn't do that much in many cases, because of
the fastpath.


The idea is that _bt_findinsertpage() would not need to know whether the
unique checks were performed or not. I'd like to encapsulate all the
information about the "insert position we're considering" in the
BTInsertStateData struct. Passing 'checkingunique' as a separate
argument violates that, because when it's set, the key means something
slightly different.

Hmm. Actually, with patch #2, _bt_findinsertloc() could look at whether
'scantid' is set, instead of 'checkingunique'. That would seem better.
If it looks like 'checkingunique', it's making the assumption that if
unique checks were not performed, then we are already positioned on the
correct page, according to the heap TID. But looking at 'scantid' seems
like a more direct way of getting the same information. And then we
won't need to pass the 'checkingunique' flag as an "out-of-band" argument.

So I'm specifically suggesting that we replace this, in _bt_findinsertloc:

if (!checkingunique && itup_key->heapkeyspace)
break;

With this:

if (itup_key->scantid)
break;

And remove the 'checkingunique' argument from _bt_findinsertloc.


Ah, scratch that. By the time we call _bt_findinsertloc(), scantid has 
already been restored, even if it was not set originally when we did 
_bt_search.


My dislike here is that passing 'checkingunique' as a separate argument 
acts like a "modifier", changing slightly the meaning of the insertion 
scan key. If it's not set, we know we're positioned on the correct page. 
Otherwise, we might not be. And it's a pretty indirect way of saying 
that, as it also depends 'heapkeyspace'. Perhaps add a flag to 
BTInsertStateData, to indicate the same thing more explicitly. Something 
like "bool is_final_insertion_page; /* when set, no need to move right */".


- Heikki



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-06 Thread Heikki Linnakangas

On 07/03/2019 14:54, Peter Geoghegan wrote:

On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas  wrote:

After staring at the first patch for bit longer, a few things started to
bother me:

* The new struct is called BTScanInsert, but it's used for searches,
too. It makes sense when you read the README, which explains the
difference between "search scan keys" and "insertion scan keys", but now
that we have a separate struct for this, perhaps we call insertion scan
keys with a less confusing name. I don't know what to suggest, though.
"Positioning key"?


I think that insertion scan key is fine. It's been called that for
almost twenty years. It's not like it's an intuitive concept that
could be conveyed easily if only we came up with a new, pithy name.


Yeah. It's been like that forever, but I must confess I hadn't paid any 
attention to it, until now. I had not understood that text in the README 
explaining the difference between search and insertion scan keys, before 
looking at this patch. Not sure I ever read it with any thought. Now 
that I understand it, I don't like the "insertion scan key" name.



BTW, I would like to hear what you think of the idea of minusinfkey
(and the !minusinfkey optimization) specifically.


I don't understand it :-(. I guess that's valuable feedback on its own. 
I'll spend more time reading the code around that, but meanwhile, if you 
can think of a simpler way to explain it in the comments, that'd be good.



The new BTInsertStateData struct also
holds the current buffer we're considering to insert to, and a
'bounds_valid' flag to indicate if the saved bounds are valid for the
current buffer. That way, it's more straightforward to clear the
'bounds_valid' flag whenever we move right.


I'm not sure that that's an improvement. Moving right should be very
rare with my patch. gcov shows that we never move right here anymore
with the regression tests, or within _bt_check_unique() -- not once.


I haven't given performance much thought, really. I don't expect this 
method to be any slower, but the point of the refactoring is to make the 
code easier to understand.



/*
   * Do the insertion. First move right to find the correct page to
   * insert to, if necessary. If we're inserting to a non-unique index,
   * _bt_search() already did this when it checked if a move to the
   * right was required for leaf page.  Insertion scankey's scantid
   * would have been filled out at the time. On a unique index, the
   * current buffer is the first buffer containing duplicates, however,
   * so we may need to move right to the correct location for this
   * tuple.
   */
if (checkingunique || itup_key->heapkeyspace)
 _bt_findinsertpage(rel, , stack, heapRel);

newitemoff = _bt_binsrch_insert(rel, );

_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
newitemoff, false);

Does this make sense?


I guess you're saying this because you noticed that the for (;;) loop
in _bt_findinsertloc() doesn't do that much in many cases, because of
the fastpath.


The idea is that _bt_findinsertpage() would not need to know whether the 
unique checks were performed or not. I'd like to encapsulate all the 
information about the "insert position we're considering" in the 
BTInsertStateData struct. Passing 'checkingunique' as a separate 
argument violates that, because when it's set, the key means something 
slightly different.


Hmm. Actually, with patch #2, _bt_findinsertloc() could look at whether 
'scantid' is set, instead of 'checkingunique'. That would seem better. 
If it looks like 'checkingunique', it's making the assumption that if 
unique checks were not performed, then we are already positioned on the 
correct page, according to the heap TID. But looking at 'scantid' seems 
like a more direct way of getting the same information. And then we 
won't need to pass the 'checkingunique' flag as an "out-of-band" argument.


So I'm specifically suggesting that we replace this, in _bt_findinsertloc:

if (!checkingunique && itup_key->heapkeyspace)
break;

With this:

if (itup_key->scantid)
break;

And remove the 'checkingunique' argument from _bt_findinsertloc.

- Heikki



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-06 Thread Peter Geoghegan
On Wed, Mar 6, 2019 at 10:54 PM Peter Geoghegan  wrote:
> It will also have to store heapkeyspace, of course. And minusinfkey.
> BTW, I would like to hear what you think of the idea of minusinfkey
> (and the !minusinfkey optimization) specifically.

> I'm not sure that that's an improvement. Moving right should be very
> rare with my patch. gcov shows that we never move right here anymore
> with the regression tests, or within _bt_check_unique() -- not once.
> For a second, I thought that you forgot to invalidate the bounds_valid
> flag, because you didn't pass it directly, by value to
> _bt_useduplicatepage().

BTW, the !minusinfkey optimization is why we literally never move
right within _bt_findinsertloc() while the regression tests run. We
always land on the correct leaf page to begin with. (It works with
unique index insertions, where scantid is NULL when we descend the
tree.)

In general, there are two good reasons for us to move right:

* There was a concurrent page split (or page deletion), and we just
missed the downlink in the parent, and need to recover.

* We omit some columns from our scan key (at least scantid), and there
are perhaps dozens of matches -- this is not relevant to
_bt_doinsert() code.

The single value strategy used by nbtsplitloc.c does a good job of
making it unlikely that _bt_check_unique()-wise duplicates will cross
leaf pages, so there will almost always be one leaf page to visit.
And, the !minusinfkey optimization ensures that the only reason we'll
move right is because of a concurrent page split, within
_bt_moveright().

The buffer lock coupling move to the right that _bt_findinsertloc()
does should be considered an edge case with all of these measures, at
least with v4 indexes.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-06 Thread Peter Geoghegan
On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas  wrote:
> After staring at the first patch for bit longer, a few things started to
> bother me:
>
> * The new struct is called BTScanInsert, but it's used for searches,
> too. It makes sense when you read the README, which explains the
> difference between "search scan keys" and "insertion scan keys", but now
> that we have a separate struct for this, perhaps we call insertion scan
> keys with a less confusing name. I don't know what to suggest, though.
> "Positioning key"?

I think that insertion scan key is fine. It's been called that for
almost twenty years. It's not like it's an intuitive concept that
could be conveyed easily if only we came up with a new, pithy name.

> * We store the binary search bounds in BTScanInsertData, but they're
> only used during insertions.
>
> * The binary search bounds are specific for a particular buffer. But
> that buffer is passed around separately from the bounds. It seems easy
> to have them go out of sync, so that you try to use the cached bounds
> for a different page. The savebinsrch and restorebinsrch is used to deal
> with that, but it is pretty complicated.

That might be an improvement, but I do think that using mutable state
in the insertion scankey, to restrict a search is an idea that could
work well in at least one other way. That really isn't a once-off
thing, even though it looks that way.

> I came up with the attached (against master), which addresses the 2nd
> and 3rd points. I added a whole new BTInsertStateData struct, to hold
> the binary search bounds. BTScanInsert now only holds the 'scankeys'
> array, and the 'nextkey' flag.

It will also have to store heapkeyspace, of course. And minusinfkey.
BTW, I would like to hear what you think of the idea of minusinfkey
(and the !minusinfkey optimization) specifically.

> The new BTInsertStateData struct also
> holds the current buffer we're considering to insert to, and a
> 'bounds_valid' flag to indicate if the saved bounds are valid for the
> current buffer. That way, it's more straightforward to clear the
> 'bounds_valid' flag whenever we move right.

I'm not sure that that's an improvement. Moving right should be very
rare with my patch. gcov shows that we never move right here anymore
with the regression tests, or within _bt_check_unique() -- not once.
For a second, I thought that you forgot to invalidate the bounds_valid
flag, because you didn't pass it directly, by value to
_bt_useduplicatepage().

> I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
> search like _bt_binsrch does, but the bounds caching is only done in
> _bt_binsrch_insert. Seems more clear to have separate functions for them
> now, even though there's some duplication.

I'll have to think about that some more. Having a separate
_bt_binsrch_insert() may be worth it, but I'll need to do some
profiling.

> Hmm. Perhaps it would be to move the call to _bt_binsrch (or
> _bt_binsrch_insert with this patch) to outside _bt_findinsertloc. So
> that _bt_findinsertloc would only be responsible for finding the correct
> page to insert to. So the overall code, after patch #2, would be like:

Maybe, but as I said it's not like _bt_findinsertloc() doesn't know
all about unique indexes already. This is pointed out in a comment in
_bt_doinsert(), even. I guess that it might have to be changed to say
_bt_findinsertpage() instead, with your new approach.

> /*
>   * Do the insertion. First move right to find the correct page to
>   * insert to, if necessary. If we're inserting to a non-unique index,
>   * _bt_search() already did this when it checked if a move to the
>   * right was required for leaf page.  Insertion scankey's scantid
>   * would have been filled out at the time. On a unique index, the
>   * current buffer is the first buffer containing duplicates, however,
>   * so we may need to move right to the correct location for this
>   * tuple.
>   */
> if (checkingunique || itup_key->heapkeyspace)
> _bt_findinsertpage(rel, , stack, heapRel);
>
> newitemoff = _bt_binsrch_insert(rel, );
>
> _bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
> newitemoff, false);
>
> Does this make sense?

I guess you're saying this because you noticed that the for (;;) loop
in _bt_findinsertloc() doesn't do that much in many cases, because of
the fastpath.

I suppose that this could be an improvement, provided all the
assertions that verify that the work "_bt_findinsertpage()" would have
done if called was in fact unnecessary. (e.g., check the high
key/rightmost-ness)

> The attached new version simplifies this, IMHO. The bounds and the
> current buffer go together in the same struct, so it's easier to keep
> track whether the bounds are valid or not.

I'll look into integrating this with my current draft v15 tomorrow.
Need to sleep on it.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-06 Thread Heikki Linnakangas

On 06/03/2019 04:03, Peter Geoghegan wrote:

On Tue, Mar 5, 2019 at 3:37 AM Heikki Linnakangas  wrote:

I'm looking at the first patch in the series now. I'd suggest that you
commit that very soon. It's useful on its own, and seems pretty much
ready to be committed already. I don't think it will be much affected by
whatever changes we make to the later patches, anymore.


After staring at the first patch for bit longer, a few things started to 
bother me:


* The new struct is called BTScanInsert, but it's used for searches, 
too. It makes sense when you read the README, which explains the 
difference between "search scan keys" and "insertion scan keys", but now 
that we have a separate struct for this, perhaps we call insertion scan 
keys with a less confusing name. I don't know what to suggest, though. 
"Positioning key"?


* We store the binary search bounds in BTScanInsertData, but they're 
only used during insertions.


* The binary search bounds are specific for a particular buffer. But 
that buffer is passed around separately from the bounds. It seems easy 
to have them go out of sync, so that you try to use the cached bounds 
for a different page. The savebinsrch and restorebinsrch is used to deal 
with that, but it is pretty complicated.



I came up with the attached (against master), which addresses the 2nd 
and 3rd points. I added a whole new BTInsertStateData struct, to hold 
the binary search bounds. BTScanInsert now only holds the 'scankeys' 
array, and the 'nextkey' flag. The new BTInsertStateData struct also 
holds the current buffer we're considering to insert to, and a 
'bounds_valid' flag to indicate if the saved bounds are valid for the 
current buffer. That way, it's more straightforward to clear the 
'bounds_valid' flag whenever we move right.


I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary 
search like _bt_binsrch does, but the bounds caching is only done in 
_bt_binsrch_insert. Seems more clear to have separate functions for them 
now, even though there's some duplication.



+/* HEIKKI:
+Do we need 'checkunique' as an argument? If unique checks were not
+performed, the insertion key will simply not have saved state.
+*/


We need it in the next patch in the series, because it's also useful
for optimizing away the high key check with non-unique indexes. We
know that _bt_moveright() was called at the leaf level, with scantid
filled in, so there is no question of needing to move right within
_bt_findinsertloc() (provided it's a heapkeyspace index).


Hmm. Perhaps it would be to move the call to _bt_binsrch (or 
_bt_binsrch_insert with this patch) to outside _bt_findinsertloc. So 
that _bt_findinsertloc would only be responsible for finding the correct 
page to insert to. So the overall code, after patch #2, would be like:


/*
 * Do the insertion. First move right to find the correct page to
 * insert to, if necessary. If we're inserting to a non-unique index,
 * _bt_search() already did this when it checked if a move to the
 * right was required for leaf page.  Insertion scankey's scantid
 * would have been filled out at the time. On a unique index, the
 * current buffer is the first buffer containing duplicates, however,
 * so we may need to move right to the correct location for this
 * tuple.
 */
if (checkingunique || itup_key->heapkeyspace)
_bt_findinsertpage(rel, , stack, heapRel);

newitemoff = _bt_binsrch_insert(rel, );

_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup, 
newitemoff, false);


Does this make sense?


Actually, we even need it in the first patch: we only restore a binary
search because we know that there is something to restore, and must
ask for it to be restored explicitly (anything else seems unsafe).
Maybe we can't restore it because it's not a unique index, or maybe we
can't restore it because we microvacuumed, or moved right to get free
space. I don't think that it'll be helpful to make _bt_findinsertloc()
pretend that it doesn't know exactly where the binary search bounds
come from -- it already knows plenty about unique indexes
specifically, and about how it may have to invalidate the bounds. The
whole way that it couples buffer locks is only useful for unique
indexes, so it already knows *plenty* about unique indexes
specifically.


The attached new version simplifies this, IMHO. The bounds and the 
current buffer go together in the same struct, so it's easier to keep 
track whether the bounds are valid or not.



- * starting a regular index scan some can be omitted.  The array is used as a
+ * starting a regular index scan, some can be omitted.  The array is used as a
   * flexible array member, though it's sized in a way that makes it possible to
   * use stack allocations.  See nbtree/README for full details.
+
+HEIKKI: I don't see anything in the README about stack allocations. What
+exactly does the README reference refer to? No code seems to actually allocate
+this in the stack, so we don't 

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-06 Thread Peter Geoghegan
On Wed, Mar 6, 2019 at 1:37 PM Robert Haas  wrote:
> I know I'm stating the obvious here, but we don't have many weeks left
> at this point.  I have not reviewed any code, but I have been
> following this thread and I'd really like to see this work go into
> PostgreSQL 12, assuming it's in good enough shape.  It sounds like
> really good stuff.

Thanks!

Barring any objections, I plan to commit the first 3 patches (plus the
amcheck "relocate" patch) within 7 - 10 days (that's almost
everything). Heikki hasn't reviewed 'Add high key "continuescan"
optimization' yet, and it seems like he should take a look at that
before I proceed with it. But that seems like the least controversial
enhancement within the entire patch series, so I'm not very worried
about it.

I'm currently working on v15, which has comment-only revisions
requested by Heikki. I expect to continue to work with him to make
sure that he is happy with the presentation. I'll also need to
revalidate the performance of the patch series following recent minor
changes to the logic for choosing a split point. That can take days.
This is why I don't want to commit the first patch without committing
at least the first three all at once -- it increases the amount of
performance validation work I'll have to do considerably. (I have to
consider both v4 and v3 indexes already, which seems like enough
work.)

Two of the later patches (one of which I plan to push as part of the
first batch of commits) use heuristics to decide where to split the
page. As a Postgres contributor, I have learned to avoid inventing
heuristics, so this automatically makes me a bit uneasy. However, I
don't feel so bad about it here, on reflection. The on-disk size of
the TPC-C indexes are reduced by 35% with the 'Add "split after new
tuple" optimization' patch (I think that the entire database is
usually about 12% smaller). There simply isn't a fundamentally better
way to get the same benefit, and I'm sure that nobody will argue that
we should just accept the fact that the most influential database
benchmark of all time has a big index bloat problem with Postgres.
That would be crazy.

That said, it's not impossible that somebody will shout at me because
my heuristics made their index bloated. I can't see how that could
happen, but I am prepared. I can always adjust the heuristics when new
information comes to light. I have fairly thorough test cases that
should allow me to do this without regressing anything else. This is a
risk that can be managed sensibly.

There is no gnawing ambiguity about the on-disk changes laid down in
the second patch (nor the first patch), though. Making on-disk changes
is always a bit scary, but making the keys unique is clearly a big
improvement architecturally, as it brings nbtree closer to the Lehman
& Yao design without breaking anything for v3 indexes (v3 indexes
simply aren't allowed to use a heap TID in their scankey). Unique keys
also allow amcheck to relocate every tuple in the index from the root
page, using the same code path as regular index scans. We'll be
relying on the uniqueness of keys within amcheck from the beginning,
before anybody teaches nbtree to perform retail index tuple deletion.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-06 Thread Robert Haas
On Tue, Mar 5, 2019 at 3:03 PM Peter Geoghegan  wrote:
> I agree that the parts covered by the first patch in the series are
> very unlikely to need changes, but I hesitate to commit it weeks ahead
> of the other patches.

I know I'm stating the obvious here, but we don't have many weeks left
at this point.  I have not reviewed any code, but I have been
following this thread and I'd really like to see this work go into
PostgreSQL 12, assuming it's in good enough shape.  It sounds like
really good stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-05 Thread Heikki Linnakangas
I'm looking at the first patch in the series now. I'd suggest that you 
commit that very soon. It's useful on its own, and seems pretty much 
ready to be committed already. I don't think it will be much affected by 
whatever changes we make to the later patches, anymore.


I did some copy-editing of the code comments, see attached patch which 
applies on top of v14-0001-Refactor-nbtree-insertion-scankeys.patch. 
Mostly, to use more Plain English: use active voice instead of passive, 
split long sentences, avoid difficult words.


I also had a few comments and questions on some details. I added them in 
the same patch, marked with "HEIKKI:". Please take a look.


- Heikki
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69b89a..eb4df2ebbe6 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -609,6 +609,9 @@ original search scankey is consulted as each index entry is sequentially
 scanned to decide whether to return the entry and whether the scan can
 stop (see _bt_checkkeys()).
 
+HEIKKI: The above probably needs some updating, now that we have a
+separate BTScanInsert struct to represent an insertion scan key.
+
 We use term "pivot" index tuples to distinguish tuples which don't point
 to heap tuples, but rather used for tree navigation.  Pivot tuples includes
 all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b3fbba276dd..2a2d6576060 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -97,9 +97,12 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
  *		UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
  *		For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
- *		don't actually insert.  If rel is a unique index, then every call
- *		here is a checkingunique call (i.e. every call does a duplicate
- *		check, though perhaps only a tentative check).
+ *		don't actually insert.
+
+HEIKKI: 'checkingunique' is a local variable in the function. Seems a bit
+weird to talk about it in the function comment. I didn't understand what
+the point of adding this sentence was, so I removed it.
+
  *
  *		The result value is only significant for UNIQUE_CHECK_PARTIAL:
  *		it must be true if the entry is known unique, else false.
@@ -285,9 +288,10 @@ top:
 		CheckForSerializableConflictIn(rel, NULL, buf);
 
 		/*
-		 * Do the insertion.  Note that itup_key contains mutable state used
-		 * by _bt_check_unique to help _bt_findinsertloc avoid repeating its
-		 * binary search.  !checkingunique case must start own binary search.
+		 * Do the insertion.  Note that itup_key contains state filled in by
+		 * _bt_check_unique to help _bt_findinsertloc avoid repeating its
+		 * binary search.  !checkingunique case must start its own binary
+		 * search.
 		 */
 		newitemoff = _bt_findinsertloc(rel, itup_key, , checkingunique,
 	   itup, stack, heapRel);
@@ -311,10 +315,6 @@ top:
 /*
  *	_bt_check_unique() -- Check for violation of unique index constraint
  *
- * Sets state in itup_key sufficient for later _bt_findinsertloc() call to
- * reuse most of the work of our initial binary search to find conflicting
- * tuples.
- *
  * Returns InvalidTransactionId if there is no conflict, else an xact ID
  * we must wait for to see if it commits a conflicting tuple.   If an actual
  * conflict is detected, no return --- just ereport().  If an xact ID is
@@ -326,6 +326,10 @@ top:
  * InvalidTransactionId because we don't want to wait.  In this case we
  * set *is_unique to false if there is a potential conflict, and the
  * core code must redo the uniqueness check later.
+ *
+ * As a side-effect, sets state in itup_key that can later be used by
+ * _bt_findinsertloc() to reuse most of the binary search work we do
+ * here.
  */
 static TransactionId
 _bt_check_unique(Relation rel, BTScanInsert itup_key,
@@ -352,8 +356,8 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 	maxoff = PageGetMaxOffsetNumber(page);
 
 	/*
-	 * Save binary search bounds.  Note that this is also used within
-	 * _bt_findinsertloc() later.
+	 * Save binary search bounds.  We use them in the fastpath below, but
+	 * also in the _bt_findinsertloc() call later.
 	 */
 	itup_key->savebinsrch = true;
 	offset = _bt_binsrch(rel, itup_key, buf);
@@ -375,16 +379,16 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 		if (offset <= maxoff)
 		{
 			/*
-			 * Fastpath: _bt_binsrch() search bounds can be used to limit our
-			 * consideration to items that are definitely duplicates in most
-			 * cases (though not when original page is empty, or when initial
-			 * offset is past the end of the original page, which may indicate
-			 * that we'll have to examine a second or subsequent page).
+			 * 

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-04 Thread Peter Geoghegan
On Sun, Mar 3, 2019 at 10:02 PM Heikki Linnakangas  wrote:
> Some comments on
> v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch below.
> Mostly about code comments. In general, I think a round of copy-editing
> the comments, to use simpler language, would do good. The actual code
> changes look good to me.

I'm delighted that the code looks good to you, and makes sense
overall. I worked hard to make the patch a natural adjunct to the
existing code, which wasn't easy.

> Seems confusing to first say assertively that "*bufptr contains the page
> that the new tuple unambiguously belongs to", and then immediately go on
> to list a whole bunch of exceptions. Maybe just remove "unambiguously".

This is fixed in v14 of the patch series.

> This happens very seldom, because you only get an incomplete split if
> you crash in the middle of a page split, and that should be very rare. I
> don't think we need to list more fine-grained conditions here, that just
> confuses the reader.

Fixed in v14.

> > /*
> >  *_bt_useduplicatepage() -- Settle for this page of duplicates?

> So, this function is only used for legacy pg_upgraded indexes. The
> comment implies that, but doesn't actually say it.

I made that more explicit in v14.

> > /*
> >  * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
> >  * and non-pivot tuples, despite differences in how heap TID is represented.

> > #define BTreeTupleGetHeapTID(itup) ...

I fixed up the comments above BTreeTupleGetHeapTID() significantly.

> The comment claims that "all pivot tuples must be as of BTREE_VERSION
> 4". I thought that all internal tuples are called pivot tuples, even on
> version 3.

In my mind, "pivot tuple" is a term that describes any tuple that
contains a separator key, which could apply to any nbtree version.
It's useful to have a distinct term (to not just say "separator key
tuple") because Lehman and Yao think of separator keys as separate and
distinct from downlinks. Internal page splits actually split *between*
a separator key and a downlink. So nbtree internal page splits must
split "inside a pivot tuple", leaving its separator on the left hand
side (new high key), and its downlink on the right hand side (new
minus infinity tuple).

Pivot tuples may contain a separator key and a downlink, just a
downlink, or just a separator key (sometimes this is implicit, and the
block number is garbage). I am particular about the terminology
because the pivot tuple vs. downlink vs. separator key thing causes a
lot of confusion, particularly when you're using Lehman and Yao (or
Lanin and Shasha) to understand how things work in Postgres.

We wan't to have a broad term that refers to the tuples that describe
the keyspace (pivot tuples), since it's often helpful to refer to them
collectively, without seeming to contradict Lehman and Yao.

> I think what this means to say is that this macro is only
> used on BTREE_VERSION 4 indexes. Or perhaps that pivot tuples can only
> have a heap TID in BTREE_VERSION 4 indexes.

My high level approach to pg_upgrade/versioning is for index scans to
*pretend* that every nbtree index (even on v2 and v3) has a heap
attribute that actually makes the keys unique. The difference is that
v4 gets to use a scantid, and actually rely on the sort order of heap
TIDs, whereas pg_upgrade'd indexes "are not allowed to look at the
heap attribute", and must never provide a scantid (they also cannot
use the !minusinfkey optimization, but this is only an optimization
that v4 indexes don't truly need). They always do the right thing
(move left) on otherwise-equal pivot tuples, since they have no
scantid.

That's why _bt_compare() can use BTreeTupleGetHeapTID() without caring
about the version the index uses. It might be NULL for a pivot tuple
in a v3 index, even though we imagine/pretend that it should have a
value set. But that doesn't matter, because higher level code knows
that !heapkeyspace indexes should never get a scantid (_bt_compare()
does Assert() that they got that detail right, though). We "have no
reason to peak", because we don't have a scantid, so index scans work
essentially the same way, regardless of the version in use.

There are a few specific cross-version things that we need think about
outside of making sure that there is never a scantid (and !minusinfkey
optimization is unused) in < v4 indexes, but these are all related to
unique indexes. "Pretending that all indexes have a heap TID" is a
very useful mental model. Nothing really changes, even though you
might guess that changing the classic "Subtree S is described by Ki <
v <= Ki+1" invariant would need to break code in
_bt_binsrch()/_bt_compare(). Just pretend that the classic invariant
was there since the Berkeley days, and don't do anything that breaks
the useful illusion on versions before v4.

> This macro (and many others in nbtree.h) is quite complicated. A static
> inline function might be easier to read.

I agree that the macros 

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-03 Thread Heikki Linnakangas
Some comments on 
v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch below. 
Mostly about code comments. In general, I think a round of copy-editing 
the comments, to use simpler language, would do good. The actual code 
changes look good to me.



/*
 *  _bt_findinsertloc() -- Finds an insert location for a tuple
 *
 *  On entry, *bufptr contains the page that the new tuple 
unambiguously
 *  belongs on.  This may not be quite right for callers that just 
called
 *  _bt_check_unique(), though, since they won't have initially 
searched
 *  using a scantid.  They'll have to insert into a page somewhere 
to the
 *  right in rare cases where there are many physical duplicates in 
a
 *  unique index, and their scantid directs us to some page full of
 *  duplicates to the right, where the new tuple must go.  
(Actually,
 *  since !heapkeyspace pg_upgraded'd non-unique indexes never get a
 *  scantid, they too may require that we move right.  We treat them
 *  somewhat like unique indexes.)


Seems confusing to first say assertively that "*bufptr contains the page 
that the new tuple unambiguously belongs to", and then immediately go on 
to list a whole bunch of exceptions. Maybe just remove "unambiguously".



@@ -759,7 +787,10 @@ _bt_findinsertloc(Relation rel,
 * If this page was incompletely split, finish the 
split now. We
 * do this while holding a lock on the left sibling, 
which is not
 * good because finishing the split could be a fairly 
lengthy
-* operation.  But this should happen very seldom.
+* operation.  But this should only happen when 
inserting into a
+* unique index that has more than an entire page for 
duplicates
+* of the value being inserted.  (!heapkeyspace 
non-unique indexes
+* are an exception, once again.)
 */
if (P_INCOMPLETE_SPLIT(lpageop))
{


This happens very seldom, because you only get an incomplete split if 
you crash in the middle of a page split, and that should be very rare. I 
don't think we need to list more fine-grained conditions here, that just 
confuses the reader.



/*
 *  _bt_useduplicatepage() -- Settle for this page of duplicates?
 *
 *  Prior to PostgreSQL 12/Btree version 4, heap TID was never 
treated
 *  as a part of the keyspace.  If there were many tuples of the 
same
 *  value spanning more than one leaf page, a new tuple of that same
 *  value could legally be placed on any one of the pages.
 *
 *  This function handles the question of whether or not an 
insertion
 *  of a duplicate into a pg_upgrade'd !heapkeyspace index should
 *  insert on the page contained in buf when a choice must be made.
 *  Preemptive microvacuuming is performed here when that could 
allow
 *  caller to insert on to the page in buf.
 *
 *  Returns true if caller should proceed with insert on buf's page.
 *  Otherwise, caller should move on to the page to the right 
(caller
 *  must always be able to still move right following call here).
 */


So, this function is only used for legacy pg_upgraded indexes. The 
comment implies that, but doesn't actually say it.



/*
 * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
 * and non-pivot tuples, despite differences in how heap TID is represented.
 *
 * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
 * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
 * (since all pivot tuples must as of BTREE_VERSION 4).  When non-pivot
 * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
 * probably also contain a heap TID at the end of the tuple.  We currently
 * assume that a tuple with INDEX_ALT_TID_MASK set is a pivot tuple within
 * heapkeyspace indexes (and that a tuple without it set must be a non-pivot
 * tuple), but it might also be used by non-pivot tuples in the future.
 * pg_upgrade'd !heapkeyspace indexes only set INDEX_ALT_TID_MASK in pivot
 * tuples that actually originated with the truncation of one or more
 * attributes.
 */
#define BTreeTupleGetHeapTID(itup) ...


The comment claims that "all pivot tuples must be as of BTREE_VERSION 
4". I thought that all internal tuples are called pivot tuples, even on 
version 3. I think what this means to say is that this macro is only 
used on BTREE_VERSION 4 indexes. Or perhaps that pivot tuples can only 
have a heap TID in BTREE_VERSION 4 indexes.


This macro (and many others in nbtree.h) is quite complicated. A static 
inline function might be easier to 

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-03 Thread Heikki Linnakangas

On 26/02/2019 12:31, Peter Geoghegan wrote:

On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas  wrote:

I spent some time first trying to understand the current algorithm, and
then rewriting it in a way that I find easier to understand. I came up
with the attached. I think it optimizes for the same goals as your
patch, but the approach is quite different.


Attached is v13 of the patch series, which significantly refactors
nbtsplitloc.c to implement the algorithm using the approach from your
prototype posted on January 28 -- I now take a "top down" approach
that materializes all legal split points up-front, as opposed to the
initial "bottom up" approach that used recursion, and weighed
everything (balance of free space, suffix truncation, etc) all at
once.


Great, looks much simpler now, indeed! Now I finally understand the 
algorithm.



I'm using qsort() to sort the candidate split points array. I'm not
trying to do something clever to avoid the up-front effort of sorting
everything, even though we could probably get away with that much of
the time (e.g. by doing a top-N sort in default mode). Testing has
shown that using an inlined qsort() routine in the style of
tuplesort.c would make my serial test cases/microbenchmarks faster,
without adding much complexity. We're already competitive with the
master branch even without this microoptimization, so I've put that
off for now. What do you think of the idea of specializing an
inlineable qsort() for nbtsplitloc.c?


If the performance is acceptable without it, let's not bother. We can 
optimize later.


What would be the worst case scenario for this? Splitting a page that 
has as many tuples as possible, I guess, so maybe inserting into a table 
with a single-column index, with 32k BLCKSZ. Have you done performance 
testing on something like that?



Unlike in your prototype, v13 makes the array for holding candidate
split points into a single big allocation that is always exactly
BLCKSZ. The idea is that palloc() can thereby recycle the big
_bt_findsplitloc() allocation within _bt_split(). palloc() considers
8KiB to be the upper limit on the size of individual blocks it
manages, and we'll always go on to palloc(BLCKSZ) through the
_bt_split() call to PageGetTempPage(). In a sense, we're not even
allocating memory that we weren't allocating already. (Not sure that
this really matters, but it is easy to do it that way.)


Rounding up the allocation to BLCKSZ seems like a premature 
optimization. Even if it saved some cycles, I don't think it's worth the 
trouble of having to explain all that in the comment.


I think you could change the curdelta, leftfree, and rightfree fields in 
SplitPoint to int16, to make the array smaller.



Other changes from your prototype:

*  I found a more efficient representation than a pair of raw
IndexTuple pointers for each candidate split. Actually, I use the same
old representation (firstoldonright + newitemonleft) in each split,
and provide routines to work backwards from that to get the lastleft
and firstright tuples. This approach is far more space efficient, and
space efficiency matters when you've allocating space for hundreds of
items in a critical path like this.


Ok.


* You seemed to refactor _bt_checksplitloc() in passing, making it not
do the newitemisfirstonright thing. I changed that back. Did I miss
something that you intended here?


My patch treated the new item the same as all the old items, in 
_bt_checksplitloc(), so it didn't need newitemisonright. You still need 
it with your approach.



Changes to my own code since v12:

* Simplified "Add "split after new tuple" optimization" commit, and
made it more consistent with associated code. This is something that
was made a lot easier by the new approach. It would be great to hear
what you think of this.


I looked at it very briefly. Yeah, it's pretty simple now. Nice!


About this comment on _bt_findsplit_loc():


/*
*   _bt_findsplitloc() -- find an appropriate place to split a page.
*
* The main goal here is to equalize the free space that will be on each
* split page, *after accounting for the inserted tuple*.  (If we fail to
* account for it, we might find ourselves with too little room on the page
* that it needs to go into!)
*
* If the page is the rightmost page on its level, we instead try to arrange
* to leave the left split page fillfactor% full.  In this way, when we are
* inserting successively increasing keys (consider sequences, timestamps,
* etc) we will end up with a tree whose pages are about fillfactor% full,
* instead of the 50% full result that we'd get without this special case.
* This is the same as nbtsort.c produces for a newly-created tree.  Note
* that leaf and nonleaf pages use different fillfactors.
*
* We are passed the intended insert position of the new tuple, expressed as
* the offsetnumber of the tuple it must go in front of (this could be
* maxoff+1 if the tuple is to go at the end).  The new tuple itself is also
* 

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-02-13 Thread Peter Geoghegan
On Mon, Feb 11, 2019 at 12:54 PM Peter Geoghegan  wrote:
> Notable improvements in v12:

I've been benchmarking v12, once again using a slightly modified
BenchmarkSQL that doesn't do up-front CREATE INDEX builds [1], since
the problems with index bloat don't take so long to manifest
themselves when the indexes are inserted into incrementally from the
very beginning. This benchmarking process took over 20 hours, with a
database that started off at about 90GB (700 TPC-C/BenchmarkSQL
warehouses were used). That easily exceeded available main memory on
my test server, which was 32GB. This is a pretty I/O bound workload,
and a fairly write-heavy one at that. I used a Samsung 970 PRO 512GB,
NVMe PCIe M.2 2280 SSD for both pg_wal and the default and only
tablespace.

Importantly, I figured out that I should disable both hash joins and
merge joins with BenchmarkSQL, in order to force all joins to be
nested loop joins. Otherwise, the "stock level" transaction eventually
starts to use a hash join, even though that's about 10x slower than a
nestloop join (~4ms vs. ~40ms on this machine) -- the hash join
produces a lot of noise without really testing anything. It usually
takes a couple of hours before we start to get obviously-bad plans,
but it also usually takes about that long until the patch series
starts to noticeably overtake the master branch. I don't think that
TPC-C will ever benefit from using a hash join or a merge join, since
it's supposed to be a pure OLTP benchmark, and is a benchmark that
MySQL is known to do at least respectably-well on.

This is the first benchmark I've published that was considerably I/O
bound. There are significant improvements in performance across the
board, on every measure, though it takes several hours for that to
really show. The benchmark was not rate-limited. 16
clients/"terminals" are used throughout. There were 5 runs for master
and 5 for patch, interlaced, each lasting 2 hours. Initialization
occurred once, so it's expected that both databases will gradually get
larger across runs.

Summary (appears in same order as the execution of each run) -- each
run is 2 hours, so 20 hours total excluding initial load time (2 hours
* 5 runs for master + 2 hours * 5 runs for patch):

Run 1 -- master: Measured tpmTOTAL = 90063.79, Measured tpmC
(NewOrders) = 39172.37
Run 1 -- patch: Measured tpmTOTAL = 90922.63, Measured tpmC
(NewOrders) = 39530.2

Run 2 -- master: Measured tpmTOTAL = 77091.63, Measured tpmC
(NewOrders) = 33530.66
Run 2 -- patch: Measured tpmTOTAL = 83905.48, Measured tpmC
(NewOrders) = 36508.38<-- 8.8% increase in tpmTOTAL/throughput

Run 3 -- master: Measured tpmTOTAL = 71224.25, Measured tpmC
(NewOrders) = 30949.24
Run 3 -- patch: Measured tpmTOTAL = 78268.29, Measured tpmC
(NewOrders) = 34021.98   <-- 9.8% increase in tpmTOTAL/throughput

Run 4 -- master: Measured tpmTOTAL = 71671.96, Measured tpmC
(NewOrders) = 31163.29
Run 4 -- patch: Measured tpmTOTAL = 73097.42, Measured tpmC
(NewOrders) = 31793.99

Run 5 -- master: Measured tpmTOTAL = 66503.38, Measured tpmC
(NewOrders) = 28908.8
Run 5 -- patch: Measured tpmTOTAL = 71072.3, Measured tpmC (NewOrders)
= 30885.56  <-- 6.9% increase in tpmTOTAL/throughput

There were *also* significant reductions in transaction latency for
the patch -- see the full html reports in the provided tar archive for
full details (URL provided below). The html reports have nice SVG
graphs, generated by BenchmarkSQL using R -- one for transaction
throughput, and another for transaction latency. The overall picture
is that the patched version starts out ahead, and has a much more
gradual decline as the database becomes larger and more bloated.

Note also that the statistics collector stats show a *big* reduction
in blocks read into shared_buffers for the duration of these runs. For
example, here is what pg_stat_database shows for run 3 (I reset the
stats between runs):

master: blks_read = 78,412,640, blks_hit = 4,022,619,556
patch: blks_read = 70,033,583, blks_hit = 4,505,308,517  <-- 10.7%
reduction in blks_read/logical I/O

This suggests an indirect benefit, likely related to how buffers are
evicted in each case. pg_stat_bgwriter indicates that more buffers are
written out during checkpoints, while fewer are written out by
backends. I won't speculate further on what all of this means right
now, though.

You can find the raw details for blks_read for each and every run in
the full tar archive. It is available for download from:

https://drive.google.com/file/d/1kN4fDmh1a9jtOj8URPrnGYAmuMPmcZax/view?usp=sharing

There are also dumps of the other pg_stat* views at the end of each
run, logs for each run, etc. There's more information than anybody
else is likely to find interesting.

If anyone needs help in recreating this benchmark, then I'd be happy
to assist in that. The is a shell script (zsh) included in the tar
archive, although that will need to be changed a bit to point to the
correct installations and so on. Independent 

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-02-05 Thread Peter Geoghegan
On Mon, Jan 28, 2019 at 1:41 PM Peter Geoghegan  wrote:
> Thanks for going to the trouble of implementing what you have in mind!
>
> I agree that the code that I wrote within nbtsplitloc.c is very
> subtle, and I do think that I have further work to do to make its
> design clearer. I think that this high level description of the goals
> of the algorithm is inaccurate in subtle but important ways, though.
> Hopefully there will be a way of making it more understandable that
> preserves certain important characteristics.

Heikki and I had the opportunity to talk about this recently. We found
an easy way forward. I believe that the nbtsplitloc.c algorithm itself
is fine -- the code will need to be refactored, though.

nbtsplitloc.c can be refactored to assemble a list of legal split
points up front, before deciding which one to go with in a separate
pass (using one of two "alternative modes", as before). I now
understand that Heikki simply wants to separate the questions of "Is
this candidate split point legal?" from "Is this known-legal candidate
split point good/ideal based on my current criteria?". This seems like
a worthwhile goal to me. Heikki accepts the need for multiple
modes/passes, provided recursion isn't used in the implementation.

It's clear to me that the algorithm should start off trying to split
towards the middle of the page (or towards the end in the rightmost
case), while possibly making a small compromise on the exact split
point to maximize the effectiveness of suffix truncation. We must
change strategy entirely if and only if the middle of the page (or
wherever we'd like to split initially) is found to be completely full
of duplicates -- that's where the need for a second pass comes in.
This should almost never happen in most applications. Even when it
happens, we only care about not splitting inside a group of
duplicates. That's not the same thing as caring about maximizing the
number of attributes truncated away. Those two things seem similar,
but are actually very different.

It might have sounded like Heikki and I disagreed on the design of the
algorithm at a high level, or what its goals ought to be. That is not
the case, though. (At least not so far.)

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-01-28 Thread Peter Geoghegan
On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas  wrote:
> I spent some time first trying to understand the current algorithm, and
> then rewriting it in a way that I find easier to understand. I came up
> with the attached. I think it optimizes for the same goals as your
> patch, but the approach  is quite different. At a very high level, I
> believe the goals can be described as:
>
> 1. Find out how much suffix truncation is possible, i.e. how many key
> columns can be truncated away, in the best case, among all possible ways
> to split the page.
>
> 2. Among all the splits that achieve that optimum suffix truncation,
> find the one with smallest "delta".

Thanks for going to the trouble of implementing what you have in mind!

I agree that the code that I wrote within nbtsplitloc.c is very
subtle, and I do think that I have further work to do to make its
design clearer. I think that this high level description of the goals
of the algorithm is inaccurate in subtle but important ways, though.
Hopefully there will be a way of making it more understandable that
preserves certain important characteristics. If you had my test
cases/data that would probably help you a lot (more on that later).

The algorithm I came up with almost always does these two things in
the opposite order, with each considered in clearly separate phases.
There are good reasons for this. We start with the same criteria as
the old algorithm. We assemble a small array of candidate split
points, rather than one split point, but otherwise it's almost exactly
the same (the array is sorted by delta). Then, at the very end, we
iterate through the small array to find the best choice for suffix
truncation. IOW, we only consider suffix truncation as a *secondary*
goal. The delta is still by far the most important thing 99%+ of the
time. I assume it's fairly rare to not have two distinct tuples within
9 or so tuples of the delta-wise optimal split position -- 99% is
probably a low estimate, at least in OLTP app, or within unique
indexes. I see that you do something with a "good enough" delta that
seems like it also makes delta the most important thing, but that
doesn't appear to be, uh, good enough. ;-)

Now, it's true that my approach does occasionally work in a way close
to what you describe above -- it does this when we give up on default
mode and check "how much suffix truncation is possible?" exhaustively,
for every possible candidate split point. "Many duplicates" mode kicks
in when we need to be aggressive about suffix truncation. Even then,
the exact goals are different to what you have in mind in subtle but
important ways. While "truncating away the heap TID" isn't really a
special case in other places, it is a special case for my version of
nbtsplitloc.c, which more or less avoids it at all costs. Truncating
away heap TID is more important than truncating away any other
attribute by a *huge* margin. Many duplicates mode *only* specifically
cares about truncating the final TID attribute. That is the only thing
that is ever treated as more important than delta, though even there
we don't forget about delta entirely. That is, we assume that the
"perfect penalty" is nkeyatts when in many duplicates mode, because we
don't care about suffix truncation beyond heap TID truncation by then.
So, if we find 5 split points out of 250 in the final array that avoid
appending heap TID, we use the earliest/lowest delta out of those 5.
We're not going to try to maximize the number of *additional*
attributes that get truncated, because that can make the leaf pages
unbalanced in an *unbounded* way. None of these 5 split points are
"good enough", but the distinction between their deltas still matters
a lot. We strongly prefer a split with a *mediocre* delta to a split
with a *terrible* delta -- a bigger high key is the least of our
worries here. (I made similar mistakes myself months ago, BTW.)

Your version of the algorithm makes a test case of mine (UK land
registry test case [1]) go from having an index that's 1101 MB with my
version of the algorithm/patch and 1329 MB on the master branch to an
index that's 3030 MB in size. I think that this happens because it
effectively fails to give any consideration to delta at all at certain
points. On leaf pages with lots of unique keys, your algorithm does
about as well as mine because all possible split points look equally
good suffix-truncation-wise, plus you have the "good enough" test, so
delta isn't ignored. I think that your algorithm also works well when
there are many duplicates but only one non-TID index column, since the
heap TID truncation versus other truncation issue does not arise. The
test case I used is an index on "(county, city, locality)", though --
low cardinality, but more than a single column. That's a *combination*
of two separate considerations, that seem to get conflated. I don't
think that you can avoid doing "a second pass" in some sense, because
these really are separate 

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-01-28 Thread Heikki Linnakangas

On 09/01/2019 02:47, Peter Geoghegan wrote:

On Fri, Dec 28, 2018 at 3:32 PM Peter Geoghegan  wrote:

On Fri, Dec 28, 2018 at 3:20 PM Heikki Linnakangas  wrote:

I'm envisioning that you have an array, with one element for each item
on the page (including the tuple we're inserting, which isn't really on
the page yet). In the first pass, you count up from left to right,
filling the array. Next, you compute the complete penalties, starting
from the middle, walking outwards.



Ah, right. I think I see what you mean now.



Leave it with me. I'll need to think about this some more.


Attached is v10 of the patch series, which has many changes based on
your feedback. However, I didn't end up refactoring _bt_findsplitloc()
in the way you described, because it seemed hard to balance all of the
concerns there. I still have an open mind on this question, andAt a 
recognize the merit in what you suggested. Perhaps it's possible to

reach a compromise here.


I spent some time first trying to understand the current algorithm, and 
then rewriting it in a way that I find easier to understand. I came up 
with the attached. I think it optimizes for the same goals as your 
patch, but the approach  is quite different. At a very high level, I 
believe the goals can be described as:


1. Find out how much suffix truncation is possible, i.e. how many key 
columns can be truncated away, in the best case, among all possible ways 
to split the page.


2. Among all the splits that achieve that optimum suffix truncation, 
find the one with smallest "delta".


For performance reasons, it doesn't actually do it in that order. It's 
more like this:


1. First, scan all split positions, recording the 'leftfree' and 
'rightfree' at every valid split position. The array of possible splits 
is kept in order by offset number. (This scans through all items, but 
the math is simple, so it's pretty fast)


2. Compute the optimum suffix truncation, by comparing the leftmost and 
rightmost keys, among all the possible split positions.


3. Split the array of possible splits in half, and process both halves 
recursively. The recursive process "zooms in" to the place where we'd 
expect to find the best candidate, but will ultimately scan through all 
split candidates, if no "good enough" match is found.



I've only been testing this on leaf splits. I didn't understand how the 
penalty worked for internal pages in your patch. In this version, the 
same algorithm is used for leaf and internal pages. I'm sure this still 
has bugs in it, and could use some polishing, but I think this will be 
more readable way of doing it.



What have you been using to test this? I wrote the attached little test 
extension, to explore what _bt_findsplitloc() decides in different 
scenarios. It's pretty rough, but that's what I've been using while 
hacking on this. It prints output like this:


postgres=# select test_split();
NOTICE:  test 1:
left2/358: 1 0
left  358/358: 1 356
right   1/ 51: 1 357
right  51/ 51: 1 407  SPLIT TUPLE
split ratio: 12/87

NOTICE:  test 2:
left2/358: 0 0
left  358/358: 356 356
right   1/ 51: 357 357
right  51/ 51: 407 407  SPLIT TUPLE
split ratio: 12/87

NOTICE:  test 3:
left2/358: 0 0
left  320/358: 10 10  SPLIT TUPLE
left  358/358: 48 48
right   1/ 51: 49 49
right  51/ 51: 99 99
split ratio: 12/87

NOTICE:  test 4:
left2/309: 1 100
left  309/309: 1 407  SPLIT TUPLE
right   1/100: 2 0
right 100/100: 2 99
split ratio: 24/75

Each test consists of creating a temp table with one index, and 
inserting rows in a certain pattern, until the root page splits. It then 
prints the first and last tuples on both pages, after the split, as well 
as the tuple that caused the split. I don't know if this is useful to 
anyone but myself, but I thought I'd share it.


- Heikki
/*-
 *
 * nbtsplitloc.c
 *	  Choose split point code for Postgres btree implementation.
 *
 * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 *
 * IDENTIFICATION
 *	  src/backend/access/nbtree/nbtsplitloc.c
 *
 *-
 */
#include "postgres.h"

#include "access/nbtree.h"
#include "storage/lmgr.h"

typedef struct
{
	/* FindSplitData candidate split */
	int			leftfree;
	int			rightfree;
	OffsetNumber firstoldonright;
	bool		newitemonleft;
	IndexTuple	lastleft_tuple;
	IndexTuple	firstright_tuple;
}			SplitPoint;

typedef struct
{
	/* context data for _bt_checksplitloc */
	Relation	rel;
	bool		is_leaf;		/* T if splitting a leaf page */
	OffsetNumber newitemoff;	/* where the new item is to be inserted */
	int			leftspace;		/* space available for items on left page */
	int			rightspace;		/* space available for items on right page */
	int			dataitemstotal; /* space taken by all items, old and new */

	int			ncandidates;	/* current 

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-01-04 Thread Peter Geoghegan
Hi Alexander,

On Fri, Jan 4, 2019 at 7:40 AM Alexander Korotkov
 wrote:
> I'm starting to look at this patchset.  Not ready to post detail
> review, but have a couple of questions.

Thanks for taking a look!

> Yes, it shouldn't be too hard, but it seems like we have to keep two
> branches of code for different handling of duplicates.  Is that true?

Not really. If you take a look at v9, you'll see the approach I've
taken is to make insertion scan keys aware of which rules apply (the
"heapkeyspace" field field controls this). I think that there are
about 5 "if" statements for that outside of amcheck. It's pretty
manageable.

I like to imagine that the existing code already has unique keys, but
nobody ever gets to look at the final attribute. It works that way
most of the time -- the only exception is insertion with user keys
that aren't unique already. Note that the way we move left on equal
pivot tuples, rather than right (rather than following the pivot's
downlink) wasn't invented by Postgres to deal with the lack of unique
keys. That's actually a part of the Lehman and Yao design itself.
Almost all of the special cases are optimizations rather than truly
necessary infrastructure.

> I didn't get the point of this paragraph.  Does it might happen that
> first right tuple is under tuple size restriction, but new pivot tuple
> is beyond that restriction?  If so, would we have an error because of
> too long pivot tuple?  If not, I think this needs to be explained
> better.

The v9 version of the function _bt_check_third_page() shows what it
means (comments on this will be improved in v10, too). The old limit
of 2712 bytes still applies to pivot tuples, while a new, lower limit
of 2704 bytes applied for non-pivot tuples. This difference is
necessary because an extra MAXALIGN() quantum could be needed to add a
heap TID to a pivot tuple during truncation in the worst case. To
users, the limit is 2704 bytes, because that's the limit that actually
needs to be enforced during insertion.

We never actually say "1/3 of a page means 2704 bytes" in the docs,
since the definition was always a bit fuzzy. There will need to be a
compatibility note in the release notes, though.
-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-01-04 Thread Alexander Korotkov
Hi!

I'm starting to look at this patchset.  Not ready to post detail
review, but have a couple of questions.

On Wed, Sep 19, 2018 at 9:24 PM Peter Geoghegan  wrote:
> I still haven't managed to add pg_upgrade support, but that's my next
> step. I am more or less happy with the substance of the patch in v5,
> and feel that I can now work backwards towards figuring out the best
> way to deal with on-disk compatibility. It shouldn't be too hard --
> most of the effort will involve coming up with a good test suite.

Yes, it shouldn't be too hard, but it seems like we have to keep two
branches of code for different handling of duplicates.  Is that true?

+ * In the worst case (when a heap TID is appended) the size of the returned
+ * tuple is the size of the first right tuple plus an additional MAXALIGN()
+ * quantum.  This guarantee is important, since callers need to stay under
+ * the 1/3 of a page restriction on tuple size.  If this routine is ever
+ * taught to truncate within an attribute/datum, it will need to avoid
+ * returning an enlarged tuple to caller when truncation + TOAST compression
+ * ends up enlarging the final datum.

I didn't get the point of this paragraph.  Does it might happen that
first right tuple is under tuple size restriction, but new pivot tuple
is beyond that restriction?  If so, would we have an error because of
too long pivot tuple?  If not, I think this needs to be explained
better.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-12-28 Thread Heikki Linnakangas

On 29/12/2018 01:04, Peter Geoghegan wrote:

However, naively computing the penalty upfront for every offset would be
a bit wasteful. Instead, start from the middle of the page, and walk
"outwards" towards both ends, until you find a "good enough" penalty.


You can't start at the middle of the page, though.

You have to start at the left (though you could probably start at the
right instead). This is because of page fragmentation -- it's not
correct to assume that the line pointer offset into tuple space on the
page (firstright linw pointer lp_off for candidate split point) tells
you anything about what the space delta will be after the split. You
have to exhaustively add up the free space before the line pointer
(the free space for all earlier line pointers) before seeing if the
line pointer works as a split point, since each previous line
pointer's tuple could be located anywhere in the original page's tuple
space (anywhere to the left or to the right of where it would be in
the simple/unfragmented case).


Right. You'll need to do the free space computations from left to right, 
but once you have done that, you can compute the penalties in any order.


I'm envisioning that you have an array, with one element for each item 
on the page (including the tuple we're inserting, which isn't really on 
the page yet). In the first pass, you count up from left to right, 
filling the array. Next, you compute the complete penalties, starting 
from the middle, walking outwards.


That's not so different from what you're doing now, but I find it more 
natural to explain the algorithm that way.


- Heikki



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-12-28 Thread Peter Geoghegan
On Fri, Dec 28, 2018 at 10:04 AM Heikki Linnakangas  wrote:
> I spent some time reviewing this. I skipped the first patch, to add a
> column to pg_depend, and I got through patches 2, 3 and 4. Impressive
> results, and the code looks sane.

Thanks! I really appreciate your taking the time to do such a thorough review.

You were right to skip the first patch, because there is a fair chance
that it won't be used in the end. Tom is looking into the pg_depend
problem that I paper over with the first patch.

> I wrote a laundry list of little comments on minor things, suggested
> rewordings of comments etc. I hope they're useful, but feel free to
> ignore/override my opinions of any of those, as you see best.

I think that that feedback is also useful, and I'll end up using 95%+
of it. Much of the information I'm trying to get across is very
subtle.

> But first, a few slightly bigger (medium-sized?) issues that caught my eye:
>
> 1. How about doing the BTScanInsertData refactoring as a separate
> commit, first? It seems like a good thing for readability on its own,
> and would slim the big main patch. (And make sure to credit Andrey for
> that idea in the commit message.)

Good idea. I'll do that.

> This 'assumeheapkeyspace' flag feels awkward. What if the caller knows
> that it is a v3 index? There's no way to tell _bt_mkscankey() that.
> (There's no need for that, currently, but seems a bit weird.)

This is there for CREATE INDEX -- we cannot access the metapage during
an index build. We'll only be able to create new v4 indexes with the
patch applied, so we can assume that heap TID is part of the key space
safely.

> _bt_split() calls _bt_truncate(), which calls _bt_leave_natts(), which
> calls _bt_mkscankey(). It's holding a lock on the page being split. Do
> we risk deadlock by locking the metapage at the same time?

I already had vague concerns along the same lines. I am also concerned
about index_getprocinfo() calls that happen in the same code path,
with a buffer lock held. (SP-GiST's doPickSplit() function can be
considered a kind of precedent that makes the second issue okay, I
suppose.)

See also: My later remarks on the use of "authoritative comparisons"
from this same e-mail.

> I don't have any great ideas on what to do about this, but it's awkward
> as it is. Can we get away without the new argument? Could we somehow
> arrange things so that rd_amcache would be guaranteed to already be set?

These are probably safe in practice, but the way that we rely on them
being safe from a distance is a concern. Let me get back to you on
this.

> 3. In the "Pick nbtree split points discerningly" patch
>
> I find the different modes and the logic in _bt_findsplitloc() very hard
> to understand. I've spent a while looking at it now, and I think I have
> a vague understanding of what things it takes into consideration, but I
> don't understand why it performs those multiple stages, what each stage
> does, and how that leads to an overall strategy. I think a rewrite would
> be in order, to make that more understandable. I'm not sure what exactly
> it should look like, though.

I've already refactored that a little bit for the upcoming v10. The
way _bt_findsplitloc() state is initially set up becomes slightly more
streamlined. It still works in the same way, though, so you'll
probably only think that the new version is a minor improvement.
(Actually, v10 focuses on making _bt_splitatnewitem() a bit less
magical, at least right now.)

> If _bt_findsplitloc() has to fall back to the MANY_DUPLICATES or
> SINGLE_VALUE modes, it has to redo a lot of the work that was done in
> the DEFAULT mode already. That's probably not a big deal in practice,
> performance-wise, but I feel that it's another hint that some
> refactoring would be in order.

The logic within _bt_findsplitloc() has been very hard to refactor all
along. You're right that there is a fair amount of redundant-ish work
that the alternative modes (MANY_DUPLICATES + SINGLE_VALUE) perform.
The idea is to not burden the common DEFAULT case, and to keep the
control flow relatively simple.

I'm sure that if I was in your position I'd say something similar. It
is complicated in subtle ways, that looks like they might not matter,
but actually do. I am working off a fair variety of test cases, which
really came in handy. I remember thinking that I'd simplified it a
couple of times back in August or September, only to realize that I'd
regressed a case that I cared about. I eventually realized that I
needed to come up with a comprehensive though relatively fast test
suite, which seems essential for refactoring _bt_findsplitloc(), and
maybe even for fully understanding how _bt_findsplitloc() works.

Another complicating factor is that I have to worry about the number
of cycles used under a buffer lock (not just the impact on space
utilization).

With all of that said, I am willing to give it another try. You've
seen opportunities to refactor that I missed before now. 

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-12-28 Thread Heikki Linnakangas

On 04/12/2018 05:10, Peter Geoghegan wrote:

Attached is v9, ...


I spent some time reviewing this. I skipped the first patch, to add a 
column to pg_depend, and I got through patches 2, 3 and 4. Impressive 
results, and the code looks sane.


I wrote a laundry list of little comments on minor things, suggested 
rewordings of comments etc. I hope they're useful, but feel free to 
ignore/override my opinions of any of those, as you see best.


But first, a few slightly bigger (medium-sized?) issues that caught my eye:

1. How about doing the BTScanInsertData refactoring as a separate 
commit, first? It seems like a good thing for readability on its own, 
and would slim the big main patch. (And make sure to credit Andrey for 
that idea in the commit message.)



2. In the "Treat heap TID as part of the nbtree key space" patch:


  * Build an insertion scan key that contains comparison data from 
itup
  * as well as comparator routines appropriate to the key datatypes.
  *
+ * When itup is a non-pivot tuple, the returned insertion scan key 
is
+ * suitable for finding a place for it to go on the leaf level.  
When
+ * itup is a pivot tuple, the returned insertion scankey is 
suitable
+ * for locating the leaf page with the pivot as its high key (there
+ * must have been one like it at some point if the pivot tuple
+ * actually came from the tree).
+ *
+ * Note that we may occasionally have to share lock the metapage, 
in
+ * order to determine whether or not the keys in the index are 
expected
+ * to be unique (i.e. whether or not heap TID is treated as a 
tie-breaker
+ * attribute).  Callers that cannot tolerate this can request that 
we
+ * assume that this is a heapkeyspace index.
+ *
  * The result is intended for use with _bt_compare().
  */
-ScanKey
-_bt_mkscankey(Relation rel, IndexTuple itup)
+BTScanInsert
+_bt_mkscankey(Relation rel, IndexTuple itup, bool assumeheapkeyspace)


This 'assumeheapkeyspace' flag feels awkward. What if the caller knows 
that it is a v3 index? There's no way to tell _bt_mkscankey() that. 
(There's no need for that, currently, but seems a bit weird.)


_bt_split() calls _bt_truncate(), which calls _bt_leave_natts(), which 
calls _bt_mkscankey(). It's holding a lock on the page being split. Do 
we risk deadlock by locking the metapage at the same time?


I don't have any great ideas on what to do about this, but it's awkward 
as it is. Can we get away without the new argument? Could we somehow 
arrange things so that rd_amcache would be guaranteed to already be set?



3. In the "Pick nbtree split points discerningly" patch

I find the different modes and the logic in _bt_findsplitloc() very hard 
to understand. I've spent a while looking at it now, and I think I have 
a vague understanding of what things it takes into consideration, but I 
don't understand why it performs those multiple stages, what each stage 
does, and how that leads to an overall strategy. I think a rewrite would 
be in order, to make that more understandable. I'm not sure what exactly 
it should look like, though.


If _bt_findsplitloc() has to fall back to the MANY_DUPLICATES or 
SINGLE_VALUE modes, it has to redo a lot of the work that was done in 
the DEFAULT mode already. That's probably not a big deal in practice, 
performance-wise, but I feel that it's another hint that some 
refactoring would be in order.


One idea on how to restructure that:

Make a single pass over all the offset numbers, considering a split at 
that location. Like the current code does. For each offset, calculate a 
"penalty" based on two factors:


* free space on each side
* the number of attributes in the pivot tuple, and whether it needs to 
store the heap TID


Define the penalty function so that having to add a heap TID to the 
pivot tuple is considered very expensive, more expensive than anything 
else, and truncating away other attributes gives a reward of some size.


However, naively computing the penalty upfront for every offset would be 
a bit wasteful. Instead, start from the middle of the page, and walk 
"outwards" towards both ends, until you find a "good enough" penalty.


Or something like that...


Now, the laundry list of smaller items:

- laundry list begins -

1st commits commit message:


Make nbtree treat all index tuples as having a heap TID trailing key
attribute.  Heap TID becomes a first class part of the key space on all
levels of the tree.  Index searches can distinguish duplicates by heap
TID, at least in principle.


What do you mean by "at least in principle"?


Secondary index insertions will descend
straight to the leaf page that they'll insert on to (unless there is a
concurrent page split).


What is a "Secondary" index insertion?


Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it 

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-12-03 Thread Peter Geoghegan
On Mon, Dec 3, 2018 at 7:10 PM Peter Geoghegan  wrote:
> Attached is v9, which does things that way. There are no interesting
> changes, though I have set things up so that a later patch in the
> series can add "dynamic prefix truncation" -- I do not include any
> such patch in v9, though. I'm going to start a new thread on that
> topic, and include the patch there, since it's largely unrelated to
> this work, and in any case still isn't in scope for Postgres 12 (the
> patch is still experimental, for reasons that are of general
> interest).

The dynamic prefix truncation thread that I started:

https://postgr.es/m/cah2-wzn_nayk4pr0hrwo0stwhmxjp5qyu+x8vppt030xpqr...@mail.gmail.com
-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-12-01 Thread Peter Geoghegan
On Sat, Dec 1, 2018 at 4:10 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Just for the information, cfbot says there are problems on windows:
>
> src/backend/catalog/pg_depend.c(33): error C2065: 'INT32_MAX' :
> undeclared identifier

Thanks. Looks like I should have used PG_INT32_MAX.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-12-01 Thread Dmitry Dolgov
> On Sun, Nov 25, 2018 at 12:14 AM Peter Geoghegan  wrote:
>
> Attached is v8 of the patch series, which has some relatively minor changes:

Thank you for working on this patch,

Just for the information, cfbot says there are problems on windows:

src/backend/catalog/pg_depend.c(33): error C2065: 'INT32_MAX' :
undeclared identifier



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-11-04 Thread Peter Geoghegan
On Sun, Nov 4, 2018 at 8:21 AM Andrey Lepikhov
 wrote:
> I mean that your code have not any problems that I can detect by
> regression tests and by the retail index tuple deletion patch.
> Difference in amount of dropped objects is not a problem. It is caused
> by pos 2293 - 'else if (thisobj->objectSubId == 0)' - at the file
> catalog/dependency.c and it is legal behavior: column row object deleted
> without any report because we already decided to drop its whole table.

The behavior implied by using ASC heap TID order is always "legal",
but it may cause a regression in certain functionality -- something
that an ordinary user might complain about. There were some changes
when DESC heap TID order is used too, of course, but those were safe
to ignore (it seemed like nobody could ever care). It might have been
okay to just use DESC order, but since it now seems like I must use
ASC heap TID order for performance reasons, I have to tackle a couple
of these issues head-on (e.g.  'cannot drop trigger trg1').

> Also, I checked the triggers test. Difference in the ERROR message
> 'cannot drop trigger trg1' is caused by different order of tuples in the
> relation with the dependDependerIndexId relid. It is legal behavior and
> we can simply replace test results.

Let's look at this specific "trg1" case:

"""
 create table trigpart (a int, b int) partition by range (a);
 create table trigpart1 partition of trigpart for values from (0) to (1000);
 create trigger trg1 after insert on trigpart for each row execute
procedure trigger_nothing();
 ...
 drop trigger trg1 on trigpart1; -- fail
-ERROR:  cannot drop trigger trg1 on table trigpart1 because trigger
trg1 on table trigpart requires it
-HINT:  You can drop trigger trg1 on table trigpart instead.
+ERROR:  cannot drop trigger trg1 on table trigpart1 because table
trigpart1 requires it
+HINT:  You can drop table trigpart1 instead.
"""

The original hint suggests "you need to drop the object on the
partition parent instead of its child", which is useful. The new hint
suggests "instead of dropping the trigger on the partition child,
maybe drop the child itself!". That's almost an insult to the user.

Now, I suppose that I could claim that it's not my responsibility to
fix this, since we get the useful behavior only due to accidental
implementation details. I'm not going to take that position, though. I
think that I am obliged to follow both the letter and the spirit of
the law. I'm almost certain that this regression test was written
because somebody specifically cared about getting the original, useful
message. The underlying assumptions may have been a bit shaky, but we
all know how common it is for software to evolve to depend on
implementation-defined details. We've all written code that does it,
but hopefully it didn't hurt us much because we also wrote regression
tests that exercised the useful behavior.

> May be you know any another problems of the patch?

Just the lack of pg_upgrade support. That is progressing nicely,
though. I'll probably have that part in the next revision of the
patch. I've found what looks like a workable approach, though I need
to work on a testing strategy for pg_upgrade.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-11-04 Thread Andrey Lepikhov




On 04.11.2018 9:31, Peter Geoghegan wrote:

On Sat, Nov 3, 2018 at 8:52 PM Andrey Lepikhov
 wrote:

I applied your patches at top of master. After tests corrections
(related to TID ordering in index relations DROP...CASCADE operation)
'make check-world' passed successfully many times.
In the case of 'create view' regression test - 'drop cascades to 62
other objects' problem - I verify an Álvaro Herrera hypothesis [1] and
it is true. You can verify it by tracking the
object_address_present_add_flags() routine return value.


I'll have to go and fix the problem directly, so that ASC sort order
can be used.


Some doubts, however, may be regarding the 'triggers' test.
May you specify the test failures do you mean?


Not sure what you mean. The order of items that are listed in the
DETAIL for a cascading DROP can have an "implementation defined"
order. I think that this is an example of the more general problem --
what you call the 'drop cascades to 62 other objects' problem is a
more specific subproblem, or, if you prefer, a more specific symptom
of the same problem.


I mean that your code have not any problems that I can detect by 
regression tests and by the retail index tuple deletion patch.
Difference in amount of dropped objects is not a problem. It is caused 
by pos 2293 - 'else if (thisobj->objectSubId == 0)' - at the file 
catalog/dependency.c and it is legal behavior: column row object deleted 
without any report because we already decided to drop its whole table.


Also, I checked the triggers test. Difference in the ERROR message 
'cannot drop trigger trg1' is caused by different order of tuples in the 
relation with the dependDependerIndexId relid. It is legal behavior and 
we can simply replace test results.


May be you know any another problems of the patch?

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-11-03 Thread Peter Geoghegan
On Sat, Nov 3, 2018 at 8:52 PM Andrey Lepikhov
 wrote:
> I applied your patches at top of master. After tests corrections
> (related to TID ordering in index relations DROP...CASCADE operation)
> 'make check-world' passed successfully many times.
> In the case of 'create view' regression test - 'drop cascades to 62
> other objects' problem - I verify an Álvaro Herrera hypothesis [1] and
> it is true. You can verify it by tracking the
> object_address_present_add_flags() routine return value.

I'll have to go and fix the problem directly, so that ASC sort order
can be used.

> Some doubts, however, may be regarding the 'triggers' test.
> May you specify the test failures do you mean?

Not sure what you mean. The order of items that are listed in the
DETAIL for a cascading DROP can have an "implementation defined"
order. I think that this is an example of the more general problem --
what you call the 'drop cascades to 62 other objects' problem is a
more specific subproblem, or, if you prefer, a more specific symptom
of the same problem.

Since I'm going to have to fix the problem head-on, I'll have to study
it in detail anyway.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-11-03 Thread Peter Geoghegan
On Fri, Nov 2, 2018 at 5:00 PM Peter Geoghegan  wrote:
> I had the opportunity to discuss this patch at length with Heikki
> during pgConf.EU.

> The DESC heap TID sort order thing probably needs to go. I'll probably
> have to go fix the regression test failures that occur when ASC heap
> TID order is used.

I've found that TPC-C testing with ASC heap TID order fixes the
regression that I've been concerned about these past few weeks. Making
this change leaves the patch a little bit faster than the master
branch for TPC-C, while still leaving TPC-C indexes about as small as
they were with v6 of the patch (i.e. much smaller). I now get about a
1% improvement in transaction throughput, an improvement that seems
fairly consistent. It seems likely that the next revision of the patch
series will be an unambiguous across the board win for performance. I
think that I come out ahead with ASC heap TID order because that has
the effect of reducing the volume of WAL generated by page splits.
Page splits are already optimized for splitting right, not left.

I should thank Heikki for pointing me in the right direction here.

--
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-11-03 Thread Andrey Lepikhov




On 03.11.2018 5:00, Peter Geoghegan wrote:

The DESC heap TID sort order thing probably needs to go. I'll probably
have to go fix the regression test failures that occur when ASC heap
TID order is used. (Technically those failures are a pre-existing
problem, a problem that I mask by using DESC order...which is weird.
The problem is masked in the master branch by accidental behaviors
around nbtree duplicates, which is something that deserves to die.
DESC order is closer to the accidental current behavior.)


I applied your patches at top of master. After tests corrections 
(related to TID ordering in index relations DROP...CASCADE operation) 
'make check-world' passed successfully many times.
In the case of 'create view' regression test - 'drop cascades to 62 
other objects' problem - I verify an Álvaro Herrera hypothesis [1] and 
it is true. You can verify it by tracking the 
object_address_present_add_flags() routine return value.

Some doubts, however, may be regarding the 'triggers' test.
May you specify the test failures do you mean?

[1] 
https://www.postgresql.org/message-id/20180504022601.fflymidf7eoencb2%40alvherre.pgsql


--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-11-02 Thread Peter Geoghegan
On Fri, Nov 2, 2018 at 3:06 AM Andrey Lepikhov
 wrote:
> Documentation is full and clear. All non-trivial logic is commented
> accurately.

Glad you think so.

I had the opportunity to discuss this patch at length with Heikki
during pgConf.EU. I don't want to speak on his behalf, but I will say
that he seemed to understand all aspects of the patch series, and
seemed generally well disposed towards the high level design. The
high-level design is the most important aspect -- B-Trees can be
optimized in many ways, all at once, and we must be sure to come up
with something that enables most or all of them. I really care about
the long term perspective.

That conversation with Heikki eventually turned into a conversation
about reimplementing GIN using the nbtree code, which is actually
related to my patch series (sorted on heap TID is the first step to
optional run length encoding for duplicates). Heikki seemed to think
that we can throw out a lot of the optimizations within GIN, and add a
few new ones to nbtree, while still coming out ahead. This made the
general nbtree-as-GIN idea (which we've been talking about casually
for years) seem a lot more realistic to me. Anyway, he requested that
I support this long term goal by getting rid of the DESC TID sort
order thing -- that breaks GIN-style TID compression. It also
increases the WAL volume unnecessarily when a page is split that
contains all duplicates.

The DESC heap TID sort order thing probably needs to go. I'll probably
have to go fix the regression test failures that occur when ASC heap
TID order is used. (Technically those failures are a pre-existing
problem, a problem that I mask by using DESC order...which is weird.
The problem is masked in the master branch by accidental behaviors
around nbtree duplicates, which is something that deserves to die.
DESC order is closer to the accidental current behavior.)

> Patch applies cleanly on top of current master. Regression tests passed
> and my "Retail Indextuple deletion" use cases works without mistakes.

Cool.

> New BTScanInsert structure reduces parameters list of many functions and
> look fine. But it contains some optimization part ('restorebinsrch'
> field et al.). It is used very locally in the code -
> _bt_findinsertloc()->_bt_binsrch() routines calling. May be you localize
> this logic into separate struct, which will passed to _bt_binsrch() as
> pointer. Another routines may pass NULL value to this routine. It is may
> simplify usability of the struct.

Hmm. I see your point. I did it that way because the knowledge of
having cached an upper and lower bound for a binary search of a leaf
page needs to last for a relatively long time. I'll look into it
again, though.

> Due to the optimization the _bt_binsrch() size has grown twice. May be
> you move this to some service routine?

Maybe. There are some tricky details that seem to work against it.
I'll see if it's possible to polish that some more, though.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-11-02 Thread Andrey Lepikhov

I do the code review.
Now, it is first patch - v6-0001... dedicated to a logical duplicates 
ordering.


Documentation is full and clear. All non-trivial logic is commented 
accurately.


Patch applies cleanly on top of current master. Regression tests passed 
and my "Retail Indextuple deletion" use cases works without mistakes.

But I have two comments on the code.
New BTScanInsert structure reduces parameters list of many functions and 
look fine. But it contains some optimization part ('restorebinsrch' 
field et al.). It is used very locally in the code - 
_bt_findinsertloc()->_bt_binsrch() routines calling. May be you localize 
this logic into separate struct, which will passed to _bt_binsrch() as 
pointer. Another routines may pass NULL value to this routine. It is may 
simplify usability of the struct.


Due to the optimization the _bt_binsrch() size has grown twice. May be 
you move this to some service routine?



--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-10-23 Thread Peter Geoghegan
On Tue, Oct 23, 2018 at 11:35 AM Andrey Lepikhov
 wrote:
> I have same problem with background heap & index cleaner (based on your
> patch). In this case the bottleneck is WAL-record which I need to write
> for each cleaned block and locks which are held during the WAL-record
> writing process.

Part of the problem here is that v6 uses up to 25 candidate split
points, even during regularly calls to _bt_findsplitloc(). That was
based on some synthetic test-cases. I've found that I can get most of
the benefit in index size with far fewer spilt points, though. The
extra work done with an exclusive buffer lock held will be
considerably reduced in v7. I'll probably post that in a couple of
weeks, since I'm in Europe for pgConf.EU. I don't fully understand the
problems here, but even still I know that what you were testing wasn't
very well optimized for write-heavy workloads. It would be especially
bad with pgbench, since there isn't much opportunity to reduce the
size of indexes there.

> Maybe you will do a test without writing any data to disk?

Yeah, I should test that on its own. I'm particularly interested in
TPC-C, because it's a particularly good target for my patch. I can
find a way of only executing the read TPC-C queries, to see where they
are on their own. TPC-C is particularly write-heavy, especially
compared to the much more recent though less influential TPC-E
benchmark.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-10-23 Thread Andrey Lepikhov




On 19.10.2018 0:54, Peter Geoghegan wrote:

I would welcome any theories as to what could be the problem here. I'm
think that this is fixable, since the picture for the patch is very
positive, provided you only focus on bgwriter/checkpoint activity and
on-disk sizes. It seems likely that there is a very specific gap in my
understanding of how the patch affects buffer cleaning.


I have same problem with background heap & index cleaner (based on your 
patch). In this case the bottleneck is WAL-record which I need to write 
for each cleaned block and locks which are held during the WAL-record 
writing process.

Maybe you will do a test without writing any data to disk?

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-10-19 Thread Peter Geoghegan
On Thu, Oct 18, 2018 at 1:44 PM Andres Freund  wrote:
> I wonder if it'd make sense to hack up a patch that logs when evicting a
> buffer while already holding another lwlock. That shouldn't be too hard.

I tried this. It looks like we're calling FlushBuffer() with more than
a single LWLock held (not just the single buffer lock) somewhat *less*
with the patch. This is a positive sign for the patch, but also means
that I'm no closer to figuring out what's going on.

I tested a case with a 1GB shared_buffers + a TPC-C database sized at
about 10GB. I didn't want the extra LOG instrumentation to influence
the outcome.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-10-18 Thread Peter Geoghegan
On Thu, Oct 18, 2018 at 1:44 PM Andres Freund  wrote:
> What kind of backend_flush_after values where you trying?
> backend_flush_after=0 obviously is the default, so I'm not clear on
> that.   How large is the database here, and how high is shared_buffers

I *was* trying backend_flush_after=512kB, but it's
backend_flush_after=0 in the benchmark I posted. See the
"postgres*settings" files.

On the master branch, things looked like this after the last run:

pg@tpcc_oltpbench[15547]=# \dt+
  List of relations
 Schema │Name│ Type  │ Owner │   Size   │ Description
┼┼───┼───┼──┼─
 public │ customer   │ table │ pg│ 4757 MB  │
 public │ district   │ table │ pg│ 5240 kB  │
 public │ history│ table │ pg│ 1442 MB  │
 public │ item   │ table │ pg│ 10192 kB │
 public │ new_order  │ table │ pg│ 140 MB   │
 public │ oorder │ table │ pg│ 1185 MB  │
 public │ order_line │ table │ pg│ 19 GB│
 public │ stock  │ table │ pg│ 9008 MB  │
 public │ warehouse  │ table │ pg│ 4216 kB  │
(9 rows)

pg@tpcc_oltpbench[15547]=# \di+
 List of relations
 Schema │ Name │ Type  │ Owner │
Table│  Size   │ Description
┼──┼───┼───┼┼─┼─
 public │ customer_pkey│ index │ pg│
customer   │ 367 MB  │
 public │ district_pkey│ index │ pg│
district   │ 600 kB  │
 public │ idx_customer_name│ index │ pg│
customer   │ 564 MB  │
 public │ idx_order│ index │ pg│
oorder │ 715 MB  │
 public │ item_pkey│ index │ pg│ item
 │ 2208 kB │
 public │ new_order_pkey   │ index │ pg│
new_order  │ 188 MB  │
 public │ oorder_o_w_id_o_d_id_o_c_id_o_id_key │ index │ pg│
oorder │ 715 MB  │
 public │ oorder_pkey  │ index │ pg│
oorder │ 958 MB  │
 public │ order_line_pkey  │ index │ pg│
order_line │ 9624 MB │
 public │ stock_pkey   │ index │ pg│ stock
 │ 904 MB  │
 public │ warehouse_pkey   │ index │ pg│
warehouse  │ 56 kB   │
(11 rows)

> Is it possible that there's new / prolonged cases where a buffer is read
> from disk after the patch? Because that might require doing *write* IO
> when evicting the previous contents of the victim buffer, and obviously
> that can take longer if you're running with backend_flush_after > 0.

Yes, I suppose that that's possible, because the buffer
popularity/usage_count will be affected in ways that cannot easily be
predicted. However, I'm not running with "backend_flush_after > 0"
here -- that was before.

> I wonder if it'd make sense to hack up a patch that logs when evicting a
> buffer while already holding another lwlock. That shouldn't be too hard.

I'll look into this.

Thanks
-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-10-18 Thread Peter Geoghegan
Shared_buffers is 10gb iirc. The server has 32gb of memory. Yes, 'public'
is the patch case. Sorry for not mentioning it initially.

--
Peter Geoghegan
(Sent from my phone)


Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-10-18 Thread Andres Freund
Hi,

On 2018-10-18 12:54:27 -0700, Peter Geoghegan wrote:
> I can show a nice improvement in latency on a slightly-rate-limited
> TPC-C workload when backend_flush_after=0 (something like a 40%
> reduction on average), but that doesn't hold up when oltpbench isn't
> rate-limited and/or has backend_flush_after set. Usually, there is a
> 1% - 2% regression, despite the big improvements in index size, and
> despite the big reduction in the amount of buffers that backends must
> write out themselves.

What kind of backend_flush_after values where you trying?
backend_flush_after=0 obviously is the default, so I'm not clear on
that.   How large is the database here, and how high is shared_buffers


> The obvious explanation is that throughput is decreased due to our
> doing extra work (truncation) while under an exclusive buffer lock.
> However, I've worked hard on that, and, as I said, I can sometimes
> observe a nice improvement in latency. This makes me doubt the obvious
> explanation. My working theory is that this has something to do with
> shared_buffers eviction. Maybe we're making worse decisions about
> which buffer to evict, or maybe the scalability of eviction is hurt.
> Perhaps both.

Is it possible that there's new / prolonged cases where a buffer is read
from disk after the patch? Because that might require doing *write* IO
when evicting the previous contents of the victim buffer, and obviously
that can take longer if you're running with backend_flush_after > 0.

I wonder if it'd make sense to hack up a patch that logs when evicting a
buffer while already holding another lwlock. That shouldn't be too hard.


> You can download results from a recent benchmark to get some sense of
> this. It includes latency and throughput graphs, plus details
> statistics collector stats:
> 
> https://drive.google.com/file/d/1oIjJ3YpSPiyRV_KF6cAfAi4gSm7JdPK1/view?usp=sharing

I'm uncllear which runs are what here? I assume "public" is your
patchset, and master is master? Do you reset the stats inbetween runs?

Greetings,

Andres Freund



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-10-18 Thread Peter Geoghegan
On Wed, Oct 3, 2018 at 4:39 PM Peter Geoghegan  wrote:
> I did find a pretty clear regression, though only with writes to
> unique indexes. Attached is v6, which fixes the issue. More on that
> below.

I've been benchmarking my patch using oltpbench's TPC-C benchmark
these past few weeks, which has been very frustrating -- the picture
is very mixed. I'm testing a patch that has evolved from v6, but isn't
too different.

In one way, the patch does exactly what it's supposed to do when these
benchmarks are run: it leaves indexes *significantly* smaller than the
master branch will on the same (rate-limited) workload, without
affecting the size of tables in any noticeable way. The numbers that I
got from my much earlier synthetic single client benchmark mostly hold
up. For example, the stock table's primary key is about 35% smaller,
and the order line index is only about 20% smaller relative to master,
which isn't quite as good as in the synthetic case, but I'll take it
(this is all because of the
v6-0003-Add-split-at-new-tuple-page-split-optimization.patch stuff).
However, despite significant effort, and despite the fact that the
index shrinking is reliable, I cannot yet consistently show an
increase in either transaction throughput, or transaction latency.

I can show a nice improvement in latency on a slightly-rate-limited
TPC-C workload when backend_flush_after=0 (something like a 40%
reduction on average), but that doesn't hold up when oltpbench isn't
rate-limited and/or has backend_flush_after set. Usually, there is a
1% - 2% regression, despite the big improvements in index size, and
despite the big reduction in the amount of buffers that backends must
write out themselves.

The obvious explanation is that throughput is decreased due to our
doing extra work (truncation) while under an exclusive buffer lock.
However, I've worked hard on that, and, as I said, I can sometimes
observe a nice improvement in latency. This makes me doubt the obvious
explanation. My working theory is that this has something to do with
shared_buffers eviction. Maybe we're making worse decisions about
which buffer to evict, or maybe the scalability of eviction is hurt.
Perhaps both.

You can download results from a recent benchmark to get some sense of
this. It includes latency and throughput graphs, plus details
statistics collector stats:

https://drive.google.com/file/d/1oIjJ3YpSPiyRV_KF6cAfAi4gSm7JdPK1/view?usp=sharing

I would welcome any theories as to what could be the problem here. I'm
think that this is fixable, since the picture for the patch is very
positive, provided you only focus on bgwriter/checkpoint activity and
on-disk sizes. It seems likely that there is a very specific gap in my
understanding of how the patch affects buffer cleaning.

--
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-09-30 Thread Peter Geoghegan
On Fri, Sep 28, 2018 at 10:58 PM Andrey Lepikhov
 wrote:
> I am reviewing this patch too. And join to Peter Eisentraut opinion
> about splitting the patch into a hierarchy of two or three patches:
> "functional" - tid stuff and "optimizational" - suffix truncation &
> splitting. My reasons are simplification of code review, investigation
> and benchmarking.

As I mentioned to Peter, I don't think that I can split out the heap
TID stuff from the suffix truncation stuff. At least not without
making the patch even more complicated, for no benefit. I will split
out the "brain" of the patch (the _bt_findsplitloc() stuff, which
decides on a split point using sophisticated rules) from the "brawn"
(the actually changes to how index scans work, including the heap TID
stuff, as well as the code for actually physically performing suffix
truncation). The brain of the patch is where most of the complexity
is, as well as most of the code. The brawn of the patch is _totally
unusable_ without intelligence around split points, but I'll split
things up along those lines anyway. Doing so should make the whole
design a little easier to see follow.

> Now benchmarking is not clear. Possible performance degradation from TID
> ordering interfere with positive effects from the optimizations in
> non-trivial way.

Is there any evidence of a regression in the last 2 versions? I've
been using pgbench, which didn't show any. That's not a sympathetic
case for the patch, though it would be nice to confirm if there was
some small improvement there. I've seen contradictory results (slight
improvements and slight regressions), but that was with a much earlier
version, so it just isn't relevant now. pgbench is mostly interesting
as a thing that we want to avoid regressing.

Once I post the next version, it would be great if somebody could use
HammerDB's OLTP test, which seems like the best fair use
implementation of TPC-C that's available. I would like to make that
the "this is why you should care, even if you happen to not believe in
the patch's strategic importance" benchmark. TPC-C is clearly the most
influential database benchmark ever, so I think that that's a fair
request. (See the TPC-C commentary at
https://www.hammerdb.com/docs/ch03s02.html, for example.)

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-09-28 Thread Andrey Lepikhov

28.09.2018 23:08, Peter Geoghegan wrote:

On Fri, Sep 28, 2018 at 7:50 AM Peter Eisentraut
 wrote:

I think it might help this patch move along if it were split up a bit,
for example 1) suffix truncation, 2) tid stuff, 3) new split strategies.
That way, it would also be easier to test out each piece separately.
For example, how much space does suffix truncation save in what
scenario, are there any performance regressions, etc.


I'll do my best. I don't think I can sensibly split out suffix
truncation from the TID stuff -- those seem truly inseparable, since
my mental model for suffix truncation breaks without fully unique
keys. I can break out all the cleverness around choosing a split point
into its own patch, though -- _bt_findsplitloc() has only been changed
to give weight to several factors that become important. It's the
"brain" of the optimization, where 90% of the complexity actually
lives.

Removing the _bt_findsplitloc() changes will make the performance of
the other stuff pretty poor, and in particular will totally remove the
benefit for cases that "become tired" on the master branch. That could
be slightly interesting, I suppose.


I am reviewing this patch too. And join to Peter Eisentraut opinion 
about splitting the patch into a hierarchy of two or three patches: 
"functional" - tid stuff and "optimizational" - suffix truncation & 
splitting. My reasons are simplification of code review, investigation 
and benchmarking.
Now benchmarking is not clear. Possible performance degradation from TID 
ordering interfere with positive effects from the optimizations in 
non-trivial way.


--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-09-28 Thread Peter Geoghegan
On Fri, Sep 28, 2018 at 7:50 AM Peter Eisentraut
 wrote:
> So.  I don't know much about the btree code, so don't believe anything I
> say.

I think that showing up and reviewing this patch makes you somewhat of
an expert, by default. There just isn't enough expertise in this area.

> I was very interested in the bloat test case that you posted on
> 2018-07-09 and I tried to understand it more.

Up until recently, I thought that I would justify the patch primarily
as a project to make B-Trees less bloated when there are many
duplicates, with maybe as many as a dozen or more secondary benefits.
That's what I thought it would say in the release notes, even though
the patch was always a broader strategic thing. Now I think that the
TPC-C multiple insert point bloat issue might be the primary headline
benefit, though.

I hate to add more complexity to get it to work well, but just look at
how much smaller the indexes are following an initial bulk load (bulk
insertions) using my working copy of the patch:

Master

customer_pkey: 75 MB
district_pkey: 40 kB
idx_customer_name: 107 MB
item_pkey: 2216 kB
new_order_pkey: 22 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 60 MB
oorder_pkey: 78 MB
order_line_pkey: 774 MB
stock_pkey: 181 MB
warehouse_pkey: 24 kB

Patch

customer_pkey: 50 MB
district_pkey: 40 kB
idx_customer_name: 105 MB
item_pkey: 2216 kB
new_order_pkey: 12 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 61 MB
oorder_pkey: 42 MB
order_line_pkey: 429 MB
stock_pkey: 111 MB
warehouse_pkey: 24 kB

All of the indexes used by oltpbench to do TPC-C are listed, so you're
seeing the full picture for TPC-C bulk loading here (actually, there
is another index that has an identical definition to
oorder_o_w_id_o_d_id_o_c_id_o_id_key for some reason, which is omitted
as redundant). As you can see, all the largest indexes are
*significantly* smaller, with the exception of
oorder_o_w_id_o_d_id_o_c_id_o_id_key. You won't be able to see this
improvement until I post the next version, though, since this is a
brand new development. Note that VACUUM hasn't been run at all, and
doesn't need to be run, as there are no dead tuples. Note also that
this has *nothing* to do with getting tired -- almost all of these
indexes are unique indexes.

Note that I'm also testing TPC-E and TPC-H in a very similar way,
which have both been improved noticeably, but to a degree that's much
less compelling than what we see with TPC-C. They have "getting tired"
cases that benefit quite a bit, but those are the minority.

Have you ever used HammerDB? I got this data from oltpbench, but I
think that HammerDB might be the way to go with TPC-C testing
Postgres.

> You propose to address this by appending the tid to the index key, so
> each key, even if its "payload" is a duplicate value, is unique and has
> a unique place, so we never have to do this "tiresome" search.This
> makes a lot of sense, and the results in the bloat test you posted are
> impressive and reproducible.

Thanks.

> I tried a silly alternative approach by placing a new duplicate key in a
> random location.  This should be equivalent since tids are effectively
> random.

You're never going to get any other approach to work remotely as well,
because while the TIDs may seem to be random in some sense, they have
various properties that are very useful from a high level, data life
cycle point of view. For insertions of duplicates, heap TID has
temporal locality --  you are only going to dirty one or two leaf
pages, rather than potentially dirtying dozens or hundreds.
Furthermore, heap TID is generally strongly correlated with primary
key values, so VACUUM can be much much more effective at killing
duplicates in low cardinality secondary indexes when there are DELETEs
with a range predicate on the primary key. This is a lot more
realistic than the 2018-07-09 test case, but it still could make as
big of a difference.

>  I didn't quite get this to fully work yet, but at least it
> doesn't blow up, and it gets the same regression test ordering
> differences for pg_depend scans that you are trying to paper over. ;-)

FWIW, I actually just added to the papering over, rather than creating
a new problem. There are plenty of instances of "\set VERBOSITY terse"
in the regression tests already, for the same reason. If you run the
regression tests with ignore_system_indexes=on, there are very similar
failures [1].

> As far as the code is concerned, I agree with Andrey Lepikhov that one
> more abstraction layer that somehow combines the scankey and the tid or
> some combination like that would be useful, instead of passing the tid
> as a separate argument everywhere.

I've already drafted this in my working copy. It is a clear
improvement. You can expect it in the next version.

> I think it might help this patch move along if it were split up a bit,
> for example 1) suffix truncation, 2) tid stuff, 3) new split strategies.
> That way, it would also be easier to test out each piece separately.
> For example, 

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-09-28 Thread Peter Eisentraut
On 19/09/2018 20:23, Peter Geoghegan wrote:
> Attached is v5,

So.  I don't know much about the btree code, so don't believe anything I
say.

I was very interested in the bloat test case that you posted on
2018-07-09 and I tried to understand it more.  The current method for
inserting a duplicate value into a btree is going to the leftmost point
for that value and then move right until we find some space or we get
"tired" of searching, in which case just make some space right there.
The problem is that it's tricky to decide when to stop searching, and
there are scenarios when we stop too soon and repeatedly miss all the
good free space to the right, leading to bloat even though the index is
perhaps quite empty.

I tried playing with the getting-tired factor (it could plausibly be a
reloption), but that wasn't very successful.  You can use that to
postpone the bloat, but you won't stop it, and performance becomes terrible.

You propose to address this by appending the tid to the index key, so
each key, even if its "payload" is a duplicate value, is unique and has
a unique place, so we never have to do this "tiresome" search.  This
makes a lot of sense, and the results in the bloat test you posted are
impressive and reproducible.

I tried a silly alternative approach by placing a new duplicate key in a
random location.  This should be equivalent since tids are effectively
random.  I didn't quite get this to fully work yet, but at least it
doesn't blow up, and it gets the same regression test ordering
differences for pg_depend scans that you are trying to paper over. ;-)

As far as the code is concerned, I agree with Andrey Lepikhov that one
more abstraction layer that somehow combines the scankey and the tid or
some combination like that would be useful, instead of passing the tid
as a separate argument everywhere.

I think it might help this patch move along if it were split up a bit,
for example 1) suffix truncation, 2) tid stuff, 3) new split strategies.
 That way, it would also be easier to test out each piece separately.
For example, how much space does suffix truncation save in what
scenario, are there any performance regressions, etc.  In the last few
versions, the patches have still been growing significantly in size and
functionality, and most of the supposed benefits are not readily visible
in tests.

And of course we need to think about how to handle upgrades, but you
have already started a separate discussion about that.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-09-24 Thread Peter Geoghegan
On Wed, Sep 19, 2018 at 11:23 AM Peter Geoghegan  wrote:
> 3 modes
> ---
>
> My new approach is to teach _bt_findsplitloc() 3 distinct modes of
> operation: Regular/default mode, many duplicates mode, and single
> value mode.

I think that I'll have to add a fourth mode, since I came up with
another strategy that is really effective though totally complementary
to the other 3 -- "multiple insertion point" mode. Credit goes to
Kevin Grittner for pointing out that this technique exists about 2
years ago [1]. The general idea is to pick a split point just after
the insertion point of the new item (the incoming tuple that prompted
a page split) when it looks like there are localized monotonically
increasing ranges.  This is like a rightmost 90:10 page split, except
the insertion point is not at the rightmost page on the level -- it's
rightmost within some local grouping of values.

This makes the two largest TPC-C indexes *much* smaller. Previously,
they were shrunk by a little over 5% by using the new generic
strategy, a win that now seems like small potatoes. With this new
mode, TPC-C's order_line primary key, which is the largest index of
all, is ~45% smaller following a standard initial bulk load at
scalefactor 50. It shrinks from 99,085 blocks (774.10 MiB) to 55,020
blocks (429.84 MiB). It's actually slightly smaller than it would be
after a fresh REINDEX with the new strategy. We see almost as big a
win with the second largest TPC-C index, the stock table's primary key
-- it's ~40% smaller.

Here is the definition of the biggest index, the order line primary key index:

pg@tpcc[3666]=# \d order_line_pkey
 Index "public.order_line_pkey"
  Column   │  Type   │ Key? │ Definition
───┼─┼──┼
 ol_w_id   │ integer │ yes  │ ol_w_id
 ol_d_id   │ integer │ yes  │ ol_d_id
 ol_o_id   │ integer │ yes  │ ol_o_id
 ol_number │ integer │ yes  │ ol_number
primary key, btree, for table "public.order_line"

The new strategy/mode works very well because we see monotonically
increasing inserts on ol_number (an order's item number), but those
are grouped by order. It's kind of an adversarial case for our
existing implementation, and yet it seems like it's probably a fairly
common scenario in the real world.

Obviously these are very significant improvements. They really exceed
my initial expectations for the patch. TPC-C is generally considered
to be by far the most influential database benchmark of all time, and
this is something that we need to pay more attention to. My sense is
that the TPC-C benchmark is deliberately designed to almost require
that the system under test have this "multiple insertion point" B-Tree
optimization, suffix truncation, etc. This is exactly the same index
that we've seen reports of out of control bloat on when people run
TPC-C over hours or days [2].

My next task is to find heuristics to make the new page split
mode/strategy kick in when it's likely to help, but not kick in when
it isn't (when we want something close to a generic 50:50 page split).
These heuristics should look similar to what I've already done to get
cases with lots of duplicates to behave sensibly. Anyone have any
ideas on how to do this? I might end up inferring a "multiple
insertion point" case from the fact that there are multiple
pass-by-value attributes for the index, with the new/incoming tuple
having distinct-to-immediate-left-tuple attribute values for the last
column, but not the first few. It also occurs to me to consider the
fragmentation of the page as a guide, though I'm less sure about that.
I'll probably need to experiment with a variety of datasets before I
settle on something that looks good. Forcing the new strategy without
considering any of this actually works surprisingly well on cases
where you'd think it wouldn't, since a 50:50 page split is already
something of a guess about where future insertions will end up.

[1] 
https://postgr.es/m/CACjxUsN5fV0kV=yirxwa0s7lqoojuy7soptipdhucemhgwo...@mail.gmail.com
[2] https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c/
-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-09-20 Thread Peter Geoghegan
On Wed, Sep 19, 2018 at 9:56 PM, Andrey Lepikhov
 wrote:
> Note, that the interface of _bt_moveright() and _bt_binsrch() functions with
> combination of scankey, scantid and nextkey parameters is too semantic
> loaded.
> Everytime of code reading i spend time to remember, what this functions do
> exactly.
> May be it needed to rewrite comments.

I think that it might be a good idea to create an "BTInsertionScankey"
struct, or similar, since keysz, nextkey, the scankey array and now
scantid are all part of that, and are all common to these 4 or so
functions. It could have a flexible array at the end, so that we still
only need a single palloc(). I'll look into that.

> What do you think about submitting the patch to the next CF?

Clearly the project that you're working on is a difficult one. It's
easy for me to understand why you might want to take an iterative
approach, with lots of prototyping. Your patch needs attention to
advance, and IMV the CF is the best way to get that attention. So, I
think that it would be fine to go submit it now.

I must admit that I didn't even notice that your patch lacked a CF
entry. Everyone has a different process, perhaps.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-09-19 Thread Andrey Lepikhov
I use the v5 version in quick vacuum strategy and in the heap 
cleaner (see new patches at corresponding thread a little bit later). It 
works fine and give quick vacuum 2-3% performance growup in comparison 
with version v3 at my 24-core test server.
Note, that the interface of _bt_moveright() and _bt_binsrch() functions 
with combination of scankey, scantid and nextkey parameters is too 
semantic loaded.
Everytime of code reading i spend time to remember, what this functions 
do exactly.
May be it needed to rewrite comments. For example, _bt_moveright() 
comments may include phrase:
nextkey=false: Traverse to the next suitable index page if the current 
page does not contain the value (scan key; scan tid).


What do you think about submitting the patch to the next CF?

12.09.2018 23:11, Peter Geoghegan пишет:

Attached is v4. I have two goals in mind for this revision, goals that
are of great significance to the project as a whole:

* Making better choices around leaf page split points, in order to
maximize suffix truncation and thereby maximize fan-out. This is
important when there are mostly-distinct index tuples on each leaf
page (i.e. most of the time). Maximizing the effectiveness of suffix
truncation needs to be weighed against the existing/main
consideration: evenly distributing space among each half of a page
split. This is tricky.

* Not regressing the logic that lets us pack leaf pages full when
there are a great many logical duplicates. That is, I still want to
get the behavior I described on the '"Write amplification" is made
worse by "getting tired" while inserting into nbtree secondary
indexes' thread [1]. This is not something that happens as a
consequence of thinking about suffix truncation specifically, and
seems like a fairly distinct thing to me. It's actually a bit similar
to the rightmost 90/10 page split case.

v4 adds significant new logic to make us do better on the first goal,
without hurting the second goal. It's easy to regress one while
focussing on the other, so I've leaned on a custom test suite
throughout development. Previous versions mostly got the first goal
wrong, but got the second goal right. For the time being, I'm
focussing on index size, on the assumption that I'll be able to
demonstrate a nice improvement in throughput or latency later. I can
get the main TPC-C order_line pkey about 7% smaller after an initial
bulk load with the new logic added to get the first goal (note that
the benefits with a fresh CREATE INDEX are close to zero). The index
is significantly smaller, even though the internal page index tuples
can themselves never be any smaller due to alignment -- this is all
about not restricting what can go on each leaf page by too much. 7% is
not as dramatic as the "get tired" case, which saw something like a
50% decrease in bloat for one pathological case, but it's still
clearly well worth having. The order_line primary key is the largest
TPC-C index, and I'm merely doing a standard bulk load to get this 7%
shrinkage. The TPC-C order_line primary key happens to be kind of
adversarial or pathological to B-Tree space management in general, but
it's still fairly realistic.

For the first goal, page splits now weigh what I've called the
"distance" between tuples, with a view to getting the most
discriminating split point -- the leaf split point that maximizes the
effectiveness of suffix truncation, within a range of acceptable split
points (acceptable from the point of view of not implying a lopsided
page split). This is based on probing IndexTuple contents naively when
deciding on a split point, without regard for the underlying
opclass/types. We mostly just use char integer comparisons to probe,
on the assumption that that's a good enough proxy for using real
insertion scankey comparisons (only actual truncation goes to those
lengths, since that's a strict matter of correctness). This distance
business might be considered a bit iffy by some, so I want to get
early feedback. This new "distance" code clearly needs more work, but
I felt that I'd gone too long without posting a new version.

For the second goal, I've added a new macro that can be enabled for
debugging purposes. This has the implementation sort heap TIDs in ASC
order, rather than DESC order. This nicely demonstrates how my two
goals for v4 are fairly independent; uncommenting "#define
BTREE_ASC_HEAP_TID" will cause a huge regression with cases where many
duplicates are inserted, but won't regress things like the TPC-C
indexes. (Note that BTREE_ASC_HEAP_TID will break the regression
tests, though in a benign way can safely be ignored.)

Open items:

* Do more traditional benchmarking.

* Add pg_upgrade support.

* Simplify _bt_findsplitloc() logic.

[1] 
https://postgr.es/m/cah2-wzmf0fvvhu+sszpgw4qe9t--j_dmxdx3it5jcdb8ff2...@mail.gmail.com



--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-08-01 Thread Peter Geoghegan
On Wed, Aug 1, 2018 at 9:48 PM, Andrey Lepikhov
 wrote:
> I use v3 version of the patch for a Retail Indextuple Deletion and from time
> to time i catch regression test error (see attachment).
> As i see in regression.diff, the problem is instability order of DROP ...
> CASCADE deletions.
> Most frequently i get error on a test called 'updatable views'.
> I check nbtree invariants during all tests, but index relations is in
> consistent state all time.
> My hypothesis is: instability order of logical duplicates in index relations
> on a pg_depend relation.
> But 'updatable views' test not contains any sources of instability:
> concurrent insertions, updates, vacuum and so on. This fact discourage me.
> May be you have any ideas on this problem?

It's caused by an implicit dependency on the order of items in an
index. See 
https://www.postgresql.org/message-id/20180504022601.fflymidf7eoencb2%40alvherre.pgsql.

I've been making "\set VERBOSITY terse" changes like this whenever it
happens in a new place. It seems to have finally stopped happening.
Note that this is a preexisting issue; there are already places in the
regression tests where we paper over the problem in a similar way. I
notice that it tends to happen when the machine running the regression
tests is heavily loaded.

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-08-01 Thread Andrey Lepikhov
I use v3 version of the patch for a Retail Indextuple Deletion and from 
time to time i catch regression test error (see attachment).
As i see in regression.diff, the problem is instability order of DROP 
... CASCADE deletions.

Most frequently i get error on a test called 'updatable views'.
I check nbtree invariants during all tests, but index relations is in 
consistent state all time.
My hypothesis is: instability order of logical duplicates in index 
relations on a pg_depend relation.
But 'updatable views' test not contains any sources of instability: 
concurrent insertions, updates, vacuum and so on. This fact discourage me.

May be you have any ideas on this problem?


18.07.2018 00:21, Peter Geoghegan пишет:

Attached is my v3, which has some significant improvements:

* The hinting for unique index inserters within _bt_findinsertloc()
has been restored, more or less.

* Bug fix for case where left side of split comes from tuple being
inserted. We need to pass this to _bt_suffix_truncate() as the left
side of the split, which we previously failed to do. The amcheck
coverage I've added allowed me to catch this issue during a benchmark.
(I use amcheck during benchmarks to get some amount of stress-testing
in.)

* New performance optimization that allows us to descend a downlink
when its user-visible attributes have scankey-equal values. We avoid
an unnecessary move left by using a sentinel scan tid that's less than
any possible real heap TID, but still greater than minus infinity to
_bt_compare().

I am now considering pursuing this as a project in its own right,
which can be justified without being part of some larger effort to add
retail index tuple deletion (e.g. by VACUUM). I think that I can get
it to the point of being a totally unambiguous win, if I haven't
already. So, this patch is no longer just an interesting prototype of
a new architectural direction we should take. In any case, it has far
fewer problems than v2.

Testing the performance characteristics of this patch has proven
difficult. My home server seems to show a nice win with a pgbench
workload that uses a Gaussian distribution for the pgbench_accounts
queries (script attached). That seems consistent and reproducible. My
home server has 32GB of RAM, and has a Samsung SSD 850 EVO SSD, with a
250GB capacity. With shared_buffers set to 12GB, 80 minute runs at
scale 4800 look like this:

Master:

25 clients:
tps = 15134.223357 (excluding connections establishing)

50 clients:
tps = 13708.419887 (excluding connections establishing)

75 clients:
tps = 12951.286926 (excluding connections establishing)

90 clients:
tps = 12057.852088 (excluding connections establishing)

Patch:

25 clients:
tps = 17857.863353 (excluding connections establishing)

50 clients:
tps = 14319.514825 (excluding connections establishing)

75 clients:
tps = 14015.794005 (excluding connections establishing)

90 clients:
tps = 12495.683053 (excluding connections establishing)

I ran this twice, and got pretty consistent results each time (there
were many other benchmarks on my home server -- this was the only one
that tested this exact patch, though). Note that there was only one
pgbench initialization for each set of runs. It looks like a pretty
strong result for the patch - note that the accounts table is about
twice the size of available main memory. The server is pretty well
overloaded in every individual run.

Unfortunately, I have a hard time showing much of any improvement on a
storage-optimized AWS instance with EBS storage, with scaled up
pgbench scale and main memory. I'm using an i3.4xlarge, which has 16
vCPUs, 122 GiB RAM, and 2 SSDs in a software RAID0 configuration. It
appears to more or less make no overall difference there, for reasons
that I have yet to get to the bottom of. I conceived this AWS
benchmark as something that would have far longer run times with a
scaled-up database size. My expectation was that it would confirm the
preliminary result, but it hasn't.

Maybe the issue is that it's far harder to fill the I/O queue on this
AWS instance? Or perhaps its related to the higher latency of EBS,
compared to the local SSD on my home server? I would welcome any ideas
about how to benchmark the patch. It doesn't necessarily have to be a
huge win for a very generic workload like the one I've tested, since
it would probably still be enough of a win for things like free space
management in secondary indexes [1]. Plus, of course, it seems likely
that we're going to eventually add retail index tuple deletion in some
form or another, which this is prerequisite to.

For a project like this, I expect an unambiguous, across the board win
from the committed patch, even if it isn't a huge win. I'm encouraged
by the fact that this is starting to look like credible as a
stand-alone patch, but I have to admit that there's probably still
significant gaps in my understanding of how it affects real-world
performance. I don't have a lot of recent experience with 

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-07-02 Thread Peter Geoghegan
On Thu, Jun 14, 2018 at 11:44 AM, Peter Geoghegan  wrote:
> I attach an unfinished prototype of suffix truncation, that also
> sometimes *adds* a new attribute in pivot tuples. It adds an extra
> heap TID from the leaf level when truncating away non-distinguishing
> attributes during a leaf page split, though only when it must. The
> patch also has nbtree treat heap TID as a first class part of the key
> space of the index. Claudio wrote a patch that did something similar,
> though without the suffix truncation part [2] (I haven't studied his
> patch, to be honest). My patch is actually a very indirect spin-off of
> Anastasia's covering index patch, and I want to show what I have in
> mind now, while it's still swapped into my head. I won't do any
> serious work on this project unless and until I see a way to implement
> retail index tuple deletion, which seems like a multi-year project
> that requires the buy-in of multiple senior community members. On its
> own, my patch regresses performance unacceptably in some workloads,
> probably due to interactions with kill_prior_tuple()/LP_DEAD hint
> setting, and interactions with page space management when there are
> many "duplicates" (it can still help performance in some pgbench
> workloads with non-unique indexes, though).

I attach a revised version, which is still very much of prototype
quality, but manages to solve a few of the problems that v1 had.
Andrey Lepikhov (CC'd) asked me to post any improved version I might
have for use with his retail index tuple deletion patch, so I thought
I'd post what I have.

The main development for v2 is that the sort order of the implicit
heap TID attribute is flipped. In v1, it was in "ascending" order. In
v2, comparisons of heap TIDs are inverted to make the attribute order
"descending". This has a number of advantages:

* It's almost consistent with the current behavior when there are
repeated insertions of duplicates. Currently, this tends to result in
page splits of the leftmost leaf page among pages that mostly consist
of the same duplicated value. This means that the destabilizing impact
on DROP SCHEMA ... CASCADE regression test output noted before [1] is
totally eliminated. There is now only a single trivial change to
regression test "expected" files, whereas in v1 dozens of "expected"
files had to be changed, often resulting in less useful reports for
the user.

* The performance regression I observed with various pgbench workloads
seems to have gone away, or is now within the noise range. A patch
like this one requires a lot of validation and testing, so this should
be taken with a grain of salt.

I may have been too quick to give up on my original ambition of
writing a stand-alone patch that can be justified entirely on its own
merits, without being tied to some much more ambitious project like
retail index tuple deletion by VACUUM, or zheap's deletion marking. I
still haven't tried to replace the kludgey handling of unique index
enforcement, even though that would probably have a measurable
additional performance benefit. I think that this patch could become
an unambiguous win.

[1] 
https://postgr.es/m/CAH2-Wz=wAKwhv0PqEBFuK2_s8E60kZRMzDdyLi=-mvcm_ph...@mail.gmail.com
-- 
Peter Geoghegan


v2-0001-Ensure-nbtree-leaf-tuple-keys-are-always-unique.patch
Description: Binary data


Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-06-19 Thread Peter Geoghegan
On Tue, Jun 19, 2018 at 8:52 PM, Amit Kapila  wrote:
> Both values will be present in the index, but the old value will be
> delete-marked.  It is correct that we can't remove the value (index
> tuple) from the index until it is truly dead (not visible to anyone),
> but during a delete or index-update operation, we need to traverse the
> index to mark the entries as delete-marked.  See, at this stage, I
> don't want to go in too much detail discussion of how delete-marking
> will happen in zheap and also I am not sure this thread is the right
> place to discuss details of that technology.

I don't understand, but okay. I can provide feedback once a design for
delete marking is available.


-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-06-19 Thread Amit Kapila
On Tue, Jun 19, 2018 at 11:13 PM, Peter Geoghegan  wrote:
> On Tue, Jun 19, 2018 at 4:03 AM, Amit Kapila  wrote:
>>> I imagine that retail index tuple deletion (the whole point of this
>>> project) would be run by a VACUUM-like process that kills tuples that
>>> are dead to everyone. Even with something like zheap, you cannot just
>>> delete index tuples until you establish that they're truly dead. I
>>> guess that the delete marking stuff that Robert mentioned marks tuples
>>> as dead when the deleting transaction commits.
>>>
>>
>> No, I don't think that is the case because we want to perform in-place
>> updates for indexed-column-updates.  If we won't delete-mark the index
>> tuple before performing in-place update, then we will have two tuples
>> in the index which point to the same heap-TID.
>
> How can an old MVCC snapshot that needs to find the heap tuple using
> some now-obsolete key values get to the heap tuple via an index scan
> if there are no index tuples that stick around until "recently dead"
> heap tuples become "fully dead"? How can you avoid keeping around both
> old and new index tuples at the same time?
>

Both values will be present in the index, but the old value will be
delete-marked.  It is correct that we can't remove the value (index
tuple) from the index until it is truly dead (not visible to anyone),
but during a delete or index-update operation, we need to traverse the
index to mark the entries as delete-marked.  See, at this stage, I
don't want to go in too much detail discussion of how delete-marking
will happen in zheap and also I am not sure this thread is the right
place to discuss details of that technology.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-06-19 Thread Peter Geoghegan
On Tue, Jun 19, 2018 at 4:03 AM, Amit Kapila  wrote:
>> I imagine that retail index tuple deletion (the whole point of this
>> project) would be run by a VACUUM-like process that kills tuples that
>> are dead to everyone. Even with something like zheap, you cannot just
>> delete index tuples until you establish that they're truly dead. I
>> guess that the delete marking stuff that Robert mentioned marks tuples
>> as dead when the deleting transaction commits.
>>
>
> No, I don't think that is the case because we want to perform in-place
> updates for indexed-column-updates.  If we won't delete-mark the index
> tuple before performing in-place update, then we will have two tuples
> in the index which point to the same heap-TID.

How can an old MVCC snapshot that needs to find the heap tuple using
some now-obsolete key values get to the heap tuple via an index scan
if there are no index tuples that stick around until "recently dead"
heap tuples become "fully dead"? How can you avoid keeping around both
old and new index tuples at the same time?

-- 
Peter Geoghegan



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-06-19 Thread Amit Kapila
On Mon, Jun 18, 2018 at 10:33 PM, Peter Geoghegan  wrote:
> On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire  
> wrote:
>> Way back when I was dabbling in this kind of endeavor, my main idea to
>> counteract that, and possibly improve performance overall, was a
>> microvacuum kind of thing that would do some on-demand cleanup to
>> remove duplicates or make room before page splits. Since nbtree
>> uniqueification enables efficient retail deletions, that could end up
>> as a net win.
>
> That sounds like a mechanism that works a bit like
> _bt_vacuum_one_page(), which we run at the last second before a page
> split. We do this to see if a page split that looks necessary can
> actually be avoided.
>
> I imagine that retail index tuple deletion (the whole point of this
> project) would be run by a VACUUM-like process that kills tuples that
> are dead to everyone. Even with something like zheap, you cannot just
> delete index tuples until you establish that they're truly dead. I
> guess that the delete marking stuff that Robert mentioned marks tuples
> as dead when the deleting transaction commits.
>

No, I don't think that is the case because we want to perform in-place
updates for indexed-column-updates.  If we won't delete-mark the index
tuple before performing in-place update, then we will have two tuples
in the index which point to the same heap-TID.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-06-18 Thread Claudio Freire
On Mon, Jun 18, 2018 at 2:03 PM Peter Geoghegan  wrote:
>
> On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire  
> wrote:
> > Way back when I was dabbling in this kind of endeavor, my main idea to
> > counteract that, and possibly improve performance overall, was a
> > microvacuum kind of thing that would do some on-demand cleanup to
> > remove duplicates or make room before page splits. Since nbtree
> > uniqueification enables efficient retail deletions, that could end up
> > as a net win.
>
> That sounds like a mechanism that works a bit like
> _bt_vacuum_one_page(), which we run at the last second before a page
> split. We do this to see if a page split that looks necessary can
> actually be avoided.
>
> I imagine that retail index tuple deletion (the whole point of this
> project) would be run by a VACUUM-like process that kills tuples that
> are dead to everyone. Even with something like zheap, you cannot just
> delete index tuples until you establish that they're truly dead. I
> guess that the delete marking stuff that Robert mentioned marks tuples
> as dead when the deleting transaction commits. Maybe we could justify
> having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if
> they're visible to anyone, and if not recycle), because we at least
> know that the deleting transaction committed there. That is, they
> could be recently dead or dead, and it may be worth going to the extra
> trouble of checking which when we know that it's one of the two
> possibilities.

Yes, but currently bt_vacuum_one_page does local work on the pinned
page. Doing dead tuple deletion however involves reading the heap to
check visibility at the very least, and doing it on the whole page
might involve several heap fetches, so it's an order of magnitude
heavier if done naively.

But the idea is to do just that, only not naively.



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-06-18 Thread Peter Geoghegan
On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire  wrote:
> Way back when I was dabbling in this kind of endeavor, my main idea to
> counteract that, and possibly improve performance overall, was a
> microvacuum kind of thing that would do some on-demand cleanup to
> remove duplicates or make room before page splits. Since nbtree
> uniqueification enables efficient retail deletions, that could end up
> as a net win.

That sounds like a mechanism that works a bit like
_bt_vacuum_one_page(), which we run at the last second before a page
split. We do this to see if a page split that looks necessary can
actually be avoided.

I imagine that retail index tuple deletion (the whole point of this
project) would be run by a VACUUM-like process that kills tuples that
are dead to everyone. Even with something like zheap, you cannot just
delete index tuples until you establish that they're truly dead. I
guess that the delete marking stuff that Robert mentioned marks tuples
as dead when the deleting transaction commits. Maybe we could justify
having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if
they're visible to anyone, and if not recycle), because we at least
know that the deleting transaction committed there. That is, they
could be recently dead or dead, and it may be worth going to the extra
trouble of checking which when we know that it's one of the two
possibilities.

-- 
Peter Geoghegan



  1   2   >