subject:"\[HACKERS\] \[WIP\] Effective storage of duplicates in B\-tree index."

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2020-03-06 Thread Peter Geoghegan

On Fri, Mar 6, 2020 at 11:00 AM Andres Freund  wrote:
> > Pushed.
>
> Congrats!

Thanks Andres!

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2020-03-06 Thread Andres Freund

On 2020-02-26 14:43:27 -0800, Peter Geoghegan wrote:
> On Mon, Feb 24, 2020 at 4:54 PM Peter Geoghegan  wrote:
> > Attached is v34, which has this change. My plan is to commit something
> > very close to this on Wednesday morning (barring any objections).
> 
> Pushed.

Congrats!

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2020-02-27 Thread Peter Geoghegan

On Wed, Feb 26, 2020 at 10:03 PM Fujii Masao
 wrote:
> Thanks for committing this nice feature!

You're welcome!

> Here is one minor comment.
>
> +  deduplicate_items
> +  storage parameter
>
> This should be
>
>  deduplicate_items storage parameter

I pushed a fix for this.

Thanks
-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2020-02-26 Thread Fujii Masao





On 2020/02/27 7:43, Peter Geoghegan wrote:

On Mon, Feb 24, 2020 at 4:54 PM Peter Geoghegan  wrote:

Attached is v34, which has this change. My plan is to commit something
very close to this on Wednesday morning (barring any objections).


Pushed.


Thanks for committing this nice feature!

Here is one minor comment.

+  deduplicate_items
+  storage parameter

This should be

deduplicate_items storage parameter

 for reloption is necessary only when the GUC parameter
with the same name of the reloption exists. So, for example, you can
see that  is used in vacuum_cleanup_index_scale_factor
but not in buffering reloption.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2020-02-26 Thread Peter Geoghegan

On Mon, Feb 24, 2020 at 4:54 PM Peter Geoghegan  wrote:
> Attached is v34, which has this change. My plan is to commit something
> very close to this on Wednesday morning (barring any objections).

Pushed.

I'm going to delay committing the pageinspect patch until tomorrow,
since I haven't thought about that aspect of the project in a while.
Seems like a good idea to go through it one more time, once it's clear
that the buildfarm is stable. The buildfarm appears to be stable now,
though there was an issue with a compiler warning earlier. I quickly
pushed a fix for that, and can see that longfin is green/passing now.

Thanks for sticking with this project, Anastasia.
-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2020-02-20 Thread Peter Geoghegan

On Thu, Feb 20, 2020 at 10:58 AM Peter Geoghegan  wrote:
> I think that there is a good chance that it just won't matter. The
> number of indexes that won't be able to support deduplication will be
> very small in practice. The important exceptions are INCLUDE indexes
> and nondeterministic collations. These exceptions make sense
> intuitively, and will be documented as limitations of those other
> features.

I wasn't clear about the implication of what I was saying here, which
is: I will make the NOTICE a DEBUG1 message, and leave everything else
as-is in the initial committed version.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2020-02-20 Thread Peter Geoghegan

On Thu, Feb 20, 2020 at 7:38 AM Anastasia Lubennikova
 wrote:
> I don't think this patch really needs more nitpicking )

But when has that ever stopped it?   :-)

> User can discover this with a complex query to pg_index and pg_opclass.
> To simplify this, we can probably wrap this into function or some field
> in pg_indexes.

A function isn't a real user interface, though -- it probably won't be noticed.

I think that there is a good chance that it just won't matter. The
number of indexes that won't be able to support deduplication will be
very small in practice. The important exceptions are INCLUDE indexes
and nondeterministic collations. These exceptions make sense
intuitively, and will be documented as limitations of those other
features.

The numeric/float thing doesn't really make intuitive sense, and
numeric is an important datatype. Still, numeric columns and float
columns seem to rarely get indexed. That just leaves container type
opclasses, like anyarray and jsonb.

Nobody cares about indexing container types with a B-Tree index, with
the possible exception of expression indexes on a jsonb column. I
don't see a way around that, but it doesn't seem all that important.
Again, applications are unlikely to have more than one or two of
those. The *overall* space saving will probably be almost as good as
if the limitation did not exist.

> Anyway, I would wait for feedback from pre-release testers.

Right -- let's delay making a final decision on it. Just like the
decision to enable it by default. It will work this way in the
committed version, but that isn't supposed to be the final word on it.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2020-02-20 Thread Anastasia Lubennikova


On 19.02.2020 22:16, Peter Geoghegan wrote:

On Wed, Feb 19, 2020 at 8:14 AM Anastasia Lubennikova
 wrote:

Thank you for this work. I've looked through the patches and they seem
to be ready for commit.
I haven't yet read recent documentation and readme changes, so maybe
I'll send some more feedback tomorrow.


The only thing I found is a typo in the comment

+  int  nhtids;  /* Number of heap TIDs in nhtids array */

s/nhtids/htids

I don't think this patch really needs more nitpicking )




In my opinion, this message is too specific for default behavior. It
exposes internal details without explanation and may look to user like
something went wrong.

You're probably right about that. I just wish that there was some way
of showing the same information that was discoverable, and didn't
require the use of pageinspect. If I make it a DEBUG1 message, then it
cannot really be documented.


User can discover this with a complex query to pg_index and pg_opclass.
To simplify this, we can probably wrap this into function or some field 
in pg_indexes.

Anyway, I would wait for feedback from pre-release testers.

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2020-02-19 Thread Peter Geoghegan

On Wed, Feb 19, 2020 at 11:16 AM Peter Geoghegan  wrote:
> On Wed, Feb 19, 2020 at 8:14 AM Anastasia Lubennikova
>  wrote:
> > Thank you for this work. I've looked through the patches and they seem
> > to be ready for commit.
> > I haven't yet read recent documentation and readme changes, so maybe
> > I'll send some more feedback tomorrow.
>
> Great.

I should add: I plan to commit the patch within the next 7 days.

I believe that the design of deduplication itself is solid; it has
many more strengths than weaknesses. It works in a way that
complements the existing approach to page splits. The optimization can
easily be turned off (and easily turned back on again).
contrib/amcheck can detect almost any possible form of corruption that
could affect a B-Tree index that has posting list tuples. I have spent
months microbenchmarking every little aspect of this patch in
isolation. I've also spent a lot of time on conventional benchmarking.

It seems quite possible that somebody won't like some aspect of the
user interface. I am more than willing to work with other contributors
on any issue in that area that comes to light. I don't see any point
in waiting for other hackers to speak up before the patch is
committed, though. Anastasia posted the first version of this patch in
August of 2015, and there have been over 30 revisions of it since the
project was revived in 2019. Everyone has been given ample opportunity
to offer input.

--
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2020-02-19 Thread Peter Geoghegan

On Wed, Feb 19, 2020 at 8:14 AM Anastasia Lubennikova
 wrote:
> Thank you for this work. I've looked through the patches and they seem
> to be ready for commit.
> I haven't yet read recent documentation and readme changes, so maybe
> I'll send some more feedback tomorrow.

Great.

> Is there any specific reason, why we need separate btnameequalimage,
> bpchar_equalimage and bttextequalimage functions?
> As far as I see, they have the same implementation.

Not really. This approach allows us to reverse the decision to enable
deduplication in a point release, which is theoretically useful. OTOH,
if that's so important, why not have many more support function 4
implementations (one per opclass)?

I suspect that we would just disable deduplication in a hard-coded
fashion if we needed to disable it due to some issue that transpired.
For example, we could do this by modifying _bt_allequalimage() itself.

> I would simply move it to debug level for all cases. Since from user's
> perspective it doesn't differ that much from the case where
> deduplication is applicable in general, but not very efficient due to
> data distribution.

I was more concerned about cases where the user would really like to
use deduplication, but wants to make sure that it gets used. And
doesn't want to install pageinspect to find out.

> I also noticed that this is not consistent with ALTER index. For
> example, alter index idx_n set (deduplicate_items =true); won't show any
> message about deduplication.

But that's a change in the user's preference. Not a change in whether
or not it's safe in principle.

> In my opinion, this message is too specific for default behavior. It
> exposes internal details without explanation and may look to user like
> something went wrong.

You're probably right about that. I just wish that there was some way
of showing the same information that was discoverable, and didn't
require the use of pageinspect. If I make it a DEBUG1 message, then it
cannot really be documented.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2020-02-19 Thread Anastasia Lubennikova


On 14.02.2020 05:57, Peter Geoghegan wrote:


Attached is v33, which adds the last piece we need: opclass
infrastructure that tells nbtree whether or not deduplication can be
applied safely. This is based on work by Anastasia that was shared
with me privately.
Thank you for this work. I've looked through the patches and they seem 
to be ready for commit.
I haven't yet read recent documentation and readme changes, so maybe 
I'll send some more feedback tomorrow.



New opclass proc


In general, supporting deduplication is the rule for B-Tree opclasses,
rather than the exception. Most can use the generic
btequalimagedatum() routine as their support function 4, which
unconditionally indicates that deduplication is safe. There is a new
test that tries to catch opclasses that omitted to do this. Here is
the opr_sanity.out changes added by the first patch:

-- Almost all Btree opclasses can use the generic btequalimagedatum function
-- as their equalimage proc (support function 4).  Look for opclasses that
-- don't do so; newly added Btree opclasses will usually be able to support
-- deduplication with little trouble.
SELECT amproc::regproc AS proc, opf.opfname AS opfamily_name,
opc.opcname AS opclass_name, opc.opcintype::regtype AS opcintype
FROM pg_am am
JOIN pg_opclass opc ON opc.opcmethod = am.oid
JOIN pg_opfamily opf ON opc.opcfamily = opf.oid
LEFT JOIN pg_amproc ON amprocfamily = opf.oid AND
 amproclefttype = opcintype AND
 amprocnum = 4
WHERE am.amname = 'btree' AND
 amproc IS DISTINCT FROM 'btequalimagedatum'::regproc
ORDER BY amproc::regproc::text, opfamily_name, opclass_name;
proc|  opfamily_name   |   opclass_name   |opcintype
---+--+--+--
  bpchar_equalimage | bpchar_ops   | bpchar_ops   | character
  btnameequalimage  | text_ops | name_ops | name
  bttextequalimage  | text_ops | text_ops | text
  bttextequalimage  | text_ops | varchar_ops  | text
| array_ops| array_ops| anyarray
| enum_ops | enum_ops | anyenum
| float_ops| float4_ops   | real
| float_ops| float8_ops   | double precision
| jsonb_ops| jsonb_ops| jsonb
| money_ops| money_ops| money
| numeric_ops  | numeric_ops  | numeric
| range_ops| range_ops| anyrange
| record_image_ops | record_image_ops | record
| record_ops   | record_ops   | record
| tsquery_ops  | tsquery_ops  | tsquery
| tsvector_ops | tsvector_ops | tsvector
(16 rows)



Is there any specific reason, why we need separate btnameequalimage, 
bpchar_equalimage and bttextequalimage functions?

As far as I see, they have the same implementation.

Since using deduplication is supposed to pretty much be the norm from
now on, it seemed like it might make sense to add a NOTICE about it
during CREATE INDEX -- a notice letting the user know that it isn't
being used due to a lack of opclass support:

regression=# create table foo(bar numeric);
CREATE TABLE
regression=# create index on foo(bar);
NOTICE:  index "foo_bar_idx" cannot use deduplication
CREATE INDEX

Note that this NOTICE isn't seen with an INCLUDE index, since that's
expected to not support deduplication.

I have a feeling that not everybody will like this, which is why I'm
pointing it out.

Thoughts?


I would simply move it to debug level for all cases. Since from user's 
perspective it doesn't differ that much from the case where 
deduplication is applicable in general, but not very efficient due to 
data distribution.
I also noticed that this is not consistent with ALTER index. For 
example, alter index idx_n set (deduplicate_items =true); won't show any 
message about deduplication.


I've tried several combinations with an index on a numeric column:

1) postgres=# create index idx_nd on tbl (n) with (deduplicate_items = 
true);

NOTICE:  index "idx_nd" cannot use deduplication
CREATE INDEX

Here the message seems appropriate. I don't think, we should restrict 
creation of the index even when deduplicate_items parameter is set 
explicitly, rather we may warn the user that it won't be efficient.


2) postgres=# create index idx_n on tbl (n) with (deduplicate_items = 
false);

NOTICE:  index "idx_n" cannot use deduplication
CREATE INDEX

In this case the message seems slightly strange to me.
Why should we show a notice about the fact that deduplication is not 
possible if that is exactly what was requested?


3)
postgres=# create index idx on tbl (n);
NOTICE:  index "idx" cannot use deduplication

In my opinion, this message is too specific for default behavior. It 
exposes internal

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2020-01-13 Thread Peter Geoghegan

On Fri, Jan 10, 2020 at 1:36 PM Peter Geoghegan  wrote:
> * HEIKKI: Do we only generate one posting list in one WAL record? I
> would assume it's better to deduplicate everything on the page, since
> we're modifying it anyway.

Still thinking about this one.

> * HEIKKI: Does xl_btree_vacuum WAL record store a whole copy of the
> remaining posting list on an updated tuple? Wouldn't it be simpler and
> more space-efficient to store just the deleted TIDs?

This probably makes sense. The btreevacuumposting() code that
generates "updated" index tuples (tuples that VACUUM uses to replace
existing ones when some but not all of the TIDs need to be removed)
was derived from GIN's ginVacuumItemPointers(). That approach works
well enough, but we can do better now. It shouldn't be that hard.

My preferred approach is a little different to your suggested approach
of storing the deleted TIDs directly. I would like to make it work by
storing an array of uint16 offsets into a posting list, one array per
"updated" tuple (with one offset per deleted TID within each array).
These arrays (which must include an array size indicator at the start)
can appear in the xl_btree_vacuum record, at the same place the patch
currently puts a raw IndexTuple. They'd be equivalent to a raw
IndexTuple -- the REDO routine would reconstruct the same raw posting
list tuple on its own. This approach seems simpler, and is clearly
very space efficient.

This approach is similar to the approach used by REDO routines to
handle posting list splits. Posting list splits must call
_bt_swap_posting() on the primary, while the corresponding REDO
routines also call _bt_swap_posting(). For space efficient "updates",
we'd have to invent a sibling utility function -- we could call it
_bt_delete_posting(), and put it next to _bt_swap_posting() within
nbtdedup.c.

How do you feel about that approach? (And how do you feel about the
existing "REDO routines call _bt_swap_posting()" business that it's
based on?)

> * HEIKKI: Would it be more clear to have a separate struct for the
> posting list split case? (i.e. don't reuse xl_btree_insert)

I've concluded that this one probably isn't worthwhile. We'd have to
carry a totally separate record on the stack within _bt_insertonpg().
If you feel strongly about it, I will reconsider.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2020-01-08 Thread Peter Geoghegan

On Wed, Jan 8, 2020 at 5:56 AM Heikki Linnakangas  wrote:
> On 04/01/2020 03:47, Peter Geoghegan wrote:
> > Attached is v28, which fixes bitrot from my recent commits to refactor
> > VACUUM-related code in nbtpage.c.
>
> I started to read through this gigantic patch.

Oh come on, it's not that big.   :-)

> I got about 1/3 way
> through. I wrote minor comments directly in the attached patch file,
> search for "HEIKKI:". I wrote them as I read the patch from beginning to
> end, so it's possible that some of my questions are answered later in
> the patch. I didn't have the stamina to read through the whole patch
> yet, I'll continue later.

Thanks for the review! Anything that you've written that I do not
respond to directly can be assumed to have been accepted by me.

I'll start with responses to the points that you raise in your patch
that need a response

Patch comments
==

* Furthermore, deduplication can be turned on or off as needed, or
appliedHEIKKI: When would it be needed?

I believe that hardly anybody will want to turn off deduplication in
practice. My point here is that we're flexible -- we're not
maintaining posting lists like GIN. We're just deduplicating as and
when needed. We can change our preference about that any time. Turning
off deduplication won't magically undo past deduplications, of course,
but everything mostly works in the same way when deduplication is on
or off. We're just talking about an alternative physical
representation of the same logical contents.

* HEIKKI: How do LP_DEAD work on posting list tuples?

Same as before, except that it applies to all TIDs in the tuple
together (will mention this in commit message, though). Note that the
fact that we delay deduplication also means that we delay merging the
LP_DEAD bits. And we always prefer to remove LP_DEAD items. Finally,
we refuse to do a posting list split when its LP_DEAD bit is set, so
it's now possible to delete LP_DEAD bit set tuples a little early,
before a page split has to be avoided -- see the new code and comments
at the end of _bt_findinsertloc().

See also: my later response to your e-mail remarks on LP_DEAD bits,
unique indexes, and space accounting.

* HEIKKI: When is it [deduplication] not safe?

With opclasses like btree/numeric_ops, where display scale messes
things up. See this thread for more information on the infrastructure
that we need for that:

https://www.postgresql.org/message-id/flat/cah2-wzn3ee49gmxb7v1vj3-ac8fwn-fr8pfwqebhe8ryrxt...@mail.gmail.com

* HEIKKI: Why is it safe to read on version 3 indexes? Because unused
space is set to zeros?

Yes. Same applies to version 4 indexes that come from Postgres 12 --
users must REINDEX to call _bt_opclasses_support_dedup() and set
metapage field, but we can rely on the new field being all zeroes
before that happens. (It would be possible to teach pg_upgrade to set
the field for compatible indexes from Postgres 12, but I don't want to
bother with that. We probably cannot safely call
_bt_opclasses_support_dedup() with a buffer lock held, so that seems
like the only way.)

* HEIKKI: Do we need it as a separate flag, isn't it always safe with
version 4 indexes, and never with version 3?

No, it isn't *always* safe with version 4 indexes, for reasons that
have nothing to do with the on-disk representation (like the display
scale issue, nondeterministic collations, etc). It really is a
distinct condition. (Deduplication is never safe with version 3
indexes, obviously.)

It occurs to me now that we probably don't even want to make the
metapage field about deduplication (though that's what it says right
now). Rather, it should be about supporting a general category of
optimizations that include deduplication, and might also include
prefix compression in the future. Note that whether or not we should
actually apply these optimizations is always a separate question.

* + * Non-pivot tuples complement pivot tuples, which only have key
columns. HEIKKI: What does it mean that they complement pivot
tuples?

It means that all tuples are either pivot tuples, or are non-pivot tuples.

* + * safely (index storage parameter separately indicates if
deduplication is   HEIKKI: Is there really an "index storage
parameter" for that? What is that, something in the WITH clause?

Yes, there is actually an index storage parameter named
"deduplication" (something in the WITH clause). This is deliberately
not named "btree_deduplication", the current name of the GUC. This
exists to make the optimization controllable at the index level.
(Though I should probably mention the GUC first in this code comment,
or not even mention the less significant storage parameter.)

* HEIKKI: How much memory does this [BTScanPosData.items array of
width MaxBTreeIndexTuplesPerPage] need now? Should we consider
pallocing this separately?

But BTScanPosData isn't allocated on the stack anyway.

* HEIKKI: Would it be more clear to have a separate struct for the
posting list split case?

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-12-17 Thread Peter Geoghegan

On Thu, Dec 12, 2019 at 6:21 PM Peter Geoghegan  wrote:
> Still waiting for some review of the first patch, to get it out of the
> way. Anastasia?

I plan to commit this first patch [1] in the next day or two, barring
any objections.

It's clear that the nbtree "pin scan" VACUUM code is totally
unnecessary -- it really should have been fully removed by commit
3e4b7d87 back in 2016.

[1] 
https://www.postgresql.org/message-id/flat/CAH2-WzkWLRDzCaxsGvA_pZoaix_2AC9S6%3D-D6JMLkQYhqrJuEg%40mail.gmail.com#daed349a71ff9d7ac726cc0e3e01a436
-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-12-17 Thread Peter Geoghegan

On Tue, Dec 17, 2019 at 5:18 PM Bruce Momjian  wrote:
> On Tue, Dec 17, 2019 at 03:30:33PM -0800, Peter Geoghegan wrote:
> > With many real world unique indexes, the true reason behind most or
> > all B-Tree page splits is "version churn". I view these page splits as
> > a permanent solution to a temporary problem -- we *permanently*
> > degrade the index structure in order to deal with a *temporary* burst
> > in versions that need to be stored. That's really bad.
>
> Yes, I was thinking why do we need to optimize duplicates in a unique
> index but then remembered is a version problem.

The whole idea of deduplication in unique indexes is hard to explain.
It just sounds odd. Also, it works using the same infrastructure as
regular deduplication, while having rather different goals.
Fortunately, it seems like we don't really have to tell users about it
in order for them to see a benefit -- there will be no choice for them
to make there (they just get it).

The regular deduplication stuff isn't confusing at all, though. It has
some noticeable though small downside, so it will be documented and
configurable. (I'm optimistic that it can be enabled by default,
because even with high cardinality non-unique indexes the downside is
rather small -- we waste some CPU cycles just before a page is split.)

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-12-17 Thread Bruce Momjian

On Tue, Dec 17, 2019 at 03:30:33PM -0800, Peter Geoghegan wrote:
> With many real world unique indexes, the true reason behind most or
> all B-Tree page splits is "version churn". I view these page splits as
> a permanent solution to a temporary problem -- we *permanently*
> degrade the index structure in order to deal with a *temporary* burst
> in versions that need to be stored. That's really bad.

Yes, I was thinking why do we need to optimize duplicates in a unique
index but then remembered is a version problem.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-12-17 Thread Peter Geoghegan

On Tue, Dec 17, 2019 at 1:58 PM Bruce Momjian  wrote:
> > Attached is v26, which adds this new criteria/heuristic for unique
> > indexes. We now seem to consistently get good results with unique
> > indexes.
>
> In the past we tried to increase the number of cases where HOT updates
> can happen but were unable to.

Right -- the WARM project.

The Z-heap project won't change the fundamentals here. It isn't going
to solve the fundamental problem of requiring that the index AM create
a new set of physical index tuples in at least *some* cases. A heap
tuple cannot be updated in-place when even one indexed column changes
-- you're not much better off than you were with the classic heapam,
because indexes get bloated in a way that wouldn't happen with Oracle.
(Even still, Z-heap isn't sensitive to when and how opportunistic heap
pruning takes place, and doesn't have the same issue with having to
fit the heap tuple on the same page or create a new HOT chain. This
will make things much better with some workloads.)

> Would this help with non-HOT updates?

Definitely, yes. The strategy used with unique indexes is specifically
designed to avoid "unnecessary" page splits altogether -- it only
makes sense because of the possibility of non-HOT UPDATEs with
mostly-unchanged index tuples. Thinking about what's going on here
from first principles is what drove the unique index deduplication
design:

With many real world unique indexes, the true reason behind most or
all B-Tree page splits is "version churn". I view these page splits as
a permanent solution to a temporary problem -- we *permanently*
degrade the index structure in order to deal with a *temporary* burst
in versions that need to be stored. That's really bad.

Consider a classic pgbench workload, for example. The smaller indexes
on the smaller tables (pgbench_tellers_pkey and pgbench_branches_pkey)
have leaf pages that will almost certainly be split a few minutes in,
even though the UPDATEs on the underlying tables never modify indexed
columns (i.e. even though HOT is as effective as it possibly could be
with this unthrottled workload). Actually, even the resulting split
pages will themselves usually be split again, and maybe even once more
after that.  We started out with leaf pages that stored just under 370
items on each leaf page (with fillfactor 90 + 8KiB BLCKSZ), and end up
with leaf pages that often have less than 50 items (sometimes as few
as 10). Even though the "logical contents" of the index are *totally*
unchanged. This could almost be considered pathological by users.

Of course, it's easy to imagine a case where it matters a lot more
than classic pgbench (pgbench_tellers_pkey and pgbench_branches_pkey
are always small, so it's easy to see the effect, which is why I went
with that example). For example, you could have a long running
transaction, which would probably have the effect of significantly
bloating even the large pgbench index (pgbench_accounts_pkey) --
typically you won't see that with classic pgbench until you do
something to frustrate VACUUM (and opportunistic cleanup). (I have
mostly been using non-HOT UPDATEs to test the patch, though.)

In theory we could go even further than this by having some kind of
version store for indexes, and using this to stash old versions rather
than performing a page split. Then you wouldn't have any page splits
in the pgbench indexes; VACUUM would eventually be able to return the
index to its "pristine" state. The trade-off with that design would be
that index scans would have to access two index pages for a while (a
leaf page, plus its subsidiary old version page). Maybe we can
actually go that far in the future -- there are various database
research papers that describe designs like this (the designs described
within these papers do things like determine whether a "version split"
or a "value split" should be performed).

What we have now is an incremental improvement, that doesn't have any
apparent downside with unique indexes -- the way that deduplication is
triggered for unique indexes is almost certain to be a win. When
deduplication isn't triggered, everything works in the same way as
before -- it's "zero overhead" for unique indexes that don't benefit.
The design augments existing garbage collection mechanisms,
particularly the way in which we set LP_DEAD bits within
_bt_check_unique().

> Do we have any benchmarks where non-HOT updates cause slowdowns that we
> can test on this?

AFAICT, any workload that has lots of non-HOT updates will benefit at
least a little bit -- indexes will finish up smaller, there will be
higher throughput, and there will be a reduction in latency for
queries.

With the right distribution of values, it's not that hard to mostly
control bloat in an index that doubles in size without the
optimization, which is much more significant. I have already reported
on this [1]. I've also been able to observe increases of 15%-20% in
TPS with similar workloads (with commensurate

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-12-17 Thread Bruce Momjian

On Thu, Dec 12, 2019 at 06:21:20PM -0800, Peter Geoghegan wrote:
> On Tue, Dec 3, 2019 at 12:13 PM Peter Geoghegan  wrote:
> > The new criteria/heuristic for unique indexes is very simple: If a
> > unique index has an existing item that is a duplicate on the incoming
> > item at the point that we might have to split the page, then apply
> > deduplication. Otherwise (when the incoming item has no duplicates),
> > don't apply deduplication at all -- just accept that we'll have to
> > split the page. We already cache the bounds of our initial binary
> > search in insert state, so we can reuse that information within
> > _bt_findinsertloc() when considering deduplication in unique indexes.
> 
> Attached is v26, which adds this new criteria/heuristic for unique
> indexes. We now seem to consistently get good results with unique
> indexes.

In the past we tried to increase the number of cases where HOT updates
can happen but were unable to.  Would this help with non-HOT updates? 
Do we have any benchmarks where non-HOT updates cause slowdowns that we
can test on this?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-12-03 Thread Peter Geoghegan

On Tue, Nov 12, 2019 at 3:21 PM Peter Geoghegan  wrote:
> * Decided to go back to turning deduplication on by default with
> non-unique indexes, and off by default using unique indexes.
>
> The unique index stuff was regressed enough with INSERT-heavy
> workloads that I was put off, despite my initial enthusiasm for
> enabling deduplication everywhere.

I have changed my mind about this again. I now think that it would
make sense to treat deduplication within unique indexes as a separate
feature that cannot be disabled by the GUC at all (though we'd
probably still respect the storage parameter for debugging purposes).
I have found that fixing the WAL record size issue has helped remove
what looked like a performance penalty for deduplication (but was
actually just a general regression). Also, I have found a way of
selectively applying deduplication within unique indexes that seems to
have no downside, and considerable upside.

The new criteria/heuristic for unique indexes is very simple: If a
unique index has an existing item that is a duplicate on the incoming
item at the point that we might have to split the page, then apply
deduplication. Otherwise (when the incoming item has no duplicates),
don't apply deduplication at all -- just accept that we'll have to
split the page. We already cache the bounds of our initial binary
search in insert state, so we can reuse that information within
_bt_findinsertloc() when considering deduplication in unique indexes.

This heuristic makes sense because deduplication within unique indexes
should only target leaf pages that cannot possibly receive new values.
In many cases, the only reason why almost all primary key leaf pages
can ever split is because of non-HOT updates whose new HOT chain needs
a new, equal entry in the primary key. This is the case with your
standard identity column/serial primary key, for example (only the
rightmost page will have a page split due to the insertion of new
logical rows -- everything other variety of page split must be due to
new physical tuples/versions). I imagine that it is possible for a
leaf page to be a "mixture"  of these two basic/general tendencies,
but not for long. It really doesn't matter if we occasionally fail to
delay a page split where that was possible, nor does it matter if we
occasionally apply deduplication when that won't delay a split for
very long -- pretty soon the page will split anyway. A split ought to
separate the parts of the keyspace that exhibit each tendency. In
general, we're only interested in delaying page splits in unique
indexes *indefinitely*, since in effect that will prevent them
*entirely*. (So the goal is *significantly* different to our general
goal for deduplication -- it's about buying time for VACUUM to run or
whatever, rather than buying space.)

This heuristic helps the TPC-C "old order" tables PK from bloating
quite noticeably, since that was the only unique index that is really
affected by non-HOT UPDATEs (i.e. the UPDATE queries that touch that
table happen to not be HOT-safe in general, which is not the case for
any other table). It doesn't regress anything else from TPC-C, since
there really isn't a benefit for other tables. More importantly, the
working/draft version of the patch will often avoid a huge amount of
bloat in a pgbench-style workload that has an extra index on the
pgbench_accounts table, to prevent HOT updates. The accounts primary
key (pgbench_accounts_pkey) hardly grows at all with the patch, but
grows 2x on master.

This 2x space saving seems to occur reliably, unless there is a lot of
contention on individual *pages*, in which case the bloat can be
delayed but not prevented. We get that 2x space saving with either
uniformly distributed random updates on pgbench_accounts (i.e. the
pgbench default), or with a skewed distribution that hashes the PRNG's
value. Hashing like this simulates a workload where there the skew
isn't concentrated in one part of the key space (i.e. there is skew,
but very popular values are scattered throughout the index evenly,
rather than being concentrated together in just a few leaf pages).

Can anyone think of an adversarial case, that we may not do so well on
with the new "only deduplicate within unique indexes when new item
already has a duplicate" strategy? I'm having difficulty identifying
some kind of worst case.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-11-25 Thread Michael Paquier

On Mon, Nov 18, 2019 at 05:26:37PM -0800, Peter Geoghegan wrote:
> Attached is v24. This revision doesn't fix the problem with
> xl_btree_insert record bloat, but it does fix the bitrot against the
> master branch that was caused by commit 50d22de9. (This patch has had
> a surprisingly large number of conflicts against the master branch
> recently.)

Please note that I have moved this patch to next CF per this last
update.  Anastasia, the ball is waiting on your side of the field, as
the CF entry is marked as waiting on author for some time now.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-11-16 Thread Te


Moin,

On 2019-11-16 01:04, Peter Geoghegan wrote:
On Fri, Nov 15, 2019 at 5:16 AM Robert Haas  
wrote:

Hmm. Well, maybe I'm just behind the times. But that same wikipedia
article also says that deduplication works on large chunks "such as
entire files or large sections of files" thus differentiating it from
compression algorithms which work on the byte level, so it seems to me
that what you are doing still sounds more like ad-hoc compression.


I see your point.

One reason for my avoiding the word "compression" is that other DB
systems that have something similar don't use the word compression
either. Actually, they don't really call it *anything*. Posting lists
are simply the way that secondary indexes work. The "Modern B-Tree
techniques" book/survey paper mentions the idea of using a TID list in
its "3.7 Duplicate Key Values" section, not in the two related
sections that follow ("Bitmap Indexes", and "Data Compression").

That doesn't seem like a very good argument, now that I've typed it
out. The patch applies deduplication/compression/whatever at the point
where we'd otherwise have to split the page, unlike GIN. GIN eagerly
maintains posting lists (doing in-place updates for most insertions
seems pretty bad to me). My argument could reasonably be made about
GIN, which really does consider posting lists the natural way to store
duplicate tuples. I cannot really make that argument about nbtree with
this patch, though -- delaying a page split by re-encoding tuples
(changing their physical representation without changing their logical
contents) justifies using the word "compression" in the name.


> Can you suggest an alternative?

My instinct is to pick a name that somehow involves compression and
just put enough other words in there to make it clear e.g. duplicate
value compression, or something of that sort.


Does anyone else want to weigh in on this? Anastasia?

I will go along with whatever the consensus is. I'm very close to the
problem we're trying to solve, which probably isn't helping me here.


I'm in favor of deduplication and not compression. Compression is a more 
generic term and can involve deduplication, but it hasn't to do so. (It 
could for instance just encode things in a more compact form). While 
deduplication does not involve compression, it just means store multiple 
things once, which by coincidence also amounts to using less space like 
compression can do.


ZFS also follows this by having both deduplication (store the same 
blocks only once with references) and compression (compress block 
contents, regardless wether they are stored once or many times).


So my vote is for deduplication (if I understand the thread correctly 
this is what the code no does, by storing the exact same key not that 
many times but only once with references or a count?).


best regards,

Tels

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-11-15 Thread Peter Geoghegan

On Fri, Nov 15, 2019 at 5:43 PM Mark Dilger  wrote:
> On 11/13/19 11:51 AM, Peter Geoghegan wrote:
> > Can you suggest an alternative?
>
> Dupression

This suggestion makes me feel better about "deduplication".

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-11-15 Thread Mark Dilger





On 11/13/19 11:51 AM, Peter Geoghegan wrote:

Can you suggest an alternative?


Dupression

--
Mark Dilger

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-11-15 Thread Peter Geoghegan

On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan  wrote:
> > I haven't measured how these changes affect WAL size yet.
> > Do you have any suggestions on how to automate testing of new WAL records?
> > Is there any suitable place in regression tests?
>
> I don't know about the regression tests (I doubt that there is a
> natural place for such a test), but I came up with a rough test case.
> I more or less copied the approach that you took with the index build
> WAL reduction patches, though I also figured out a way of subtracting
> heapam WAL overhead to get a real figure. I attach the test case --
> note that you'll need to use the "land" database with this. (This test
> case might need to be improved, but it's a good start.)

I used a test script similar to the "nbtree_wal_test.sql" test script
I posted on September 11th today. I am concerned about the WAL
overhead for cases that don't benefit from the patch (usually because
they turn off deduplication altogether). The details of the index
tested were different this time, though. I used an index that had the
smallest possible tuple size: 16 bytes (this is the smallest possible
size on 64-bit systems, but that's what almost everybody uses these
days). So any index with one or two int4 columns (or one int8 column)
will generally have 16 byte IndexTuples, at least when there are no
NULLs in the index. In general, 16 byte wide tuples are very, very
common.

What I saw suggests that we will need to remove the new "postingoff"
field from xl_btree_insert. (We can create a new XLog record for leaf
page inserts that also need to split a posting list, without changing
much else.)

The way that *alignment* of WAL records affects these common 16 byte
IndexTuple cases is the real problem. Adding "postingoff" to
xl_btree_insert increases the WAL required for INSERT_LEAF records by
two bytes (sizeof(OffsetNumber)), as you'd expect -- pg_waldump output
shows that they're 66 bytes, whereas they're only 64 bytes on the
master branch. That doesn't sound that bad, but once you consider the
alignment of whole records, it's really an extra 8 bytes. That is
totally unacceptable. The vast majority of nbtree WAL records are
bound to be INSERT_LEAF records, so as things stand we have added
(almost) 12.5% space overhead to nbtree for these common cases, that
don't benefit.

I haven't really looked into other types of WAL record just yet. The
real world overhead that we're adding to xl_btree_vacuum records is
something that I will have to look into separately. I'm already pretty
sure that adding two bytes to xl_btree_split is okay, though, because
they're far less numerous than xl_btree_insert records, and aren't
affected by alignment in the same way (they're already several hundred
bytes in almost all cases).

I also noticed something positive: The overhead of xl_btree_dedup WAL
records seems to be very low with indexes that have hundreds of
logical tuples for each distinct integer value. We don't seem to have
a problem with "deduplication thrashing".

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-11-15 Thread Peter Geoghegan

On Fri, Nov 15, 2019 at 5:16 AM Robert Haas  wrote:
> Hmm. Well, maybe I'm just behind the times. But that same wikipedia
> article also says that deduplication works on large chunks "such as
> entire files or large sections of files" thus differentiating it from
> compression algorithms which work on the byte level, so it seems to me
> that what you are doing still sounds more like ad-hoc compression.

I see your point.

One reason for my avoiding the word "compression" is that other DB
systems that have something similar don't use the word compression
either. Actually, they don't really call it *anything*. Posting lists
are simply the way that secondary indexes work. The "Modern B-Tree
techniques" book/survey paper mentions the idea of using a TID list in
its "3.7 Duplicate Key Values" section, not in the two related
sections that follow ("Bitmap Indexes", and "Data Compression").

That doesn't seem like a very good argument, now that I've typed it
out. The patch applies deduplication/compression/whatever at the point
where we'd otherwise have to split the page, unlike GIN. GIN eagerly
maintains posting lists (doing in-place updates for most insertions
seems pretty bad to me). My argument could reasonably be made about
GIN, which really does consider posting lists the natural way to store
duplicate tuples. I cannot really make that argument about nbtree with
this patch, though -- delaying a page split by re-encoding tuples
(changing their physical representation without changing their logical
contents) justifies using the word "compression" in the name.

> > Can you suggest an alternative?
>
> My instinct is to pick a name that somehow involves compression and
> just put enough other words in there to make it clear e.g. duplicate
> value compression, or something of that sort.

Does anyone else want to weigh in on this? Anastasia?

I will go along with whatever the consensus is. I'm very close to the
problem we're trying to solve, which probably isn't helping me here.

--
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-11-15 Thread Robert Haas

On Wed, Nov 13, 2019 at 2:51 PM Peter Geoghegan  wrote:
> "Deduplication" never means that you get rid of duplicates. According
> to Wikipedia's deduplication article: "Whereas compression algorithms
> identify redundant data inside individual files and encodes this
> redundant data more efficiently, the intent of deduplication is to
> inspect large volumes of data and identify large sections – such as
> entire files or large sections of files – that are identical, and
> replace them with a shared copy".

Hmm. Well, maybe I'm just behind the times. But that same wikipedia
article also says that deduplication works on large chunks "such as
entire files or large sections of files" thus differentiating it from
compression algorithms which work on the byte level, so it seems to me
that what you are doing still sounds more like ad-hoc compression.

> Can you suggest an alternative?

My instinct is to pick a name that somehow involves compression and
just put enough other words in there to make it clear e.g. duplicate
value compression, or something of that sort.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-11-13 Thread Peter Geoghegan

On Wed, Nov 13, 2019 at 11:33 AM Robert Haas  wrote:
> On Tue, Nov 12, 2019 at 6:22 PM Peter Geoghegan  wrote:
> > * Disabled deduplication in system catalog indexes by deeming it
> > generally unsafe.
>
> I (continue to) think that deduplication is a terrible name, because
> you're not getting rid of the duplicates. You are using a compressed
> representation of the duplicates.

"Deduplication" never means that you get rid of duplicates. According
to Wikipedia's deduplication article: "Whereas compression algorithms
identify redundant data inside individual files and encodes this
redundant data more efficiently, the intent of deduplication is to
inspect large volumes of data and identify large sections – such as
entire files or large sections of files – that are identical, and
replace them with a shared copy".

This seemed like it fit what this patch does. We're concerned with a
specific, simple kind of redundancy. Also:

* From the user's point of view, we're merging together what they'd
call duplicates. They don't really think of the heap TID as part of
the key.

* The term "compression" suggests a decompression penalty when
reading, which is not the case here.

* The term "compression" confuses the feature added by the patch with
TOAST compression. Now we may have two very different varieties of
compression in the same index.

Can you suggest an alternative?

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-11-13 Thread Robert Haas

On Tue, Nov 12, 2019 at 6:22 PM Peter Geoghegan  wrote:
> * Disabled deduplication in system catalog indexes by deeming it
> generally unsafe.

I (continue to) think that deduplication is a terrible name, because
you're not getting rid of the duplicates. You are using a compressed
representation of the duplicates.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-10-02 Thread Peter Geoghegan

On Mon, Sep 30, 2019 at 7:39 PM Peter Geoghegan  wrote:
> I've found that my "regular pgbench, but with a low cardinality index
> on pgbench_accounts(abalance)" benchmark works best with the specific
> heuristics used in the patch, especially over many hours.

I ran pgbench without the pgbench_accounts(abalance) index, and with
slightly adjusted client counts -- you could say that this was a
classic pgbench benchmark of v20 of the patch. Still scale 500, with
single hour runs.

Here are the results for each 1 hour run, with client counts of 8, 16,
and 32, with two rounds of runs total:

master_1_run_8.out: "tps = 25156.689415 (including connections establishing)"
patch_1_run_8.out: "tps = 25135.472084 (including connections establishing)"
master_1_run_16.out: "tps = 30947.053714 (including connections establishing)"
patch_1_run_16.out: "tps = 31225.044305 (including connections establishing)"
master_1_run_32.out: "tps = 29550.231969 (including connections establishing)"
patch_1_run_32.out: "tps = 29425.011249 (including connections establishing)"

master_2_run_8.out: "tps = 24678.792084 (including connections establishing)"
patch_2_run_8.out: "tps = 24891.130465 (including connections establishing)"
master_2_run_16.out: "tps = 30878.930585 (including connections establishing)"
patch_2_run_16.out: "tps = 30982.306091 (including connections establishing)"
master_2_run_32.out: "tps = 29555.453436 (including connections establishing)"
patch_2_run_32.out: "tps = 29591.767136 (including connections establishing)"

This interlaced order is the same order that each 1 hour pgbench run
actually ran in. The patch wasn't expected to do any better here -- it
was expected to not be any slower for a workload that it cannot really
help. Though the two small pgbench indexes do remain a lot smaller
with the patch.

While a lot of work remains to validate the performance of the patch,
this looks good to me.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-09-27 Thread Peter Geoghegan

On Fri, Sep 27, 2019 at 9:43 AM Anastasia Lubennikova
 wrote:
> Attached is v19.

Cool.

> * By default deduplication is on for non-unique indexes and off for
> unique ones.

I think that it makes sense to enable deduplication by default -- even
with unique indexes. It looks like deduplication can be very helpful
with non-HOT updates. I have been benchmarking this using more or less
standard pgbench at scale 200, with one big difference -- I also
create an index on "pgbench_accounts (abalance)". This is a low
cardinality index, which ends up about 3x smaller with the patch, as
expected. It also makes all updates non-HOT updates, greatly
increasing index bloat in the primary key of the accounts table --
this is what I found really interesting about this workload.

The theory behind deduplication within unique indexes seems quite
different to the cases we've focussed on so far -- that's why my
working copy of the patch makes a few small changes to how
_bt_dedup_one_page() works with unique indexes specifically (more on
that later). With unique indexes, deduplication doesn't help by
creating space -- it helps by creating *time* for garbage collection
to run before the real "damage" is done -- it delays page splits. This
is only truly valuable when page splits caused by non-HOT updates are
delayed by so much that they're actually prevented entirely, typically
because the _bt_vacuum_one_page() stuff can now happen before pages
split, not after. In general, these page splits are bad because they
degrade the B-Tree structure, more or less permanently (it's certainly
permanent with this workload). Having a huge number of page splits
*purely* because of non-HOT updates is particular bad -- it's just
awful. I believe that this is the single biggest problem with the
Postgres approach to versioned storage (we know that other DB systems
have no primary key page splits with this kind of workload).

Anyway, if you run this pgbench workload without rate-limiting, so
that a patched Postgres does as much work as physically possible, the
accounts table primary key (pgbench_accounts_pkey) at least grows at a
slower rate -- the patch clearly beats master at the start of the
benchmark/test (as measured by index size). As the clients are ramped
up by my testing script, and as time goes on, eventually the size of
the pgbench_accounts_pkey index "catches up" with master. The patch
delays page splits, but ultimately the system as a whole cannot
prevent the page splits altogether, since the server is being
absolutely hammered by pgbench. Actually, the index is *exactly* the
same size for both the master case and the patch case when we reach
this "bloat saturation point". We can delay the problem, but we cannot
prevent it. But what about a more realistic workload, with
rate-limiting?

When I add some rate limiting, so that the TPS/throughput is at about
50% of what I got the first time around (i.e. 50% of what is
possible), or 15k TPS, it's very different. _bt_dedup_one_page() can
now effectively cooperate with _bt_vacuum_one_page(). Now
deduplication is able to "soak up all the extra garbage tuples" for
long enough to delay and ultimately *prevent* almost all page splits.
pgbench_accounts_pkey starts off at 428 MB for both master and patch
(CREATE INDEX makes it that size). After about an hour, the index is
447 MB with the patch. The master case ends up with a
pgbench_accounts_pkey size of 854 MB, though (this is very close to
857 MB, the "saturation point" index size from before).

This is a very significant improvement, obviously -- the patch has an
index that is ~52% of the size seen for the same index with the master
branch!

Here is how I changed _bt_dedup_one_page() for unique indexes to get
this result:

* We limit the size of posting lists to 5 heap TIDs in the
checkingunique case. Right now, we will actually accept a
checkingunique page split before we'll merge together items that
result in a posting list with more heap TIDs than that (not sure about
these details at all, though).

* Avoid creating a new posting list that caller will have to split
immediately anyway (this is based on details of _bt_dedup_one_page()
caller's newitem tuple).

(Not sure how much this customization contributes to this favorable
test result -- maybe it doesn't make that much difference.)

The goal here is for duplicates that are close together in both time
and space to get "clumped together" into their own distinct, small-ish
posting list tuples with no more than 5 TIDs. This is intended to help
_bt_vacuum_one_page(), which is known to be a very important mechanism
for indexes like our pgbench_accounts_pkey index (LP_DEAD bits are set
very frequently within _bt_check_unique()). The general idea is to
balance deduplication against LP_DEAD killing, and to increase
spatial/temporal locality within these smaller posting lists. If we
have one huge posting list for each value, then we can't set the
LP_DEAD bit on anything at all, which is very bad.

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-09-25 Thread Peter Geoghegan

On Wed, Sep 25, 2019 at 8:05 AM Anastasia Lubennikova
 wrote:
> Attached is v18. In this version bt_dedup_one_page() is refactored so that:
> - no temp page is used, all updates are applied to the original page.
> - each posting tuple wal logged separately.
> This also allowed to simplify btree_xlog_dedup significantly.

This looks great! Even if it isn't faster than using a temp page
buffer, the flexibility seems like an important advantage. We can do
things like have the _bt_dedup_one_page() caller hint that
deduplication should start at a particular offset number. If that
doesn't work out by the time the end of the page is reached (whatever
"works out" may mean), then we can just start at the beginning of the
page, and work through the items we skipped over initially.

> > We still haven't added an "off" switch to deduplication, which seems
> > necessary. I suppose that this should look like GIN's "fastupdate"
> > storage parameter.

> Why is it necessary to save this information somewhere but rel->rd_options,
> while we can easily access this field from _bt_findinsertloc() and
> _bt_load().

Maybe, but we also need to access a flag that says it's safe to use
deduplication. Obviously deduplication is not safe for datatypes like
numeric and text with a nondeterministic collation. The "is
deduplication safe for this index?" mechanism will probably work by
doing several catalog lookups. This doesn't seem like something we
want to do very often, especially with a buffer lock held -- ideally
it will be somewhere that's convenient to access.

Do we want to do that separately, and have a storage parameter that
says "I would like to use deduplication in principle, if it's safe"?
Or, do we store both pieces of information together, and forbid
setting the storage parameter to on when it's known to be unsafe for
the underlying opclasses used by the index? I don't know.

I think that you can start working on this without knowing exactly how
we'll do those catalog lookups. What you come up with has to work with
that before the patch can be committed, though.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-09-25 Thread Anastasia Lubennikova


24.09.2019 3:13, Peter Geoghegan wrote:

On Wed, Sep 18, 2019 at 7:25 PM Peter Geoghegan  wrote:

I attach version 16. This revision merges your recent work on WAL
logging with my recent work on simplifying _bt_dedup_one_page(). See
my e-mail from earlier today for details.

I attach version 17. This version has changes that are focussed on
further polishing certain things, including fixing some minor bugs. It
seemed worth creating a new version for that. (I didn't get very far
with the space utilization stuff I talked about, so no changes there.)

Attached is v18. In this version bt_dedup_one_page() is refactored so that:
- no temp page is used, all updates are applied to the original page.
- each posting tuple wal logged separately.
This also allowed to simplify btree_xlog_dedup significantly.


Another infrastructure thing that the patch needs to handle to be committable:

We still haven't added an "off" switch to deduplication, which seems
necessary. I suppose that this should look like GIN's "fastupdate"
storage parameter. It's not obvious how to do this in a way that's
easy to work with, though. Maybe we could do something like copy GIN's
GinGetUseFastUpdate() macro, but the situation with nbtree is actually
quite different. There are two questions for nbtree when it comes to
deduplication within an inde: 1) Does the user want to use
deduplication, because that will help performance?, and 2) Is it
safe/possible to use deduplication at all?

I'll send another version with dedup option soon.


I think that we should probably stash this information (deduplication
is both possible and safe) in the metapage. Maybe we can copy it over
to our insertion scankey, just like the "heapkeyspace" field -- that
information also comes from the metapage (it's based on the nbtree
version). The "heapkeyspace" field is a bit ugly, so maybe we
shouldn't go further by adding something similar, but I don't see any
great alternative right now.


Why is it necessary to save this information somewhere but rel->rd_options,
while we can easily access this field from _bt_findinsertloc() and 
_bt_load().


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..d65e2a7 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 	  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 	 IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 			   OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxPostingIndexTuplesPerPage / 3 "logical" tuples.  heapallindexed
+		 * verification fingerprints posting list heap TIDs as plain non-pivot
+		 * tuples, complete with index keys.  This allows its heap scan to
+		 * behave as if posting lists do not exist.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxPostingIndexTuplesPerPage / 3),
 		  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 	   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-			ItemPointerGetBlockNumber(&(itup->t_tid)),
-			ItemPointerGetOffsetNumber(&(itup->t_tid)));
+

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-09-24 Thread Peter Geoghegan

On Mon, Sep 23, 2019 at 5:13 PM Peter Geoghegan  wrote:
> I attach version 17.

I attach a patch that applies on top of v17. It adds support for
deduplication within unique indexes. Actually, this is a snippet of
code that appeared in my prototype from August 5 (we need very little
extra code for this now). Unique index support kind of looked like a
bad idea at the time, but things have changed a lot since then.

I benchmarked this overnight using a custom pgbench-based test that
used a Zipfian distribution, with a single-row SELECT and an UPDATE of
pgbench_accounts per xact. I used regular/logged tables this time
around, since WAL-logging is now fairly efficient. I added a separate
low cardinality index on pgbench_accounts(abalance). A low cardinality
index is the most interesting case for this patch, obviously, but it
also serves to prevent all HOT updates, increasing bloat for both
indexes. We want a realistic case that creates a lot of index bloat.

This wasn't a rigorous enough benchmark to present here in full, but
the results were very encouraging. With reasonable client counts for
the underlying hardware, we seem to have a small increase in TPS, and
a small decrease in latency. There is a regression with 128 clients,
when contention is ridiculously high (this is my home server, which
only has 4 cores). More importantly:

* The low cardinality index is almost 3x smaller with the patch -- no
surprises there.

* The read latency is where latency goes up, if it goes up at all.
Whatever it is that might increase latency, it doesn't look like it's
deduplication itself. Yeah, deduplication is expensive, but it's not
nearly as expensive as a page split. (I'm talking about the immediate
cost, not the bigger picture, though the bigger picture matters even
more.)

* The growth in primary key size over time is the thing I find really
interesting. The patch seems to really control the number of pages
splits over many hours with lots of non-HOT updates. I think that a
timeline of days or weeks could be really interesting.

I am now about 75% convinced that adding deduplication to unique
indexes is a good idea, at least as an option that is disabled by
default. We're already doing well here, even though there has been no
work on optimizing deduplication in unique indexes. Further
optimizations seem quite possible, though. I'm mostly thinking about
optimizing non-HOT updates by teaching nbtree some basic things about
versioning with unique indexes.

For example, we could remember "recently dead" duplicates of the value
we are about to insert (as part of an UPDATE statement) from within
_bt_check_unique(). Then, when it looks like a page split may be
necessary, we can hint to  _bt_dedup_one_page(): "please just
deduplicate the group of duplicates starting from this offset, which
are duplicates of the this new item I am inserting -- do not create a
posting list that I will have to split, though". We already cache the
binary search bounds established within _bt_check_unique() in
insertstate, so perhaps we could reuse that to make this work. The
goal here is that the the old/recently dead versions end up together
in their own posting list (or maybe two posting lists), whereas our
new/most current tuple is on its own. There is a very good chance that
our transaction will commit, leaving somebody else to set the LP_DEAD
bit on the posting list that contains those old versions. In short,
we'd be making deduplication and opportunistic garbage collection
cooperate closely.

This can work both ways -- maybe we should also teach
_bt_vacuum_one_page() to cooperate with _bt_dedup_one_page(). That is,
if we add the mechanism I just described in the last paragraph, maybe
_bt_dedup_one_page() marks the posting list that is likely to get its
LP_DEAD bit set soon with a new hint bit -- the LP_REDIRECT bit. Here,
LP_REDIRECT means "somebody is probably going to set the LP_DEAD bit
on this posting list tuple very soon". That way, if nobody actually
does set the LP_DEAD bit, _bt_vacuum_one_page() still has options. If
it goes to the heap and finds the latest version, and that that
version is visible to any possible MVCC snapshot, that means that it's
safe to kill all the other versions, even without the LP_DEAD bit set
-- this is a unique index. So, it often gets to kill lots of extra
garbage that it wouldn't get to kill, preventing page splits. The cost
is pretty low: the risk that the single heap page check will be a
wasted effort. (Of course, we still have to visit the heap pages of
things that we go on to kill, to get the XIDs to generate recovery
conflicts -- the important point is that we only need to visit one
heap page in _bt_vacuum_one_page(), to *decide* if it's possible to do
all this -- cases that don't benefit at all also don't pay very much.)

I don't think that we need to do either of these two other things to
justify committing the patch with unique index support. But, teaching
nbtree a little bit about versioning

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-09-18 Thread Peter Geoghegan

On Wed, Sep 18, 2019 at 10:43 AM Peter Geoghegan  wrote:
> This also suggests that making _bt_dedup_one_page() do raw page adds
> and page deletes to the page in shared_buffers (i.e. don't use a temp
> buffer page) could pay off. As I went into at the start of this
> e-mail, unnecessarily doing expensive things like copying large
> posting lists around is a real concern. Even if it isn't truly useful
> for _bt_dedup_one_page() to operate in a very incremental fashion,
> incrementalism is probably still a good thing to aim for -- it seems
> to make deduplication faster in all cases.

I think that I forgot to mention that I am concerned that the
kill_prior_tuple/LP_DEAD optimization could be applied less often
because _bt_dedup_one_page() operates too aggressively. That is a big
part of my general concern.

Maybe I'm wrong about this -- who knows? I definitely think that
LP_DEAD setting by _bt_check_unique() is generally a lot more
important than LP_DEAD setting by the kill_prior_tuple optimization,
and the patch won't affect unique indexes. Only very serious
benchmarking can give us a clear answer, though.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-09-18 Thread Peter Geoghegan

On Tue, Sep 17, 2019 at 9:43 AM Anastasia Lubennikova
 wrote:
> > 3. Third, there is incremental writing of the page itself -- avoiding
> > using a temp buffer. Not sure where I stand on this.
>
> I think it's a good idea.  memmove must be much faster than copying
> items tuple by tuple.
> I'll send next patch by the end of the week.

I think that the biggest problem is that we copy all of the tuples,
including existing posting list tuples that can't be merged with
anything. Even if you assume that we'll never finish early (e.g. by
using logic like the "if (pagesaving >= newitemsz) deduplicate =
false;" thing), this can still unnecessarily slow down deduplication.
Very often,  _bt_dedup_one_page() is called when 1/2 - 2/3 of the
space on the page is already used by a small number of very large
posting list tuples.

> > The loop within _bt_dedup_one_page() is very confusing in both v13 and
> > v14 -- I couldn't figure out why the accounting worked like this:

> I'll look at it.

I'm currently working on merging my refactored version of
_bt_dedup_one_page() with your v15 WAL-logging. This is a bit tricky.
(I have finished merging the other WAL-logging stuff, though -- that
was easy.)

The general idea is that the loop in _bt_dedup_one_page() now
explicitly operates with a "base" tuple, rather than *always* saving
the prev tuple from the last loop iteration. We always have a "pending
posting list", which won't be written-out as a posting list if it
isn't possible to merge at least one existing page item. The "base"
tuple doesn't change. "pagesaving" space accounting works in a way
that doesn't care about whether or not the base tuple was already a
posting list -- it saves the size of the IndexTuple without any
existing posting list size, and calculates the contribution to the
total size of the new posting list separately (heap TIDs from the
original base tuple and subsequent tuples are counted together).

This has a number of advantages:

* The loop is a lot clearer now, and seems to have slightly better
space utilization because of improved accounting (with or without the
"if (pagesaving >= newitemsz) deduplicate = false;" thing).

* I think that we're going to need to be disciplined about which tuple
is the "base" tuple for correctness reasons -- we should always use
the leftmost existing tuple to form a new posting list tuple. I am
concerned about rare cases where we deduplicate tuples that are equal
according to _bt_keep_natts_fast()/datum_image_eq() that nonetheless
have different sizes (and are not bitwise equal). There are rare cases
involving TOAST compression where that is just about possible (see the
temp comments I added to  _bt_keep_natts_fast() in the patch).

* It's clearly faster, because there is far less palloc() overhead --
the "land" unlogged table test completes in about 95.5% of the time
taken by v15 (I disabled "if (pagesaving >= newitemsz) deduplicate =
false;" for both versions here, to keep it simple and fair).

This also suggests that making _bt_dedup_one_page() do raw page adds
and page deletes to the page in shared_buffers (i.e. don't use a temp
buffer page) could pay off. As I went into at the start of this
e-mail, unnecessarily doing expensive things like copying large
posting lists around is a real concern. Even if it isn't truly useful
for _bt_dedup_one_page() to operate in a very incremental fashion,
incrementalism is probably still a good thing to aim for -- it seems
to make deduplication faster in all cases.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-09-17 Thread Anastasia Lubennikova


16.09.2019 21:58, Peter Geoghegan wrote:

On Mon, Sep 16, 2019 at 8:48 AM Anastasia Lubennikova
 wrote:

I tested patch with nbtree_wal_test, and found out that the real issue is
not the dedup WAL records themselves, but the full page writes that they 
trigger.
Here are test results (config is standard, except fsync=off to speedup tests):

'FPW on' and 'FPW off' are tests on v14.
NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().

I think that is makes sense to focus on synthetic cases without
FPWs/FPIs from checkpoints. At least for now.


With random insertions into btree it's highly possible that deduplication will 
often be
the first write after checkpoint, and thus will trigger FPW, even if only a few 
tuples were compressed.

<...>

I think that the problem here is that you didn't copy this old code
from _bt_split() over to _bt_dedup_one_page():

 /*
  * Copy the original page's LSN into leftpage, which will become the
  * updated version of the page.  We need this because XLogInsert will
  * examine the LSN and possibly dump it in a page image.
  */
 PageSetLSN(leftpage, PageGetLSN(origpage));
 isleaf = P_ISLEAF(oopaque);

Note that this happens at the start of _bt_split() -- the temp page
buffer based on origpage starts out with the same LSN as origpage.
This is an important step of the WAL volume optimization used by
_bt_split().


That's it. I suspected that such enormous amount of FPW reflects some bug.

That's why there is no significant difference with log_newpage_buffer() 
approach.
And that's why "lazy" deduplication doesn't help to decrease amount of WAL.

My point was that the problem is extra FPWs, so it doesn't matter
whether we deduplicate just several entries to free enough space or all 
of them.



The term "lazy deduplication" is seriously overloaded here. I think
that this could cause miscommunications. Let me list the possible
meanings of that term here:

1. First of all, the basic approach to deduplication is already lazy,
unlike GIN, in the sense that _bt_dedup_one_page() is called to avoid
a page split. I'm 100% sure that we both think that that works well
compared to an eager approach (like GIN's).

Sure.

2. Second of all, there is the need to incrementally WAL log. It looks
like v14 does that well, in that it doesn't create
"xlrec_dedup.n_intervals" space when it isn't truly needed. That's
good.

In v12-v15 I mostly concentrated on this feature.
The last version looks good to me.


3. Third, there is incremental writing of the page itself -- avoiding
using a temp buffer. Not sure where I stand on this.


I think it's a good idea.  memmove must be much faster than copying 
items tuple by tuple.

I'll send next patch by the end of the week.


4. Finally, there is the possibility that we could make deduplication
incremental, in order to avoid work that won't be needed altogether --
this would probably be combined with 3. Not sure where I stand on
this, either.

We should try to be careful when using these terms, as there is a very
real danger of talking past each other.


Another, and more realistic approach is to make deduplication less intensive:
if freed space is less than some threshold, fall back to not changing page at 
all and not generating xlog record.

I see that v14 uses the "dedupInterval" struct, which provides a
logical description of a deduplicated set of tuples. That general
approach is at least 95% of what I wanted from the
_bt_dedup_one_page() WAL-logging.


Probably that was the reason, why patch became faster after I added 
BT_COMPRESS_THRESHOLD in early versions,
not because deduplication itself is cpu bound or something, but because WAL 
load decreased.

I think so too -- BT_COMPRESS_THRESHOLD definitely makes compression
faster as things are. I am not against bringing back
BT_COMPRESS_THRESHOLD. I just don't want to do it right now because I
think that it's a distraction. It may hide problems that we want to
fix. Like the PageSetLSN() problem I mentioned just now, and maybe
others.

We will definitely need to have page space accounting that's a bit
similar to nbtsplitloc.c, to avoid the case where a leaf page is 100%
full (or has 4 bytes left, or something). That happens regularly now.
That must start with teaching _bt_dedup_one_page() about how much
space it will free. Basing it on the number of items on the page or
whatever is not going to work that well.

I think that it would be possible to have something like
BT_COMPRESS_THRESHOLD to prevent thrashing, and *also* make the
deduplication incremental, in the sense that it can give up on
deduplication when it frees enough space (i.e. something like v13's
0002-* patch). I said that these two things are closely related, which
is true, but it's also true that they don't overlap.

Don't forget the reason why I removed BT_COMPRESS_THRESHOLD: Doing so
made a handful of specific indexes (mostly from TPC-H) significantly
smaller. I never tried to debug the

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-09-16 Thread Peter Geoghegan

On Mon, Sep 16, 2019 at 8:48 AM Anastasia Lubennikova
 wrote:
> Attached is v14 based on v12 (v13 changes are not merged).
>
> In this version, I fixed the bug you mentioned and also fixed nbtinsert,
> so that it doesn't save newposting in xlog record anymore.

Cool.

> I tested patch with nbtree_wal_test, and found out that the real issue is
> not the dedup WAL records themselves, but the full page writes that they 
> trigger.
> Here are test results (config is standard, except fsync=off to speedup tests):
>
> 'FPW on' and 'FPW off' are tests on v14.
> NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().

I think that is makes sense to focus on synthetic cases without
FPWs/FPIs from checkpoints. At least for now.

> With random insertions into btree it's highly possible that deduplication 
> will often be
> the first write after checkpoint, and thus will trigger FPW, even if only a 
> few tuples were compressed.

I find that hard to believe. Deduplication only occurs when we're
about to split the page. If that's almost as likely to occur as a
simple insert, then we're in big trouble (maybe it's actually true,
but if it is then that's the real problem). Also, fewer pages for the
index naturally leads to far fewer FPIs after a checkpoint.

I used "pg_waldump -z" and "pg_waldump --stats=record" to evaluate the
same case on v13. It was practically the same as the master branch,
apart from the huge difference in FPIs for the XLOG rmgr. Aside from
that one huge difference, there was a similar volume of the same types
of WAL records in each case. Mostly leaf inserts, and far fewer
internal page inserts. I suppose this isn't surprising.

It probably makes sense for the final version of the patch to increase
the volume of WAL a little overall, since the savings for internal
page stuff cannot make up for the cost of having to WAL log something
extra (deduplication operations) on leaf pages, regardless of the size
of those extra dedup WAL records (I am ignoring FPIs after a
checkpoint in this analysis). So the patch is more or less certain to
add *some* WAL overhead in cases that benefit, and that's okay. But,
it adds way too much WAL overhead today (even in v14), for reasons
that we don't understand yet, which is not okay.

I may have misunderstood your approach to WAL-logging in v12. I
thought that you were WAL-logging things that didn't change, which
doesn't seem to be the case with v14. I thought that v12 was very
similar to v11 (and my v13) in terms of how _bt_dedup_one_page() does
its WAL-logging. v14 looks good, though.

 "pg_waldump -z" and "pg_waldump --stats=record" will break down the
contributing factor of FPIs, so it should be possible to account for
the overhead in the test case exactly. We can debug the problem by
using pg_waldump to count the absolute number of each type of record,
and the size of each type of record.

(Thinks some more...)

I think that the problem here is that you didn't copy this old code
from _bt_split() over to _bt_dedup_one_page():

/*
 * Copy the original page's LSN into leftpage, which will become the
 * updated version of the page.  We need this because XLogInsert will
 * examine the LSN and possibly dump it in a page image.
 */
PageSetLSN(leftpage, PageGetLSN(origpage));
isleaf = P_ISLEAF(oopaque);

Note that this happens at the start of _bt_split() -- the temp page
buffer based on origpage starts out with the same LSN as origpage.
This is an important step of the WAL volume optimization used by
_bt_split().

> That's why there is no significant difference with log_newpage_buffer() 
> approach.
> And that's why "lazy" deduplication doesn't help to decrease amount of WAL.

The term "lazy deduplication" is seriously overloaded here. I think
that this could cause miscommunications. Let me list the possible
meanings of that term here:

1. First of all, the basic approach to deduplication is already lazy,
unlike GIN, in the sense that _bt_dedup_one_page() is called to avoid
a page split. I'm 100% sure that we both think that that works well
compared to an eager approach (like GIN's).

2. Second of all, there is the need to incrementally WAL log. It looks
like v14 does that well, in that it doesn't create
"xlrec_dedup.n_intervals" space when it isn't truly needed. That's
good.

3. Third, there is incremental writing of the page itself -- avoiding
using a temp buffer. Not sure where I stand on this.

4. Finally, there is the possibility that we could make deduplication
incremental, in order to avoid work that won't be needed altogether --
this would probably be combined with 3. Not sure where I stand on
this, either.

We should try to be careful when using these terms, as there is a very
real danger of talking past each other.

> Another, and more realistic approach is to make deduplication less intensive:
> if freed space is less than some threshold, fall back to not changing page at 
> all and not generating xlog record.

I see

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-09-16 Thread Anastasia Lubennikova


13.09.2019 4:04, Peter Geoghegan wrote:

On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan  wrote:

I think that the new WAL record has to be created once per posting
list that is generated, not once per page that is deduplicated --
that's the only way that I can see that avoids a huge increase in
total WAL volume. Even if we assume that I am wrong about there being
value in making deduplication incremental, it is still necessary to
make the WAL-logging behave incrementally.

It would be good to hear your thoughts on this _bt_dedup_one_page()
WAL volume/"write amplification" issue.


Attached is v14 based on v12 (v13 changes are not merged).

In this version, I fixed the bug you mentioned and also fixed nbtinsert,
so that it doesn't save newposting in xlog record anymore.

I tested patch with nbtree_wal_test, and found out that the real issue is
not the dedup WAL records themselves, but the full page writes that they 
trigger.
Here are test results (config is standard, except fsync=off to speedup 
tests):


'FPW on' and 'FPW off' are tests on v14.
NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().

+---+---+---++---+

|    ---    |   FPW on  |  FPW off  | FORCE_NO_IMAGE |   master  |

+---+---+---++---+

| time  | 09:12 min | 06:56 min | 06:24 min  | 08:10 min |

| nbtree_wal_volume | 8083 MB   | 2128 MB   | 2327 MB    | 2439 MB   |

| index_size    | 169 MB    | 169 MB    | 169 MB | 1118 MB   |

+---+---+---++---+


With random insertions into btree it's highly possible that 
deduplication will often be
the first write after checkpoint, and thus will trigger FPW, even if 
only a few tuples were compressed.
That's why there is no significant difference with log_newpage_buffer() 
approach.

And that's why "lazy" deduplication doesn't help to decrease amount of WAL.

Also, since the index is packed way better than before, it probably 
benefits less of wal_compression.


One possible "fix" to decrease WAL amplification is to add 
REGBUF_NO_IMAGE flag to XLogRegisterBuffer in bt_dedup_one_page().
As you can see from test result, it really eliminates the problem of 
inadequate WAL amount.

However, I doubt that it is a crash-safe idea.

Another, and more realistic approach is to make deduplication less 
intensive:
if freed space is less than some threshold, fall back to not changing 
page at all and not generating xlog record.


Probably that was the reason, why patch became faster after I added 
BT_COMPRESS_THRESHOLD in early versions,
not because deduplication itself is cpu bound or something, but because 
WAL load decreased.


So I propose to develop this idea. The question is how to choose threshold.
I wouldn't like to introduce new user settings.  Any ideas?

I also noticed that the number of checkpoints differ between tests:
select checkpoints_req from pg_stat_bgwriter ;

+-+-+-+++

|   ---   |  FPW on | FPW off | FORCE_NO_IMAGE | master |

+-+-+-+++

| checkpoints_req |  16 |   7 |  8 | 10 |

+-+-+-+++

And I struggle to explain the reason of this.
Do you understand what can cause the difference?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..399743d 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 	   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-			ItemPointerGetBlockNumber(&(itup->t_tid)),
-			ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			ItemPointerGetBlockNumber(tid),
+			ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 	(errcode(ERRCODE_INDEX_CORRUPTED),
 	 errmsg("could not find tuple using search from

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-09-11 Thread Peter Geoghegan

On Wed, Sep 11, 2019 at 3:09 PM Peter Geoghegan  wrote:
> Hmm. So v12 seems to have some problems with the WAL logging for
> posting list splits. With wal_debug = on and
> wal_consistency_checking='all', I can get a replica to fail
> consistency checking very quickly when "make installcheck" is run on
> the primary

I see the bug here. The problem is that we WAL-log a version of the
new item that already has its heap TID changed. On the primary, the
call to _bt_form_newposting() has a new item with the original heap
TID, which is then rewritten before being inserted -- that's correct.
But during recovery, we *start out with* a version of the new item
that *already* had its heap TID swapped. So we have nowhere to get the
original heap TID from during recovery.

Attached patch fixes the problem in a hacky way -- it WAL-logs the
original heap TID, just in case. Obviously this fix isn't usable, but
it should make the problem clearer.

Can you come up with a proper fix, please? I can think of one way of
doing it, but I'll leave the details to you.

The same issue exists in _bt_split(), so the tests will still fail
with wal_consistency_checking -- it just takes a lot longer to reach a
point where an inconsistent page is found, because posting list splits
that occur at the same point that we need to split a page are much
rarer than posting list splits that occur when we simply need to
insert, without splitting the page. I suggest using
wal_consistency_checking to test the fix that you come up with. As I
mentioned, I regularly use it. Also note that there are further
subtleties to doing this within _bt_split() -- see the FIXME comments
there.

Thanks
-- 
Peter Geoghegan

0001-Save-original-new-heap-TID-in-insert-WAL-record.patch
Description: Binary data

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-09-11 Thread Peter Geoghegan

On Wed, Sep 11, 2019 at 5:38 AM Anastasia Lubennikova
 wrote:
> Attached is v12, which contains WAL optimizations for posting split and
> page
> deduplication.

Hmm. So v12 seems to have some problems with the WAL logging for
posting list splits. With wal_debug = on and
wal_consistency_checking='all', I can get a replica to fail
consistency checking very quickly when "make installcheck" is run on
the primary:

4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/30423A0; LSN 0/30425A0:
prev 0/3041C78; xid 506; len 3; blkref #0: rel 1663/16385/2608, blk 56
FPW - Heap/INSERT: off 20 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/30425A0; LSN 0/3042F78:
prev 0/30423A0; xid 506; len 4; blkref #0: rel 1663/16385/2673, blk 13
FPW - Btree/INSERT_LEAF: off 138; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3042F78; LSN 0/3043788:
prev 0/30425A0; xid 506; len 4; blkref #0: rel 1663/16385/2674, blk 37
FPW - Btree/INSERT_LEAF: off 68; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3043788; LSN 0/30437C0:
prev 0/3042F78; xid 506; len 28 - Transaction/ABORT: 2019-09-11
15:01:06.291717-07; rels: pg_tblspc/16388/PG_13_201909071/16385/16399
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/30437C0; LSN 0/3043A30:
prev 0/3043788; xid 507; len 3; blkref #0: rel 1663/16385/1247, blk 9
FPW - Heap/INSERT: off 9 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3043A30; LSN 0/3043D08:
prev 0/30437C0; xid 507; len 4; blkref #0: rel 1663/16385/2703, blk 2
FPW - Btree/INSERT_LEAF: off 51; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3043D08; LSN 0/3044948:
prev 0/3043A30; xid 507; len 4; blkref #0: rel 1663/16385/2704, blk 1
FPW - Btree/INSERT_LEAF: off 169; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3044948; LSN 0/3044B58:
prev 0/3043D08; xid 507; len 3; blkref #0: rel 1663/16385/2608, blk 56
FPW - Heap/INSERT: off 21 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3044B58; LSN 0/30454A0:
prev 0/3044948; xid 507; len 4; blkref #0: rel 1663/16385/2673, blk 8
FPW - Btree/INSERT_LEAF: off 156; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/30454A0; LSN 0/3045CC0:
prev 0/3044B58; xid 507; len 4; blkref #0: rel 1663/16385/2674, blk 37
FPW - Btree/INSERT_LEAF: off 71; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3045CC0; LSN 0/3045F48:
prev 0/30454A0; xid 507; len 3; blkref #0: rel 1663/16385/1247, blk 9
FPW - Heap/INSERT: off 10 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3045F48; LSN 0/3046240:
prev 0/3045CC0; xid 507; len 4; blkref #0: rel 1663/16385/2703, blk 2
FPW - Btree/INSERT_LEAF: off 51; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3046240; LSN 0/3046E70:
prev 0/3045F48; xid 507; len 4; blkref #0: rel 1663/16385/2704, blk 1
FPW - Btree/INSERT_LEAF: off 44; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3046E70; LSN 0/3047090:
prev 0/3046240; xid 507; len 3; blkref #0: rel 1663/16385/2608, blk 56
FPW - Heap/INSERT: off 22 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3047090; LSN 0/30479E0:
prev 0/3046E70; xid 507; len 4; blkref #0: rel 1663/16385/2673, blk 8
FPW - Btree/INSERT_LEAF: off 156; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/30479E0; LSN 0/3048420:
prev 0/3047090; xid 507; len 4; blkref #0: rel 1663/16385/2674, blk 38
FPW - Btree/INSERT_LEAF: off 10; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3048420; LSN 0/30486B0:
prev 0/30479E0; xid 507; len 3; blkref #0: rel 1663/16385/1259, blk 0
FPW - Heap/INSERT: off 6 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/30486B0; LSN 0/3048C30:
prev 0/3048420; xid 507; len 4; blkref #0: rel 1663/16385/2662, blk 2
FPW - Btree/INSERT_LEAF: off 119; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3048C30; LSN 0/3049668:
prev 0/30486B0; xid 507; len 4; blkref #0: rel 1663/16385/2663, blk 1
FPW - Btree/INSERT_LEAF: off 42; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3049668; LSN 0/304A550:
prev 0/3048C30; xid 507; len 4; blkref #0: rel 1663/16385/3455, blk 1
FPW - Btree/INSERT_LEAF: off 2; in_posting_offset 1
4448/2019-09-11 15:01:06 PDT FATAL:  inconsistent page found, rel
1663/16385/3455, forknum 0, blkno 1
4448/2019-09-11 15:01:06 PDT CONTEXT:  WAL redo at 0/3049668 for
Btree/INSERT_LEAF: off 2; in_posting_offset 1
4447/2019-09-11 15:01:06 PDT LOG:  startup process (PID 4448) exited
with exit code 1
4447/2019-09-11 15:01:06 PDT LOG:  terminating any other active server processes
4447/2019-09-11 15:01:06 PDT LOG:  database system is shut down

I regularly use this test case for the patch -- I think that I fixed a
similar problem in v11, when I changed the same WAL logging, but I
didn't mention it until now. I will debug this myself in a few days,
though you may prefer to do it before then.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-09-11 Thread Peter Geoghegan

On Wed, Sep 11, 2019 at 5:38 AM Anastasia Lubennikova
 wrote:
> I reviewed them and everything looks good except the idea of not
> splitting dead posting tuples.
> According to comments to scan->ignore_killed_tuples in genam.c:107,
> it may lead to incorrect tuple order on a replica.
> I don't sure, if it leads to any real problem, though, or it will be
> resolved
> by subsequent visibility checks.

Fair enough, but I didn't do that because it's compelling on its own
-- it isn't. I did it because it seemed like the best way to handle
posting list splits in a version of the patch where LP_DEAD bits can
be set on posting list tuples. I think that we have 3 high level
options here:

1. We don't support kill_prior_tuple/LP_DEAD bit setting with posting
lists at all. This is clearly the easiest approach.

2. We do what I did in v11 of the patch -- we make it so that
_bt_insertonpg() and _bt_split() never have to deal with LP_DEAD
posting lists that they must split in passing.

3. We add additional code to _bt_insertonpg() and _bt_split() to deal
with the rare case where they must split an LP_DEAD posting list,
probably by unsetting the bit or something like that. Obviously it
would be wrong to leave the LP_DEAD bit set for the newly inserted
heap tuples TID that must go in a posting list that had its LP_DEAD
bit set -- that would make it dead to index scans even after its xact
successfully committed.

I think that you already agree that we want to have the
kill_prior_tuple optimizations with posting lists, so #1 isn't really
an option. That just leaves #2 and #3. Since posting list splits are
already assumed to be quite rare, it seemed far simpler to take the
conservative approach of forcing clean-up that removes LP_DEAD bits so
that _bt_insertonpg() and _bt_split() don't have to think about it.
Obviously I think it's important that we make as few changes as
possible to _bt_insertonpg() and _bt_split(), in general.

I don't understand what you mean about visibility checks. There is
nothing truly special about the way in which _bt_findinsertloc() will
sometimes have to kill LP_DEAD items so that _bt_insertonpg() and
_bt_split() don't have to think about LP_DEAD posting lists. As far as
recovery is concerned, it is just another XLOG_BTREE_DELETE record,
like any other. Note that there is a second call to
_bt_binsrch_insert() within _bt_findinsertloc() when it has to
generate a new XLOG_BTREE_DELETE record (by calling
_bt_dedup_one_page(), which calls _bt_delitems_delete() in a way that
isn't dependent on the BTP_HAS_GARBAGE status bit being set).

> Anyway, it's worth to add more comments in
> _bt_killitems() explaining why it's safe.

There is no question that the little snippet of code I added to
_bt_killitems() in v11 is still too complicated. We also have to
consider cases where the array overflows because the scan direction
was changed (see the kill_prior_tuple comment block in btgetuple()).
Yeah, it's messy.

> Attached is v12, which contains WAL optimizations for posting split and
> page
> deduplication.

Cool.

> * xl_btree_split record doesn't contain posting tuple anymore, instead
> it keeps
> 'in_posting offset'  and repeats the logic of _bt_insertonpg() as you
> proposed
> upthread.

That looks good.

> * I introduced new xlog record XLOG_BTREE_DEDUP_PAGE, which contains
> info about
> groups of tuples deduplicated into posting tuples. In principle, it is
> possible
> to fit it into some existing record, but I preferred to keep things clear.

I definitely think that inventing a new WAL record was the right thing to do.

> I haven't measured how these changes affect WAL size yet.
> Do you have any suggestions on how to automate testing of new WAL records?
> Is there any suitable place in regression tests?

I don't know about the regression tests (I doubt that there is a
natural place for such a test), but I came up with a rough test case.
I more or less copied the approach that you took with the index build
WAL reduction patches, though I also figured out a way of subtracting
heapam WAL overhead to get a real figure. I attach the test case --
note that you'll need to use the "land" database with this. (This test
case might need to be improved, but it's a good start.)

> * I also noticed that _bt_dedup_one_page() can be optimized to return early
> when none tuples were deduplicated. I wonder if we can introduce inner
> statistic to tune deduplication? That is returning to the idea of
> BT_COMPRESS_THRESHOLD, which can help to avoid extra work for pages that
> have
> very few duplicates or pages that are already full of posting lists.

I think that the BT_COMPRESS_THRESHOLD idea is closely related to
making _bt_dedup_one_page() behave incrementally.

On my machine, v12 of the patch actually uses slightly more WAL than
v11 did with the nbtree_wal_test.sql test case -- it's 6510 MB of
nbtree WAL in v12 vs. 6502 MB in v11 (note that v11 benefits from WAL
compression, so if I turned that off v12 would

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-31 Thread Peter Geoghegan

On Thu, Aug 29, 2019 at 10:10 PM Peter Geoghegan  wrote:
> I see some Valgrind errors on v9, all of which look like the following
> two sample errors I go into below.

I've found a fix for these Valgrind issues. It's a matter of making
sure that _bt_truncate() sizes new pivot tuples properly, which is
quite subtle:

--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2155,8 +2155,11 @@ _bt_truncate(Relation rel, IndexTuple lastleft,
IndexTuple firstright,
 {
 BTreeTupleClearBtIsPosting(pivot);
 BTreeTupleSetNAtts(pivot, keepnatts);
-pivot->t_info &= ~INDEX_SIZE_MASK;
-pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+if (keepnatts == natts)
+{
+pivot->t_info &= ~INDEX_SIZE_MASK;
+pivot->t_info |=
MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+}
 }

I'm varying how the new pivot tuple is sized here according to whether
or not index_truncate_tuple() just does a CopyIndexTuple(). This very
slightly changes the behavior of the nbtsplitloc.c stuff, but that's
not a concern for me.

I will post a patch with this and other tweaks next week.

--
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-30 Thread Peter Geoghegan

On Thu, Aug 29, 2019 at 5:07 PM Peter Geoghegan  wrote:
> I agree that v9 might be ever so slightly more space efficient than v5
> was, on balance.

I see some Valgrind errors on v9, all of which look like the following
two sample errors I go into below.

First one:

==11193== VALGRINDERROR-BEGIN
==11193== Unaddressable byte(s) found during client check request
==11193==at 0x4C0E03: PageAddItemExtended (bufpage.c:332)
==11193==by 0x20F6C3: _bt_split (nbtinsert.c:1643)
==11193==by 0x20F6C3: _bt_insertonpg (nbtinsert.c:1206)
==11193==by 0x21239B: _bt_doinsert (nbtinsert.c:306)
==11193==by 0x2150EE: btinsert (nbtree.c:207)
==11193==by 0x20D63A: index_insert (indexam.c:186)
==11193==by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193==by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193==by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193==by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193==by 0x36C738: ExecProcNode (executor.h:240)
==11193==by 0x36C738: ExecutePlan (execMain.c:1648)
==11193==by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193==by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193==by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193==by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193==by 0x4CD31C: PortalRun (pquery.c:796)
==11193==by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193==by 0x4C9EE0: PostgresMain (postgres.c:4256)
==11193==by 0x453650: BackendRun (postmaster.c:4446)
==11193==by 0x453650: BackendStartup (postmaster.c:4137)
==11193==by 0x453650: ServerLoop (postmaster.c:1704)
==11193==by 0x454CAC: PostmasterMain (postmaster.c:1377)
==11193==by 0x3B85A1: main (main.c:210)
==11193==  Address 0x9c11350 is 0 bytes after a recently re-allocated
block of size 8,192 alloc'd
==11193==at 0x4C2FB0F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11193==by 0x61085A: AllocSetAlloc (aset.c:914)
==11193==by 0x617AD8: palloc (mcxt.c:938)
==11193==by 0x21A829: _bt_mkscankey (nbtutils.c:107)
==11193==by 0x2118F3: _bt_doinsert (nbtinsert.c:93)
==11193==by 0x2150EE: btinsert (nbtree.c:207)
==11193==by 0x20D63A: index_insert (indexam.c:186)
==11193==by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193==by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193==by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193==by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193==by 0x36C738: ExecProcNode (executor.h:240)
==11193==by 0x36C738: ExecutePlan (execMain.c:1648)
==11193==by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193==by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193==by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193==by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193==by 0x4CD31C: PortalRun (pquery.c:796)
==11193==by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193==by 0x4C9EE0: PostgresMain (postgres.c:4256)
==11193==by 0x453650: BackendRun (postmaster.c:4446)
==11193==by 0x453650: BackendStartup (postmaster.c:4137)
==11193==by 0x453650: ServerLoop (postmaster.c:1704)
==11193==by 0x454CAC: PostmasterMain (postmaster.c:1377)
==11193==
==11193== VALGRINDERROR-END
{
   
   Memcheck:User
   fun:PageAddItemExtended
   fun:_bt_split
   fun:_bt_insertonpg
   fun:_bt_doinsert
   fun:btinsert
   fun:index_insert
   fun:ExecInsertIndexTuples
   fun:ExecInsert
   fun:ExecModifyTable
   fun:ExecProcNodeFirst
   fun:ExecProcNode
   fun:ExecutePlan
   fun:standard_ExecutorRun
   fun:ExecutorRun
   fun:ProcessQuery
   fun:PortalRunMulti
   fun:PortalRun
   fun:exec_simple_query
   fun:PostgresMain
   fun:BackendRun
   fun:BackendStartup
   fun:ServerLoop
   fun:PostmasterMain
   fun:main
}

nbtinsert.c:1643 is the first PageAddItem() in _bt_split() -- the
lefthikey call.

Second one:

==11193== VALGRINDERROR-BEGIN
==11193== Invalid read of size 2
==11193==at 0x20FDF5: _bt_insertonpg (nbtinsert.c:1126)
==11193==by 0x21239B: _bt_doinsert (nbtinsert.c:306)
==11193==by 0x2150EE: btinsert (nbtree.c:207)
==11193==by 0x20D63A: index_insert (indexam.c:186)
==11193==by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193==by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193==by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193==by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193==by 0x36C738: ExecProcNode (executor.h:240)
==11193==by 0x36C738: ExecutePlan (execMain.c:1648)
==11193==by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193==by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193==by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193==by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193==by 0x4CD31C: PortalRun (pquery.c:796)
==11193==by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193==by 0x4C9EE0: PostgresMain (postgres.c:4256)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-29 Thread Peter Geoghegan

On Thu, Aug 29, 2019 at 5:13 AM Anastasia Lubennikova
 wrote:
> Your explanation helped me to understand that this approach can be
> extended to
> the case of insertion into posting list, that doesn't trigger posting
> split,
> and that nbtsplitloc indeed doesn't need to know about posting tuples
> specific.
> The code is much cleaner now.

Fantastic!

> Some individual indexes are larger, some are smaller compared to the
> expected output.

I agree that v9 might be ever so slightly more space efficient than v5
was, on balance. In any case v9 completely fixes the regression that I
saw in the last version. I have pushed the changes to the test output
for the serial tests that I privately maintain, that I gave you access
to. The MGD test output also looks perfect.

We may find that deduplication is a little too effective, in the sense
that it packs so many tuples on to leaf pages that *concurrent*
inserters will tend to get excessive page splits. We may find that it
makes sense to aim for posting lists that are maybe 96% of
BTMaxItemSize() -- note that BTREE_SINGLEVAL_FILLFACTOR is 96 for this
reason. Concurrent inserters will tend to have heap TIDs that are
slightly out of order, so we want to at least have enough space
remaining on the left half of a "single value mode" split. We may end
up with a design where deduplication anticipates what will be useful
for nbtsplitloc.c.

I still think that it's too early to start worrying about problems
like this one -- I feel it will be useful to continue to focus on the
code and the space utilization of the serial test cases for now. We
can look at it at the same time that we think about adding back
something like BT_COMPRESS_THRESHOLD. I am mentioning it now because
it's probably a good time for you to start thinking about it, if you
haven't already (actually, maybe I'm just describing what
BT_COMPRESS_THRESHOLD was supposed to do in the first place). We'll
need to have a good benchmark to assess these questions, and it's not
obvious what that will be. Two possible candidates are TPC-H and
TPC-E. (Of course, I mean running them for real -- not using their
indexes to make sure that the nbtsplitloc.c stuff works well in
isolation.)

Any thoughts on a conventional benchmark that allows us to understand
the patch's impact on both throughput and latency?

BTW, I notice that we often have indexes that are quite a lot smaller
when they were created with retail insertions rather than with CREATE
INDEX/REINDEX. This is not new, but the difference is much larger than
it typically is without the patch. For example, the TPC-E index on
trade.t_ca_id (which is named "i_t_ca_id" or  "i_t_ca_id2" in my test)
is 162 MB with CREATE INDEX/REINDEX, and 121 MB with retail insertions
(assuming the insertions use the actual order from the test). I'm not
sure what to do about this, if anything. I mean, the reason that the
retail insertions do better is that they have the nbtsplitloc.c stuff,
and because we don't split the page until it's 100% full and until
deduplication stops helping -- we could apply several rounds of
deduplication before we actually have to split the cage. So the
difference that we see here is both logical and surprising.

How do you feel about this CREATE INDEX index-size-is-larger business?

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-29 Thread Anastasia Lubennikova


28.08.2019 6:19, Peter Geoghegan wrote:

On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
  wrote:

Now the algorithm is the following:
- In case page split is needed, pass both tuples to _bt_split().
  _bt_findsplitloc() is now aware of upcoming replacement of origtup with
neworigtup, so it uses correct item size where needed.

It seems that now all replace operations are crash-safe. The new patch passes
all regression tests, so I think it's ready for review again.

I think that the way this works within nbtsplitloc.c is too
complicated. In v5, the only thing that  nbtsplitloc.c knew about
deduplication was that it could be sure that suffix truncation would
at least make a posting list into a single heap TID in the worst case.
This consideration was mostly about suffix truncation, not
deduplication, which seemed like a good thing to me. _bt_split() and
_bt_findsplitloc() should know as little as possible about posting
lists.

Obviously it will sometimes be necessary to deal with the case where a
posting list is about to become too big (i.e. it's about to go over
BTMaxItemSize()), and so must be split. Less often, a page split will
be needed because of one of these posting list splits. These are two
complicated areas (posting list splits and page splits), and it would
be a good idea to find a way to separate them as much as possible.
Remember, nbtsplitloc.c works by pretending that the new item that
cannot fit on the page is already on its own imaginary version of the
page that *can* fit the new item, along with everything else from the
original/actual page. That gets *way* too complicated when it has to
deal with the fact that the new item is being merged with an existing
item. Perhaps nbtsplitloc.c could also "pretend" that the new item is
always a plain tuple, without knowing anything about posting lists.
Almost like how it worked in v5.

We always want posting lists to be as close to the BTMaxItemSize()
size as possible, because that helps with space utilization. In v5 of
the patch, this was what happened, because, in effect, we didn't try
to do anything complicated with the new item. This worked well, apart
from the crash safety issue. Maybe we can simulate the v5 approach,
giving us the best of all worlds (good space utilization, simplicity,
and crash safety). Something like this:

* Posting list splits should always result in one posting list that is
at or just under BTMaxItemSize() in size, plus one plain tuple to its
immediate right on the page. This is similar to the more common case
where we cannot add additional tuples to a posting list due to the
BTMaxItemSize() restriction, and so end up with a single tuple (or a
smaller posting list with the same value) to the right of a
BTMaxItemSize()-sized posting list tuple. I don't see a reason to
split a posting list in the middle -- we should always split to the
right, leaving the posting list as large as possible.

* When there is a simple posting list split, with no page split, the
logic required is fairly straightforward: We rewrite the posting list
in-place so that our new item goes wherever it belongs in the existing
posting list on the page (we memmove() the posting list to make space
for the new TID, basically). The old last/rightmost TID in the
original posting list becomes a new, plain tuple. We may need a new
WAL record for this, but it's not that different to a regular leaf
page insert.

* When this happens to result in a page split, we then have a "fake"
new item -- the right half of the posting list that we split, which is
always a plain item. Obviously we need to be a bit careful with the
WAL logging, but the space accounting within _bt_split() and
_bt_findsplitloc() can work just the same as now. nbtsplitloc.c can
work like it did in v5, when the only thing it knew about posting
lists was that _bt_truncate() always removes them, maybe leaving a
single TID behind in the new high key. (Note also that it's not okay
to remove the conservative assumption about at least having space for
one heap TID within _bt_recsplitloc() -- that needs to be restored to
its v5 state in the next version of the patch.)

Because deduplication is lazy, there is little value in doing
deduplication of the new item (which may or may not be the fake new
item). The nbtsplitloc.c logic will "trap" duplicates on the same page
today, so we can just let deduplication of the new item happen at a
later time. _bt_split() can almost pretend that posting lists don't
exist, and nbtsplitloc.c needs to know nothing about posting lists
(apart from the way that _bt_truncate() behaves with posting lists).
We "lie" to  _bt_findsplitloc(), and tell it that the new item is our
fake new item -- it doesn't do anything that will be broken by that
lie, because it doesn't care about the actual content of posting
lists. And, we can fix the "fake new item is not actually real new
item" issue at one point within _bt_split(), just as we're about to
WAL log.

What do you think of that

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-27 Thread Peter Geoghegan

On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
 wrote:
> Now the algorithm is the following:

> - In case page split is needed, pass both tuples to _bt_split().
>  _bt_findsplitloc() is now aware of upcoming replacement of origtup with
> neworigtup, so it uses correct item size where needed.
>
> It seems that now all replace operations are crash-safe. The new patch passes
> all regression tests, so I think it's ready for review again.

I think that the way this works within nbtsplitloc.c is too
complicated. In v5, the only thing that  nbtsplitloc.c knew about
deduplication was that it could be sure that suffix truncation would
at least make a posting list into a single heap TID in the worst case.
This consideration was mostly about suffix truncation, not
deduplication, which seemed like a good thing to me. _bt_split() and
_bt_findsplitloc() should know as little as possible about posting
lists.

Obviously it will sometimes be necessary to deal with the case where a
posting list is about to become too big (i.e. it's about to go over
BTMaxItemSize()), and so must be split. Less often, a page split will
be needed because of one of these posting list splits. These are two
complicated areas (posting list splits and page splits), and it would
be a good idea to find a way to separate them as much as possible.
Remember, nbtsplitloc.c works by pretending that the new item that
cannot fit on the page is already on its own imaginary version of the
page that *can* fit the new item, along with everything else from the
original/actual page. That gets *way* too complicated when it has to
deal with the fact that the new item is being merged with an existing
item. Perhaps nbtsplitloc.c could also "pretend" that the new item is
always a plain tuple, without knowing anything about posting lists.
Almost like how it worked in v5.

We always want posting lists to be as close to the BTMaxItemSize()
size as possible, because that helps with space utilization. In v5 of
the patch, this was what happened, because, in effect, we didn't try
to do anything complicated with the new item. This worked well, apart
from the crash safety issue. Maybe we can simulate the v5 approach,
giving us the best of all worlds (good space utilization, simplicity,
and crash safety). Something like this:

* Posting list splits should always result in one posting list that is
at or just under BTMaxItemSize() in size, plus one plain tuple to its
immediate right on the page. This is similar to the more common case
where we cannot add additional tuples to a posting list due to the
BTMaxItemSize() restriction, and so end up with a single tuple (or a
smaller posting list with the same value) to the right of a
BTMaxItemSize()-sized posting list tuple. I don't see a reason to
split a posting list in the middle -- we should always split to the
right, leaving the posting list as large as possible.

* When there is a simple posting list split, with no page split, the
logic required is fairly straightforward: We rewrite the posting list
in-place so that our new item goes wherever it belongs in the existing
posting list on the page (we memmove() the posting list to make space
for the new TID, basically). The old last/rightmost TID in the
original posting list becomes a new, plain tuple. We may need a new
WAL record for this, but it's not that different to a regular leaf
page insert.

* When this happens to result in a page split, we then have a "fake"
new item -- the right half of the posting list that we split, which is
always a plain item. Obviously we need to be a bit careful with the
WAL logging, but the space accounting within _bt_split() and
_bt_findsplitloc() can work just the same as now. nbtsplitloc.c can
work like it did in v5, when the only thing it knew about posting
lists was that _bt_truncate() always removes them, maybe leaving a
single TID behind in the new high key. (Note also that it's not okay
to remove the conservative assumption about at least having space for
one heap TID within _bt_recsplitloc() -- that needs to be restored to
its v5 state in the next version of the patch.)

Because deduplication is lazy, there is little value in doing
deduplication of the new item (which may or may not be the fake new
item). The nbtsplitloc.c logic will "trap" duplicates on the same page
today, so we can just let deduplication of the new item happen at a
later time. _bt_split() can almost pretend that posting lists don't
exist, and nbtsplitloc.c needs to know nothing about posting lists
(apart from the way that _bt_truncate() behaves with posting lists).
We "lie" to  _bt_findsplitloc(), and tell it that the new item is our
fake new item -- it doesn't do anything that will be broken by that
lie, because it doesn't care about the actual content of posting
lists. And, we can fix the "fake new item is not actually real new
item" issue at one point within _bt_split(), just as we're about to
WAL log.

What do you think of that approach?

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-23 Thread Anastasia Lubennikova


23.08.2019 7:33, Peter Geoghegan wrote:

On Wed, Aug 21, 2019 at 10:19 AM Anastasia Lubennikova
 wrote:

I'm going to look through the patch once more to update nbtxlog
comments, where needed and
answer to your remarks that are still left in the comments.

Have you been using amcheck's rootdescend verification?


No, I haven't checked it with the latest version yet.


There were many large indexes that amcheck didn't detect a problem
with. I don't yet understand what the problem is, or why we only see
the problem for a small number of indexes. Note that all of these
indexes passed verification with v5, so this is some kind of
regression.

I also noticed that there were some regressions in the size of indexes
-- indexes were not nearly as small as they were in v5 in some cases.
The overall picture was a clear regression in how effective
deduplication is.

Do these indexes have something in common? Maybe some specific workload?
Are there any error messages in log?

I'd like to specify what caused the problem.
There were several major changes between v5 and v8:
- dead tuples handling added in v6;
- _bt_split changes for posting tuples in v7;
- WAL logging of posting tuple changes in v8.

I don't think the last one could break regular indexes on master.
Do you see the same regression in v6, v7?


I think that it would save time if you had direct access to my test
data, even though it's a bit cumbersome. You'll have to download about
10GB of dumps, which require plenty of disk space when restored:


Want me to send this data and the associated tests script over to you?


Yes, I think it will help me to debug the patch faster.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-22 Thread Peter Geoghegan

On Wed, Aug 21, 2019 at 10:19 AM Anastasia Lubennikova
 wrote:
> I'm going to look through the patch once more to update nbtxlog
> comments, where needed and
> answer to your remarks that are still left in the comments.

Have you been using amcheck's rootdescend verification? I see this
problem with v8, with the TPC-H test data:

DEBUG:  finished verifying presence of 150 tuples from table
"customer" with bitset 51.09% set
ERROR:  could not find tuple using search from root page in index
"idx_customer_nationkey2"

I've been running my standard amcheck query with these databases, which is:

SELECT bt_index_parent_check(index => c.oid, heapallindexed => true,
rootdescend => true),
c.relname,
c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree'
AND c.relpersistence != 't'
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC;

There were many large indexes that amcheck didn't detect a problem
with. I don't yet understand what the problem is, or why we only see
the problem for a small number of indexes. Note that all of these
indexes passed verification with v5, so this is some kind of
regression.

I also noticed that there were some regressions in the size of indexes
-- indexes were not nearly as small as they were in v5 in some cases.
The overall picture was a clear regression in how effective
deduplication is.

I think that it would save time if you had direct access to my test
data, even though it's a bit cumbersome. You'll have to download about
10GB of dumps, which require plenty of disk space when restored:

regression=# \l+
List
of databases
Name| Owner | Encoding |  Collate   |   Ctype| Access
privileges |  Size   | Tablespace |Description
+---+--+++---+-++
 land   | pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 6425 MB | pg_default |
 mgd| pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 61 GB   | pg_default |
 postgres   | pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 7753 kB | pg_default | default administrative connection
database
 regression | pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 886 MB  | pg_default |
 template0  | pg| UTF8 | en_US.UTF8 | en_US.UTF8 | =c/pg
 +| 7609 kB | pg_default | unmodifiable empty database
|   |  ||| pg=CTc/pg
  | ||
 template1  | pg| UTF8 | en_US.UTF8 | en_US.UTF8 | =c/pg
 +| 7609 kB | pg_default | default template for new databases
|   |  ||| pg=CTc/pg
  | ||
 tpcc   | pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 10 GB   | pg_default |
 tpce   | pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 26 GB   | pg_default |
 tpch   | pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 32 GB   | pg_default |
(9 rows)

I have found it very valuable to use this test data when changing
nbtsplitloc.c, or anything that could affect where page splits make
free space available. If this is too much data to handle conveniently,
then you could skip "mgd" and almost have as much test coverage. There
really does seem to be a benefit to using diverse test cases like
this, because sometimes regressions only affect a small number of
specific indexes for specific reasons. For example, only TPC-H has a
small number of indexes that have tuples that are inserted in order,
but also have many duplicates. Removing the BT_COMPRESS_THRESHOLD
stuff really helped with those indexes.

Want me to send this data and the associated tests script over to you?

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-21 Thread Anastasia Lubennikova


20.08.2019 4:04, Peter Geoghegan wrote:

On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
 wrote:


It seems that now all replace operations are crash-safe. The new patch passes
all regression tests, so I think it's ready for review again.

I'm looking at it now. I'm going to spend a significant amount of time
on this tomorrow.

I think that we should start to think about efficient WAL-logging now.


Thank you for the review.

The new version v8 is attached. Compared to previous version, this patch 
includes
updated btree_xlog_insert() and btree_xlog_split() so that WAL records 
now only contain data

about updated posting tuple and don't require full page writes.
I haven't updated pg_waldump yet. It is postponed until we agree on 
nbtxlog changes.


Also in this patch I renamed all 'compress' keywords to 'deduplicate' 
and did minor cleanup

of outdated comments.

I'm going to look through the patch once more to update nbtxlog 
comments, where needed and

answer to your remarks that are still left in the comments.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

commit d73c1b8e10177dfb55ff1b1bac999f85d2a0298d
Author: Anastasia 
Date:   Wed Aug 21 20:00:54 2019 +0300

v8-0001-Deduplication-in-nbtree.patch

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..ddc511a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 	   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-			ItemPointerGetBlockNumber(&(itup->t_tid)),
-			ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			ItemPointerGetBlockNumber(tid),
+			ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 	(errcode(ERRCODE_INDEX_CORRUPTED),
 	 errmsg("could not find tuple using search from root page in index \"%s\"",
 			RelationGetRelationName(state->rel)),
-	 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+	 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 		itid, htid,
 		(uint32) (state->targetlsn >> 32),
 		(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), );
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+current = BTreeTupleGetPostingN(itup, i);
+
+if (ItemPointerCompare(current, ) <= 0)
+{
+	char	   *itid,
+			   *htid;
+
+	itid = psprintf("(%u,%u)", state->targetblock, offset);
+	htid = psprintf("(%u,%u)",
+	ItemPointerGetBlockNumberNoCheck(current),
+	ItemPointerGetOffsetNumberNoCheck(current));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_INDEX_CORRUPTED),
+			 errmsg("posting list heap TIDs out of order in index \"%s\"",
+	RelationGetRelationName(state->rel)),
+			 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+itid, htid,
+(uint32) (state->targetlsn >> 32),
+(uint32) state->targetlsn)));
+}
+
+ItemPointerCopy(current, );
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-			  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+IndexTuple	onetup;
+
+/* Fingerprint all elements of posting tuple one by one */
+for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+{
+	onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+	norm = bt_normalize_tuple(state, onetup);
+	bloom_add_element(state->filter, (unsigned char *) norm,
+	  IndexTupleSize(norm));
+	/* Be tidy */
+	if (norm != onetup)
+

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-19 Thread Peter Geoghegan

On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
 wrote:
> Now the algorithm is the following:
>
> - If bt_findinsertloc() found out that tuple belongs to existing posting 
> tuple's
> TID interval, it sets 'in_posting_offset' variable and passes it to
> _bt_insertonpg()
>
> - If 'in_posting_offset' is valid and origtup is valid,
> merge our itup into origtup.
>
> It can result in one tuple neworigtup, that must replace origtup; or two 
> tuples:
> neworigtup and newrighttup, if the result exceeds BTMaxItemSize,

That sounds like the right way to do it.

> - If two new tuple(s) fit into the old page, we're lucky.
> call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to
> atomically replace oldtup with new tuple(s) and generate xlog record.
>
> - In case page split is needed, pass both tuples to _bt_split().
>  _bt_findsplitloc() is now aware of upcoming replacement of origtup with
> neworigtup, so it uses correct item size where needed.

That makes sense, since _bt_split() is responsible for both splitting
the page, and inserting the new item on either the left or right page,
as part of the first phase of a page split. In other words, if you're
adding something new to _bt_insertonpg(), you probably also need to
add something new to _bt_split(). So that's what you did.

> It seems that now all replace operations are crash-safe. The new patch passes
> all regression tests, so I think it's ready for review again.

I'm looking at it now. I'm going to spend a significant amount of time
on this tomorrow.

I think that we should start to think about efficient WAL-logging now.

> In the meantime, I'll run more stress-tests.

As you probably realize, wal_consistency_checking is a good thing to
use with your tests here.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-19 Thread Peter Geoghegan

On Tue, Aug 13, 2019 at 8:45 AM Anastasia Lubennikova
 wrote:
> > I still need to think about the exact details of alignment within
> > _bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
> > could be wrong.
> Could you explain more about these cases?
> Now I don't understand the problem.

Maybe there is no problem.

> Thank you for the patch.
> Still, I'd suggest to leave it as a possible future improvement, so that
> it doesn't
> distract us from the original feature.

I don't even think that it's useful work for the future. It's just
nice to be sure that we could support unique index deduplication if it
made sense. Which it doesn't. If I didn't write the patch that
implements deduplication for unique indexes, I might still not realize
that we need the index_compute_xid_horizon_for_tuples() stuff in
certain other places. I'm not serious about it at all, except as a
learning exercise/experiment.

> I added to v6 another related fix for _bt_compress_one_page().
> Previous code was implicitly deleted DEAD items without
> calling index_compute_xid_horizon_for_tuples().
> New code has a check whether DEAD items on the page exist and remove
> them if any.
> Another possible solution is to copy dead items as is from old page to
> the new one,
> but I think it's good to remove dead tuples as fast as possible.

I think that what you've done in v7 is probably the best way to do it.
It's certainly simple, which is appropriate given that we're not
really expecting to see LP_DEAD items within _bt_compress_one_page()
(we just need to be prepared for them).

> > v5 makes  _bt_insertonpg_in_posting() prepared to overwrite an
> > existing item if it's an LP_DEAD item that falls in the same TID range
> > (that's _bt_compare()-wise "equal" to an existing tuple, which may or
> > may not be a posting list tuple already). I haven't made this code do
> > something like call  index_compute_xid_horizon_for_tuples(), even
> > though that's needed for correctness (i.e. this new code is currently
> > broken in the same way that I mentioned unique index support is
> > broken).
> Is it possible that DEAD tuple to delete was smaller than itup?

I'm not sure what you mean by this. I suppose that it doesn't matter,
since we both prefer the alternative that you came up with anyway.

> > How do you feel about officially calling this deduplication, not
> > compression? I think that it's a more accurate name for the technique.
> I agree.
> Should I rename all related names of functions and variables in the patch?

Please rename them when convenient.

--
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-16 Thread Anastasia Lubennikova


13.08.2019 18:45, Anastasia Lubennikova wrote:

  I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.


Good catch. It seems, that I need to rearrange the code.
I'll send updated patch this week.


Attached is v7.

In this version of the patch, I heavily refactored the code of insertion 
into

posting tuple. bt_split logic is quite complex, so I omitted a couple of
optimizations. They are mentioned in TODO comments.

Now the algorithm is the following:

- If bt_findinsertloc() found out that tuple belongs to existing posting 
tuple's

TID interval, it sets 'in_posting_offset' variable and passes it to
_bt_insertonpg()

- If 'in_posting_offset' is valid and origtup is valid,
merge our itup into origtup.

It can result in one tuple neworigtup, that must replace origtup; or two 
tuples:

neworigtup and newrighttup, if the result exceeds BTMaxItemSize,

- If two new tuple(s) fit into the old page, we're lucky.
call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to
atomically replace oldtup with new tuple(s) and generate xlog record.

- In case page split is needed, pass both tuples to _bt_split().
 _bt_findsplitloc() is now aware of upcoming replacement of origtup with
neworigtup, so it uses correct item size where needed.

It seems that now all replace operations are crash-safe. The new patch 
passes

all regression tests, so I think it's ready for review again.

In the meantime, I'll run more stress-tests.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..504bca2 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 	   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-			ItemPointerGetBlockNumber(&(itup->t_tid)),
-			ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			ItemPointerGetBlockNumber(tid),
+			ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 	(errcode(ERRCODE_INDEX_CORRUPTED),
 	 errmsg("could not find tuple using search from root page in index \"%s\"",
 			RelationGetRelationName(state->rel)),
-	 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+	 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 		itid, htid,
 		(uint32) (state->targetlsn >> 32),
 		(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), );
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+current = BTreeTupleGetPostingN(itup, i);
+
+if (ItemPointerCompare(current, ) <= 0)
+{
+	char	   *itid,
+			   *htid;
+
+	itid = psprintf("(%u,%u)", state->targetblock, offset);
+	htid = psprintf("(%u,%u)",
+	ItemPointerGetBlockNumberNoCheck(current),
+	ItemPointerGetOffsetNumberNoCheck(current));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_INDEX_CORRUPTED),
+			 errmsg("posting list heap TIDs out of order in index \"%s\"",
+	RelationGetRelationName(state->rel)),
+			 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+itid, htid,
+(uint32) (state->targetlsn >> 32),
+(uint32) state->targetlsn)));
+}
+
+ItemPointerCopy(current, );
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-			  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+IndexTuple	onetup;
+
+/*

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-13 Thread Anastasia Lubennikova


06.08.2019 4:28, Peter Geoghegan wrote:

Attached is v5, which is based on your v4. The three main differences
between this and v4 are:

* Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my
July 24 e-mail. We can always add something like this back during
performance validation of the patch. Right now, having no
BT_COMPRESS_THRESHOLD limit definitely improves space utilization for
certain important cases, which seems more important than the
uncertain/speculative downside.

Fair enough.
I think we can measure performance and make a decision, when patch will 
stabilize.



* We now have experimental support for unique indexes. This is broken
out into its own patch.

* We now handle LP_DEAD items in a special way within
_bt_insertonpg_in_posting().

As you pointed out already, we do need to think about LP_DEAD items
directly, rather than assuming that they cannot be on the page that
_bt_insertonpg_in_posting() must process. More on that later.


If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
can be
larger. It may happen if keysize <= 4 byte.
In this situation original tuples must have been aligned to size 16
bytes each,
and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
also safe.

I still need to think about the exact details of alignment within
_bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
could be wrong.

Could you explain more about these cases?
Now I don't understand the problem.


The main reason why I decided to avoid applying compression to unique
indexes
is the performance of microvacuum. It is not applied to items inside a
posting
tuple. And I expect it to be important for unique indexes, which ideally
contain only a few live values.

I found that the performance of my experimental patch with unique
index was significantly worse. It looks like this is a bad idea, as
you predicted, though we may still want to do
deduplication/compression with NULL values in unique indexes. I did
learn a few things from implementing unique index support, though.

BTW, there is a subtle bug in how my unique index patch does
WAL-logging -- see my comments within
index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if
replication isn't used. I don't think that we're going to use this
experimental patch at all, so I didn't bother fixing the bug.

Thank you for the patch.
Still, I'd suggest to leave it as a possible future improvement, so that 
it doesn't

distract us from the original feature.

if (ItemIdIsDead(itemId))
continue;

In the previous review Rafia asked about "some reason".
Trying to figure out if this situation possible, I changed this line to
Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
performance
test. Unfortunately, I was not able to reproduce it.

I found it easy enough to see LP_DEAD items within
_bt_insertonpg_in_posting() when running pgbench with the extra unique
index patch. To give you a simple example of how this can happen,
consider the comments about BTP_HAS_GARBAGE within
_bt_delitems_vacuum(). That probably isn't the only way it can happen,
either. ISTM that we need to be prepared for LP_DEAD items during
deduplication, rather than trying to prevent deduplication from ever
having to see an LP_DEAD item.


I added to v6 another related fix for _bt_compress_one_page().
Previous code was implicitly deleted DEAD items without
calling index_compute_xid_horizon_for_tuples().
New code has a check whether DEAD items on the page exist and remove 
them if any.
Another possible solution is to copy dead items as is from old page to 
the new one,

but I think it's good to remove dead tuples as fast as possible.


v5 makes  _bt_insertonpg_in_posting() prepared to overwrite an
existing item if it's an LP_DEAD item that falls in the same TID range
(that's _bt_compare()-wise "equal" to an existing tuple, which may or
may not be a posting list tuple already). I haven't made this code do
something like call  index_compute_xid_horizon_for_tuples(), even
though that's needed for correctness (i.e. this new code is currently
broken in the same way that I mentioned unique index support is
broken).

Is it possible that DEAD tuple to delete was smaller than itup?

  I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.


Good catch. It seems, that I need to rearrange the code.
I'll send updated patch this week.


How do you feel about officially calling this deduplication, not
compression? I think that it's a more accurate name for the technique.

I agree.
Should I rename all related names of functions and variables in the patch?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

commit 9ac37503c71f7623413a2e406d81f5c9a4b02742
Author: Anastasia 
Date:   Tue Aug 13 17:00:41 2019 +0300

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-05 Thread Peter Geoghegan

On Wed, Jul 31, 2019 at 9:23 AM Anastasia Lubennikova
 wrote:
> > * Included my own pageinspect hack to visualize the minimum TIDs in
> > posting lists. It's broken out into a separate patch file. The code is
> > very rough, but it might help someone else, so I thought I'd include
> > it.
> Cool, I think we should add it to the final patchset,
> probably, as separate function by analogy with tuple_data_split.

Good idea.

Attached is v5, which is based on your v4. The three main differences
between this and v4 are:

* Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my
July 24 e-mail. We can always add something like this back during
performance validation of the patch. Right now, having no
BT_COMPRESS_THRESHOLD limit definitely improves space utilization for
certain important cases, which seems more important than the
uncertain/speculative downside.

* We now have experimental support for unique indexes. This is broken
out into its own patch.

* We now handle LP_DEAD items in a special way within
_bt_insertonpg_in_posting().

As you pointed out already, we do need to think about LP_DEAD items
directly, rather than assuming that they cannot be on the page that
_bt_insertonpg_in_posting() must process. More on that later.

> If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
> can be
> larger. It may happen if keysize <= 4 byte.
> In this situation original tuples must have been aligned to size 16
> bytes each,
> and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
> also safe.

I still need to think about the exact details of alignment within
_bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
could be wrong.

> I changed DEBUG message to ERROR in v4 and it passes all regression tests.
> I doubt that it covers all corner cases, so I'll try to add more special
> tests.

It also passes my tests, FWIW.

> Hmm, I can't get the problem.
> In current implementation each posting tuple is smaller than BTMaxItemSize,
> so no split can lead to having tuple of larger size.

That sounds correct, then.

> No, we don't need them both. I don't mind combining them into one macro.
> Actually, we never needed BTreeTupleGetMinTID(),
> since its functionality is covered by BTreeTupleGetHeapTID.

I've removed BTreeTupleGetMinTID() in v5. I think it's fine to just
have a comment next to BTreeTupleGetHeapTID(), and another comment
next to BTreeTupleGetMaxTID().

> The main reason why I decided to avoid applying compression to unique
> indexes
> is the performance of microvacuum. It is not applied to items inside a
> posting
> tuple. And I expect it to be important for unique indexes, which ideally
> contain only a few live values.

I found that the performance of my experimental patch with unique
index was significantly worse. It looks like this is a bad idea, as
you predicted, though we may still want to do
deduplication/compression with NULL values in unique indexes. I did
learn a few things from implementing unique index support, though.

BTW, there is a subtle bug in how my unique index patch does
WAL-logging -- see my comments within
index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if
replication isn't used. I don't think that we're going to use this
experimental patch at all, so I didn't bother fixing the bug.

> if (ItemIdIsDead(itemId))
> continue;
>
> In the previous review Rafia asked about "some reason".
> Trying to figure out if this situation possible, I changed this line to
> Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
> performance
> test. Unfortunately, I was not able to reproduce it.

I found it easy enough to see LP_DEAD items within
_bt_insertonpg_in_posting() when running pgbench with the extra unique
index patch. To give you a simple example of how this can happen,
consider the comments about BTP_HAS_GARBAGE within
_bt_delitems_vacuum(). That probably isn't the only way it can happen,
either. ISTM that we need to be prepared for LP_DEAD items during
deduplication, rather than trying to prevent deduplication from ever
having to see an LP_DEAD item.

v5 makes  _bt_insertonpg_in_posting() prepared to overwrite an
existing item if it's an LP_DEAD item that falls in the same TID range
(that's _bt_compare()-wise "equal" to an existing tuple, which may or
may not be a posting list tuple already). I haven't made this code do
something like call  index_compute_xid_horizon_for_tuples(), even
though that's needed for correctness (i.e. this new code is currently
broken in the same way that I mentioned unique index support is
broken). I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.

How do you feel about officially calling this deduplication, not
compression? I think that it's a more accurate name for the technique.
-- 
Peter Geoghegan

v5-0001-Compression-deduplication-in-nbtree.patch

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-31 Thread Anastasia Lubennikova


24.07.2019 4:22, Peter Geoghegan wrote:


Attached is a revised version of your v2 that fixes this issue -- I'll
call this v3. In general, my goal for the revision was to make sure
that all of my old tests from the v12 work passed, and to make sure
that amcheck can detect almost any possible problem. I tested the
amcheck changes by corrupting random state in a test index using
pg_hexedit, then making sure that amcheck actually complained in each
case.

I also fixed one or two bugs in passing, including the bug that caused
an assertion failure in _bt_truncate(). That was down to a subtle
off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't
make that many changes to your v2. There are probably some things
about the patch that I still don't understand, or things that I have
misunderstood.


Thank you for this review and fixes.

* Changed the custom binary search code within _bt_compare_posting()
to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know
of any reason not to do it that way?

It's ok to update it. There was no particular reason, just my habit.

* Added quite a few "FIXME"/"XXX" comments at various points, to
indicate where I have general concerns that need more discussion.


+     * FIXME:  The calls to BTreeGetNthTupleOfPosting() allocate 
memory,


If we only need to check TIDs, we don't need BTreeGetNthTupleOfPosting(),
we can use BTreeTupleGetPostingN() instead and iterate over TIDs, not 
tuples.


Fixed in version 4.


* Included my own pageinspect hack to visualize the minimum TIDs in
posting lists. It's broken out into a separate patch file. The code is
very rough, but it might help someone else, so I thought I'd include
it.

Cool, I think we should add it to the final patchset,
probably, as separate function by analogy with tuple_data_split.


I also have some new concerns about the code in the patch that I will
point out now (though only as something to think about a solution on
-- I am unsure myself):

* It's a bad sign that compression involves calls to PageAddItem()
that are allowed to fail (we just give up on compression when that
happens). For one thing, all existing calls to PageAddItem() in
Postgres are never expected to fail -- if they do fail we get a "can't
happen" error that suggests corruption. It was a good idea to take
this approach to get the patch to work, and to prove the general idea,
but we now need to fully work out all the details about the use of
space. This includes complicated new questions around how alignment is
supposed to work.


The main reason to implement this gentle error handling is the fact that
deduplication could cause storage overhead, which leads to running out 
of space

on the page.

First of all, it is a legacy of the previous versions where
BTreeFormPostingTuple was not able to form non-posting tuple even in case
where a number of posting items is 1.

Another case that was in my mind is the situation where we have 2 tuples:
t_tid | t_info | key + t_tid | t_info | key

and compressed result is:
t_tid | t_info | key | t_tid | t_tid

If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple 
can be

larger. It may happen if keysize <= 4 byte.
In this situation original tuples must have been aligned to size 16 
bytes each,
and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is 
also safe.


I changed DEBUG message to ERROR in v4 and it passes all regression tests.
I doubt that it covers all corner cases, so I'll try to add more special 
tests.



Alignment in nbtree is already complicated today -- you're supposed to
MAXALIGN() everything in nbtree, so that the MAXALIGN() within
bufpage.c routines cannot be different to the lp_len/IndexTupleSize()
length (note that heapam can have tuples whose lp_len isn't aligned,
so nbtree could do it differently if it proved useful). Code within
nbtsplitloc.c fully understands the space requirements for the
bufpage.c routines, and is very careful about it. (The bufpage.c
details are supposed to be totally hidden from code like
nbtsplitloc.c, but I guess that that ideal isn't quite possible in
reality. Code comments don't really explain the situation today.)

I'm not sure what it would look like for this patch to be as precise
about free space as nbtsplitloc.c already is, even though that seems
desirable (I just know that it would mean you would trust
PageAddItem() to work in all cases). The patch is different to what we
already have today in that it tries to add *less than* a single
MAXALIGN() quantum at a time in some places (when a posting list needs
to grow by one item). The devil is in the details.

* As you know, the current approach to WAL logging is very
inefficient. It's okay for now, but we'll need a fine-grained approach
for the patch to be commitable. I think that this is subtly related to
the last item (i.e. the one about alignment). I have done basic
performance tests using unlogged tables. The patch seems to either
make big INSERT

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-31 Thread Rafia Sabih

On Thu, 25 Jul 2019 at 05:49, Peter Geoghegan  wrote:
>
> On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan  wrote:
> > There seems to be a kind of "synergy" between the nbtsplitloc.c
> > handling of pages that have lots of duplicates and posting list
> > compression. It seems as if the former mechanism "sets up the bowling
> > pins", while the latter mechanism "knocks them down", which is really
> > cool. We should try to gain a better understanding of how that works,
> > because it's possible that it could be even more effective in some
> > cases.
>
> I found another important way in which this synergy can fail to take
> place, which I can fix.
>
> By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes
> from my test suite become much smaller, while most are not affected.
> These indexes were not helped too much by the patch before. For
> example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full
> of duplicates of a single value (that's how it appears after an
> initial TPC-E bulk load), as are a couple of other TPC-E indexes.
> TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its
> idx_lineitem_orderkey index becomes ~15% smaller.
>
> I believe that this happened because rightmost page splits were an
> inefficient case for compression. But rightmost page split heavy
> indexes with lots of duplicates are not that uncommon. Think of any
> index with many NULL values, for example.
>
> I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm
> not sure what the idea is behind it. My sense is that we're likely to
> benefit by delaying page splits, no matter what. Though I am still
> looking at it purely from a space utilization point of view, at least
> for now.
>
Minor comment fix, pointes-->pointer, plus, are we really doing the
half, or is it just splitting into two.
/*
+ * Split posting tuple into two halves.
+ *
+ * Left tuple contains all item pointes less than the new one and
+ * right tuple contains new item pointer and all to the right.
+ *
+ * TODO Probably we can come up with more clever algorithm.
+ */

Some remains of 'he'.
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * it will get what he expects
+ */

Everything reads just fine without 'us'.
/*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
-- 
Regards,
Rafia Sabih

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-24 Thread Peter Geoghegan

On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan  wrote:
> There seems to be a kind of "synergy" between the nbtsplitloc.c
> handling of pages that have lots of duplicates and posting list
> compression. It seems as if the former mechanism "sets up the bowling
> pins", while the latter mechanism "knocks them down", which is really
> cool. We should try to gain a better understanding of how that works,
> because it's possible that it could be even more effective in some
> cases.

I found another important way in which this synergy can fail to take
place, which I can fix.

By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes
from my test suite become much smaller, while most are not affected.
These indexes were not helped too much by the patch before. For
example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full
of duplicates of a single value (that's how it appears after an
initial TPC-E bulk load), as are a couple of other TPC-E indexes.
TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its
idx_lineitem_orderkey index becomes ~15% smaller.

I believe that this happened because rightmost page splits were an
inefficient case for compression. But rightmost page split heavy
indexes with lots of duplicates are not that uncommon. Think of any
index with many NULL values, for example.

I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm
not sure what the idea is behind it. My sense is that we're likely to
benefit by delaying page splits, no matter what. Though I am still
looking at it purely from a space utilization point of view, at least
for now.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-24 Thread Peter Geoghegan

On Tue, Jul 23, 2019 at 6:22 PM Peter Geoghegan  wrote:
> Attached is a revised version of your v2 that fixes this issue -- I'll
> call this v3.

Remember that index that I said was 5.5x smaller with the patch
applied, following retail insertions (a single big INSERT ... SELECT
...)? Well, it's 6.5x faster with this small additional patch applied
on top of the v3 I posted yesterday. Many of the indexes in my test
suite are about ~20% smaller __in addition to__ very big size
reductions. Some are even ~30% smaller than they were with v3 of the
patch. For example, the fair use implementation of TPC-H that my test
data comes from has an index on the "orders" o_orderdate column, named
idx_orders_orderdate, which is made ~30% smaller by the addition of
this simple patch (once again, this is following a single big INSERT
... SELECT ...). This change makes idx_orders_orderdate ~3.3x smaller
than it is with master/Postgres 12, in case you were wondering.

This new patch teaches nbtsplitloc.c to subtract posting list overhead
when sizing the new high key for the left half of a candidate split
point, since we know for sure that _bt_truncate() will at least manage
to truncate away that much from the new high key, even in the worst
case. Since posting lists are often very large, this can make a big
difference. This is actually just a bugfix, not a new idea -- I merely
made nbtsplitloc.c understand how truncation works with posting lists.

There seems to be a kind of "synergy" between the nbtsplitloc.c
handling of pages that have lots of duplicates and posting list
compression. It seems as if the former mechanism "sets up the bowling
pins", while the latter mechanism "knocks them down", which is really
cool. We should try to gain a better understanding of how that works,
because it's possible that it could be even more effective in some
cases.

-- 
Peter Geoghegan

0003-Account-for-posting-list-overhead-during-splits.patch
Description: Binary data

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-23 Thread Peter Geoghegan

On Fri, Jul 19, 2019 at 7:24 PM Peter Geoghegan  wrote:
> Hmm. So, the attached test case fails amcheck verification for me with
> the latest version of the patch:

Attached is a revised version of your v2 that fixes this issue -- I'll
call this v3. In general, my goal for the revision was to make sure
that all of my old tests from the v12 work passed, and to make sure
that amcheck can detect almost any possible problem. I tested the
amcheck changes by corrupting random state in a test index using
pg_hexedit, then making sure that amcheck actually complained in each
case.

I also fixed one or two bugs in passing, including the bug that caused
an assertion failure in _bt_truncate(). That was down to a subtle
off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't
make that many changes to your v2. There are probably some things
about the patch that I still don't understand, or things that I have
misunderstood.

Other changes:

* We now support system catalog indexes. There is no reason not to support them.

* Removed unnecessary code from _bt_buildadd().

* Added my own new DEBUG4 trace to _bt_insertonpg_in_posting(), which
I used to fix that bug I mentioned. I agree that we should keep the
DEBUG4 traces around until the overall design settles down. I found
the ones that you added helpful, too.

* Added quite a few new assertions. For example, we need to still
support !heapkeyspace (pre Postgres 12) nbtree indexes, but we cannot
let them use compression -- new defensive assertions were added to
make this break loudly.

* Changed the custom binary search code within _bt_compare_posting()
to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know
of any reason not to do it that way?

* Added quite a few "FIXME"/"XXX" comments at various points, to
indicate where I have general concerns that need more discussion.

* Included my own pageinspect hack to visualize the minimum TIDs in
posting lists. It's broken out into a separate patch file. The code is
very rough, but it might help someone else, so I thought I'd include
it.

I also have some new concerns about the code in the patch that I will
point out now (though only as something to think about a solution on
-- I am unsure myself):

* It's a bad sign that compression involves calls to PageAddItem()
that are allowed to fail (we just give up on compression when that
happens). For one thing, all existing calls to PageAddItem() in
Postgres are never expected to fail -- if they do fail we get a "can't
happen" error that suggests corruption. It was a good idea to take
this approach to get the patch to work, and to prove the general idea,
but we now need to fully work out all the details about the use of
space. This includes complicated new questions around how alignment is
supposed to work.

Alignment in nbtree is already complicated today -- you're supposed to
MAXALIGN() everything in nbtree, so that the MAXALIGN() within
bufpage.c routines cannot be different to the lp_len/IndexTupleSize()
length (note that heapam can have tuples whose lp_len isn't aligned,
so nbtree could do it differently if it proved useful). Code within
nbtsplitloc.c fully understands the space requirements for the
bufpage.c routines, and is very careful about it. (The bufpage.c
details are supposed to be totally hidden from code like
nbtsplitloc.c, but I guess that that ideal isn't quite possible in
reality. Code comments don't really explain the situation today.)

I'm not sure what it would look like for this patch to be as precise
about free space as nbtsplitloc.c already is, even though that seems
desirable (I just know that it would mean you would trust
PageAddItem() to work in all cases). The patch is different to what we
already have today in that it tries to add *less than* a single
MAXALIGN() quantum at a time in some places (when a posting list needs
to grow by one item). The devil is in the details.

* As you know, the current approach to WAL logging is very
inefficient. It's okay for now, but we'll need a fine-grained approach
for the patch to be commitable. I think that this is subtly related to
the last item (i.e. the one about alignment). I have done basic
performance tests using unlogged tables. The patch seems to either
make big INSERT queries run as fast or faster than before when
inserting into unlogged tables, which is a very good start.

* Since we can now split a posting list in two, we may also have to
reconsider BTMaxItemSize, or some similar mechanism that worries about
extreme cases where it becomes impossible to split because even two
pages are not enough to fit everything. Think of what happens when
there is a tuple with a single large datum, that gets split in two
(the tuple is split, not the page), with each half receiving its own
copy of the datum. I haven't proven to myself that this is broken, but
that may just be because I haven't spent any time on it. OTOH, maybe
you already have it right, in which case it seems like it should be

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-19 Thread Peter Geoghegan

On Fri, Jul 19, 2019 at 12:32 PM Peter Geoghegan  wrote:
> On Fri, Jul 19, 2019 at 10:53 AM Anastasia Lubennikova
>  wrote:
> > Patch 0002 (must be applied on top of 0001) implements preserving of
> > correct TID order
> > inside posting list when inserting new tuples.
> > This version passes all regression tests including amcheck test.
> > I also used following script to test insertion into the posting list:
>
> Nice!

Hmm. So, the attached test case fails amcheck verification for me with
the latest version of the patch:

$ psql -f amcheck-compress-test.sql
DROP TABLE
CREATE TABLE
CREATE INDEX
CREATE EXTENSION
INSERT 0 2001
psql:amcheck-compress-test.sql:6: ERROR:  down-link lower bound
invariant violated for index "idx_desc_nl"
DETAIL:  Parent block=3 child index tid=(2,2) parent page lsn=10/F87A3438.

Note that this test only has an INSERT statement. You have to use
bt_index_parent_check() to see the problem -- bt_index_check() will
not detect the problem.

-- 
Peter Geoghegan

amcheck-compress-test.sql
Description: Binary data

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-19 Thread Peter Geoghegan

On Fri, Jul 19, 2019 at 10:53 AM Anastasia Lubennikova
 wrote:
> Patch 0002 (must be applied on top of 0001) implements preserving of
> correct TID order
> inside posting list when inserting new tuples.
> This version passes all regression tests including amcheck test.
> I also used following script to test insertion into the posting list:

Nice!

> I suppose it is not the final version of the patch yet,
> so I left some debug messages and TODO comments to ease review.

I'm fine with leaving them in. I have sometimes distributed a separate
patch with debug messages, but now that I think about it, that
probably wasn't a good use of time.

You will probably want to remove at least some of the debug messages
during performance testing. I'm thinking of code that appears in very
tight inner loops, such as the _bt_compare() code.

> Please, in your review, pay particular attention to usage of
> BTreeTupleGetHeapTID.
> For posting tuples it returns the first tid from posting list like
> BTreeTupleGetMinTID,
> but maybe some callers are not ready for that and want
> BTreeTupleGetMaxTID instead.
> Incorrect usage of these macros may cause some subtle bugs,
> which are probably not covered by tests. So, please double-check it.

One testing strategy that I plan to use for the patch is to
deliberately corrupt a compressed index in a subtle way using
pg_hexedit, and then see if amcheck detects the problem. For example,
I may swap the order of two TIDs in the middle of a posting list,
which is something that is unlikely to produce wrong answers to
queries, and won't even be detected by the "heapallindexed" check, but
is still wrong. If we can detect very subtle, adversarial corruption
like this, then we can detect any real-world problem.

Once we have confidence in amcheck's ability to detect problems with
posting lists in general, we can use it in many different contexts
without much thought. For example, we'll probably need to do long
running benchmarks to validate the performance of the patch. It's easy
to add amcheck testing at the end of each run. Every benchmark is now
also a correctness/stress test, for free.

> Next week I'm going to check performance and try to find specific
> scenarios where this
> feature can lead to degradation and measure it, to understand if we need
> to make this deduplication optional.

Sounds good, though I think it might be a bit too early to decide
whether or not it needs to be enabled by default. For one thing, the
approach to WAL-logging within _bt_compress_one_page() is probably
fairly inefficient, which may be a problem for certain workloads. It's
okay to leave it that way for now, because it is not relevant to the
core design of the patch. I'm sure that _bt_compress_one_page() can be
carefully optimized when the time comes.

My current focus is not on the raw performance itself. For now, I am
focussed on making sure that the compression works well, and that the
resulting indexes "look nice" in general. FWIW, the first few versions
of my v12 work on nbtree didn't actually make *anything* go faster. It
took a couple of months to fix the more important regressions, and a
few more months to fix all of them. I think that the work on this
patch may develop in a similar way. I am willing to accept regressions
in the unoptimized code during development because it seems likely
that you have the right idea about the data structure itself, which is
the one thing that I *really* care about. Once you get that right, the
remaining problems are very likely to either be fixable with further
work on optimizing specific code, or a price that users will mostly be
happy to pay to get the benefits.

--
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-19 Thread Anastasia Lubennikova


17.07.2019 19:36, Anastasia Lubennikova:


There is one major issue left - preserving TID order in posting lists.
For a start, I added a sort into BTreeFormPostingTuple function.
It turned out to be not very helpful, because we cannot check this 
invariant lazily.


Now I work on patching _bt_binsrch_insert() and _bt_insertonpg() to 
implement
insertion into the middle of the posting list. I'll send a new version 
this week.


Patch 0002 (must be applied on top of 0001) implements preserving of 
correct TID order

inside posting list when inserting new tuples.
This version passes all regression tests including amcheck test.
I also used following script to test insertion into the posting list:

set client_min_messages to debug4;
drop table tbl;
create table tbl (i1 int, i2 int);
insert into tbl select 1, i from generate_series(0,1000) as i;
insert into tbl select 1, i from generate_series(0,1000) as i;
create index idx on tbl (i1);
delete from tbl where i2 <500;
vacuum tbl ;
insert into tbl select 1, i from generate_series(1001, 1500) as i;

The last insert triggers several insertions that can be seen in debug 
messages.

I suppose it is not the final version of the patch yet,
so I left some debug messages and TODO comments to ease review.

Please, in your review, pay particular attention to usage of 
BTreeTupleGetHeapTID.
For posting tuples it returns the first tid from posting list like 
BTreeTupleGetMinTID,
but maybe some callers are not ready for that and want 
BTreeTupleGetMaxTID instead.

Incorrect usage of these macros may cause some subtle bugs,
which are probably not covered by tests. So, please double-check it.

Next week I'm going to check performance and try to find specific 
scenarios where this
feature can lead to degradation and measure it, to understand if we need 
to make this deduplication optional.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 9126c18..2b05b1e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -1033,12 +1033,34 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-			  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+IndexTuple	onetup;
+int i;
+
+/* Fingerprint all elements of posting tuple one by one */
+for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+{
+	onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+	norm = bt_normalize_tuple(state, onetup);
+	bloom_add_element(state->filter, (unsigned char *) norm,
+	  IndexTupleSize(norm));
+	/* Be tidy */
+	if (norm != onetup)
+		pfree(norm);
+	pfree(onetup);
+}
+			}
+			else
+			{
+norm = bt_normalize_tuple(state, itup);
+bloom_add_element(state->filter, (unsigned char *) norm,
+  IndexTupleSize(norm));
+/* Be tidy */
+if (norm != itup)
+	pfree(norm);
+			}
 		}
 
 		/*
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 602f884..26ddf32 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -20,6 +20,7 @@
 #include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xloginsert.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -56,6 +57,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 		 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static bool insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -759,6 +762,12 @@ _bt_findinsertloc(Relation rel,
 			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
 			insertstate->bounds_valid = false;
 		}
+
+		/*
+		 * If the target page is full, try to compress the page
+		 */
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
 	}
 	else
 	{
@@ -806,6 +815,11 @@ _bt_findinsertloc(Relation rel,
 			}
 
 			/*
+			 * Before considering moving right, try to compress the page
+			 */
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
+
+			/*
 			 * Nope, so check conditions (b) and (c) enumerated above
 			 *
 			 * The earlier _bt_check_unique() call may well have established a
@@ -2286,3 +2300,241 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * the page.
 	 */
 }
+
+/*
+ * Add new item (compressed or not) to the page,

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-17 Thread Anastasia Lubennikova



11.07.2019 21:19, Peter Geoghegan wrote:

On Thu, Jul 11, 2019 at 8:34 AM Rafia Sabih  wrote:

Hi,
Peter, Rafia, thanks for the review. New version is attached.


+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d
IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
and other such DEBUG4 statements are meant to be removed, right...?

I hope so too.

Yes, these messages are only for debugging.
I haven't delete them since this is still work in progress
and it's handy to be able to print inner details.
Maybe I should also write a patch for pageinspect.


/*
* If we have only 10 uncompressed items on the full page, it probably
* won't worth to compress them.
*/
if (maxoff - n_posting_on_page < 10)
return;

Is this a magic number...?

I think that this should be a constant or something.
Fixed. Now this is a constant in nbtree.h. I'm not 100% sure about the 
value.

When the code will stabilize we can benchmark it and find optimal value.


/*
* We do not expect to meet any DEAD items, since this function is
* called right after _bt_vacuum_one_page(). If for some reason we
* found dead item, don't compress it, to allow upcoming microvacuum
* or vacuum clean it up.
*/
if (ItemIdIsDead(itemId))
continue;

This makes me wonder about those 'some' reasons.

I think that this is just defensive. Note that _bt_vacuum_one_page()
is prepared to find no dead items, even when the BTP_HAS_GARBAGE flag
is set for the page.

You are right, now it is impossible to meet dead items in this function.
Though it can change in the future if, for example, _bt_vacuum_one_page 
will behave lazily.

So this is just a sanity check. Maybe it's worth to move it to Assert.




Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * he will get what he expects

This can be re-framed to make the caller more gender neutral.

Agreed. I also don't like anthropomorphizing code like this.


Fixed.


Other than that, I am curious about the plans for its backward compatibility.

Me too. There is something about a new version 5 in comments in
nbtree.h, but the version number isn't changed. I think that we may be
able to get away with not increasing the B-Tree version from 4 to 5,
actually. Deduplication is performed lazily when it looks like we
might have to split the page, so there isn't any expectation that
tuples will either be compressed or uncompressed in any context.

Current implementation is backward compatible.
To distinguish posting tuples, it only adds one new flag combination.
This combination was never possible before. Comment about version 5 is 
deleted.


I also added a patch for amcheck.

There is one major issue left - preserving TID order in posting lists.
For a start, I added a sort into BTreeFormPostingTuple function.
It turned out to be not very helpful, because we cannot check this 
invariant lazily.


Now I work on patching _bt_binsrch_insert() and _bt_insertonpg() to 
implement
insertion into the middle of the posting list. I'll send a new version 
this week.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 9126c18..2b05b1e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -1033,12 +1033,34 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-			  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+IndexTuple	onetup;
+int i;
+
+/* Fingerprint all elements of posting tuple one by one */
+for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+{
+	onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+	norm = bt_normalize_tuple(state, onetup);
+	bloom_add_element(state->filter, (unsigned char *) norm,
+	  IndexTupleSize(norm));
+	/* Be tidy */
+	if (norm != onetup)
+		pfree(norm);
+	pfree(onetup);
+}
+			}
+			else
+			{
+norm = bt_normalize_tuple(state, itup);
+bloom_add_element(state->filter, (unsigned char *) norm,
+  IndexTupleSize(norm));
+/* Be tidy */
+if (norm != itup)
+	pfree(norm);
+			}
 		}
 
 		/*
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 602f884..26ddf32 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -20,6 +20,7 @@
 #include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xloginsert.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -56,6 +57,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-11 Thread Peter Geoghegan

On Thu, Jul 11, 2019 at 10:42 AM Peter Geoghegan  wrote:
> > I think unique indexes may benefit from deduplication not only because
> > of NULL values.  Non-HOT updates produce duplicates of non-NULL values
> > in unique indexes.  And those duplicates can take significant space.
>
> I agree that we should definitely have an open mind about unique
> indexes, even with non-NULL values. If we can prevent a page split by
> deduplicating the contents of a unique index page, then we'll probably
> win. Why not try? This will need to be tested.

I thought about this some more. I believe that the LP_DEAD bit setting
within _bt_check_unique() is generally more important than the more
complicated kill_prior_tuple mechanism for setting LP_DEAD bits, even
though the _bt_check_unique() thing can only be used with unique
indexes. Also, I have often thought that we don't do enough to take
advantage of the special characteristics of unique indexes -- they
really are quite different. I believe that other database systems do
this in various ways. Maybe we should too.

Unique indexes are special because there can only ever be zero or one
tuples of the same value that are visible to any possible MVCC
snapshot. Within the index AM, there is little difference between an
UPDATE by a transaction and a DELETE + INSERT of the same value by a
transaction. If there are 3 or 5 duplicates within a unique index,
then there is a strong chance that VACUUM could reclaim some of them,
given the chance. It is worth going to a little effort to find out.

In a traditional serial/bigserial primary key, the key space that is
typically "owned" by the left half of a rightmost page split describes
a range of about ~366 items, with few or no gaps for other values that
didn't exist at the time of the split (i.e. the two pivot tuples on
each side cover a range that is equal to the number of items itself).
If the page ever splits again, the chances of it being due to non-HOT
updates is perhaps 100%. Maybe VACUUM just didn't get around to the
index in time, or maybe there is a long running xact, or whatever. If
we can delay page splits in indexes like this, then we could easily
prevent them from *ever* happening.

Our first line of defense against page splits within unique indexes
will probably always be LP_DEAD bits set within _bt_check_unique(),
because it costs so little -- same as today. We could also add a
second line of defense: deduplication -- same as with non-unique
indexes with the patch. But we can even add a third line of defense on
top of those two: more aggressive reclaiming of posting list space, by
going to the heap to check the visibility status of earlier posting
list entries. We can do this optimistically when there is no LP_DEAD
bit set, based on heuristics.

The high level principle here is that we can justify going to a small
amount of extra effort for the chance to avoid a page split, and maybe
even more than a small amount. Our chances of reversing the split by
merging pages later on are almost zero. The two halves of the split
will probably each get dirtied again and again in the future if we
cannot avoid it, plus we have to dirty the parent page, and the old
sibling page (to update its left link). In general, a page split is
already really expensive. We could do something like amortize the cost
of accessing the heap a second time for tuples that we won't have
considered setting the LP_DEAD bit on within _bt_check_unique() by
trying the *same* heap page a *second* time where possible (distinct
values are likely to be nearby on the same page). I think that an
approach like this could work quite well for many workloads. You only
pay a cost (visiting the heap an extra time) when it looks like you'll
get a benefit (not splitting the page).

As you know, Andres already changed nbtree to get an XID for conflict
purposes on the primary by visiting the heap a second time (see commit
558a9165e08), when we need to actually reclaim LP_DEAD space. I
anticipated that we could extend this to do more clever/eager/lazy
cleanup of additional items before that went in, which is a closely
related idea. See:

https://www.postgresql.org/message-id/flat/CAH2-Wznx8ZEuXu7BMr6cVpJ26G8OSqdVo6Lx_e3HSOOAU86YoQ%40mail.gmail.com#46ffd6f32a60e086042a117f2bfd7df7

I know that this is a bit hand-wavy; the details certainly need to be
worked out. However, it is not so different to the "ghost bit" design
that other systems use with their non-unique indexes (though this idea
applies specifically to unique indexes in our case). The main
difference is that we're going to the heap rather than to UNDO,
because that's where we store our visibility information. That doesn't
seem like such a big difference -- we are also reasonably confident
that we'll find that the TID is dead, even without LP_DEAD bits being
set, because we only do the extra stuff with unique indexes. And, we
do it lazily.

--
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-11 Thread Peter Geoghegan

On Thu, Jul 11, 2019 at 8:34 AM Rafia Sabih  wrote:
> I tried this patch and found the improvements impressive. However,
> when I tried with multi-column indexes it wasn't giving any
> improvement, is it the known limitation of the patch?

It'll only deduplicate full duplicates. It works with multi-column
indexes, provided the entire set of values in duplicated -- not just a
prefix. Prefix compression is possible, but it's more complicated. It
seems to generally require the DBA to specify a prefix length,
expressed as a number of prefix columns.

> I am surprised to find that such a patch is on radar since quite some
> years now and not yet committed.

The v12 work on nbtree (making heap TID a tiebreaker column) seems to
have made the general approach a lot more effective. Compression is
performed lazily, not eagerly, which seems to work a lot better.

> + elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d
> IndexTupleSize %zu free %zu",
> + compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
> +
> and other such DEBUG4 statements are meant to be removed, right...?

I hope so too.

> /*
> * If we have only 10 uncompressed items on the full page, it probably
> * won't worth to compress them.
> */
> if (maxoff - n_posting_on_page < 10)
> return;
>
> Is this a magic number...?

I think that this should be a constant or something.

> /*
> * We do not expect to meet any DEAD items, since this function is
> * called right after _bt_vacuum_one_page(). If for some reason we
> * found dead item, don't compress it, to allow upcoming microvacuum
> * or vacuum clean it up.
> */
> if (ItemIdIsDead(itemId))
> continue;
>
> This makes me wonder about those 'some' reasons.

I think that this is just defensive. Note that _bt_vacuum_one_page()
is prepared to find no dead items, even when the BTP_HAS_GARBAGE flag
is set for the page.

> Caller is responsible for checking BTreeTupleIsPosting to ensure that
> + * he will get what he expects
>
> This can be re-framed to make the caller more gender neutral.

Agreed. I also don't like anthropomorphizing code like this.

> Other than that, I am curious about the plans for its backward compatibility.

Me too. There is something about a new version 5 in comments in
nbtree.h, but the version number isn't changed. I think that we may be
able to get away with not increasing the B-Tree version from 4 to 5,
actually. Deduplication is performed lazily when it looks like we
might have to split the page, so there isn't any expectation that
tuples will either be compressed or uncompressed in any context.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-11 Thread Peter Geoghegan

On Thu, Jul 11, 2019 at 8:09 AM Alexander Korotkov
 wrote:
> BTW, I think deduplication could cause some small performance
> degradation in some particular cases, because page-level locks became
> more coarse grained once pages hold more tuples.  However, this
> doesn't seem like something we should much care about.  Providing an
> option to turn deduplication off looks enough for me.

There was an issue like this with my v12 work on nbtree, with the
TPC-C indexes. They were always ~40% smaller, but there was a
regression when TPC-C was used with a small number of warehouses, when
the data could easily fit in memory (which is not allowed by the TPC-C
spec, in effect). TPC-C is very write-heavy, which combined with
everything else causes this problem. I wasn't doing anything too fancy
there -- the regression seemed to happen simply because the index was
smaller, not because of the overhead of doing page splits differently
or anything like that (there were far fewer splits).

I expect there to be some regression for workloads like this. I am
willing to accept that provided it's not too noticeable, and doesn't
have an impact on other workloads. I am optimistic about it.

> Regarding bitmap indexes itself, I think our BRIN could provide them.
> However, it would be useful to have opclass parameters to make them
> tunable.

I thought that we might implement them in nbtree myself. But we don't
need to decide now.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-11 Thread Peter Geoghegan

On Thu, Jul 11, 2019 at 8:02 AM Alexander Korotkov
 wrote:
> Could you please elaborate more on preserving the logical contents?  I
> can understand it as following: "B-Tree should have the same structure
> and invariants as if each TID in posting list be a separate tuple".

That's exactly what I mean.

> So, if we imagine each TID to become separate tuple it would be the
> same B-tree, which just can magically sometimes store more tuples in
> page.  Is my understanding correct?

Yes.

> But outside code will still
> notice changes as soon as it directly accesses B-tree pages (like
> contrib/amcheck does).  Do you mean we need an API for accessing
> logical B-tree tuples or something?

Well, contrib/amcheck isn't really outside code. But amcheck's
"rootdescend" option will still need to be able to supply a heap TID
as just another column, and get back zero or one logical tuples from
the index. This is important because retail index tuple deletion needs
to be able to think about logical tuples in the same way. I also think
that it might be useful for the planner to expect to get back
duplicates in heap TID order in the future (or in reverse order in the
case of a backwards scan). Query execution and VACUUM code outside of
nbtree should be able to pretend that there is no such thing as a
posting list.

The main thing that the patch is missing that is needed to "preserve
logical contents" is the ability to update/expand an *existing*
posting list due to a retail insertion of a new duplicate that happens
to be within the range of that existing posting list. This will
usually be a non-HOT update that doesn't change the value for the row
in the index -- that must change the posting list, even when there is
available space on the page without recompressing. We must still
occasionally be eager, like GIN always is, though in practice we'll
almost always add to posting lists in a lazy fashion, when it looks
like we might have to split the page -- the lazy approach seems to
perform best.

> I think in order to deduplicate "equal but distinct" values we need at
> least to give up with index only scans.  Because we have no
> restriction that equal according to B-tree opclass values are same for
> other operations and/or user output.

We can either prevent index-only scans in the case of affected
indexes, or prevent compression, or give the user a choice. I'm not
too worried about how that will work for users just yet.

> Do I understand correctly that current patch may produce posting lists
> of the same value with overlapping ranges of TIDs?  If so, it's
> definitely wrong.

Yes, it can, since the assertion fails. It looks like the assertion
itself was changed to match what I expect, so I assume that this bug
will be fixed in the next version of the patch. It fails with a fairly
big index on text for me.

> > * Maybe we could do compression with unique indexes when inserting
> > values with NULLs? Note that we now treat an insertion of a tuple with
> > NULLs into a unique index as if it wasn't even a unique index -- see
> > the "checkingunique" optimization at the beginning of _bt_doinsert().
> > Having many NULL values in a unique index is probably fairly common.
>
> I think unique indexes may benefit from deduplication not only because
> of NULL values.  Non-HOT updates produce duplicates of non-NULL values
> in unique indexes.  And those duplicates can take significant space.

I agree that we should definitely have an open mind about unique
indexes, even with non-NULL values. If we can prevent a page split by
deduplicating the contents of a unique index page, then we'll probably
win. Why not try? This will need to be tested.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-11 Thread Peter Geoghegan

On Thu, Jul 11, 2019 at 7:30 AM Bruce Momjian  wrote:
> Wow, I never thought of that.  The only things I know we lock until
> transaction end are rows we update (against concurrent updates), and
> additions to unique indexes.  By definition, indexes with many
> duplicates are not unique, so that doesn't apply.

Right. Another advantage of their approach is that you can make
queries like this work:

UPDATE tab SET unique_col = unique_col + 1

This will not throw a unique violation error on most/all other DB
systems when the updated column (in this case "unique_col") has a
unique constraint/is the primary key. This behavior is actually
required by the SQL standard. An SQL statement is supposed to be
all-or-nothing, which Postgres doesn't quite manage here.

The section "6.6 Interdependencies of Transactional Storage" from the
paper "Architecture of a Database System" provides additional
background information (I should have suggested reading both 6.6 and
6.7 together).

-- 
Peter Geoghegan

Fwd: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-11 Thread Rafia Sabih

On Sun, 7 Jul 2019 at 01:08, Peter Geoghegan  wrote:

> * Maybe we could do compression with unique indexes when inserting
> values with NULLs? Note that we now treat an insertion of a tuple with
+1

I tried this patch and found the improvements impressive. However,
when I tried with multi-column indexes it wasn't giving any
improvement, is it the known limitation of the patch?
I am surprised to find that such a patch is on radar since quite some
years now and not yet committed.

Going through the patch, here are a few comments from me,

 /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d
IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
and other such DEBUG4 statements are meant to be removed, right...?
Just because I didn't find any other such statements in this API and
there are many in this patch, so not sure how much are they needed.

/*
* If we have only 10 uncompressed items on the full page, it probably
* won't worth to compress them.
*/
if (maxoff - n_posting_on_page < 10)
return;

Is this a magic number...?

/*
* We do not expect to meet any DEAD items, since this function is
* called right after _bt_vacuum_one_page(). If for some reason we
* found dead item, don't compress it, to allow upcoming microvacuum
* or vacuum clean it up.
*/
if (ItemIdIsDead(itemId))
continue;

This makes me wonder about those 'some' reasons.

Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * he will get what he expects

This can be re-framed to make the caller more gender neutral.

Other than that, I am curious about the plans for its backward compatibility.

--
Regards,
Rafia Sabih

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-11 Thread Alexander Korotkov

On Thu, Jul 11, 2019 at 7:53 AM Peter Geoghegan  wrote:
> Anyway, I think that *hundreds* or even *thousands* of rows are
> effectively locked all at once when a bitmap index needs to be updated
> in these other systems -- and I mean a heavyweight lock that lasts
> until the xact commits or aborts, like a Postgres row lock. As I said,
> this is necessary simply because the transaction might need to roll
> back. Of course, your patch never needs to do anything like that --
> the only risk is that buffer lock contention will be increased. Maybe
> VACUUM isn't so bad after all!
>
> Doing deduplication adaptively and automatically in nbtree seems like
> it might play to the strengths of Postgres, while also ameliorating
> its weaknesses. As the same paper goes on to say, it's actually quite
> unusual that PostgreSQL has *transactional* full text search built in
> (using GIN), and offers transactional, high concurrency spatial
> indexing (using GiST). Actually, this is an additional advantages of
> our "pure" approach to MVCC -- we can add new high concurrency,
> transactional access methods relatively easily.

Good finding, thank you!

BTW, I think deduplication could cause some small performance
degradation in some particular cases, because page-level locks became
more coarse grained once pages hold more tuples.  However, this
doesn't seem like something we should much care about.  Providing an
option to turn deduplication off looks enough for me.

Regarding bitmap indexes itself, I think our BRIN could provide them.
However, it would be useful to have opclass parameters to make them
tunable.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-11 Thread Alexander Korotkov

Hi Peter,

Thank you very much for your attention to this patch. Let me comment
some points of your message.

On Sun, Jul 7, 2019 at 2:09 AM Peter Geoghegan  wrote:
> On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
>  wrote:
> > The new version of the patch is attached.
> > This version is even simpler than the previous one,
> > thanks to the recent btree design changes and all the feedback I received.
> > I consider it ready for review and testing.
>
> I took a closer look at this patch, and have some general thoughts on
> its design, and specific feedback on the implementation.
>
> Preserving the *logical contents* of B-Tree indexes that use
> compression is very important -- that should not change in a way that
> outside code can notice. The heap TID itself should count as logical
> contents here, since we want to be able to implement retail index
> tuple deletion in the future. Even without retail index tuple
> deletion, amcheck's "rootdescend" verification assumes that it can
> find one specific tuple (which could now just be one specific "logical
> tuple") using specific key values from the heap, including the heap
> tuple's heap TID. This requirement makes things a bit harder for your
> patch, because you have to deal with one or two edge-cases that you
> currently don't handle: insertion of new duplicates that fall inside
> the min/max range of some existing posting list. That should be rare
> enough in practice, so the performance penalty won't be too bad. This
> probably means that code within _bt_findinsertloc() and/or
> _bt_binsrch_insert() will need to think about a logical tuple as a
> distinct thing from a physical tuple, though that won't be necessary
> in most places.

Could you please elaborate more on preserving the logical contents?  I
can understand it as following: "B-Tree should have the same structure
and invariants as if each TID in posting list be a separate tuple".
So, if we imagine each TID to become separate tuple it would be the
same B-tree, which just can magically sometimes store more tuples in
page.  Is my understanding correct?  But outside code will still
notice changes as soon as it directly accesses B-tree pages (like
contrib/amcheck does).  Do you mean we need an API for accessing
logical B-tree tuples or something?

> The need to "preserve the logical contents" also means that the patch
> will need to recognize when indexes are not safe as a target for
> compression/deduplication (maybe we should call this feature
> deduplilcation, so it's clear how it differs from TOAST?). For
> example, if we have a case-insensitive ICU collation, then it is not
> okay to treat an opclass-equal pair of text strings that use the
> collation as having the same value when considering merging the two
> into one. You don't actually do that in the patch, but you also don't
> try to deal with the fact that such a pair of strings are equal, and
> so must have their final positions determined by the heap TID column
> (deduplication within _bt_compress_one_page() must respect that).
> Possibly equal-but-distinct values seems like a problem that's not
> worth truly fixing, but it will be necessary to store metadata about
> whether or not we're willing to do deduplication in the meta page,
> based on operator class and collation details. That seems like a
> restriction that we're just going to have to accept, though I'm not
> too worried about exactly what that will look like right now. We can
> work it out later.

I think in order to deduplicate "equal but distinct" values we need at
least to give up with index only scans.  Because we have no
restriction that equal according to B-tree opclass values are same for
other operations and/or user output.

> I think that the need to be careful about the logical contents of
> indexes already causes bugs, even with "safe for compression" indexes.
> For example, I can sometimes see an assertion failure
> within_bt_truncate(), at the point where we check if heap TID values
> are safe:
>
> /*
>  * Lehman and Yao require that the downlink to the right page, which is to
>  * be inserted into the parent page in the second phase of a page split be
>  * a strict lower bound on items on the right page, and a non-strict upper
>  * bound for items on the left page.  Assert that heap TIDs follow these
>  * invariants, since a heap TID value is apparently needed as a
>  * tiebreaker.
>  */
> #ifndef DEBUG_NO_TRUNCATE
> Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
>   BTreeTupleGetMinTID(firstright)) < 0);
> ...
>
> This bug is not that easy to see, but it will happen with a big index,
> even without updates or deletes. I think that this happens because
> compression can allow the "logical tuples" to be in the wrong heap TID
> order when there are multiple posting lists for the same value. As I
> said, I think that it's necessary to see a posting list as being
> comprised of multiple

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-11 Thread Bruce Momjian

On Wed, Jul 10, 2019 at 09:53:04PM -0700, Peter Geoghegan wrote:
> Anyway, I think that *hundreds* or even *thousands* of rows are
> effectively locked all at once when a bitmap index needs to be updated
> in these other systems -- and I mean a heavyweight lock that lasts
> until the xact commits or aborts, like a Postgres row lock. As I said,
> this is necessary simply because the transaction might need to roll
> back. Of course, your patch never needs to do anything like that --
> the only risk is that buffer lock contention will be increased. Maybe
> VACUUM isn't so bad after all!
> 
> Doing deduplication adaptively and automatically in nbtree seems like
> it might play to the strengths of Postgres, while also ameliorating
> its weaknesses. As the same paper goes on to say, it's actually quite
> unusual that PostgreSQL has *transactional* full text search built in
> (using GIN), and offers transactional, high concurrency spatial
> indexing (using GiST). Actually, this is an additional advantages of
> our "pure" approach to MVCC -- we can add new high concurrency,
> transactional access methods relatively easily.

Wow, I never thought of that.  The only things I know we lock until
transaction end are rows we update (against concurrent updates), and
additions to unique indexes.  By definition, indexes with many
duplicates are not unique, so that doesn't apply.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-10 Thread Peter Geoghegan

On Sat, Jul 6, 2019 at 4:08 PM Peter Geoghegan  wrote:
> I took a closer look at this patch, and have some general thoughts on
> its design, and specific feedback on the implementation.

I have some high level concerns about how the patch might increase
contention, which could make queries slower. Apparently that is a real
problem in other systems that use MVCC when their bitmap index feature
is used -- they are never really supposed to be used with OLTP apps.
This patch makes nbtree behave rather a lot like a bitmap index.
That's not exactly true, because you're not storing a bitmap or
compressing the TID lists, but they're definitely quite similar. It's
easy to imagine a hybrid approach, that starts with a B-Tree with
deduplication/TID lists, and eventually becomes a bitmap index as more
duplicates are added [1].

It doesn't seem like it would be practical for these other MVCC
database systems to have standard B-Tree secondary indexes that
compress duplicates gracefully in the way that you propose to with
this patch, because lock contention would presumably be a big problem
for the same reason as it is with their bitmap indexes (whatever the
true reason actually is). Is it really possible to have something
that's adaptive, offering the best of both worlds?

Having dug into it some more, I think that the answer for us might
actually be "yes, we can have it both ways". Other database systems
that are also based on MVCC still probably use a limited form of index
locking, even in READ COMMITTED mode, though this isn't very widely
known. They need this for unique indexes, but they also need it for
transaction rollback, to remove old entries from the index when the
transaction must abort. The section "6.7 Standard Practice" from the
paper "Architecture of a Database System" [2] goes into this, saying:

"All production databases today support ACID transactions. As a rule,
they use write-ahead logging for durability, and two-phase locking for
concurrency control. An exception is PostgreSQL, which uses
multiversion concurrency control throughout."

I suggest reading "6.7 Standard Practice" in full.

Anyway, I think that *hundreds* or even *thousands* of rows are
effectively locked all at once when a bitmap index needs to be updated
in these other systems -- and I mean a heavyweight lock that lasts
until the xact commits or aborts, like a Postgres row lock. As I said,
this is necessary simply because the transaction might need to roll
back. Of course, your patch never needs to do anything like that --
the only risk is that buffer lock contention will be increased. Maybe
VACUUM isn't so bad after all!

Doing deduplication adaptively and automatically in nbtree seems like
it might play to the strengths of Postgres, while also ameliorating
its weaknesses. As the same paper goes on to say, it's actually quite
unusual that PostgreSQL has *transactional* full text search built in
(using GIN), and offers transactional, high concurrency spatial
indexing (using GiST). Actually, this is an additional advantages of
our "pure" approach to MVCC -- we can add new high concurrency,
transactional access methods relatively easily.

[1] 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98.3159=rep1=pdf
[2] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf
-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-06 Thread Peter Geoghegan

On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
 wrote:
> The new version of the patch is attached.
> This version is even simpler than the previous one,
> thanks to the recent btree design changes and all the feedback I received.
> I consider it ready for review and testing.

I took a closer look at this patch, and have some general thoughts on
its design, and specific feedback on the implementation.

Preserving the *logical contents* of B-Tree indexes that use
compression is very important -- that should not change in a way that
outside code can notice. The heap TID itself should count as logical
contents here, since we want to be able to implement retail index
tuple deletion in the future. Even without retail index tuple
deletion, amcheck's "rootdescend" verification assumes that it can
find one specific tuple (which could now just be one specific "logical
tuple") using specific key values from the heap, including the heap
tuple's heap TID. This requirement makes things a bit harder for your
patch, because you have to deal with one or two edge-cases that you
currently don't handle: insertion of new duplicates that fall inside
the min/max range of some existing posting list. That should be rare
enough in practice, so the performance penalty won't be too bad. This
probably means that code within _bt_findinsertloc() and/or
_bt_binsrch_insert() will need to think about a logical tuple as a
distinct thing from a physical tuple, though that won't be necessary
in most places.

The need to "preserve the logical contents" also means that the patch
will need to recognize when indexes are not safe as a target for
compression/deduplication (maybe we should call this feature
deduplilcation, so it's clear how it differs from TOAST?). For
example, if we have a case-insensitive ICU collation, then it is not
okay to treat an opclass-equal pair of text strings that use the
collation as having the same value when considering merging the two
into one. You don't actually do that in the patch, but you also don't
try to deal with the fact that such a pair of strings are equal, and
so must have their final positions determined by the heap TID column
(deduplication within _bt_compress_one_page() must respect that).
Possibly equal-but-distinct values seems like a problem that's not
worth truly fixing, but it will be necessary to store metadata about
whether or not we're willing to do deduplication in the meta page,
based on operator class and collation details. That seems like a
restriction that we're just going to have to accept, though I'm not
too worried about exactly what that will look like right now. We can
work it out later.

I think that the need to be careful about the logical contents of
indexes already causes bugs, even with "safe for compression" indexes.
For example, I can sometimes see an assertion failure
within_bt_truncate(), at the point where we check if heap TID values
are safe:

/*
 * Lehman and Yao require that the downlink to the right page, which is to
 * be inserted into the parent page in the second phase of a page split be
 * a strict lower bound on items on the right page, and a non-strict upper
 * bound for items on the left page.  Assert that heap TIDs follow these
 * invariants, since a heap TID value is apparently needed as a
 * tiebreaker.
 */
#ifndef DEBUG_NO_TRUNCATE
Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
  BTreeTupleGetMinTID(firstright)) < 0);
...

This bug is not that easy to see, but it will happen with a big index,
even without updates or deletes. I think that this happens because
compression can allow the "logical tuples" to be in the wrong heap TID
order when there are multiple posting lists for the same value. As I
said, I think that it's necessary to see a posting list as being
comprised of multiple logical tuples in the context of inserting new
tuples, even when you're not performing compression or splitting the
page. I also see that amcheck's bt_index_parent_check() function
fails, though bt_index_check() does not fail when I don't use any of
its extra verification options. (You haven't updated amcheck, but I
don't think that you need to update it for these basic checks to
continue to work.)

Other feedback on specific things:

* A good way to assess whether or not the "logical tuple versus
physical tuple" thing works is to make sure that amcheck's
"rootdescend" verification works with a variety of indexes. As I said,
it has the same requirements for nbtree as retail index tuple deletion
will.

* _bt_findinsertloc() should not call _bt_compress_one_page() for
!heapkeyspace (version 3) indexes -- the second call to
_bt_compress_one_page() should be removed.

* Why can't compression be used on system catalog indexes? I
understand that they are not a compelling case, but we tend to do
things the same way with catalog tables and indexes unless there is a
very good reason not to (e.g. HOT, suffix

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-05 Thread Bruce Momjian

On Thu, Jul  4, 2019 at 05:06:09PM -0700, Peter Geoghegan wrote:
> This result is very impressive. We'll need to revisit what the right
> trade-off is for the compression scheme, which Heikki had some
> thoughts on when we left off 3 years ago, but that should be a lot
> easier now. I am very encouraged by the fact that this relatively
> simple approach already works quite nicely. It's also great to see
> that bulk insertions with lots of compression are very clearly faster
> with this latest revision of your patch, unlike earlier versions from
> 2016 that made those cases slower (though I haven't tested indexes
> that don't really use compression). I think that this is because you
> now do the compression lazily, at the point where it looks like we may
> need to split the page. Previous versions of the patch had to perform
> compression eagerly, just like GIN, which is not really appropriate
> for nbtree.

I am also encouraged and am happy we can finally move this duplicate
optimization forward.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-04 Thread Peter Geoghegan

On Thu, Jul 4, 2019 at 10:38 AM Peter Geoghegan  wrote:
> I tried this on my own "UK land registry" test data [1], which was
> originally used for the v12 nbtree work. My test case has a low
> cardinality, multi-column text index. I chose this test case because
> it was convenient for me.
>
> On v12/master, the index is 1100MB. Whereas with your patch, it ends
> up being 196MB -- over 5.5x smaller!

I also see a huge and consistent space saving for TPC-H. All 9 indexes
are significantly smaller. The lineitem orderkey index is "just" 1/3
smaller, which is the smallest improvement among TPC-H indexes in my
index bloat test case. The two largest indexes after the initial bulk
load are *much* smaller: the lineitem parts supplier index is ~2.7x
smaller, while the lineitem ship date index is a massive ~4.2x
smaller. Also, the orders customer key index is ~2.8x smaller, and the
order date index is ~2.43x smaller. Note that the test involved retail
insertions, not CREATE INDEX.

I haven't seen any regression in the size of any index so far,
including when the number of internal pages is all that we measure.
Actually, there seems to be cases where there is a noticeably larger
reduction in internal pages than in leaf pages, probably because of
interactions with suffix truncation.

This result is very impressive. We'll need to revisit what the right
trade-off is for the compression scheme, which Heikki had some
thoughts on when we left off 3 years ago, but that should be a lot
easier now. I am very encouraged by the fact that this relatively
simple approach already works quite nicely. It's also great to see
that bulk insertions with lots of compression are very clearly faster
with this latest revision of your patch, unlike earlier versions from
2016 that made those cases slower (though I haven't tested indexes
that don't really use compression). I think that this is because you
now do the compression lazily, at the point where it looks like we may
need to split the page. Previous versions of the patch had to perform
compression eagerly, just like GIN, which is not really appropriate
for nbtree.

--
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-04 Thread Peter Geoghegan

On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
 wrote:
> i - number of distinct values in the index.
> So i=1 means that all rows have the same key,
> and i=1000 means that all keys are different.
>
> i / old size (MB) / new size (MB)
> 121588
> 100021590
> 1021571
> 1000214214
>
> For more, see the attached diagram with test results.

I tried this on my own "UK land registry" test data [1], which was
originally used for the v12 nbtree work. My test case has a low
cardinality, multi-column text index. I chose this test case because
it was convenient for me.

On v12/master, the index is 1100MB. Whereas with your patch, it ends
up being 196MB -- over 5.5x smaller!

I also tried it out with the "Mouse genome informatics" database [2],
which was already improved considerably by the v12 work on duplicates.
This is helped tremendously by your patch. It's not quite 5.5x across
the board, of course. There are 187 indexes (on 28 tables), and almost
all of the indexes are smaller. Actually, *most* of the indexes are
*much* smaller. Very often 50% smaller.

I don't have time to do an in-depth analysis of these results today,
but clearly the patch is very effective on real world data. I think
that we tend to underestimate just how common indexes with a huge
number of duplicates are.

[1] 
https://https:/postgr.es/m/cah2-wzn_nayk4pr0hrwo0stwhmxjp5qyu+x8vppt030xpqr...@mail.gmail.com
[2] http://www.informatics.jax.org/software.shtml
--
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-04 Thread Anastasia Lubennikova


The new version of the patch is attached.
This version is even simpler than the previous one,
thanks to the recent btree design changes and all the feedback I received.
I consider it ready for review and testing.

[feature overview]
This patch implements the deduplication of btree non-pivot tuples on 
leaf pages

in a manner similar to GIN index "posting lists".

Non-pivot posting tuple has following format:
t_tid | t_info | key values | posting_list[]

Where t_tid and t_info fields are used to store meta info
about tuple's posting list.
posting list is an array of ItemPointerData.

Currently, compression is applied to all indexes except system indexes, 
unique

indexes, and indexes with included columns.

On insertion, compression applied not to each tuple, but to the page before
split. If the target page is full, we try to compress it.

[benchmark results]
idx ON tbl(c1);
index contains 1000 integer values

i - number of distinct values in the index.
So i=1 means that all rows have the same key,
and i=1000 means that all keys are different.

i / old size (MB) / new size (MB)
1            215    88
1000        215    90
10        215    71
1000    214    214

For more, see the attached diagram with test results.

[future work]
Many things can be improved in this feature.
Personally, I'd prefer to keep this patch as small as possible
and work on other improvements after a basic part is committed.
Though, I understand that some of these can be considered essential
for this patch to be approved.

1. Implement a split of the posting tuples on a page split.
2. Implement microvacuum of posting tuples.
3. Add a flag into pg_index, which allows enabling/disabling compression
for a particular index.
4. Implement posting list compression.

--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 602f884..fce499b 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -20,6 +20,7 @@
 #include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xloginsert.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -56,6 +57,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 		 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static bool insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -759,6 +762,12 @@ _bt_findinsertloc(Relation rel,
 			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
 			insertstate->bounds_valid = false;
 		}
+
+		/*
+		 * If the target page is full, try to compress the page
+		 */
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
 	}
 	else
 	{
@@ -806,6 +815,11 @@ _bt_findinsertloc(Relation rel,
 			}
 
 			/*
+			 * Before considering moving right, try to compress the page
+			 */
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
+
+			/*
 			 * Nope, so check conditions (b) and (c) enumerated above
 			 *
 			 * The earlier _bt_check_unique() call may well have established a
@@ -2286,3 +2300,232 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * the page.
 	 */
 }
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static bool
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+	IndexTuple to_insert;
+	OffsetNumber offnum =  PageGetMaxOffsetNumber(page);
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple postingtuple;
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+			compressState->ipd,
+			compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	/* Add the new item into the page */
+	offnum = OffsetNumberNext(offnum);
+
+	elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+			 compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+	if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+	offnum, false, false) == InvalidOffsetNumber)
+	{
+		elog(DEBUG4, "insert_itupprev_to_page. failed");
+		/*
+		 * this may happen if tuple is bigger than freespace
+		 * fallback to uncompressed page case
+		 */
+		if (compressState->ntuples > 0)
+			pfree(to_insert);
+		return false;
+	}

78 matches

Mail list logo