Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-11-06 Thread Peter Geoghegan
On Mon, Jul 4, 2016 at 2:30 AM, Heikki Linnakangas  wrote:
> I think we should pack the TIDs more tightly, like GIN does with the varbyte
> encoding. It's tempting to commit this without it for now, and add the
> compression later, but I'd like to avoid having to deal with multiple
> binary-format upgrades, so let's figure out the final on-disk format that we
> want, right from the beginning.

While the idea of duplicate storage is pretty obviously compelling,
there could be other, non-obvious benefits. I think that it could
bring further benefits if we could use duplicate storage to change
this property of nbtree (this is from the README):

"""
Lehman and Yao assume that the key range for a subtree S is described
by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
page.  This does not work for nonunique keys (for example, if we have
enough equal keys to spread across several leaf pages, there *must* be
some equal bounding keys in the first level up).  Therefore we assume
Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
bounding key in an upper tree level must descend to the left of that
key to ensure it finds any equal keys in the preceding page.  An
insertion that sees the high key of its target page is equal to the key
to be inserted has a choice whether or not to move right, since the new
key could go on either page.  (Currently, we try to find a page where
there is room for the new key without a split.)

"""

If we could *guarantee* that all keys in the index are unique, then we
could maintain the keyspace as L originally described.

The practical benefits to this would be:

* We wouldn't need to take the extra step described above -- finding a
bounding key/separator key that's fully equal to our scankey would no
longer necessitate a probably-useless descent to the left of that key.
(BTW, I wonder if we could get away with not inserting a downlink into
parent when a leaf page split finds an identical IndexTuple in parent,
*without* changing the keyspace invariant I mention -- if we're always
going to go to the left of an equal-to-scankey key in an internal
page, why even have more than one?)

* This would make suffix truncation of internal index tuples easier,
and that's important.

The traditional reason why suffix truncation is important is that it
can keep the tree a lot shorter than it would otherwise be. These
days, that might not seem that important, because even if you have
twice the number of internal pages than strictly necessary, that still
isn't that many relative to typical main memory size (and even CPU
cache sizes, perhaps).

The reason I think it's important these days is that not having suffix
truncation makes our "separator keys" overly prescriptive about what
part of the keyspace is owned by each internal page. With a pristine
index (following REINDEX), this doesn't matter much. But, I think that
we get much bigger problems with index bloat due to the poor fan-out
that we sometimes see due to not having suffix truncation, *combined*
with the page deletion algorithms restriction on deleting internal
pages (it can only be done for internal pages with *no* children).

Adding another level or two to the B-Tree makes it so that your
workload's "sparse deletion patterns" really don't need to be that
sparse in order to bloat the B-Tree badly, necessitating a REINDEX to
get back to acceptable performance (VACUUM won't do it). To avoid
this, we should make the internal pages represent the key space in the
least restrictive way possible, by applying suffix truncation so that
it's much more likely that things will *stay* balanced as churn
occurs. This is probably a really bad problem with things like
composite indexes over text columns, or indexes with many NULL values.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-07-04 Thread Heikki Linnakangas

On 18/03/16 19:19, Anastasia Lubennikova wrote:

Please, find the new version of the patch attached. Now it has WAL
functionality.

Detailed description of the feature you can find in README draft
https://goo.gl/50O8Q0

This patch is pretty complicated, so I ask everyone, who interested in
this feature,
to help with reviewing and testing it. I will be grateful for any feedback.
But please, don't complain about code style, it is still work in progress.

Next things I'm going to do:
1. More debugging and testing. I'm going to attach in next message
couple of sql scripts for testing.
2. Fix NULLs processing
3. Add a flag into pg_index, that allows to enable/disable compression
for each particular index.
4. Recheck locking considerations. I tried to write code as less
invasive as possible, but we need to make sure that algorithm is still
correct.
5. Change BTMaxItemSize
6. Bring back microvacuum functionality.


I think we should pack the TIDs more tightly, like GIN does with the 
varbyte encoding. It's tempting to commit this without it for now, and 
add the compression later, but I'd like to avoid having to deal with 
multiple binary-format upgrades, so let's figure out the final on-disk 
format that we want, right from the beginning.


It would be nice to reuse the varbyte encoding code from GIN, but we 
might not want to use that exact scheme for B-tree. Firstly, an 
important criteria when we designed GIN's encoding scheme was to avoid 
expanding on-disk size for any data set, which meant that a TID had to 
always be encoded in 6 bytes or less. We don't have that limitation with 
B-tree, because in B-tree, each item is currently stored as a separate 
IndexTuple, which is much larger. So we are free to choose an encoding 
scheme that's better at packing some values, at the expense of using 
more bytes for other values, if we want to. Some analysis on what we 
want would be nice. (It's still important that removing a TID from the 
list never makes the list larger, for VACUUM.)


Secondly, to be able to just always enable this feature, without a GUC 
or reloption, we might need something that's faster for random access 
than GIN's posting lists. Or can we just add the setting, but it would 
be nice to have some more analysis on the worst-case performance before 
we decide on that.


I find the macros in nbtree.h in the patch quite confusing. They're 
similar to what we did in GIN, but again we might want to choose 
differently here. So some discussion on the desired IndexTuple layout is 
in order. (One clear bug is that using the high bit of BlockNumber for 
the BT_POSTING flag will fail for a table larger than 2^31 blocks.)


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-03-28 Thread Claudio Freire
On Thu, Mar 24, 2016 at 7:12 PM, Peter Geoghegan  wrote:
> On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas  wrote:
>> I really like this idea, and the performance results seem impressive,
>> but I think we should push this out to 9.7.  A btree patch that didn't
>> have WAL support until two and a half weeks into the final CommitFest
>> just doesn't seem to me like a good candidate.  First, as a general
>> matter, if a patch isn't code-complete at the start of a CommitFest,
>> it's reasonable to say that it should be reviewed but not necessarily
>> committed in that CommitFest.  This patch has had some review, but I'm
>> not sure how deep that review is, and I think it's had no code review
>> at all of the WAL logging changes, which were submitted only a week
>> ago, well after the CF deadline.  Second, the btree AM is a
>> particularly poor place to introduce possibly destabilizing changes.
>> Everybody depends on it, all the time, for everything.  And despite
>> new tools like amcheck, it's not a particularly easy thing to debug.
>
> Regrettably, I must agree. I don't see a plausible path to commit for
> this patch in the ongoing CF.
>
> I think that Anastasia did an excellent job here, and I wish I could
> have been of greater help sooner. Nevertheless, it would be unwise to
> commit this given the maturity of the code. There have been very few
> instances of performance improvements to the B-Tree code for as long
> as I've been interested, because it's so hard, and the standard is so
> high. The only example I can think of from the last few years is
> Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which
> were far less invasive, and Simon's commit c7111d11b1, which we just
> outright reverted from 9.5 due to subtle bugs (and even that was
> significantly less invasive than this patch). Improving nbtree is
> something that requires several rounds of expert review, and that's
> something that's in short supply for the B-Tree code in particular. I
> think that a new testing strategy is needed to make this easier, and I
> hope to get that going with amcheck. I need help with formalizing a
> "testing first" approach for improving the B-Tree code, because I
> think it's the only way that we can move forward with projects like
> this. It's *incredibly* hard to push forward patches like this given
> our current, limited testing strategy.

I've been toying (having gotten nowhere concrete really) with prefix
compression myself, I agree that messing with btree code is quite
harder than it ought to be.

Perhaps trying experimental format changes in a separate experimental
am wouldn't be all that bad (say, nxbtree?). People could opt-in to
those, by creating the indexes with nxbtree instead of plain btree
(say in development environments) and get some testing going without
risking much.

Normally the same effect should be achievable with mere flags, but
since format changes to btree tend to be rather invasive, ensuring the
patch doesn't change behavior with the flag off is hard as well, hence
the wholly separate am idea.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-03-28 Thread Anastasia Lubennikova

25.03.2016 01:12, Peter Geoghegan:

On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas  wrote:

I really like this idea, and the performance results seem impressive,
but I think we should push this out to 9.7.  A btree patch that didn't
have WAL support until two and a half weeks into the final CommitFest
just doesn't seem to me like a good candidate.  First, as a general
matter, if a patch isn't code-complete at the start of a CommitFest,
it's reasonable to say that it should be reviewed but not necessarily
committed in that CommitFest.

You're right.
Frankly, I thought that someone will help me with the path, but I had to 
finish it myself.

*off-topic*
I wonder, if we can add new flag to commitfest. Something like "Needs 
assistance",

which will be used to mark big and complicated patches in progress.
While "Needs review" means that the patch is almost ready and only 
requires the final review.



   This patch has had some review, but I'm
not sure how deep that review is, and I think it's had no code review
at all of the WAL logging changes, which were submitted only a week
ago, well after the CF deadline.  Second, the btree AM is a
particularly poor place to introduce possibly destabilizing changes.
Everybody depends on it, all the time, for everything.  And despite
new tools like amcheck, it's not a particularly easy thing to debug.

Regrettably, I must agree. I don't see a plausible path to commit for
this patch in the ongoing CF.

I think that Anastasia did an excellent job here, and I wish I could
have been of greater help sooner. Nevertheless, it would be unwise to
commit this given the maturity of the code. There have been very few
instances of performance improvements to the B-Tree code for as long
as I've been interested, because it's so hard, and the standard is so
high. The only example I can think of from the last few years is
Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which
were far less invasive, and Simon's commit c7111d11b1, which we just
outright reverted from 9.5 due to subtle bugs (and even that was
significantly less invasive than this patch). Improving nbtree is
something that requires several rounds of expert review, and that's
something that's in short supply for the B-Tree code in particular. I
think that a new testing strategy is needed to make this easier, and I
hope to get that going with amcheck. I need help with formalizing a
"testing first" approach for improving the B-Tree code, because I
think it's the only way that we can move forward with projects like
this. It's *incredibly* hard to push forward patches like this given
our current, limited testing strategy.


Unfortunately, I must agree. This patch seems to be far from final 
version until the feature freeze.

I'll move it to the future commitfest.

Anyway it means, that now we have more time to improve the patch.
If you have any ideas related to this patch like prefix/suffix 
compression, I'll be glad to discuss them.

Same for any other ideas of B-tree optimization.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-03-24 Thread Jim Nasby

On 3/24/16 10:21 AM, Alexander Korotkov wrote:

1) It's a great feature many users dream about.


Doesn't matter if it starts eating their data...


2) Patch is not very big.
3) Patch doesn't introduce significant infrastructural changes.  It just
change some well-isolated placed.


It doesn't really matter how big the patch is, it's a question of "What 
did the patch fail to consider?". With something as complicated as the 
btree code, there's ample opportunities for missing things. (And FWIW, 
I'd argue that a 51kB patch is certainly not small, and a patch that is 
doing things in critical sections isn't terribly isolated).


I do think this will be a great addition, but it's just too late to be 
adding this to 9.6.


(BTW, I'm getting bounces from a.lebe...@postgrespro.ru, as well as 
postmaster@. I emailed i...@postgrespro.ru about this but never heard back.)

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-03-24 Thread Peter Geoghegan
On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas  wrote:
> I really like this idea, and the performance results seem impressive,
> but I think we should push this out to 9.7.  A btree patch that didn't
> have WAL support until two and a half weeks into the final CommitFest
> just doesn't seem to me like a good candidate.  First, as a general
> matter, if a patch isn't code-complete at the start of a CommitFest,
> it's reasonable to say that it should be reviewed but not necessarily
> committed in that CommitFest.  This patch has had some review, but I'm
> not sure how deep that review is, and I think it's had no code review
> at all of the WAL logging changes, which were submitted only a week
> ago, well after the CF deadline.  Second, the btree AM is a
> particularly poor place to introduce possibly destabilizing changes.
> Everybody depends on it, all the time, for everything.  And despite
> new tools like amcheck, it's not a particularly easy thing to debug.

Regrettably, I must agree. I don't see a plausible path to commit for
this patch in the ongoing CF.

I think that Anastasia did an excellent job here, and I wish I could
have been of greater help sooner. Nevertheless, it would be unwise to
commit this given the maturity of the code. There have been very few
instances of performance improvements to the B-Tree code for as long
as I've been interested, because it's so hard, and the standard is so
high. The only example I can think of from the last few years is
Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which
were far less invasive, and Simon's commit c7111d11b1, which we just
outright reverted from 9.5 due to subtle bugs (and even that was
significantly less invasive than this patch). Improving nbtree is
something that requires several rounds of expert review, and that's
something that's in short supply for the B-Tree code in particular. I
think that a new testing strategy is needed to make this easier, and I
hope to get that going with amcheck. I need help with formalizing a
"testing first" approach for improving the B-Tree code, because I
think it's the only way that we can move forward with projects like
this. It's *incredibly* hard to push forward patches like this given
our current, limited testing strategy.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-03-24 Thread Alexander Korotkov
On Thu, Mar 24, 2016 at 5:17 PM, Robert Haas  wrote:

> On Fri, Mar 18, 2016 at 1:19 PM, Anastasia Lubennikova
>  wrote:
> > Please, find the new version of the patch attached. Now it has WAL
> > functionality.
> >
> > Detailed description of the feature you can find in README draft
> > https://goo.gl/50O8Q0
> >
> > This patch is pretty complicated, so I ask everyone, who interested in
> this
> > feature,
> > to help with reviewing and testing it. I will be grateful for any
> feedback.
> > But please, don't complain about code style, it is still work in
> progress.
> >
> > Next things I'm going to do:
> > 1. More debugging and testing. I'm going to attach in next message
> couple of
> > sql scripts for testing.
> > 2. Fix NULLs processing
> > 3. Add a flag into pg_index, that allows to enable/disable compression
> for
> > each particular index.
> > 4. Recheck locking considerations. I tried to write code as less
> invasive as
> > possible, but we need to make sure that algorithm is still correct.
> > 5. Change BTMaxItemSize
> > 6. Bring back microvacuum functionality.
>
> I really like this idea, and the performance results seem impressive,
> but I think we should push this out to 9.7.  A btree patch that didn't
> have WAL support until two and a half weeks into the final CommitFest
> just doesn't seem to me like a good candidate.  First, as a general
> matter, if a patch isn't code-complete at the start of a CommitFest,
> it's reasonable to say that it should be reviewed but not necessarily
> committed in that CommitFest.  This patch has had some review, but I'm
> not sure how deep that review is, and I think it's had no code review
> at all of the WAL logging changes, which were submitted only a week
> ago, well after the CF deadline.  Second, the btree AM is a
> particularly poor place to introduce possibly destabilizing changes.
> Everybody depends on it, all the time, for everything.  And despite
> new tools like amcheck, it's not a particularly easy thing to debug.
>

It's all true.  But:
1) It's a great feature many users dream about.
2) Patch is not very big.
3) Patch doesn't introduce significant infrastructural changes.  It just
change some well-isolated placed.

Let's give it a chance.  I've signed as additional reviewer and I'll do my
best in spotting all possible issues in this patch.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-03-24 Thread Robert Haas
On Fri, Mar 18, 2016 at 1:19 PM, Anastasia Lubennikova
 wrote:
> Please, find the new version of the patch attached. Now it has WAL
> functionality.
>
> Detailed description of the feature you can find in README draft
> https://goo.gl/50O8Q0
>
> This patch is pretty complicated, so I ask everyone, who interested in this
> feature,
> to help with reviewing and testing it. I will be grateful for any feedback.
> But please, don't complain about code style, it is still work in progress.
>
> Next things I'm going to do:
> 1. More debugging and testing. I'm going to attach in next message couple of
> sql scripts for testing.
> 2. Fix NULLs processing
> 3. Add a flag into pg_index, that allows to enable/disable compression for
> each particular index.
> 4. Recheck locking considerations. I tried to write code as less invasive as
> possible, but we need to make sure that algorithm is still correct.
> 5. Change BTMaxItemSize
> 6. Bring back microvacuum functionality.

I really like this idea, and the performance results seem impressive,
but I think we should push this out to 9.7.  A btree patch that didn't
have WAL support until two and a half weeks into the final CommitFest
just doesn't seem to me like a good candidate.  First, as a general
matter, if a patch isn't code-complete at the start of a CommitFest,
it's reasonable to say that it should be reviewed but not necessarily
committed in that CommitFest.  This patch has had some review, but I'm
not sure how deep that review is, and I think it's had no code review
at all of the WAL logging changes, which were submitted only a week
ago, well after the CF deadline.  Second, the btree AM is a
particularly poor place to introduce possibly destabilizing changes.
Everybody depends on it, all the time, for everything.  And despite
new tools like amcheck, it's not a particularly easy thing to debug.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-03-23 Thread Alexandr Popov



On 18.03.2016 20:19, Anastasia Lubennikova wrote:
Please, find the new version of the patch attached. Now it has WAL 
functionality.


Detailed description of the feature you can find in README draft 
https://goo.gl/50O8Q0


This patch is pretty complicated, so I ask everyone, who interested in 
this feature,
to help with reviewing and testing it. I will be grateful for any 
feedback.
But please, don't complain about code style, it is still work in 
progress.


Next things I'm going to do:
1. More debugging and testing. I'm going to attach in next message 
couple of sql scripts for testing.

2. Fix NULLs processing
3. Add a flag into pg_index, that allows to enable/disable compression 
for each particular index.
4. Recheck locking considerations. I tried to write code as less 
invasive as possible, but we need to make sure that algorithm is still 
correct.

5. Change BTMaxItemSize
6. Bring back microvacuum functionality.




Hi, hackers.

It's my first review, so do not be strict to me.

I have tested this patch on the next table:
create table message
(
idserial,
usr_idinteger,
texttext
);
CREATE INDEX message_usr_id ON message (usr_id);
The table has 1000 records.

I found the following:
The less unique keys the less size of the table.

Next 2 tablas demonstrates it.
New B-tree
Count of unique keys (usr_id), index“s size , time of creation
1000;"214 MB";"00:00:34.193441"
333  ;"214 MB";"00:00:45.731173"
200  ;"129 MB";"00:00:41.445876"
100  ;"129 MB";"00:00:38.455616"
10;"86 MB"  ;"00:00:40.887626"
1  ;"79 MB"  ;"00:00:47.199774"

Old B-tree
Count of unique keys (usr_id), index“s size , time of creation
1000;"214 MB";"00:00:35.043677"
333  ;"286 MB";"00:00:40.922845"
200  ;"300 MB";"00:00:46.454846"
100  ;"278 MB";"00:00:42.323525"
10;"287 MB";"00:00:47.438132"
1  ;"280 MB";"00:01:00.307873"

I inserted data  randomly and sequentially, it did not influence the 
index's size.
Time of select, insert and update random rows is not changed. It is 
great, but certainly it needs some more detailed study.


Alexander Popov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-03-19 Thread Anastasia Lubennikova
Please, find the new version of the patch attached. Now it has WAL 
functionality.


Detailed description of the feature you can find in README draft 
https://goo.gl/50O8Q0


This patch is pretty complicated, so I ask everyone, who interested in 
this feature,

to help with reviewing and testing it. I will be grateful for any feedback.
But please, don't complain about code style, it is still work in progress.

Next things I'm going to do:
1. More debugging and testing. I'm going to attach in next message 
couple of sql scripts for testing.

2. Fix NULLs processing
3. Add a flag into pg_index, that allows to enable/disable compression 
for each particular index.
4. Recheck locking considerations. I tried to write code as less 
invasive as possible, but we need to make sure that algorithm is still 
correct.

5. Change BTMaxItemSize
6. Bring back microvacuum functionality.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3c55eb..72acc0f 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -24,6 +24,8 @@
 #include "storage/predicate.h"
 #include "utils/tqual.h"
 
+#include "catalog/catalog.h"
+#include "utils/datum.h"
 
 typedef struct
 {
@@ -82,6 +84,7 @@ static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 			int keysz, ScanKey scankey);
+
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 
@@ -113,6 +116,11 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	BTStack		stack;
 	Buffer		buf;
 	OffsetNumber offset;
+	Page 		page;
+	TupleDesc	itupdesc;
+	int			nipd;
+	IndexTuple 	olditup;
+	Size 		sizetoadd;
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
@@ -190,6 +198,7 @@ top:
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		bool updposting = false;
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -201,7 +210,42 @@ top:
 		/* do the insertion */
 		_bt_findinsertloc(rel, , , natts, itup_scankey, itup,
 		  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+
+		/*
+		 * Decide, whether we can apply compression
+		 */
+		page = BufferGetPage(buf);
+
+		if(!IsSystemRelation(rel)
+			&& !rel->rd_index->indisunique
+			&& offset != InvalidOffsetNumber
+			&& offset <= PageGetMaxOffsetNumber(page))
+		{
+			itupdesc = RelationGetDescr(rel);
+			sizetoadd = sizeof(ItemPointerData);
+			olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset));
+
+			if(_bt_isbinaryequal(itupdesc, olditup,
+	rel->rd_index->indnatts, itup))
+			{
+if (!BtreeTupleIsPosting(olditup))
+{
+	nipd = 1;
+	sizetoadd = sizetoadd*2;
+}
+else
+	nipd = BtreeGetNPosting(olditup);
+
+if ((IndexTupleSize(olditup) + sizetoadd) <= BTMaxItemSize(page)
+	&& PageGetFreeSpace(page) > sizetoadd)
+	updposting = true;
+			}
+		}
+
+		if (updposting)
+			_bt_pgupdtup(rel, buf, offset, itup, olditup, nipd);
+		else
+			_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
 	}
 	else
 	{
@@ -1042,6 +1086,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 		false, false) == InvalidOffsetNumber)
 		{
@@ -1072,13 +1117,39 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 	}
-	if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+
+	if (BtreeTupleIsPosting(item))
+	{
+		Size hikeysize =  BtreeGetPostingOffset(item);
+		IndexTuple hikey = palloc0(hikeysize);
+
+		/* Truncate posting before insert it as a hikey. */
+		memcpy (hikey, item, hikeysize);
+		hikey->t_info &= ~INDEX_SIZE_MASK;
+		hikey->t_info |= hikeysize;
+		ItemPointerSet(&(hikey->t_tid), origpagenumber, P_HIKEY);
+
+		if (PageAddItem(leftpage, (Item) hikey, hikeysize, leftoff,
 	false, false) == InvalidOffsetNumber)
+		{
+			memset(rightpage, 0, BufferGetPageSize(rbuf));
+			elog(ERROR, "failed to add hikey to the left sibling"
+" while splitting block %u of index \"%s\"",
+origpagenumber, RelationGetRelationName(rel));
+		}
+
+		pfree(hikey);
+	}
+	else
 	{
-		memset(rightpage, 0, BufferGetPageSize(rbuf));
-		elog(ERROR, "failed to add hikey to the left sibling"
-			 " while splitting block %u of index \"%s\"",
-			 origpagenumber, RelationGetRelationName(rel));
+		if (PageAddItem(leftpage, (Item) item, 

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-03-15 Thread Anastasia Lubennikova

14.03.2016 16:02, David Steele:

Hi Anastasia,

On 2/18/16 12:29 PM, Anastasia Lubennikova wrote:

18.02.2016 20:18, Anastasia Lubennikova:

04.02.2016 20:16, Peter Geoghegan:

On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
  wrote:

I fixed it in the new version (attached).


Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it
looks much better.
I described all details of the compression in this document
https://goo.gl/50O8Q0 (the same text without pictures is attached in
btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about
tricky moments of implementation and questions about future work.
Please don't hesitate to comment it.


Sorry, previous patch was dirty. Hotfix is attached.


This looks like an extremely valuable optimization for btree indexes 
but unfortunately it is not getting a lot of attention. It still 
applies cleanly for anyone interested in reviewing.




Thank you for attention.
I would be indebted to all reviewers, who can just try this patch on 
real data and workload (except WAL for now).

B-tree needs very much testing.

It's not clear to me that you answered all of Peter's questions in 
[1].  I understand that you've provided a README but it may not be 
clear if the answers are in there (and where).


I described in README all the points Peter asked.
But I see that it'd be better to answer directly.
Thanks for reminding, I'll do it tomorrow.


Also, at the end of the README it says:

13. Xlog. TODO.

Does that mean the patch is not yet complete?


Yes, you're right.
Frankly speaking, I supposed that someone will help me with that stuff,
but now I almost completed it. I'll send updated patch in the next letter.

I'm still doubtful about some patch details. I mentioned them in readme 
(bold type).

But they are mostly about future improvements.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-03-14 Thread David Steele

Hi Anastasia,

On 2/18/16 12:29 PM, Anastasia Lubennikova wrote:

18.02.2016 20:18, Anastasia Lubennikova:

04.02.2016 20:16, Peter Geoghegan:

On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
  wrote:

I fixed it in the new version (attached).


Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it
looks much better.
I described all details of the compression in this document
https://goo.gl/50O8Q0 (the same text without pictures is attached in
btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about
tricky moments of implementation and questions about future work.
Please don't hesitate to comment it.


Sorry, previous patch was dirty. Hotfix is attached.


This looks like an extremely valuable optimization for btree indexes but 
unfortunately it is not getting a lot of attention.  It still applies 
cleanly for anyone interested in reviewing.


It's not clear to me that you answered all of Peter's questions in [1]. 
 I understand that you've provided a README but it may not be clear if 
the answers are in there (and where).


Also, at the end of the README it says:

13. Xlog. TODO.

Does that mean the patch is not yet complete?

Thanks,
--
-David
da...@pgmasters.net

[1] 
http://www.postgresql.org/message-id/cam3swzq3_plqch4w7uq8q_f2t4hesektr2n0rq5pxa18oer...@mail.gmail.com



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-02-18 Thread Anastasia Lubennikova

18.02.2016 20:18, Anastasia Lubennikova:

04.02.2016 20:16, Peter Geoghegan:

On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
  wrote:

I fixed it in the new version (attached).


Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it 
looks much better.
I described all details of the compression in this document 
https://goo.gl/50O8Q0 (the same text without pictures is attached in 
btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about 
tricky moments of implementation and questions about future work.

Please don't hesitate to comment it.


Sorry, previous patch was dirty. Hotfix is attached.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Compression. To be correct, it’s not actually compression, but just effective layout of ItemPointers on an index page.
compressed tuple  = IndexTuple (with metadata in TID field+ key) + PostingList


1. Gin index fits extremely good for really large sets of repeating keys, but on the other hand, it completely fails to handle unique keys. To btree it is essential to have good performance and concurrency in any corner cases with any number of duplicates. That’s why we can’t just copy the gin implementation of  item pointers compression. The first difference is that btree algorithm performs compression (or, in other words, changes index tuple layout) only if there’s more than one tuple with this key. It allows us to avoid the overhead of storing useless metadata for mostly different keys (see picture below). It seems that compression could be useful for unique indexes under heavy write/update load (because of MVCC copies), but I don’t sure whether this use-case really exists. Those tuples should be deleted by microvacuum as soon as possible. Anyway, I think that it’s worth to add storage_parameter for btree which enables/disables compression for each particular index. And set compression of unique indexes to off by default. System indexes do not support compression for several reasons. First of all because of WIP state of the patch (debugging system catalog isn’t a big pleasure). The next reason is that I know many places in the code where hardcode or some non-obvious syscache routines are used. I do not feel brave enough to change this code. And last but not least, I don’t see good reasons to do that.

2. If the index key is very small (smaller than metadata) and the number of duplicates is small, compression could lead to index bloat instead of index size decrease (see picture below). I don’t sure whether it’s worth to handle this case separately because it’s really rare and I consider that it’s the DBA’s job to disable compression on such indexes. But if you see any clear way to do this, it would be great.

3. For GIN indexes, if a posting list is too large, a posting tree is created. It proceeded on assumptions that:
Indexed keys are never deleted. It makes all tree algorithms much easier.
There are always many duplicates. Otherwise, gin becomes really inefficient.
There’s no big concurrent rate. In order to add a new entry into a posting tree, we hold a lock on its root, so only 1 backend at a time can perform insertion. 

In btree we can’t afford these assumptions. So we should handle big posting lists in another way. If there are too many ItemPointers to fit into a single posting list, we will just create another one. The overhead of this approach is that we have to store a duplicate of the key and metadata. It leads to the problem of big keys. If the keysize is close to BTMaxItemSize, compression will give us really small benefit, if any at all (see picture below).

4. The more item pointers fit into the single posting list, the rare we have to split it and repeat the key. Therefore, the bigger BTMaxItemSize is the better. The comment in nbtree.h says: “We actually need to be able to fit three items on every page, so restrict any one item to 1/3 the per-page available space.” That is quite right for regular items, but if the index tuple is compressed it already contains more than one item. Taking it into account, we can assert that BTMaxItemSize ~ ⅓ pagesize for regular items, and ~ ½ pagesize for compressed items. Are there any objections? I wonder if we can increase BTMaxItemSize with some other assumption? The problem I see here is that varlena highkey could be as big as the compressed tuple.

5. CREATE INDEX. _bt_load. The algorithm of btree build is following: do the heap scan, add tuples into spool, sort the data, insert ordered data from spool into leaf index pages (_bt_load), build inner pages and root. The main changes are applied to _bt_load function. While loading tuples, we do not insert them one by one, but instead, compare each tuple with the previous one, and if they are equal we put them into posting list. If the posting list is large enough to 

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-02-04 Thread Peter Geoghegan
On Tue, Feb 2, 2016 at 3:59 AM, Thom Brown  wrote:
>  public | pgbench_accounts_pkey | index | thom  | pgbench_accounts | 214 MB |
>  public | pgbench_branches_pkey | index | thom  | pgbench_branches | 24 kB  |
>  public | pgbench_tellers_pkey  | index | thom  | pgbench_tellers  | 48 kB  |

I see the same.

I use my regular SQL query to see the breakdown of leaf/internal/root pages:

postgres=# with tots as (
  SELECT count(*) c,
  avg(live_items) avg_live_items,
  avg(dead_items) avg_dead_items,
  u.type,
  r.oid
  from (select c.oid,
  c.relpages,
  generate_series(1, c.relpages - 1) i
  from pg_index i
  join pg_opclass op on i.indclass[0] = op.oid
  join pg_am am on op.opcmethod = am.oid
  join pg_class c on i.indexrelid = c.oid
  where am.amname = 'btree') r,
lateral (select * from bt_page_stats(r.oid::regclass::text, i)) u
  group by r.oid, type)
select ct.relname table_name,
  tots.oid::regclass::text index_name,
  (select relpages - 1 from pg_class c where c.oid = tots.oid) non_meta_pages,
  upper(type) page_type,
  c npages,
  to_char(avg_live_items, '990.999'),
  to_char(avg_dead_items, '990.999'),
  to_char(c/sum(c) over(partition by tots.oid) * 100, '990.999') || '
%' as prop_of_index
  from tots
  join pg_index i on i.indexrelid = tots.oid
  join pg_class ct on ct.oid = i.indrelid
  where tots.oid = 'pgbench_accounts_pkey'::regclass
  order by ct.relnamespace, table_name, index_name, npages, type;
table_name│  index_name   │ non_meta_pages │ page_type
│ npages │ to_char  │ to_char  │ prop_of_index
──┼───┼┼───┼┼──┼──┼───
 pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ R
│  1 │   97.000 │0.000 │0.004 %
 pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ I
│ 97 │  282.670 │0.000 │0.354 %
 pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ L
│ 27,323 │  366.992 │0.000 │   99.643 %
(3 rows)

But this looks healthy -- I see the same with master. And since the
accounts table is listed as 1281 MB, this looks like a plausible ratio
in the size of the table to its primary index (which I would not say
is true of an 87MB primary key index).

Are you sure you have the details right, Thom?
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-02-04 Thread Peter Geoghegan
On Thu, Feb 4, 2016 at 8:25 AM, Thom Brown  wrote:
>
> No, I'm not.  I've just realised that all I've been checking is the
> primary key expecting it to change in size, which is, of course,
> nonsense.  I should have been creating an index on the bid field of
> pgbench_accounts and reviewing the size of that.

Right. Because, apart from everything else, unique indexes are not
currently supported.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-02-04 Thread Thom Brown
On 4 February 2016 at 15:07, Peter Geoghegan  wrote:
> On Tue, Feb 2, 2016 at 3:59 AM, Thom Brown  wrote:
>>  public | pgbench_accounts_pkey | index | thom  | pgbench_accounts | 214 MB |
>>  public | pgbench_branches_pkey | index | thom  | pgbench_branches | 24 kB  |
>>  public | pgbench_tellers_pkey  | index | thom  | pgbench_tellers  | 48 kB  |
>
> I see the same.
>
> I use my regular SQL query to see the breakdown of leaf/internal/root pages:
>
> postgres=# with tots as (
>   SELECT count(*) c,
>   avg(live_items) avg_live_items,
>   avg(dead_items) avg_dead_items,
>   u.type,
>   r.oid
>   from (select c.oid,
>   c.relpages,
>   generate_series(1, c.relpages - 1) i
>   from pg_index i
>   join pg_opclass op on i.indclass[0] = op.oid
>   join pg_am am on op.opcmethod = am.oid
>   join pg_class c on i.indexrelid = c.oid
>   where am.amname = 'btree') r,
> lateral (select * from bt_page_stats(r.oid::regclass::text, i)) u
>   group by r.oid, type)
> select ct.relname table_name,
>   tots.oid::regclass::text index_name,
>   (select relpages - 1 from pg_class c where c.oid = tots.oid) non_meta_pages,
>   upper(type) page_type,
>   c npages,
>   to_char(avg_live_items, '990.999'),
>   to_char(avg_dead_items, '990.999'),
>   to_char(c/sum(c) over(partition by tots.oid) * 100, '990.999') || '
> %' as prop_of_index
>   from tots
>   join pg_index i on i.indexrelid = tots.oid
>   join pg_class ct on ct.oid = i.indrelid
>   where tots.oid = 'pgbench_accounts_pkey'::regclass
>   order by ct.relnamespace, table_name, index_name, npages, type;
> table_name│  index_name   │ non_meta_pages │ page_type
> │ npages │ to_char  │ to_char  │ prop_of_index
> ──┼───┼┼───┼┼──┼──┼───
>  pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ R
> │  1 │   97.000 │0.000 │0.004 %
>  pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ I
> │ 97 │  282.670 │0.000 │0.354 %
>  pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ L
> │ 27,323 │  366.992 │0.000 │   99.643 %
> (3 rows)
>
> But this looks healthy -- I see the same with master. And since the
> accounts table is listed as 1281 MB, this looks like a plausible ratio
> in the size of the table to its primary index (which I would not say
> is true of an 87MB primary key index).
>
> Are you sure you have the details right, Thom?

*facepalm*

No, I'm not.  I've just realised that all I've been checking is the
primary key expecting it to change in size, which is, of course,
nonsense.  I should have been creating an index on the bid field of
pgbench_accounts and reviewing the size of that.

Now I've checked it with the latest patch, and can see it working
fine.  Apologies for the confusion.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-02-04 Thread Peter Geoghegan
On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
 wrote:
> I fixed it in the new version (attached).

Some quick remarks on your V2.0:

* Seems unnecessary that _bt_binsrch() is passed a real pointer by all
callers. Maybe the one current posting list caller
_bt_findinsertloc(), or its caller, _bt_doinsert(), should do this
work itself:

@@ -373,7 +377,17 @@ _bt_binsrch(Relation rel,
 * scan key), which could be the last slot + 1.
 */
if (P_ISLEAF(opaque))
+   {
+   if (low <= PageGetMaxOffsetNumber(page))
+   {
+   IndexTuple oitup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, low));
+   /* one excessive check of equality. for possible posting
tuple update or creation */
+   if ((_bt_compare(rel, keysz, scankey, page, low) == 0)
+   && (IndexTupleSize(oitup) + sizeof(ItemPointerData) <
BTMaxItemSize(page)))
+   *updposing = true;
+   }
return low;
+   }

* ISTM that you should not use _bt_compare() above, in any case. Consider this:

postgres=# select 5.0 = 5.000;
 ?column?
──
 t
(1 row)

B-Tree operator class indicates equality here. And yet, users will
expect to see the original value in an index-only scan, including the
trailing zeroes as they were originally input. So this should be a bit
closer to HeapSatisfiesHOTandKeyUpdate() (actually,
heap_tuple_attr_equals()), which looks for strict binary equality for
similar reasons.

* Is this correct?:

@@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState
*state, IndexTuple itup)
 * it off the old page, not the new one, in case we are not at leaf
 * level.
 */
-   state->btps_minkey = CopyIndexTuple(oitup);
+   ItemId iihk = PageGetItemId(opage, P_HIKEY);
+   IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+   state->btps_minkey = CopyIndexTuple(hikey);

How this code has changed from the master branch is not clear to me.

I understand that this code in incomplete/draft:

+#define MaxPackedIndexTuplesPerPage\
+   ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+   (sizeof(ItemPointerData

But why is it different to the old (actually unchanged)
MaxIndexTuplesPerPage? I would like to see comments explaining your
understanding, even if they are quite rough. Why did GIN never require
this change to a generic header (itup.h)? Should such a change live in
that generic header file, and not another one more localized to
nbtree?

* More explanation of the design would be nice. I suggest modifying
the nbtree README file, so it's easy to tell what the current design
is. It's hard to follow this from the thread. When I reviewed Heikki's
B-Tree patches from a couple of years ago, we spent ~75% of the time
on design, and only ~25% on code.

* I have a paranoid feeling that the deletion locking protocol
(VACUUMing index tuples concurrently and safely) may need special
consideration here. Basically, with the B-Tree code, there are several
complicated locking protocols, like for page splits, page deletion,
and interlocking with vacuum ("super exclusive lock" stuff). These are
why the B-Tree code is complicated in general, and it's very important
to pin down exactly how we deal with each. Ideally, you'd have an
explanation for why your code was correct in each of these existing
cases (especially deletion). With very complicated and important code
like this, it's often wise to be very clear about when we are talking
about your design, and when we are talking about your code. It's
generally too hard to review both at the same time.

Ideally, when you talk about your design, you'll be able to say things
like "it's clear that this existing thing is correct; at least we have
no complaints from the field. Therefore, it must be true that my new
technique is also correct, because it makes that general situation no
worse". Obviously that kind of rigor is just something we aspire to,
and still fall short of at times. Still, it would be nice to
specifically see a reason why the new code isn't special from the
point of view of the super-exclusive lock thing (which is what I mean
by deletion locking protocol + special consideration). Or why it is
special, but that's okay, or whatever. This style of review is normal
when writing B-Tree code. Some other things don't need this rigor, or
have no invariants that need to be respected/used. Maybe this is
obvious to you already, but it isn't obvious to me.

It's okay if you don't know why, but knowing that you don't have a
strong opinion about something is itself useful information.

* I see you disabled the LP_DEAD thing; why? Just because that made
bugs go away?

* Have you done much stress testing? Using pgbench with many
concurrent VACUUM FREEZE operations would be a good idea, if you
haven't already, because that is insistent about getting super
exclusive locks, unlike regular VACUUM.

* Are you keeping the restriction of 1/3 of a buffer 

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-02-02 Thread Anastasia Lubennikova



29.01.2016 20:43, Thom Brown:

On 29 January 2016 at 16:50, Anastasia Lubennikova
  wrote:

29.01.2016 19:01, Thom Brown:

On 29 January 2016 at 15:47, Aleksander Alekseev
  wrote:

I tested this patch on x64 and ARM servers for a few hours today. The
only problem I could find is that INSERT works considerably slower after
applying a patch. Beside that everything looks fine - no crashes, tests
pass, memory doesn't seem to leak, etc.

Thank you for testing. I rechecked that, and insertions are really very very
very slow. It seems like a bug.


Okay, now for some badness.  I've restored a database containing 2
tables, one 318MB, another 24kB.  The 318MB table contains 5 million
rows with a sequential id column.  I get a problem if I try to delete
many rows from it:
# delete from contacts where id % 3 != 0 ;
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory

I didn't manage to reproduce this. Thom, could you describe exact steps
to reproduce this issue please?

Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
-r0), which creates an instance with a custom config, which is as
follows:

shared_buffers = 8MB
max_connections = 7
wal_level = 'hot_standby'
cluster_name = 'primary'
max_wal_senders = 3
wal_keep_segments = 6

Then create a pgbench data set (I didn't originally use pgbench, but
you can get the same results with it):

createdb -p 5530 pgbench
pgbench -p 5530 -i -s 100 pgbench

And delete some stuff:

thom@swift:~/Development/test$ psql -p 5530 pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.


   ➤ psql://thom@[local]:5530/pgbench

# DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
...
WARNING:  out of shared memory
WARNING:  out of shared memory
DELETE 667
Time: 22218.804 ms

There were 358 lines of that warning message.  I don't get these
messages without the patch.

Thom

Thank you for this report.
I tried to reproduce it, but I couldn't. Debug will be much easier now.

I hope I'll fix these issueswithin the next few days.

BTW, I found a dummy mistake, the previous patch contains some unrelated
changes. I fixed it in the new version (attached).

Thanks.  Well I've tested this latest patch, and the warnings are no
longer generated.  However, the index sizes show that the patch
doesn't seem to be doing its job, so I'm wondering if you removed too
much from it.


Huh, this patch seems to be enchanted) It works fine for me. Did you 
perform "make distclean"?

Anyway, I'll send a new version soon.
I just write here to say that I do not disappear and I do remember about 
the issue.
I even almost fixed the insert speed problem. But I'm very very busy 
this week.

I'll send an updated patch next week as soon as possible.

Thank you for attention to this work.

--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-02-02 Thread Thom Brown
On 2 February 2016 at 11:47, Anastasia Lubennikova
 wrote:
>
>
> 29.01.2016 20:43, Thom Brown:
>
>> On 29 January 2016 at 16:50, Anastasia Lubennikova
>>   wrote:
>>>
>>> 29.01.2016 19:01, Thom Brown:

 On 29 January 2016 at 15:47, Aleksander Alekseev
   wrote:
>
> I tested this patch on x64 and ARM servers for a few hours today. The
> only problem I could find is that INSERT works considerably slower
> after
> applying a patch. Beside that everything looks fine - no crashes, tests
> pass, memory doesn't seem to leak, etc.
>>>
>>> Thank you for testing. I rechecked that, and insertions are really very
>>> very
>>> very slow. It seems like a bug.
>>>
>> Okay, now for some badness.  I've restored a database containing 2
>> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
>> rows with a sequential id column.  I get a problem if I try to delete
>> many rows from it:
>> # delete from contacts where id % 3 != 0 ;
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>
> I didn't manage to reproduce this. Thom, could you describe exact steps
> to reproduce this issue please?

 Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
 -r0), which creates an instance with a custom config, which is as
 follows:

 shared_buffers = 8MB
 max_connections = 7
 wal_level = 'hot_standby'
 cluster_name = 'primary'
 max_wal_senders = 3
 wal_keep_segments = 6

 Then create a pgbench data set (I didn't originally use pgbench, but
 you can get the same results with it):

 createdb -p 5530 pgbench
 pgbench -p 5530 -i -s 100 pgbench

 And delete some stuff:

 thom@swift:~/Development/test$ psql -p 5530 pgbench
 Timing is on.
 psql (9.6devel)
 Type "help" for help.


➤ psql://thom@[local]:5530/pgbench

 # DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
 WARNING:  out of shared memory
 WARNING:  out of shared memory
 WARNING:  out of shared memory
 WARNING:  out of shared memory
 WARNING:  out of shared memory
 WARNING:  out of shared memory
 WARNING:  out of shared memory
 ...
 WARNING:  out of shared memory
 WARNING:  out of shared memory
 DELETE 667
 Time: 22218.804 ms

 There were 358 lines of that warning message.  I don't get these
 messages without the patch.

 Thom
>>>
>>> Thank you for this report.
>>> I tried to reproduce it, but I couldn't. Debug will be much easier now.
>>>
>>> I hope I'll fix these issueswithin the next few days.
>>>
>>> BTW, I found a dummy mistake, the previous patch contains some unrelated
>>> changes. I fixed it in the new version (attached).
>>
>> Thanks.  Well I've tested this latest patch, and the warnings are no
>> longer generated.  However, the index sizes show that the patch
>> doesn't seem to be doing its job, so I'm wondering if you removed too
>> much from it.
>
>
> Huh, this patch seems to be enchanted) It works fine for me. Did you perform
> "make distclean"?

Yes.  Just tried it again:

git clean -fd
git stash
make distclean
patch -p1 < ~/Downloads/btree_compression_2.0.patch
../dopg.sh (script I've always used to build with)
pg_ctl start
createdb pgbench
pgbench -i -s 100 pgbench

$ psql pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.


 ➤ psql://thom@[local]:5488/pgbench

# \di+
List of relations
 Schema | Name  | Type  | Owner |  Table   |
Size  | Description
+---+---+---+--++-
 public | pgbench_accounts_pkey | index | thom  | pgbench_accounts | 214 MB |
 public | pgbench_branches_pkey | index | thom  | pgbench_branches | 24 kB  |
 public | pgbench_tellers_pkey  | index | thom  | pgbench_tellers  | 48 kB  |
(3 rows)

Previously, this would show an index size of 87MB for pgbench_accounts_pkey.

> Anyway, I'll send a new version soon.
> I just write here to say that I do not disappear and I do remember about the
> issue.
> I even almost fixed the insert speed problem. But I'm very very busy this
> week.
> I'll send an updated patch next week as soon as possible.

Thanks.

> Thank you for attention to this work.

Thanks for your awesome patches.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-29 Thread Anastasia Lubennikova

28.01.2016 20:03, Thom Brown:
On 28 January 2016 at 16:12, Anastasia Lubennikova 
> 
wrote:



28.01.2016 18:12, Thom Brown:


On 28 January 2016 at 14:06, Anastasia Lubennikova
> wrote:


31.08.2015 10:41, Anastasia Lubennikova:

Hi, hackers!
I'm going to begin work on effective storage of duplicate
keys in B-tree index.
The main idea is to implement posting lists and posting
trees for B-tree index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is
organised as follows.
Index stores single index tuple for each unique key. That
index tuple points to posting list which contains pointers
to heap tuples (TIDs). If too many rows having the same key,
multiple pages are allocated for the TIDs and these
constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme


and articles .
It also makes possible to apply compression algorithm to
posting list/tree and significantly decrease index size.
Read more in presentation (part 1)
.

Now new B-tree index tuple must be inserted for each table
row that we index.
It can possibly cause page split. Because of MVCC even
unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid
superfluous splits.


I'd like to share the progress of my work. So here is a WIP
patch.
It provides effective duplicate handling using posting lists
the same way as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) +
key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID
(ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID
(ip_blkid, ip_posid)

It seems that backward compatibility works well without any
changes. But I haven't tested it properly yet.

Here are some test results. They are obtained by test
functions test_btbuild and test_ginbuild, which you can find
in attached sql file.
i - number of distinct values in the index. So i=1 means that
all rows have the same key, and i=1000 means that all
keys are different.
The other columns contain the index size (MB).

i   B-tree Old  B-tree New  GIN
1   214,234375  87,7109375  10,2109375
10  214,234375  87,7109375  10,71875
100 214,234375  87,4375 15,640625
1000214,234375  86,2578125  31,296875
1   214,234375  78,421875   104,3046875
10  214,234375  65,359375   49,078125
100 214,234375  90,140625   106,8203125
1000214,234375  214,234375  534,0625


You can note that the last row contains the same index sizes
for B-tree, which is quite logical - there is no compression
if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list
compression yet. So there is still potential to decrease size
of compressed btree.

I'm almost sure, there are still some tiny bugs and missed
functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some
real datasets. Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside
the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral
bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral
bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the
near future:
1. Add storage_parameter 'enable_compression' for btree
access method which specifies whether the index handles
duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly
slower with compressed btree, which is obviously not what we
do want.
4. Clean the code and comments, add related 

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-29 Thread Anastasia Lubennikova

29.01.2016 19:01, Thom Brown:

On 29 January 2016 at 15:47, Aleksander Alekseev
 wrote:

I tested this patch on x64 and ARM servers for a few hours today. The
only problem I could find is that INSERT works considerably slower after
applying a patch. Beside that everything looks fine - no crashes, tests
pass, memory doesn't seem to leak, etc.
Thank you for testing. I rechecked that, and insertions are really very 
very very slow. It seems like a bug.

Okay, now for some badness.  I've restored a database containing 2
tables, one 318MB, another 24kB.  The 318MB table contains 5 million
rows with a sequential id column.  I get a problem if I try to delete
many rows from it:
# delete from contacts where id % 3 != 0 ;
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory

I didn't manage to reproduce this. Thom, could you describe exact steps
to reproduce this issue please?

Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
-r0), which creates an instance with a custom config, which is as
follows:

shared_buffers = 8MB
max_connections = 7
wal_level = 'hot_standby'
cluster_name = 'primary'
max_wal_senders = 3
wal_keep_segments = 6

Then create a pgbench data set (I didn't originally use pgbench, but
you can get the same results with it):

createdb -p 5530 pgbench
pgbench -p 5530 -i -s 100 pgbench

And delete some stuff:

thom@swift:~/Development/test$ psql -p 5530 pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.


  ➤ psql://thom@[local]:5530/pgbench

# DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
...
WARNING:  out of shared memory
WARNING:  out of shared memory
DELETE 667
Time: 22218.804 ms

There were 358 lines of that warning message.  I don't get these
messages without the patch.

Thom


Thank you for this report.
I tried to reproduce it, but I couldn't. Debug will be much easier now.

I hope I'll fix these issueswithin the next few days.

BTW, I found a dummy mistake, the previous patch contains some unrelated 
changes. I fixed it in the new version (attached).


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3c55eb..3908cc1 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -24,6 +24,7 @@
 #include "storage/predicate.h"
 #include "utils/tqual.h"
 
+#include "catalog/catalog.h"
 
 typedef struct
 {
@@ -60,7 +61,8 @@ static void _bt_findinsertloc(Relation rel,
   ScanKey scankey,
   IndexTuple newtup,
   BTStack stack,
-  Relation heapRel);
+  Relation heapRel,
+  bool *updposing);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -113,6 +115,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	BTStack		stack;
 	Buffer		buf;
 	OffsetNumber offset;
+	bool updposting = false;
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
@@ -162,8 +165,9 @@ top:
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
+		bool fakeupdposting = false; /* Never update posting in unique index */
 
-		offset = _bt_binsrch(rel, buf, natts, itup_scankey, false);
+		offset = _bt_binsrch(rel, buf, natts, itup_scankey, false, );
 		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
  checkUnique, _unique, );
 
@@ -200,8 +204,54 @@ top:
 		CheckForSerializableConflictIn(rel, NULL, buf);
 		/* do the insertion */
 		_bt_findinsertloc(rel, , , natts, itup_scankey, itup,
-		  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+		  stack, heapRel, );
+
+		if (IsSystemRelation(rel))
+			updposting = false;
+
+		/*
+		 * New tuple has the same key with tuple at the page.
+		 * Unite them into one posting.
+		 */
+		if (updposting)
+		{
+			Page		page;
+			IndexTuple olditup, newitup;
+			ItemPointerData *ipd;
+			int nipd;
+
+			page = BufferGetPage(buf);
+			olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset));
+
+			if (BtreeTupleIsPosting(olditup))
+nipd = BtreeGetNPosting(olditup);
+			else
+nipd = 1;
+
+			ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1));
+			/* copy item pointers from old tuple into ipd */
+			if (BtreeTupleIsPosting(olditup))
+memcpy(ipd, BtreeGetPosting(olditup), sizeof(ItemPointerData)*nipd);
+			else
+memcpy(ipd, olditup, sizeof(ItemPointerData));
+
+			/* add item pointer of the new tuple into ipd */
+			memcpy(ipd+nipd, itup, sizeof(ItemPointerData));
+
+			/*
+			 * Form posting tuple, then delete old tuple and insert posting tuple.
+			 */

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-29 Thread Thom Brown
On 29 January 2016 at 15:47, Aleksander Alekseev
 wrote:
> I tested this patch on x64 and ARM servers for a few hours today. The
> only problem I could find is that INSERT works considerably slower after
> applying a patch. Beside that everything looks fine - no crashes, tests
> pass, memory doesn't seem to leak, etc.
>
>> Okay, now for some badness.  I've restored a database containing 2
>> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
>> rows with a sequential id column.  I get a problem if I try to delete
>> many rows from it:
>> # delete from contacts where id % 3 != 0 ;
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>
> I didn't manage to reproduce this. Thom, could you describe exact steps
> to reproduce this issue please?

Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
-r0), which creates an instance with a custom config, which is as
follows:

shared_buffers = 8MB
max_connections = 7
wal_level = 'hot_standby'
cluster_name = 'primary'
max_wal_senders = 3
wal_keep_segments = 6

Then create a pgbench data set (I didn't originally use pgbench, but
you can get the same results with it):

createdb -p 5530 pgbench
pgbench -p 5530 -i -s 100 pgbench

And delete some stuff:

thom@swift:~/Development/test$ psql -p 5530 pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.


 ➤ psql://thom@[local]:5530/pgbench

# DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
...
WARNING:  out of shared memory
WARNING:  out of shared memory
DELETE 667
Time: 22218.804 ms

There were 358 lines of that warning message.  I don't get these
messages without the patch.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-29 Thread Aleksander Alekseev
I tested this patch on x64 and ARM servers for a few hours today. The
only problem I could find is that INSERT works considerably slower after
applying a patch. Beside that everything looks fine - no crashes, tests
pass, memory doesn't seem to leak, etc.

> Okay, now for some badness.  I've restored a database containing 2
> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
> rows with a sequential id column.  I get a problem if I try to delete
> many rows from it:
> # delete from contacts where id % 3 != 0 ;
> WARNING:  out of shared memory
> WARNING:  out of shared memory
> WARNING:  out of shared memory

I didn't manage to reproduce this. Thom, could you describe exact steps
to reproduce this issue please?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-29 Thread Thom Brown
On 29 January 2016 at 16:50, Anastasia Lubennikova
 wrote:
> 29.01.2016 19:01, Thom Brown:
>>
>> On 29 January 2016 at 15:47, Aleksander Alekseev
>>  wrote:
>>>
>>> I tested this patch on x64 and ARM servers for a few hours today. The
>>> only problem I could find is that INSERT works considerably slower after
>>> applying a patch. Beside that everything looks fine - no crashes, tests
>>> pass, memory doesn't seem to leak, etc.
>
> Thank you for testing. I rechecked that, and insertions are really very very
> very slow. It seems like a bug.
>
 Okay, now for some badness.  I've restored a database containing 2
 tables, one 318MB, another 24kB.  The 318MB table contains 5 million
 rows with a sequential id column.  I get a problem if I try to delete
 many rows from it:
 # delete from contacts where id % 3 != 0 ;
 WARNING:  out of shared memory
 WARNING:  out of shared memory
 WARNING:  out of shared memory
>>>
>>> I didn't manage to reproduce this. Thom, could you describe exact steps
>>> to reproduce this issue please?
>>
>> Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
>> -r0), which creates an instance with a custom config, which is as
>> follows:
>>
>> shared_buffers = 8MB
>> max_connections = 7
>> wal_level = 'hot_standby'
>> cluster_name = 'primary'
>> max_wal_senders = 3
>> wal_keep_segments = 6
>>
>> Then create a pgbench data set (I didn't originally use pgbench, but
>> you can get the same results with it):
>>
>> createdb -p 5530 pgbench
>> pgbench -p 5530 -i -s 100 pgbench
>>
>> And delete some stuff:
>>
>> thom@swift:~/Development/test$ psql -p 5530 pgbench
>> Timing is on.
>> psql (9.6devel)
>> Type "help" for help.
>>
>>
>>   ➤ psql://thom@[local]:5530/pgbench
>>
>> # DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> ...
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> DELETE 667
>> Time: 22218.804 ms
>>
>> There were 358 lines of that warning message.  I don't get these
>> messages without the patch.
>>
>> Thom
>
>
> Thank you for this report.
> I tried to reproduce it, but I couldn't. Debug will be much easier now.
>
> I hope I'll fix these issueswithin the next few days.
>
> BTW, I found a dummy mistake, the previous patch contains some unrelated
> changes. I fixed it in the new version (attached).

Thanks.  Well I've tested this latest patch, and the warnings are no
longer generated.  However, the index sizes show that the patch
doesn't seem to be doing its job, so I'm wondering if you removed too
much from it.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-28 Thread Peter Geoghegan
On Thu, Jan 28, 2016 at 9:03 AM, Thom Brown  wrote:
> I'm surprised that efficiencies can't be realised beyond this point.  Your 
> results show a sweet spot at around 1000 / 1000, with it getting slightly 
> worse beyond that.  I kind of expected a lot of efficiency where all the 
> values are the same, but perhaps that's due to my lack of understanding 
> regarding the way they're being stored.

I think that you'd need an I/O bound workload to see significant
benefits. That seems unsurprising. I believe that random I/O from
index writes is a big problem for us.



-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-28 Thread Anastasia Lubennikova

28.01.2016 18:12, Thom Brown:
On 28 January 2016 at 14:06, Anastasia Lubennikova 
> 
wrote:



31.08.2015 10:41, Anastasia Lubennikova:

Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in
B-tree index.
The main idea is to implement posting lists and posting trees for
B-tree index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is
organised as follows.
Index stores single index tuple for each unique key. That index
tuple points to posting list which contains pointers to heap
tuples (TIDs). If too many rows having the same key, multiple
pages are allocated for the TIDs and these constitute so called
posting tree.
You can find wonderful detailed descriptions in gin readme


and articles .
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in
presentation (part 1)
.

Now new B-tree index tuple must be inserted for each table row
that we index.
It can possibly cause page split. Because of MVCC even unique
index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid
superfluous splits.


I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the
same way as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key,
TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid,
ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

It seems that backward compatibility works well without any
changes. But I haven't tested it properly yet.

Here are some test results. They are obtained by test functions
test_btbuild and test_ginbuild, which you can find in attached sql
file.
i - number of distinct values in the index. So i=1 means that all
rows have the same key, and i=1000 means that all keys are
different.
The other columns contain the index size (MB).

i   B-tree Old  B-tree New  GIN
1   214,234375  87,7109375  10,2109375
10  214,234375  87,7109375  10,71875
100 214,234375  87,4375 15,640625
1000214,234375  86,2578125  31,296875
1   214,234375  78,421875   104,3046875
10  214,234375  65,359375   49,078125
100 214,234375  90,140625   106,8203125
1000214,234375  214,234375  534,0625


You can note that the last row contains the same index sizes for
B-tree, which is quite logical - there is no compression if all
the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list
compression yet. So there is still potential to decrease size of
compressed btree.

I'm almost sure, there are still some tiny bugs and missed
functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real
datasets. Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside the
index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral
bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral
bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the near
future:
1. Add storage_parameter 'enable_compression' for btree access
method which specifies whether the index handles duplicates.
default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower
with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.


This doesn't apply cleanly against current git head. Have you caught 
up past commit 65c5fcd35?


Thank you for the notice. New patch is attached.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 9673fe0..0c8e4fb 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -495,7 +495,7 @@ 

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-28 Thread Thom Brown
On 28 January 2016 at 17:09, Peter Geoghegan  wrote:
> On Thu, Jan 28, 2016 at 9:03 AM, Thom Brown  wrote:
>> I'm surprised that efficiencies can't be realised beyond this point.  Your 
>> results show a sweet spot at around 1000 / 1000, with it getting 
>> slightly worse beyond that.  I kind of expected a lot of efficiency where 
>> all the values are the same, but perhaps that's due to my lack of 
>> understanding regarding the way they're being stored.
>
> I think that you'd need an I/O bound workload to see significant
> benefits. That seems unsurprising. I believe that random I/O from
> index writes is a big problem for us.

I was thinking more from the point of view of the index size.  An
index containing 10 million duplicate values is around 40% of the size
of an index with 10 million unique values.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-28 Thread Thom Brown
On 28 January 2016 at 17:03, Thom Brown  wrote:

>
> On 28 January 2016 at 16:12, Anastasia Lubennikova <
> a.lubennik...@postgrespro.ru> wrote:
>
>>
>> 28.01.2016 18:12, Thom Brown:
>>
>> On 28 January 2016 at 14:06, Anastasia Lubennikova <
>> a.lubennik...@postgrespro.ru> wrote:
>>
>>>
>>> 31.08.2015 10:41, Anastasia Lubennikova:
>>>
>>> Hi, hackers!
>>> I'm going to begin work on effective storage of duplicate keys in B-tree
>>> index.
>>> The main idea is to implement posting lists and posting trees for B-tree
>>> index pages as it's already done for GIN.
>>>
>>> In a nutshell, effective storing of duplicates in GIN is organised as
>>> follows.
>>> Index stores single index tuple for each unique key. That index tuple
>>> points to posting list which contains pointers to heap tuples (TIDs). If
>>> too many rows having the same key, multiple pages are allocated for the
>>> TIDs and these constitute so called posting tree.
>>> You can find wonderful detailed descriptions in gin readme
>>> 
>>> and articles .
>>> It also makes possible to apply compression algorithm to posting
>>> list/tree and significantly decrease index size. Read more in presentation
>>> (part 1)
>>> .
>>>
>>> Now new B-tree index tuple must be inserted for each table row that we
>>> index.
>>> It can possibly cause page split. Because of MVCC even unique index
>>> could contain duplicates.
>>> Storing duplicates in posting list/tree helps to avoid superfluous
>>> splits.
>>>
>>>
>>> I'd like to share the progress of my work. So here is a WIP patch.
>>> It provides effective duplicate handling using posting lists the same
>>> way as GIN does it.
>>>
>>> Layout of the tuples on the page is changed in the following way:
>>> before:
>>> TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
>>> (ip_blkid, ip_posid) + key
>>> with patch:
>>> TID (N item pointers, posting list offset) + key, TID (ip_blkid,
>>> ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
>>>
>>> It seems that backward compatibility works well without any changes. But
>>> I haven't tested it properly yet.
>>>
>>> Here are some test results. They are obtained by test functions
>>> test_btbuild and test_ginbuild, which you can find in attached sql file.
>>> i - number of distinct values in the index. So i=1 means that all rows
>>> have the same key, and i=1000 means that all keys are different.
>>> The other columns contain the index size (MB).
>>>
>>> i B-tree Old B-tree New GIN
>>> 1 214,234375 87,7109375 10,2109375
>>> 10 214,234375 87,7109375 10,71875
>>> 100 214,234375 87,4375 15,640625
>>> 1000 214,234375 86,2578125 31,296875
>>> 1 214,234375 78,421875 104,3046875
>>> 10 214,234375 65,359375 49,078125
>>> 100 214,234375 90,140625 106,8203125
>>> 1000 214,234375 214,234375 534,0625
>>> You can note that the last row contains the same index sizes for B-tree,
>>> which is quite logical - there is no compression if all the keys are
>>> distinct.
>>> Other cases looks really nice to me.
>>> Next thing to say is that I haven't implemented posting list compression
>>> yet. So there is still potential to decrease size of compressed btree.
>>>
>>> I'm almost sure, there are still some tiny bugs and missed functions,
>>> but on the whole, the patch is ready for testing.
>>> I'd like to get a feedback about the patch testing on some real
>>> datasets. Any bug reports and suggestions are welcome.
>>>
>>> Here is a couple of useful queries to inspect the data inside the index
>>> pages:
>>> create extension pageinspect;
>>> select * from bt_metap('idx');
>>> select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx',
>>> n) as bt;
>>> select n, bt.* from generate_series(1,1) as n, lateral
>>> bt_page_items('idx', n) as bt;
>>>
>>> And at last, the list of items I'm going to complete in the near future:
>>> 1. Add storage_parameter 'enable_compression' for btree access method
>>> which specifies whether the index handles duplicates. default is 'off'
>>> 2. Bring back microvacuum functionality for compressed indexes.
>>> 3. Improve insertion speed. Insertions became significantly slower with
>>> compressed btree, which is obviously not what we do want.
>>> 4. Clean the code and comments, add related documentation.
>>>
>>
>> This doesn't apply cleanly against current git head.  Have you caught up
>> past commit 65c5fcd35?
>>
>>
>> Thank you for the notice. New patch is attached.
>>
>
> Thanks for the quick rebase.
>
> Okay, a quick check with pgbench:
>
> CREATE INDEX ON pgbench_accounts(bid);
>
> Timing
> Scale: master / patch
> 100: 10657ms / 13555ms (rechecked and got 9745ms)
> 500: 56909ms / 56985ms
>
> Size
> Scale: master / patch
> 100: 214MB / 87MB (40.7%)
> 500: 1071MB / 437MB (40.8%)
>
> No performance 

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-28 Thread Anastasia Lubennikova

31.08.2015 10:41, Anastasia Lubennikova:

Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in 
B-tree index.
The main idea is to implement posting lists and posting trees for 
B-tree index pages as it's already done for GIN.


In a nutshell, effective storing of duplicates in GIN is organised as 
follows.
Index stores single index tuple for each unique key. That index tuple 
points to posting list which contains pointers to heap tuples (TIDs). 
If too many rows having the same key, multiple pages are allocated for 
the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme 
 
and articles .
It also makes possible to apply compression algorithm to posting 
list/tree and significantly decrease index size. Read more in 
presentation (part 1) 
.


Now new B-tree index tuple must be inserted for each table row that we 
index.
It can possibly cause page split. Because of MVCC even unique index 
could contain duplicates.

Storing duplicates in posting list/tree helps to avoid superfluous splits.


I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same 
way as GIN does it.


Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID 
(ip_blkid, ip_posid) + key

with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, 
ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)


It seems that backward compatibility works well without any changes. But 
I haven't tested it properly yet.


Here are some test results. They are obtained by test functions 
test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows 
have the same key, and i=1000 means that all keys are different.

The other columns contain the index size (MB).

i   B-tree Old  B-tree New  GIN
1   214,234375  87,7109375  10,2109375
10  214,234375  87,7109375  10,71875
100 214,234375  87,4375 15,640625
1000214,234375  86,2578125  31,296875
1   214,234375  78,421875   104,3046875
10  214,234375  65,359375   49,078125
100 214,234375  90,140625   106,8203125
1000214,234375  214,234375  534,0625


You can note that the last row contains the same index sizes for B-tree, 
which is quite logical - there is no compression if all the keys are 
distinct.

Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression 
yet. So there is still potential to decrease size of compressed btree.


I'm almost sure, there are still some tiny bugs and missed functions, 
but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real 
datasets. Any bug reports and suggestions are welcome.


Here is a couple of useful queries to inspect the data inside the index 
pages:

create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', 
n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral 
bt_page_items('idx', n) as bt;


And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method 
which specifies whether the index handles duplicates. default is 'off'

2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with 
compressed btree, which is obviously not what we do want.

4. Clean the code and comments, add related documentation.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 77c2fdf..3b61e8f 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -24,6 +24,7 @@
 #include "storage/predicate.h"
 #include "utils/tqual.h"
 
+#include "catalog/catalog.h"
 
 typedef struct
 {
@@ -60,7 +61,8 @@ static void _bt_findinsertloc(Relation rel,
   ScanKey scankey,
   IndexTuple newtup,
   BTStack stack,
-  Relation heapRel);
+  Relation heapRel,
+  bool *updposing);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -113,6 +115,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	BTStack		stack;
 	Buffer		buf;
 	OffsetNumber offset;
+	bool updposting = false;
 
 	/* we need an insertion scan key to do our search, 

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-28 Thread Thom Brown
On 28 January 2016 at 16:12, Anastasia Lubennikova <
a.lubennik...@postgrespro.ru> wrote:

>
> 28.01.2016 18:12, Thom Brown:
>
> On 28 January 2016 at 14:06, Anastasia Lubennikova <
> a.lubennik...@postgrespro.ru> wrote:
>
>>
>> 31.08.2015 10:41, Anastasia Lubennikova:
>>
>> Hi, hackers!
>> I'm going to begin work on effective storage of duplicate keys in B-tree
>> index.
>> The main idea is to implement posting lists and posting trees for B-tree
>> index pages as it's already done for GIN.
>>
>> In a nutshell, effective storing of duplicates in GIN is organised as
>> follows.
>> Index stores single index tuple for each unique key. That index tuple
>> points to posting list which contains pointers to heap tuples (TIDs). If
>> too many rows having the same key, multiple pages are allocated for the
>> TIDs and these constitute so called posting tree.
>> You can find wonderful detailed descriptions in gin readme
>> 
>> and articles .
>> It also makes possible to apply compression algorithm to posting
>> list/tree and significantly decrease index size. Read more in presentation
>> (part 1)
>> .
>>
>> Now new B-tree index tuple must be inserted for each table row that we
>> index.
>> It can possibly cause page split. Because of MVCC even unique index could
>> contain duplicates.
>> Storing duplicates in posting list/tree helps to avoid superfluous splits.
>>
>>
>> I'd like to share the progress of my work. So here is a WIP patch.
>> It provides effective duplicate handling using posting lists the same way
>> as GIN does it.
>>
>> Layout of the tuples on the page is changed in the following way:
>> before:
>> TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
>> (ip_blkid, ip_posid) + key
>> with patch:
>> TID (N item pointers, posting list offset) + key, TID (ip_blkid,
>> ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
>>
>> It seems that backward compatibility works well without any changes. But
>> I haven't tested it properly yet.
>>
>> Here are some test results. They are obtained by test functions
>> test_btbuild and test_ginbuild, which you can find in attached sql file.
>> i - number of distinct values in the index. So i=1 means that all rows
>> have the same key, and i=1000 means that all keys are different.
>> The other columns contain the index size (MB).
>>
>> i B-tree Old B-tree New GIN
>> 1 214,234375 87,7109375 10,2109375
>> 10 214,234375 87,7109375 10,71875
>> 100 214,234375 87,4375 15,640625
>> 1000 214,234375 86,2578125 31,296875
>> 1 214,234375 78,421875 104,3046875
>> 10 214,234375 65,359375 49,078125
>> 100 214,234375 90,140625 106,8203125
>> 1000 214,234375 214,234375 534,0625
>> You can note that the last row contains the same index sizes for B-tree,
>> which is quite logical - there is no compression if all the keys are
>> distinct.
>> Other cases looks really nice to me.
>> Next thing to say is that I haven't implemented posting list compression
>> yet. So there is still potential to decrease size of compressed btree.
>>
>> I'm almost sure, there are still some tiny bugs and missed functions, but
>> on the whole, the patch is ready for testing.
>> I'd like to get a feedback about the patch testing on some real datasets.
>> Any bug reports and suggestions are welcome.
>>
>> Here is a couple of useful queries to inspect the data inside the index
>> pages:
>> create extension pageinspect;
>> select * from bt_metap('idx');
>> select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx',
>> n) as bt;
>> select n, bt.* from generate_series(1,1) as n, lateral
>> bt_page_items('idx', n) as bt;
>>
>> And at last, the list of items I'm going to complete in the near future:
>> 1. Add storage_parameter 'enable_compression' for btree access method
>> which specifies whether the index handles duplicates. default is 'off'
>> 2. Bring back microvacuum functionality for compressed indexes.
>> 3. Improve insertion speed. Insertions became significantly slower with
>> compressed btree, which is obviously not what we do want.
>> 4. Clean the code and comments, add related documentation.
>>
>
> This doesn't apply cleanly against current git head.  Have you caught up
> past commit 65c5fcd35?
>
>
> Thank you for the notice. New patch is attached.
>

Thanks for the quick rebase.

Okay, a quick check with pgbench:

CREATE INDEX ON pgbench_accounts(bid);

Timing
Scale: master / patch
100: 10657ms / 13555ms (rechecked and got 9745ms)
500: 56909ms / 56985ms

Size
Scale: master / patch
100: 214MB / 87MB (40.7%)
500: 1071MB / 437MB (40.8%)

No performance issues from what I can tell.

I'm surprised that efficiencies can't be realised beyond this point.  Your
results show a sweet spot at around 1000 / 1000, with it getting
slightly worse beyond that.  I kind 

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-28 Thread Thom Brown
On 28 January 2016 at 14:06, Anastasia Lubennikova <
a.lubennik...@postgrespro.ru> wrote:

>
> 31.08.2015 10:41, Anastasia Lubennikova:
>
> Hi, hackers!
> I'm going to begin work on effective storage of duplicate keys in B-tree
> index.
> The main idea is to implement posting lists and posting trees for B-tree
> index pages as it's already done for GIN.
>
> In a nutshell, effective storing of duplicates in GIN is organised as
> follows.
> Index stores single index tuple for each unique key. That index tuple
> points to posting list which contains pointers to heap tuples (TIDs). If
> too many rows having the same key, multiple pages are allocated for the
> TIDs and these constitute so called posting tree.
> You can find wonderful detailed descriptions in gin readme
> 
> and articles .
> It also makes possible to apply compression algorithm to posting list/tree
> and significantly decrease index size. Read more in presentation (part 1)
> .
>
> Now new B-tree index tuple must be inserted for each table row that we
> index.
> It can possibly cause page split. Because of MVCC even unique index could
> contain duplicates.
> Storing duplicates in posting list/tree helps to avoid superfluous splits.
>
>
> I'd like to share the progress of my work. So here is a WIP patch.
> It provides effective duplicate handling using posting lists the same way
> as GIN does it.
>
> Layout of the tuples on the page is changed in the following way:
> before:
> TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
> (ip_blkid, ip_posid) + key
> with patch:
> TID (N item pointers, posting list offset) + key, TID (ip_blkid,
> ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
>
> It seems that backward compatibility works well without any changes. But I
> haven't tested it properly yet.
>
> Here are some test results. They are obtained by test functions
> test_btbuild and test_ginbuild, which you can find in attached sql file.
> i - number of distinct values in the index. So i=1 means that all rows
> have the same key, and i=1000 means that all keys are different.
> The other columns contain the index size (MB).
>
> i B-tree Old B-tree New GIN
> 1 214,234375 87,7109375 10,2109375
> 10 214,234375 87,7109375 10,71875
> 100 214,234375 87,4375 15,640625
> 1000 214,234375 86,2578125 31,296875
> 1 214,234375 78,421875 104,3046875
> 10 214,234375 65,359375 49,078125
> 100 214,234375 90,140625 106,8203125
> 1000 214,234375 214,234375 534,0625
> You can note that the last row contains the same index sizes for B-tree,
> which is quite logical - there is no compression if all the keys are
> distinct.
> Other cases looks really nice to me.
> Next thing to say is that I haven't implemented posting list compression
> yet. So there is still potential to decrease size of compressed btree.
>
> I'm almost sure, there are still some tiny bugs and missed functions, but
> on the whole, the patch is ready for testing.
> I'd like to get a feedback about the patch testing on some real datasets.
> Any bug reports and suggestions are welcome.
>
> Here is a couple of useful queries to inspect the data inside the index
> pages:
> create extension pageinspect;
> select * from bt_metap('idx');
> select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx',
> n) as bt;
> select n, bt.* from generate_series(1,1) as n, lateral
> bt_page_items('idx', n) as bt;
>
> And at last, the list of items I'm going to complete in the near future:
> 1. Add storage_parameter 'enable_compression' for btree access method
> which specifies whether the index handles duplicates. default is 'off'
> 2. Bring back microvacuum functionality for compressed indexes.
> 3. Improve insertion speed. Insertions became significantly slower with
> compressed btree, which is obviously not what we do want.
> 4. Clean the code and comments, add related documentation.
>

This doesn't apply cleanly against current git head.  Have you caught up
past commit 65c5fcd35?

Thom