Re: [HACKERS] SP-GiST bug.

2014-06-05 Thread Tom Lane
I wrote:
 I think what we have to do is use a different dummy value for the node
 label of a pushed-down allTheSame tuple than we do for end-of-string.
 This requires widening the node labels from uint8 to (at least) int16.
 However, I think that's essentially free because pass-by-value node
 labels are stored as Datums anyway.  In fact, it might even be upward
 compatible, at least for cases that aren't broken today.

Here's a draft patch along these lines.  AFAICT it is fully upward
compatible with existing indexes, although in cases where the current code
creates a zero node label for strings ending at that level, this code will
choose to push new strings into a new child node with label -1 instead, so
that there's some very small added inefficiency.  I don't think we need to
tell people to reindex if we use this fix, though.  I went with new
dummy labels of -1 and -2 partially so that an existing zero label would
not be a hazard for future insertions, and partly with the thought that if
we ever allow embedded \0 in strings, this coding would be able to handle
it with minimal adjustment.

Note that this fix slightly reduces the maximum safe length of a prefix
string (SPGIST_MAX_PREFIX_LENGTH).  In an existing index, there might be
prefixes longer than that.  The code will still work, but it's
theoretically possible that addition of nodes to such a tuple could result
in SPGiST inner tuple size exceeds maximum errors, if you managed to
populate all possible next-byte values at that tuple.  I think the odds of
this being a problem in the field are negligible (and if it did happen,
reindexing would fix the problem).

regards, tom lane

diff --git a/src/backend/access/spgist/spgtextproc.c b/src/backend/access/spgist/spgtextproc.c
index 5b7a5a0..b278916 100644
*** a/src/backend/access/spgist/spgtextproc.c
--- b/src/backend/access/spgist/spgtextproc.c
***
*** 3,8 
--- 3,34 
   * spgtextproc.c
   *	  implementation of radix tree (compressed trie) over text
   *
+  * In a text_ops SPGiST index, inner tuples can have a prefix which is the
+  * common prefix of all strings indexed under that tuple.  The node labels
+  * represent the next byte of the string(s) after the prefix.  Assuming we
+  * always use the longest possible prefix, we will get more than one node
+  * label unless the prefix length is restricted by SPGIST_MAX_PREFIX_LENGTH.
+  *
+  * To reconstruct the indexed string for any index entry, concatenate the
+  * inner-tuple prefixes and node labels starting at the root and working
+  * down to the leaf entry, then append the datum in the leaf entry.
+  * (While descending the tree, level is the number of bytes reconstructed
+  * so far.)
+  *
+  * However, there are two special cases for node labels: -1 indicates that
+  * there are no more bytes after the prefix-so-far, and -2 indicates that we
+  * had to split an existing prefix-less allTheSame tuple (in such a case we
+  * have to create a node label that doesn't correspond to any string byte).
+  * In either case, the node label does not contribute anything to the
+  * reconstructed string.
+  *
+  * Previously, we used a node label of zero for both special cases, but
+  * this was problematic because one can't tell whether a string ending at
+  * the current level can be pushed down into such a child node.  For
+  * backwards compatibility, we still support such node labels for reading;
+  * but no new entries will ever be pushed down into a zero-labeled child.
+  * No new entries ever get pushed into a -2-labeled child, either.
+  *
   *
   * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
   * Portions Copyright (c) 1994, Regents of the University of California
***
*** 24,51 
  
  /*
   * In the worst case, an inner tuple in a text radix tree could have as many
!  * as 256 nodes (one for each possible byte value).  Each node can take 16
!  * bytes on MAXALIGN=8 machines.  The inner tuple must fit on an index page
!  * of size BLCKSZ.  Rather than assuming we know the exact amount of overhead
   * imposed by page headers, tuple headers, etc, we leave 100 bytes for that
   * (the actual overhead should be no more than 56 bytes at this writing, so
   * there is slop in this number).  So we can safely create prefixes up to
!  * BLCKSZ - 256 * 16 - 100 bytes long.  Unfortunately, because 256 * 16 is
!  * already 4K, there is no safe prefix length when BLCKSZ is less than 8K;
   * it is always possible to get SPGiST inner tuple size exceeds maximum
   * if there are too many distinct next-byte values at a given place in the
   * tree.  Since use of nonstandard block sizes appears to be negligible in
   * the field, we just live with that fact for now, choosing a max prefix
   * size of 32 bytes when BLCKSZ is configured smaller than default.
   */
! #define SPGIST_MAX_PREFIX_LENGTH	Max((int) (BLCKSZ - 256 * 16 - 100), 32)
  
  /* Struct for sorting values in 

Re: [HACKERS] SP-GiST bug.

2014-06-03 Thread Tom Lane
Teodor Sigaev teo...@sigaev.ru writes:
 The point I'm making is that the scenario your test case exposes is not
 an infinite loop of picksplits, but an infinite loop of choose calls.

 Thank you, now I see what I missed before. After some brainstorming, it's 
 possible to use '\0' only as end of string marker.  The idea is: when we 
 split 
 allthesame tuple to upper/lower we copy node label to upper tuple instead of 
 set 
 it to '\0'. Node labels of lower tuple will be unchanged but should be 
 ignored 
 for reconstructionValue - and they are ignored because of allTheSame flag.  
 See 
 attached patch.

I don't believe this patch works at all.  In particular, for an allTheSame
node occurring at the root level, omitting the nodeChar from the
reconstructed value is certainly wrong since there was no parent that
could have had a label containing the same character.  More generally, the
patch is assuming that any allTheSame tuple is one that is underneath a
split node; but that's surely wrong (where did such a tuple come from
before the split happened?).

If the root case were the only possible one then we could fix the problem
by adding a test on whether level == 0, but AFAICS, allTheSame tuples can
also get created at lower tree levels.  All you need is a bunch of strings
that are duplicates for more than the maximum prefix length.

I think what we have to do is use a different dummy value for the node
label of a pushed-down allTheSame tuple than we do for end-of-string.
This requires widening the node labels from uint8 to (at least) int16.
However, I think that's essentially free because pass-by-value node
labels are stored as Datums anyway.  In fact, it might even be upward
compatible, at least for cases that aren't broken today.

 I rewrited a patch to fix missed way - allTheSame result of picksplit and 
 tooBig 
 is set. I believe this patch is still needed because it could make a tree 
 more 
 efficient as it was demonstrated for quad tree.

The question here continues to be how sure we are that the case described
in checkAllTheSame's header comment can't actually occur.  We certainly
thought it could when we originally wrote this stuff.  I think possibly
we were wrong, but I'd like to see a clear proof of that before we discuss
dropping the logic that's meant to cope with the situation.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST bug.

2014-05-30 Thread Teodor Sigaev

The point I'm making is that the scenario your test case exposes is not
an infinite loop of picksplits, but an infinite loop of choose calls.


Thank you, now I see what I missed before. After some brainstorming, it's 
possible to use '\0' only as end of string marker.  The idea is: when we split 
allthesame tuple to upper/lower we copy node label to upper tuple instead of set 
it to '\0'. Node labels of lower tuple will be unchanged but should be ignored 
for reconstructionValue - and they are ignored because of allTheSame flag.  See 
attached patch.


Unfortunately, it will break on-disk compatibility. Although it will not cause a 
panic/error/coredump allTheSame approach will not find tuples. Users should 
recreate spgist indexes over text column.




It's certainly possible that we could/should change what checkAllTheSame
is doing --- on re-reading the comment, I'm not really sure that the
scenario it's concerned about can happen.  However, that will not fix


I rewrited a patch to fix missed way - allTheSame result of picksplit and tooBig 
is set. I believe this patch is still needed because it could make a tree more 
efficient as it was demonstrated for quad tree.


this patch doesn't break on-disk compatibility, although index recreation is 
recommended.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


spgist_textproc.patch.gz
Description: Unix tar archive


spgist_allthesame.patch.2.gz
Description: Unix tar archive

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST bug.

2014-05-30 Thread Bruce Momjian
On Fri, May 30, 2014 at 06:55:18PM +0400, Teodor Sigaev wrote:
 The point I'm making is that the scenario your test case exposes is not
 an infinite loop of picksplits, but an infinite loop of choose calls.
 
 Thank you, now I see what I missed before. After some brainstorming,
 it's possible to use '\0' only as end of string marker.  The idea
 is: when we split allthesame tuple to upper/lower we copy node label
 to upper tuple instead of set it to '\0'. Node labels of lower tuple
 will be unchanged but should be ignored for reconstructionValue -
 and they are ignored because of allTheSame flag.  See attached
 patch.
 
 Unfortunately, it will break on-disk compatibility. Although it will
 not cause a panic/error/coredump allTheSame approach will not find
 tuples. Users should recreate spgist indexes over text column.

If we bump the system catalog version, pg_upgrade can mark those indexes
as invalid and output a script to reindex them.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST bug.

2014-05-30 Thread Teodor Sigaev

If we bump the system catalog version, pg_upgrade can mark those indexes
as invalid and output a script to reindex them.



Between 9.3 - 9.4 catalog version is changed anyway.
--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST bug.

2014-05-30 Thread Bruce Momjian
On Fri, May 30, 2014 at 08:19:07PM +0400, Teodor Sigaev wrote:
 If we bump the system catalog version, pg_upgrade can mark those indexes
 as invalid and output a script to reindex them.
 
 
 Between 9.3 - 9.4 catalog version is changed anyway.

Right.  I was thinking of beta users between beta1 and beta2 as well. 
We would need to trigger the index invalidation on the catalog version
used for the commit that changes the index format, not the major version
like 9.4.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST bug.

2014-05-29 Thread Teodor Sigaev

I don't think that checkAllTheSame is broken, but it probably would be
after this patch --- ISTM you're breaking the corner case described in the
header comment for checkAllTheSame, where we need to force splitting of a
set of existing tuples that the opclass can't distinguish.
INHO, this patch will not break - because it forces (in corner case) to create a 
node for new tuple but exclude it from list to insert. On next iteration we will 
find a needed node with empty list.


Comment:
 * If we know that the leaf tuples wouldn't all fit on one page, then we
 * exclude the last tuple (which is the incoming new tuple that forced a split)
 * from the check to see if more than one node is used.  The reason for this
 * is that if the existing tuples are put into only one chain, then even if
 * we move them all to an empty page, there would still not be room for the
 * new tuple, so we'd get into an infinite loop of picksplit attempts.

If we reserve a node for new tuple then on next iteration we will not fall into 
picksplit again - we will have an empty list. So new tuple will fall into 
another page.


BTW, look for following example:
create table xxx (p point);

insert into xxx select '(1,1)'::point from generate_series(1, 226) as v;
insert into xxx values ('(3,3)'::point);

create index xxxidx on xxx using spgist (p quad_point_ops);

easy to see that the single node is allTheSame although underlying leafs are 
different. Due to allTheSame search logic scans are correct but inefficient. And 
patch fixes it too.




What actually is broken, I think, is the spgist text opclass, because it's
using a zero node label for two different situations.  The scenario we're
looking at here is:


Agree for different situations, but disagree that's broken :)


 It seems like we have to
 distinguish the case of zero-length string from that of a dummy label
 for a pushed-down allTheSame tuple.

Actually, we do this: if allTheSame is true then nodes have a dummy labels.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST bug.

2014-05-29 Thread Tom Lane
Teodor Sigaev teo...@sigaev.ru writes:
 I don't think that checkAllTheSame is broken, but it probably would be
 after this patch --- ISTM you're breaking the corner case described in the
 header comment for checkAllTheSame, where we need to force splitting of a
 set of existing tuples that the opclass can't distinguish.

 INHO, this patch will not break - because it forces (in corner case) to 
 create a 
 node for new tuple but exclude it from list to insert. On next iteration we 
 will 
 find a needed node with empty list.

 Comment:
   * If we know that the leaf tuples wouldn't all fit on one page, then we
   * exclude the last tuple (which is the incoming new tuple that forced a 
 split)
   * from the check to see if more than one node is used.  The reason for this
   * is that if the existing tuples are put into only one chain, then even if
   * we move them all to an empty page, there would still not be room for the
   * new tuple, so we'd get into an infinite loop of picksplit attempts.

 If we reserve a node for new tuple then on next iteration we will not fall 
 into 
 picksplit again - we will have an empty list. So new tuple will fall into 
 another page.

The point I'm making is that the scenario your test case exposes is not
an infinite loop of picksplits, but an infinite loop of choose calls.
There are certainly ways to get into that loop that don't involve this
specific checkAllTheSame logic.  Here's an example:

regression=# create table xxx ( t text);
CREATE TABLE
regression=# create index xxxidx on xxx using spgist (t);
CREATE INDEX
regression=# insert into xxx select repeat('x',4000);
INSERT 0 1
regression=# insert into xxx select repeat('x',4000);
INSERT 0 1
regression=# insert into xxx select repeat('x',4000);
INSERT 0 1
regression=# insert into xxx select repeat('x',3996);

In this example, we get a picksplit at the third insert, and here we
*must* take the allTheSame path because the values are, in fact, all the
same (the SPGIST_MAX_PREFIX_LENGTH constraint means we can only put 3996
characters into the prefix, and then the 3997'th character of each value
is 'x').  Then the fourth insert triggers this same infinite loop, because
spg_text_choose is asked how to insert the slightly-shorter value into the
allTheSame tuple, and it does the wrong thing.

It's certainly possible that we could/should change what checkAllTheSame
is doing --- on re-reading the comment, I'm not really sure that the
scenario it's concerned about can happen.  However, that will not fix
the bug in spgtextproc.c; it will only block one avenue for reaching
the bug.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST bug.

2014-05-28 Thread Tom Lane
Teodor Sigaev teo...@sigaev.ru writes:
 create table xxx ( t text);

 insert into xxx select 'x' || v::text from generate_series(1, 291) as v;
 insert into xxx  values ('');

 create index xxxidx on xxx using spgist (t);

Fun!

 And postgres will eat memory forever. It seems to me that checkAllTheSame 
 wrongly decides that all tuples are the same. All tuples except tuple to be 
 inserted have the same character 'x' and only new tuple has '\0'.

I don't think that checkAllTheSame is broken, but it probably would be
after this patch --- ISTM you're breaking the corner case described in the
header comment for checkAllTheSame, where we need to force splitting of a
set of existing tuples that the opclass can't distinguish.

What actually is broken, I think, is the spgist text opclass, because it's
using a zero node label for two different situations.  The scenario we're
looking at here is:

1. We call picksplit with all 292 of these entries.  Not surprisingly,
it puts the first 291 into a single node with label 'x' and the last one
into another node with label '\0'.

2. Because this is too many to fit on one index page, checkAllTheSame
decides it had better create an allTheSame tuple for the first 291 and
then try again to insert the empty string into that tuple.  While you
could argue whether this is *necessary*, it is not *incorrect*, so ISTM
the opclass had better cope with it.

3. We call spg_text_choose with the allTheSame tuple and the empty-string
value to be inserted.  It finds the empty-string value doesn't match
(since all the nodes have label 'x') so it requests a split:

out-resultType = spgSplitTuple;
out-result.splitTuple.prefixHasPrefix = in-hasPrefix;
out-result.splitTuple.prefixPrefixDatum = in-prefixDatum;
out-result.splitTuple.nodeLabel = UInt8GetDatum('\0');
out-result.splitTuple.postfixHasPrefix = false;

After splitting, we have a new upper tuple with a single node with label
'\0', pointing to a new allTheSame tuple containing the original 291
entries.

4. We call spg_text_choose with the new upper tuple and the empty-string
value to be inserted.  It looks for a node labeled '\0', and finds it, so
it reports that we should descend into that node --- ie, down to the new
allTheSame tuple.

5. We call spg_text_choose with the new allTheSame tuple and the
empty-string value to be inserted.  This is exactly like the situation
at step 3, so we're in an infinite loop.

It doesn't look to me like the core code has done anything wrong here;
it just did what the opclass requested.  Rather, the problem is we're
using '\0' node label both to represent a no op label descent and
to represent we're past the end of the string.

I'm afraid there may not be any way to fix this without breaking on-disk
compatibility for spgist text indexes :-(.  It seems like we have to
distinguish the case of zero-length string from that of a dummy label
for a pushed-down allTheSame tuple.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] SP-GiST bug.

2014-05-20 Thread Teodor Sigaev

Hi!

create table xxx ( t text);

insert into xxx select 'x' || v::text from generate_series(1, 291) as v;
insert into xxx  values ('');

create index xxxidx on xxx using spgist (t);

And postgres will eat memory forever. It seems to me that checkAllTheSame 
wrongly decides that all tuples are the same. All tuples except tuple to be 
inserted have the same character 'x' and only new tuple has '\0'. But 
checkAllTheSame() removes new tuple because set of tuples is too big to fit page 
and founds that all tuples are the same. Patch attached, suppose, all branches 
with spgist should be fixed.


Original data is too big to send in e-mail list or even send link, and the bug 
caused not in root page as in example above.





--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


spgist_allthesame.patch.gz
Description: Unix tar archive

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST bug and fix

2013-11-02 Thread Tom Lane
Teodor Sigaev teo...@sigaev.ru writes:
 SP-GiST has a bug during creation:
 % create table ranges as select int4range( (random()*5)::int,
 (random()*5)::int+5) as range
 from generate_series(1,100) x;

 %  create index ranges_range_spgist_idx on ranges using spgist(range);
 ERROR: unexpected spgdoinsert() failure

 Bug is discovered by Jonathan S. Katz jonathan.k...@excoventures.com

 When it was found deadlock possibility it was fixed by using 
 ConditionalLockBuffer() instead of LockBuffer(EXCLUSIVE) and retrying 
 insertion 
 from the scratch. Build index method doesn't believe in concurrent access and 
 throws an error if ConditionalLockBuffer() fails. But I missed that 
 checkpointer process could take a share lock on buffer to write it on disk.

 Attached patch just intoduces retrying during index creation.

The comments could use some work (both here and in spgdoinsert), but
I agree this is a real bug and the fix is sane.  Will fix the comments
and commit.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] SP-GiST bug and fix

2013-10-31 Thread Teodor Sigaev

SP-GiST has a bug during creation:
% create table ranges as select int4range( (random()*5)::int,
   (random()*5)::int+5) as range
from generate_series(1,100) x;

%  create index ranges_range_spgist_idx on ranges using spgist(range);
ERROR: unexpected spgdoinsert() failure

Bug is discovered by Jonathan S. Katz jonathan.k...@excoventures.com

When it was found deadlock possibility it was fixed by using 
ConditionalLockBuffer() instead of LockBuffer(EXCLUSIVE) and retrying insertion 
from the scratch. Build index method doesn't believe in concurrent access and 
throws an error if ConditionalLockBuffer() fails. But I missed that 
checkpointer process could take a share lock on buffer to write it on disk.


Attached patch just intoduces retrying during index creation.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


spgist_build.patch.gz
Description: Unix tar archive

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers