Re: Yet another fast GiST build

2021-11-25 Thread Andrey Borodin



> 17 нояб. 2021 г., в 16:33, Daniel Gustafsson  написал(а):
> 
>> On 5 Jul 2021, at 08:27, Emre Hasegeli  wrote:
> 
>> ...
>> 
>> I couldn't understand patch number 2 "Remove DEBUG1 verification".  It
>> seems like something rather useful.

Emre, thanks for the review! And sorry for this delay. Properly answering 
questions is still in my queue.

> 
> These questions have gone unanswered since July, and the patch fails to apply
> anymore.  Is there an updated version on the way?

Yes. In future versions I also want to address IOS vs pinned buffers issue[0]. 
And, probably, sort items on leaf pages. And, maybe, split pages more 
intelligently.
I hope to get to this in December.

I'll post rebased version ASAP.

Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/CAH2-Wz=PqOziyRSrnN5jAtfXWXY7-BJcHz9S355LH8Dt=5q...@mail.gmail.com



Re: Yet another fast GiST build

2021-11-17 Thread Daniel Gustafsson
> On 5 Jul 2021, at 08:27, Emre Hasegeli  wrote:

> ...
> 
> I couldn't understand patch number 2 "Remove DEBUG1 verification".  It
> seems like something rather useful.

These questions have gone unanswered since July, and the patch fails to apply
anymore.  Is there an updated version on the way?

--
Daniel Gustafsson   https://vmware.com/





Re: Yet another fast GiST build

2021-07-05 Thread Emre Hasegeli
I tried reviewing the remaining patches.  It seems to work correctly,
and passes the tests on my laptop.

> In this pattern I flipped PointerGetDatum(a) to PointerGetDatum(ra.lower), 
> because it seems to me correct. I've followed rule of thumb: every sort 
> function must extract and use "lower" somehow. Though I suspect numeric a 
> bit. Is it regular varlena?

As far as I understand, we cannot use the sortsupport functions from
the btree operator classes because the btree_gist extension handles
things differently.  This is unfortunate and a source of bugs [1], but
we cannot do anything about it.

Given that the lower and upper datums must be the same for the leaf
nodes, it makes sense to me to compare one of them.

Using numeric_cmp() for numeric in line with using bttextcmp() for text.

> +   /*
> +* Numeric has abbreviation routines in numeric.c, but we don't try to use
> +* them here. Maybe later.
> +*/

This is also true for text.  Perhaps we should also add a comment there.

> PFA patchset with v6 intact + two fixes of discovered issues.

> +   /* Use byteacmp(), like gbt_bitcmp() does */

We can improve this comment by incorporating Heikki's previous email:

> Ok, I think I understand that now. In btree_gist, the *_cmp() function
> operates on non-leaf values, and *_lt(), *_gt() et al operate on leaf
> values. For all other datatypes, the leaf and non-leaf representation is
> the same, but for bit/varbit, the non-leaf representation is different.
> The leaf representation is VarBit, and non-leaf is just the bits without
> the 'bit_len' field. That's why it is indeed correct for gbt_bitcmp() to
> just use byteacmp(), whereas gbt_bitlt() et al compares the 'bit_len'
> field separately. That's subtle, and 100% uncommented.

I think patch number 3 should be squashed to patch number 1.

I couldn't understand patch number 2 "Remove DEBUG1 verification".  It
seems like something rather useful.

[1] 
https://www.postgresql.org/message-id/flat/201010112055.o9BKtZf7011251%40wwwmaster.postgresql.org




Re: Yet another fast GiST build

2021-04-07 Thread Heikki Linnakangas

On 07/04/2021 22:41, Andrey Borodin wrote:

I see there is a problem with "SET client_min_messages = DEBUG1;" on
buildfarm. I think these checks were necessary to make sure test
paths are triggered. When we know that code paths are tested, maybe
we can omit checks?
Yeah. We don't have very reliable coverage of different GiST build 
methods, as noted earlier in this thread. But that's not this patch's fault.


I've been investigating the varbit issue, and realized that all the 
comparison functions in this patch for varlen datatypes are broken. The 
representation that the sortsupport function sees is the one that the 
'compress' function returns. The fixed-length versions got this right, 
but the varlen versions assume that the input is a Datum of the original 
datatype. It happens to not crash, because the representation returned 
by gbt_var_compress() is also a varlena, and all of the comparison 
functions tolerate the garbage inputs, but it's bogus. The correct 
pattern would be something like this (without the debugging NOTICE, of 
course):



static int
gbt_text_sort_build_cmp(Datum a, Datum b, SortSupport ssup)
{
GBT_VARKEY_R ra = gbt_var_key_readable((GBT_VARKEY *) 
PG_DETOAST_DATUM(a));
GBT_VARKEY_R rb = gbt_var_key_readable((GBT_VARKEY *) 
PG_DETOAST_DATUM(b));

int x = DatumGetInt32(DirectFunctionCall2Coll(bttextcmp,
   
  ssup->ssup_collation,

 PointerGetDatum(a),

 PointerGetDatum(b)));
elog(NOTICE, "cmp: %s vs %s: %d",
 TextDatumGetCString(ra.lower),
 TextDatumGetCString(rb.lower),
 x);
return x;
}

- Heikki




Re: Yet another fast GiST build

2021-04-07 Thread Andrey Borodin



> 7 апр. 2021 г., в 16:18, Heikki Linnakangas  написал(а):
> 

I see there is a problem with "SET client_min_messages = DEBUG1;" on buildfarm. 
I think these checks were necessary to make sure test paths are triggered. When 
we know that code paths are tested, maybe we can omit checks?

Best regards, Andrey Borodin.



Re: Yet another fast GiST build

2021-04-07 Thread Heikki Linnakangas

On 07/04/2021 15:12, Andrey Borodin wrote:

7 апр. 2021 г., в 14:56, Heikki Linnakangas 
написал(а):

Ok, I think I understand that now. In btree_gist, the *_cmp()
function operates on non-leaf values, and *_lt(), *_gt() et al
operate on leaf values. For all other datatypes, the leaf and
non-leaf representation is the same, but for bit/varbit, the
non-leaf representation is different. The leaf representation is
VarBit, and non-leaf is just the bits without the 'bit_len' field.
That's why it is indeed correct for gbt_bitcmp() to just use
byteacmp(), whereas gbt_bitlt() et al compares the 'bit_len' field
separately. That's subtle, and 100% uncommented.

What that means for this patch is that gbt_bit_sort_build_cmp()
should *not* call byteacmp(), but bitcmp(). Because it operates on
the original datatype stored in the table.


+1 Thanks for investigating this. If I understand things right,
adding test values with different lengths of bit sequences would not
uncover the problem anyway?


That's right, the only consequence of a "wrong" sort order is that the 
quality of the tree suffers, and scans need to scan more pages 
unnecessarily.


I tried to investigate this by creating a varbit index with and without 
sorting, and compared them with pageinspect, but in quick testing, I 
wasn't able to find cases where the sorted version was badly ordered. I 
guess I didn't find the right data set yet.


- Heikki




Re: Yet another fast GiST build

2021-04-07 Thread Andrey Borodin



> 7 апр. 2021 г., в 13:23, Heikki Linnakangas  написал(а):
> 
> Committed with small fixes.

Thanks!

> 7 апр. 2021 г., в 14:56, Heikki Linnakangas  написал(а):
> 
> Ok, I think I understand that now. In btree_gist, the *_cmp() function 
> operates on non-leaf values, and *_lt(), *_gt() et al operate on leaf values. 
> For all other datatypes, the leaf and non-leaf representation is the same, 
> but for bit/varbit, the non-leaf representation is different. The leaf 
> representation is VarBit, and non-leaf is just the bits without the 'bit_len' 
> field. That's why it is indeed correct for gbt_bitcmp() to just use 
> byteacmp(), whereas gbt_bitlt() et al compares the 'bit_len' field 
> separately. That's subtle, and 100% uncommented.
> 
> What that means for this patch is that gbt_bit_sort_build_cmp() should *not* 
> call byteacmp(), but bitcmp(). Because it operates on the original datatype 
> stored in the table.

+1
Thanks for investigating this.
If I understand things right, adding test values with different lengths of bit 
sequences would not uncover the problem anyway?

Best regards, Andrey Borodin.



Re: Yet another fast GiST build

2021-04-07 Thread Heikki Linnakangas

On 07/04/2021 09:00, Heikki Linnakangas wrote:

On 08/03/2021 19:06, Andrey Borodin wrote:

There were numerous GiST-build-related patches in this thread. Yet uncommitted 
is a patch with sortsupport routines for btree_gist contrib module.
Here's its version which needs review.


Reviewing this now again. One thing caught my eye:


+static int
+gbt_bit_sort_build_cmp(Datum a, Datum b, SortSupport ssup)
+{
+   return DatumGetInt32(DirectFunctionCall2(byteacmp,
+   
 PointerGetDatum(a),
+   
 PointerGetDatum(b)));
+}


That doesn't quite match the sort order used by the comparison
functions, gbt_bitlt and such. The comparison functions compare the bits
first, and use the length as a tie-breaker. Using byteacmp() will
compare the "bit length" first. However, gbt_bitcmp() also uses
byteacmp(), so I'm a bit confused. So, huh?


Ok, I think I understand that now. In btree_gist, the *_cmp() function 
operates on non-leaf values, and *_lt(), *_gt() et al operate on leaf 
values. For all other datatypes, the leaf and non-leaf representation is 
the same, but for bit/varbit, the non-leaf representation is different. 
The leaf representation is VarBit, and non-leaf is just the bits without 
the 'bit_len' field. That's why it is indeed correct for gbt_bitcmp() to 
just use byteacmp(), whereas gbt_bitlt() et al compares the 'bit_len' 
field separately. That's subtle, and 100% uncommented.


What that means for this patch is that gbt_bit_sort_build_cmp() should 
*not* call byteacmp(), but bitcmp(). Because it operates on the original 
datatype stored in the table.


- Heikki




Re: Yet another fast GiST build

2021-04-07 Thread Heikki Linnakangas

On 07/04/2021 09:00, Heikki Linnakangas wrote:

On 08/03/2021 19:06, Andrey Borodin wrote:

There were numerous GiST-build-related patches in this thread. Yet uncommitted 
is a patch with sortsupport routines for btree_gist contrib module.
Here's its version which needs review.


Committed with small fixes. I changed the all functions to use 
*GetDatum() and DatumGet*() macros, instead of just comparing Datums 
with < and >. Datum is unsigned, while int2, int4, int8 and money are 
signed, so that changes the sort order around 0 for those types to be 
the same as the picksplit and picksplit functions use. Not a correctness 
issue, the sorting only affects the quality of the index, but let's be tidy.


This issue remains:


Reviewing this now again. One thing caught my eye:


+static int
+gbt_bit_sort_build_cmp(Datum a, Datum b, SortSupport ssup)
+{
+   return DatumGetInt32(DirectFunctionCall2(byteacmp,
+   
 PointerGetDatum(a),
+   
 PointerGetDatum(b)));
+}


That doesn't quite match the sort order used by the comparison
functions, gbt_bitlt and such. The comparison functions compare the bits
first, and use the length as a tie-breaker. Using byteacmp() will
compare the "bit length" first. However, gbt_bitcmp() also uses
byteacmp(), so I'm a bit confused. So, huh?


Since we used byteacmp() previously for picksplit, too, this is 
consistent with the order you got previously. It still seems wrong to me 
and should be investigated, but it doesn't need to block this patch.


- Heikki




Re: Yet another fast GiST build

2021-04-07 Thread Heikki Linnakangas

On 08/03/2021 19:06, Andrey Borodin wrote:

There were numerous GiST-build-related patches in this thread. Yet uncommitted 
is a patch with sortsupport routines for btree_gist contrib module.
Here's its version which needs review.


Reviewing this now again. One thing caught my eye:


+static int
+gbt_bit_sort_build_cmp(Datum a, Datum b, SortSupport ssup)
+{
+   return DatumGetInt32(DirectFunctionCall2(byteacmp,
+   
 PointerGetDatum(a),
+   
 PointerGetDatum(b)));
+}


That doesn't quite match the sort order used by the comparison 
functions, gbt_bitlt and such. The comparison functions compare the bits 
first, and use the length as a tie-breaker. Using byteacmp() will 
compare the "bit length" first. However, gbt_bitcmp() also uses 
byteacmp(), so I'm a bit confused. So, huh?


- Heikki




Re: Yet another fast GiST build

2021-03-08 Thread Andrey Borodin
Thanks, Ibrar!

> 8 марта 2021 г., в 21:15, Ibrar Ahmed  написал(а):
> 
> 
> 
> On Mon, Mar 8, 2021 at 8:59 PM Peter Geoghegan  wrote:
> On Mon, Mar 8, 2021 at 6:41 AM Ibrar Ahmed  wrote:
> > The patch (0001-Add-bool-column-for-LP_DEAF-flag-to-GiST-pageinspect.patch)
> > does not apply successfully and has multiple hanks failed.
> 
> That's because it was committed.
> 
> Thanks for the clarification, its status was not changed which confused me :)
> 

There were numerous GiST-build-related patches in this thread. Yet uncommitted 
is a patch with sortsupport routines for btree_gist contrib module.
Here's its version which needs review.

Thanks for bringing this up!

Best regards, Andrey Borodin.



v5-0001-Sortsupport-for-sorting-GiST-build-for-gist_btree.patch
Description: Binary data


Re: Yet another fast GiST build

2021-03-08 Thread Ibrar Ahmed
On Mon, Mar 8, 2021 at 8:59 PM Peter Geoghegan  wrote:

> On Mon, Mar 8, 2021 at 6:41 AM Ibrar Ahmed  wrote:
> > The patch
> (0001-Add-bool-column-for-LP_DEAF-flag-to-GiST-pageinspect.patch)
> > does not apply successfully and has multiple hanks failed.
>
> That's because it was committed.
>
> Thanks for the clarification, its status was not changed which confused me
:)



> --
> Peter Geoghegan
>


-- 
Ibrar Ahmed


Re: Yet another fast GiST build

2021-03-08 Thread Peter Geoghegan
On Mon, Mar 8, 2021 at 6:41 AM Ibrar Ahmed  wrote:
> The patch (0001-Add-bool-column-for-LP_DEAF-flag-to-GiST-pageinspect.patch)
> does not apply successfully and has multiple hanks failed.

That's because it was committed.

-- 
Peter Geoghegan




Re: Yet another fast GiST build

2021-03-08 Thread Ibrar Ahmed
On Mon, Jan 18, 2021 at 3:52 AM Heikki Linnakangas  wrote:

> On 18/01/2021 00:35, Peter Geoghegan wrote:
> > On Sun, Jan 17, 2021 at 12:50 PM Tom Lane  wrote:
> >> I noticed that gist_page_items() thinks it can hold inter_call_data->rel
> >> open across a series of calls.  That's completely unsafe: the executor
> >> might not run the call series to completion (see LIMIT), resulting in
> >> relcache leak complaints.
>
> Fixed, thanks! I changed it to return a tuplestore.
>
> > It also has the potential to run into big problems should the user
> > input a raw page image with an regclass-argument-incompatible tuple
> > descriptor. Maybe that's okay (this is a tool for experts), but it
> > certainly is a consideration.
>
> I'm not sure I understand. It's true that the raw page image can contain
> data from a different index, or any garbage really. And the function
> will behave badly if you do that. That's an accepted risk with
> pageinspect functions, that's why they're superuser-only, although some
> of them are more tolerant of corrupt pages than others. The
> gist_page_items_bytea() variant doesn't try to parse the key data and is
> less likely to crash on bad input.
>
> - Heikki
>
>
> The patch (0001-Add-bool-column-for-LP_DEAF-flag-to-GiST-pageinspect.patch
)
does not apply successfully and has multiple hanks failed.

http://cfbot.cputube.org/patch_32_2824.log

patching file contrib/pageinspect/gistfuncs.c
Hunk #1 FAILED at 151.
Hunk #2 FAILED at 175.
Hunk #3 FAILED at 245.
Hunk #4 FAILED at 271.

...

Can we get a rebase?

I am marking the patch "Waiting on Author"

-- 
Ibrar Ahmed


Re: Yet another fast GiST build

2021-01-18 Thread Heikki Linnakangas

On 18/01/2021 01:10, Peter Geoghegan wrote:

On Sun, Jan 17, 2021 at 3:04 PM Peter Geoghegan  wrote:

I personally agree with you - it's not like there aren't other ways
for superusers to crash the server (most of which seem very similar to
this gist_page_items() issue, in fact). I just think that it's worth
being clear about that being a trade-off that we've accepted.


Can we rename gist_page_items_bytea() to gist_page_items(), and at the
same time rename the current gist_page_items() -- perhaps call it
gist_page_items_output()?

That way we could add a bt_page_items_output() function later, while
leaving everything consistent (actually not quite, since
bt_page_items() outputs text instead of bytea -- but that seems worth
fixing too). This also has the merit of making the unsafe "output"
variant into the special case.


bt_page_items() and bt_page_items_bytea() exist already. And 
brin_page_items() also calls the output functions (there's no bytea 
version of that). Perhaps it would've been better to make the 
bytea-variants the default, but I'm afraid that ship has already sailed.


We're not terribly consistent; heap_page_items(), hash_page_items() and 
gin_page_items() don't attempt to call the output functions.


Then again, I don't think we need to worry much about backwards 
compatibility in pageinspect, so I guess we could rename them all. It 
doesn't bother me enough to make me do it, but I won't object if you 
want to.


- Heikki




Re: Yet another fast GiST build

2021-01-17 Thread Peter Geoghegan
On Sun, Jan 17, 2021 at 3:04 PM Peter Geoghegan  wrote:
> I personally agree with you - it's not like there aren't other ways
> for superusers to crash the server (most of which seem very similar to
> this gist_page_items() issue, in fact). I just think that it's worth
> being clear about that being a trade-off that we've accepted.

Can we rename gist_page_items_bytea() to gist_page_items(), and at the
same time rename the current gist_page_items() -- perhaps call it
gist_page_items_output()?

That way we could add a bt_page_items_output() function later, while
leaving everything consistent (actually not quite, since
bt_page_items() outputs text instead of bytea -- but that seems worth
fixing too). This also has the merit of making the unsafe "output"
variant into the special case.

-- 
Peter Geoghegan




Re: Yet another fast GiST build

2021-01-17 Thread Peter Geoghegan
On Sun, Jan 17, 2021 at 2:52 PM Heikki Linnakangas  wrote:
> I'm not sure I understand. It's true that the raw page image can contain
> data from a different index, or any garbage really. And the function
> will behave badly if you do that. That's an accepted risk with
> pageinspect functions, that's why they're superuser-only, although some
> of them are more tolerant of corrupt pages than others. The
> gist_page_items_bytea() variant doesn't try to parse the key data and is
> less likely to crash on bad input.

I personally agree with you - it's not like there aren't other ways
for superusers to crash the server (most of which seem very similar to
this gist_page_items() issue, in fact). I just think that it's worth
being clear about that being a trade-off that we've accepted.

-- 
Peter Geoghegan




Re: Yet another fast GiST build

2021-01-17 Thread Heikki Linnakangas

On 18/01/2021 00:35, Peter Geoghegan wrote:

On Sun, Jan 17, 2021 at 12:50 PM Tom Lane  wrote:

I noticed that gist_page_items() thinks it can hold inter_call_data->rel
open across a series of calls.  That's completely unsafe: the executor
might not run the call series to completion (see LIMIT), resulting in
relcache leak complaints.


Fixed, thanks! I changed it to return a tuplestore.


It also has the potential to run into big problems should the user
input a raw page image with an regclass-argument-incompatible tuple
descriptor. Maybe that's okay (this is a tool for experts), but it
certainly is a consideration.


I'm not sure I understand. It's true that the raw page image can contain 
data from a different index, or any garbage really. And the function 
will behave badly if you do that. That's an accepted risk with 
pageinspect functions, that's why they're superuser-only, although some 
of them are more tolerant of corrupt pages than others. The 
gist_page_items_bytea() variant doesn't try to parse the key data and is 
less likely to crash on bad input.


- Heikki




Re: Yet another fast GiST build

2021-01-17 Thread Peter Geoghegan
On Sun, Jan 17, 2021 at 12:50 PM Tom Lane  wrote:
> I noticed that gist_page_items() thinks it can hold inter_call_data->rel
> open across a series of calls.  That's completely unsafe: the executor
> might not run the call series to completion (see LIMIT), resulting in
> relcache leak complaints.

It also has the potential to run into big problems should the user
input a raw page image with an regclass-argument-incompatible tuple
descriptor. Maybe that's okay (this is a tool for experts), but it
certainly is a consideration.

-- 
Peter Geoghegan




Re: Yet another fast GiST build

2021-01-17 Thread Tom Lane
Heikki Linnakangas  writes:
> On 12/01/2021 18:19, Andrey Borodin wrote:
>> Thanks! Looks good to me.

> Pushed, thanks!

I noticed that gist_page_items() thinks it can hold inter_call_data->rel
open across a series of calls.  That's completely unsafe: the executor
might not run the call series to completion (see LIMIT), resulting in
relcache leak complaints.  I suspect that it might have cache-flush
hazards even without that.  I think this code needs to be rewritten to do
all the interesting work in the first call.  Or maybe better, return the
results as a tuplestore so you don't have to do multiple calls at all.

regards, tom lane




Re: Yet another fast GiST build

2021-01-15 Thread Andrey Borodin



> 15 янв. 2021 г., в 10:24, Peter Eisentraut 
>  написал(а):
> 
> I noticed this patch while working on another patch for pageinspect [0], and 
> this one appears to introduce a problem similar to the one the other patch 
> attempts to fix: The "itemlen" output parameters are declared to be of type 
> smallint, but the underlying C data is of type uint16 (OffsetNumber).  I 
> don't know the details of gist enough to determine whether overflow is 
> possible here.  If not, perhaps a check or at least a comment would be 
> useful.  Otherwise, these parameters should be of type int in SQL.

Item offsets cannot exceed maximum block size of 32768. And even 
32768/sizeof(ItemId). Thus overflow is impossible.
Interesting question is wether pageinspect should protect itself from corrupted 
input?
Generating description from bogus tuple, probably, can go wrong.

Best regards, Andrey Borodin.



Re: Yet another fast GiST build

2021-01-14 Thread Peter Eisentraut

On 2021-01-12 14:49, Heikki Linnakangas wrote:

I suggest calling BuildIndexValueDescription() from your own custom
debug instrumentation code.

Thanks for the hint, Peter!
This function does exactly what I want to do. But I have no Relation inside 
gist_page_items(bytea) function... probably, I'll add gist_page_items(relname, 
blockno) overload to fetch keys.


PFA patch with implementation.


I did a bit of cleanup on the function signature. The .sql script
claimed that gist_page_items() took bytea as argument, but in reality it
was a relation name, as text. I changed it so that it takes a page image
as argument, instead of reading the block straight from the index.
Mainly to make it consistent with brin_page_items(), if it wasn't for
that precedence I might've gone either way on it.


I noticed this patch while working on another patch for pageinspect [0], 
and this one appears to introduce a problem similar to the one the other 
patch attempts to fix: The "itemlen" output parameters are declared to 
be of type smallint, but the underlying C data is of type uint16 
(OffsetNumber).  I don't know the details of gist enough to determine 
whether overflow is possible here.  If not, perhaps a check or at least 
a comment would be useful.  Otherwise, these parameters should be of 
type int in SQL.


[0]: 
https://www.postgresql.org/message-id/09e2dd82-4eb6-bba1-271a-d2b58bf6c...@enterprisedb.com





Re: Yet another fast GiST build

2021-01-14 Thread Andrey Borodin


> 14 янв. 2021 г., в 04:47, Peter Geoghegan  написал(а):
> 
> On Tue, Jan 12, 2021 at 5:49 AM Heikki Linnakangas  wrote:
>> I did a bit of cleanup on the function signature. The .sql script
>> claimed that gist_page_items() took bytea as argument, but in reality it
>> was a relation name, as text. I changed it so that it takes a page image
>> as argument, instead of reading the block straight from the index.
>> Mainly to make it consistent with brin_page_items(), if it wasn't for
>> that precedence I might've gone either way on it.
> 
> BTW it would be nice if gist_page_items() had a "dead" boolean output
> argument for the item's LP_DEAD bit, just like bt_page_items().
+1. PFA patch.

> I plan
> on adding some testing for GiST's opportunistic index deletion soon. I
> may also add some of the same enhancements that nbtree got today
> (following commit d168b666).
> 
> This feature was originally heavily based on the nbtree LP_DEAD
> deletion mechanism (now called simple deletion), and I see no reason
> (or at least no good reason) why it shouldn't be possible to keep it
> in sync (except maybe with bottom-up deletion, where that it at least
> isn't straightforward/mechanical).

Sound great!

Best regards, Andrey Borodin.


0001-Add-bool-column-for-LP_DEAF-flag-to-GiST-pageinspect.patch
Description: Binary data


Re: Yet another fast GiST build

2021-01-13 Thread Peter Geoghegan
On Tue, Jan 12, 2021 at 5:49 AM Heikki Linnakangas  wrote:
> I did a bit of cleanup on the function signature. The .sql script
> claimed that gist_page_items() took bytea as argument, but in reality it
> was a relation name, as text. I changed it so that it takes a page image
> as argument, instead of reading the block straight from the index.
> Mainly to make it consistent with brin_page_items(), if it wasn't for
> that precedence I might've gone either way on it.

BTW it would be nice if gist_page_items() had a "dead" boolean output
argument for the item's LP_DEAD bit, just like bt_page_items(). I plan
on adding some testing for GiST's opportunistic index deletion soon. I
may also add some of the same enhancements that nbtree got today
(following commit d168b666).

This feature was originally heavily based on the nbtree LP_DEAD
deletion mechanism (now called simple deletion), and I see no reason
(or at least no good reason) why it shouldn't be possible to keep it
in sync (except maybe with bottom-up deletion, where that it at least
isn't straightforward/mechanical).

-- 
Peter Geoghegan




Re: Yet another fast GiST build

2021-01-13 Thread Heikki Linnakangas



On 13 January 2021 13:53:39 EET, Heikki Linnakangas  wrote:
>Buildfarm animal thorntail is still not happy:
>
>> --- 
>> /home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/contrib/pageinspect/expected/gist.out
>>  2021-01-13 13:38:09.721752365 +0300
>> +++ 
>> /home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/contrib/pageinspect/results/gist.out
>>2021-01-13 14:12:21.540046507 +0300
>> @@ -3,21 +3,21 @@
>>  CREATE INDEX test_gist_idx ON test_gist USING gist (p);
>>  -- Page 0 is the root, the rest are leaf pages
>>  SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 0));
>> - lsn | nsn | rightlink  | flags 
>> --+-++---
>> - 0/1 | 0/0 | 4294967295 | {}
>> +lsn | nsn | rightlink  | flags 
>> ++-++---
>> + 0/1B8357F8 | 0/0 | 4294967295 | {}
>>  (1 row)
>
>Looks like the LSN on the page is not set to GistBuildLSN as expected. 
>Weird.
>
>Thorntail is a sparc64 system, so little-endian, but the other 
>little-endian buildfarm members are not reporting this error. Any idea 
>what might be going on?

Status update on this: I am building Postgres in a qemu sparc64 emulated 
virtual machine, hoping to be able to reproduce this. It's very slow, so it 
will take hours still to complete.

I don't think this is a problem with the test, or with the new pageinspect 
functions, but a genuine bug in the gist building  code. Or there is something 
special on that animal that causes the just-created index pages to be dirtied. 
It does seem to happen consistently on thorntail, but not on other animals.

- Heikki




Re: Yet another fast GiST build

2021-01-13 Thread Heikki Linnakangas
On 13 January 2021 20:04:10 EET, Heikki Linnakangas  wrote:
>
>
>On 13 January 2021 13:53:39 EET, Heikki Linnakangas  wrote:
>>Looks like the LSN on the page is not set to GistBuildLSN as expected. 
>>Weird.
>>
>>Thorntail is a sparc64 system, so little-endian, but the other 
>>little-endian buildfarm members are not reporting this error. Any idea 
>>what might be going on?
>
>Status update on this: I am building Postgres in a qemu sparc64 emulated 
>virtual machine, hoping to be able to reproduce this. It's very slow, so it 
>will take hours still to complete.
>
>I don't think this is a problem with the test, or with the new pageinspect 
>functions, but a genuine bug in the gist building  code. Or there is something 
>special on that animal that causes the just-created index pages to be dirtied. 
>It does seem to happen consistently on thorntail, but not on other animals.

Ah, silly me. Thorntail uses "wal_level=minimal". With that, I can readily 
reproduce this.

I'm still not sure why it happens, but now it should be straightforward to 
debug.

- Heikki




Re: Yet another fast GiST build

2021-01-13 Thread Tom Lane
Heikki Linnakangas  writes:
> On 13/01/2021 11:46, Andrey Borodin wrote:
>> Maybe we can just omit key_data from tests?

> Make sense, fixed it that way. Thanks!

thorntail, at least, is still unhappy.

regards, tom lane




Re: Yet another fast GiST build

2021-01-13 Thread Heikki Linnakangas

On 13/01/2021 12:34, Heikki Linnakangas wrote:

On 13/01/2021 11:46, Andrey Borodin wrote:

13 янв. 2021 г., в 13:41, Heikki Linnakangas 
написал(а):


One more question: will bytea tests run correctly on
32bit\different-endian systems?


Good question. Somehow I thought we were printing esseantilly text
values as bytea. But they are Points, which consists of float8's.
Since I already pushed this, I'm going to just wait and see what
the buildfarm says, and fix if needed. I think the fix is going to
be to just remove the test for the bytea-variant, it doesn't seem
that important to test.


Maybe we can just omit key_data from tests?


Make sense, fixed it that way. Thanks!


Buildfarm animal thorntail is still not happy:


--- 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/contrib/pageinspect/expected/gist.out
2021-01-13 13:38:09.721752365 +0300
+++ 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/contrib/pageinspect/results/gist.out
  2021-01-13 14:12:21.540046507 +0300
@@ -3,21 +3,21 @@
 CREATE INDEX test_gist_idx ON test_gist USING gist (p);
 -- Page 0 is the root, the rest are leaf pages
 SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 0));
- lsn | nsn | rightlink  | flags 
--+-++---

- 0/1 | 0/0 | 4294967295 | {}
+lsn | nsn | rightlink  | flags 
++-++---

+ 0/1B8357F8 | 0/0 | 4294967295 | {}
 (1 row)


Looks like the LSN on the page is not set to GistBuildLSN as expected. 
Weird.


Thorntail is a sparc64 system, so little-endian, but the other 
little-endian buildfarm members are not reporting this error. Any idea 
what might be going on?


- Heikki




Re: Yet another fast GiST build

2021-01-13 Thread Heikki Linnakangas

On 13/01/2021 11:46, Andrey Borodin wrote:

13 янв. 2021 г., в 13:41, Heikki Linnakangas 
написал(а):

One more question: will bytea tests run correctly on 
32bit\different-endian systems?


Good question. Somehow I thought we were printing esseantilly text
values as bytea. But they are Points, which consists of float8's.
Since I already pushed this, I'm going to just wait and see what
the buildfarm says, and fix if needed. I think the fix is going to
be to just remove the test for the bytea-variant, it doesn't seem
that important to test.


Maybe we can just omit key_data from tests?


Make sense, fixed it that way. Thanks!

- Heikki




Re: Yet another fast GiST build

2021-01-13 Thread Andrey Borodin



> 13 янв. 2021 г., в 13:41, Heikki Linnakangas  написал(а):
> 
>> One more question: will bytea tests run correctly on
>> 32bit\different-endian systems?
> 
> Good question. Somehow I thought we were printing esseantilly text values as 
> bytea. But they are Points, which consists of float8's. Since I already 
> pushed this, I'm going to just wait and see what the buildfarm says, and fix 
> if needed. I think the fix is going to be to just remove the test for the 
> bytea-variant, it doesn't seem that important to test.

Maybe we can just omit key_data from tests?

Best regards, Andrey Borodin.



Re: Yet another fast GiST build

2021-01-13 Thread Heikki Linnakangas

On 12/01/2021 18:19, Andrey Borodin wrote:

12 янв. 2021 г., в 18:49, Heikki Linnakangas 
написал(а):

Fixed the docs accordingly, and ran pgindent. New patch version
attached.


Thanks! Looks good to me.


Pushed, thanks!


One more question: will bytea tests run correctly on
32bit\different-endian systems?


Good question. Somehow I thought we were printing esseantilly text 
values as bytea. But they are Points, which consists of float8's. Since 
I already pushed this, I'm going to just wait and see what the buildfarm 
says, and fix if needed. I think the fix is going to be to just remove 
the test for the bytea-variant, it doesn't seem that important to test.


- Heikki




Re: Yet another fast GiST build

2021-01-12 Thread Andrey Borodin



> 12 янв. 2021 г., в 18:49, Heikki Linnakangas  написал(а):
> 
>> PFA patch with implementation.
> 
> I did a bit of cleanup on the function signature. The .sql script claimed 
> that gist_page_items() took bytea as argument, but in reality it was a 
> relation name, as text. I changed it so that it takes a page image as 
> argument, instead of reading the block straight from the index. Mainly to 
> make it consistent with brin_page_items(), if it wasn't for that precedence I 
> might've gone either way on it.
bt_page_items() takes relation name and block number, that was a reason for 
doing so. But all others *_page_items() (heap, brin, hash) are doing as in v4. 
So I think it's more common way.

> 
> Fixed the docs accordingly, and ran pgindent. New patch version attached.

Thanks! Looks good to me.

One more question: will bytea tests run correctly on 32bit\different-endian 
systems?

Best regards, Andrey Borodin.



Re: Yet another fast GiST build

2021-01-12 Thread Heikki Linnakangas

On 10/12/2020 12:16, Andrey Borodin wrote:

9 дек. 2020 г., в 14:47, Andrey Borodin  написал(а):

7 дек. 2020 г., в 23:56, Peter Geoghegan  написал(а):

On Mon, Dec 7, 2020 at 2:05 AM Andrey Borodin  wrote:

Here's version with tests and docs. I still have no idea how to print some 
useful information about tuples keys.


I suggest calling BuildIndexValueDescription() from your own custom
debug instrumentation code.

Thanks for the hint, Peter!
This function does exactly what I want to do. But I have no Relation inside 
gist_page_items(bytea) function... probably, I'll add gist_page_items(relname, 
blockno) overload to fetch keys.


PFA patch with implementation.


I did a bit of cleanup on the function signature. The .sql script 
claimed that gist_page_items() took bytea as argument, but in reality it 
was a relation name, as text. I changed it so that it takes a page image 
as argument, instead of reading the block straight from the index. 
Mainly to make it consistent with brin_page_items(), if it wasn't for 
that precedence I might've gone either way on it.


Fixed the docs accordingly, and ran pgindent. New patch version attached.

- Heikki
>From 791e45bcc86b205b372ea0679c8c036fff941f56 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 12 Jan 2021 15:43:09 +0200
Subject: [PATCH v4 1/1] Add functions to 'pageinspect' to inspect GiST
 indexes.

Author: Andrey Borodin and me
Discussion: https://www.postgresql.org/message-id/3E4F9093-A1B5-4DF8-A292-0B48692E3954%40yandex-team.ru
---
 contrib/pageinspect/Makefile  |   6 +-
 contrib/pageinspect/expected/gist.out | 152 ++
 contrib/pageinspect/gistfuncs.c   | 287 ++
 contrib/pageinspect/pageinspect--1.8--1.9.sql |  41 +++
 contrib/pageinspect/pageinspect.control   |   2 +-
 contrib/pageinspect/sql/gist.sql  |  23 ++
 doc/src/sgml/pageinspect.sgml |  89 ++
 7 files changed, 597 insertions(+), 3 deletions(-)
 create mode 100644 contrib/pageinspect/expected/gist.out
 create mode 100644 contrib/pageinspect/gistfuncs.c
 create mode 100644 contrib/pageinspect/pageinspect--1.8--1.9.sql
 create mode 100644 contrib/pageinspect/sql/gist.sql

diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index d9d8177116..4539f0aef7 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -7,19 +7,21 @@ OBJS = \
 	btreefuncs.o \
 	fsmfuncs.o \
 	ginfuncs.o \
+	gistfuncs.o \
 	hashfuncs.o \
 	heapfuncs.o \
 	rawpage.o
 
 EXTENSION = pageinspect
-DATA =  pageinspect--1.7--1.8.sql pageinspect--1.6--1.7.sql \
+DATA =  pageinspect--1.8--1.9.sql \
+	pageinspect--1.7--1.8.sql pageinspect--1.6--1.7.sql \
 	pageinspect--1.5.sql pageinspect--1.5--1.6.sql \
 	pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \
 	pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \
 	pageinspect--1.0--1.1.sql
 PGFILEDESC = "pageinspect - functions to inspect contents of database pages"
 
-REGRESS = page btree brin gin hash checksum
+REGRESS = page btree brin gin gist hash checksum
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/pageinspect/expected/gist.out b/contrib/pageinspect/expected/gist.out
new file mode 100644
index 00..3c7e08b710
--- /dev/null
+++ b/contrib/pageinspect/expected/gist.out
@@ -0,0 +1,152 @@
+CREATE TABLE test_gist AS SELECT point(i,i) p, i::text t FROM
+generate_series(1,1000) i;
+CREATE INDEX test_gist_idx ON test_gist USING gist (p);
+-- Page 0 is the root, the rest are leaf pages
+SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 0));
+ lsn | nsn | rightlink  | flags 
+-+-++---
+ 0/1 | 0/0 | 4294967295 | {}
+(1 row)
+
+SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 1));
+ lsn | nsn | rightlink  | flags  
+-+-++
+ 0/1 | 0/0 | 4294967295 | {leaf}
+(1 row)
+
+SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 2));
+ lsn | nsn | rightlink | flags  
+-+-+---+
+ 0/1 | 0/0 | 1 | {leaf}
+(1 row)
+
+SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 0), 'test_gist_idx');
+ itemoffset |   ctid| itemlen |   keys
++---+-+---
+  1 | (1,65535) |  40 | (p)=((166,166))
+  2 | (2,65535) |  40 | (p)=((332,332))
+  3 | (3,65535) |  40 | (p)=((498,498))
+  4 | (4,65535) |  40 | (p)=((664,664))
+  5 | (5,65535) |  40 | (p)=((830,830))
+  6 | (6,65535) |  40 | (p)=((996,996))
+  7 | (7,65535) |  40 | (p)=((1000,1000))
+(7 rows)
+
+SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 1), 'test_gist_idx') LIMIT 5;
+ itemoffset | ctid  | itemlen |keys 
++---+-+-
+  1 | (0,1) |  40 | (p)=((1,1))
+  2 | (0,2) |  40 | (p)=((2,2))
+  3 | (0,3) |  40 | (p)=((3,3))
+

Re: Yet another fast GiST build

2020-12-10 Thread Andrey Borodin


> 9 дек. 2020 г., в 14:47, Andrey Borodin  написал(а):
> 
> 
> 
>> 7 дек. 2020 г., в 23:56, Peter Geoghegan  написал(а):
>> 
>> On Mon, Dec 7, 2020 at 2:05 AM Andrey Borodin  wrote:
>>> Here's version with tests and docs. I still have no idea how to print some 
>>> useful information about tuples keys.
>> 
>> I suggest calling BuildIndexValueDescription() from your own custom
>> debug instrumentation code.
> Thanks for the hint, Peter!
> This function does exactly what I want to do. But I have no Relation inside 
> gist_page_items(bytea) function... probably, I'll add 
> gist_page_items(relname, blockno) overload to fetch keys.

PFA patch with implementation.

Best regards, Andrey Borodin.



v3-0001-Add-functions-to-pageinspect-to-inspect-GiST-inde.patch
Description: Binary data




Re: Yet another fast GiST build

2020-12-09 Thread Andrey Borodin



> 7 дек. 2020 г., в 23:56, Peter Geoghegan  написал(а):
> 
> On Mon, Dec 7, 2020 at 2:05 AM Andrey Borodin  wrote:
>> Here's version with tests and docs. I still have no idea how to print some 
>> useful information about tuples keys.
> 
> I suggest calling BuildIndexValueDescription() from your own custom
> debug instrumentation code.
Thanks for the hint, Peter!
This function does exactly what I want to do. But I have no Relation inside 
gist_page_items(bytea) function... probably, I'll add gist_page_items(relname, 
blockno) overload to fetch keys.

Thanks!

Best regards, Andrey Borodin.



Re: Yet another fast GiST build

2020-12-07 Thread Peter Geoghegan
On Mon, Dec 7, 2020 at 2:05 AM Andrey Borodin  wrote:
> Here's version with tests and docs. I still have no idea how to print some 
> useful information about tuples keys.

I suggest calling BuildIndexValueDescription() from your own custom
debug instrumentation code.

-- 
Peter Geoghegan




Re: Yet another fast GiST build

2020-12-07 Thread Andrey Borodin


> 28 сент. 2020 г., в 13:12, Heikki Linnakangas  написал(а):
> 
> I wrote a couple of 'pageinspect' function to inspect GiST pages for this. 
> See attached.
> <0001-Add-functions-to-pageinspect-to-inspect-GiST-indexes.patch>

Here's version with tests and docs. I still have no idea how to print some 
useful information about tuples keys.

Thanks!

Best regards, Andrey Borodin.


v2-0001-Add-functions-to-pageinspect-to-inspect-GiST-inde.patch
Description: Binary data




Re: Yet another fast GiST build

2020-11-07 Thread Andrey Borodin


> 5 нояб. 2020 г., в 22:20, Justin Pryzby  написал(а):
> 
> On Thu, Nov 05, 2020 at 10:11:52PM +0500, Andrey Borodin wrote:
>> To test that functions are actually called for sorting build we should 
>> support directive sorting build like "CREATE INDEX ON A USING GIST(B) 
>> WITH(sorting=surely,and fail if not)".
> 
> Maybe you could add a DEBUG1 message for that, and include that in regression
> tests, which would then fail if sorting wasn't used.

That's a good idea. Thanks!
> 
> Maybe you'd need to make it consistent by setting GUCs like work_mem /
> max_parallel_maintenance_workers / ??
> 
> Similar to this
> 
> postgres=# SET client_min_messages =debug;
> postgres=# CREATE INDEX ON A USING GIST(i) WITH(buffering=off);
> DEBUG:  building index "a_i_idx2" on table "a" serially
> CREATE INDEX

Currently, only B-tree uses parallel build, so no need to tweak GUCs except 
client_min_messages.
Before these tests, actually, ~20% of opclasses were not working as expected. 
Despite I've checked each one by hand. I have 

PFA patch with fixed comments from Heikki.

Thanks!

Best regards, Andrey Borodin.


v4-0001-Sortsupport-for-sorting-GiST-build-for-gist_btree.patch
Description: Binary data


Re: Yet another fast GiST build

2020-11-05 Thread Justin Pryzby
On Thu, Nov 05, 2020 at 10:11:52PM +0500, Andrey Borodin wrote:
> To test that functions are actually called for sorting build we should 
> support directive sorting build like "CREATE INDEX ON A USING GIST(B) 
> WITH(sorting=surely,and fail if not)".

Maybe you could add a DEBUG1 message for that, and include that in regression
tests, which would then fail if sorting wasn't used.

Maybe you'd need to make it consistent by setting GUCs like work_mem /
max_parallel_maintenance_workers / ??

Similar to this

postgres=# SET client_min_messages =debug;
postgres=# CREATE INDEX ON A USING GIST(i) WITH(buffering=off);
DEBUG:  building index "a_i_idx2" on table "a" serially
CREATE INDEX

> If we have unconditional sorting build and unconditional buffered build we 
> can check opclasses code better.
> The problem is current reloption is called "buffering". It would be strange 
> if we gave this enum possible value like "not buffering, but very much like 
> buffering, just another way".
> How do you think, is it ok to add reloption "buffering=sorting" to enhance 
> tests?
> 
> Best regards, Andrey Borodin.




Re: Yet another fast GiST build

2020-11-05 Thread Andrey Borodin



> 2 нояб. 2020 г., в 19:45, Heikki Linnakangas  написал(а):
> 
>> How do you think, should this patch and patch with pageinspect GiST 
>> functions be registered on commitfest?
> 
> Yeah, that'd be good.
I've registered both patches on January CF.
pageinspect patch's code looks goot to me (besides TODO item there), but it 
lacks docs and tests. I can add some info and calls in future reviews.

> 
> On 30/10/2020 20:20, Andrey Borodin wrote:
> A few quick comments:
> 
> * Currently with float8, you immediately abort abbreviation if SIZEOF_DATUM 
> == 4. Like in the 'inet' above, you could convert the float8 to float4 
> instead.
> 
> * Some of the ALTER OPERATOR FAMILY commands in btree_gist--1.6--1.7.sql are 
> copy-pasted wrong. Here's one example:
> 
> ALTER OPERATOR FAMILY gist_timestamptz_ops USING gist ADD
>   FUNCTION11  (text, text) gbt_text_sortsupport;
> 
> * It's easy to confuse the new comparison functions with the existing 
> comparisons used by the picksplit functions. Some comments and/or naming 
> changes would be good to clarify how they're different.
> 
> * It would be straightforward to have abbreviated functions for macaddr and 
> macaddr8 too.

I'll fix these issues soon. But things like this bother me a lot:
> ALTER OPERATOR FAMILY gist_timestamptz_ops USING gist ADD
>   FUNCTION11  (text, text) gbt_text_sortsupport;

To test that functions are actually called for sorting build we should support 
directive sorting build like "CREATE INDEX ON A USING GIST(B) 
WITH(sorting=surely,and fail if not)".
If we have unconditional sorting build and unconditional buffered build we can 
check opclasses code better.
The problem is current reloption is called "buffering". It would be strange if 
we gave this enum possible value like "not buffering, but very much like 
buffering, just another way".
How do you think, is it ok to add reloption "buffering=sorting" to enhance 
tests?

Best regards, Andrey Borodin.



Re: Yet another fast GiST build

2020-11-02 Thread Heikki Linnakangas

On 30/10/2020 20:20, Andrey Borodin wrote:

27 окт. 2020 г., в 16:43, Heikki Linnakangas  написал(а):

static bool
gbt_inet_abbrev_abort(int memtupcount, SortSupport ssup)
{
#if SIZEOF_DATUM == 8
return false;
#else
return true;
#endif
}


Better to not set the 'abbrev_converter' function in the first place. Or would 
it be better to cast the float8 to float4 if SIZEOF_DATUM == 4?

Ok, now for 4 bytes Datum we do return (Datum) Float4GetDatum((float) z);


A few quick comments:

* Currently with float8, you immediately abort abbreviation if 
SIZEOF_DATUM == 4. Like in the 'inet' above, you could convert the 
float8 to float4 instead.


* Some of the ALTER OPERATOR FAMILY commands in btree_gist--1.6--1.7.sql 
are copy-pasted wrong. Here's one example:


ALTER OPERATOR FAMILY gist_timestamptz_ops USING gist ADD
FUNCTION11  (text, text) gbt_text_sortsupport;

* It's easy to confuse the new comparison functions with the existing 
comparisons used by the picksplit functions. Some comments and/or naming 
changes would be good to clarify how they're different.


* It would be straightforward to have abbreviated functions for macaddr 
and macaddr8 too.



How do you think, should this patch and patch with pageinspect GiST functions 
be registered on commitfest?


Yeah, that'd be good.

- Heikki




Re: Yet another fast GiST build

2020-10-30 Thread Andrey Borodin
Thanks!

> 27 окт. 2020 г., в 16:43, Heikki Linnakangas  написал(а):
> 
> gbt_ts_cmp(), gbt_time_cmp_sort() and gbt_date_cmp_sort() still have the 
> above issue, they still compare "upper" for no good reason.
Fixed.

>> +static Datum
>> +gbt_bit_abbrev_convert(Datum original, SortSupport ssup)
>> +{
>> +   return (Datum) 0;
>> +}
>> +
>> +static int
>> +gbt_bit_cmp_abbrev(Datum z1, Datum z2, SortSupport ssup)
>> +{
>> +   return 0;
>> +}
> 
> If an abbreviated key is not useful, just don't define abbrev functions and 
> don't set SortSupport->abbrev_converter in the first place.
Fixed.
> 
>> static bool
>> gbt_inet_abbrev_abort(int memtupcount, SortSupport ssup)
>> {
>> #if SIZEOF_DATUM == 8
>>  return false;
>> #else
>>  return true;
>> #endif
>> }
> 
> Better to not set the 'abbrev_converter' function in the first place. Or 
> would it be better to cast the float8 to float4 if SIZEOF_DATUM == 4?
Ok, now for 4 bytes Datum we do return (Datum) Float4GetDatum((float) z);

How do you think, should this patch and patch with pageinspect GiST functions 
be registered on commitfest?

Thanks!

Best regards, Andrey Borodin.


v3-0001-Sortsupport-for-sorting-GiST-build-for-gist_btree.patch
Description: Binary data


Re: Yet another fast GiST build

2020-10-27 Thread Heikki Linnakangas

On 21/10/2020 20:13, Andrey Borodin wrote:

7 окт. 2020 г., в 17:38, Heikki Linnakangas  написал(а):

On 07/10/2020 15:27, Andrey Borodin wrote:

Here's draft patch with implementation of sortsupport for ints and floats.



+static int
+gbt_int4_cmp(Datum a, Datum b, SortSupport ssup)
+{
+   int32KEY   *ia = (int32KEY *) DatumGetPointer(a);
+   int32KEY   *ib = (int32KEY *) DatumGetPointer(b);
+
+   if (ia->lower == ib->lower)
+   {
+   if (ia->upper == ib->upper)
+   return 0;
+
+   return (ia->upper > ib->upper) ? 1 : -1;
+   }
+
+   return (ia->lower > ib->lower) ? 1 : -1;
+}


We're only dealing with leaf items during index build, so the 'upper' and 
'lower' should always be equal here, right? Maybe add a comment and an 
assertion on that.

(It's pretty sad that the on-disk representation is identical for leaf and 
internal items, because that wastes a lot of disk space, but that's way out of 
scope for this patch.)


Thanks, I've added assert() where is was easy to test equalty.

PFA patch with all types.


gbt_ts_cmp(), gbt_time_cmp_sort() and gbt_date_cmp_sort() still have the 
above issue, they still compare "upper" for no good reason.



+static Datum
+gbt_bit_abbrev_convert(Datum original, SortSupport ssup)
+{
+   return (Datum) 0;
+}
+
+static int
+gbt_bit_cmp_abbrev(Datum z1, Datum z2, SortSupport ssup)
+{
+   return 0;
+}


If an abbreviated key is not useful, just don't define abbrev functions 
and don't set SortSupport->abbrev_converter in the first place.



static bool
gbt_inet_abbrev_abort(int memtupcount, SortSupport ssup)
{
#if SIZEOF_DATUM == 8
return false;
#else
return true;
#endif
}


Better to not set the 'abbrev_converter' function in the first place. Or 
would it be better to cast the float8 to float4 if SIZEOF_DATUM == 4?



I had a plan to implement and test one type each day. I did not quite 
understood how rich our type system is.


:-)

- Heikki




Re: Yet another fast GiST build

2020-10-21 Thread Pavel Borisov
>
>
> >> +return (ia->lower > ib->lower) ? 1 : -1;
> >> +}
> >
> > We're only dealing with leaf items during index build, so the 'upper'
> and 'lower' should always be equal here, right? Maybe add a comment and an
> assertion on that.
> >
> > (It's pretty sad that the on-disk representation is identical for leaf
> and internal items, because that wastes a lot of disk space, but that's way
> out of scope for this patch.)
>
> Thanks, I've added assert() where is was easy to test equalty.
>
> PFA patch with all types.
>
> I had a plan to implement and test one type each day. I did not quite
> understood how rich our type system is.
>

Cool, thanks!

BTW there is a somewhat parallel discussion on this gist patch in
pgsql-bugs which you may miss
https://www.postgresql.org/message-id/flat/3dda6945-c147-5afc-7720-38ec6848dafa%40gmail.com#9be5b8a75c4921498748514cb77c6e68

The main point is that buffering build is better to be enforced with
buffering=on
as otherwise buffering build becomes hard to test in the presence of sort
support.

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: Yet another fast GiST build

2020-10-21 Thread Andrey Borodin


> 7 окт. 2020 г., в 17:38, Heikki Linnakangas  написал(а):
> 
> On 07/10/2020 15:27, Andrey Borodin wrote:
>> Here's draft patch with implementation of sortsupport for ints and floats.
> 
>> +static int
>> +gbt_int4_cmp(Datum a, Datum b, SortSupport ssup)
>> +{
>> +int32KEY   *ia = (int32KEY *) DatumGetPointer(a);
>> +int32KEY   *ib = (int32KEY *) DatumGetPointer(b);
>> +
>> +if (ia->lower == ib->lower)
>> +{
>> +if (ia->upper == ib->upper)
>> +return 0;
>> +
>> +return (ia->upper > ib->upper) ? 1 : -1;
>> +}
>> +
>> +return (ia->lower > ib->lower) ? 1 : -1;
>> +}
> 
> We're only dealing with leaf items during index build, so the 'upper' and 
> 'lower' should always be equal here, right? Maybe add a comment and an 
> assertion on that.
> 
> (It's pretty sad that the on-disk representation is identical for leaf and 
> internal items, because that wastes a lot of disk space, but that's way out 
> of scope for this patch.)

Thanks, I've added assert() where is was easy to test equalty.

PFA patch with all types.

I had a plan to implement and test one type each day. I did not quite 
understood how rich our type system is.

Thanks!

Best regards, Andrey Borodin.


0001-Sortsupport-for-sorting-GiST-build-for-gist_btree-ty.patch
Description: Binary data


Re: Yet another fast GiST build

2020-10-07 Thread Heikki Linnakangas

On 07/10/2020 15:27, Andrey Borodin wrote:

Here's draft patch with implementation of sortsupport for ints and floats.



+static int
+gbt_int4_cmp(Datum a, Datum b, SortSupport ssup)
+{
+   int32KEY   *ia = (int32KEY *) DatumGetPointer(a);
+   int32KEY   *ib = (int32KEY *) DatumGetPointer(b);
+
+   if (ia->lower == ib->lower)
+   {
+   if (ia->upper == ib->upper)
+   return 0;
+
+   return (ia->upper > ib->upper) ? 1 : -1;
+   }
+
+   return (ia->lower > ib->lower) ? 1 : -1;
+}


We're only dealing with leaf items during index build, so the 'upper' 
and 'lower' should always be equal here, right? Maybe add a comment and 
an assertion on that.


(It's pretty sad that the on-disk representation is identical for leaf 
and internal items, because that wastes a lot of disk space, but that's 
way out of scope for this patch.)


- Heikki




Re: Yet another fast GiST build

2020-10-07 Thread Andrey Borodin


> 29 сент. 2020 г., в 23:04, Andrey M. Borodin  
> написал(а):
> 
> 
> Also I'm working on btree_gist opclasses and found out that new sortsupport 
> function is not mentioned in gistadjustmembers(). I think this can be fixed 
> with gist_btree patch.

Here's draft patch with implementation of sortsupport for ints and floats.

Thanks!

Best regards, Andrey Borodin.


0001-Sortsupport-for-sorting-GiST-build-for-ints-and-floa.patch
Description: Binary data


Re: Yet another fast GiST build

2020-10-06 Thread Pavel Borisov
It became normal with
-fsanitize=signed-integer-overflow,null,alignment
instead of
-fsanitize=undefined
(which is strictly a 'default' list of needed and unnecessary things to
check, can be overridden anyway but needed some reading for it)

Thanks!

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: Yet another fast GiST build

2020-10-06 Thread Tom Lane
Heikki Linnakangas  writes:
> You get the same error with:
> select (float8 '1e+300')::float4;
> float.c:1204:11: runtime error: 1e+300 is outside the range of 
> representable values of type 'float'
> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior float.c:1204:11 in

> It boils down to casting a C double to float, when the value doesn't fit 
> in float. I'm surprised that's undefined behavior, but I'm no C99 
> lawyer. The code in dtof() expects it to yield Inf.

I think UBSan read C99 6.3.1.5:

   [#2]  When  a double is demoted to float or a long double to
   double or float, if the value being converted is outside the
   range  of  values  that  can be represented, the behavior is
   undefined.

and stopped reading at that point, which they should not have.
If you go on to read the portions around, particularly, ,
you get a different picture of affairs.  If we're relying on IEEE
float semantics in other places, which we are, we're perfectly
entitled to assume that the cast will yield Inf (and a floating
point exception flag, which we ignore).  I think the "undefined"
here is just meant to say that there's no single behavior promised
across all possible C implementations.  They'd have been better to
write "implementation-defined", though.

> I'm inclined to shrug this off and say that the sanitizer is being 
> over-zealous. Is there some compiler flag we should be setting, to tell 
> it that we require specific behavior? Any other ideas?

If UBSan doesn't have a flag to tell it to assume IEEE math,
I'd say that makes it next door to worthless for our purposes.

regards, tom lane




Re: Yet another fast GiST build

2020-10-06 Thread Heikki Linnakangas

On 06/10/2020 14:05, Pavel Borisov wrote:

I've been making tests with memory sanitizer


Thanks!

and got one another error 
in regression test create_index:

CREATE INDEX gpointind ON point_tbl USING gist (f1);
server closed the connection unexpectedly

with logfile:
gistproc.c:1714:28: runtime error: 1e+300 is outside the range of 
representable values of type 'float'

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior gistproc.c:1714:28

gistproc.c:
1714     z = point_zorder_internal(p->x, p->y);

Consider this a minor issue but unrelated to the other issues discussed. 
It is reproduced on the last master 
commit 0a3c864c32751fd29d021929cf70af421fd27370 after all changes into 
Gist committed.


cflags="-DUSE_VALGRIND -Og -O0 -fsanitize=address -fsanitize=undefined 
-fno-sanitize-recover=all -fno-sanitize=alignment -fstack-protector"


You get the same error with:

select (float8 '1e+300')::float4;

float.c:1204:11: runtime error: 1e+300 is outside the range of 
representable values of type 'float'

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior float.c:1204:11 in

It boils down to casting a C double to float, when the value doesn't fit 
in float. I'm surprised that's undefined behavior, but I'm no C99 
lawyer. The code in dtof() expects it to yield Inf.


I'm inclined to shrug this off and say that the sanitizer is being 
over-zealous. Is there some compiler flag we should be setting, to tell 
it that we require specific behavior? Any other ideas?


- Heikki




Re: Yet another fast GiST build

2020-10-06 Thread Pavel Borisov
I've been making tests with memory sanitizer and got one another error in
regression test create_index:
CREATE INDEX gpointind ON point_tbl USING gist (f1);
server closed the connection unexpectedly

with logfile:
gistproc.c:1714:28: runtime error: 1e+300 is outside the range of
representable values of type 'float'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior gistproc.c:1714:28

gistproc.c:
1714 z = point_zorder_internal(p->x, p->y);

Consider this a minor issue but unrelated to the other issues discussed. It
is reproduced on the last master
commit 0a3c864c32751fd29d021929cf70af421fd27370 after all changes into Gist
committed.

cflags="-DUSE_VALGRIND -Og -O0 -fsanitize=address -fsanitize=undefined
-fno-sanitize-recover=all -fno-sanitize=alignment -fstack-protector"


-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: Yet another fast GiST build

2020-09-29 Thread Heikki Linnakangas

On 29/09/2020 21:04, Andrey M. Borodin wrote:

28 сент. 2020 г., в 13:12, Heikki Linnakangas  написал(а):

I did some testing with your patch. It seems that the rightlinks
are still not always set. I didn't try to debug why.


Yes, there is a bug. Now it seems to me so obvious, yet it took some 
time to understand that links were shifted by one extra jump. PFA 
fixed rightlinks installation.

Ah, that was simple. I propose adding a comment on it:

--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -540,6 +540,19 @@ gist_indexsortbuild_pagestate_flush(GISTBuildState 
*state,

/* Re-initialize the page buffer for next page on this level. */
pagestate->page = palloc(BLCKSZ);
gistinitpage(pagestate->page, isleaf ? F_LEAF : 0);
+
+   /*
+* Set the right link to point to the previous page. This is just for
+* debugging purposes: GiST only follows the right link if a page is 
split
+* concurrently to a scan, and that cannot happen during index build.
+*
+* It's a bit counterintuitive that we set the right link on the new 
page
+* to point to the previous page, and not the other way round. But GiST
+* pages are not ordered like B-tree pages are, so as long as the
+* right-links form a chain through all the pages in the same level, the
+* order doesn't matter.
+*/
+   GistPageGetOpaque(pagestate->page)->rightlink = blkno;
 }


BTW some one more small thing: we initialise page buffers with
palloc() and palloc0(), while first one is sufficient for
gistinitpage().
Hmm. Only the first one, in gist_indexsortbuild(), but that needs to be 
palloc0, because it's used to write an all-zeros placeholder for the 
root page.



Also I'm working on btree_gist opclasses and found out that new
sortsupport function is not mentioned in gistadjustmembers(). I think
this can be fixed with gist_btree patch.


Thanks!


Your pageinspect patch seems very useful. How do you think, should we
provide a way to find invalid tuples in GiST within
gist_page_items()? At some point we will have to ask user to reindex
GiSTs with invalid tuples.
You mean invalid tuples created by crash on PostgreSQL version 9.0 or 
below, and pg_upgraded? I doubt there are any of those still around in 
the wild. We have to keep the code to detect them, though.


It would be nice to improve gist_page_items() to display more 
information about the items, although I wouldn't worry much about 
invalid tuples. The 'gevel' extension that Oleg mentioned upthread does 
more, it would be nice to incorporate that into pageinspect somehow.


- Heikki




Re: Yet another fast GiST build

2020-09-29 Thread Andrey M. Borodin


> 28 сент. 2020 г., в 13:12, Heikki Linnakangas  написал(а):
> 
> On 21/09/2020 17:19, Andrey M. Borodin wrote:
>>> 21 сент. 2020 г., в 18:29, Andrey M. Borodin  
>>> написал(а):
>>> 
>>> It was a conscious decision with incorrect motivation. I was thinking that 
>>> it will help to reduce number of "false positive" inspecting right pages. 
>>> But now I see that:
>>> 1. There should be no such "false positives" that we can avoid
>>> 2. Valid rightlinks could help to do amcheck verification in future
>> Well, point number 2 here is invalid. There exist one leaf page p, so that 
>> if we start traversing rightlink from p we will reach all leaf pages. But we 
>> practically have no means to find this page. This makes rightlinks not very 
>> helpful in amcheck for GiST.
> 
> Well, if you store all the right links in a hash table or something, you can 
> "connect the dots" after you have scanned all the pages to see that the chain 
> is unbroken. Probably would not be worth the trouble, since the rightlinks 
> are not actually needed after concurrent scans have completed.
> 
>> But for consistency I think it worth to install them.
> 
> I agree. I did some testing with your patch. It seems that the rightlinks are 
> still not always set. I didn't try to debug why.
> 
> I wrote a couple of 'pageinspect' function to inspect GiST pages for this. 
> See attached. I then created a test table and index like this:
> 
> create table points (p point);
> insert into points select point(x,y) from generate_series(-2000, 2000) x, 
> generate_series(-2000, 2000) y;
> create index points_idx on points using gist (p);
> 
> And this is what the root page looks like:
> 
> postgres=# select * from gist_page_items(get_raw_page('points_idx', 0));
> itemoffset | ctid  | itemlen
> +---+-
>  1 | (27891,65535) |  40
>  2 | (55614,65535) |  40
>  3 | (83337,65535) |  40
>  4 | (97019,65535) |  40
> (4 rows)
> 
> And the right links on the next level:
> 
> postgres=# select * from (VALUES (27891), (55614), (83337), (97019)) b 
> (blkno), lateral gist_page_opaque_info(get_raw_page('points_idx', blkno));
> blkno | lsn | nsn | rightlink  | flags
> ---+-+-++---
> 27891 | 0/1 | 0/0 | 4294967295 | {}
> 55614 | 0/1 | 0/0 | 4294967295 | {}
> 83337 | 0/1 | 0/0 |  27891 | {}
> 97019 | 0/1 | 0/0 |  55614 | {}
> (4 rows)
> 
> I expected there to be only one page with invalid right link, but there are 
> two.

Yes, there is a bug. Now it seems to me so obvious, yet it took some time to 
understand that links were shifted by one extra jump. PFA fixed rightlinks 
installation.

BTW some one more small thing: we initialise page buffers with palloc() and 
palloc0(), while first one is sufficient for gistinitpage().
Also I'm working on btree_gist opclasses and found out that new sortsupport 
function is not mentioned in gistadjustmembers(). I think this can be fixed 
with gist_btree patch.

Your pageinspect patch seems very useful. How do you think, should we provide a 
way to find invalid tuples in GiST within gist_page_items()? At some point we 
will have to ask user to reindex GiSTs with invalid tuples.


> 28 сент. 2020 г., в 11:53, Heikki Linnakangas  написал(а):
> 
> On 21/09/2020 16:29, Andrey M. Borodin wrote:
>> But thing that bothers me now: when we vacuum leaf page, we bump it's
>> NSN. But we do not bump internal page LSN. Does this means we will
>> follow rightlinks after vacuum? It seems superflous.
> 
> Sorry, I did not understand what you said above. Vacuum doesn't update any 
> NSNs, only LSNs. Can you elaborate?
I've misunderstood difference between NSN and LSN. Seems like everything is 
fine.

> 
>> And btw we did not adjust internal page tuples after vacuum...
> What do you mean by that?

When we delete rows from table internal tuples in GiST stay wide.


Thanks!

Best regards, Andrey Borodin.



install_rightlinks_v2.diff
Description: Binary data


Re: Yet another fast GiST build

2020-09-28 Thread Heikki Linnakangas

On 21/09/2020 17:19, Andrey M. Borodin wrote:

21 сент. 2020 г., в 18:29, Andrey M. Borodin  написал(а):

It was a conscious decision with incorrect motivation. I was thinking that it will help 
to reduce number of "false positive" inspecting right pages. But now I see that:
1. There should be no such "false positives" that we can avoid
2. Valid rightlinks could help to do amcheck verification in future


Well, point number 2 here is invalid. There exist one leaf page p, so that if 
we start traversing rightlink from p we will reach all leaf pages. But we 
practically have no means to find this page. This makes rightlinks not very 
helpful in amcheck for GiST.


Well, if you store all the right links in a hash table or something, you 
can "connect the dots" after you have scanned all the pages to see that 
the chain is unbroken. Probably would not be worth the trouble, since 
the rightlinks are not actually needed after concurrent scans have 
completed.



But for consistency I think it worth to install them.


I agree. I did some testing with your patch. It seems that the 
rightlinks are still not always set. I didn't try to debug why.


I wrote a couple of 'pageinspect' function to inspect GiST pages for 
this. See attached. I then created a test table and index like this:


create table points (p point);
insert into points select point(x,y) from generate_series(-2000, 2000) 
x, generate_series(-2000, 2000) y;

create index points_idx on points using gist (p);

And this is what the root page looks like:

postgres=# select * from gist_page_items(get_raw_page('points_idx', 0));
 itemoffset | ctid  | itemlen
+---+-
  1 | (27891,65535) |  40
  2 | (55614,65535) |  40
  3 | (83337,65535) |  40
  4 | (97019,65535) |  40
(4 rows)

And the right links on the next level:

postgres=# select * from (VALUES (27891), (55614), (83337), (97019)) b 
(blkno), lateral gist_page_opaque_info(get_raw_page('points_idx', blkno));

 blkno | lsn | nsn | rightlink  | flags
---+-+-++---
 27891 | 0/1 | 0/0 | 4294967295 | {}
 55614 | 0/1 | 0/0 | 4294967295 | {}
 83337 | 0/1 | 0/0 |  27891 | {}
 97019 | 0/1 | 0/0 |  55614 | {}
(4 rows)

I expected there to be only one page with invalid right link, but there 
are two.


- Heikki
>From c388e458b196454d3535d84f6f7a617f0f3f819a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 28 Sep 2020 11:01:45 +0300
Subject: [PATCH 1/1] Add functions to 'pageinspect' to inspect GiST indexes.

---
 contrib/pageinspect/Makefile  |   4 +-
 contrib/pageinspect/gistfuncs.c   | 173 ++
 contrib/pageinspect/pageinspect--1.8--1.9.sql |  27 +++
 contrib/pageinspect/pageinspect.control   |   2 +-
 4 files changed, 204 insertions(+), 2 deletions(-)
 create mode 100644 contrib/pageinspect/gistfuncs.c
 create mode 100644 contrib/pageinspect/pageinspect--1.8--1.9.sql

diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index d9d8177116b..0f9561616e1 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -7,12 +7,14 @@ OBJS = \
 	btreefuncs.o \
 	fsmfuncs.o \
 	ginfuncs.o \
+	gistfuncs.o \
 	hashfuncs.o \
 	heapfuncs.o \
 	rawpage.o
 
 EXTENSION = pageinspect
-DATA =  pageinspect--1.7--1.8.sql pageinspect--1.6--1.7.sql \
+DATA =  pageinspect--1.8--1.9.sql \
+	pageinspect--1.7--1.8.sql pageinspect--1.6--1.7.sql \
 	pageinspect--1.5.sql pageinspect--1.5--1.6.sql \
 	pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \
 	pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \
diff --git a/contrib/pageinspect/gistfuncs.c b/contrib/pageinspect/gistfuncs.c
new file mode 100644
index 000..8517b4c241f
--- /dev/null
+++ b/contrib/pageinspect/gistfuncs.c
@@ -0,0 +1,173 @@
+/*
+ * gistfuncs.c
+ *		Functions to investigate the content of GiST indexes
+ *
+ * Copyright (c) 2014-2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/pageinspect/gitfuncs.c
+ */
+#include "postgres.h"
+
+#include "access/gist.h"
+#include "access/htup.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pageinspect.h"
+#include "storage/itemptr.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+
+PG_FUNCTION_INFO_V1(gist_page_opaque_info);
+PG_FUNCTION_INFO_V1(gist_page_items);
+
+#define ItemPointerGetDatum(X)	 PointerGetDatum(X)
+
+
+Datum
+gist_page_opaque_info(PG_FUNCTION_ARGS)
+{
+	bytea	   *raw_page = PG_GETARG_BYTEA_P(0);
+	TupleDesc	tupdesc;
+	Page		page;
+	GISTPageOpaque opaq;
+	HeapTuple	resultTuple;
+	Datum		values[4];
+	bool		nulls[4];
+	Datum		flags[16];
+	int			nflags = 0;
+	uint16		flagbits;
+
+	if (!superuser())
+		ereport(ERROR,
+(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("must be superuser to use raw page functions")));
+
+	page = get_page_from_raw(raw_page);
+
+	opaq = (GISTPageOpaque) PageGetSpecialPointer(page);
+

Re: Yet another fast GiST build

2020-09-28 Thread Heikki Linnakangas



On 21/09/2020 16:29, Andrey M. Borodin wrote:

But thing that bothers me now: when we vacuum leaf page, we bump it's
NSN. But we do not bump internal page LSN. Does this means we will
follow rightlinks after vacuum? It seems superflous.


Sorry, I did not understand what you said above. Vacuum doesn't update 
any NSNs, only LSNs. Can you elaborate?



And btw we did not adjust internal page tuples after vacuum...

What do you mean by that?

- Heikki




Re: Yet another fast GiST build

2020-09-21 Thread Tom Lane
Heikki Linnakangas  writes:
> On 21/09/2020 02:06, Tom Lane wrote:
>> Another interesting point is that all the other index AMs seem to WAL-log
>> the new page before the smgrextend call, whereas this code is doing it
>> in the other order.  I strongly doubt that both patterns are equally
>> correct.  Could be that the other AMs are in the wrong though.

> My thinking was that it's better to call smgrextend() first, so that if 
> you run out of disk space, you get the error before WAL-logging it. That 
> reduces the chance that WAL replay will run out of disk space. A lot of 
> things are different during WAL replay, so it's quite likely that WAL 
> replay runs out of disk space anyway if you're living on the edge, but 
> still.

Yeah.  access/transam/README points out that such failures need to be
planned for, and explains what we do for heap pages;

1. Adding a disk page to an existing table.

This action isn't WAL-logged at all.  We extend a table by writing a page
of zeroes at its end.  We must actually do this write so that we are sure
the filesystem has allocated the space.  If the write fails we can just
error out normally.  Once the space is known allocated, we can initialize
and fill the page via one or more normal WAL-logged actions.  Because it's
possible that we crash between extending the file and writing out the WAL
entries, we have to treat discovery of an all-zeroes page in a table or
index as being a non-error condition.  In such cases we can just reclaim
the space for re-use.

So GIST seems to be acting according to that design.  (Someday we need
to update this para to acknowledge that not all filesystems behave as
it's assuming.)

> I didn't notice that the other callers are doing it the other way round, 
> though. I think they need to, so that they can stamp the page with the 
> LSN of the WAL record. But GiST build is special in that regard, because 
> it stamps all pages with GistBuildLSN.

Kind of unpleasant; that means they risk what the README points out:

In all of these cases, if WAL replay fails to redo the original action
we must panic and abort recovery.  The DBA will have to manually clean up
(for instance, free up some disk space or fix directory permissions) and
then restart recovery.  This is part of the reason for not writing a WAL
entry until we've successfully done the original action.

I'm not sufficiently motivated to go and change it right now, though.

regards, tom lane




Re: Yet another fast GiST build

2020-09-21 Thread Andrey M. Borodin


> 21 сент. 2020 г., в 18:29, Andrey M. Borodin  
> написал(а):
> 
> It was a conscious decision with incorrect motivation. I was thinking that it 
> will help to reduce number of "false positive" inspecting right pages. But 
> now I see that:
> 1. There should be no such "false positives" that we can avoid
> 2. Valid rightlinks could help to do amcheck verification in future

Well, point number 2 here is invalid. There exist one leaf page p, so that if 
we start traversing rightlink from p we will reach all leaf pages. But we 
practically have no means to find this page. This makes rightlinks not very 
helpful in amcheck for GiST.

But for consistency I think it worth to install them.

Thanks!

Best regards, Andrey Borodin.



0001-Install-rightlinks-on-GiST-pages-in-case-of-sorting-.patch
Description: Binary data


Re: Yet another fast GiST build

2020-09-21 Thread Andrey M. Borodin



> 21 сент. 2020 г., в 17:15, Heikki Linnakangas  написал(а):
> 
> On 21/09/2020 12:06, Andrey M. Borodin wrote
>> 
>> I think we don't set rightlinks during index build.
> 
> The new GiST sorting code does not, but the regular insert-based code does.
> 
> That's a bit questionable in the new code actually. Was that a conscious
> decision? The right-links are only needed when there are concurrent page
> splits, so I think it's OK, but the checks for InvalidBlockNumber in
> gistScanPage() and gistFindPage() have comment "/* sanity check */".
> Comment changes are needed, at least.

It was a conscious decision with incorrect motivation. I was thinking that it 
will help to reduce number of "false positive" inspecting right pages. But now 
I see that:
1. There should be no such "false positives" that we can avoid
2. Valid rightlinks could help to do amcheck verification in future

But thing that bothers me now: when we vacuum leaf page, we bump it's NSN. But 
we do not bump internal page LSN. Does this means we will follow rightlinks 
after vacuum? It seems superflous. And btw we did not adjust internal page 
tuples after vacuum...

Best regards, Andrey Borodin.



Re: Yet another fast GiST build

2020-09-21 Thread Heikki Linnakangas

On 21/09/2020 12:06, Andrey M. Borodin wrote:

21 сент. 2020 г., в 13:45, Heikki Linnakangas 
написал(а):

Actually, don't we have a problem with that, even before this
patch? Even though we set the LSN to the magic GistBuildLSN value
when we build the index, WAL replay will write the LSN of the
record instead. That would mess with the LSN-NSN interlock. After
WAL replay (or in a streaming replica), a scan on the GiST index
might traverse right-links unnecessarily.


I think we don't set rightlinks during index build.


The new GiST sorting code does not, but the regular insert-based code does.

That's a bit questionable in the new code actually. Was that a conscious
decision? The right-links are only needed when there are concurrent page
splits, so I think it's OK, but the checks for InvalidBlockNumber in
gistScanPage() and gistFindPage() have comment "/* sanity check */".
Comment changes are needed, at least.

- Heikki




Re: Yet another fast GiST build

2020-09-21 Thread Heikki Linnakangas

On 21/09/2020 02:06, Tom Lane wrote:

Justin Pryzby  writes:

This also appears to break checksums.


Fixed, thanks for the report!


I was wondering about that, because the typical pattern for use of
smgrextend for indexes seems to be

RelationOpenSmgr(rel);
PageSetChecksumInplace(page, lastblock);
smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf.data, false);

and gist_indexsortbuild wasn't doing either of the first two things.

gist_indexsortbuild_flush_ready_pages looks like it might be
a few bricks shy of a load too.  But my local CLOBBER_CACHE_ALWAYS
run hasn't gotten to anything except the pretty-trivial index
made in point.sql, so I don't have evidence about it.


I added a RelationOpenSmgr() call there too, although it's not needed 
currently. It seems to be enough to do it before the first smgrextend() 
call. But if you removed or refactored the first call someohow, so it 
was not the first call anymore, it would be easy to miss that you'd 
still need the RelationOpenSmgr() call there. It's more consistent with 
the code in nbtsort.c now, too.


- Heikki




Re: Yet another fast GiST build

2020-09-21 Thread Andrey M. Borodin



> 21 сент. 2020 г., в 13:45, Heikki Linnakangas  написал(а):
> 
> On 21/09/2020 11:08, Heikki Linnakangas wrote:
>> I think they need to, so that they can stamp the page with the LSN of
>> the WAL record. But GiST build is special in that regard, because it
>> stamps all pages with GistBuildLSN.
> 
> Actually, don't we have a problem with that, even before this patch? Even 
> though we set the LSN to the magic GistBuildLSN value when we build the 
> index, WAL replay will write the LSN of the record instead. That would mess 
> with the LSN-NSN interlock. After WAL replay (or in a streaming replica), a 
> scan on the GiST index might traverse right-links unnecessarily.

I think we don't set rightlinks during index build.

Best regards, Andrey Borodin.



Re: Yet another fast GiST build

2020-09-21 Thread Heikki Linnakangas

On 21/09/2020 11:08, Heikki Linnakangas wrote:

I think they need to, so that they can stamp the page with the LSN of
the WAL record. But GiST build is special in that regard, because it
stamps all pages with GistBuildLSN.


Actually, don't we have a problem with that, even before this patch? 
Even though we set the LSN to the magic GistBuildLSN value when we build 
the index, WAL replay will write the LSN of the record instead. That 
would mess with the LSN-NSN interlock. After WAL replay (or in a 
streaming replica), a scan on the GiST index might traverse right-links 
unnecessarily.


- Heikki




Re: Yet another fast GiST build

2020-09-21 Thread Heikki Linnakangas

On 21/09/2020 02:06, Tom Lane wrote:

Justin Pryzby  writes:

This also appears to break checksums.


Thanks, I'll go fix it.


I was wondering about that, because the typical pattern for use of
smgrextend for indexes seems to be

RelationOpenSmgr(rel);
PageSetChecksumInplace(page, lastblock);
smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf.data, false);

and gist_indexsortbuild wasn't doing either of the first two things.

gist_indexsortbuild_flush_ready_pages looks like it might be
a few bricks shy of a load too.  But my local CLOBBER_CACHE_ALWAYS
run hasn't gotten to anything except the pretty-trivial index
made in point.sql, so I don't have evidence about it.


I don't think a relcache invalidation can happen on the index we're 
building. Other similar callers call RelationOpenSmgr(rel) before every 
write though (e.g. _bt_blwritepage()), so perhaps it's better to copy 
that pattern here too.



Another interesting point is that all the other index AMs seem to WAL-log
the new page before the smgrextend call, whereas this code is doing it
in the other order.  I strongly doubt that both patterns are equally
correct.  Could be that the other AMs are in the wrong though.


My thinking was that it's better to call smgrextend() first, so that if 
you run out of disk space, you get the error before WAL-logging it. That 
reduces the chance that WAL replay will run out of disk space. A lot of 
things are different during WAL replay, so it's quite likely that WAL 
replay runs out of disk space anyway if you're living on the edge, but 
still.


I didn't notice that the other callers are doing it the other way round, 
though. I think they need to, so that they can stamp the page with the 
LSN of the WAL record. But GiST build is special in that regard, because 
it stamps all pages with GistBuildLSN.


- Heikki




Re: Yet another fast GiST build

2020-09-20 Thread Tom Lane
Justin Pryzby  writes:
> This also appears to break checksums.

I was wondering about that, because the typical pattern for use of
smgrextend for indexes seems to be

RelationOpenSmgr(rel);
PageSetChecksumInplace(page, lastblock);
smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf.data, false);

and gist_indexsortbuild wasn't doing either of the first two things.

gist_indexsortbuild_flush_ready_pages looks like it might be
a few bricks shy of a load too.  But my local CLOBBER_CACHE_ALWAYS
run hasn't gotten to anything except the pretty-trivial index
made in point.sql, so I don't have evidence about it.

Another interesting point is that all the other index AMs seem to WAL-log
the new page before the smgrextend call, whereas this code is doing it
in the other order.  I strongly doubt that both patterns are equally
correct.  Could be that the other AMs are in the wrong though.

regards, tom lane




Re: Yet another fast GiST build

2020-09-20 Thread Justin Pryzby
On Sun, Sep 20, 2020 at 05:10:05PM -0400, Tom Lane wrote:
> I wrote:
> > It appears that hyrax (CLOBBER_CACHE_ALWAYS) is not very happy
> > with this:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax=2020-09-19%2021%3A27%3A23
> 
> I reproduced that and traced it to a missing RelationOpenSmgr call.
> Fixed now.

This also appears to break checksums.

postgres=#  CREATE TABLE pvactst (i INT, a INT[], p POINT) with 
(autovacuum_enabled = off);
postgres=#  CREATE INDEX gist_pvactst ON pvactst USING gist (p);
postgres=#  INSERT INTO pvactst SELECT i, array[1,2,3], point(i, i+1) FROM 
generate_series(1,1000) i;
WARNING:  page verification failed, calculated checksum 34313 but expected 0
ERROR:  invalid page in block 0 of relation base/12859/16389

I was able to make this work like so:

@@ -449,6 +450,7 @@ gist_indexsortbuild(GISTBuildState *state)
 
/* Write out the root */
PageSetLSN(pagestate->page, GistBuildLSN);
+   PageSetChecksumInplace(pagestate->page, GIST_ROOT_BLKNO, 
state->indexrel->rd_smgr);
smgrwrite(state->indexrel->rd_smgr, MAIN_FORKNUM, GIST_ROOT_BLKNO,
  pagestate->page, true);
if (RelationNeedsWAL(state->indexrel))
@@ -555,6 +557,7 @@ gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
 
PageSetLSN(page, GistBuildLSN);
 
+   PageSetChecksumInplace(page, state->pages_written, 
state->indexrel->rd_smgr);
smgrextend(state->indexrel->rd_smgr,
   MAIN_FORKNUM,
   state->pages_written++,

-- 
Justin




Re: Yet another fast GiST build

2020-09-20 Thread Tom Lane
I wrote:
> It appears that hyrax (CLOBBER_CACHE_ALWAYS) is not very happy
> with this:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax=2020-09-19%2021%3A27%3A23

I reproduced that and traced it to a missing RelationOpenSmgr call.
Fixed now.

regards, tom lane




Re: Yet another fast GiST build

2020-09-20 Thread Tom Lane
Heikki Linnakangas  writes:
> Pushed. Thanks everyone!

It appears that hyrax (CLOBBER_CACHE_ALWAYS) is not very happy
with this:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax=2020-09-19%2021%3A27%3A23

We have a recent pass from prion, showing that -DRELCACHE_FORCE_RELEASE
-DCATCACHE_FORCE_RELEASE doesn't cause a problem, so maybe hyrax's
result is just random cosmic rays or something.  But I doubt it.

regards, tom lane




Re: Yet another fast GiST build

2020-09-17 Thread Justin Pryzby
On Thu, Sep 17, 2020 at 11:38:47AM +0300, Heikki Linnakangas wrote:
> On 15/09/2020 14:36, Heikki Linnakangas wrote:
> > Another patch version, fixed a few small bugs pointed out by assertion
> > failures in the regression tests.
> 
> Pushed. Thanks everyone!

+/* FIXME: bump this before pushing! */
 #define CATALOG_VERSION_NO 202009031





Re: Yet another fast GiST build

2020-09-17 Thread Andrey M. Borodin



> 17 сент. 2020 г., в 13:38, Heikki Linnakangas  написал(а):
> 
> On 15/09/2020 14:36, Heikki Linnakangas wrote:
>> Another patch version, fixed a few small bugs pointed out by assertion
>> failures in the regression tests.
> 
> Pushed. Thanks everyone!

That's wonderful! Thank you, Heikki!

Best regards, Andrey Borodin.



Re: Yet another fast GiST build

2020-09-17 Thread Heikki Linnakangas

On 15/09/2020 14:36, Heikki Linnakangas wrote:

Another patch version, fixed a few small bugs pointed out by assertion
failures in the regression tests.


Pushed. Thanks everyone!

- Heikki




Re: Yet another fast GiST build

2020-09-16 Thread Kyotaro Horiguchi
At Wed, 16 Sep 2020 12:27:09 +0500, "Andrey M. Borodin"  
wrote in 
> I was thinking that machine epsilon is near 1e-300, but I was
> wrong. It's actually near 1e-15.

FWIW, the mantissa of double is effectively 52+1 bits, about 15.9
digits. so 1+(1e-16) is basically indistincitve from
1+(2e-16). Actually two double precisions 1+2e-16 and 1+3e-16 are
indistinctive from each other.

> Actually, I just want to understand what changes between v18 and v19 changed 
> on-page order of items. I look into patch diff and cannot figure it out. 
> There are only logging changes. How this affects scan?

FWIW, I saw the same symptom by my another patch after adding a value
to POINT_TBL. (But I didn't pursue the cause further..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Yet another fast GiST build

2020-09-16 Thread Heikki Linnakangas

On 16/09/2020 10:27, Andrey M. Borodin wrote:

Actually, I just want to understand what changes between v18 and v19
changed on-page order of items. I look into patch diff and cannot
figure it out. There are only logging changes. How this affects
scan?


The test was failing with v18 too.

- Heikki




Re: Yet another fast GiST build

2020-09-16 Thread Andrey M. Borodin



> 15 сент. 2020 г., в 22:07, Heikki Linnakangas  написал(а):
> 
> regression=#  SELECT f1, f1 <-> '0,1' FROM point_tbl ORDER BY f1 <-> '0,1';
>f1 | ?column?
> ---+--
> (0,0) |1
> (1e-300,-1e-300)  |1
> (-3,4)| 4.24264068711929
> (-10,0)   | 10.0498756211209
> (10,10)   | 13.4536240470737
> (-5,-12)  | 13.9283882771841
> (5.1,34.5)|  33.885985303662
> (1e+300,Infinity) | Infinity
> (NaN,NaN) |  NaN
>   |
> (10 rows)
> 
> It is arbitrary which one you get first.
> 
> It's not very nice to have a not-well defined order of rows in the expected 
> output, as it could change in the future if we change the index build 
> algorithm again. But we have plenty of cases that depend on the physical row 
> order, and it's not like this changes very often, so I think it's ok to just 
> memorize the new order in the expected output.


I think this is valid reasoning. GiST choose subtree algorithm is not 
deterministic, it calls random(), but not in tested paths. 
I was thinking that machine epsilon is near 1e-300, but I was wrong. It's 
actually near 1e-15.

Actually, I just want to understand what changes between v18 and v19 changed 
on-page order of items. I look into patch diff and cannot figure it out. There 
are only logging changes. How this affects scan?

Best regards, Andrey Borodin.



Re: Yet another fast GiST build

2020-09-15 Thread Heikki Linnakangas

On 15/09/2020 19:46, Andrey M. Borodin wrote:




15 сент. 2020 г., в 16:36, Heikki Linnakangas  написал(а):

Another patch version, fixed a few small bugs pointed out by assertion failures 
in the regression tests.

- Heikki



These changes in create_index.out do not seem correct to me

  SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
  f1
  ---
- (0,0)
   (1e-300,-1e-300)
+ (0,0)

I did not figure out the root cause yet. We do not touch anything related to 
distance computation..


Ah yeah, that's subtle. Those rows are considered to be equally distant 
from (0, 1), given the precision of the <-> operator:


regression=#  SELECT f1, f1 <-> '0,1' FROM point_tbl ORDER BY f1 <-> '0,1';
f1 | ?column?
---+--
 (0,0) |1
 (1e-300,-1e-300)  |1
 (-3,4)| 4.24264068711929
 (-10,0)   | 10.0498756211209
 (10,10)   | 13.4536240470737
 (-5,-12)  | 13.9283882771841
 (5.1,34.5)|  33.885985303662
 (1e+300,Infinity) | Infinity
 (NaN,NaN) |  NaN
   |
(10 rows)

It is arbitrary which one you get first.

It's not very nice to have a not-well defined order of rows in the 
expected output, as it could change in the future if we change the index 
build algorithm again. But we have plenty of cases that depend on the 
physical row order, and it's not like this changes very often, so I 
think it's ok to just memorize the new order in the expected output.


- Heikki




Re: Yet another fast GiST build

2020-09-15 Thread Andrey M. Borodin



> 15 сент. 2020 г., в 16:36, Heikki Linnakangas  написал(а):
> 
> Another patch version, fixed a few small bugs pointed out by assertion 
> failures in the regression tests.
> 
> - Heikki
> 

These changes in create_index.out do not seem correct to me

 SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
 f1 
 ---
- (0,0)
  (1e-300,-1e-300)
+ (0,0)

I did not figure out the root cause yet. We do not touch anything related to 
distance computation..

Best regards, Andrey Borodin.



Re: Yet another fast GiST build

2020-09-15 Thread Heikki Linnakangas

On 11/09/2020 09:02, Andrey M. Borodin wrote:

10 сент. 2020 г., в 17:43, Heikki Linnakangas 
написал(а):

One more patch version attached. I fixed some memory leaks, and
fixed the abbreviation on 32-bit systems, and a bunch of small
comment changes and such.




The patch looks fine to me. On my machine GiST for points is builded
10x faster than before the patch.


Another patch version, fixed a few small bugs pointed out by assertion 
failures in the regression tests.


- Heikki
>From fdf51af02513384949d4a26c2c8381e7715703b7 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 15 Sep 2020 14:32:26 +0300
Subject: [PATCH v19 1/1] Add support for building GiST index by sorting.

This adds a new optional support function to the GiST access method:
sortsupport. If it is defined, the GiST index is built by sorting all data
to the order defined by the sortsupport's comparator function, and packing
the tuples in that order to GiST pages. This is similar to how B-tree
index build works, and is much faster than inserting the tuples one by one.
The resulting index is smaller too, because the pages are packed more
tightly, upto 'fillfactor'. The normal build method works by splitting
pages, which tends to lead to more wasted space.

The quality of the resulting index depends on how good the opclass-defined
sort order is. A good order preserves locality of the input data.

As the first user of this facility, add 'sortsupport' function to the
point_ops opclass. It sorts the points in Z-order (aka Morton Code), by
interleaving the bits of the X and Y coordinates.

Author: Andrey Borodin
Reviewed-by: Pavel Borisov, Thomas Munro
Discussion: https://www.postgresql.org/message-id/1A36620E-CAD8-4267-9067-FB31385E7C0D%40yandex-team.ru
---
 doc/src/sgml/gist.sgml |  70 +++
 src/backend/access/gist/gistbuild.c| 510 +
 src/backend/access/gist/gistproc.c | 229 +
 src/backend/access/gist/gistutil.c |  53 ++-
 src/backend/access/gist/gistvalidate.c |   6 +-
 src/backend/access/transam/xloginsert.c|  57 +++
 src/backend/utils/sort/sortsupport.c   |  34 ++
 src/backend/utils/sort/tuplesort.c |  57 +++
 src/include/access/gist.h  |   3 +-
 src/include/access/gist_private.h  |   3 +
 src/include/access/xloginsert.h|   2 +
 src/include/catalog/catversion.h   |   1 +
 src/include/catalog/pg_amproc.dat  |   2 +
 src/include/catalog/pg_proc.dat|   3 +
 src/include/utils/sortsupport.h|   1 +
 src/include/utils/tuplesort.h  |   4 +
 src/test/regress/expected/create_index.out |   6 +-
 17 files changed, 935 insertions(+), 106 deletions(-)

diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index f9226e7a35c..7c72a547409 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -259,6 +259,8 @@ CREATE INDEX ON my_table USING GIST (my_inet_column inet_ops);
compress method is omitted. The optional tenth method
options is needed if the operator class provides
the user-specified parameters.
+   The sortsupport method is also optional and is used to
+   speed up building a GiST index.
  
 
  
@@ -1065,6 +1067,74 @@ my_compress(PG_FUNCTION_ARGS)
   
  
 
+
+
+ sortsupport
+ 
+  
+   Returns a comparator function to sort data in a way that preserves
+   locality. It is used by CREATE INDEX and
+   REINDEX. The quality of the created index depends on
+   how well the sort order determined by the comparator function preserves
+   locality of the inputs.
+  
+  
+   The sortsupport method is optional. If it is not
+   provided, CREATE INDEX builds the index by inserting
+   each tuple to the tree using the penalty and
+   picksplit functions, which is much slower.
+  
+
+  
+   The SQL declaration of the function must look like
+   this:
+
+
+CREATE OR REPLACE FUNCTION my_sortsupport(internal)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+
+
+   The argument is a pointer to a SortSupport
+   struct. At a minimum, the function must fill in its comparator field.
+   The comparator takes three arguments: two Datums to compare, and
+   a pointer to the SortSupport struct. The
+   Datums are the two indexed values in the format that they are stored
+   in the index; that is, in the format returned by the
+   compress method. The full API is defined in
+   src/include/utils/sortsupport.h.
+   
+
+   
+The matching code in the C module could then follow this skeleton:
+
+
+PG_FUNCTION_INFO_V1(my_sortsupport);
+
+static int
+my_fastcmp(Datum x, Datum y, SortSupport ssup)
+{
+  /* establish order between x and y by computing some sorting value z */
+
+  int z1 = ComputeSpatialCode(x);
+  int z2 = ComputeSpatialCode(y);
+
+  return z1 == z2 ? 0 : z1 > z2 ? 1 : -1;
+}
+
+Datum

Re: Yet another fast GiST build

2020-09-11 Thread Andrey M. Borodin



> 10 сент. 2020 г., в 17:43, Heikki Linnakangas  написал(а):
> 
> One more patch version attached. I fixed some memory leaks, and fixed the 
> abbreviation on 32-bit systems, and a bunch of small comment changes and such.
> 
> - Heikki
> 

The patch looks fine to me. On my machine GiST for points is builded 10x faster 
than before the patch.

Future action items:
1. Sort support for gist_btree data types
2. Better page borders with split and fillfactor
3. Consider sort build for tsvector

I'll certainly do 1 before next CF and most probably 2.
Item 1 is basically a lot of similar code for many many different types.
In Item 2 I plan to use Oleg's gevel to evaluation possibilities of MBR overlap 
reduction.

Item 3 seems tricky and need deeper evaluation: chances are sort build will 
decrease IndexScan performance in this case.

Thanks, Heikki!

Best regards, Andrey Borodin,



Re: Yet another fast GiST build

2020-09-10 Thread Heikki Linnakangas

On 09/09/2020 19:50, Andrey M. Borodin wrote:

9 сент. 2020 г., в 20:39, Heikki Linnakangas  написал(а):

On 09/09/2020 15:20, Darafei "Komяpa" Praliaskouski wrote:

On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas  wrote:

Come to think of it, the point z-order comparator could benefit a lot
from key abbreviation, too. You could do the point -> zorder conversion
in the abbreviation routine.

That's how it works in PostGIS, only that we moved to more
effecient Hilbert curve:
https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171


Thanks, that's interesting.

I implemented the abbreviated keys for the point opclass, too, and noticed that 
the patch as it was never used it. I reworked the patch so that 
tuplesort_begin_index_gist() is responsible for looking up the sortsupport 
function, like tuplesort_begin_index_btree() does, and uses abbreviation when 
possible.

Wow, abbreviated sort made gist for points construction even 1.5x faster!
btw there is small typo in arg names in gist_bbox_zorder_cmp_abbrev(); z1,z2 -> 
a,b


One more patch version attached. I fixed some memory leaks, and fixed 
the abbreviation on 32-bit systems, and a bunch of small comment changes 
and such.


- Heikki
>From 79800152b6305e93129293452002cd082daadff4 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 10 Sep 2020 15:37:08 +0300
Subject: [PATCH v18 1/1] Add support for building GiST index by sorting.

This adds a new optional support function to the GiST access method:
sortsupport. If it is defined, the GiST index is built by sorting all data
to the order defined by the sortsupport's comparator function, and packing
the tuples in that order to GiST pages. This is similar to how B-tree
index build works, and is much faster than inserting the tuples one by one.
The resulting index is smaller too, because the pages are packed more
tightly, upto 'fillfactor'. The normal build method works by splitting
pages, which tends to lead to more wasted space.

The quality of the resulting index depends on how good the opclass-defined
sort order is. A good order preserves locality of the input data.

As the first user of this facility, add 'sortsupport' function to the
point_ops opclass. It sorts the points in Z-order (aka Morton Code), by
interleaving the bits of the X and Y coordinates.

Author: Andrey Borodin
Reviewed-by: Pavel Borisov, Thomas Munro
Discussion: https://www.postgresql.org/message-id/1A36620E-CAD8-4267-9067-FB31385E7C0D%40yandex-team.ru
---
 doc/src/sgml/gist.sgml  |  70 
 src/backend/access/gist/gistbuild.c | 506 
 src/backend/access/gist/gistproc.c  | 229 +++
 src/backend/access/gist/gistutil.c  |  53 ++-
 src/backend/access/gist/gistvalidate.c  |   6 +-
 src/backend/access/transam/xloginsert.c |  57 +++
 src/backend/utils/sort/sortsupport.c|  34 ++
 src/backend/utils/sort/tuplesort.c  |  57 +++
 src/include/access/gist.h   |   3 +-
 src/include/access/gist_private.h   |   3 +
 src/include/access/xloginsert.h |   2 +
 src/include/catalog/catversion.h|   1 +
 src/include/catalog/pg_amproc.dat   |   2 +
 src/include/catalog/pg_proc.dat |   3 +
 src/include/utils/sortsupport.h |   1 +
 src/include/utils/tuplesort.h   |   4 +
 16 files changed, 928 insertions(+), 103 deletions(-)

diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index f9226e7a35c..7c72a547409 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -259,6 +259,8 @@ CREATE INDEX ON my_table USING GIST (my_inet_column inet_ops);
compress method is omitted. The optional tenth method
options is needed if the operator class provides
the user-specified parameters.
+   The sortsupport method is also optional and is used to
+   speed up building a GiST index.
  
 
  
@@ -1065,6 +1067,74 @@ my_compress(PG_FUNCTION_ARGS)
   
  
 
+
+
+ sortsupport
+ 
+  
+   Returns a comparator function to sort data in a way that preserves
+   locality. It is used by CREATE INDEX and
+   REINDEX. The quality of the created index depends on
+   how well the sort order determined by the comparator function preserves
+   locality of the inputs.
+  
+  
+   The sortsupport method is optional. If it is not
+   provided, CREATE INDEX builds the index by inserting
+   each tuple to the tree using the penalty and
+   picksplit functions, which is much slower.
+  
+
+  
+   The SQL declaration of the function must look like
+   this:
+
+
+CREATE OR REPLACE FUNCTION my_sortsupport(internal)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+
+
+   The argument is a pointer to a SortSupport
+   struct. At a minimum, the function must fill in its comparator field.
+   The comparator takes three arguments: two Datums to compare, and
+   a pointer to the 

Re: Yet another fast GiST build

2020-09-10 Thread Pavel Borisov
>
> Interesting to see also the size of index, it should be several times less.
>
> I've tested this above in CF thread and got ordered GiST index ~1.7 times
smaller than non-ordered one for single column of real points.


-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: Yet another fast GiST build

2020-09-10 Thread Oleg Bartunov
On Mon, Sep 7, 2020 at 7:50 PM Heikki Linnakangas  wrote:
>
> On 07/09/2020 13:59, Pavel Borisov wrote:
>  I suppose there is a big jump in integer value (whether signed or
>  unsigned) as you cross from positive to negative floats, and then the
>  sort order is reversed.  I have no idea if either of those things is a
>  problem worth fixing.  That made me wonder if there might also be an
> >>
> >> I took a stab at fixing this, see attached patch (applies on top of your
> >> patch v14).
> >>
> >> To evaluate this, I used the other attached patch to expose the zorder
> >> function to SQL, and plotted points around zero with gnuplot. See the
> >> attached two images, one with patch v14, and the other one with this patch.
> >
> > I'd made testing of sorted SpGist build in cases of points distributed only
> > in 2d quadrant and points in all 4 quadrants and it appears that this
> > abnormality doesn't affect as much as Andrey supposed. But Heikki's patch
> > is really nice way to avoid what can be avoided and I'd like it is included
> > together with Andrey's patch.
>
> Thanks! Did you measure the quality of the built index somehow? The
> ordering shouldn't make any difference to the build speed, but it
> affects the shape of the resulting index and the speed of queries
> against it.
>
> I played with some simple queries like this:
>
> explain (analyze, buffers) select count(*) from points_good where p <@
> box(point(50, 50), point(75, 75));
>
> and looking at the "Buffers" line for how many pages were accessed.
> There doesn't seem to be any consistent difference between v14 and my
> fix. So I concur it doesn't seem to matter much.
>
>
> I played some more with plotting the curve. I wrote a little python
> program to make an animation of it, and also simulated how the points
> would be divided into pages, assuming that each GiST page can hold 200
> tuples (I think the real number is around 150 with default page size).
> In the animation, the leaf pages appear as rectangles as it walks
> through the Z-order curve. This is just a simulation by splitting all
> the points into batches of 200 and drawing a bounding box around each
> batch. I haven't checked the actual pages as the GiST creates, but I
> think this gives a good idea of how it works.

Heikki, you may use our gevel extension to visualize index tree
http://www.sai.msu.su/~megera/wiki/Gevel
I used it to investigate rtree index
http://www.sai.msu.su/~megera/wiki/Rtree_Index


>
> The animation shows that there's quite a lot of overlap between the
> pages. It's not necessarily this patch's business to try to improve
> that, and the non-sorting index build isn't perfect either. But it
> occurs to me that there's maybe one pretty simple trick we could do:
> instead of blindly filling the leaf pages in Z-order, collect tuples
> into a larger buffer, in Z-order. I'm thinking 32 pages worth of tuples,
> or something in that ballpark, or maybe go all the way up to work_mem.
> When the buffer fills up, call the picksplit code to divide the buffer
> into the actual pages, and flush them to disk. If you look at the
> animation and imagine that you would take a handful of pages in the
> order they're created, and re-divide the points with the split
> algorithm, there would be much less overlap.

Interesting to see also the size of index, it should be several times less.

>
> - Heikki



-- 
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Yet another fast GiST build

2020-09-09 Thread Andrey M. Borodin



> 9 сент. 2020 г., в 20:39, Heikki Linnakangas  написал(а):
> 
> On 09/09/2020 15:20, Darafei "Komяpa" Praliaskouski wrote:
>> On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas  wrote:
>>> Come to think of it, the point z-order comparator could benefit a lot
>>> from key abbreviation, too. You could do the point -> zorder conversion
>>> in the abbreviation routine.
>> That's how it works in PostGIS, only that we moved to more
>> effecient Hilbert curve:
>> https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171
> 
> Thanks, that's interesting.
> 
> I implemented the abbreviated keys for the point opclass, too, and noticed 
> that the patch as it was never used it. I reworked the patch so that 
> tuplesort_begin_index_gist() is responsible for looking up the sortsupport 
> function, like tuplesort_begin_index_btree() does, and uses abbreviation when 
> possible.
Wow, abbreviated sort made gist for points construction even 1.5x faster!
btw there is small typo in arg names in gist_bbox_zorder_cmp_abbrev(); z1,z2 -> 
a,b

> do we have regression test coverage for this?
Yes, sorting build for points is tested in point.sql, but with small dataset. 
index_including_gist.sql seems to be working with boxes, but triggers point 
paths too.

> , also on a SIZEOF_DATUM==4 system since the abbreviation works differently 
> with that, and push if nothing new comes up. And clarify the documentation 
> and/or comments that the sortsupport function sees "compressed" values.
> 
> I wonder if we could use sorting to also speed up building tsvector indexes? 
> The values stored there are bit signatures, what would be a good sort order 
> for those?
We need an order so that nearby values have a lot of bits in common.
What is the length of this signature?
For each 4 bytes we can compute number of 1s in it's binary representation. 
Then z-order these dwords as values 0-32.

This will be very inefficient grouping, but it will tend to keep empty and 
dense 4-byte regions apart.

Thanks for working on this!


Best regards, Andrey Borodin.




Re: Yet another fast GiST build

2020-09-09 Thread Heikki Linnakangas

On 09/09/2020 15:20, Darafei "Komяpa" Praliaskouski wrote:

On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas  wrote:


Come to think of it, the point z-order comparator could benefit a lot
from key abbreviation, too. You could do the point -> zorder conversion
in the abbreviation routine.


That's how it works in PostGIS, only that we moved to more
effecient Hilbert curve:
https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171


Thanks, that's interesting.

I implemented the abbreviated keys for the point opclass, too, and 
noticed that the patch as it was never used it. I reworked the patch so 
that tuplesort_begin_index_gist() is responsible for looking up the 
sortsupport function, like tuplesort_begin_index_btree() does, and uses 
abbreviation when possible.


I think this is pretty much ready for commit now. I'll do a bit more 
testing (do we have regression test coverage for this?), also on a 
SIZEOF_DATUM==4 system since the abbreviation works differently with 
that, and push if nothing new comes up. And clarify the documentation 
and/or comments that the sortsupport function sees "compressed" values.


I wonder if we could use sorting to also speed up building tsvector 
indexes? The values stored there are bit signatures, what would be a 
good sort order for those?


- Heikki
>From 3a4d9c14631ae54b983d75433e0286f3dfedf432 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 9 Sep 2020 18:33:18 +0300
Subject: [PATCH v17 1/1] Add sort support for point gist_point_sortsupport

Implement GiST build using sort support

We use special sorting function provided by opclass to approximate
GiST tree with B-tree-like structure. This approach allows to
radically reduce build time in some cases.

Author: Andrey Borodin
Reviewed-by: Pavel Borisov, Thomas Munro
Discussion: https://www.postgresql.org/message-id/1A36620E-CAD8-4267-9067-FB31385E7C0D%40yandex-team.ru
---
 doc/src/sgml/gist.sgml  |  66 
 src/backend/access/gist/gistbuild.c | 499 
 src/backend/access/gist/gistproc.c  | 224 +++
 src/backend/access/gist/gistutil.c  |  53 ++-
 src/backend/access/gist/gistvalidate.c  |   6 +-
 src/backend/access/transam/xloginsert.c |  57 +++
 src/backend/utils/sort/sortsupport.c|  34 ++
 src/backend/utils/sort/tuplesort.c  |  57 +++
 src/include/access/gist.h   |   3 +-
 src/include/access/gist_private.h   |   3 +
 src/include/access/xloginsert.h |   2 +
 src/include/catalog/catversion.h|   1 +
 src/include/catalog/pg_amproc.dat   |   2 +
 src/include/catalog/pg_proc.dat |   3 +
 src/include/utils/sortsupport.h |   1 +
 src/include/utils/tuplesort.h   |   4 +
 16 files changed, 912 insertions(+), 103 deletions(-)

diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index f9226e7a35c..b049094c811 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -259,6 +259,8 @@ CREATE INDEX ON my_table USING GIST (my_inet_column inet_ops);
compress method is omitted. The optional tenth method
options is needed if the operator class provides
the user-specified parameters.
+   The sortsupport method is also optional and is used to speed up
+   building a GiST index.
  
 
  
@@ -1065,6 +1067,70 @@ my_compress(PG_FUNCTION_ARGS)
   
  
 
+
+
+ sortsupport
+ 
+  
+   Returns a comparator function to sort data in a way that preserves
+   locality. It is used by CREATE INDEX and
+   REINDEX. The quality of the created index depends on
+   how well the sort order determined by the comparator routine preserves
+   locality of the inputs.
+  
+  
+   The sortsupport method is optional. If it is not
+   provided, CREATE INDEX builds the index by inserting
+   each tuple to the tree using the penalty and
+   picksplit functions, which is much slower.
+  
+
+  
+   The SQL declaration of the function must look like
+   this:
+
+
+CREATE OR REPLACE FUNCTION my_sortsupport(internal)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+
+
+   The argument is a pointer to a SortSupport
+   struct. At a minimum, the function must fill in its comparator field,
+   the full API is defined in
+   src/include/utils/sortsupport.h.
+   
+
+   
+The matching code in the C module could then follow this skeleton:
+
+
+PG_FUNCTION_INFO_V1(my_sortsupport);
+
+static int
+my_fastcmp(Datum x, Datum y, SortSupport ssup)
+{
+  /* establish order between x and y by computing some sorting value z */
+
+  int z1 = ComputeSpatialCode(x);
+  int z2 = ComputeSpatialCode(y);
+
+  return z1 == z2 ? 0 : z1 > z2 ? 1 : -1;
+}
+
+Datum
+my_sortsupport(PG_FUNCTION_ARGS)
+{
+  SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
+
+  ssup->comparator = my_fastcmp;
+  PG_RETURN_VOID();
+}
+
+  
+ 
+
   
 
   
diff 

Re: Yet another fast GiST build

2020-09-09 Thread Heikki Linnakangas

On 09/09/2020 15:20, Darafei "Komяpa" Praliaskouski wrote:

On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas  wrote:


Come to think of it, the point z-order comparator could benefit a lot
from key abbreviation, too. You could do the point -> zorder conversion
in the abbreviation routine.


That's how it works in PostGIS, only that we moved to more
effecient Hilbert curve:
https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171


Thanks, that's interesting.

I implemented the abbreviated keys for the point opclass, too, and 
noticed that the patch as it was never used it. I reworked the patch so 
that tuplesort_begin_index_gist() is responsible for looking up the 
sortsupport function, like tuplesort_begin_index_btree() does, and uses 
abbreviation when possible.


I think this is pretty much ready for commit now. I'll do a bit more 
testing (do we have regression test coverage for this?), also on a 
SIZEOF_DATUM==4 system since the abbreviation works differently with 
that, and push if nothing new comes up. And clarify the documentation 
and/or comments that the sortsupport function sees "compressed" values.


I wonder if we could use sorting to also speed up building tsvector 
indexes? The values stored there are bit signatures, what would be a 
good sort order for those?


- Heikki




Re: Yet another fast GiST build

2020-09-09 Thread Komяpa
On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas  wrote:

> On 09/09/2020 13:28, Andrey M. Borodin wrote:
> > Thanks Darafei!
> >
> >> 9 сент. 2020 г., в 12:05, Darafei Komяpa Praliaskouski
> >>  написал(а):
> >>
> >>> How does the 'sortsupport' routine interact with
> >>> 'compress'/'decompress'? Which representation is passed to the
> >>> comparator routine: the original value from the table, the
> >>> compressed representation, or the decompressed representation? Do
> >>> the comparetup_index_btree() and readtup_index() routines agree
> >>> with that?
> >>
> 
>
> Come to think of it, the point z-order comparator could benefit a lot
> from key abbreviation, too. You could do the point -> zorder conversion
> in the abbreviation routine.
>

That's how it works in PostGIS, only that we moved to more
effecient Hilbert curve:
https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171


Re: Yet another fast GiST build

2020-09-09 Thread Heikki Linnakangas

On 09/09/2020 13:28, Andrey M. Borodin wrote:

Thanks Darafei!


9 сент. 2020 г., в 12:05, Darafei Komяpa Praliaskouski
 написал(а):


How does the 'sortsupport' routine interact with
'compress'/'decompress'? Which representation is passed to the
comparator routine: the original value from the table, the
compressed representation, or the decompressed representation? Do
the comparetup_index_btree() and readtup_index() routines agree
with that?


Currently we pass compressed values, which seems not very good. But
there was a request from PostGIS maintainers to pass values before
decompression. Darafei, please, correct me if I'm wrong. Also can
you please provide link on PostGIS B-tree sorting functions?

We were expecting to reuse btree opclass for this thing. This way
btree_gist extension will become a lot thinner. :)


I think if we aim at reusing B-tree sort support functions we have to
pass uncompressed values. They can be a lot bigger and slower in case
of PostGIS. We will be sorting actual geometries instead of MBRs.

In my view it's better to implement GiST-specific sort support in
btree_gist, rather than trying to reuse existing sort supports.


Yeah, I don't think reusing existing sortsupport functions directly is 
important. The comparison function should be short anyway for 
performance reasons, so it won't be a lot of code to copy-paste. And if 
there are some common subroutines, you can put them in a separate 
internal functions for reuse.


Using the 'compressed' format seems reasonable to me. It's natural to 
the gistbuild.c code, and the comparison routine can 'decompress' itself 
if it wishes. If the decompressions is somewhat expensive, it's 
unfortunate if you need to do it repeatedly in the comparator, but 
tuplesort.c would need pretty big changes to keep around a separate 
in-memory representation compare. However, you could use the sort 
"abbreviation" functionality to mitigate that.


Come to think of it, the point z-order comparator could benefit a lot 
from key abbreviation, too. You could do the point -> zorder conversion 
in the abbreviation routine.


- Heikki




Re: Yet another fast GiST build

2020-09-09 Thread Andrey M. Borodin
Thanks Darafei!

> 9 сент. 2020 г., в 12:05, Darafei Komяpa Praliaskouski  
> написал(а):
> 
> > How does the 'sortsupport' routine interact with 'compress'/'decompress'? 
> > Which representation is passed to the comparator routine: the original 
> > value from the table, the compressed representation, or the decompressed 
> > representation? Do the comparetup_index_btree() and readtup_index() 
> > routines agree with that?
> 
> Currently we pass compressed values, which seems not very good.
> But there was a request from PostGIS maintainers to pass values before 
> decompression.
> Darafei, please, correct me if I'm wrong. Also can you please provide link on 
> PostGIS B-tree sorting functions?
> 
> We were expecting to reuse btree opclass for this thing. This way btree_gist 
> extension will become a lot thinner. :)

I think if we aim at reusing B-tree sort support functions we have to pass 
uncompressed values. They can be a lot bigger and slower in case of PostGIS. We 
will be sorting actual geometries instead of MBRs.

In my view it's better to implement GiST-specific sort support in btree_gist, 
rather than trying to reuse existing sort supports.

Best regards, Andrey Borodin.



Re: Yet another fast GiST build

2020-09-09 Thread Komяpa
Hi,


On Wed, Sep 9, 2020 at 9:43 AM Andrey M. Borodin 
wrote:

>
>
> > 9 сент. 2020 г., в 00:05, Heikki Linnakangas 
> написал(а):
> >
> > I've been reviewing the patch today. The biggest changes I've made have
> been in restructuring the code in gistbuild.c for readability, but there
> are a bunch of smaller changes throughout. Attached is what I've got so
> far, squashed into one patch.
> Thanks!
>
> > I'm continuing to review it, but a couple of questions so far:
> >
> > In the gistBuildCallback(), you're skipping the tuple if 'tupleIsAlive
> == false'. That seems fishy, surely we need to index recently-dead tuples,
> too. The normal index build path isn't skipping them either.
> That's an oversight.
> >
> > How does the 'sortsupport' routine interact with
> 'compress'/'decompress'? Which representation is passed to the comparator
> routine: the original value from the table, the compressed representation,
> or the decompressed representation? Do the comparetup_index_btree() and
> readtup_index() routines agree with that?
>
> Currently we pass compressed values, which seems not very good.
> But there was a request from PostGIS maintainers to pass values before
> decompression.
> Darafei, please, correct me if I'm wrong. Also can you please provide link
> on PostGIS B-tree sorting functions?
>

We were expecting to reuse btree opclass for this thing. This way
btree_gist extension will become a lot thinner. :)

Core routine for current sorting implementation is Hilbert curve, which is
based on 2D center of a box - and used for abbreviated sort:
https://github.com/postgis/postgis/blob/2a7ebd0111b02aed3aa24752aad0ba89aef5d431/liblwgeom/gbox.c#L893


All the btree functions are wrappers around gserialized_cmp which just adds
a bunch of tiebreakers that don't matter in practice:
https://github.com/postgis/postgis/blob/2a7ebd0111b02aed3aa24752aad0ba89aef5d431/liblwgeom/gserialized.c#L313

Base representation for index compressed datatype is GIDX, which is also a
box. We can make it work on top of it instead of the original
representation.
There is no such thing as "decompressed representation" unfortunately as
compression is lossy.


Re: Yet another fast GiST build

2020-09-09 Thread Andrey M. Borodin



> 9 сент. 2020 г., в 00:05, Heikki Linnakangas  написал(а):
> 
> I've been reviewing the patch today. The biggest changes I've made have been 
> in restructuring the code in gistbuild.c for readability, but there are a 
> bunch of smaller changes throughout. Attached is what I've got so far, 
> squashed into one patch.
Thanks!

> I'm continuing to review it, but a couple of questions so far:
> 
> In the gistBuildCallback(), you're skipping the tuple if 'tupleIsAlive == 
> false'. That seems fishy, surely we need to index recently-dead tuples, too. 
> The normal index build path isn't skipping them either.
That's an oversight.
> 
> How does the 'sortsupport' routine interact with 'compress'/'decompress'? 
> Which representation is passed to the comparator routine: the original value 
> from the table, the compressed representation, or the decompressed 
> representation? Do the comparetup_index_btree() and readtup_index() routines 
> agree with that?

Currently we pass compressed values, which seems not very good.
But there was a request from PostGIS maintainers to pass values before 
decompression.
Darafei, please, correct me if I'm wrong. Also can you please provide link on 
PostGIS B-tree sorting functions?

Thanks!

Best regards, Andrey Borodin.



Re: Yet another fast GiST build

2020-09-08 Thread Heikki Linnakangas

On 08/09/2020 21:33, Pavel Borisov wrote:

 > Thanks! Did you measure the quality of the built index somehow? The
 > ordering shouldn't make any difference to the build speed, but it
 > affects the shape of the resulting index and the speed of queries
 > against it.

Again I've tried random select tests near axes and haven't noticed any 
performance difference between ordinary gist build and z-ordered one. 
The same is for selects far from axes. Theoretically, there may be a 
possible slowdown for particular points inside the MBR which crosses the 
axis but I haven't tried to dig so deep and haven't tested performance 
as a function of coordinate.


So I feel this patch is not about select speed optimization.


Ok, thank for confirming.

I've been reviewing the patch today. The biggest changes I've made have 
been in restructuring the code in gistbuild.c for readability, but there 
are a bunch of smaller changes throughout. Attached is what I've got so 
far, squashed into one patch. I'm continuing to review it, but a couple 
of questions so far:


In the gistBuildCallback(), you're skipping the tuple if 'tupleIsAlive 
== false'. That seems fishy, surely we need to index recently-dead 
tuples, too. The normal index build path isn't skipping them either.


How does the 'sortsupport' routine interact with 
'compress'/'decompress'? Which representation is passed to the 
comparator routine: the original value from the table, the compressed 
representation, or the decompressed representation? Do the 
comparetup_index_btree() and readtup_index() routines agree with that?


- Heikki
>From 7a9331bbd43799150d6a0b9dad2e98604c6b7dfc Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 8 Sep 2020 21:56:41 +0300
Subject: [PATCH v16 1/1] Add sort support for point gist_point_sortsupport

Implement GiST build using sort support

We use special sorting function provided by opclass to approximate
GiST tree with B-tree-like structure. This approach allows to
radically reduce build time in some cases.

Discussion: https://www.postgresql.org/message-id/1A36620E-CAD8-4267-9067-FB31385E7C0D%40yandex-team.ru
Reviewed-by: Pavel Borisov, Thomas Munro
---
 doc/src/sgml/gist.sgml  |  64 +++
 src/backend/access/gist/gistbuild.c | 511 
 src/backend/access/gist/gistproc.c  | 154 +++
 src/backend/access/gist/gistutil.c  |  59 ++-
 src/backend/access/gist/gistvalidate.c  |   6 +-
 src/backend/access/transam/xloginsert.c |  57 +++
 src/backend/utils/sort/tuplesort.c  |  37 ++
 src/include/access/gist.h   |   3 +-
 src/include/access/gist_private.h   |   3 +
 src/include/access/xloginsert.h |   2 +
 src/include/catalog/pg_amproc.dat   |   2 +
 src/include/catalog/pg_proc.dat |   3 +
 src/include/utils/tuplesort.h   |   6 +
 13 files changed, 801 insertions(+), 106 deletions(-)

diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index f9226e7a35c..bc45b3260f2 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -259,6 +259,8 @@ CREATE INDEX ON my_table USING GIST (my_inet_column inet_ops);
compress method is omitted. The optional tenth method
options is needed if the operator class provides
the user-specified parameters.
+   The sortsupport method is also optional and is used to speed up
+   building a GiST index.
  
 
  
@@ -1065,6 +1067,68 @@ my_compress(PG_FUNCTION_ARGS)
   
  
 
+
+
+ sortsupport
+ 
+  
+   Returns a comparator function to sort data in a way that preserves
+   locality. It is used by CREATE INDEX and
+   REINDEX. The quality of the created index depends on
+   how well the sort order determined by the comparator routine preserves
+   locality of the inputs.
+  
+  
+   The sortsupport method is optional. If it is not
+   provided, CREATE INDEX builds the index by inserting
+   each tuple to the tree using the penalty and
+   picksplit functions, which is much slower.
+  
+
+  
+The SQL declaration of the function must look like this:
+
+
+CREATE OR REPLACE FUNCTION my_sortsupport(internal)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+
+
+The argument is a pointer to a SortSupport struct.
+At a minimum, the function must fill in its comparator field, the full API
+is defined in src/include/utils/sortsupport.h.
+   
+
+   
+The matching code in the C module could then follow this skeleton:
+
+
+PG_FUNCTION_INFO_V1(my_sortsupport);
+
+static int
+my_fastcmp(Datum x, Datum y, SortSupport ssup)
+{
+  /* establish order between x and y by computing some sorting value z */
+
+  int z1 = ComputeSpatialCode(x);
+  int z2 = ComputeSpatialCode(y);
+
+  return z1 == z2 ? 0 : z1 > z2 ? 1 : -1;
+}
+
+Datum
+my_sortsupport(PG_FUNCTION_ARGS)
+{
+  SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
+
+  ssup->comparator = 

Re: Yet another fast GiST build

2020-09-08 Thread Pavel Borisov
>
> > Thanks! Did you measure the quality of the built index somehow? The
> > ordering shouldn't make any difference to the build speed, but it
> > affects the shape of the resulting index and the speed of queries
> > against it.


Again I've tried random select tests near axes and haven't noticed any
performance difference between ordinary gist build and z-ordered one. The
same is for selects far from axes. Theoretically, there may be a possible
slowdown for particular points inside the MBR which crosses the axis but I
haven't tried to dig so deep and haven't tested performance as a function
of coordinate.

So I feel this patch is not about select speed optimization.

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: Yet another fast GiST build

2020-09-07 Thread Andrey M. Borodin



> 7 сент. 2020 г., в 19:10, Heikki Linnakangas  написал(а):
> 
> On 07/09/2020 13:59, Pavel Borisov wrote:
> I suppose there is a big jump in integer value (whether signed or
> unsigned) as you cross from positive to negative floats, and then the
> sort order is reversed.  I have no idea if either of those things is a
> problem worth fixing.  That made me wonder if there might also be an
>>> 
>>> I took a stab at fixing this, see attached patch (applies on top of your
>>> patch v14).
>>> 
>>> To evaluate this, I used the other attached patch to expose the zorder
>>> function to SQL, and plotted points around zero with gnuplot. See the
>>> attached two images, one with patch v14, and the other one with this patch.
>> 
>> I'd made testing of sorted SpGist build in cases of points distributed only
>> in 2d quadrant and points in all 4 quadrants and it appears that this
>> abnormality doesn't affect as much as Andrey supposed. But Heikki's patch
>> is really nice way to avoid what can be avoided and I'd like it is included
>> together with Andrey's patch.
> 
> Thanks! Did you measure the quality of the built index somehow? The 
> ordering shouldn't make any difference to the build speed, but it 
> affects the shape of the resulting index and the speed of queries 
> against it.
I've tried to benchmark the difference between build time v14 and v15. v15 
seems to be slightly slower, but with negligible difference.

> I played with some simple queries like this:
> 
> explain (analyze, buffers) select count(*) from points_good where p <@ 
> box(point(50, 50), point(75, 75));
To observe IndexScan difference query should touch 4 quadrants. i.e. search 
within ((-25,-25),point(25,25))

> and looking at the "Buffers" line for how many pages were accessed. 
> There doesn't seem to be any consistent difference between v14 and my 
> fix. So I concur it doesn't seem to matter much.
> 
> 
> I played some more with plotting the curve. I wrote a little python 
> program to make an animation of it, and also simulated how the points 
> would be divided into pages, assuming that each GiST page can hold 200 
> tuples (I think the real number is around 150 with default page size). 
> In the animation, the leaf pages appear as rectangles as it walks 
> through the Z-order curve. This is just a simulation by splitting all 
> the points into batches of 200 and drawing a bounding box around each 
> batch. I haven't checked the actual pages as the GiST creates, but I 
> think this gives a good idea of how it works.
> The animation shows that there's quite a lot of overlap between the 
> pages. It's not necessarily this patch's business to try to improve 
> that, and the non-sorting index build isn't perfect either. But it 
> occurs to me that there's maybe one pretty simple trick we could do: 
> instead of blindly filling the leaf pages in Z-order, collect tuples 
> into a larger buffer, in Z-order. I'm thinking 32 pages worth of tuples, 
> or something in that ballpark, or maybe go all the way up to work_mem. 
> When the buffer fills up, call the picksplit code to divide the buffer 
> into the actual pages, and flush them to disk. If you look at the 
> animation and imagine that you would take a handful of pages in the 
> order they're created, and re-divide the points with the split 
> algorithm, there would be much less overlap.

Animation looks cool! It really pins the inefficiency of resulting MBRs.
But in R*-tree one of Beckman's points was that overlap optimisation worth 
doing on higher levels, not lower.
But we can do this for splits on each level, I think. We do not know tree depth 
in advance to divide maintenance workmem among level.. But, probably we don't 
need to, let's allocate half to first level, quarter to second, 1/8 to third 
etc until it's one page. Should we take allocations inside picksplit() into 
account?
The more I think about it the cooler idea seem to me.

BTW I've found one more bug in the patch: it writes WAL even for unlogged 
tables. I'm not sending a patch because changes are trivial and currently we 
already have lengthy patchset in different messages.
Also, to avoid critical section we can use log_new_page() instead of 
log_buffer().


Thanks!

Best regards, Andrey Borodin.



Re: Yet another fast GiST build

2020-09-07 Thread Pavel Borisov
>
>
> >> I suppose there is a big jump in integer value (whether signed or
> >> unsigned) as you cross from positive to negative floats, and then the
> >> sort order is reversed.  I have no idea if either of those things is a
> >> problem worth fixing.  That made me wonder if there might also be an
>
> I took a stab at fixing this, see attached patch (applies on top of your
> patch v14).
>
> To evaluate this, I used the other attached patch to expose the zorder
> function to SQL, and plotted points around zero with gnuplot. See the
> attached two images, one with patch v14, and the other one with this patch.
>

I'd made testing of sorted SpGist build in cases of points distributed only
in 2d quadrant and points in all 4 quadrants and it appears that this
abnormality doesn't affect as much as Andrey supposed. But Heikki's patch
is really nice way to avoid what can be avoided and I'd like it is included
together with Andrey's patch.

Pavel.


Re: Yet another fast GiST build

2020-09-07 Thread Heikki Linnakangas

On 24/02/2020 10:50, Andrey M. Borodin wrote:

On 24 февр. 2020 г., at 01:58, Thomas Munro  wrote:

On Thu, Feb 20, 2020 at 10:14 AM Thomas Munro  wrote:

1.  We expect floats to be in IEEE format, and the sort order of IEEE
floats is mostly correlated to the binary sort order of the bits
reinterpreted as an int.  It isn't in some special cases, but for this
use case we don't really care about that, we're just trying to
encourage locality.


I suppose there is a big jump in integer value (whether signed or
unsigned) as you cross from positive to negative floats, and then the
sort order is reversed.  I have no idea if either of those things is a
problem worth fixing.  That made me wonder if there might also be an
endianness problem.  It seems from some quick googling that all
current architectures have integers and floats of the same endianness.
Apparently this wasn't always the case, and some ARMs have a weird
half-flipped arrangement for 64 bit floats, but not 32 bit floats as
you are using here.


Yes, this leap is a problem for point as generic data type. And I do not know
how to fix it. It can cause inefficient Index Scans when searching near (0,0) 
and query
window touches simultaneously all quadrants (4x slower).


I took a stab at fixing this, see attached patch (applies on top of your 
patch v14).


To evaluate this, I used the other attached patch to expose the zorder 
function to SQL, and plotted points around zero with gnuplot. See the 
attached two images, one with patch v14, and the other one with this patch.


I'll continue looking at these patches in whole tomorrow. I think it's 
getting close to a committable state.



But everything will be just fine when all data is in 2nd quadrant.


Simon Riggs and friends would agree :-)

- Heikki
>From c7cadbb017aa3ec446136c36bb58b10d35ed095a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 7 Sep 2020 12:23:20 +0300
Subject: [PATCH v15 3/4] Expose point_zorder() to SQL.

---
 src/backend/access/gist/gistproc.c | 19 +++
 src/include/catalog/pg_proc.dat|  4 
 2 files changed, 23 insertions(+)

diff --git a/src/backend/access/gist/gistproc.c b/src/backend/access/gist/gistproc.c
index 387c66d3ca3..4ed9f46c9bb 100644
--- a/src/backend/access/gist/gistproc.c
+++ b/src/backend/access/gist/gistproc.c
@@ -1626,3 +1626,22 @@ gist_point_sortsupport(PG_FUNCTION_ARGS)
 	ssup->comparator = gist_bbox_fastcmp;
 	PG_RETURN_VOID();
 }
+
+/*
+ * Expose the Z-Order for debugging purposes
+ */
+Datum
+point_zorder(PG_FUNCTION_ARGS)
+{
+	Point	   *p = PG_GETARG_POINT_P(0);
+	uint64		zorder;
+
+	zorder = point_zorder_internal(p);
+
+	/*
+	 * XXX: Shift by one, so that when it's interpreted as a signed integer,
+	 * it's always positive. We lose the least-significant bit, but that's OK
+	 * for the quick plotting I'm using this for.
+	 */
+	return Int64GetDatum(zorder >> 1);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 96d7efd4270..9a98bed79e8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3004,6 +3004,10 @@
   proname => 'point_div', prorettype => 'point', proargtypes => 'point point',
   prosrc => 'point_div' },
 
+{ oid => '9110',
+  proname => 'point_zorder', prorettype => 'int8', proargtypes => 'point',
+  prosrc => 'point_zorder' },
+
 { oid => '1445',
   proname => 'poly_npoints', prorettype => 'int4', proargtypes => 'polygon',
   prosrc => 'poly_npoints' },
-- 
2.20.1

>From a7e0237d0bfd4909c0c4e0237013efe0cd071dd4 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 7 Sep 2020 12:33:36 +0300
Subject: [PATCH v15 4/4] Map negative values better.

---
 src/backend/access/gist/gistproc.c | 169 +
 1 file changed, 121 insertions(+), 48 deletions(-)

diff --git a/src/backend/access/gist/gistproc.c b/src/backend/access/gist/gistproc.c
index 4ed9f46c9bb..7ca7eda84b0 100644
--- a/src/backend/access/gist/gistproc.c
+++ b/src/backend/access/gist/gistproc.c
@@ -31,10 +31,11 @@ static bool gist_box_leaf_consistent(BOX *key, BOX *query,
 	 StrategyNumber strategy);
 static bool rtree_internal_consistent(BOX *key, BOX *query,
 	  StrategyNumber strategy);
-static int64 part_bits32_by2(uint32 x);
-static int64 interleave_bits32(uint32 x, uint32 y);
-static inline uint64 point_zorder_internal(Point *p);
-static int gist_bbox_fastcmp(Datum x, Datum y, SortSupport ssup);
+static uint64 part_bits32_by2(uint32 x);
+static uint64 interleave_bits32(uint32 x, uint32 y);
+static uint32 ieee_float32_to_uint32(float f);
+static uint64 point_zorder_internal(Point *p);
+static int	gist_bbox_fastcmp(Datum x, Datum y, SortSupport ssup);
 
 
 /* Minimum accepted ratio of split */
@@ -1549,75 +1550,147 @@ gist_poly_distance(PG_FUNCTION_ARGS)
 
 /* Z-order routines */
 /* Interleave 32 bits with zeroes */
-static int64
+static uint64
 part_bits32_by2(uint32 x)
 {
 	uint64		n = x;
 
 	n = (n | (n << 16)) & 

Re: Yet another fast GiST build (typo)

2020-09-07 Thread Andrey M. Borodin


> 7 сент. 2020 г., в 12:14, Andrey M. Borodin  написал(а):
> 
> Maybe I'm missing something? Like forgot to log 10% of pages, or something 
> like that...

Indeed, there was a bug. I've fixed it, and I still observe same performance 
gain.

Best regards, Andrey Borodin.



v2-0001-nbtree-faster-logging.patch
Description: Binary data


Re: Yet another fast GiST build (typo)

2020-09-07 Thread Andrey M. Borodin


> 6 сент. 2020 г., в 18:26, Heikki Linnakangas  написал(а):
> 
> On 05/09/2020 14:53, Andrey M. Borodin wrote:
>> Thanks for ideas, Heikki. Please see v13 with proposed changes.
> 
> Thanks, that was quick!
> 
>> But I've found out that logging page-by-page slows down GiST build by
>> approximately 15% (when CPU constrained). Though In think that this
>> is IO-wise.
> Hmm, any ideas why that is? log_newpage_range() writes one WAL record for 32 
> pages, while now you're writing one record per page, so you'll have a little 
> bit more overhead from that. But 15% seems like a lot.

Hmm, this works for B-tree too.
this index creation
psql -c '\timing' -c "create table x as select random() from 
generate_series(1,1000,1);" -c "create index ON x (random );"
takes 7 seconds on may machine, but with one weird patch it takes only 6 :)

Maybe I'm missing something? Like forgot to log 10% of pages, or something like 
that...

Best regards, Andrey Borodin.


0001-nbtree-faster-logging.patch
Description: Binary data


Re: Yet another fast GiST build (typo)

2020-09-06 Thread Andrey M. Borodin


> 6 сент. 2020 г., в 18:26, Heikki Linnakangas  написал(а):
> 
> On 05/09/2020 14:53, Andrey M. Borodin wrote:
>> Thanks for ideas, Heikki. Please see v13 with proposed changes.
> 
> Thanks, that was quick!
> 
>> But I've found out that logging page-by-page slows down GiST build by
>> approximately 15% (when CPU constrained). Though In think that this
>> is IO-wise.
> Hmm, any ideas why that is? log_newpage_range() writes one WAL record for 32 
> pages, while now you're writing one record per page, so you'll have a little 
> bit more overhead from that. But 15% seems like a lot.
I do not know. I guess this can be some effect of pglz compression during cold 
stage. It can be slower and less compressive than pglz with cache table? But 
this is pointing into the sky.
Nevertheless, here's the patch identical to v13, but with 3rd part: log flushed 
pages with bunches of 32.
This brings CPU performance back and slightly better than before page-by-page 
logging.

Some details about test:
MacOS, 6-core i7
psql -c '\timing' -c "create table x as select point (random(),random()) from 
generate_series(1,1000,1);" -c "create index on x using gist (point);"

With patch v13 this takes 20,567 seconds, with v14 18,149 seconds, v12 ~18,3s 
(which is closer to 10% btw, sorry for miscomputation). This was not 
statistically significant testing, just a quick laptop benchmark with 2-3 tests 
to verify stability.

Best regards, Andrey Borodin.


v14-0001-Add-sort-support-for-point-gist_point_sortsuppor.patch
Description: Binary data


v14-0002-Implement-GiST-build-using-sort-support.patch
Description: Binary data


v14-0003-Log-GiST-build-with-packs-of-32-pages.patch
Description: Binary data


Re: Yet another fast GiST build (typo)

2020-09-06 Thread Heikki Linnakangas

On 05/09/2020 14:53, Andrey M. Borodin wrote:

Thanks for ideas, Heikki. Please see v13 with proposed changes.


Thanks, that was quick!


But I've found out that logging page-by-page slows down GiST build by
approximately 15% (when CPU constrained). Though In think that this
is IO-wise.
Hmm, any ideas why that is? log_newpage_range() writes one WAL record 
for 32 pages, while now you're writing one record per page, so you'll 
have a little bit more overhead from that. But 15% seems like a lot.


- Heikki




Re: Yet another fast GiST build (typo)

2020-09-05 Thread Andrey M. Borodin


> 3 сент. 2020 г., в 23:40, Heikki Linnakangas  написал(а):
> 
> On 30/08/2020 15:04, Andrey M. Borodin wrote:
>>> 23 авг. 2020 г., в 14:39, Andrey M. Borodin  
>>> написал(а):
>>> 
>>> Thanks for reviewing and benchmarking, Pavel!
>> Pavel sent me few typos offlist. PFA v12 fixing these typos.
> 
> In gist_indexsortbuild(), you first build all the leaf pages. Then, you read 
> through all the index pages you just built, to form the tuples for the next 
> level, and repeat for all the upper levels. That seems inefficient, it would 
> be more better to form the tuples for the downlinks as you go, when you build 
> the leaf pages in the first place. That's how nbtsort.c works. Also, you 
> could WAL-log the pages as you go.
> 
> In gist_indexsortbuild_flush(), can't you just memcpy() the page from
> memory to the buffer?
> 
> - Heikki
Thanks for ideas, Heikki. Please see v13 with proposed changes.
But I've found out that logging page-by-page slows down GiST build by 
approximately 15% (when CPU constrained).
Though In think that this is IO-wise.

Thanks!

Best regards, Andrey Borodin.


v13-0001-Add-sort-support-for-point-gist_point_sortsuppor.patch
Description: Binary data


v13-0002-Implement-GiST-build-using-sort-support.patch
Description: Binary data


Re: Yet another fast GiST build (typo)

2020-09-03 Thread Heikki Linnakangas

On 30/08/2020 15:04, Andrey M. Borodin wrote:

23 авг. 2020 г., в 14:39, Andrey M. Borodin  написал(а):

Thanks for reviewing and benchmarking, Pavel!


Pavel sent me few typos offlist. PFA v12 fixing these typos.


In gist_indexsortbuild(), you first build all the leaf pages. Then, you 
read through all the index pages you just built, to form the tuples for 
the next level, and repeat for all the upper levels. That seems 
inefficient, it would be more better to form the tuples for the 
downlinks as you go, when you build the leaf pages in the first place. 
That's how nbtsort.c works. Also, you could WAL-log the pages as you go.


In gist_indexsortbuild_flush(), can't you just memcpy() the page from
memory to the buffer?

- Heikki




Re: Yet another fast GiST build (typo)

2020-08-31 Thread Pavel Borisov
>
> Pavel sent me few typos offlist. PFA v12 fixing these typos.
> Thanks!
>

Now I consider the patch ready to be committed and mark it so on CF.
Thank you!

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


  1   2   >