Re: [HACKERS] GSoC 2017 : Patch for predicate locking in Gist index

2017-10-02 Thread Andrew Borodin
On Mon, Oct 2, 2017 at 8:00 PM, Alexander Korotkov
 wrote:
> What happen if exactly this "continue" fires?
>
>> if (GistFollowRight(stack->page))
>> {
>> if (!xlocked)
>> {
>> LockBuffer(stack->buffer, GIST_UNLOCK);
>> LockBuffer(stack->buffer, GIST_EXCLUSIVE);
>> xlocked = true;
>> /* someone might've completed the split when we unlocked */
>> if (!GistFollowRight(stack->page))
>> continue;
>
>
> In this case we might get xlocked == true without calling
> CheckForSerializableConflictIn().
Indeed! I've overlooked it. I'm remembering this issue, we were
considering not fixing splits if in Serializable isolation, but
dropped the idea.
CheckForSerializableConflictIn() must be after every exclusive lock.

> I think it would be rather safe and easy for understanding to more
> CheckForSerializableConflictIn() directly before gistinserttuple().
The difference is that after lock we have conditions to change page,
and before gistinserttuple() we have actual intent to change page.

>From the point of future development first version is better (if some
new calls occasionally spawn in), but it has 3 calls while your
proposal have 2 calls.
It seems to me that CheckForSerializableConflictIn() before
gistinserttuple() is better as for now.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2017 : Patch for predicate locking in Gist index

2017-10-02 Thread Andrew Borodin
Hi, Alexander!

Thanks for looking into the patch!

On Thu, Sep 28, 2017 at 3:59 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

>
>
> In gistdoinsert() you do CheckForSerializableConflictIn() only if page
> wasn't exclusively locked before (xlocked is false).
>
> if (!xlocked)
>> {
>> LockBuffer(stack->buffer, GIST_UNLOCK);
>> LockBuffer(stack->buffer, GIST_EXCLUSIVE);
>> CheckForSerializableConflictIn(r, NULL, stack->buffer);
>> xlocked = true;
>
>
> However, page might be exclusively locked before.  And in this case
> CheckForSerializableConflictIn() would be skipped.  That happens very
> rarely (someone fixes incomplete split before we did), but nevertheless.
>

if xlocked = true, page was already checked for conflict after setting
exclusive lock on it's buffer.  I still do not see any problem here...

Best regards, Andrey Borodin.


[HACKERS] Re: [HACKERS] 答复: GiST API Adancement

2017-06-16 Thread Andrew Borodin
2017-06-16 17:06 GMT+03:00 Tom Lane <t...@sss.pgh.pa.us>:
> Yuan Dong <doff...@hotmail.com> writes:
>> ·¢¼þÈË: Andrew Borodin <boro...@octonica.com>
>>> I think there is one more aspect of development: backward
>>> compatibility: it's impossible to update all existing extensions. This
>>> is not that major feature to ignore them.
>
>> I should still maintain original API of GiST after modification.
>
> To put a little context on that: we have always felt that it's okay
> to require extension authors to make small adjustments when going to
> a new major release.

Then maybe it worth to consider more general API advancement at once?
Here's my wish list:
1. Choose subtree function as an alternative for penalty
2. Bulk load support
3. Better split framework: now many extensions peek two seeds and
spread split candidates among them, some with subtle but noisy
mistakes
4. Better way of key compares (this thread, actually)
5. Split in many portions

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST API Adancement

2017-06-16 Thread Andrew Borodin
Hi, Dong!

2017-06-15 21:19 GMT+05:00 Yuan Dong <doff...@hotmail.com>:
> I'm going to hack on my own. With the help of Andrew Borodin, I want to
> start the project with adding a third state to collision check. The third
> state is that:
>  subtree is totally within the query. In this case, GiST scan can scan down
> subtree without key checks.

That's fine!

> After reading some code, I get this plan to modify the code:
>
> 1. Modify the consistent function of datatypes supported by GiST.
>
> 1.1 Start with cube, g_cube_consistent should return a third state when
> calling g_cube_internal_consistent. Inspired by Andrew Borodin, I have two
> solutions, 1)modify return value, 2) modify a reference type parameter.
>
> 2. Modify the gistindex_keytest in gistget.c to return multiple states
>
> 2.1 Need declare a new enum type(or define values) in gist_private.h or
> somewhere
>
> 3. Modify the gitsScanPage in gistget.c to deal with the extra state
>
> 4. Add a state to mark the nodes under this GISTSearchItem are all with in
> the query
>
> 4.1 Two ways to do this: 1) add a state to GISTSearchItem (prefered) 2)
> Somewhere else to record all this kind of items
>
> 4.2 We may use the block number as the key
>
> 4.3 Next time when the gistScanPage met this item, just find all the leaves
> directly(need a new function)
>
> After this, I shall start to benchmark the improvement and edit the code of
> other datatypes.
>
> Hope you hackers can give me some suggestions~

I think there is one more aspect of development: backward
compatibility: it's impossible to update all existing extensions. This
is not that major feature to ignore them.

Though for benchmarking purposes backward compatibility can be omitted.

Best regards, Andrey Borodin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2017 weekly progress reports (week 2)

2017-06-13 Thread Andrew Borodin
2017-06-13 18:00 GMT+05:00 Shubham Barai :
>
> Project: Explicitly support predicate locks in index AMs besides b-tree
>
Hi, Shubham
Good job!

So, in current HEAD test predicate_gist_2.spec generate false
positives, but with your patch, it does not?
I'd suggest keeping spec tests with your code in the same branch, it's
easier. Also it worth to clean up specs style and add some words to
documentation.

Kevin, all, how do you think, is it necessary to expand these tests
not only on Index Only Scan, but also on Bitmap Index Scan? And may be
KNN version of scan too?
I couldn't find such tests for B-tree, do we have them?

Best regards, Andrey Borodin, Octonica.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Merge Join v1

2017-06-03 Thread Andrew Borodin
2017-06-02 19:42 GMT+05:00 Jeff Davis <pg...@j-davis.com>:
> On Tue, May 30, 2017 at 11:44 PM, Andrew Borodin <boro...@octonica.com> wrote:
>> 1. Are there any types, which could benefit from Range Merge and are
>> not covered by this patch?
>
> I thought about this for a while, and the only thing I can think of
> are range joins that don't explicitly use range types.

Let me try to write && in SQL

select * from a join b where (a.min<=b.max and a.min>=b.min) or
(a.max<=b.max and a.max>=b.min);

Quite complicated. Here user knows that min <= max, but DB don't. If
we write it precisely we get hell lot of permutations.
Here's what I think:
1. For me, this feature seems hard to implement.
2. This feature also will be hard to commit solely, since it's use
case will be rare.
3. If this could yield unexpected performance for queries like
select * from t where xc or a!=z or [other conditions]
and optimizer could think: "Aha! I'll sort it and do it fast" it'd be cool.

I do not think range joins that don't explicitly use range types are
possible right now...

>> 2. Can Range Merge handle merge of different ranges? Like int4range()
>> && int8range() ?
>
> Right now, there aren't even casts between range types. I think the
> best way to handle that at this point would be to add casts among the
> numeric ranges. There may be advantages to supporting any two ranges
> where the contained types are part of the same opfamily, but it seems
> a little early to add that complication.
Agree.


> I think there are a couple more things that could be done if we want
> to. Let me know if you think these things should be done now, or if
> they should be a separate patch later when the need arises:
>
> * Support for r1 @> r2 joins (join on "contains" rather than "overlaps").
> * Better integration with the catalog so that users could add their
> own types that support range merge join.

1. I believe this changes can be incremental. They constitute value
for the end user, then they are committable. This patch is not overly
complicated, but it's easier to do this stuff in small pieces.
2. The commitfest is 3 months away. If you will have new versions of
the patch, I'll review again. Maybe will spot some new things :)

Best regards, Andrey Borodin, Octonica.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2017 : Proposal for predicate locking in gist index

2017-05-31 Thread Andrew Borodin
Hi, hackers!

2017-06-01 1:50 GMT+05:00 Kevin Grittner :
> > The main difference between b-tree and gist index while searching for a
> > target tuple is that in gist index, we can determine if there is a match or
> > not at any level of the index.
>
> Agreed.  A gist scan can fail at any level, but for that scan to
> have produced a different result due to insertion of an index entry,
> *some* page that the scan looked at must be modified.

Two things seem non-obvious to me.
First, I just do not know, can VACUUM erase page with predicate lock?
Right now, GiST never deletes pages, as far as I know, so this
question is only matter of future compatibility.

Second, when we are doing GiST inserts, we can encounter unfinished
split. That's not a frequent situation, but still. Should we skip
finishing split or should we add it to collision check too?

Best regards, Andrey Borodin, Octonica.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Merge Join v1

2017-05-31 Thread Andrew Borodin
Hi, Jeff!

Sorry for being late. Actually, I had several unsuccessful attempts to
find something wrong with the patch.
Here's my review.

in pathkey.c

ecs = (EquivalenceClass **) palloc(nClauses * sizeof(EquivalenceClass *));
scores = (int *) palloc(nClauses * sizeof(int));
range_ecs = palloc(nClauses * sizeof(bool));

Third assignment has no cast.

And I have few questions:
1. Are there any types, which could benefit from Range Merge and are
not covered by this patch?
2. Can Range Merge handle merge of different ranges? Like int4range()
&& int8range() ?

My perf test script from the previous message was broken, here's fixed
one in the attachment.

This patch implements feature, contains new tests and passes old
tests, is documented and spec compliant. I do not see any reason why
not mark it "Ready for committer".

Best regards, Andrey Borodin.

\timing
create table r as
  select int4range(g, g+10) ir, g g
  from generate_series(1,100) g
  order by random();

create index r_idx on r using gist (ir);

create table s as
  select int4range(g+5, g+15) ir,g g
  from generate_series(1,100) g
  order by random();

create index s_idx on s using gist (ir);

vacuum analyze r;
vacuum analyze s;

--baseline performance using GiST
set enable_mergejoin=false;
explain analyze
  select * from r, s where r.ir && s.ir;
explain analyze
  select * from r, s where r.ir && s.ir;

--performance without GiST
set enable_mergejoin=true;
explain analyze
  select * from r, s where r.ir && s.ir;
explain analyze
  select * from r, s where r.ir && s.ir;

--performance in presence of expression index
create index r_idx1 on r(int4range(r.g, r.g+10));
create index s_idx1 on s(int4range(s.g+5, s.g+15));


explain analyze
  select * from r, s where int4range(r.g, r.g+10) && int4range(s.g+5, s.g+15);
explain analyze
  select * from r, s where int4range(r.g, r.g+10) && int4range(s.g+5, s.g+15);


--performance in precence of direct B-tree index
create index r_idx2 on r(ir);
create index s_idx2 on s(ir);

explain analyze
  select * from r, s where r.ir && s.ir;
explain analyze
  select * from r, s where r.ir && s.ir;
 

drop table r;
drop table s;

--here we test that performance is not affected by payload of other attributes 
in heap tuples
create table r as
  select int4range(g, g+10) ir, g g, 1::float pl1,1::float pl2,1::float 
pl3,1::float pl4,1::float pl5,1::float pl6,1::float pl7,1::float pl8,1::float 
pl9,1::float pl0
  from generate_series(1,100) g
  order by random();

create table s as
  select int4range(g+5, g+15) ir,g g, 1::float pl1,1::float pl2,1::float 
pl3,1::float pl4,1::float pl5,1::float pl6,1::float pl7,1::float pl8,1::float 
pl9,1::float pl0
  from generate_series(1,100) g
  order by random();


explain analyze
  select r.ir,s.ir from r, s where r.ir && s.ir;
explain analyze
  select r.ir,s.ir from r, s where r.ir && s.ir;
 

drop table r;
drop table s;
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow GiST opcalsses without compress\decompres functions

2017-05-29 Thread Andrew Borodin
2017-05-29 0:00 GMT+05:00 Tom Lane :
> But the opclass's consistent function might expect to be always given an
> untoasted input, in which case you'd need the decompress function to fix
> that up.

In cube data is detoasted at least 4 times before going to
g_cube_internal_consistent(), including once in g_cube_consistent()
But I'll address that in another patch.

Here's new version of the patch, with:
1. Corrected docs
2. Any combination of dompress\decompress\fetch is allowed
3. (Existence of fetch) or (absence of compress) leads to IOS

The last point seems counterintuitive, but I do not see anything wrong.

Best regards, Andrey Borodin, Octonica.


0001-Allow-uncompressed-GiST-2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow GiST opcalsses without compress\decompres functions

2017-05-28 Thread Andrew Borodin
28 мая 2017 г. 11:15 PM пользователь "Tom Lane" <t...@sss.pgh.pa.us> написал:

Andrew Borodin <boro...@octonica.com> writes:
> 2017-05-28 22:22 GMT+05:00 Tom Lane <t...@sss.pgh.pa.us>:
>> 2. If compress/decompress are omitted, then we could support index-only
>> scans all the time, that is a no-op fetch function would work.  The
>> patch should cover that interaction too.

> I do not think so. Decompress have to get att to the state where
> consistency function can understand it. In theory. I've checked like a
> dozen of types and have not found places where fetch should differ
> from decompress.

But if compress is omitted, then we know the in-index representation
is the same as the original.  Therefore the fetch function would always
be a no-op and we can implement IOS whether or not the opclass designer
bothered to supply one.

regards, tom lane

True. If there is no code to "hash" original value, it is not hashed, it's
whole...

I'm not expert in toasting, cube's compress does nothing while cube
decomress does detoasting. Is this for reason?

Best regards, Andrey Borodin.


Re: [HACKERS] Allow GiST opcalsses without compress\decompres functions

2017-05-28 Thread Andrew Borodin
Thanks, Tom!

2017-05-28 22:22 GMT+05:00 Tom Lane <t...@sss.pgh.pa.us>:
> Andrew Borodin <boro...@octonica.com> writes:
>> Maybe we should make compress\decompress functions optional?
>
> 1. You'll need to adjust the documentation (gist.sgml) not just the code.
Sure, I'll check out where compression is mentioned and update docs.

> 2. If compress/decompress are omitted, then we could support index-only
> scans all the time, that is a no-op fetch function would work.  The
> patch should cover that interaction too.
I do not think so. Decompress have to get att to the state where
consistency function can understand it. In theory. I've checked like a
dozen of types and have not found places where fetch should differ
from decompress.

> 3. Personally I wouldn't bother with the separate compressed[] flags,
> just look at OidIsValid(giststate->compressFn[i].fn_oid).
That would save few bytes, but make compress\decompress code slightly
harder to read. Though some comments will help.

> 4. I don't think you thought very carefully about the combinations
> where only one of the two functions is supplied.  I can definitely
> see that compress + no decompress could be sensible.  Less sure
> about the other case, but why not allow it?  We could just say that
> an omitted function is taken to represent a no-op transformation.
Well, I thought only about the exclusion of this situation(when only
one function is supplied). But, Indeed, I've seen many opclasses where
compress was doing a work (like simplifying the data) while decompress
is just NOP.
I'll think more about it.
decompress-only opclass seems like having zero sense in adjusting
tuple on internal page. We decompress att, update it, then store it
back. Then, once upon a time, decompress it again. Looks odd. But this
does not look like a place where opcalss developer can introduce new
bugs, so I do not see reasons to forbid having only decompress
function.

> Please add this to the next commitfest.

OK.
Thank you for your comments. I'll address them and put a patch to the
commitfest.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Allow GiST opcalsses without compress\decompres functions

2017-05-25 Thread Andrew Borodin
Hi, hackers!

Currently, GiST stores each attribute in a compressed form.
That is, each time attribute is written it's calling compress
function, and when the attribute is accessed the decompress functions
is called.
Some types can't get any advantage out of this technique since the
context of one value is not enough for seeding effective compression
algorithm.
And we have some of the compress functions like this
Datum
gist_box_compress(PG_FUNCTION_ARGS)
{
PG_RETURN_POINTER(PG_GETARG_POINTER(0));
}

https://github.com/postgres/postgres/blob/master/src/backend/access/gist/gistproc.c#L195
https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/rangetypes_gist.c#L221
https://github.com/postgres/postgres/blob/master/contrib/seg/seg.c#L255
https://github.com/postgres/postgres/blob/master/contrib/cube/cube.c#L384

Maybe we should make compress\decompress functions optional?
Also, this brings some bit of performance.
For the attached test I observe 6% faster GiST build and 4% faster scans.
Not a big deal, but something. How do you think?

In some cases, there are strange things in the code of
compress\decompress. E.g. cube's decompress function is detoasting
entry twice, to be very sure :)



Best regards, Andrey Borodin, Octonica.
CREATE EXTENSION IF NOT EXISTS CUBE ;

DROP TABLE IF EXISTS TESTTABLE ;
DROP OPERATOR CLASS IF EXISTS gist_cube_ops_nocompress USING GIST;

CREATE OPERATOR CLASS gist_cube_ops_nocompress
FOR TYPE CUBE USING GIST AS
OPERATOR3   && ,
OPERATOR6   = ,
OPERATOR7   @> ,
OPERATOR8   <@ ,
OPERATOR13  @ ,
OPERATOR14  ~ ,

FUNCTION1   g_cube_consistent (INTERNAL, CUBE, SMALLINT, 
OID, INTERNAL),
FUNCTION2   g_cube_union (INTERNAL, INTERNAL),
--we do not have functions 3 and for (compress and decompress)
FUNCTION5   g_cube_penalty (INTERNAL, INTERNAL, INTERNAL),
FUNCTION6   g_cube_picksplit (INTERNAL, INTERNAL),
FUNCTION7   g_cube_same (CUBE, CUBE, INTERNAL),
FUNCTION8   g_cube_distance (INTERNAL, CUBE, SMALLINT, OID, 
INTERNAL),
FUNCTION9   g_cube_decompress (INTERNAL);--fetch function, 
not for compression



BEGIN;
SELECT SETSEED(0.5);

CREATE TABLE TESTTABLE AS SELECT CUBE(RANDOM()) C FROM 
GENERATE_SERIES(1,50) I;

\timing

--testing time to create compressed index
CREATE INDEX COMPRESSED ON TESTTABLE USING GIST(C gist_cube_ops);
--testing time to create uncompressed index
CREATE INDEX UNCOMPRESSED ON TESTTABLE USING GIST(C gist_cube_ops_nocompress);

\timing
set enable_bitmapscan = false;

UPDATE PG_INDEX SET INDISVALID = FALSE WHERE INDEXRELID = 
'UNCOMPRESSED'::REGCLASS;
UPDATE PG_INDEX SET INDISVALID = TRUE WHERE INDEXRELID = 'COMPRESSED'::REGCLASS;

EXPLAIN ANALYZE select (SELECT COUNT(*) FROM TESTTABLE WHERE C <@ CUBE(-b,1)) 
from generate_series(1,50) b;
EXPLAIN ANALYZE select (SELECT COUNT(*) FROM TESTTABLE WHERE C <@ CUBE(-b,1)) 
from generate_series(1,50) b;

UPDATE PG_INDEX SET INDISVALID = TRUE WHERE INDEXRELID = 
'UNCOMPRESSED'::REGCLASS;
UPDATE PG_INDEX SET INDISVALID = FALSE WHERE INDEXRELID = 
'COMPRESSED'::REGCLASS;

EXPLAIN ANALYZE select (SELECT COUNT(*) FROM TESTTABLE WHERE C <@ CUBE(-b,1)) 
from generate_series(1,50) b;
EXPLAIN ANALYZE select (SELECT COUNT(*) FROM TESTTABLE WHERE C <@ CUBE(-b,1)) 
from generate_series(1,50) b;

COMMIT;





0001-Allow-uncompressed-GiST.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Index Only Scan support for cube

2017-05-23 Thread Andrew Borodin
Hi, hackers!

Here's a small patch that implements fetch function necessary for
Index Only Scans that use cube data type.
I reuse function g_cube_decompress() instead of creating new function
g_cube_fetch().
Essentially, they both have to detoast data.

How do you think, is it better to create a shallow copy of
g_cube_decompress instead?
Any other suggestions on the functionality?

This

CREATE TABLE SOMECUBES AS SELECT CUBE(X,X+1) C FROM GENERATE_SERIES(1,100) X;
CREATE INDEX SOMECUBES_IDX ON SOMECUBES USING GIST(C);
SET ENABLE_SEQSCAN = FALSE;
EXPLAIN (COSTS OFF ) SELECT C FROM SOMECUBES WHERE C<@CUBE(30,40);

now produces

Index Only Scan using somecubes_idx on somecubes
  Index Cond: (c <@ '(30),(40)'::cube)

instead of

Index Scan using somecubes_idx on somecubes
  Index Cond: (c <@ '(30),(40)'::cube)



Best regards, Andrey Borodin, Octonica.


cubefetch.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On How To Shorten the Steep Learning Curve Towards PG Hacking...

2017-04-30 Thread Andrew Borodin
Hi, Kang and everyone in this thread.

I'm planning to present the online course "Hacking PostgreSQL: data
access methods in action and under the hood" on edX on June 1st. It's
not announced yet, links will be available later.
This course I'm describing information that was crucial for me to
start hacking. Currently, my knowledge of technologies behind Postgres
is quite limited, though I know the border of my knowledge quite well.

Chances are that description from my point of view will not be helpful
in some cases: before starting contributing to Postgres I had already
held PhD in CS for database technology and I had already implemented 3
different commercial DBMS (all in different technologies, PLs,
paradigms, focuses, different prbolems being solved). And still,
production of minimally viable contribution took 3 months (I was
hacking for an hour a day, mostly at evenings).
That's why I decided that it worth talking about how to get there
before I'm already there. It's quite easy to forget that some concepts
are really hard before you get them.

The course will cover:
1. Major differences of Postgres from others
2. Dev tools as I use them
3. Concept of paged memory, latches and paged data structures
4. WAL, recovery, replication
5. Concurrency and locking in B-trees
6. GiST internals
7. Extensions
8. Text search and some of GIN
9. Postgres community mechanics
Every topic will consist of two parts: 1 - video lectures on YouTube
(in English and Russian, BTW my English is far from perfect) with
references to docs and other resources, 2 - practical tasks where you
change code slightly and observe differences (this part is mostly to
help the student to observe easy entry points).

Best regards, Andrey Borodin, Octonica.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Merge join for GiST

2017-04-28 Thread Andrew Borodin
Hi, hackers!

I've adapted crossmatch join from pgSphere to cube for performance tests.
I've placed spatial join code here
https://github.com/Octonica/postgres/blob/spatialjoin/contrib/cube/spatialjoin.c
and node code here
https://github.com/Octonica/postgres/blob/spatialjoin/contrib/cube/joinnode.c

If you have an idea of improving the performance of this code, please
do not hesitate to express them.
One of the performance bottlenecks is code nearby heap_getattr() in
fetch_next_pair().

==Performance Tests==
I've tested performance on queries which aggregate result of the
spatial join. See cube_test.sql attached.

On 3d data, Spatial Join performs 3x faster than Nested Loop Join + GiST Scan

Nested Loop + Index Scan plan:
 HashAggregate  (cost=36841568.00..36841570.00 rows=200 width=40)
(actual time=206565.869..206738.307 rows=298731 loops=1)
   Group Key: r.nx
   ->  Nested Loop  (cost=0.41..25591568.00 rows=9 width=40)
(actual time=0.357..200838.416 rows=8464147 loops=1)
 ->  Seq Scan on regions r  (cost=0.00..6410.00 rows=30
width=40) (actual time=0.015..324.436 rows=30 loops=1)
 ->  Index Scan using idx on datatable a  (cost=0.41..55.28
rows=3000 width=64) (actual time=0.174..0.648 rows=28 loops=30)
   Index Cond: (r.c @> c)
 Planning time: 17.175 ms
 Execution time: 206806.926 ms

Time: 206852,635 ms (03:26,853)

Spatial Join plan:
HashAggregate  (cost=56250001.00..56250003.00 rows=200 width=40)
(actual time=67373.644..67553.118 rows=298731 loops=1)
   Group Key: r.nx
   ->  Custom Scan (SpatialJoin)  (cost=0.00..1.00 rows=45
width=40) (actual time=0.151..61718.804 rows=8464147 loops=1)
 Outer index: idx
 Inner index: idx1
 Planning time: 0.182 ms
 Execution time: 67630.742 ms

Time: 67631,557 ms (01:07,632)

But on more realistic 7D data with queries emulating OLAP system
performance of Spatial Join is 2 times worse than Nested Loop Join +
GiST Scan. Which comes as a complete surprise to me.
I do not see any algorithmic reason for Spatial Join to be slower.
Thus I strongly suspect that my implementation is not efficient, but
as for now I have no ideas how to improve it.

Here are plans for 7D
Nested Loop + Index Scan
 HashAggregate  (cost=3425143.00..3425743.00 rows=6 width=72)
(actual time=122794.715..122822.893 rows=6 loops=1)
   Group Key: r.nx
   ->  Nested Loop  (cost=0.41..2075143.00 rows=6000 width=72)
(actual time=0.311..100478.710 rows=39817008 loops=1)
 ->  Seq Scan on r1 r  (cost=0.00..2419.00 rows=6
width=128) (actual time=0.043..60.579 rows=6 loops=1)
 ->  Index Scan using ix_a1_cube on a1 a  (cost=0.41..24.55
rows=1000 width=128) (actual time=0.110..1.266 rows=664 loops=6)
   Index Cond: (c <@ r.c)
 Planning time: 0.349 ms
 Execution time: 122831.353 ms
(8 rows)

Spatial Join
 HashAggregate  (cost=6750001.00..6750601.00 rows=6 width=72)
(actual time=241832.855..241889.360 rows=6 loops=1)
   Group Key: r.nx
   ->  Custom Scan (SpatialJoin)  (cost=0.00..1.00 rows=3
width=72) (actual time=0.140..216187.111 rows=39817008 loops=1)
 Outer index: ix_r1_cube
 Inner index: ix_a1_cube
 Planning time: 0.533 ms
 Execution time: 241907.569 ms
(7 rows)

Time: 241910,440 ms (04:01,910)

Any ideas will be highly appreciated.

Best regards, Andrey Borodin.
create extension if not exists cube;

begin transaction;
SELECT setseed(.43);

\timing

create table dataTable as select cube(array[random(),random(),random()]) c, 
1::float as x1, 1::float as x2, 1::float as x3, 1::float as x4 from 
generate_series(1,3e6,1) s;
create index idx on dataTable using gist(c);

create table regions as select cube(array[x,y,z],array[x+0.1,y+0.01,z+0.01]) c, 
row_number() over() as nx from (select random() x,random() y, random() z from 
generate_series(1,3e5,1) s) q;
create index idx1 on regions using gist(c);

set max_parallel_workers_per_gather = 0;

explain analyze select 
sum(a.x1) as x1, sum(a.x2) as x2, sum(a.x3) as x3, sum(a.x4) as x4
from dataTable a
join regions r
on 
r.c @> a.c
group by r.nx;

explain analyze select 
sum(a.x1) as x1, sum(a.x2) as x2, sum(a.x3) as x3, sum(a.x4) as x4
from dataTable a
join regions r
on 
r.c && a.c
group by r.nx;

commit;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG 10 release notes

2017-04-26 Thread Andrew Borodin
Hi, Bruce!

2017-04-25 6:31 GMT+05:00 Bruce Momjian :
> The only unusual thing is that this release has ~180 items while most
> recent release have had ~220.  The pattern I see that there are more
> large features in this release than previous ones.

I'm not sure, but, probably, it worth mentioning this [1,2] in the
E.1.3.1.2. Indexes section.
Better tuple management on GiST page allows faster inserts and
updates. (In numbers: 15% for randomized data,3x for ordered data in
specific cases)

[1] https://commitfest.postgresql.org/10/661/
[2] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b1328d78f88cdf4f7504004159e530b776f0de16

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Merge Join v1

2017-04-21 Thread Andrew Borodin
Hi, Jeff!

I'm planning to provide the review for this patch in future. I've done
some performance tesing (see attachment).
All in all, nothing important, everything works fine.
Tests were executed under current HEAD on Ubuntu 16 over MacBook Air.

I observe that:
1. Patch brings performance improvement for specified query
2. Performance improvement of Range Merge Join vs GiST Nested Loop
Join (on Index Only Scan) is between 4x and 6x
3. Performance improvement of RMJ with B-tree index vs GiST is between
4x and 10x
(due to lack of actually competing cases, here I omit T-tests, they
are not that important when speaking about 4x difference)
4. Range Merge Join is capable to consume expressional index with
exactly same effect as direct index
5. Other attributes residing in joined tables do not affect
performance improvement

I'll make code review some time later, probably next week. (I hope
there is no urge in this, since commitfest is in July, or is it
urgent?) BTW fixe typo in index specification of original performance
test (both indexes were for same table)

Best regards, Andrey Borodin.


perf2.sql
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Merge join for GiST

2017-04-13 Thread Andrew Borodin
2017-04-13 11:30 GMT+05:00 Jeff Davis :
> I don't quite follow. I don't think any of these proposals uses btree,
> right? Range merge join doesn't need any index, your proposal uses
> gist, and PgSphere's crossmatch uses gist.

Merge join will use presorted data, B-tree provides sorted data.
Merge Join cannot join non-sorted data, can it?

Indeed, B-tree is not the only sorted data provider. In this sight,
Range Merge join is even more generic.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Merge join for GiST

2017-04-12 Thread Andrew Borodin
2017-04-13 7:01 GMT+05:00 Jeff Davis :
> On Tue, Apr 11, 2017 at 8:35 AM, Alexander Korotkov
>  wrote:
>> On Tue, Apr 11, 2017 at 5:46 PM, Jeff Davis  wrote:
>>> Do you have a sense of how this might compare with range merge join?
>>
>>
>> If you have GiST indexes over ranges for both sides of join, then this
>> method could be used for range join.  Hence, it could be compared with range
>> merge join.
>> However, particular implementation in pgsphere uses hardcoded datatypes and
>> operations.
>> Thus, for range join we need either generalized version of GiST-based join
>> or special implementation for ranges.
>
> Alexander, Andrew,
>
> How do you think we should proceed? Which projects do you think should
> eventually be in core, versus which are fine as extensions?

Some points in favor of Range joins via nbtree:
1. It's more efficient than B-tree over GiST
2. It is already in a patch form

Point against:
1. Particular implementation is somewhat leaked abstraction. Inside
the planner, you check not for capabilities of operator and type, but
for specific op and type. But I do not know how to fix this.

So, here is my opinion: if we can inside the planner understand that
join condition involves range specifics (for all available ranges) and
do Range Merge Join, then this is preferred solution.

Yes, Spatial Join, when available, will provide roughly the same scan
performance. But B-trees are more well known to users new to
PostgreSQL, and B-trees are faster.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Merge join for GiST

2017-04-11 Thread Andrew Borodin
2017-04-11 14:17 GMT+05:00 Alexander Korotkov :
> FYI, I've implemented this algorithm for pgsphere.  See following branch.
> https://github.com/akorotkov/pgsphere/tree/experimental
> It's implemented as crossmatch() function which takes as arguments names of
> two indexes over spoint and maximum distance (it checks not overlapping but
> proximity of points).  This function returns setof pairs of TIDs.
>
> Later, Dmitry Ivanov made it a custom scan node.
> https://github.com/akorotkov/pgsphere/tree/crossmatch_cnode
>
> You also can find some experimental evaluation here:
> http://www.adass2016.inaf.it/images/presentations/10_Korotkov.pdf
>
> Feel free to experiment with this code and/or ask any question.

Wow, that's cool! Thanks! That code actually answers a lot of questions.
I'll try to generalize that for && operator

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Merge join for GiST

2017-04-11 Thread Andrew Borodin
2017-04-10 20:38 GMT+05:00 Robert Haas <robertmh...@gmail.com>:
> On Mon, Apr 10, 2017 at 7:53 AM, Andrew Borodin <boro...@octonica.com> wrote:
>> I think this idea is somewhat related to this patch [2], but as for
>> now cannot describe how exactly GiST merge and Range Merge features
>> relate.
>
> It also seems somewhat related to Peter Moser's work on ALIGN and
> NORMALIZE.  It would be nice if the various groups of people
> interested in improving PostgreSQL's spatial stuff got together and
> reviewed each others' patches.  As a non-spatial guy myself, it's
> pretty hard to decide on the relative merits of different proposed
> approaches.

Hi, Robert!

Thank you for the pointer. Temporal features are not exactly within my
scope, but you are right, topics are close to each other. I'll look
into the patch with temporal features and assess whether I can provide
a meaningful review.

I do not know how to gather the attention of all how is interested in
this kind of features. I read hackers@ digest regularly, used search a
lot, but that temporal work slipped away from my attention.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Merge join for GiST

2017-04-10 Thread Andrew Borodin
Hello, hackers!

==Spatial joins==
Scientific papers from the dawn of R-trees and multidimensional
indexes feature a lot of algorithms for spatial joins.
I.e. you have two sets of geometries s1 and s2, you need to produce
all colliding pairs (p1,p2) where p1 in s1 and p2 in s2. For 2 R-trees
of equal heights with roots R and S most straightforward[1] algorithm
look like:

SpatialJoin (R,S: Node)
{
  FOR (all E in  S)
FOR (all ER in R with ER.rect intersecting  E.rect)
   IF (R is a leaf page)
   {
 Output ER,ES
   }
   ELSE
   {
 SpatialJoin (ER.ref, ES.ref)
   }
}


==Motivation==
I’m mostly interested in OLAP-style aggregates like:

SELECT rows.Id, sum(data.attribute1), sum(data.attribute2), sum(data.attribute3)
FROM rows LEFT JOIN data ON rows.IndexedAttribute && data.IndexedAttribute
GROUP BY 1

When we have two GiST for rows.IndexedAttribute and data.IndexedAttribute.
Currently, this query produces plans with Nested Loop Join, so for
every row from “rows” there is one GiST scan into “data”.


==Proposal==
Obviously, for this scenario, we can use GiST-based join algorithm
identical to the SpatialJoin algorithm above.

This algorithm will work iif (key1 && key2)  ==> (union(key1,key3) &&
key2 ) and (union(key2,key3) && key1 ) for any key3. This is quite
straightforward for && operator, while I’m not sure this will work for
<@ and @> and others.

I can try to implement this kind of join as an extension or try to
embed this into the planner.

I think this idea is somewhat related to this patch [2], but as for
now cannot describe how exactly GiST merge and Range Merge features
relate.


How do you think:
1. Has this idea been considered before? I had not found anything on
actual spatial join of GiSTS.
2. What is the easiest way to implement this feature?
3. What kind of operators may be spatial-joined without intervention
to existing GiST scan strategies API?

Many thanks for reading.

Best regards, Andrey Borodin.

[1] Brinkhoff, Thomas, Hans-Peter Kriegel, and Bernhard Seeger.
Efficient processing of spatial joins using R-trees. Vol. 22. No. 2.
ACM, 1993.
[2] 
https://www.postgresql.org/message-id/flat/CAMp0ubfwAFFW3O_NgKqpRPmm56M4weTEXjprb2gP_NrDaEC4Eg%40mail.gmail.com#camp0ubfwaffw3o_ngkqprpmm56m4wetexjprb2gp_nrdaec...@mail.gmail.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-03-22 Thread Andrew Borodin
2017-03-22 22:48 GMT+05:00 Teodor Sigaev :
> hasEmptyChild? and hasNonEmptyChild (BTW, isAnyNonempy has missed 't')

Yes, I think this naming is good. It's clear what's in common in these
flags and what's different.

> And if the whole posting tree is empty,then we could mark root page as leaf
> and remove all other pages in tree without any locking. Although, it could
> be a task for separate patch.

>From the performance point of view, this is a very good idea. Both,
performance of VACUUM and performance of Scans. But doing so we risk
to leave some garbage pages in case of a crash. And I do not see how
to avoid these without unlinking pages one by one. I agree, that
leaving this trick for a separate patch is quite reasonable.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-03-21 Thread Andrew Borodin
Hi, Teodor!

2017-03-21 20:32 GMT+05:00 Teodor Sigaev :
> I had a look on patch

That's great, thanks!

>
> /*
>  * All subtree is empty - just return TRUE to indicate that parent
> must
>  * do a cleanup. Unless we are ROOT an there is way to go upper.
>  */
>
> if(isChildHasVoid && !isAnyNonempy && !isRoot)
> return TRUE;
>
> if(isChildHasVoid)
> {
> ...
> ginScanToDelete(gvs, blkno, TRUE, ,
> InvalidOffsetNumber);
> }
>
> In first 'if' clause I see !isRoot, so second if and corresponding
> ginScanToDelete() could be called only for root in posting tree.

No, second conditional code will be called for any subtree, which
contains totally empty subtree. That check !isRoot covers case when
the entire posting tree should be erased: we cannot just quit out of
recursive cleanup, we have to make a scan here, starting from root.

Probably, variable isChildHasVoid has a bit confusing name. This flag
indicates that some subtree:
1. Had empty pages
2. Did not bother deleting them, because there is a chance that it is
a part of a bigger empty subtree.
May be it'd be better to call the variable "someChildIsVoidSubtree".

>just remove lock for cleanup over ginVacuumPostingTreeLeaves() and if it 
>returns that tree contains empty page then lock the whole posting tree to do 
>ginScanToDelete() work.

It is, indeed, viable approach. But currently proposed patch is
capable of dealing with small page deletes without much of locking
fuss, even in 4-5 level trees.

How do you think, which way should we take?

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-03-16 Thread Andrew Borodin
2017-03-16 23:55 GMT+05:00 Peter Geoghegan <p...@bowt.ie>:
> On Thu, Mar 16, 2017 at 11:53 AM, Andrew Borodin <boro...@octonica.com> wrote:
>> 2.  Thus, L fully concurrent vacuum is possible, indeed, and
>> furthermore Theodor suggested that I should implement not only page
>> eviction, but also page merge and tree condence algorithm.
>
> I think that it's very hard to make merging of pages that are not
> completely empty work, while also using the L algorithm.

That's true. This is a distant plan...

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-03-16 Thread Andrew Borodin
2017-03-16 21:27 GMT+05:00 David Steele :
> This patch applies cleanly and compiles at cccbdde.
>
> Jeff, any thoughts on Andrew's responses?

Hi, David!

I've got some updates on the matter of this patch, since the
understanding of the B-tree bothered me much.
Currently, I'm at PgConf.Russia, where I've contacted Theodor Sigaev,
and he answered my questions about the GIN.
0. I think that proposed patch is safe (deadlock free, does not
introduce new livelocks, all the resources guarded properly)
1. There _are_ high keys at the posting trees, they are just called
rightmost keys, but in fact they are high keys in terms of L
algorithm.
2.  Thus, L fully concurrent vacuum is possible, indeed, and
furthermore Theodor suggested that I should implement not only page
eviction, but also page merge and tree condence algorithm.
3. Eventually, I'll do that, certainly, but, currently, I can't
predict the time it'll take. I think I'll start somewhere in the
summer, may be right after GiST intrapage indexing.

As for now, I think that having this patch in PostgreSQL 10 is viable.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GSoC] Self introduction and question to mentors

2017-03-09 Thread Andrew Borodin
Hi, Kirill!

2017-03-06 12:41 GMT+05:00 Кирилл Бороздин :
> My name is Kirill Borozdin. I am a student of Ural Federal University from
> Yekaterinburg, Russia.
That's fine. I'm the associate professor at Ural Federal University.
But not in your institution, though.

> I understand that this task will require a big number of experiments because
> we want to achieve
> a speed-up in real life cases.
>
> Can you recommend me some articles or specific source files in PostgreSQL
> codebase (it is so huge!)
> which I can explore to be better prepared (in case I will be lucky enough to
> be selected for GSoC)?

Well, this project is not that specific to PostgreSQL. Everything you
need to compile and use Postgres you can find on Wiki or just ask me
directly.

For starters, I'd recommend to look up some sorting algorithms survey
papers from Google Scholar. There is a good paper by Goetz Graefe on
sorting, but it either reflects current situation that some future
algorithms :)

Something on cache-oblivious sorting and on cache
http://igoro.com/archive/gallery-of-processor-cache-effects/

Finally, papers referenced in Wiki page about GSoC for this project
are a good starting point.

As for tests, TPC is kind of ubiquitous, We could just move standard
pg_bench script towards more sorting involved.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GSoC] Personal presentation and request for clarification

2017-03-03 Thread Andrew Borodin
Hi!

It's great that you are interested in PostgreSQL!

I'll answer the question on the matter of GSoC project proposed by me.
I hope someone else will handle questions on your primary objective.


2017-03-03 4:15 GMT+05:00 João Miguel Afonso :
> My second choice would be the "Sorting algorithms benchmark and
> implementation", that although is not directly related to my current work, I
> am more familiarised with and looks quite easier to accomplish.
> As I said, I had not enough time to explore the whole project documentation
> or source code, but I read the code of the sorting algorithm and I realised
> that it is sequential. Would a parallel implementation take some benefits
> here?

Parallel - means cool, for about 15 years already.
There is related work on parallelization of sorting by Peter Geoghegan
https://commitfest.postgresql.org/13/690/
Sure, it is viable and important work.

At this project, I was aiming to increase the performance of qsort. In
the context of Peter Geoghegan's qsort() is used only in queries
sorting relatively small amount of data, but it is used in other
places too (and, probably, external sort involves it).
Surely, it is related and something on the matter of parallelization
can be done here.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-02-05 Thread Andrew Borodin
Hi, Jeff!

2017-02-05 3:45 GMT+05:00 Jeff Davis <pg...@j-davis.com>:
> On Sun, Jan 22, 2017 at 10:32 PM, Jeff Davis <pg...@j-davis.com> wrote:
>> On Sat, Jan 21, 2017 at 4:25 AM, Andrew Borodin <boro...@octonica.com> wrote:
>> One idea I had that might be simpler is to use a two-stage page
>> delete. The first stage would remove the link from the parent and mark
>> the page deleted, but leave the right link intact and prevent
>> recycling. The second stage would follow the chain of right links
>> along each level, removing the right links to deleted pages and
>> freeing the page to be recycled.
>
> Do you think this approach is viable as a simplification?

To consider this approach I need to ask several questions.

1. Are we discussing simplification of existing GIN vacuum? Or
proposed GIN vacuum? Please, note that they do not differ in the way
page is unlinked, function ginDeletePage() is almost untoched.

2. What do you mean by "stage"? In terms of the paper "A symmetric
concurrent B-tree algorithm" by Lanin: stage is an
uninterrupted period of holding lock on nonempty page set.
Here's the picture https://www.screencast.com/t/xUpGKgkkU from L
Both paper (L and L) tend to avoid lock coupling, which is
inevitable if you want to do parent unlink first. To prevent insertion
of records on a page, you have to mark it. If you are doing this in
the stage when you unlink from parent - you have to own both locks.

3. What do you want to simplify? Currently we unlink a page from
parent and left sibling in one stage, at cost of aquiring CleanUp lock
(the way we aquire it - is the diference between current and patched
version).
2-stage algorithms will not be simplier, yet it will be more concurrent.
Please note, that during absence of high fence keys in GIN B-tree we
actually should implement algorithm from figure 3A
https://www.screencast.com/t/2cfGZtrzaz0z  (It would be incorrect, but
only with presence of high keys)

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-01-29 Thread Andrew Borodin
2017-01-25 12:09 GMT+05:00 Andrew Borodin <boro...@octonica.com>:
> 2017-01-24 22:29 GMT+05:00 Jeff Davis <pg...@j-davis.com>:
>> By the way, can you show me where the Lehman and Yao paper addresses
>> page recycling?
>
> Here J. Hellerstein comments L paper [1] saying that effectively
> there is no deletion algorithm provided.
> Here [2] is paper referenced from nbtree docs. I'll check if this is
> applicable to GIN B-tree.
>
> [1] http://db.cs.berkeley.edu/jmh/cs262b/treeCCR.html
> [2] http://dl.acm.org/citation.cfm?id=324589

Hi!

I'll summarize here my state of studying concurrent methods of page unlinking.

GIN B-tree does not have "high key". That means, that rightmost key on
a page is maximal for the non-leaf page.
But I do not see anything theoretical in a way of implementation of
Lanin and Shasha`s methods of page merging, with slight modifications.
Their paper does not even mention high key(high fence key in papers by
Goetz Graefe).

But it's not a simple task due to large portions of shared code
between entry tree and posting tree.

Also, I do not see a reason why this method can be practically
superior to proposed patch.

Currently, I do not have resources to implement a proof of concept for
fully concurrent page unlinking to make benchmarking.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background contrib module proposal

2017-01-27 Thread Andrew Borodin
2017-01-27 19:14 GMT+05:00 Peter Eisentraut :
> I suppose we should decide first whether we want pg_background as a
> separate extension or rather pursue extending dblink as proposed elsewhere.
>
> I don't know if pg_background allows any use case that dblink can't
> handle (yet).
pg_background in it's current version is just a start of a feature.
The question is: are they coherent in desired features? I do not know.
E.g. will it be possible to copy from stdin in dblink and possible
incarnations of pg_background functionality?)

2017-01-27 19:38 GMT+05:00 Robert Haas :
> For the record, I have no big problem with extending dblink to allow
> this instead of adding pg_background.  But I think we should try to
> get one or the other done in time for this release.
+1!
that's why I hesitate between not saying my points and making
controversy...need to settle it somehow.

Parallelism is a "selling" feature, everything has to be parallel for
a decade already (don't we have parallel sequential scan yet?).
It's fine to go with dblink, but dblink docs start with roughly "this
is an outdated substandard feature"(not a direct quote[0)].
What will we add there? "Do not use dblink for linking to databases.
This is the standard for doing concurrency." ?
Please excuse me for exaggeration. BTW, pg_background do not have docs at all.

[0] https://www.postgresql.org/docs/devel/static/dblink.html

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-01-24 Thread Andrew Borodin
2017-01-24 22:29 GMT+05:00 Jeff Davis <pg...@j-davis.com>:
> On Tue, Jan 24, 2017 at 3:18 AM, Andrew Borodin <boro...@octonica.com> wrote:
>> Technically, approach of locking a subtree is not novel. Lehman and
>> Yao focused on "that any process for manipulating the tree uses only a
>> small (constant) number of locks at any time." We are locking unknown
>> and possibly large amount of pages.
>
> By the way, can you show me where the Lehman and Yao paper addresses
> page recycling?
>
> It says that one approach is to allow fewer than K entries on a leaf
> node; presumably as few as zero. But it doesn't seem to show how to
> remove all references to the page and recycle it in a new place in the
> tree.
>
> Regards,
>  Jeff Davis

Here J. Hellerstein comments L paper [1] saying that effectively
there is no deletion algorithm provided.
Here [2] is paper referenced from nbtree docs. I'll check if this is
applicable to GIN B-tree.

[1] http://db.cs.berkeley.edu/jmh/cs262b/treeCCR.html
[2] http://dl.acm.org/citation.cfm?id=324589


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-01-24 Thread Andrew Borodin
Hello!

I see your point on alternatives. I will do my best to generate more
ideas on how vacuum can be done withing existing index structure, this
will take a day or two.
In this message I'll answer some questions.

2017-01-23 22:45 GMT+05:00 Jeff Davis :
> 1. I haven't seen the approach before of finding the parent of any
> empty subtree. While that sounds like a reasonable thing to do, it
> worries me to invent new approaches in an area as well-studied as
> btrees. The problem is less that it's too complex, and more that it's
> different. Future developers looking at this code will expect it to be
> either very simple or very standard, because it's just a posting list.

Technically, approach of locking a subtree is not novel. Lehman and
Yao focused on "that any process for manipulating the tree uses only a
small (constant) number of locks at any time." We are locking unknown
and possibly large amount of pages.

> 2. It relies on searches to only start from the very leftmost page
> (correct?). If someone wants to optimize the intersection of two
> posting trees later, they might break it if they don't understand the
> assumptions in this patch.

Searches start from root, to find a correct position.
But cuncurrency of the patch does not depend on that.
Patch relies on the idea, that at any given subtree, if:
1. There is no inserts within subtree (guaranteed by cleanup lock)
2. Every page was locked with Exclusive lock at least once, locks were
taken from left to right, without lock coupling.
than: there is no read searches within the subtree.

> 3. This adds a heap allocation, and copies the contents of the page,
> and then releases the pin and lock. GinStepRight is careful to avoid
> doing that (it locks the target before releasing the page containing
> the pointer). I don't see a problem in your patch, but again, we are
> breaking an assumption that future developers might make.

Yes, heap allocation worries me a lot too, but let me explain details.

GinStepRight - is the place where lock coupling is done. That is, you
lock rightlink before releasing already locked page.
Generally, B-tree tends to lock just one page at a time, but this
place is an important exception (not the only exception, but others
behave similarly).

Proposed patch is doing tree traversal at worst two times.

First: scan trough the tree, lock internal pages for read, lock leafs
for write, delete dead tuples. If empty subtrees are found, invoke for
them second stage.
Here children Block Numbers are stored in palloc`d array, to do this
scan with minimum of locks. This is unacceptable if there may be
concurrent VACUUM, which can delete a page when we are asleep for some
reason. (My point of worring number 1)

Second: this stage recives subtree containing a lot of pages which
must be just deleted. Here we take cleanup lock on subtree root,
ensuring no inserts are within (they hold pins on stack from root to
leaf). Than we take exclusive locks on all pages from left to right,
without lock coupling. When we find empty page, we lock left page too,
thus doing lock coupling form right to left. If search might be wihtin
a subtree, it could deadlock with us. (My point of worring number 2)

But both points seems to have reasonable explanation why it cannot happen.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-01-23 Thread Andrew Borodin
Hi!

2017-01-23 11:32 GMT+05:00 Jeff Davis :
> I have some reservations about the complexity and novelty of the
> approach. I don't think I've seen an approach quite like this one
> before. It seems like the main reason you are using this approach is
> because the btree structure was too simple (no left links); is that
> right? My gut is telling me we should either leave it simple, or use a
> real concurrent btree implementation.

GIN B-tree is departed from backend\access\nbtree almost like nbtree
is departed from L algorithm. As far as I understand there is no
"high key" concept as in nbtree. I'm not sure that it's possible to
implement same level of cuncurrency as in nbtree without that.
Also GIN B-tree is actually two different B-trees, large portions of
code is shared between Entries tree and Postings tree. I'm not sure
going fully concurrent vacuum worth it, there will be a lot of
changes. And then there is compression...installing high keys into GIN
may be a mess, with pg_upgrade.

Proposed patch, on a contrary, simplifies code. There is more code
deleted than added. It does not affect WAL, changes nothing in page
layout.

> One idea I had that might be simpler is to use a two-stage page
> delete. The first stage would remove the link from the parent and mark
> the page deleted, but leave the right link intact and prevent
> recycling. The second stage would follow the chain of right links
> along each level, removing the right links to deleted pages and
> freeing the page to be recycled.

As far as I recall, this is resemplant to Lenin and Shasha algorithm.
I'll look into it, but I think it relies on concept of "high key" on a
page somehow (high key is the key from a sibling page, not local
rightmost key as GIN B-tree uses).

> Another idea is to partition the posting tree so that the root page
> actually points to K posting trees. Scans would need to merge them,
> but it would allow inserts to progress in one tree while the other is
> being vacuumed. I think this would give good concurrency even for K=2.
> I just had this idea now, so I didn't think it through very well.

This is efficient from a point of view of frequent vacuums.
concurrency of same key inserts can benefit from this approach. But
all aother operations are more complicated and less efficient.
GIN already has Fast Update list to tackle problems in this manner.
But I do not feel I have resources to try to implement it...


Thank you for your considerations. I'll think about concurrency and
"high keys" more, but I'm realy afraid of amount of changes which may
be prerequisites.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-01-21 Thread Andrew Borodin
Hello Jeff!

>Review of the code itself:
How do you think, is there anything else to improve in that patch or
we can mark it as 'Ready for committer'?

I've checked the concurrency algorithm with original work of Lehman
and Yao on B-link-tree. For me everything looks fine (safe, deadlock
free and not increased probability of livelock due to
LockBufferForCleanup call).

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-01-19 Thread Andrew Borodin
Hi, Jeff!

Thanks for review!

Here's a patch corrected according to your suggestions.

2017-01-19 11:48 GMT+05:00 Jeff Davis :
> https://www.postgresql.org/message-id/CAJEAwVE4UAmm8fr%2BNW8XTnKV6M--ACoNhL3ES8yoKL2sKhbaiw%40mail.gmail.com
>
> Let me re-summarize what's been done here to make sure I understand:
>
> Each key in GIN has its own posting tree, which is itself a btree,
> holding all of the tuples that contain that key. This posting tree is
> really just a set of tuples, and searches always scan an entire
> posting tree (because all the tuples contain the key, so all are a
> potential match).
>
> Currently, the cleanup process requires blocking all reads and writes
> in the posting list. I assume this is only a problem when a key is
> common enough that its posting list is quite large (how large?).
The posting tree shall be at least 3 levels high to make this patch
useful. I guess it's at least 1 postings of a single term
(assuming average hundred of tuples per page).

> This patch makes a couple changes to improve concurrency:
>
> 1. Vacuum leaves first, without removing any pages. Instead, just keep
> track of pages which are completely empty (more generally, subtrees
> that contain empty pages).
>
> 2. Free the empty pages in those subtrees. That requires blocking
> reads and writes to that subtree, but not the entire posting list.
> Blocking writes is easy: acquiring a cleanup lock on the root of any
> subtree always blocks any writes to that subtree, because the write
> keeps a pin on all pages in that path. Blocking reads is accomplished
> by locking the page to be deleted, then locking/unlocking the page one
> left of that (which may be outside the subtree?).
No, leftmost page within tree. It may be referenced by rightlink from
outside the subtree, that's the reason we ceep it alive even if it's
empy.

> Maybe a basic question, but why is the posting list a btree at all,
> and not, say a doubly-linked list?
When GIN executes searches usually it have to fetch documents which
contains intersection of posting lists. It is faster to intersect
B-trees, if common elements are rare.

>
> Review of the code itself:
>
> * Nice and simple, with a net line count of only +45.
> * Remove commented-out code.
I've removed the code. Though it's purpose was to accent changed
tricky concurrency places
> * In the README, "leaf-to-right" should be left-to-right; and "takinf"
> should be "taking".
Fixed
> * When you say the leftmost page is never deleted; do you mean the
> leftmost of the entire posting tree, or leftmost of the subtree that
> you are removing pages from?
Leftmost of a subtree, containing totally empty subtree. It's not
neccesarily leftmost page of whole posting tree.
> * In order to keep out concurrent reads, you need to lock/unlock the
> left page while holding exclusive lock on the page being deleted, but
> I didn't see how that happens exactly in the code. Where does that
> happen?
in function ginDeletePage()
LockBuffer(lBuffer, GIN_EXCLUSIVE);
We have to mark this page dirty, since it's rightlink is changed. We
cannot mark it dirty without locking it, even if we surely know that
no any reader can reach it out (due to cleanup lock at the root of
cleaned subtree).

Thank you for your suggestions, do not hesitate to ask any questions,
concurrency and GIN both are very interesting topics.

Best regards, Andrey Borodin.
diff --git a/src/backend/access/gin/README b/src/backend/access/gin/README
index fade0cb..990b5ff 100644
--- a/src/backend/access/gin/README
+++ b/src/backend/access/gin/README
@@ -314,10 +314,17 @@ deleted.
 The previous paragraph's reasoning only applies to searches, and only to
 posting trees. To protect from inserters following a downlink to a deleted
 page, vacuum simply locks out all concurrent insertions to the posting tree,
-by holding a super-exclusive lock on the posting tree root. Inserters hold a
-pin on the root page, but searches do not, so while new searches cannot begin
-while root page is locked, any already-in-progress scans can continue
-concurrently with vacuum. In the entry tree, we never delete pages.
+by holding a super-exclusive lock on the parent page of subtree with deletable
+pages. Inserters hold a pin on the root page, but searches do not, so while
+new searches cannot begin while root page is locked, any already-in-progress
+scans can continue concurrently with vacuum in corresponding subtree of
+posting tree. To exclude interference with readers vacuum takes exclusive
+locks in a depth-first scan in left-to-right order of page tuples. Leftmost
+page is never deleted. Thus before deleting any page we obtain exclusive
+lock on any left page, effectively excluding deadlock with any reader, despite
+taking parent lock before current and left lock after current. We take left
+lock not for a concurrency reasons, but rather in need to mark page dirty.
+In the entry tree, we never delete pages.
 
 (This is quite different from the 

Re: [HACKERS] background sessions

2017-01-11 Thread Andrew Borodin
2017-01-12 9:01 GMT+05:00 Peter Eisentraut :
> On 1/10/17 10:58 AM, Robert Haas wrote:
>> On Thu, Dec 29, 2016 at 5:18 PM, Peter Eisentraut
>>  wrote:
>>> For additional entertainment, I include patches that integrate
>>> background sessions into dblink.  So dblink can open a connection to a
>>> background session, and then you can use the existing dblink functions
>>> to send queries, read results, etc.  People use dblink to make
>>> self-connections to get autonomous subsessions, so this would directly
>>> address that use case.  The 0001 patch is some prerequisite refactoring
>>> to remove an ugly macro mess, which is useful independent of this.  0002
>>> is the actual patch.
>>
>> Would that constitute a complete replacement for pg_background?
>
> I think so.
That constitute replacement on the set of existing functionality.
It's not certain whether new features for pg_background would be
coherent with db_link ideology.
E.g. if one day we implement data exchange between two running queries
for pg_background, it would be in conflict with db_link ideology.

I have not opinion on is db_link or pg_background apropriate place for
this functionality. Just mentioning some thoughts.

BTW can we have an automatic FWD for bgsessions? Sounds crazy for me,
but technically make sense.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2017

2017-01-10 Thread Andrew Borodin
2017-01-10 14:53 GMT+05:00 Alexander Korotkov :
> 1. What project ideas we have?

I have one more project of interest which I can mentor.

Topic. GiST API advancement

===Idea===
GiST API was designed at the beginning of 90th to reduce boilerplate
code around data access methods over balanced tree. Now, after 30
years, there are some ideas on improving this API.

===Project details===
Opclass developer must specify 4 core operations to make a type GiST-indexable:
1. Split: a function to split set of datatype instances into two parts.
2. Penalty calculation: a function to measure penalty for unification
of two keys.
3. Collision check: a function which determines whether two keys may
have overlap or are not intersecting.
4. Unification: a function to combine two keys into one so that
combined key collides with both input keys.

Functions 2 and 3 can be improved.
For example, Revised R*-tree[1] algorithm of insertion cannot be
expressed in terms of penalty-based algorithms. There was some
attempts to bring parts of RR*-tree insertion, but they come down to
ugly hacks [2]. Current GiST API, due to penalty-based insertion
algorithm, does not allow to implement important feature of RR*-tree:
overlap optimization. As Norbert Beckman, author of RR*-tree, put it
in discussion: “Overlap optimization is one of the main elements, if
not the main query performance tuning element of the RR*-tree. You
would fall back to old R-Tree times if that would be left off.”

Collision check currently returns binary result:
1.   Query may be collides with subtree MBR
2.   Query do not collides with subtree
This result may be augmented with a third state: subtree is totally
within query. In this case GiST scan can scan down subtree without key
checks.

Potential effect of these improvements must be benchmarked. Probably,
implementation of these two will spawn more ideas on GiST performance
improvements.

Finally, GiST do not provide API for bulk loading. Alexander Korotkov
during GSoC 2011 implemented buffered GiST build. This index
construction is faster, but yields the index tree with virtually same
querying performance. There are different algorithms aiming to provide
better indexing tree due to some knowledge of data, e.g. [3]


[1] Beckmann, Norbert, and Bernhard Seeger. "A revised r*-tree in
comparison with related index structures." Proceedings of the 2009 ACM
SIGMOD International Conference on Management of data. ACM, 2009.
[2] 
https://www.postgresql.org/message-id/flat/CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg%40mail.gmail.com#cajeawvfmo-fxaj6lkj8wtb1br0mtby48egmvejboodroegy...@mail.gmail.com
[3] Achakeev, Daniar, Bernhard Seeger, and Peter Widmayer. "Sort-based
query-adaptive loading of r-trees." Proceedings of the 21st ACM
international conference on Information and knowledge management. ACM,
2012.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2017

2017-01-10 Thread Andrew Borodin
2017-01-10 14:53 GMT+05:00 Alexander Korotkov :
> 1. What project ideas we have?

Hi!
I'd like to propose project on sorting algorithm research. I’m ready
to be a mentor on this project.

===Topic===
Sorting algorithms benchmark and implementation.

===Idea===
Currently the PostgreSQL uses Hoare’s Quicksort implementation based
on work of Bentley and McIlroy [1] from 1993, while there exist some
more novel algorithms [2], [3], and [4] which are actively used by
highly optimized code like Java and .NET. Probably, use of optimized
sorting algorithm could yield general system performance improvement.
Also, use of non-comparison based algorithms deserves attention and
benchmarking [5].

===Project details===
The project has four essential parts:
1.   Implementation of benchmark for sorting. Making sure that
operations using sorting are represented proportionally to some
“average” use cases.
2.   Selection of benchmark algorithms. Selection can be based,
for example, on scientific papers or community opinions.
3.   Benchmark implementation of selected algorithms. Analysis of
results, picking of winner.
4.   Industrial implementation for pg_qsort(), pg_qsort_args() and
gen_qsort_tuple.pl. Implemented patch is submitted to commitfest,
other patch is reviewed by the student.

[1] Bentley, Jon L., and M. Douglas McIlroy. "Engineering a sort
function." Software: Practice and Experience 23.11 (1993): 1249-1265.
[2] Musser, David R. "Introspective sorting and selection algorithms."
Softw., Pract. Exper. 27.8 (1997): 983-993.
[3] Auger, Nicolas, Cyril Nicaud, and Carine Pivoteau. "Merge
Strategies: from Merge Sort to TimSort." (2015).
[4] Beniwal, Sonal, and Deepti Grover. "Comparison of various sorting
algorithms: A review." International Journal of Emerging Research in
Management  2 (2013).
[5] Mcllroy, Peter M., Keith Bostic, and M. Douglas Mcllroy.
"Engineering radix sort." Computing systems 6.1 (1993): 5-27.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background contrib module proposal

2017-01-07 Thread Andrew Borodin
Hi!

2017-01-07 11:44 GMT+05:00 amul sul :

> Changes:
> 1. pg_background_launch renamed to pg_background_start
> 2. pg_background_detach renamed to pg_background_close
> 3. Added new api to display previously launch background worker & its
> result count waiting to be read.
Great work!


> Note that attached patches need to apply to the top of the Peter
> Eisentraut's v2 patch[1].
I've looked a bit into that patch. I thought you would switch
MemoryContext before calling BackgroundSessionStart() from
pg_background? Have I got something wrong?

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] background sessions

2017-01-04 Thread Andrew Borodin
2017-01-04 10:23 GMT+05:00 amul sul :
> One more query, can we modify
> BackgroundSessionStart()/BackgroundSession struct to get background
> worker PID as well?
I think since session always has a PID it's absoultley reasonable to return PID.

> I can understand this requirement could be sound useless for now,
> because it only for the benefit of pg_background contrib module only.
As far as i can unserstand BackgroundSession is not just a feature
itself, it's the API. So PID would benefit to pg_background and all
API use cases we didn't implement yet. I do not think that one PID in
structure will waste huge amount of memory, cycles, dev time,
readbility of docs, clearness of API etc. AFAIK the only reason may be
if the PID is not always there.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] background sessions

2017-01-03 Thread Andrew Borodin
2017-01-03 19:39 GMT+05:00 Peter Eisentraut :

> On 1/3/17 1:26 AM, amul sul wrote:
> > One more requirement for pg_background is session, command_qh,
> > response_qh and worker_handle should be last longer than current
> > memory context, for that we might need to allocate these in
> > TopMemoryContext.  Please find attach patch does the same change in
> > BackgroundSessionStart().
>
> I had pondered this issue extensively.  The standard coding convention
> in postgres is that the caller sets the memory context.  See the dblink
> and plpython patches that make this happen in their own way.
>
> I agree it would make sense that you either pass in a memory context or
> always use TopMemoryContext.  I'm open to these ideas, but they did not
> seem to match any existing usage.
>
+1
Please excuse me if I'm not getting something obvious, but seems like
BackgroundSessionNew() caller from pg_background_launch() can pick up
TopMemoryCtx. I think that context setup should be done by extension, not
by bg_session API.

Best regards, Andrey Borodin.


Re: [HACKERS] pg_background contrib module proposal

2016-12-21 Thread Andrew Borodin
2016-12-21 20:42 GMT+05:00 Robert Haas :
> This whole subthread seems like a distraction to me.  I find it hard
> to believe that this test case would be stable enough to survive the
> buildfarm where, don't forget, we have things like
> CLOBBER_CACHE_ALWAYS machines where queries take 100x longer to run.
> But even if it is, surely we can pick a less contrived test case.  So
> why worry about this?

David Fetter's test is deterministic and shall pass no matter how slow
and unpredictable perfromance is on a server.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background contrib module proposal

2016-12-20 Thread Andrew Borodin
2016-12-21 4:55 GMT+05:00 David Fetter :
> I see.
>
> I find the following a little easier to follow, and the sleeps still
> work even when very short.

Now I know a little more SQL :) Thank you.

I'm not sure every platform supports microsecond sleeps. But it will
work anyway. This test is deterministic.
Without sequence here is no race condition. So it is not sleepsort, it
is deterministic. Though I agree that it is good thing for test, I'd
still add some miliseconds to test case when main query for sure have
to wait end of other sleeping query.
.
>
> CREATE TABLE input AS
> SELECT x, row_number() OVER (ORDER BY x) n
> FROM
> generate_series(0,.05,0.01) x
> ORDER BY x DESC;
>
> CREATE TABLE output(place int,value float);
>
> CREATE TABLE handles AS
> SELECT pg_background_launch(
> format($$
> SELECT pg_sleep(%s);
> INSERT INTO output VALUES (%s, %s)
> $$,
> x, n, x
> )
> ) h
> FROM input;
>
> --wait until everyone finishes
> SELECT * FROM handles JOIN LATERAL pg_background_result(handles.h) AS (x 
> TEXT) ON (true);
>
> --output results
> SELECT * FROM output ORDER BY place;

Best regrards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background contrib module proposal

2016-12-19 Thread Andrew Borodin
2016-12-19 4:21 GMT+05:00 David Fetter :
> Couldn't it sleep in increments smaller than a second?  Like maybe
> milliseconds?  Also, it's probably cleaner (or at least more
> comprehensible) to write something using format() and dollar quoting
> than the line with the double 's.

Right. Here's version which waits for half a second. I do not see how
to path doubles via dollar sign, pg_background_launch() gets a string
as a parameter, it's not EXCEUTE USING.

Best regards, Andrey Borodin.
diff --git a/contrib/pg_background/Makefile b/contrib/pg_background/Makefile
index c4e717d..085fbff 100644
--- a/contrib/pg_background/Makefile
+++ b/contrib/pg_background/Makefile
@@ -6,6 +6,8 @@ OBJS = pg_background.o
 EXTENSION = pg_background
 DATA = pg_background--1.0.sql
 
+REGRESS = pg_background
+
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/contrib/pg_background/expected/pg_background.out 
b/contrib/pg_background/expected/pg_background.out
new file mode 100644
index 000..a6613ce
--- /dev/null
+++ b/contrib/pg_background/expected/pg_background.out
@@ -0,0 +1,31 @@
+CREATE EXTENSION pg_background;
+--run 6 workers which wait .0, .1, .2, .3, .4, .5 seconds respectively
+CREATE TABLE input as SELECT x FROM generate_series(0,.5,0.1) x ORDER BY x 
DESC;
+CREATE TABLE output(place int,value float);
+--sequence for indication of who's finished on which place
+CREATE sequence s start 1;
+CREATE TABLE handles as SELECT pg_background_launch(format('select 
pg_sleep(%s); insert into output values (nextval(''s''),%s);',x,x)) h FROM 
input;
+--wait until everyone finishes
+SELECT (SELECT * FROM pg_background_result(h) as (x text) limit 1) FROM 
handles;
+x 
+--
+ SELECT 1
+ SELECT 1
+ SELECT 1
+ SELECT 1
+ SELECT 1
+ SELECT 1
+(6 rows)
+
+--output results
+SELECT * FROM output ORDER BY place;
+ place | value 
+---+---
+ 1 | 0
+ 2 |   0.1
+ 3 |   0.2
+ 4 |   0.3
+ 5 |   0.4
+ 6 |   0.5
+(6 rows)
+
diff --git a/contrib/pg_background/sql/pg_background.sql 
b/contrib/pg_background/sql/pg_background.sql
new file mode 100644
index 000..770f0b8
--- /dev/null
+++ b/contrib/pg_background/sql/pg_background.sql
@@ -0,0 +1,12 @@
+CREATE EXTENSION pg_background;
+
+--run 6 workers which wait .0, .1, .2, .3, .4, .5 seconds respectively
+CREATE TABLE input as SELECT x FROM generate_series(0,.5,0.1) x ORDER BY x 
DESC;
+CREATE TABLE output(place int,value float);
+--sequence for indication of who's finished on which place
+CREATE sequence s start 1;
+CREATE TABLE handles as SELECT pg_background_launch(format('select 
pg_sleep(%s); insert into output values (nextval(''s''),%s);',x,x)) h FROM 
input;
+--wait until everyone finishes
+SELECT (SELECT * FROM pg_background_result(h) as (x text) limit 1) FROM 
handles;
+--output results
+SELECT * FROM output ORDER BY place;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] background sessions

2016-12-16 Thread Andrew Borodin
2016-12-16 20:17 GMT+05:00 Peter Eisentraut :
>> And one more thing... Can we have BackgroundSessionExecute() splitted
>> into two parts: start query and wait for results?
>> It would allow pg_background to reuse bgsession's code.
>
> Yes, I will look into that.

Thank you. I'm marking both patches as "Waiting for author", keeping
in mind that pg_background is waiting for bgsessions.
After updates I'll review these patches.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] background sessions

2016-12-14 Thread Andrew Borodin
2016-12-15 0:30 GMT+05:00 Peter Eisentraut :
 TryBeginSession()?
>>>
>>> What exactly would that do?
>> Return status (success\failure) and session object, if a function succeeded.
>>
>> If there is max_connections exceeded, then (false,null).
>>
>> I'm not sure whether this idiom is common for Python.
>
> You can catch PostgreSQL exceptions in PL/Python, so this can be handled
> in user code.
>
> Some better connection management or pooling can probably be built on
> top of the primitives later, I'd say.

Agree, doing this in Python is the better option.

And one more thing... Can we have BackgroundSessionExecute() splitted
into two parts: start query and wait for results?
It would allow pg_background to reuse bgsession's code.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] background sessions

2016-12-14 Thread Andrew Borodin
2016-12-14 20:45 GMT+05:00 Peter Eisentraut <peter.eisentr...@2ndquadrant.com>:
> On 12/11/16 5:38 AM, Andrew Borodin wrote:
>> 2. From my point of view the interface of the feature is the strong
>> point here (compared to pg_background). But it does not help
>> programmer to follow good practice: bgworker is a valuable and limited
>> resource, may be we could start session with something like
>> TryBeginSession()?
>
> What exactly would that do?
Return status (success\failure) and session object, if a function succeeded.

If there is max_connections exceeded, then (false,null).

I'm not sure whether this idiom is common for Python.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background contrib module proposal

2016-12-13 Thread Andrew Borodin
2016-12-13 12:55 GMT+05:00 amul sul :
> I think background-session code is not that much deviated from
> pg_background code,
It is not derived, though it is not much deviated. background sessions
code do not have detailed exceptions and code comments, but it is
doing somewhat similar things.
>IIUC background-session patch we can manage to
> reuse it, as suggested by Robert.  This will allow us to maintain
> session in long run, we could reuse this session to run subsequent
> queries instead of maintaining separate worker pool.  Thoughts?

One API to deal with both features would be much better, sure.
"Object" like sessions pool are much easier to implement on top of
session "object" then on top of worker process, PIDs etc.

>> 4. Query as a future (actually it is implemented already by
>> pg_background_result)
>> 5. Promised result. Declare that you are waiting for data of specific
>> format, execute a query, someone (from another backend) will
>> eventually place a data to promised result and your query continues
>> execution.
>
> Could you please elaborate more?
> Do you mean two way communication between foreground & background process?

It is from C++11 threading: future, promise and async - these are
related but different kinds of aquiring result between threads.
Feature - is when caller Cl start thread T(or dequeue thread from
pool) and Cl can wait until T finishes and provides result.
Here waiting the result is just like selecting from pg_background_result().
Promise - is when you declare a variable and caller do not know which
thread will put the data to this variable. But any thread reading
promise will wait until other thread put a data to promise.
There are more parallelism patterns there, like async, deffered, lazy,
but futures and promises from my point of view are most used.

>> 6. Cancelation: a way to signal to background query that it's time to
>> quit gracefully.
> Let me check this too.
I think Craig is right: any background query must be ready to be shut
down. That's what databases are about, you can safely pull the plug at
any moment.
I've remembered one more parallelism pattern: critical region. It's
when you ask the system "plz no TERM now, and, if you can, postpone
the vacuums, OOMKs and that kind of stuff" from user code. But I do
not think it's usable pattern here.

Thank you for your work.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background contrib module proposal

2016-12-12 Thread Andrew Borodin
Hi!

Just in case you'd like to include sleepsort as a test, here it is
wrapped as a regression test(see attachment). But it has serious
downside: it runs no less than 5 seconds.

Also I'll list here every parallelism feature I managed to imagine. It
is not to say that I suggest having some of these features at v1. We
certainly have to start somewhere, this list is just for information
purposes.
1. Poolable workers. Just to be sure you can reliably queue your task
somewhere without having "increase max_connections" hint.
2. Inside one pool, you can have task chaining. After competition of
task A (query A) start task B, in case of failure start task C. Task C
is followed by task D.
3. Reliably read task status: running\awaiting\completed\errored\in a
lock. Get a query plan of a task? (I know, there are reasons why it is
not possible now)
4. Query as a future (actually it is implemented already by
pg_background_result)
5. Promised result. Declare that you are waiting for data of specific
format, execute a query, someone (from another backend) will
eventually place a data to promised result and your query continues
execution.
6. Cancelation: a way to signal to background query that it's time to
quit gracefully.

Best regards, Andrey Borodin.
diff --git a/contrib/pg_background/Makefile b/contrib/pg_background/Makefile
index c4e717d..085fbff 100644
--- a/contrib/pg_background/Makefile
+++ b/contrib/pg_background/Makefile
@@ -6,6 +6,8 @@ OBJS = pg_background.o
 EXTENSION = pg_background
 DATA = pg_background--1.0.sql
 
+REGRESS = pg_background
+
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/contrib/pg_background/expected/pg_background.out 
b/contrib/pg_background/expected/pg_background.out
new file mode 100644
index 000..dbc344e
--- /dev/null
+++ b/contrib/pg_background/expected/pg_background.out
@@ -0,0 +1,26 @@
+CREATE EXTENSION pg_background;
+--run 5 workers which wait about 1 second
+CREATE TABLE input as SELECT x FROM generate_series(1,5,1) x ORDER BY x DESC;
+CREATE TABLE output(place int,value int);
+CREATE sequence s start 1;
+CREATE TABLE handles as SELECT pg_background_launch('select pg_sleep('||x||'); 
insert into output values (nextval(''s''),'||x||');') h FROM input;
+SELECT (SELECT * FROM pg_background_result(h) as (x text) limit 1) FROM 
handles;
+x 
+--
+ SELECT 1
+ SELECT 1
+ SELECT 1
+ SELECT 1
+ SELECT 1
+(5 rows)
+
+SELECT * FROM output ORDER BY place;
+ place | value 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 3
+ 4 | 4
+ 5 | 5
+(5 rows)
+
diff --git a/contrib/pg_background/sql/pg_background.sql 
b/contrib/pg_background/sql/pg_background.sql
new file mode 100644
index 000..d7cbd44
--- /dev/null
+++ b/contrib/pg_background/sql/pg_background.sql
@@ -0,0 +1,9 @@
+CREATE EXTENSION pg_background;
+
+--run 5 workers which wait about 1 second
+CREATE TABLE input as SELECT x FROM generate_series(1,5,1) x ORDER BY x DESC;
+CREATE TABLE output(place int,value int);
+CREATE sequence s start 1;
+CREATE TABLE handles as SELECT pg_background_launch('select pg_sleep('||x||'); 
insert into output values (nextval(''s''),'||x||');') h FROM input;
+SELECT (SELECT * FROM pg_background_result(h) as (x text) limit 1) FROM 
handles;
+SELECT * FROM output ORDER BY place;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] background sessions

2016-12-11 Thread Andrew Borodin
Hi!

I'm planning to review this patch.
I have some questions, maybe answers could help me understand patch better.
1. As far as I can see, we connot use COPY FROM STDIN in bg session?
Since one of purposes is to orchestrate transactions, may be that
would be valuable.
2. From my point of view the interface of the feature is the strong
point here (compared to pg_background). But it does not help
programmer to follow good practice: bgworker is a valuable and limited
resource, may be we could start session with something like
TryBeginSession()? May be this violates some policies or idioms which
I'm not aware of.

Thank you for your work.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background contrib module proposal

2016-12-11 Thread Andrew Borodin
2016-12-09 18:00 GMT+05:00 Robert Haas :
> It looks like this could be reworked as a client of Peter Eisentraut's
> background sessions code, which I think is also derived from
> pg_background:
>
> http://archives.postgresql.org/message-id/e1c2d331-ee6a-432d-e9f5-dcf85cffa...@2ndquadrant.com
>
> That might be good, because then we wouldn't have to maintain two
> copies of the code.

Code looks quite different. I mean bgsession.c code and pg_background.c code.
Definitly, there is possibility to refactor both patches to have
common subset of base routines, they operate with similar concepts.
But to start, it's better to choose which patch goes first, or merge
them.

There is no possibility to make one on base of other since they both
require some work.
Personally, I like C code from pg_background more. It is far better
commented and has more exceptions info for user. But interface of
bgsessions is crispier. Finally, they solve different problems.

I signed up for review there too (in background sessions patch). I
hope I'll have enough resources to provide decent review for both in
december, before commitfest.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background contrib module proposal

2016-12-09 Thread Andrew Borodin
I've read the code and now I have more suggestions.

1. Easy one. The case of different natts of expected and acutal result
throws same errmsg as the case of wrong types.
I think we should assist the user more here

if (natts != tupdesc->natts)
   ereport(ERROR,
 (errcode(ERRCODE_DATATYPE_MISMATCH),
  errmsg("remote query result rowtype does not match the
specified FROM clause rowtype")));

and here

   if (type_id != tupdesc->attrs[i]->atttypid)
  ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
 errmsg("remote query result rowtype does not match the
specified FROM clause rowtype")));

2. This code
errhint("You may need to increase max_worker_processes.")
Is technically hinting absolutley right.
But this extension requires something like pg_bouncer or what is
called "thread pool".
You just cannot safely fire a background task because you do not know
how many workes can be spawned. Maybe it'd be better to create 'pool'
of workers and form a queue of queries for them?
This will make possible start a millions of subtasks.

3. About getting it to the core, not as an extension. IMO this is too
sharp thing to place it into core, I think that it's best place is in
contribs.

Thanks for the work on such a cool and fun extension.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background contrib module proposal

2016-12-07 Thread Andrew Borodin
This is simply wonderful!
Finaly, I can implement my favorite sleepsort in postgres:

create table input as select round(random()*20) x from generate_series(1,5,1);
create table output(place int,value int);
create sequence s start 1;
create table handles as select pg_background_launch('select
pg_sleep('||x||'); insert into output values
(nextval(''s''),'||x||');') h from input;
select (select * from pg_background_result(h) as (x text) limit 1) from handles;
select * from output;

Works like a charm. It has perfrect running time O(1) where n is
number of items to sort, but it cannot sort more than
max_worker_processes-1.
Next step of user cuncurrency mechanics is future indexes (indexes
under cunstruction are taken into plans, query is executed when index
is actualy constructed) and promised tables (query waits untial there
is actually data in the table).


I like the idea and implementation and I'm planning to post review by
the end of december.
Right now I have just one useful idea: I do not see comfortable way to
check up state of task (finished\failed) without possibility of haning
for long or catching fire.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN non-intrusive vacuum of posting tree

2016-11-30 Thread Andrew Borodin
Here is v1 of the patch. Now it has changes for README and contains
more comments clarifying changes of locking model.

Also I will elaborate some more on what is patched. Main portion of
changes is made to function ginVacuumPostingTreeLeaves(). Before the
patch it was traversing tree in depth-first fashion, acquiring
exclusive lock on each page and removing dead tuples from leafs. Also
this function used to hold cleanup lock. Now this function is doing
same DFS, but without cleanup lock, acquiring only read locks on inner
pages and exclusive lock on leafs before eliminating dead tuples. If
this function finds empty leafs, it computes minimal subtree,
containing only empty pages and start scan for empty pages from parent
page pointing to found subtree.

This scan acquires cleanup lock on root of scan (not necessarily root
of posting tree). Cleanup lock ensures no inserts are inside subtree.
Scan traverses subtree DF taking exclusive locks from left to right.
For any page being deleted all leftmost pages were locked and unlocked
before. New reads cannot enter subtree, all old readscans were
excluded by lock\unlock. Thus there shall not be deadlocks with
ginStepRight().

We get lock on page being deleted, then on a left page.
ginStepRight() takes lock on left page, than on right page. But it’s
presence is excluded by cleanup lock and DFS scan with locks of upper
and left parts of tree.


Thank you for reading this. Concurrency bothers me a lot. If you see
that anything is wrong or suspicious, please do not hesitate to post
your thoughts.


Best regards, Andrey Borodin.
diff --git a/src/backend/access/gin/README b/src/backend/access/gin/README
index fade0cb..29dafce 100644
--- a/src/backend/access/gin/README
+++ b/src/backend/access/gin/README
@@ -314,10 +314,16 @@ deleted.
 The previous paragraph's reasoning only applies to searches, and only to
 posting trees. To protect from inserters following a downlink to a deleted
 page, vacuum simply locks out all concurrent insertions to the posting tree,
-by holding a super-exclusive lock on the posting tree root. Inserters hold a
-pin on the root page, but searches do not, so while new searches cannot begin
-while root page is locked, any already-in-progress scans can continue
-concurrently with vacuum. In the entry tree, we never delete pages.
+by holding a super-exclusive lock on the parent page of subtree with deletable
+pages. Inserters hold a pin on the root page, but searches do not, so while
+new searches cannot begin while root page is locked, any already-in-progress
+scans can continue concurrently with vacuum in corresponding subtree of
+posting tree. To exclude interference with readers vacuum takes exclusive
+locks in a depth-first scan in leaf-to-right order of page tuples. Leftmost
+page is never deleted. Thus before deleting any page we obtain exclusive
+lock on any left page, effectively excluding deadlock with any reader, despite
+takinf parent lock first and not holding left lock at all.
+In the entry tree, we never delete pages.
 
 (This is quite different from the mechanism the btree indexam uses to make
 page-deletions safe; it stamps the deleted pages with an XID and keeps the
diff --git a/src/backend/access/gin/ginbtree.c 
b/src/backend/access/gin/ginbtree.c
index a0afec4..dc28c81 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -30,7 +30,7 @@ static void ginFinishSplit(GinBtree btree, GinBtreeStack 
*stack,
 /*
  * Lock buffer by needed method for search.
  */
-static int
+int
 ginTraverseLock(Buffer buffer, bool searchMode)
 {
Pagepage;
diff --git a/src/backend/access/gin/ginvacuum.c 
b/src/backend/access/gin/ginvacuum.c
index 2685a1c..9bd2f50 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -108,75 +108,17 @@ xlogVacuumPage(Relation index, Buffer buffer)
PageSetLSN(page, recptr);
 }
 
-static bool
-ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool 
isRoot, Buffer *rootBuffer)
-{
-   Buffer  buffer;
-   Pagepage;
-   boolhasVoidPage = FALSE;
-   MemoryContext oldCxt;
-
-   buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno,
-   RBM_NORMAL, 
gvs->strategy);
-   page = BufferGetPage(buffer);
-
-   /*
-* We should be sure that we don't concurrent with inserts, insert 
process
-* never release root page until end (but it can unlock it and lock
-* again). New scan can't start but previously started ones work
-* concurrently.
-*/
-   if (isRoot)
-   LockBufferForCleanup(buffer);
-   else
-   LockBuffer(buffer, GIN_EXCLUSIVE);
-
-   Assert(GinPageIsData(page));
-
-   if (GinPageIsLeaf(page))
-   {
-   oldCxt = MemoryContextSwitchTo(gvs->tmpCxt);
-   

[HACKERS] GIN non-intrusive vacuum of posting tree

2016-11-28 Thread Andrew Borodin
Hi hackers!

Here is a patch with more concurrency-friendly vacuum of GIN.

===Short problem statement===
Currently GIN vacuum during cleaning posting tree can lock this tree
for a long time without real need.

===Problem statement===
Vacuum of posting tree (function ginVacuumPostingTree() in ginvacuum.c
) is doing two passes through posting tree:

1. ginVacuumPostingTreeLeaves() takes LockBufferForCleanup,
effectively excluding all inserts. Then it traverses down trough tree
taking exclusive lock, effectively excluding all reads. On leaf level
it calls ginVacuumPostingTreeLeaf(), which deletes all dead tuples. If
there are any empty page, root lock is not relaesed, it passes to
stage two.

Interim: vacuum_delay_point(), which can hand for a while, holding
CleanupLock on root.

2. If there are any empty pages, ginScanToDelete() scans through tree,
deleting empty lead pages, then deleting empty inner pages, if they
emerged after leaf page deletion.


===Proposed algorithm===
ginVacuumPostingTreeLeaves() takes SHARED locks on inner pages and
EXCLUSIVE locks on leaf pages. On leaf pages it calls
ginVacuumPostingTreeLeaf().

If ginVacuumPostingTreeLeaves() encounters subtree with all leafs
empty, then it takes LockBufferForCleanup() on page P pointing to
empty subtree. After taking lock on P, ginScanToDelete() is called for
P. For every parent of P we can guarantee that this procedure will not
be necessary: if P was empty subtree itself, we would pass this
procedure to parent (unless P is root, than behavior is effectively
equal to before-patched).


===Testing===
This patch solved a problem encountered by Evgeniy Efimkin and
Vladimir Borodin from Yandex.Mail.
They implemented testbed with GIN index

CREATE TABLE BOX (uid bigint,
  lids integer[] NOT NULL
  CHECK (array_ndims(lids) = 1));

CREATE OR REPLACE FUNCTION ulids(
i_uid bigint,
i_lids integer[]
) RETURNS bigint[] AS $$
SELECT array_agg((i_uid << 32) | lid)
  FROM unnest(i_lids) lid;
$$ LANGUAGE SQL IMMUTABLE STRICT;

CREATE INDEX i_box_uid_lids
ON box USING gin (ulids(uid, lids)) WITH (FASTUPDATE=OFF);

Executing by a pgbench following query

\setrandom uid 1 150
\setrandom lid 1 1000
\setrandom lid2 1 1000
\setrandom lid3 1 1000
BEGIN;
insert into box values(:uid,'{:lid,:lid2,:lid3}');
insert into box values(:uid,'{}');
END;

and eventually deleting some of data. This testbed showed VACUUM
holding inserts for up to tenths of seconds. They claim that proposed
patch made vacuum invisible in this test.

Evgeniy, Vladimir, if I missed something or you have something to add,
please join discussion.

===Known issues===
1. Probably, there exists better algorithms. I could not adapt B-tree
vacuum wizardy to GIN, but that does not mean it is impossible.

2. There may be more CleanUp locks taken. Under write-dense scenarios,
this can lead to longer vacuum, but it should not consume more
resources, just wait.

3. I have changed locks sequence during page deleteion. I think that
it is safe, but comments there were in fear of inserts (excluded by
CleanupLock). More details in a code of ginDeletePage().

4. ginVacuumPostingTreeLeaves() traverses children pages after release
of parent lock (event SHARED). This is safe if there is only one
vacuum at a time. Is there a reason to be afraid of concurrent
vacuums?


I will be happy if someone points me were is a best place to read
about B-tree vacuum process. Or if someone will explain me how it
works in a few words.
Dev process is here
https://github.com/x4m/postgres_g/pull/2

Thank you for reading, I'm looking forward to hear your thought on the matter.


Best regards, Andrey Borodin.
diff --git a/src/backend/access/gin/ginbtree.c 
b/src/backend/access/gin/ginbtree.c
index a0afec4..dc28c81 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -30,7 +30,7 @@ static void ginFinishSplit(GinBtree btree, GinBtreeStack 
*stack,
 /*
  * Lock buffer by needed method for search.
  */
-static int
+int
 ginTraverseLock(Buffer buffer, bool searchMode)
 {
Pagepage;
diff --git a/src/backend/access/gin/ginvacuum.c 
b/src/backend/access/gin/ginvacuum.c
index 2685a1c..bc54284 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -108,75 +108,17 @@ xlogVacuumPage(Relation index, Buffer buffer)
PageSetLSN(page, recptr);
 }
 
-static bool
-ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool 
isRoot, Buffer *rootBuffer)
-{
-   Buffer  buffer;
-   Pagepage;
-   boolhasVoidPage = FALSE;
-   MemoryContext oldCxt;
-
-   buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno,
-   RBM_NORMAL, 
gvs->strategy);
-   page = BufferGetPage(buffer);
-
-   /*
-* We should be sure that we don't concurrent with inserts, insert 
process
-* 

Re: [HACKERS] Fractal tree indexing

2016-11-16 Thread Andrew Borodin
Hi hackers!

Here is prototype of procrastinating GiST. Or lazy GiST. Or buffered GiST.
Indeed, there is nothing fractal about it.

The concept is leaf tuples stall on internal pages. From time to time they
are pushed down in batches. This code is far from perfect, I've coded this
only for the research purposes.

I have tested code with insertion of random 3D cubes. On 1 million of
insertions classic GiST is 8% faster, on 3M performance is equal, on 30M
lazy GiST if 12% faster.
I've tested it with SSD disk, may be with HDD difference will be more
significant.
Test scenario is here https://github.com/x4m/postgres_g/blob/bufdev/test.sql

In theory, this is cache friendly implementation (but not cache oblivious)
and it must be faster even for small datasets without huge disk work. But
it has there extra cost of coping batch of tuples to palloc`d array,
couln't avoid that.

This is just a proof-of-concept for performance measrures:
1. make check fails for two tests
2. WAL is broken
3. code is a mess in some places

I'm not sure 12% of performance worth it. I'll appreciate any ideas on what
to do next.

Best regards, Andrey Borodin.

2013-02-13 22:54 GMT+05:00 Simon Riggs :

> On 13 February 2013 16:48, Heikki Linnakangas 
> wrote:
> > On 13.02.2013 18:20, Tom Lane wrote:
> >>
> >> Heikki Linnakangas  writes:
> >>>
> >>> The basic idea of a fractal tree index is to attach a buffer to every
> >>> non-leaf page. On insertion, instead of descending all the way down to
> >>> the correct leaf page, the new tuple is put on the buffer at the root
> >>> page. When that buffer fills up, all the tuples in the buffer are
> >>> cascaded down to the buffers on the next level pages. And recursively,
> >>> whenever a buffer fills up at any level, it's flushed to the next
> level.
> >>
> >>
> >> [ scratches head... ]  What's "fractal" about that?  Or is that just a
> >> content-free marketing name for this technique?
> >
> >
> > I'd call it out as a marketing name. I guess it's fractal in the sense
> that
> > all levels of the tree can hold "leaf tuples" in the buffers; the
> structure
> > looks the same no matter how deep you zoom, like a fractal.. But
> "Buffered"
> > would be more appropriate IMO.
>
> I hope for their sake there is more to it than that. It's hard to see
> how buffering can be patented.
>
> --
>  Simon Riggs   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>
>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b8aa9bc..29770d2 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -265,6 +265,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE 
*giststate,
boolis_rootsplit;
int npage;
 
+
is_rootsplit = (blkno == GIST_ROOT_BLKNO);
 
/*
@@ -524,6 +525,8 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE 
*giststate,
gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
}
 
+   GistPageGetOpaque(page)->nlazy=1; //DO NOT FORGET: remark pages 
after split
+
MarkBufferDirty(buffer);
 
if (BufferIsValid(leftchildbuf))
@@ -589,6 +592,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE 
*giststate,
  * this routine assumes it is invoked in a short-lived memory context,
  * so it does not bother releasing palloc'd allocations.
  */
+
 void
 gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 {
@@ -597,7 +601,6 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, 
GISTSTATE *giststate)
GISTInsertStack firststack;
GISTInsertStack *stack;
GISTInsertState state;
-   boolxlocked = false;
 
memset(, 0, sizeof(GISTInsertState));
state.freespace = freespace;
@@ -610,6 +613,21 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, 
GISTSTATE *giststate)
firststack.downlinkoffnum = InvalidOffsetNumber;
state.stack = stack = 
 
+
+   gistdoinsertloop(r,itup,freespace, giststate, stack, state, 0);
+}
+
+void
+gistdoinsertloop(Relation r, IndexTuple itup, Size freespace, GISTSTATE 
*giststate,GISTInsertStack *givenstack, GISTInsertState state, GISTInsertStack 
*nolazy)
+{
+   ItemId  iid;
+   IndexTuple  idxtuple;
+
+   boolxlocked = false;
+   GISTInsertStack *stack = givenstack;
+
+
+
/*
 * Walk down along the path of smallest penalty, updating the parent
 * pointers with the key we're inserting as we go. If we crash in the
@@ -685,9 +703,86 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, 
GISTSTATE *giststate)
GISTInsertStack *item;
OffsetNumber downlinkoffnum;
 
+
+
+   

Re: [HACKERS] GiST penalty functions [PoC]

2016-10-02 Thread Andrew Borodin
> This thread has basically died, so marking as returned with feedback.
Well, not exactly died, but it's kind of in blind lead.

I'll summarize a bit for future references:
1. cube extension for indexing lowered dimensionality data (2d in 3d)
is broken. Queries effectively will degrade to full index scan.
2. In existing GiST API this can be fixed only with technique,
implemented in this patch. But this technique seems controversial.
3. This patch also brings 30%-40% speedup, but expected speedup of
advanced GIST API techniques is much higher.
As Norbert Beckmann put it in other discussion:
> Overlap optimization [requires API advancement] is one of the main elements, 
> if not the main query performance tuning element of the RR*-tree. You would 
> fall back to old R-Tree times if that would be left off.

Outcome: I'm planning to rollout another patch for both GiST and cube,
probably to next CF, which will render this patch unnecessary.
Thanks to everyone for discussion.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST penalty functions [PoC]

2016-09-21 Thread Andrew Borodin
Hi Heikki!

> Got a link for a description of the RR*-tree? Would be good to reference it
> in the patch comments, too.
Well, as for now, the best way to reach the paper is
DOI 10.1145/1559845.1559929
http://sci-hub.cc/
Authors in conversations clearly stated that they endorse (not sure in
correct word) implementation in PostgreSQL, so I do not think it's a
bad kind of piracy.
More or less persistent link is http://dl.acm.org/citation.cfm?id=1559929

> If I understand correctly, cases #1 and #3 arise when one of the dimensions
> is 0. For example, in a 3D space, if the existing entry is a rectangle on a
> plane, with zero-length edge on one of the dimensions, and the new entry is
> on the same plane. Case #1 arises if the new entry falls within that
> rectangle, and case #3 if it's outside it. Currently, we treat all such
> cases as 0-penalty, because the degenerate 0-dimension causes all the
> calculated volumes to become zero. So clearly we can do better, which is
> what this patch does.
>
> At first blush, I'm surprised that you switch to using the sum of the edges
> in those cases. I would expect to ignore the degenerate 0-dimension, and
> calculate the volume using the other dimensions. So in a 3D space, you would
> calculate the increase of the area of the rectangle (A*B), not the sum of
> the edges (A+B). And it probably would be best to take into account how many
> of the dimensions are zero. So in a 3D space, if there is an existing line
> segment that the new point falls into, and also a rectangle that the new
> point falls into, you should prefer the 1-dimensional line segment over the
> 2-dimensional rectangle.
>
> I don't know how big a difference that makes in practice. But it seems odd
> that if you e.g. have a set of 3D points, but the Z dimension in all of the
> points is 0, the algorithm behaves differently than if you had the exact
> same points in a 2D space.
>
> (If this is explained in the RR*-tree paper, feel free to just point me to
> that and I'll ask again if I have further questions.)

As far as I know, your version of penalty function degradation is a
new concept regarding R-trees. I have not saw this idea before.
It is based on two assertions:
1. Degrading algorithms should resemble general algorithm, if choice
of general algorithm is correct.
2. Degradation happens when and only when at least on edge of MBB has zero size.

First assertion seems correct, while second is wrong. When you index
high-dimensional data (say 10 dimensions), you can easily reach 0 by
multiplication of values around 10^-4. And such data often is a result
of scaling and normalizing in machine learning, these are concepts
natural for them, along with high dimensinonality.

We can rejig your algorithm: edges are sorted in descending order, and
multiplied just before getting zero. Choose subtree picks tuple by
count of numbers multiplied, resolving ties by result of
multiplication.

We can get on fire with big edges, but current cube has no more than
100 dimensions. That is a little more than 1000 for each before
overlow (if multiplication is done in doubles).

Practically, this algorithm cannot be implemented in current GiST API
(only if we provide non-penalty-based choose subtree function,
optional for GiST extension), but it certainly has scientific value.

Regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST: interpretation of NaN from penalty function

2016-09-15 Thread Andrew Borodin
> Certainly, "NaN means infinity", as your patch proposes, has no more basis to 
> it than "NaN means zero".
You are absolutley right. Now I see that best interpretation is "NaN
means NaN". Seems like we need only drop a check. Nodes with NaN
penalties will be considered even worser than those with infinity
penalty.
Penalty calculation is CPU performance critical, it is called for
every tuple on page along insertion path. Ommiting this check will
speed this up...a tiny bit.
> If the penalty function doesn't like that interpretation, it shouldn't return 
> NaN.
It may return NaN accidentally. If NaN will pass through union()
function then index will be poisoned.
That's not a good contract to remember for extension developer.


Regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] GiST: interpretation of NaN from penalty function

2016-09-14 Thread Andrew Borodin
Hi hackers!

Currently GiST treats NaN penalty as zero penalty, in terms of
generalized tree this means "perfect fit". I think that this situation
should be considered "worst fit" instead.
Here is a patch to highlight place in the code.
I could not construct test to generate bad tree, which would be fixed
by this patch. There is not so much of cases when you get NaN. None of
them can be a result of usual additions and multiplications of real
values.

Do I miss something? Is there any case when NaN should be considered good fit?

Greg Stark was talking about this in
BANLkTi=d+bpps1cm4yc8kukhj63hwj4...@mail.gmail.com but that topic
didn't go far (due to triangles). I'm currently messing with floats in
penalties, very close to NaNs, and I think this question can be
settled.

Regrads, Andrey Borodin.
diff --git a/src/backend/access/gist/gistutil.c 
b/src/backend/access/gist/gistutil.c
index fac166d..0a62561 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -675,8 +675,14 @@ gistpenalty(GISTSTATE *giststate, int attno,
  PointerGetDatum(orig),
  PointerGetDatum(add),
  PointerGetDatum());
-   /* disallow negative or NaN penalty */
-   if (isnan(penalty) || penalty < 0.0)
+   /*
+* disallow negative or NaN penalty
+* NaN is considered float infinity - i.e. most inapropriate fit
+* negative is considered penaltiless fix
+*/
+   if (isnan(penalty))
+   penalty = get_float4_infinity();
+   if (penalty < 0.0)
penalty = 0.0;
}
else if (isNullOrig && isNullAdd)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST penalty functions [PoC]

2016-09-14 Thread Andrew Borodin
Here is v6 of cube penalty patch.
There was a warning about unused edge function  under systems without
__STDC_IEC_559__ defined, this patch fixes it.

Regards, Andrey Borodin.
diff --git a/contrib/cube/cube.c b/contrib/cube/cube.c
index 3feddef..ad868ac 100644
--- a/contrib/cube/cube.c
+++ b/contrib/cube/cube.c
@@ -18,6 +18,7 @@
 
 #include "cubedata.h"
 
+
 PG_MODULE_MAGIC;
 
 /*
@@ -27,6 +28,15 @@ PG_MODULE_MAGIC;
 #define ARRNELEMS(x)  ArrayGetNItems( ARR_NDIM(x), ARR_DIMS(x))
 
 /*
+ * If IEEE754 floats are fully supported we use advanced penalty
+ * which takes into account cube volume, perimeter and tend to favor
+ * small nodes over big ones. For more info see g_cube_penalty implementation
+ */
+#ifdef __STDC_IEC_559__
+#define ADVANCED_PENALTY
+#endif
+
+/*
 ** Input/Output routines
 */
 PG_FUNCTION_INFO_V1(cube_in);
@@ -91,14 +101,18 @@ PG_FUNCTION_INFO_V1(cube_enlarge);
 /*
 ** For internal use only
 */
-int32  cube_cmp_v0(NDBOX *a, NDBOX *b);
-bool   cube_contains_v0(NDBOX *a, NDBOX *b);
-bool   cube_overlap_v0(NDBOX *a, NDBOX *b);
-NDBOX *cube_union_v0(NDBOX *a, NDBOX *b);
-void   rt_cube_size(NDBOX *a, double *sz);
-NDBOX *g_cube_binary_union(NDBOX *r1, NDBOX *r2, int *sizep);
-bool   g_cube_leaf_consistent(NDBOX *key, NDBOX *query, StrategyNumber 
strategy);
-bool   g_cube_internal_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
+static int32   cube_cmp_v0(NDBOX *a, NDBOX *b);
+static boolcube_contains_v0(NDBOX *a, NDBOX *b);
+static boolcube_overlap_v0(NDBOX *a, NDBOX *b);
+static NDBOX  *cube_union_v0(NDBOX *a, NDBOX *b);
+#ifdef ADVANCED_PENALTY
+static float   pack_float(float value, int realm);
+static voidrt_cube_edge(NDBOX *a, double *sz);
+#endif
+static voidrt_cube_size(NDBOX *a, double *sz);
+static NDBOX  *g_cube_binary_union(NDBOX *r1, NDBOX *r2, int *sizep);
+static boolg_cube_leaf_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
+static boolg_cube_internal_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
 
 /*
 ** Auxiliary funxtions
@@ -419,7 +433,36 @@ g_cube_decompress(PG_FUNCTION_ARGS)
}
PG_RETURN_POINTER(entry);
 }
+/*
+ * Function to pack floats of different realms
+ * This function serves to pack bit flags inside float type
+ * Resulted value represent can be from four different "realms"
+ * Every value from realm 3 is greater than any value from realms 2,1 and 0.
+ * Every value from realm 2 is less than every value from realm 3 and greater
+ * than any value from realm 1 and 0, and so on. Values from the same realm
+ * loose two bits of precision. This technique is possible due to floating
+ * point numbers specification according to IEEE 754: exponent bits are highest
+ * (excluding sign bits, but here penalty is always positive). If float a is
+ * greater than float b, integer A with same bit representation as a is greater
+ * than integer B with same bits as b.
+ */
+#ifdef ADVANCED_PENALTY
+static float
+pack_float(const float value, const int realm)
+{
+  union {
+float f;
+struct { unsigned value:31, sign:1; } vbits;
+struct { unsigned value:29, realm:2, sign:1; } rbits;
+  } a;
+
+  a.f = value;
+  a.rbits.value = a.vbits.value >> 2;
+  a.rbits.realm = realm;
 
+  return a.f;
+}
+#endif
 
 /*
 ** The GiST Penalty method for boxes
@@ -441,9 +484,42 @@ g_cube_penalty(PG_FUNCTION_ARGS)
rt_cube_size(DatumGetNDBOX(origentry->key), );
*result = (float) (tmp1 - tmp2);
 
-   /*
-* fprintf(stderr, "penalty\n"); fprintf(stderr, "\t%g\n", *result);
-*/
+#ifdef ADVANCED_PENALTY
+   /* Realm tricks are used only in case of IEEE754 support(IEC 60559) */
+
+   /* REALM 0: No extension is required, volume is zero, return edge   
*/
+   /* REALM 1: No extension is required, return nonzero volume 
*/
+   /* REALM 2: Volume extension is zero, return nonzero edge extension 
*/
+   /* REALM 3: Volume extension is nonzero, return it  
*/
+
+   if( *result == 0 )
+   {
+   double tmp3 = tmp1; /* remember entry volume */
+   rt_cube_edge(ud, );
+   rt_cube_edge(DatumGetNDBOX(origentry->key), );
+   *result = (float) (tmp1 - tmp2);
+   if( *result == 0 )
+   {
+   if( tmp3 != 0 )
+   {
+   *result = pack_float(tmp3, 1); /* REALM 1 */
+   }
+   else
+   {
+   *result = pack_float(tmp1, 0); /* REALM 0 */
+   }
+   }
+   else
+   {
+   *result = pack_float(*result, 2); /* REALM 2 */
+   }
+   }
+   else
+   {
+  

Re: [HACKERS] GiST penalty functions [PoC]

2016-09-13 Thread Andrew Borodin
Hi hackers!

Here is gist cube penaly patch v5.
Changes:
1. union version of pack_float() is used, without warings and with
minimum asm commands emitted
2. test for __STDC_IEC_559__ is added, this is a standard way of C99
to determine float compliance with IEEE754. ANSI-only compiler+libs,
along with non-IEEE-float-systems, will fallback to pre-patch
behavior.

Best regards, Andrey Borodin.
diff --git a/contrib/cube/cube.c b/contrib/cube/cube.c
index 3feddef..c86418d 100644
--- a/contrib/cube/cube.c
+++ b/contrib/cube/cube.c
@@ -91,14 +91,18 @@ PG_FUNCTION_INFO_V1(cube_enlarge);
 /*
 ** For internal use only
 */
-int32  cube_cmp_v0(NDBOX *a, NDBOX *b);
-bool   cube_contains_v0(NDBOX *a, NDBOX *b);
-bool   cube_overlap_v0(NDBOX *a, NDBOX *b);
-NDBOX *cube_union_v0(NDBOX *a, NDBOX *b);
-void   rt_cube_size(NDBOX *a, double *sz);
-NDBOX *g_cube_binary_union(NDBOX *r1, NDBOX *r2, int *sizep);
-bool   g_cube_leaf_consistent(NDBOX *key, NDBOX *query, StrategyNumber 
strategy);
-bool   g_cube_internal_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
+static int32   cube_cmp_v0(NDBOX *a, NDBOX *b);
+static boolcube_contains_v0(NDBOX *a, NDBOX *b);
+static boolcube_overlap_v0(NDBOX *a, NDBOX *b);
+static NDBOX  *cube_union_v0(NDBOX *a, NDBOX *b);
+#ifdef __STDC_IEC_559__
+static float   pack_float(float value, int realm);
+#endif
+static voidrt_cube_size(NDBOX *a, double *sz);
+static voidrt_cube_edge(NDBOX *a, double *sz);
+static NDBOX  *g_cube_binary_union(NDBOX *r1, NDBOX *r2, int *sizep);
+static boolg_cube_leaf_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
+static boolg_cube_internal_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
 
 /*
 ** Auxiliary funxtions
@@ -419,7 +423,36 @@ g_cube_decompress(PG_FUNCTION_ARGS)
}
PG_RETURN_POINTER(entry);
 }
+/*
+ * Function to pack floats of different realms
+ * This function serves to pack bit flags inside float type
+ * Resulted value represent can be from four different "realms"
+ * Every value from realm 3 is greater than any value from realms 2,1 and 0.
+ * Every value from realm 2 is less than every value from realm 3 and greater
+ * than any value from realm 1 and 0, and so on. Values from the same realm
+ * loose two bits of precision. This technique is possible due to floating
+ * point numbers specification according to IEEE 754: exponent bits are highest
+ * (excluding sign bits, but here penalty is always positive). If float a is
+ * greater than float b, integer A with same bit representation as a is greater
+ * than integer B with same bits as b.
+ */
+#ifdef __STDC_IEC_559__
+static float
+pack_float(const float value, const int realm)
+{
+  union {
+float f;
+struct { unsigned value:31, sign:1; } vbits;
+struct { unsigned value:29, realm:2, sign:1; } rbits;
+  } a;
 
+  a.f = value;
+  a.rbits.value = a.vbits.value >> 2;
+  a.rbits.realm = realm;
+
+  return a.f;
+}
+#endif
 
 /*
 ** The GiST Penalty method for boxes
@@ -428,6 +461,11 @@ g_cube_decompress(PG_FUNCTION_ARGS)
 Datum
 g_cube_penalty(PG_FUNCTION_ARGS)
 {
+   /* REALM 0: No extension is required, volume is zero, return edge   
*/
+   /* REALM 1: No extension is required, return nonzero volume 
*/
+   /* REALM 2: Volume extension is zero, return nonzero edge extension 
*/
+   /* REALM 3: Volume extension is nonzero, return it  
*/
+
GISTENTRY  *origentry = (GISTENTRY *) PG_GETARG_POINTER(0);
GISTENTRY  *newentry = (GISTENTRY *) PG_GETARG_POINTER(1);
float  *result = (float *) PG_GETARG_POINTER(2);
@@ -441,9 +479,36 @@ g_cube_penalty(PG_FUNCTION_ARGS)
rt_cube_size(DatumGetNDBOX(origentry->key), );
*result = (float) (tmp1 - tmp2);
 
-   /*
-* fprintf(stderr, "penalty\n"); fprintf(stderr, "\t%g\n", *result);
-*/
+#ifdef __STDC_IEC_559__
+   /* Realm tricks are in use only in case of IEEE754 support(IEC 60559)*/
+   if( *result == 0 )
+   {
+   double tmp3 = tmp1; /* remember entry volume */
+   rt_cube_edge(ud, );
+   rt_cube_edge(DatumGetNDBOX(origentry->key), );
+   *result = (float) (tmp1 - tmp2);
+   if( *result == 0 )
+   {
+   if( tmp3 != 0 )
+   {
+   *result = pack_float(tmp3, 1); /* REALM 1 */
+   }
+   else
+   {
+   *result = pack_float(tmp1, 0); /* REALM 0 */
+   }
+   }
+   else
+   {
+   *result = pack_float(*result, 2); /* REALM 2 */
+   }
+   }
+   else
+   {
+   

Re: [HACKERS] GiST penalty functions [PoC]

2016-09-10 Thread Andrew Borodin
>Personally I wouldn't be very happy about an IEEE754 assumption.
Ok, so let's avoid autoconf man. There is no real explanations of the
ground for this assumption, just a reference to paper of David
Goldberg (and there is no metion about IEEE754 is absoulte
everywhere). BTW, may be we can ask David Goldberg how to hack that
float properly? I do  not see any math sense in this:

pack_float(float actualValue, int realm)
{
...
  int realmAjustment = *((int*))/4;
  float something = *((float*))
...
}
No wonder it's not in libs.
Nither I see a way to implement it with ldexp siblings.
>I did go to the trouble of testing Postgres on a VAX and we fixed the few
instances where it had such dependencies for a net gain.
Greg, could you please point out those places to see how exaclty it
should be #ifdef'd?

Regrads, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST penalty functions [PoC]

2016-09-08 Thread Andrew Borodin
>BTW, would someone explain to me why using a float here will not fail 
>catastrophically for inputs outside the range of float?

Indeed, it will. And that's one of the reason I'm considering
advancing GiST API instead of advancing extensions.

It won't crash, but will produce terrible index, not suitable of
performing efficient queries.
GiST treats penalty this way:
/* disallow negative or NaN penalty */
if (isnan(penalty) || penalty < 0.0)
penalty = 0.0;

Any NaN is a "perfect fit".

Calculation of real (say float) penalty and choice of subtree with
smallest penalty is an idea from 92. Modern spatial index requires
more than just a float to compare subtrees.

Best regards, Andrey Borodin, Octonica & Ural Federal University.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST penalty functions [PoC]

2016-09-08 Thread Andrew Borodin
>autoconf check for IEEE 754 floats
Autoconf man says folowing:
>it is safe to assume IEEE-754 in most portable code these days
https://www.gnu.org/software/autoconf/manual/autoconf.html#Floating-Point-Portability

> A union might be more readable
Here is union version of the patch. It's slower 10% than original cube
and dereference version. Have no idea why.
Select performance is improved as in v3.

Also I've investigated a bit why linear package failed. It's usefull
to think in terms of bijections, not in terms of arithmetic functions.

float 1 is 1065353216 hex 3f80
FLT_MAX / ( 8 >> 3 ) is 2139095039 hex 7f7f
FLT_MAX / ( 8 >> 2 ) is 2130706431 hex 7eff
FLT_MAX / ( 8 >> 1 ) is 2122317823 hex 7e7f
FLT_MAX / ( 8 >> 0 ) is 2113929215 hex 7dff

This realm borders are completly wrong.
That maens I used 800 thousands values to pack 2billions of values of
realms 1-3, and all other 2.1 bils of values to pack realm 0. Fragile
arithmetics was done wrong.

What I had to use was
Realm 3
INT32_MAX is nan (or something little less than nan)
3*INT32_MAX/4 is 3.689348e+19
Total little less than 512kk different values

Realm 2
3*INT32_MAX/4 is 3.689348e+19
INT32_MAX/2 is 2.00e+00
Total 512kk different values

Realm 1
INT32_MAX/2 is 2.00e+00
INT32_MAX/4 is 1.084202e-19
Total 512kk different values

Realm 0
INT32_MAX/4 is 1.084202e-19
0 is 0.00e+00
Total 512kk different values

Though I hadn't tested it yet. I'm not sure, BTW, hardcoding this
consts is a good idea.

Best regards, Andrey Borodin, Octonica & Ural Federal University.
diff --git a/contrib/cube/cube.c b/contrib/cube/cube.c
index 3feddef..ded639b 100644
--- a/contrib/cube/cube.c
+++ b/contrib/cube/cube.c
@@ -91,14 +91,16 @@ PG_FUNCTION_INFO_V1(cube_enlarge);
 /*
 ** For internal use only
 */
-int32  cube_cmp_v0(NDBOX *a, NDBOX *b);
-bool   cube_contains_v0(NDBOX *a, NDBOX *b);
-bool   cube_overlap_v0(NDBOX *a, NDBOX *b);
-NDBOX *cube_union_v0(NDBOX *a, NDBOX *b);
-void   rt_cube_size(NDBOX *a, double *sz);
-NDBOX *g_cube_binary_union(NDBOX *r1, NDBOX *r2, int *sizep);
-bool   g_cube_leaf_consistent(NDBOX *key, NDBOX *query, StrategyNumber 
strategy);
-bool   g_cube_internal_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
+static int32   cube_cmp_v0(NDBOX *a, NDBOX *b);
+static boolcube_contains_v0(NDBOX *a, NDBOX *b);
+static boolcube_overlap_v0(NDBOX *a, NDBOX *b);
+static NDBOX  *cube_union_v0(NDBOX *a, NDBOX *b);
+static float   pack_float(float actualValue, int realm);
+static voidrt_cube_size(NDBOX *a, double *sz);
+static voidrt_cube_edge(NDBOX *a, double *sz);
+static NDBOX  *g_cube_binary_union(NDBOX *r1, NDBOX *r2, int *sizep);
+static boolg_cube_leaf_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
+static boolg_cube_internal_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
 
 /*
 ** Auxiliary funxtions
@@ -419,7 +421,32 @@ g_cube_decompress(PG_FUNCTION_ARGS)
}
PG_RETURN_POINTER(entry);
 }
+/*
+ * Function to pack floats of different realms
+ * This function serves to pack bit flags inside float type
+ * Resulted value represent can be from four different "realms"
+ * Every value from realm 3 is greater than any value from realms 2,1 and 0.
+ * Every value from realm 2 is less than every value from realm 3 and greater
+ * than any value from realm 1 and 0, and so on. Values from the same realm
+ * loose two bits of precision. This technique is possible due to floating
+ * point numbers specification according to IEEE 754: exponent bits are highest
+ * (excluding sign bits, but here penalty is always positive). If float a is
+ * greater than float b, integer A with same bit representation as a is greater
+ * than integer B with same bits as b.
+ */
+static float
+pack_float(float actualValue, int realm)
+{
+   union
+   {
+   float fp;
+   int32 bits;
+   } buffer;
 
+   buffer.fp = actualValue;
+   buffer.bits = (buffer.bits >> 2) + (INT32_MAX / 4) * realm;
+   return buffer.fp;
+}
 
 /*
 ** The GiST Penalty method for boxes
@@ -428,6 +455,11 @@ g_cube_decompress(PG_FUNCTION_ARGS)
 Datum
 g_cube_penalty(PG_FUNCTION_ARGS)
 {
+   /* REALM 0: No extension is required, volume is zero, return edge   
*/
+   /* REALM 1: No extension is required, return nonzero volume 
*/
+   /* REALM 2: Volume extension is zero, return nonzero edge extension 
*/
+   /* REALM 3: Volume extension is nonzero, return it  
*/
+
GISTENTRY  *origentry = (GISTENTRY *) PG_GETARG_POINTER(0);
GISTENTRY  *newentry = (GISTENTRY *) PG_GETARG_POINTER(1);
float  *result = (float *) PG_GETARG_POINTER(2);
@@ -441,9 +473,33 @@ g_cube_penalty(PG_FUNCTION_ARGS)

Re: [HACKERS] GiST penalty functions [PoC]

2016-09-07 Thread Andrew Borodin
Well, arithmetics is too fragile.

This version of float packing with arithmetical packaging
static float
pack_float(float actualValue, int realm)
{
double max,min;
max = FLT_MAX / ( 8 >> realm );
min = FLT_MAX / ( 16 >> realm );
if( realm == 0 )
min = 0;
/* squeeze the actual value between min and max */
return ( min + (actualValue * ( max - min ) / FLT_MAX));
}

Inserts are the same as of bithacked pack_float, but selects are 5 times slower.
When we are trying to pack value linearly into range we loose too much bits.
Here is how it happens: in floats addition of small values to big
values causes loss of small values.
Applied to Heikki's algorithm this means that nV, nE and dV can all be
in different mantissa ranges. (Please note that dV ca be many times
more than nV and many times smaller that nV)

Integer bitshift of a float have no arithmetical meaning. It would be
hard to describe in formulas what carring of mantissa bit to
significant field mean. But this bitshift preserves an order among
almost all float values (except 2 controllably lost bits and some
values become sNaN ). Entire idea of the advanced GiST penalty stands
on this.

The other way is to add to API optional handler which executes all of
choose subtree algorithm inside cube (or other extension).

Best regards, Andrey Borodin, Octonica & Ural Federal University.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST penalty functions [PoC]

2016-09-07 Thread Andrew Borodin
Oh, sorry, made one more attemp and now I see your algorithm differently.

So you propose to use oE and oV as a markers of borders for what I call Realm.
But there may be very little significant bits in one of this ranges.
pg_sphere and PostGiS extensions tried to use 1 as a marker, with
alike formula. And that generated not a good tree.
Here's how they done it
https://github.com/postgis/postgis/commit/9569e898078eeac8928891e8019ede2cbf27d06f

I'll try to compose a patch for your formula later, I think we should
just try and see, it is easier than to reason about digital floating
points.

Best regards, Andrey Borodin, Octonica & Ural Federal University.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST penalty functions [PoC]

2016-09-07 Thread Andrew Borodin
Hi Heikki!
Thank you for reviewing the code, it's always inspiring when a work is
noted (:

>Looking at the code, by "margin", you mean the sum of all "edges", i.e. of
each dimension, of the cube.
Indeed. As far as I remember, this is a terminology of old R*-tree paper.
Now they use the word "perimeter", seems like it's closer for
English-speaking folks, so in future patches I'll switch to "perimeter"

>I guess the point of that is to differentiate between cubes that have the
same volume, but a different shape.

Somewhat like this. For an R[something]-tree volume is an ultimate measure
to compare figures, close to multidimensional cube. Tree must tend to form
MBRs with equal edges to ensure search optimizations spread across all
dimensions equally.
But on practice volume degrade quickly. When you index dots, you quickly
get a page with a "plane", lot's of dots with one dimensions on a fixed
value. And then R-tree do not work anymore, inserts are somewhat-randomized
by GiST, but in practice tree is a lot worse than randomized, due to
randomization skew towards early insert candidates.
Perimeter is not as good as volume in general case. But it does not degrade
until you have truly equal object MBRs.

So this was volume vs perimeter. For example in MySQL they use #define to
choose one strategy of these, which seems for me not a good design. In this
patch I propose to use perimeter when volume strategy degraded, effectively
when one dimension edge is 0 or some dimensions edges are close to epsilon.

But there is one more tie to break. When you choose subtree for insertion,
some candidates have zero volume extension and zero edge extension. They do
not need to be extended to accommodate new tuple. In this case we choose
smallest one, by volume. If all of them have 0 volume, we choose by
perimeter.

All in all, we have 4 different cases here, ordered by priority.

>So, let's try to come up with a scheme that doesn't require IEEE 754
floats.
1. Conversion for my algorithm to float ops. I actually haven't thought
about it before, but seems it's impossible. Bit shift makes no sense for
regular float operations, under any circumstances exponent bits cannot shit
to significant field and vise versa. If we do not move last bit of exponent
- all significant field bits are garbage, that leaves us with only 6 bits
for actual value, which is unacceptable.
2. Your algorithm, among loosing some bits of precision (which is
absolutely acceptable - we have like 31 of them and that’s a lot) rely on
false assumption. We compare tuples on page not by MBR of inserted tuple,
but by MBR of tuple on page, which is different for every penalty
calculation.
I'll put it in your var names, they are good to reason about proposed
algorithm.

oE: Original tuple's edge sum (perimeter measure). This tuple is on a page.
nE: New tuple's edge sum. This tuple is to be inserted. We calc penalty to
choose subtree with minimum penalty.
dE: Edge increase. Edge of union(original|new) minus oE.

oV: Original tuple's volume
nV: New tuple's volume
dV: Volume increase

Best desired algorithm: sort tuples by dV, then by dE, then by oV, then by
oE (all ascending) and pick top 1.
Unfortunately, I do not see a way to do that in 31 bit of float (in GiST we
cannot use sign bit, <=0 is used as a magic condition).
So implemented is following:
Say we have number ALOT which is guaranteed to be bigger than dV,dE,oV,oE.
if( dV > 0 )
  penalty = ALOT*3 + dV;//nonzero dV goes last
elif ( dE > 0 )
  penalty = ALOT*2 + dE;//than goes dE
elif ( oV > 0 )
  penalty = ALOT + oV
then
  penalty = oE;
//oE is zero for subtrees containing only equal points. In this case we
pass split optimization possibility to next GiST attribute

Volume and perimeter of new entry does not affect choice of subtree
directly, whether it is zero or not.

>Hmm. That's a pretty large increase in construction time. Have you done
any profiling on where the time goes?
Since I've changes only one function, I didn't consider profiling. I've
asked Michael to check disassembly of the function call tree, implemented
his recommendations and patch v2 have all build performance back (: And now
it computes aggregates 40% faster.
>continue work within cube
There is one more option: implement extension index using CREATE ACCESS
METHOD feature of 9.6. It is an option: we can skip all GiST API hacks and
implement real RR*-tree, there are some other improvements.
But, on the other hand, improvement of GiST API could be beneficial to
other extensions.

Again, thanks for your attention. Possibly we can come up with some
solution without dirty hacks.

Best regards, Andrey Borodin, Octonica & Ural Federal University.


Re: [HACKERS] GiST penalty functions [PoC]

2016-09-01 Thread Andrew Borodin
Here is new version of the patch.
Now function pack_float is commented better.
All inline keywords are removed. I haven't found any performance loss for that.
Remove of static's showed 1%-7% performance loss in SELECT computation
(3 test runs), so I left statics where they are.

Actually, to avoid this kind of hacks, probably, it would be better to
fork GiST to GiSTv2 and implement many API changes there:
1. Bulk load API
2. Richier choose subtree API
3. Allow for key test not just NO\MAYBE answers, but SURE answer to
skip key test for underlying tree
And some other improvements require breaking chanes for extensions.
GiST idea is almost 25, modern spatial indices vary a lot from things
that were there during 90th.

Best regards, Andrey Borodin, Octonica & Ural Federal University.
diff --git a/contrib/cube/cube.c b/contrib/cube/cube.c
index 3feddef..9ce3cd6 100644
--- a/contrib/cube/cube.c
+++ b/contrib/cube/cube.c
@@ -91,14 +91,16 @@ PG_FUNCTION_INFO_V1(cube_enlarge);
 /*
 ** For internal use only
 */
-int32  cube_cmp_v0(NDBOX *a, NDBOX *b);
-bool   cube_contains_v0(NDBOX *a, NDBOX *b);
-bool   cube_overlap_v0(NDBOX *a, NDBOX *b);
-NDBOX *cube_union_v0(NDBOX *a, NDBOX *b);
-void   rt_cube_size(NDBOX *a, double *sz);
-NDBOX *g_cube_binary_union(NDBOX *r1, NDBOX *r2, int *sizep);
-bool   g_cube_leaf_consistent(NDBOX *key, NDBOX *query, StrategyNumber 
strategy);
-bool   g_cube_internal_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
+static int32   cube_cmp_v0(NDBOX *a, NDBOX *b);
+static boolcube_contains_v0(NDBOX *a, NDBOX *b);
+static boolcube_overlap_v0(NDBOX *a, NDBOX *b);
+static NDBOX  *cube_union_v0(NDBOX *a, NDBOX *b);
+static float   pack_float(float actualValue, int realm);
+static voidrt_cube_size(NDBOX *a, double *sz);
+static voidrt_cube_edge(NDBOX *a, double *sz);
+static NDBOX  *g_cube_binary_union(NDBOX *r1, NDBOX *r2, int *sizep);
+static boolg_cube_leaf_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
+static boolg_cube_internal_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
 
 /*
 ** Auxiliary funxtions
@@ -419,7 +421,27 @@ g_cube_decompress(PG_FUNCTION_ARGS)
}
PG_RETURN_POINTER(entry);
 }
-
+/*
+ * Function to pack bit flags inside float type
+ * Resulted value represent can be from four different "realms"
+ * Every value from realm 3 is greater than any value from realms 2,1 and 0.
+ * Every value from realm 2 is less than every value from realm 3 and greater
+ * than any value from realm 1 and 0, and so on. Values from the same realm
+ * loose two bits of precision. This technique is possible due to floating
+ * point numbers specification according to IEEE 754: exponent bits are highest
+ * (excluding sign bits, but here penalty is always positive). If float a is
+ * greater than float b, integer A with same bit representation as a is greater
+ * than integer B with same bits as b.
+ */
+static float
+pack_float(float actualValue, int realm)
+{
+   /* two bits for realm, others for value */
+   /* we have 4 realms */
+   int realmAjustment = *((int*))/4;
+   int realCode = realm * (INT32_MAX/4) + realmAjustment;
+   return *((float*));
+}
 
 /*
 ** The GiST Penalty method for boxes
@@ -428,6 +450,11 @@ g_cube_decompress(PG_FUNCTION_ARGS)
 Datum
 g_cube_penalty(PG_FUNCTION_ARGS)
 {
+   /* REALM 0: No extension is required, volume is zero, return edge   
*/
+   /* REALM 1: No extension is required, return nonzero volume 
*/
+   /* REALM 2: Volume extension is zero, return nonzero edge extension 
*/
+   /* REALM 3: Volume extension is nonzero, return it  
*/
+
GISTENTRY  *origentry = (GISTENTRY *) PG_GETARG_POINTER(0);
GISTENTRY  *newentry = (GISTENTRY *) PG_GETARG_POINTER(1);
float  *result = (float *) PG_GETARG_POINTER(2);
@@ -441,9 +468,33 @@ g_cube_penalty(PG_FUNCTION_ARGS)
rt_cube_size(DatumGetNDBOX(origentry->key), );
*result = (float) (tmp1 - tmp2);
 
-   /*
-* fprintf(stderr, "penalty\n"); fprintf(stderr, "\t%g\n", *result);
-*/
+   if( *result == 0 )
+   {
+   double tmp3 = tmp1; /* remember entry volume */
+   rt_cube_edge(ud, );
+   rt_cube_edge(DatumGetNDBOX(origentry->key), );
+   *result = (float) (tmp1 - tmp2);
+   if( *result == 0 )
+   {
+   if( tmp3 != 0 )
+   {
+   *result = pack_float(tmp3, 1); /* REALM 1 */
+   }
+   else
+   {
+   *result = pack_float(tmp1, 0); /* REALM 0 */
+   }
+   }
+  

Re: [HACKERS] GiST penalty functions [PoC]

2016-09-01 Thread Andrew Borodin
Thank you for your coments, Tom.

> Modern compilers are generally able to make their own decisions about it, and 
> trying to put your thumb on the scales this heavily is not likely to improve 
> the code.
Well, I have tested that combination of "static inline" affets
performance of index build on a scale of 5%. Though I didn't tested
with "static" only.
AFAIK compiler cannot prove that array of function input and output do
not intersect, so it emits lots of writes to output address inside
loop body.

>That pack_float function is absolutely not acceptable
Well, possibly, there are ways to achive penalty optimization without
such hacks, but it failed for pg_shpere and in PostGIS code. They
reverted plain arithmetic optimization without bit hacks, becuase it
didn't worked. This one works.
There is other way: advance GiST API. But I'm not sure it is possible
to do so without breaking compatibility with many existing extensions.

Best regards, Andrey Borodin, Octonica & Ural Federal University.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST penalty functions [PoC]

2016-09-01 Thread Andrew Borodin
Hi hackers!

Here is the new patch version.
With the help of Mikhail Bakhterev I've optimized subroutines of the
penalty function. Index build time now is roughly equivalent to build
time before patch (test attached to thread start).
Time of SELECT statement execution is reduced by 40%.
Changes in the patch:
1. Wrong usage of realms is fixed
2. Cube size and margin (edge) functions are optimized to reduce
memory write instructions count (result of these functions were
written on evey cycle of a loop)
3. All private functions are marked as static inline
4. Comments are formatted per project style

I'm going to put this to commitfest queue, because performance of gist
queries is improved significantly and I do not see any serious
drawbacks.

Any ideas about this patch are welcome. Especialy I'm conserned about
portability of pack_float function.
Does every supported Postgres platform conforms to IEEE 754 floating
point specification?

Also I'm not sure about possibility to hit any problems with float
NaNs during float package?

Best regards, Andrey Borodin, Octonica & Ural Federal University.
diff --git a/contrib/cube/cube.c b/contrib/cube/cube.c
index 3feddef..dec314f 100644
--- a/contrib/cube/cube.c
+++ b/contrib/cube/cube.c
@@ -91,20 +91,22 @@ PG_FUNCTION_INFO_V1(cube_enlarge);
 /*
 ** For internal use only
 */
-int32  cube_cmp_v0(NDBOX *a, NDBOX *b);
-bool   cube_contains_v0(NDBOX *a, NDBOX *b);
-bool   cube_overlap_v0(NDBOX *a, NDBOX *b);
-NDBOX *cube_union_v0(NDBOX *a, NDBOX *b);
-void   rt_cube_size(NDBOX *a, double *sz);
-NDBOX *g_cube_binary_union(NDBOX *r1, NDBOX *r2, int *sizep);
-bool   g_cube_leaf_consistent(NDBOX *key, NDBOX *query, StrategyNumber 
strategy);
-bool   g_cube_internal_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
+static inline int32cube_cmp_v0(NDBOX *a, NDBOX *b);
+static inline bool cube_contains_v0(NDBOX *a, NDBOX *b);
+static inline bool cube_overlap_v0(NDBOX *a, NDBOX *b);
+static inline NDBOX   *cube_union_v0(NDBOX *a, NDBOX *b);
+static inline floatpack_float(float actualValue, int realm);
+static inline void rt_cube_size(NDBOX *a, double *sz);
+static inline void rt_cube_edge(NDBOX *a, double *sz);
+static inline NDBOX   *g_cube_binary_union(NDBOX *r1, NDBOX *r2, int 
*sizep);
+static inline bool g_cube_leaf_consistent(NDBOX *key, NDBOX 
*query, StrategyNumber strategy);
+static inline bool g_cube_internal_consistent(NDBOX *key, NDBOX 
*query, StrategyNumber strategy);
 
 /*
 ** Auxiliary funxtions
 */
-static double distance_1D(double a1, double a2, double b1, double b2);
-static bool cube_is_point_internal(NDBOX *cube);
+static inline double distance_1D(double a1, double a2, double b1, double b2);
+static inline bool cube_is_point_internal(NDBOX *cube);
 
 
 /*
@@ -420,6 +422,15 @@ g_cube_decompress(PG_FUNCTION_ARGS)
PG_RETURN_POINTER(entry);
 }
 
+static inline float
+pack_float(float actualValue, int realm)
+{
+   /* two bits for realm, other for value  */
+   /* we have 4 realms */
+   int realmAjustment = *((int*))/4;
+   int realCode = realm * (INT32_MAX/4) + realmAjustment;
+   return *((float*));
+}
 
 /*
 ** The GiST Penalty method for boxes
@@ -428,6 +439,11 @@ g_cube_decompress(PG_FUNCTION_ARGS)
 Datum
 g_cube_penalty(PG_FUNCTION_ARGS)
 {
+   /* REALM 0: No extension is required, volume is zero, return edge   
*/
+   /* REALM 1: No extension is required, return nonzero volume 
*/
+   /* REALM 2: Volume extension is zero, return nonzero edge extension 
*/
+   /* REALM 3: Volume extension is nonzero, return it  
*/
+
GISTENTRY  *origentry = (GISTENTRY *) PG_GETARG_POINTER(0);
GISTENTRY  *newentry = (GISTENTRY *) PG_GETARG_POINTER(1);
float  *result = (float *) PG_GETARG_POINTER(2);
@@ -441,9 +457,33 @@ g_cube_penalty(PG_FUNCTION_ARGS)
rt_cube_size(DatumGetNDBOX(origentry->key), );
*result = (float) (tmp1 - tmp2);
 
-   /*
-* fprintf(stderr, "penalty\n"); fprintf(stderr, "\t%g\n", *result);
-*/
+   if( *result == 0 )
+   {
+   double tmp3 = tmp1; /* remember entry volume */
+   rt_cube_edge(ud, );
+   rt_cube_edge(DatumGetNDBOX(origentry->key), );
+   *result = (float) (tmp1 - tmp2);
+   if( *result == 0 )
+   {
+   if( tmp3 != 0 )
+   {
+   *result = pack_float(tmp3, 1); /* REALM 1 */
+   }
+   else
+   {
+   *result = pack_float(tmp1, 0); /* REALM 0 */
+   

[HACKERS] GiST penalty functions [PoC]

2016-08-29 Thread Andrew Borodin
Hi, hackers!


Some time ago I put a patch to commitfest that optimizes slightly GiST
inserts and updates. It's effect in general is quite modest (15% perf
improved), but for sorted data inserts are 3 times quicker. This
effect I attribute to mitigation of deficiencies of old split
algorithm.
Alexander Korotkov advised me to lookup some features of the RR*-tree.

The RR*-tree differs from combination of Gist + cube extension in two
algorithms:
1.Algorithm to choose subtree for insertion
2.Split algorithm

This is the message regarding implementation of first one in GiST. I
think that both of this algorithms will improve querying performance.

Long story short, there are two problems in choose subtree algorithm.
1.GiST chooses subtree via penalty calculation for each entry,
while RR*-tree employs complicated conditional logic: when there are
MBBs (Minimum Bounding Box) which do not require extensions, smallest
of them is chosen; in some cases, entries are chosen not by volume,
but by theirs margin.
2.RR*-tree algorithm jumps through entry comparation non-linearly,
I think this means that GiST penalty computation function will have to
access other entries on a page.

In this message I address only first problem. I want to construct a
penalty function that will choose:
1.Entry with a zero volume and smallest margin, that can
accommodate insertion tuple without extension, if one exists;
2.Entry with smallest volume, that can accommodate insertion tuple
without extension, if one exists;
3.Entry with zero volume increase after insert and smallest margin
increase, if one exists;
4.Entry with minimal volume increase after insert.

Current cube behavior inspects only 4th option by returning as a
penalty (float) MBB’s volume increase. To implement all 4 options, I
use a hack: order of positive floats corresponds exactly to order of
integers with same binary representation (I’m not sure this stands for
every single supported platform). So I use two bits of float as if it
were integer, and all others are used as shifted bits of corresponding
float: option 4 – volume increase, 3 - margin increase, 2 – MBB
volume, 1 – MBB margin. You can check the reinterpretation cast
function pack_float() in the patch.

I’ve tested patch performance with attached test. On my machine patch
slows index construction time from 60 to 76 seconds, reduces size of
the index from 85Mb to 82Mb, reduces time of aggregates computation
from 36 seconds to 29 seconds.

I do not know: should I continue this work in cube, or it’d be better
to fork cube?
Maybe anyone have tried already RR*-tree implementation? Science
papers show very good search performance increase.


Please let me know if you have any ideas, information or interest in this topic.


Best regards, Andrey Borodin, Octonica & Ural Federal University.
diff --git a/contrib/cube/cube.c b/contrib/cube/cube.c
index 3feddef..9c8c63d 100644
--- a/contrib/cube/cube.c
+++ b/contrib/cube/cube.c
@@ -96,6 +96,7 @@ bool  cube_contains_v0(NDBOX *a, NDBOX *b);
 bool   cube_overlap_v0(NDBOX *a, NDBOX *b);
 NDBOX *cube_union_v0(NDBOX *a, NDBOX *b);
 void   rt_cube_size(NDBOX *a, double *sz);
+void   rt_cube_edge(NDBOX *a, double *sz);
 NDBOX *g_cube_binary_union(NDBOX *r1, NDBOX *r2, int *sizep);
 bool   g_cube_leaf_consistent(NDBOX *key, NDBOX *query, StrategyNumber 
strategy);
 bool   g_cube_internal_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
@@ -420,6 +421,13 @@ g_cube_decompress(PG_FUNCTION_ARGS)
PG_RETURN_POINTER(entry);
 }
 
+float pack_float(float actualValue, int realm)
+{
+   // two bits for realm, other for value
+   int realmAjustment = *((int*))/4;
+   int realCode = realm * (INT32_MAX/4) + realmAjustment; // we have 4 
realms
+   return *((float*));
+}
 
 /*
 ** The GiST Penalty method for boxes
@@ -428,6 +436,11 @@ g_cube_decompress(PG_FUNCTION_ARGS)
 Datum
 g_cube_penalty(PG_FUNCTION_ARGS)
 {
+   // REALM 0: No extension is required, volume is zerom return edge
+   // REALM 1: No extension is required, return nonzero volume
+   // REALM 2: Volume extension is zero, return nonzero edge extension
+   // REALM 3: Volume extension is nonzero, return it
+
GISTENTRY  *origentry = (GISTENTRY *) PG_GETARG_POINTER(0);
GISTENTRY  *newentry = (GISTENTRY *) PG_GETARG_POINTER(1);
float  *result = (float *) PG_GETARG_POINTER(2);
@@ -440,6 +453,32 @@ g_cube_penalty(PG_FUNCTION_ARGS)
rt_cube_size(ud, );
rt_cube_size(DatumGetNDBOX(origentry->key), );
*result = (float) (tmp1 - tmp2);
+   if(*result == 0)
+   {
+   double tmp3 = tmp1;
+   rt_cube_edge(ud, );
+   rt_cube_edge(DatumGetNDBOX(origentry->key), );
+   *result = (float) (tmp1 - tmp2);
+   if(*result == 0)
+   {
+   if(tmp3!=0)
+   

Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-08-18 Thread Andrew Borodin
Thank you for your corrections.
Here is the patch with suggestions taken into account, except 6th.
>6) I'd rather use alignednewsize here.
> +ItemIdSetNormal(tupid, offset + size_diff, newsize);
This behavior is accroding to ubiquitous PageAddItem.

Best regards, Andrey Borodin, Octonica & Ural Federal University.

2016-08-16 20:23 GMT+05:00 Anastasia Lubennikova <a.lubennik...@postgrespro.ru>:
> 09.08.2016 19:45, Andrew Borodin:
>>
>> Here is new version of the patch, now it includes recommendations from
>> Anastasia Lubennikova.
>>
>>
>> I've investigated anamalous index size decrease. Most probable version
>> appeared to be true.
>> Cube extension, as some others, use Guttman's polynomial time split
>> algorithm. It is known to generate "needle-like" MBBs (MBRs) for
>> sorted data due to imbalanced splits (splitting 100 tuples as 98 to
>> 2). Due to previous throw-to-the-end behavior of GiST this imbalance
>> was further amplificated (most of inserts were going to bigger part
>> after split). So GiST inserts were extremely slow for sorted data.
>> There is no need to do exact sorting to trigger this behavior.
>> This behavior can be fused by implementation of small-m restriction in
>> split (AFAIR this is described in original R-tree paper from 84),
>> which prescribes to do a split in a parts no smaller than m, where m
>> is around 20% of a page capacity (in tuples number). After application
>> of this patch performance for sorted data is roughly the same as
>> performance for randomized data.
>
>
> Thank you for explanation. Now it's clear to me. I did some more testing and
> found nothing special. The declared feature is implemented correctly.
> It passes all regression tests and improves performance.
>
> I still have a few minor nitpicks about the patch style.
> You can find a style guide on
> https://www.postgresql.org/docs/9.6/static/source.html
>
> 1) remove extra whitespace in README
>
> 2) add whitespace: if(ntup == 1)
>
> 3) fix comments in accordance with coding conventions
>
> /* In case of single tuple update GiST calls overwrite
>  * In all other cases function gistplacetopage deletes
>  * old tuples and place updated at the end
>  *  */
>
>
> +/* Normally here was following assertion
> + * Assert(ItemIdHasStorage(ii));
> + * This assertion was introduced in PageIndexTupleDelete
> + * But if this function will be used from BRIN index
> + * this assertion will fail. Thus, here we do not check that
> + * item has the storage.
> + * */
>
> 4) remove unrelated changes
> -data += sizeof(OffsetNumber) * xldata->ntodelete;
> +data += sizeof(OffsetNumber) *xldata->ntodelete;
>
> 5) If the comment is correct, maxalignment is not necessary.
> +/* tuples on a page are always maxaligned */
> +oldsize = MAXALIGN(oldsize);
>
> 6) I'd rather use alignednewsize here.
>  +ItemIdSetNormal(tupid, offset + size_diff, newsize);
>
>
> After the cleanup you can change status to "Ready for Committer"
> without waiting for the response.
>
>> If data is randomized this effect of Guttman poly-time split is not in
>> effect; test from the start of the thread will show no statistically
>> confident improvement by the patch.
>> To prove performance increase in randomized case I've composed
>> modified test https://gist.github.com/x4m/856b2e1a5db745f8265c76b9c195f2e1
>> This test with five runs shows following:
>> Before patch
>> Insert Time AVG 78 seconds STD 9.5
>> Afer patch
>> Insert Time AVG 68 seconds STD 7.7
>> This is marginal but statistically viable performance improvement.
>>
>> For sorted data performance is improved by a factor of 3.
>> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>>
>
> --
> Anastasia Lubennikova
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>


PageIndexTupleOverwrite v8.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Covering + unique indexes.

2016-08-17 Thread Andrew Borodin
> That was a bug caused by high key truncation, that occurs when index has more 
> than 3 levels. Fixed.
Affirmative. I've tested index construction with 100M rows and
subsequent execution of select queries using index, works fine.

Best regards, Andrey Borodin, Octonica & Ural Federal University.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

2016-08-15 Thread Andrew Borodin
>So on average in a large randomly filled index, pages
spend more time nearer 50% full than 100% full.

I think we can make this number more...controllable.
Before split we can check whether left and right pages both are in
shared buffer and if they are seriously under certain fillfactor, say
under 75%. We can unload some of data there instead of spliting. This
will slow down insertions a bit, but we can have any fillfactor we
want for random inserts. I mean, for sure, someone can construct bad
input to gain low fillfactor, like it is with qsort (
http://www.cs.dartmouth.edu/~doug/mdmspe.pdf ), but every real
scenario will not trigger this behavior.

But then we have to think about locks, WALs etc.


Best regards, Andrey Borodin, Octonica & Ural Federal University.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-08-09 Thread Andrew Borodin
Here is new version of the patch, now it includes recommendations from
Anastasia Lubennikova.


I've investigated anamalous index size decrease. Most probable version
appeared to be true.
Cube extension, as some others, use Guttman's polynomial time split
algorithm. It is known to generate "needle-like" MBBs (MBRs) for
sorted data due to imbalanced splits (splitting 100 tuples as 98 to
2). Due to previous throw-to-the-end behavior of GiST this imbalance
was further amplificated (most of inserts were going to bigger part
after split). So GiST inserts were extremely slow for sorted data.
There is no need to do exact sorting to trigger this behavior.
This behavior can be fused by implementation of small-m restriction in
split (AFAIR this is described in original R-tree paper from 84),
which prescribes to do a split in a parts no smaller than m, where m
is around 20% of a page capacity (in tuples number). After application
of this patch performance for sorted data is roughly the same as
performance for randomized data.

If data is randomized this effect of Guttman poly-time split is not in
effect; test from the start of the thread will show no statistically
confident improvement by the patch.
To prove performance increase in randomized case I've composed
modified test https://gist.github.com/x4m/856b2e1a5db745f8265c76b9c195f2e1
This test with five runs shows following:
Before patch
Insert Time AVG 78 seconds STD 9.5
Afer patch
Insert Time AVG 68 seconds STD 7.7
This is marginal but statistically viable performance improvement.

For sorted data performance is improved by a factor of 3.

Best regards, Andrey Borodin, Octonica & Ural Federal University.


PageIndexTupleOverwrite v7.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-08-06 Thread Andrew Borodin
Anastasia , thank you for your attentive code examine.

2016-08-05 21:19 GMT+05:00 Anastasia Lubennikova :
> First of all, shouldn't we use MAXALIGN(oldsize) instead of oldsize?
> Although, I'm quite sure that it was already aligned somewhere before.
> I doubt that the check (size_diff != MAXALIGN(size_diff)) is necessary.
> I'd rather add  the check: (offset+size_diff < pd_lower).
Actually, that's a tricky question. There may be different code expectations.
1. If we expect that not-maxaligned tuple may be placed by any other
routine, we should remove check (size_diff != MAXALIGN(size_diff)),
since it will fail for not-maxaligned tuple.
2. If we expect that noone should use PageIndexTupleOverwrite with
not-maxaligned tuples, than current checks are OK: we will break
execution if we find non-maxaligned tuples. With an almost correct
message.
I suggest that this check may be debug-only assertion: in a production
code routine will work with not-maxaligned tuples if they already
reside on the page, in a dev build it will inform dev that something
is going wrong. Is this behavior Postgres-style compliant?


I agree that pd_lower check makes sense.

> BTW, I'm very surprised that it improves performance so much.
> And also size is reduced significantly. 89MB against 289MB without patch.
> Could you explain in details, why does it happen?
Size reduction is unexpected for me.
There might be 4 plausible explanations. I'll list them ordered by
descending of probability:
1. Before this patch every update was throwing recently updated tuple
to the end of a page. Thus, in some-how ordered data, recent best-fit
will be discovered last. This can cause increase of MBB's overlap in
spatial index and slightly higher tree. But 3 times size decrease is
unlikely.
How did you obtained those results? Can I look at a test case?
2. Bug in PageIndexTupleDelete causing unused space emersion. I've
searched for it, unsuccessfully.
3. Bug in PageIndexTupleOVerwrite. I cannot imagine nature of such a
bug. May be we are not placing something not very important and big on
a page?
4. Magic.
I do not see what one should do with the R-tree to change it's size by
a factor of 3. First three explanations are not better that forth,
actually.
Those 89 MB, they do not include WAL, right?

Thank you for the review.


Best regards, Andrey Borodin, Octonica & Ural Federal University.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-07-30 Thread Andrew Borodin
Here is BRIN-compatible version of patch. Now function
PageIndexTupleOverwrite consumes tuple size as a parameter, this
behavior is coherent with PageAddItem.
Also, i had to remove asserstion that item has a storage in the loop
that adjusts line pointers. It would be great if someone could check
that place (commented Assert(ItemIdHasStorage(ii)); in
PageIndexTupleOverwrite). I suspect that there might be a case when
linp's should not be adjusted.

Best regards, Andrey Borodin, Octonica & Ural Federal University.


PageIndexTupleOverwrite v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing numeric SUM() aggregate

2016-07-29 Thread Andrew Borodin
I've tested patch with this query
https://gist.github.com/x4m/fee16ed1a55217528f036983d88771b4
Test specs were: Ubuntu 14 LTS VM, dynamic RAM, hypervisor Windows
Server 2016, SSD disk, core i5-2500. Configuration: out of the box
master make.

On 10 test runs timing of select statement: AVG 3739.8 ms, STD  435.4193
With patch applied (as is) : 3017.8 ms, STD 319.893

Increase of overflow const showed no statistically viable performance
improvement (though, I do not worth doing).

2016-07-27 17:32 GMT+05:00 Tom Lane :
> For that matter, spelling INT_MAX as 0x7FF is also not per project style.

Sure, 0x7FF was not for code but for metal arithmetics. I would
even add that INT_MAX is system-dependent and varies in different
specs. I'd suggest INT32_MAX.

Best regards, Andrey Borodin, Octonica & Ural Federal University.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing numeric SUM() aggregate

2016-07-27 Thread Andrew Borodin
>   if (accum->maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1))
Woth noting that (INT_MAX - INT_MAX / NBASE) / (NBASE - 1) == INT_MAX
/ NBASE for any NBASE > 1
>I don't think there is any reason for this new code to assume NBASE=1
There is a comment on line 64 stating that value 1 is hardcoded
somewhere else, any other value is not recommended and a bunch of code
is left for historical reasons.

Best regards, Andrey Borodin, Octonica & Ural Federal University.

2016-07-27 13:57 GMT+05:00 Dean Rasheed <dean.a.rash...@gmail.com>:
> On 27 July 2016 at 07:33, Andrew Borodin <boro...@octonica.com> wrote:
>>>I think we could do carry every 0x7FF / 1 accumulation, couldn't we?
>>
>> I feel that I have to elaborate a bit. Probably my calculations are wrong.
>>
>> Lets assume we already have accumulated INT_MAX of  digits in
>> previous-place accumulator. That's almost overflow, but that's not
>> overflow. Carring that accumulator to currents gives us INT_MAX/1
>> carried sum.
>> So in current-place accumulator we can accumulate: ( INT_MAX - INT_MAX
>> / 1 ) / , where  is max value dropped in current-place
>> accumulator on each addition.
>> That is INT_MAX *  /  or simply INT_MAX / 1.
>>
>> If we use unsigned 32-bit integer that is 429496. Which is 43 times
>> less frequent carring.
>>
>> As a bonus, we get rid of  const in the code (:
>>
>> Please correct me if I'm wrong.
>>
>
> This is basically the same problem that has already been solved in
> mul_var() and other places in numeric.c, so in this case it could be
> coded using something like
>
> accum->maxdig += NBASE - 1;
> if (accum->maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1))
> Propagate carries...
>
> I agree that the new code should avoid explicitly referring to
> constants like , and I don't think there is any reason for this
> new code to assume NBASE=1.
>
> Regards,
> Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing numeric SUM() aggregate

2016-07-27 Thread Andrew Borodin
>I think we could do carry every 0x7FF / 1 accumulation, couldn't we?

I feel that I have to elaborate a bit. Probably my calculations are wrong.

Lets assume we already have accumulated INT_MAX of  digits in
previous-place accumulator. That's almost overflow, but that's not
overflow. Carring that accumulator to currents gives us INT_MAX/1
carried sum.
So in current-place accumulator we can accumulate: ( INT_MAX - INT_MAX
/ 1 ) / , where  is max value dropped in current-place
accumulator on each addition.
That is INT_MAX *  /  or simply INT_MAX / 1.

If we use unsigned 32-bit integer that is 429496. Which is 43 times
less frequent carring.

As a bonus, we get rid of  const in the code (:

Please correct me if I'm wrong.


Best regards, Andrey Borodin, Octonica & Ural Federal University.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing numeric SUM() aggregate

2016-07-26 Thread Andrew Borodin
Hi!

I like the idea and implementation, but I have one suggestion.
> Instead of propagating carry after each new value, it's done only every  
> values (or at the end).

I think we could do carry every 0x7FF / 1 accumulation, couldn't we?

Best regards, Andrey Borodin, Octonica & Ural Federal University.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-07-26 Thread Andrew Borodin
>Can you please patch BRIN to use this new function too?

On my machine replacement of both BRIN update cases (see
https://github.com/x4m/pggistopt/commit/a6d301ff79104b977619339d53aebf748045418a
) showed no performance changes on folowing update query (6 seconds of
updates avg):
create table dataTable(x int, y int);
insert into dataTable(x,y) select x,y from generate_series(1,1e3,1)
x,generate_series(1,1e3,1) y;
create index idx on dataTable using brin(x,y);
update datatable set x = random()*1024, y = random()*1024;

https://gist.github.com/x4m/7e69fd924b9ffd2fdc9c6100e741171d

Probably I was looking in a wrong place. I do not see other cases when
PageIndexTupleOverwrite can improve performance of BRIN. Though I'll
make PageIndexTupleOverwrite BRIN-compatible in forthcoming patch
version: BRIN  tuples have no length in header and it must be taken as
a parameter. Just as the PageAddItem do.


Best regards, Andrey Borodin, Octonica & Ural Federal University.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-07-25 Thread Andrew Borodin
>Can you please patch BRIN to use this new function too?
AFAIK Sergey Mirvoda was going to implement and test it.
It is easy to do this optimization for brin_samepage_update (see patch
draft in attachment, make check passes), but WAL of regular BRIN
update is not clear for me.

Best regards, Andrey Borodin, Octonica & Ural Federal University.


Brin_PageIndexTupleOverwrite.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-07-24 Thread Andrew Borodin
>If a feature changes the shape of WAL records, XLOG_PAGE_MAGIC is bumped to 
>prevent any problems.
I've attached patch with a bump, but, obviously, it'll be irrelevant
after any other commit changing WAL shape.

Here is a patch with updated GiST README.
I haven't found apropriate place to describe PageIndexTupleOverwrite
function in docs, since all it's siblings are described only in the
code.
Code comments describing this function are coherent with others
(PageAddItem, PageIndexTupleDelete).

Best regards, Andrey Borodin, Octonica & Ural Federal University.


XLog-magic-bump.patch
Description: Binary data


PageIndexTupleOverwrite v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-07-07 Thread Andrew Borodin
One more thing came to my mind:
WAL records done by the GiST before patch will corrupt GiST after
patch if replayed.
Is it a problem? May be it's prevented on some higher level?

Best regards, Andrey Borodin, Octonica & Ural Federal University.

2016-07-06 22:11 GMT+05:00 Andrew Borodin <boro...@octonica.com>:
> Here is a new patch. Updates:
> 1. Uses MAXALIGN to allocate space on page
> 2. Uses ItemIdSetNormal in case when ItemId was not normal for some
> reasons before call
> 3. Improved comments and var names
>
> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>
> 2016-07-05 17:54 GMT+05:00 Andrew Borodin <boro...@octonica.com>:
>> Here is a patch with updated WAL behavior.
>>
>> I'm not certain about MAXALIGN for size of appended tuple. Currently
>> if anyone calls PageIndexTupleOverwrite() with unalligned tuple it
>> will ereport(ERROR).
>> I suspect that it should allign size instead.
>>
>> Also I suspect that PageIndexTupleOverwrite() should make a call to
>> ItemIdSetNormal() to pretend it is generic function and not just a
>> private part of GiST.
>>
>> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>>
>> 2016-07-04 22:59 GMT+05:00 Andrew Borodin <boro...@octonica.com>:
>>> Thank you, Alvaro.
>>>
>>> Yes, now I see. I'll update gistRedoPageUpdateRecord() to work
>>> accordingly with patched gistplacetopage().
>>> Also, i've noticed that PageAddItem uses MAXALIGN() macro to calculate
>>> tuple size. So, PageIndexTupleOverwrite should behave the same.
>>>
>>> About PageIndexDeleteNoCompact() in BRIN. I do not see why
>>> PageIndexDeleteNoCompact() is not a good solution instead of
>>> PageIndexTupleOverwrite? Will it mark tuple as unused until next
>>> PageAddItem will use it's place?
>>>
>>> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>>>
>>> 2016-07-04 22:16 GMT+05:00 Alvaro Herrera <alvhe...@2ndquadrant.com>:
>>>> Andrew Borodin wrote:
>>>>> Thank you, Amit.
>>>>>
>>>>> Currently, if WAL reconstruct page in another order it won't be a problem.
>>>>
>>>> We require that replay of WAL leads to identical bytes than the
>>>> original.  Among other things, this enables use of a tool that verifies
>>>> that WAL is working correctly simply by comparing page images.  So
>>>> please fix it instead of just verifying that this works for GiST.
>>>>
>>>> By the way, BRIN indexes have a need of this operation too.  The current
>>>> approach is to call PageIndexDeleteNoCompact followed by PageAddItem.
>>>> I suppose it would be beneficial to use your new routine there too.
>>>>
>>>> --
>>>> Álvaro Herrerahttp://www.2ndQuadrant.com/
>>>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-07-06 Thread Andrew Borodin
Here is a new patch. Updates:
1. Uses MAXALIGN to allocate space on page
2. Uses ItemIdSetNormal in case when ItemId was not normal for some
reasons before call
3. Improved comments and var names

Best regards, Andrey Borodin, Octonica & Ural Federal University.

2016-07-05 17:54 GMT+05:00 Andrew Borodin <boro...@octonica.com>:
> Here is a patch with updated WAL behavior.
>
> I'm not certain about MAXALIGN for size of appended tuple. Currently
> if anyone calls PageIndexTupleOverwrite() with unalligned tuple it
> will ereport(ERROR).
> I suspect that it should allign size instead.
>
> Also I suspect that PageIndexTupleOverwrite() should make a call to
> ItemIdSetNormal() to pretend it is generic function and not just a
> private part of GiST.
>
> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>
> 2016-07-04 22:59 GMT+05:00 Andrew Borodin <boro...@octonica.com>:
>> Thank you, Alvaro.
>>
>> Yes, now I see. I'll update gistRedoPageUpdateRecord() to work
>> accordingly with patched gistplacetopage().
>> Also, i've noticed that PageAddItem uses MAXALIGN() macro to calculate
>> tuple size. So, PageIndexTupleOverwrite should behave the same.
>>
>> About PageIndexDeleteNoCompact() in BRIN. I do not see why
>> PageIndexDeleteNoCompact() is not a good solution instead of
>> PageIndexTupleOverwrite? Will it mark tuple as unused until next
>> PageAddItem will use it's place?
>>
>> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>>
>> 2016-07-04 22:16 GMT+05:00 Alvaro Herrera <alvhe...@2ndquadrant.com>:
>>> Andrew Borodin wrote:
>>>> Thank you, Amit.
>>>>
>>>> Currently, if WAL reconstruct page in another order it won't be a problem.
>>>
>>> We require that replay of WAL leads to identical bytes than the
>>> original.  Among other things, this enables use of a tool that verifies
>>> that WAL is working correctly simply by comparing page images.  So
>>> please fix it instead of just verifying that this works for GiST.
>>>
>>> By the way, BRIN indexes have a need of this operation too.  The current
>>> approach is to call PageIndexDeleteNoCompact followed by PageAddItem.
>>> I suppose it would be beneficial to use your new routine there too.
>>>
>>> --
>>> Álvaro Herrerahttp://www.2ndQuadrant.com/
>>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


PageIndexTupleOverwrite v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-07-05 Thread Andrew Borodin
Here is a patch with updated WAL behavior.

I'm not certain about MAXALIGN for size of appended tuple. Currently
if anyone calls PageIndexTupleOverwrite() with unalligned tuple it
will ereport(ERROR).
I suspect that it should allign size instead.

Also I suspect that PageIndexTupleOverwrite() should make a call to
ItemIdSetNormal() to pretend it is generic function and not just a
private part of GiST.

Best regards, Andrey Borodin, Octonica & Ural Federal University.

2016-07-04 22:59 GMT+05:00 Andrew Borodin <boro...@octonica.com>:
> Thank you, Alvaro.
>
> Yes, now I see. I'll update gistRedoPageUpdateRecord() to work
> accordingly with patched gistplacetopage().
> Also, i've noticed that PageAddItem uses MAXALIGN() macro to calculate
> tuple size. So, PageIndexTupleOverwrite should behave the same.
>
> About PageIndexDeleteNoCompact() in BRIN. I do not see why
> PageIndexDeleteNoCompact() is not a good solution instead of
> PageIndexTupleOverwrite? Will it mark tuple as unused until next
> PageAddItem will use it's place?
>
> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>
> 2016-07-04 22:16 GMT+05:00 Alvaro Herrera <alvhe...@2ndquadrant.com>:
>> Andrew Borodin wrote:
>>> Thank you, Amit.
>>>
>>> Currently, if WAL reconstruct page in another order it won't be a problem.
>>
>> We require that replay of WAL leads to identical bytes than the
>> original.  Among other things, this enables use of a tool that verifies
>> that WAL is working correctly simply by comparing page images.  So
>> please fix it instead of just verifying that this works for GiST.
>>
>> By the way, BRIN indexes have a need of this operation too.  The current
>> approach is to call PageIndexDeleteNoCompact followed by PageAddItem.
>> I suppose it would be beneficial to use your new routine there too.
>>
>> --
>> Álvaro Herrerahttp://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


PageIndexTupleOverwrite v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-07-04 Thread Andrew Borodin
Thank you, Alvaro.

Yes, now I see. I'll update gistRedoPageUpdateRecord() to work
accordingly with patched gistplacetopage().
Also, i've noticed that PageAddItem uses MAXALIGN() macro to calculate
tuple size. So, PageIndexTupleOverwrite should behave the same.

About PageIndexDeleteNoCompact() in BRIN. I do not see why
PageIndexDeleteNoCompact() is not a good solution instead of
PageIndexTupleOverwrite? Will it mark tuple as unused until next
PageAddItem will use it's place?

Best regards, Andrey Borodin, Octonica & Ural Federal University.

2016-07-04 22:16 GMT+05:00 Alvaro Herrera <alvhe...@2ndquadrant.com>:
> Andrew Borodin wrote:
>> Thank you, Amit.
>>
>> Currently, if WAL reconstruct page in another order it won't be a problem.
>
> We require that replay of WAL leads to identical bytes than the
> original.  Among other things, this enables use of a tool that verifies
> that WAL is working correctly simply by comparing page images.  So
> please fix it instead of just verifying that this works for GiST.
>
> By the way, BRIN indexes have a need of this operation too.  The current
> approach is to call PageIndexDeleteNoCompact followed by PageAddItem.
> I suppose it would be beneficial to use your new routine there too.
>
> --
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-07-04 Thread Andrew Borodin
Thank you, Amit.

Currently, if WAL reconstruct page in another order it won't be a problem.
How can I check that it works? Would it be sufficient if I kill 9 backend
and postmaster after commit and it will start normally executing select
queries after restart?

I'll make a patch with missing spaces later. I also found some more missing
spaces in PageIndexTupleOverwrite call and typo in comments ("on-place").

Best regards, Andrey Borodin, Octonica & Ural Federal University.

2016-07-04 18:58 GMT+05:00 Amit Kapila <amit.kapil...@gmail.com>:

> On Mon, Jul 4, 2016 at 4:52 PM, Andrew Borodin <boro...@octonica.com>
> wrote:
> > Here is a patch with corrections from Alexander Korotkov.
> > My tests, check and check-world show no problems against Ubuntu LTS 14
> x64 VM.
> >
> >
>
> - PageIndexTupleDelete(page, oldoffnum);
> - gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
> + {
> + /*if we have just one tuple to update we replace it on-place on page*/
> + if(ntup == 1)
> + {
> + PageIndexTupleOverwrite(page,oldoffnum,*itup);
> + }
> + else
> + {
> + /*this corner case is here to support mix calls case (see comment
> above)*/
> + PageIndexTupleDelete(page, oldoffnum);
> + gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
> + }
>
>
> Here, you have changed the way tuple is added on a page, but still the
> WAL record is same as before.  I think WAL replay should perform the
> action in a same way as we have done when original operation is
> performed.  If not, then it can change the organization of page which
> might lead to failure in replay of consecutive WAL records.
>
> + /*if we have just one tuple to update we replace it on-place on page*/
>
> In comments, we always leave a space both in the beginning and at the
> end of a comment.  See comments in nearby code.
>
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>


[HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-07-04 Thread Andrew Borodin
Here is a patch with corrections from Alexander Korotkov.
My tests, check and check-world show no problems against Ubuntu LTS 14 x64 VM.


Best regards, Andrey Borodin, Octonica & Ural Federal University.

2016-07-04 10:41 GMT+05:00 Andrew Borodin <boro...@octonica.com>:
> Hi!
>>I think you should implement PageReplaceItem() version and add it to the 
>>commitfest.
> Here is the patch.
> I've called function PageIndexTupleOverwrite() because it's suitable
> only for indices. It works on my tests and performance is the same as
> in proof-of-concept (despite some sanity checks copied from
> PageIndexTupleDelete).
> Next weekend I'll do more testing, and then add it to commitfest.
>
> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>
> 2016-07-03 15:21 GMT+05:00 Alexander Korotkov <a.korot...@postgrespro.ru>:
>> Hi!
>>
>> On Sun, Jul 3, 2016 at 12:24 PM, Andrew Borodin <boro...@octonica.com>
>> wrote:
>>>
>>> I think there is some room for improving GiST inserts. Following is
>>> the description of what I think is performance problem.
>>>
>>> Function gistplacetopage in file /src/backend/access/gist/gist.c is
>>> responsible for updating or appending new tuple on GiST page.
>>> Currently, after checking necessity of page split due to overflow, it
>>> essentially executes following:
>>>
>>>if (OffsetNumberIsValid(oldoffnum))
>>>PageIndexTupleDelete(page, oldoffnum);
>>>gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
>>>
>>> That is: remove old tuple if it’s there, then place updated tuple at the
>>> end.
>>>
>>> Half of the old data have to be shifted my memmove inside
>>> PageIndexTupleDelete() call, half of the linp-s have to be corrected.
>>>
>>> If the updated tuple has same size as already residing on page we can
>>> just overwrite it. Attached patch demonstrates that concept. Attached
>>> test.sql inserts million rows into GiST index based on cube extension.
>>> My machine is Hyper-V VM running Ubuntu on i5-2500 CPU with SSD
>>> storage. Before patch, insert part of test is executed on average
>>> within 159 second, after patch application: insert part is executed
>>> within 77 seconds on average. That is almost twice faster (for
>>> CPU\Mem-bounded inserts, disk-constrained test will show no
>>> improvement). But it works only for fixed-size tuple inserts.
>>
>>
>> Very promising results!
>>
>>> I know that code in patch is far from beautiful: it operates with
>>> three different levels of abstraction within 5 lines of code. Those
>>> are low level memmove(), system-wide PageAddItem() and GiST private
>>> gistfillBuffer().
>>>
>>> By the way PageAddItem() have overwrite flag, but it only works with
>>> unused ItemId’s. Marking old ItemId as unused before PageAddItem()
>>> didn’t work for me. Unfortunately bufpage.c routines do not contain
>>> one for updating(replacing with new) tuple on page. It is important
>>> for me because I’m working on advanced GiST page layout (
>>>
>>> https://www.postgresql.org/message-id/CAJEAwVE0rrr%2BOBT-P0gDCtXbVDkBBG_WcXwCBK%3DGHo4fewu3Yg%40mail.gmail.com
>>> ), current approach is to use skip-tuples which can be used to skip
>>> many flowing tuples with one key check. Obviously, this design cares
>>> about tuples order. And update in a fashion “place updated tuple at
>>> the end” won’t work for me.
>>>
>>> So, I think it would be better to implement PageReplaceItem()
>>> functionality to make code better, to make existing GiST inserts
>>> faster and to enable new advanced page layouts in indices.
>>
>>
>> +1 for PageReplaceItem()
>> Even if item is not the same size, we can move the tail of page once instead
>> of twice.
>> I think you should implement PageReplaceItem() version and add it to the
>> commitfest.
>>
>> --
>> Alexander Korotkov
>> Postgres Professional: http://www.postgrespro.com
>> The Russian Postgres Company


PageIndexTupleOverwrite v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-07-03 Thread Andrew Borodin
Hi!
>I think you should implement PageReplaceItem() version and add it to the 
>commitfest.
Here is the patch.
I've called function PageIndexTupleOverwrite() because it's suitable
only for indices. It works on my tests and performance is the same as
in proof-of-concept (despite some sanity checks copied from
PageIndexTupleDelete).
Next weekend I'll do more testing, and then add it to commitfest.

Best regards, Andrey Borodin, Octonica & Ural Federal University.

2016-07-03 15:21 GMT+05:00 Alexander Korotkov <a.korot...@postgrespro.ru>:
> Hi!
>
> On Sun, Jul 3, 2016 at 12:24 PM, Andrew Borodin <boro...@octonica.com>
> wrote:
>>
>> I think there is some room for improving GiST inserts. Following is
>> the description of what I think is performance problem.
>>
>> Function gistplacetopage in file /src/backend/access/gist/gist.c is
>> responsible for updating or appending new tuple on GiST page.
>> Currently, after checking necessity of page split due to overflow, it
>> essentially executes following:
>>
>>if (OffsetNumberIsValid(oldoffnum))
>>PageIndexTupleDelete(page, oldoffnum);
>>gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
>>
>> That is: remove old tuple if it’s there, then place updated tuple at the
>> end.
>>
>> Half of the old data have to be shifted my memmove inside
>> PageIndexTupleDelete() call, half of the linp-s have to be corrected.
>>
>> If the updated tuple has same size as already residing on page we can
>> just overwrite it. Attached patch demonstrates that concept. Attached
>> test.sql inserts million rows into GiST index based on cube extension.
>> My machine is Hyper-V VM running Ubuntu on i5-2500 CPU with SSD
>> storage. Before patch, insert part of test is executed on average
>> within 159 second, after patch application: insert part is executed
>> within 77 seconds on average. That is almost twice faster (for
>> CPU\Mem-bounded inserts, disk-constrained test will show no
>> improvement). But it works only for fixed-size tuple inserts.
>
>
> Very promising results!
>
>> I know that code in patch is far from beautiful: it operates with
>> three different levels of abstraction within 5 lines of code. Those
>> are low level memmove(), system-wide PageAddItem() and GiST private
>> gistfillBuffer().
>>
>> By the way PageAddItem() have overwrite flag, but it only works with
>> unused ItemId’s. Marking old ItemId as unused before PageAddItem()
>> didn’t work for me. Unfortunately bufpage.c routines do not contain
>> one for updating(replacing with new) tuple on page. It is important
>> for me because I’m working on advanced GiST page layout (
>>
>> https://www.postgresql.org/message-id/CAJEAwVE0rrr%2BOBT-P0gDCtXbVDkBBG_WcXwCBK%3DGHo4fewu3Yg%40mail.gmail.com
>> ), current approach is to use skip-tuples which can be used to skip
>> many flowing tuples with one key check. Obviously, this design cares
>> about tuples order. And update in a fashion “place updated tuple at
>> the end” won’t work for me.
>>
>> So, I think it would be better to implement PageReplaceItem()
>> functionality to make code better, to make existing GiST inserts
>> faster and to enable new advanced page layouts in indices.
>
>
> +1 for PageReplaceItem()
> Even if item is not the same size, we can move the tail of page once instead
> of twice.
> I think you should implement PageReplaceItem() version and add it to the
> commitfest.
>
> --
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company


PageIndexTupleOverwrite.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-07-03 Thread Andrew Borodin
Hi, hackers!

I think there is some room for improving GiST inserts. Following is
the description of what I think is performance problem.

Function gistplacetopage in file /src/backend/access/gist/gist.c is
responsible for updating or appending new tuple on GiST page.
Currently, after checking necessity of page split due to overflow, it
essentially executes following:

   if (OffsetNumberIsValid(oldoffnum))
   PageIndexTupleDelete(page, oldoffnum);
   gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);

That is: remove old tuple if it’s there, then place updated tuple at the end.

Half of the old data have to be shifted my memmove inside
PageIndexTupleDelete() call, half of the linp-s have to be corrected.

If the updated tuple has same size as already residing on page we can
just overwrite it. Attached patch demonstrates that concept. Attached
test.sql inserts million rows into GiST index based on cube extension.
My machine is Hyper-V VM running Ubuntu on i5-2500 CPU with SSD
storage. Before patch, insert part of test is executed on average
within 159 second, after patch application: insert part is executed
within 77 seconds on average. That is almost twice faster (for
CPU\Mem-bounded inserts, disk-constrained test will show no
improvement). But it works only for fixed-size tuple inserts.

I know that code in patch is far from beautiful: it operates with
three different levels of abstraction within 5 lines of code. Those
are low level memmove(), system-wide PageAddItem() and GiST private
gistfillBuffer().

By the way PageAddItem() have overwrite flag, but it only works with
unused ItemId’s. Marking old ItemId as unused before PageAddItem()
didn’t work for me. Unfortunately bufpage.c routines do not contain
one for updating(replacing with new) tuple on page. It is important
for me because I’m working on advanced GiST page layout (
https://www.postgresql.org/message-id/CAJEAwVE0rrr%2BOBT-P0gDCtXbVDkBBG_WcXwCBK%3DGHo4fewu3Yg%40mail.gmail.com
), current approach is to use skip-tuples which can be used to skip
many flowing tuples with one key check. Obviously, this design cares
about tuples order. And update in a fashion “place updated tuple at
the end” won’t work for me.

So, I think it would be better to implement PageReplaceItem()
functionality to make code better, to make existing GiST inserts
faster and to enable new advanced page layouts in indices.

Current work is here https://github.com/x4m/pggistopt/tree/simplefix

What do you think about found performance problem and about patch
attached? Am I missing something?



Best regards, Andrey Borodin, Octonica & Ural Federal University.


test.sql
Description: Binary data


GiST memmove.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [Proposal] Improvement of GiST page layout

2016-02-08 Thread Andrew Borodin
Hi, hackers!

I want to propose improvement of GiST page layout.

GiST is optimized to reduce disk operations on search queries, for
example, windows search queries in case of R-tree.

I expect that more complicated page layout will help to tradeoff some
of CPU efficiency for disk efficiency.

GiST tree is a balanced tree structure with two kinds of pages:
internal pages and leaf pages. Each tree page contains bunch of tuples
with the same structure. Tuples of an internal page reference other
pages of the same tree, while a leaf page tuples holds heap TIDs.

During execution of search query GiST for each tuple on page invokes
key comparison algorithm with two possible outcomes: 'no' and 'may
be'. 'May be' answer recursively descends search algorithms to
referenced page (in case of internal page) or yields tuple (in case of
leaf page).

Expected tuples count on page is around of tenth to hundreds. This
count is big enough to try to save some cache lines from loading into
CPU during enumeration.

For B-trees during inspection of a page we effectively apply binary
search algorithm, which is not possible in GiST tree.

Let's consider R-tree with arbitrary fan-out f. If a given query will
find exactly one data tuple, it is easily to show that keys comparison
count is minimal if f->e /*round_to_optimal(2.78) == 3*/ (tree have to
review f*h keys, h=logf(N), f*logf(N) is minimal when f->e). Smaller
keys comparison count means less cache lines are touched. So fan-out
reduction means cache pressure reduction (except avg fan-out 2, which
seems to be too small) and less time waiting for RAM. I suppose, all
that reasoning holds true in a cases when not just one tuple will be
found.

How do we reduce tree fan-out? Obviously, we can’t fill page with just
3 tuples. But we can install small tree-like structure inside one
page. General GiST index has root page. But a page tree should have
“root” layer of tuples. Private (or internal, intermediate, auxiliary,
I can’t pick precise word) tuples have only keys and fixed-size(F)
array of underlying records offsets. Each layer is linked-list. After
page have just been allocated there is only “ground” level of regular
tuples. Eventually record count reaches F-1 and we create new root
layer with two tuples. Each new tuple references half of preexisting
records. Placement of new “ground” tuples on page eventually will
cause internal tuple to split. If there is not enough space to spilt
internal tuple we mark page for whole page-split during next iteration
of insertion algorithms of owning tree. That is why tuple-spilt
happens on F-1 tuples, not on F: if we have no space for splitting, we
just adding reference to last slot. In this algorithm, page split will
cause major page defragmentation: we take root layer, halve it and
place halves on different pages. When half of a data is gone to other
page, restructuration should tend to place records in such a fashion
that accessed together tuples lie together. I think, placing whole
level together is a good strategy.

Let’s look how page grows with fan-out factor F=5. RLS – root layer
start, G – ground tuple, Ix – internal tuple of level x.

When we added 3 ground tuples it’s just a ground layer
RLS=0|G G G
Then we place one more tuple and layer splits:
RLS=4|G G G G I0 I0
Each I0 tuple now references two G tuples. We keep placing G tuples.
RLS=4|G G G G I0 I0 G G
And then one of I0 tuples is spitted
RLS=4|G G G G I0 I0 G G G I0
And right after one more I0 split causes new layer
RLS=12|G G G G I0 I0 G G G I0 G I0 I1 I1

And so on, until we have space on a page. In a regular GiST we ran out
of space on page before we insert tuple on page. Now we can run out of
space during insertion. But this will not be fatal, we still will be
able to place ground level tuple. Inner index structure will use
one-extra slot for reference allocation, but next insertion on a page
will deal with it. On a pages marked for split we have to find which
exactly index tuple has run out of extra-slot during split and fix it.

Several years ago I had unsuccessful attempt to implement akin
algorithm in a database engine of a proprietary system. I stuck in the
debug of deletion algorithm and tree condensation, it is in use in
that system. I suppose it was a mistake to defrag page with creeping
heuristic, eventually I dropped the idea and just moved on to actually
important tasks, there is always deadline rush in business. I’m newbie
in Postgres internals. But as I see there is no deletion on GiST page.
So I feel itch to try this feature one more time.

I expect that implementation of this proposal could speed up
insertions into index by 20% and performance of queries by 30% when
all index accommodates in shared buffer. In case of buffer starvation,
when index is accessed through disk this feature will cause 15%
performance degrade since effective page capacity will be smaller.
Should this feature be configurable? May be this should be other
access method?

I need help to