Re: [HACKERS] Merge join for GiST
Hi, hackers! I've adapted crossmatch join from pgSphere to cube for performance tests. I've placed spatial join code here https://github.com/Octonica/postgres/blob/spatialjoin/contrib/cube/spatialjoin.c and node code here https://github.com/Octonica/postgres/blob/spatialjoin/contrib/cube/joinnode.c If you have an idea of improving the performance of this code, please do not hesitate to express them. One of the performance bottlenecks is code nearby heap_getattr() in fetch_next_pair(). ==Performance Tests== I've tested performance on queries which aggregate result of the spatial join. See cube_test.sql attached. On 3d data, Spatial Join performs 3x faster than Nested Loop Join + GiST Scan Nested Loop + Index Scan plan: HashAggregate (cost=36841568.00..36841570.00 rows=200 width=40) (actual time=206565.869..206738.307 rows=298731 loops=1) Group Key: r.nx -> Nested Loop (cost=0.41..25591568.00 rows=9 width=40) (actual time=0.357..200838.416 rows=8464147 loops=1) -> Seq Scan on regions r (cost=0.00..6410.00 rows=30 width=40) (actual time=0.015..324.436 rows=30 loops=1) -> Index Scan using idx on datatable a (cost=0.41..55.28 rows=3000 width=64) (actual time=0.174..0.648 rows=28 loops=30) Index Cond: (r.c @> c) Planning time: 17.175 ms Execution time: 206806.926 ms Time: 206852,635 ms (03:26,853) Spatial Join plan: HashAggregate (cost=56250001.00..56250003.00 rows=200 width=40) (actual time=67373.644..67553.118 rows=298731 loops=1) Group Key: r.nx -> Custom Scan (SpatialJoin) (cost=0.00..1.00 rows=45 width=40) (actual time=0.151..61718.804 rows=8464147 loops=1) Outer index: idx Inner index: idx1 Planning time: 0.182 ms Execution time: 67630.742 ms Time: 67631,557 ms (01:07,632) But on more realistic 7D data with queries emulating OLAP system performance of Spatial Join is 2 times worse than Nested Loop Join + GiST Scan. Which comes as a complete surprise to me. I do not see any algorithmic reason for Spatial Join to be slower. Thus I strongly suspect that my implementation is not efficient, but as for now I have no ideas how to improve it. Here are plans for 7D Nested Loop + Index Scan HashAggregate (cost=3425143.00..3425743.00 rows=6 width=72) (actual time=122794.715..122822.893 rows=6 loops=1) Group Key: r.nx -> Nested Loop (cost=0.41..2075143.00 rows=6000 width=72) (actual time=0.311..100478.710 rows=39817008 loops=1) -> Seq Scan on r1 r (cost=0.00..2419.00 rows=6 width=128) (actual time=0.043..60.579 rows=6 loops=1) -> Index Scan using ix_a1_cube on a1 a (cost=0.41..24.55 rows=1000 width=128) (actual time=0.110..1.266 rows=664 loops=6) Index Cond: (c <@ r.c) Planning time: 0.349 ms Execution time: 122831.353 ms (8 rows) Spatial Join HashAggregate (cost=6750001.00..6750601.00 rows=6 width=72) (actual time=241832.855..241889.360 rows=6 loops=1) Group Key: r.nx -> Custom Scan (SpatialJoin) (cost=0.00..1.00 rows=3 width=72) (actual time=0.140..216187.111 rows=39817008 loops=1) Outer index: ix_r1_cube Inner index: ix_a1_cube Planning time: 0.533 ms Execution time: 241907.569 ms (7 rows) Time: 241910,440 ms (04:01,910) Any ideas will be highly appreciated. Best regards, Andrey Borodin. create extension if not exists cube; begin transaction; SELECT setseed(.43); \timing create table dataTable as select cube(array[random(),random(),random()]) c, 1::float as x1, 1::float as x2, 1::float as x3, 1::float as x4 from generate_series(1,3e6,1) s; create index idx on dataTable using gist(c); create table regions as select cube(array[x,y,z],array[x+0.1,y+0.01,z+0.01]) c, row_number() over() as nx from (select random() x,random() y, random() z from generate_series(1,3e5,1) s) q; create index idx1 on regions using gist(c); set max_parallel_workers_per_gather = 0; explain analyze select sum(a.x1) as x1, sum(a.x2) as x2, sum(a.x3) as x3, sum(a.x4) as x4 from dataTable a join regions r on r.c @> a.c group by r.nx; explain analyze select sum(a.x1) as x1, sum(a.x2) as x2, sum(a.x3) as x3, sum(a.x4) as x4 from dataTable a join regions r on r.c && a.c group by r.nx; commit; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merge join for GiST
2017-04-13 11:30 GMT+05:00 Jeff Davis : > I don't quite follow. I don't think any of these proposals uses btree, > right? Range merge join doesn't need any index, your proposal uses > gist, and PgSphere's crossmatch uses gist. Merge join will use presorted data, B-tree provides sorted data. Merge Join cannot join non-sorted data, can it? Indeed, B-tree is not the only sorted data provider. In this sight, Range Merge join is even more generic. Best regards, Andrey Borodin. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merge join for GiST
On Wed, Apr 12, 2017 at 10:44 PM, Andrew Borodin wrote: >> How do you think we should proceed? Which projects do you think should >> eventually be in core, versus which are fine as extensions? > > Some points in favor of Range joins via nbtree: My patch doesn't require indexes, it can sort the input (and the 5X improvement that I got included the effort to sort). In fact, I expect using sort will be more common than maintaining btree indexes on a range column. > 1. It's more efficient than B-tree over GiST > 2. It is already in a patch form > > Point against: > 1. Particular implementation is somewhat leaked abstraction. Inside > the planner, you check not for capabilities of operator and type, but > for specific op and type. But I do not know how to fix this. It's not a specific type, it's the "anyrange" type, so you can define new range types to take advantage of range merge join. I can drive the planner changes from the catalog rather than hard-coded OIDs if we think range merge join is a good solution. > So, here is my opinion: if we can inside the planner understand that > join condition involves range specifics (for all available ranges) and > do Range Merge Join, then this is preferred solution. Yes, I can do that. > Yes, Spatial Join, when available, will provide roughly the same scan > performance. But B-trees are more well known to users new to > PostgreSQL, and B-trees are faster. I don't quite follow. I don't think any of these proposals uses btree, right? Range merge join doesn't need any index, your proposal uses gist, and PgSphere's crossmatch uses gist. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merge join for GiST
2017-04-13 7:01 GMT+05:00 Jeff Davis : > On Tue, Apr 11, 2017 at 8:35 AM, Alexander Korotkov > wrote: >> On Tue, Apr 11, 2017 at 5:46 PM, Jeff Davis wrote: >>> Do you have a sense of how this might compare with range merge join? >> >> >> If you have GiST indexes over ranges for both sides of join, then this >> method could be used for range join. Hence, it could be compared with range >> merge join. >> However, particular implementation in pgsphere uses hardcoded datatypes and >> operations. >> Thus, for range join we need either generalized version of GiST-based join >> or special implementation for ranges. > > Alexander, Andrew, > > How do you think we should proceed? Which projects do you think should > eventually be in core, versus which are fine as extensions? Some points in favor of Range joins via nbtree: 1. It's more efficient than B-tree over GiST 2. It is already in a patch form Point against: 1. Particular implementation is somewhat leaked abstraction. Inside the planner, you check not for capabilities of operator and type, but for specific op and type. But I do not know how to fix this. So, here is my opinion: if we can inside the planner understand that join condition involves range specifics (for all available ranges) and do Range Merge Join, then this is preferred solution. Yes, Spatial Join, when available, will provide roughly the same scan performance. But B-trees are more well known to users new to PostgreSQL, and B-trees are faster. Best regards, Andrey Borodin. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merge join for GiST
On Tue, Apr 11, 2017 at 8:35 AM, Alexander Korotkov wrote: > On Tue, Apr 11, 2017 at 5:46 PM, Jeff Davis wrote: >> Do you have a sense of how this might compare with range merge join? > > > If you have GiST indexes over ranges for both sides of join, then this > method could be used for range join. Hence, it could be compared with range > merge join. > However, particular implementation in pgsphere uses hardcoded datatypes and > operations. > Thus, for range join we need either generalized version of GiST-based join > or special implementation for ranges. Alexander, Andrew, How do you think we should proceed? Which projects do you think should eventually be in core, versus which are fine as extensions? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merge join for GiST
Thank you, Alexander! This is definitely the example we are looking for! Hat tip to Dmitry especially for this commit https://github.com/akorotkov/pgsphere/commit/971d2c5d61f17774a6d8d137ca3ad87e2883048f Regards, Sergey Mirvoda On Tue, Apr 11, 2017 at 2:17 PM, Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > On Mon, Apr 10, 2017 at 2:53 PM, Andrew Borodin > wrote: > >> ==Spatial joins== >> Scientific papers from the dawn of R-trees and multidimensional >> indexes feature a lot of algorithms for spatial joins. >> I.e. you have two sets of geometries s1 and s2, you need to produce >> all colliding pairs (p1,p2) where p1 in s1 and p2 in s2. For 2 R-trees >> of equal heights with roots R and S most straightforward[1] algorithm >> look like: >> >> SpatialJoin (R,S: Node) >> { >> FOR (all E in S) >> FOR (all ER in R with ER.rect intersecting E.rect) >>IF (R is a leaf page) >>{ >> Output ER,ES >>} >>ELSE >>{ >> SpatialJoin (ER.ref, ES.ref) >>} >> } >> > > FYI, I've implemented this algorithm for pgsphere. See following branch. > https://github.com/akorotkov/pgsphere/tree/experimental > It's implemented as crossmatch() function which takes as arguments names > of two indexes over spoint and maximum distance (it checks not overlapping > but proximity of points). This function returns setof pairs of TIDs. > > Later, Dmitry Ivanov made it a custom scan node. > https://github.com/akorotkov/pgsphere/tree/crossmatch_cnode > > You also can find some experimental evaluation here: > http://www.adass2016.inaf.it/images/presentations/10_Korotkov.pdf > > Feel free to experiment with this code and/or ask any question. > > -- > Alexander Korotkov > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company > -- --Regards, Sergey Mirvoda
Re: [HACKERS] Merge join for GiST
2017-04-11 14:17 GMT+05:00 Alexander Korotkov : > FYI, I've implemented this algorithm for pgsphere. See following branch. > https://github.com/akorotkov/pgsphere/tree/experimental > It's implemented as crossmatch() function which takes as arguments names of > two indexes over spoint and maximum distance (it checks not overlapping but > proximity of points). This function returns setof pairs of TIDs. > > Later, Dmitry Ivanov made it a custom scan node. > https://github.com/akorotkov/pgsphere/tree/crossmatch_cnode > > You also can find some experimental evaluation here: > http://www.adass2016.inaf.it/images/presentations/10_Korotkov.pdf > > Feel free to experiment with this code and/or ask any question. Wow, that's cool! Thanks! That code actually answers a lot of questions. I'll try to generalize that for && operator Best regards, Andrey Borodin. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merge join for GiST
On Tue, Apr 11, 2017 at 5:46 PM, Jeff Davis wrote: > On Tue, Apr 11, 2017 at 2:17 AM, Alexander Korotkov > wrote: > > FYI, I've implemented this algorithm for pgsphere. See following branch. > > https://github.com/akorotkov/pgsphere/tree/experimental > > It's implemented as crossmatch() function which takes as arguments names > of > > two indexes over spoint and maximum distance (it checks not overlapping > but > > proximity of points). This function returns setof pairs of TIDs. > > > > Later, Dmitry Ivanov made it a custom scan node. > > https://github.com/akorotkov/pgsphere/tree/crossmatch_cnode > > > > You also can find some experimental evaluation here: > > http://www.adass2016.inaf.it/images/presentations/10_Korotkov.pdf > > Do you have a sense of how this might compare with range merge join? > If you have GiST indexes over ranges for both sides of join, then this method could be used for range join. Hence, it could be compared with range merge join. However, particular implementation in pgsphere uses hardcoded datatypes and operations. Thus, for range join we need either generalized version of GiST-based join or special implementation for ranges. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Merge join for GiST
On Tue, Apr 11, 2017 at 2:17 AM, Alexander Korotkov wrote: > FYI, I've implemented this algorithm for pgsphere. See following branch. > https://github.com/akorotkov/pgsphere/tree/experimental > It's implemented as crossmatch() function which takes as arguments names of > two indexes over spoint and maximum distance (it checks not overlapping but > proximity of points). This function returns setof pairs of TIDs. > > Later, Dmitry Ivanov made it a custom scan node. > https://github.com/akorotkov/pgsphere/tree/crossmatch_cnode > > You also can find some experimental evaluation here: > http://www.adass2016.inaf.it/images/presentations/10_Korotkov.pdf Do you have a sense of how this might compare with range merge join? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merge join for GiST
On Mon, Apr 10, 2017 at 2:53 PM, Andrew Borodin wrote: > ==Spatial joins== > Scientific papers from the dawn of R-trees and multidimensional > indexes feature a lot of algorithms for spatial joins. > I.e. you have two sets of geometries s1 and s2, you need to produce > all colliding pairs (p1,p2) where p1 in s1 and p2 in s2. For 2 R-trees > of equal heights with roots R and S most straightforward[1] algorithm > look like: > > SpatialJoin (R,S: Node) > { > FOR (all E in S) > FOR (all ER in R with ER.rect intersecting E.rect) >IF (R is a leaf page) >{ > Output ER,ES >} >ELSE >{ > SpatialJoin (ER.ref, ES.ref) >} > } > FYI, I've implemented this algorithm for pgsphere. See following branch. https://github.com/akorotkov/pgsphere/tree/experimental It's implemented as crossmatch() function which takes as arguments names of two indexes over spoint and maximum distance (it checks not overlapping but proximity of points). This function returns setof pairs of TIDs. Later, Dmitry Ivanov made it a custom scan node. https://github.com/akorotkov/pgsphere/tree/crossmatch_cnode You also can find some experimental evaluation here: http://www.adass2016.inaf.it/images/presentations/10_Korotkov.pdf Feel free to experiment with this code and/or ask any question. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Merge join for GiST
2017-04-10 20:38 GMT+05:00 Robert Haas : > On Mon, Apr 10, 2017 at 7:53 AM, Andrew Borodin wrote: >> I think this idea is somewhat related to this patch [2], but as for >> now cannot describe how exactly GiST merge and Range Merge features >> relate. > > It also seems somewhat related to Peter Moser's work on ALIGN and > NORMALIZE. It would be nice if the various groups of people > interested in improving PostgreSQL's spatial stuff got together and > reviewed each others' patches. As a non-spatial guy myself, it's > pretty hard to decide on the relative merits of different proposed > approaches. Hi, Robert! Thank you for the pointer. Temporal features are not exactly within my scope, but you are right, topics are close to each other. I'll look into the patch with temporal features and assess whether I can provide a meaningful review. I do not know how to gather the attention of all how is interested in this kind of features. I read hackers@ digest regularly, used search a lot, but that temporal work slipped away from my attention. Best regards, Andrey Borodin. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merge join for GiST
On Mon, Apr 10, 2017 at 7:53 AM, Andrew Borodin wrote: > I think this idea is somewhat related to this patch [2], but as for > now cannot describe how exactly GiST merge and Range Merge features > relate. It also seems somewhat related to Peter Moser's work on ALIGN and NORMALIZE. It would be nice if the various groups of people interested in improving PostgreSQL's spatial stuff got together and reviewed each others' patches. As a non-spatial guy myself, it's pretty hard to decide on the relative merits of different proposed approaches. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Merge join for GiST
Hello, hackers! ==Spatial joins== Scientific papers from the dawn of R-trees and multidimensional indexes feature a lot of algorithms for spatial joins. I.e. you have two sets of geometries s1 and s2, you need to produce all colliding pairs (p1,p2) where p1 in s1 and p2 in s2. For 2 R-trees of equal heights with roots R and S most straightforward[1] algorithm look like: SpatialJoin (R,S: Node) { FOR (all E in S) FOR (all ER in R with ER.rect intersecting E.rect) IF (R is a leaf page) { Output ER,ES } ELSE { SpatialJoin (ER.ref, ES.ref) } } ==Motivation== I’m mostly interested in OLAP-style aggregates like: SELECT rows.Id, sum(data.attribute1), sum(data.attribute2), sum(data.attribute3) FROM rows LEFT JOIN data ON rows.IndexedAttribute && data.IndexedAttribute GROUP BY 1 When we have two GiST for rows.IndexedAttribute and data.IndexedAttribute. Currently, this query produces plans with Nested Loop Join, so for every row from “rows” there is one GiST scan into “data”. ==Proposal== Obviously, for this scenario, we can use GiST-based join algorithm identical to the SpatialJoin algorithm above. This algorithm will work iif (key1 && key2) ==> (union(key1,key3) && key2 ) and (union(key2,key3) && key1 ) for any key3. This is quite straightforward for && operator, while I’m not sure this will work for <@ and @> and others. I can try to implement this kind of join as an extension or try to embed this into the planner. I think this idea is somewhat related to this patch [2], but as for now cannot describe how exactly GiST merge and Range Merge features relate. How do you think: 1. Has this idea been considered before? I had not found anything on actual spatial join of GiSTS. 2. What is the easiest way to implement this feature? 3. What kind of operators may be spatial-joined without intervention to existing GiST scan strategies API? Many thanks for reading. Best regards, Andrey Borodin. [1] Brinkhoff, Thomas, Hans-Peter Kriegel, and Bernhard Seeger. Efficient processing of spatial joins using R-trees. Vol. 22. No. 2. ACM, 1993. [2] https://www.postgresql.org/message-id/flat/CAMp0ubfwAFFW3O_NgKqpRPmm56M4weTEXjprb2gP_NrDaEC4Eg%40mail.gmail.com#camp0ubfwaffw3o_ngkqprpmm56m4wetexjprb2gp_nrdaec...@mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers