Re: [HACKERS] [RFC] What would be difficult to make data models pluggable for making PostgreSQL a multi-model database?
I was just reading the Postgresql 11 roadmap and it mentions native graph support. I would be interested in following the design work for this. Would this require a the new pluggable storage which is currently in development or would the existing storage engine be sufficient? I am just wondering if there are any rough design/plans for this... https://wiki.postgresql.org/wiki/Fujitsu_roadmap#Multi-model_database - *graph: Natively support graph data model. Implement Cypher and/or Gremlin as the query language through UDFs.* Thank you, Henry On Sun, Sep 3, 2017 at 1:14 PM MauMau wrote: > From: Henry M > > This may be interesting... they implement cypher (unfortunately they > had to fork in order to have cypher be a first class query language > with SQL). > > > > https://github.com/bitnine-oss/agensgraph > > I'm sorry for my very late reply. > > Thanks for the information. AgensGraph is certainly interesting, but > the problem is that it's a fork of PostgreSQL as you mentioned. I > wish the data models, including query languages, to be pluggable > extensions, so that various people (especially database researchers?) > can develop them flexibly. Of course, I want various data models to > be incorporated in the core as early as possible, but I'm afraid it's > not easy. If new data models can be added as extensions, they can be > developed outside the PostgreSQL community process, get popular and > mature, and then be embraced in core like GiST/SP-Gist indexes and > full text search did. > > > Regards > MauMau > >
Re: [HACKERS] [RFC] What would be difficult to make data models pluggable for making PostgreSQL a multi-model database?
From: Henry M > This may be interesting... they implement cypher (unfortunately they had to fork in order to have cypher be a first class query language with SQL). > > https://github.com/bitnine-oss/agensgraph I'm sorry for my very late reply. Thanks for the information. AgensGraph is certainly interesting, but the problem is that it's a fork of PostgreSQL as you mentioned. I wish the data models, including query languages, to be pluggable extensions, so that various people (especially database researchers?) can develop them flexibly. Of course, I want various data models to be incorporated in the core as early as possible, but I'm afraid it's not easy. If new data models can be added as extensions, they can be developed outside the PostgreSQL community process, get popular and mature, and then be embraced in core like GiST/SP-Gist indexes and full text search did. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] What would be difficult to make data models pluggable for making PostgreSQL a multi-model database?
This may be interesting... they implement cypher (unfortunately they had to fork in order to have cypher be a first class query language with SQL). https://github.com/bitnine-oss/agensgraph On Mon, Aug 21, 2017 at 12:44 AM Chris Travers wrote: > On Sun, Aug 20, 2017 at 4:10 AM, MauMau wrote: > >> From: Chris Travers >> > Why cannot you do all this in a language handler and treat as a user >> defined function? >> > ... >> > If you have a language handler for cypher, why do you need in_region >> or cast_region? Why not just have a graph_search() function which >> takes in a cypher query and returns a set of records? >> >> The language handler is for *stored* functions. The user-defined >> function (UDF) doesn't participate in the planning of the outer >> (top-level) query. And they both assume that they are executed in SQL >> commands. >> > > Sure but stored functions can take arguments, such as a query string which > gets handled by the language handler. There's absolutely no reason you > cannot declare a function in C that takes in a Cypher query and returns a > set of tuples. And you can do a whole lot with preloaded shared libraries > if you need to. > > The planning bit is more difficult, but see below as to where I see major > limits here. > >> >> I want the data models to meet these: >> >> 1) The query language can be used as a top-level session language. >> For example, if an app specifies "region=cypher_graph" at database >> connection, it can use the database as a graph database and submit >> Cypher queries without embedding them in SQL. >> > > That sounds like a foot gun. I would probably think of those cases as > being ideal for a custom background worker, similar to Mongress. > Expecting to be able to switch query languages on the fly strikes me as > adding totally needless complexity everywhere to be honest. Having > different listeners on different ports simplifies this a lot and having, > additionally, query languages for ad-hoc mixing via language handlers might > be able to get most of what you want already. > >> >> 2) When a query contains multiple query fragments of different data >> models, all those fragments are parsed and planned before execution. >> The planner comes up with the best plan, crossing the data model >> boundary. To take the query example in my first mail, which joins a >> relational table and the result of a graph query. The relational >> planner considers how to scan the table, the graph planner considers >> how to search the graph, and the relational planner considers how to >> join the two fragments. >> > > It seems like all you really need is a planner hook for user defined > languages (I.e. "how many rows does this function return with these > parameters" right?). Right now we allow hints but they are static. I > wonder how hard this would be using preloaded, shared libraries. > > >> >> So in_region() and cast_region() are not functions to be executed >> during execution phase, but are syntax constructs that are converted, >> during analysis phase, into calls to another region's parser/analyzer >> and an inter-model cast routine. >> > > So basically they work like immutable functions except that you cannot > index the output? > >> >> 1. The relational parser finds in_region('cypher_graph', 'graph >> query') and produces a parse node InRegion(region_name, query) in the >> parse tree. >> >> 2. The relational analyzer looks up the system catalog to checks if >> the specified region exists, then calls its parser/analyzer to produce >> > the query tree for the graph query fragment. The relational analyser > > >> attaches the graph query tree to the InRegion node. >> >> 3. When the relational planner finds the graph query tree, it passes >> the graph query tree to the graph planner to produce the graph >> execution plan. >> >> 4. The relational planner produces a join plan node, based on the >> costs/statistics of the relational table scan and graph query. The >> graph execution plan is attached to the join plan node. >> >> The parse/query/plan nodes have a label to denote a region, so that >> appropriate region's routines can be called. >> > > It would be interesting to see how much of what you want you can get with > what we currently have and what pieces are really missing. > > Am I right that if you wrote a function in C to take a Cypher query plan, > and analyse it, and execute it, the only thing really missing would be > feedback to the PostgreSQL planner regarding number of rows expected? > >> >> Regards >> MauMau >> >> > > > -- > Best Regards, > Chris Travers > Database Administrator > > Tel: +49 162 9037 210 <+49%20162%209037210> | Skype: einhverfr | > www.adjust.com > Saarbrücker Straße 37a, 10405 Berlin > >
Re: [HACKERS] [RFC] What would be difficult to make data models pluggable for making PostgreSQL a multi-model database?
On Sun, Aug 20, 2017 at 4:10 AM, MauMau wrote: > From: Chris Travers > > Why cannot you do all this in a language handler and treat as a user > defined function? > > ... > > If you have a language handler for cypher, why do you need in_region > or cast_region? Why not just have a graph_search() function which > takes in a cypher query and returns a set of records? > > The language handler is for *stored* functions. The user-defined > function (UDF) doesn't participate in the planning of the outer > (top-level) query. And they both assume that they are executed in SQL > commands. > Sure but stored functions can take arguments, such as a query string which gets handled by the language handler. There's absolutely no reason you cannot declare a function in C that takes in a Cypher query and returns a set of tuples. And you can do a whole lot with preloaded shared libraries if you need to. The planning bit is more difficult, but see below as to where I see major limits here. > > I want the data models to meet these: > > 1) The query language can be used as a top-level session language. > For example, if an app specifies "region=cypher_graph" at database > connection, it can use the database as a graph database and submit > Cypher queries without embedding them in SQL. > That sounds like a foot gun. I would probably think of those cases as being ideal for a custom background worker, similar to Mongress. Expecting to be able to switch query languages on the fly strikes me as adding totally needless complexity everywhere to be honest. Having different listeners on different ports simplifies this a lot and having, additionally, query languages for ad-hoc mixing via language handlers might be able to get most of what you want already. > > 2) When a query contains multiple query fragments of different data > models, all those fragments are parsed and planned before execution. > The planner comes up with the best plan, crossing the data model > boundary. To take the query example in my first mail, which joins a > relational table and the result of a graph query. The relational > planner considers how to scan the table, the graph planner considers > how to search the graph, and the relational planner considers how to > join the two fragments. > It seems like all you really need is a planner hook for user defined languages (I.e. "how many rows does this function return with these parameters" right?). Right now we allow hints but they are static. I wonder how hard this would be using preloaded, shared libraries. > > So in_region() and cast_region() are not functions to be executed > during execution phase, but are syntax constructs that are converted, > during analysis phase, into calls to another region's parser/analyzer > and an inter-model cast routine. > So basically they work like immutable functions except that you cannot index the output? > > 1. The relational parser finds in_region('cypher_graph', 'graph > query') and produces a parse node InRegion(region_name, query) in the > parse tree. > > 2. The relational analyzer looks up the system catalog to checks if > the specified region exists, then calls its parser/analyzer to produce > the query tree for the graph query fragment. The relational analyser > attaches the graph query tree to the InRegion node. > > 3. When the relational planner finds the graph query tree, it passes > the graph query tree to the graph planner to produce the graph > execution plan. > > 4. The relational planner produces a join plan node, based on the > costs/statistics of the relational table scan and graph query. The > graph execution plan is attached to the join plan node. > > The parse/query/plan nodes have a label to denote a region, so that > appropriate region's routines can be called. > It would be interesting to see how much of what you want you can get with what we currently have and what pieces are really missing. Am I right that if you wrote a function in C to take a Cypher query plan, and analyse it, and execute it, the only thing really missing would be feedback to the PostgreSQL planner regarding number of rows expected? > > Regards > MauMau > > -- Best Regards, Chris Travers Database Administrator Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com Saarbrücker Straße 37a, 10405 Berlin
Re: [HACKERS] [RFC] What would be difficult to make data models pluggable for making PostgreSQL a multi-model database?
On 20 August 2017 at 10:10, MauMau wrote: > From: Chris Travers > > Why cannot you do all this in a language handler and treat as a user > defined function? > > ... > > If you have a language handler for cypher, why do you need in_region > or cast_region? Why not just have a graph_search() function which > takes in a cypher query and returns a set of records? > > The language handler is for *stored* functions. The user-defined > function (UDF) doesn't participate in the planning of the outer > (top-level) query. And they both assume that they are executed in SQL > commands. > While I generally agree with Tom on this, I think there are some useful ideas to examine. Allow a UDF to emit multiple result sets that can then be incorporated into a outer query. IMO it'd be fine to support this by returning a wide row of REFCURSORs and then allow FETCH to be used in a subquery. The UDF would need to be invoked before the rest of the query was planned, so the planner could learn the structure of the cursor's result sets. Or some higher level concept could be introduced, like it was for aggregates and window functions, where one call can be made to get the output structure and some stats estimates, and another call (or series) to get the rows. I guess you're going two steps further than that, seeking a more integrated model where the plugin can generate paths and participate more actively in planning, and where you can optionally make it the default so you don't need a SQL function call to access it. If you want to pursue that, I suggest you start small and go step-by-step. Things like: * Allow FETCH ... to be used in subqueries with explicitly listed output relation structure, like calling a function that returns record * Allow pre-execution of parts of a query that produce refcursors used in subqueries, then finish planning the outer query once the cursor output types are known * A construct that can inject arbitrary virtual relations into the namespace at parse-time, so you don't have to do the dance with refcursors. (Like WITH). * Construct that can supply stats estimates for the virtual relations So try to build it in stages. You could also potentially use the FDW interface. > I want the data models to meet these: > > 1) The query language can be used as a top-level session language. > For example, if an app specifies "region=cypher_graph" at database > connection, it can use the database as a graph database and submit > Cypher queries without embedding them in SQL. > Why? What does this offer over the app or client tool wrapping its queries in "SELECT cypher_graph('')" ? > 2) When a query contains multiple query fragments of different data > models, all those fragments are parsed and planned before execution. > The planner comes up with the best plan, crossing the data model > boundary. To take the query example in my first mail, which joins a > relational table and the result of a graph query. The relational > planner considers how to scan the table, the graph planner considers > how to search the graph, and the relational planner considers how to > join the two fragments. > Here, what you need is a way to define a set of virtual relations on a per-query basis, where you can get stats estimates for the relations during planning. I guess what you're imagining is something more sophisticated where you're generating some kind of sub-plan candidates, like the path model. With some kind of interaction so the sub-planner for the other model could know to generate a different sub-plan based on the context of the outer plan. I have no idea how that could work. But I think you have about zero chance of achieving what you want by going straight there. Focus on small incremental steps, preferably ones you can find other uses for too. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] [RFC] What would be difficult to make data models pluggable for making PostgreSQL a multi-model database?
From: Chris Travers > Why cannot you do all this in a language handler and treat as a user defined function? > ... > If you have a language handler for cypher, why do you need in_region or cast_region? Why not just have a graph_search() function which takes in a cypher query and returns a set of records? The language handler is for *stored* functions. The user-defined function (UDF) doesn't participate in the planning of the outer (top-level) query. And they both assume that they are executed in SQL commands. I want the data models to meet these: 1) The query language can be used as a top-level session language. For example, if an app specifies "region=cypher_graph" at database connection, it can use the database as a graph database and submit Cypher queries without embedding them in SQL. 2) When a query contains multiple query fragments of different data models, all those fragments are parsed and planned before execution. The planner comes up with the best plan, crossing the data model boundary. To take the query example in my first mail, which joins a relational table and the result of a graph query. The relational planner considers how to scan the table, the graph planner considers how to search the graph, and the relational planner considers how to join the two fragments. So in_region() and cast_region() are not functions to be executed during execution phase, but are syntax constructs that are converted, during analysis phase, into calls to another region's parser/analyzer and an inter-model cast routine. 1. The relational parser finds in_region('cypher_graph', 'graph query') and produces a parse node InRegion(region_name, query) in the parse tree. 2. The relational analyzer looks up the system catalog to checks if the specified region exists, then calls its parser/analyzer to produce the query tree for the graph query fragment. The relational analyzer attaches the graph query tree to the InRegion node. 3. When the relational planner finds the graph query tree, it passes the graph query tree to the graph planner to produce the graph execution plan. 4. The relational planner produces a join plan node, based on the costs/statistics of the relational table scan and graph query. The graph execution plan is attached to the join plan node. The parse/query/plan nodes have a label to denote a region, so that appropriate region's routines can be called. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] What would be difficult to make data models pluggable for making PostgreSQL a multi-model database?
"MauMau" writes: > I'm thinking of making PostgreSQL a multi-model database by supporting > data models other than the current relational model. A data model > consists of a query language (e.g. SQL for relational model, Cypher > for graph model), a parser and analyzer to transform a query into a > query tree, a planner to transform the query tree into an execution > plan, an executor, and a storage engine. It sounds like what you want is to replace all of Postgres except the name. I'm not clear on the point. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] What would be difficult to make data models pluggable for making PostgreSQL a multi-model database?
On Sat, Aug 19, 2017 at 4:29 PM, MauMau wrote: > Hello, > > Please forgive me for asking such a stupid and rough question. > > I'm thinking of making PostgreSQL a multi-model database by supporting > data models other than the current relational model. A data model > consists of a query language (e.g. SQL for relational model, Cypher > for graph model), a parser and analyzer to transform a query into a > query tree, a planner to transform the query tree into an execution > plan, an executor, and a storage engine. > > To promote the data model development, I want to make data models > pluggable. The rough sketch is: > > 1) A data model developer implements the parser, analyzer, > transformer, planner, executor, and storage engine functions in a > shared library. > > 2) The DBA registers the data model. > > CREATE QUERY LANGUAGE Cypher ( > PARSER = > ); > > CREATE DATA MODEL graph ( > QUERY LANGUAGE = Cypher, > ANALYZER = , > TRANSFORMER = , > PLANNER = , > EXECUTOR = , > STORAGE ENGINE = , > ); > > CREATE REGION cypher_graph ( > QUERY LANGUAGE = Cypher, > DATA MODEL = graph > ); > Why cannot you do all this in a language handler and treat as a user defined function? > > The region is just a combination of a query language and a data model, > much like a locale is a combination of a language and a country. This > is because there may be multiple popular query languages for a data > model. > > 3) The application connects to the database, specifying a desired > region. The specified region's query language becomes the default > query language for the session. > > > The application can use the data of multiple data models in one query > by specifying another region and its query via in_region(). For > example, the following query combines the relational restaurant table > and a social graph to get the five chinese restaurants in Tokyo that > are most popular among friends of John and their friends. > > SELECT r.name, g.num_likers > FROM restaurant r, > cast_region( > in_region('cypher_graph', > 'MATCH (:Person {name:"John"})-[:IS_FRIEND_OF*1..2]-(friend), > (friend)-[:LIKES]->(restaurant:Restaurant) > RETURN restaurant.name, count(*)'), > 'relational', 'g', '(name text, num_likers int') > WHERE r.name = g.name AND > r.city = 'Tokyo' AND r.cuisine = 'chinese' > ORDER BY g.num_likers DESC LIMIT 5 If you have a language handler for cypher, why do you need in_region or cast_region? Why not just have a graph_search() function which takes in a cypher query and returns a set of records? > ; > > > What do you think would be difficult to make data models pluggable, > especially related to plugging the parser, planner, executor, etc? > One possible concern is that various PostgreSQL components might be > too dependent on the data model being relational, and it would be > difficult to separate tight coupling. > I guess I am missing why the current language handler structure is not enough. Maybe I am missing something? > > Regards > MauMau > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Best Regards, Chris Travers Database Administrator Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com Saarbrücker Straße 37a, 10405 Berlin
[HACKERS] [RFC] What would be difficult to make data models pluggable for making PostgreSQL a multi-model database?
Hello, Please forgive me for asking such a stupid and rough question. I'm thinking of making PostgreSQL a multi-model database by supporting data models other than the current relational model. A data model consists of a query language (e.g. SQL for relational model, Cypher for graph model), a parser and analyzer to transform a query into a query tree, a planner to transform the query tree into an execution plan, an executor, and a storage engine. To promote the data model development, I want to make data models pluggable. The rough sketch is: 1) A data model developer implements the parser, analyzer, transformer, planner, executor, and storage engine functions in a shared library. 2) The DBA registers the data model. CREATE QUERY LANGUAGE Cypher ( PARSER = ); CREATE DATA MODEL graph ( QUERY LANGUAGE = Cypher, ANALYZER = , TRANSFORMER = , PLANNER = , EXECUTOR = , STORAGE ENGINE = , ); CREATE REGION cypher_graph ( QUERY LANGUAGE = Cypher, DATA MODEL = graph ); The region is just a combination of a query language and a data model, much like a locale is a combination of a language and a country. This is because there may be multiple popular query languages for a data model. 3) The application connects to the database, specifying a desired region. The specified region's query language becomes the default query language for the session. The application can use the data of multiple data models in one query by specifying another region and its query via in_region(). For example, the following query combines the relational restaurant table and a social graph to get the five chinese restaurants in Tokyo that are most popular among friends of John and their friends. SELECT r.name, g.num_likers FROM restaurant r, cast_region( in_region('cypher_graph', 'MATCH (:Person {name:"John"})-[:IS_FRIEND_OF*1..2]-(friend), (friend)-[:LIKES]->(restaurant:Restaurant) RETURN restaurant.name, count(*)'), 'relational', 'g', '(name text, num_likers int') WHERE r.name = g.name AND r.city = 'Tokyo' AND r.cuisine = 'chinese' ORDER BY g.num_likers DESC LIMIT 5; What do you think would be difficult to make data models pluggable, especially related to plugging the parser, planner, executor, etc? One possible concern is that various PostgreSQL components might be too dependent on the data model being relational, and it would be difficult to separate tight coupling. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers