Re: [HACKERS] Recursive query syntax ambiguity
On Mon, Jan 29, 2007 at 01:38:02PM +, Gregory Stark wrote: Instead I suggest we create one type of reentrant node, which memoizes its output. It would be a kind of on-demand Materialize. It would be paired with a second node type which would be the only type of node which can invoke it. This RecursiveCall node would invoke the Memoize node using a special interface in which it passes enough state for the Memoize node to seek to the correct place in its output. That I beleive is the right approach. I think an equivalent way of looking at it is a Loop node with an InitPlan and a StepPlan. Initially, the Loop node executes the InitPlan to get it's initial set of tuples, storing them like a Materialize node does. After that it keeps calling the StepNode, where somewhere inside the it has a node that extracts a row from the aforementioned tuplestore. It stores these returned tuples in the tuplestore also, thus giving you recursion. snip (I've convinced myself that that's true but I should probably work out a good proof of it before I make all this depend on it.) Yeah, proving it is going to be tricky, I'm not sure what the standard says about infinite recursion. There are three general cases of the Memoize node. Breadth-first, Depth-first, and non-linearly-recursive. I think the the only difference between depth and bredth-first searches is (if you consider the tuplesort to be a queue) whether the new tuples go to the front or the back of the list. But a data structures and algorithms book will know this. There are a few open issues to consider. Namely, how to cost a RecursiveCall node. One note: if you know that if you get p tuples out for every tuple in (where p1) then the asymptotic result of 1 + p + p*p+ ... is 1/(1-p) However, I don't know it matters. You only need to cost the plan if there are alternate paths and given the plan structure is strongly constrained, I'm not sure how much it matters. Also, if a subplan has exactly one call-site we really ought to inline it as it will get much more reliable plans. Similarly if there are two call sites we should consider inlining it if the cost of materializing the results (and reading them back) is more than n-call-sites x the cost of executing the query. I would expect That would happen with plain sequential scans for example. In the case where the subplan has side-effects, you can't optimise at all. In the case of read-committed mode, will two seq-scans always return the same result? Hope this helps, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
Re: [HACKERS] Recursive query syntax ambiguity
Martijn van Oosterhout kleptog@svana.org writes: However, I don't know it matters. You only need to cost the plan if there are alternate paths and given the plan structure is strongly constrained, I'm not sure how much it matters. It does, since the whole thing could be a subquery, in which case there could be options available at the outer level. I doubt we'll be able to be really smart, but that doesn't mean we can just punt. In the case of read-committed mode, will two seq-scans always return the same result? They definitely should, since we'll be using the same snapshot throughout the query. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Recursive query syntax ambiguity
Tom Lane [EMAIL PROTECTED] writes: Gregory Stark [EMAIL PROTECTED] writes: Having fixed that everything works fine with SET and WITH being reserved keywords. You didn't mean to say I should be able to leave WITH unreserved did you? I think we'd decided that was a lost cause, unless you see a way? No, I decided it was a lost cause right at the start of this. I was just double checking that you didn't think otherwise when you said you didn't see anything else that needed to be reserved. Of course that was the easy part... Yup ;-) I think I've finally wrapped my head around the idea of mutually recursive and non-linearly recursive queries. Good thing too since in thinking about them I realized that the approach I had intended to use for plain non-recursive queries wasn't really adequate and needs the same machinery. The fundamental feature Postgres needs to implement this that it doesn't have is the ability to be running the same subplan from two different contexts in the plan simultaneously. That is, the ability to start a scan from the beginning without disturbing a scan already in progress for that same node. In retrospect this isn't terribly surprising since it's the same characteristic functions need in a programming language to be safely callable recursively. They need to be reentrant. The normal way to make functions reentrant is to remove all local state and make the caller responsible for providing space to allocate memory for temporary state. But teaching every node type about passing its state up to the parent including bundling up all the state of its children seems like too much work. Both for me and the computer :) In any case that would result in an n^2 algorithm for recursive queries as it would force the executor to re-execute each node for the same record over and over. Much as the obvious recursive fibonacci algorithm is n^2 for example. It also would mean making non-recursive WITH clauses pointless since they would still be executed once per call-site. Instead I suggest we create one type of reentrant node, which memoizes its output. It would be a kind of on-demand Materialize. It would be paired with a second node type which would be the only type of node which can invoke it. This RecursiveCall node would invoke the Memoize node using a special interface in which it passes enough state for the Memoize node to seek to the correct place in its output. Because it memoizes its output it should never actually need to be able to handle recursive calls. When the call depth is maximum each call site would be able to make at most 1 call to the Memoize node. A second one would necessarily have been stopped higher up the call chain by the memoization table. (I've convinced myself that that's true but I should probably work out a good proof of it before I make all this depend on it.) There are three general cases of the Memoize node. Breadth-first, Depth-first, and non-linearly-recursive. In the case of breadth-first search we would want to keep two tuplestore around, one for the current generation, and one for its parent. Once we finish a generation we can free its parent since we know a linearly-recursive scan won't be needing it any more. In the case of depth-first we really want to memoize into a stack. I'm not sure how to fit that into our current model of tuplestores. At worst though, we just allocate a tuplestore for each generation as it appears. We can still throw away old generations once their scans are finished, it just won't be until quite late. Non-linearly-recursive searches are the same case as non-recursive queries. Memoize would just allocate one tuplestore and store tuples blindly without any idea of generation. It could arrange to throw away old tuples whenever all call-sites' scans have past them. That's the same optimization Simon wants to make in the merge-join case when the mark is moved forward. There are a few open issues to consider. Namely, how to cost a RecursiveCall node. Also, if a subplan has exactly one call-site we really ought to inline it as it will get much more reliable plans. Similarly if there are two call sites we should consider inlining it if the cost of materializing the results (and reading them back) is more than n-call-sites x the cost of executing the query. I would expect That would happen with plain sequential scans for example. Note that we need the Memoize/RecursiveCall nodes even for non-recursive queries. Except in the simplest cases where each common table expression only has a single call site (in which case we could have just inlined them anyways) we need some way to retain state and avoid reexecuting the plan multiple times. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] Recursive query syntax ambiguity
Hm, I had hoped that the DB2/ANSI syntax would only require making WITH a fully reserved word, and not the other tokens it uses. Certainly for non-recursive queries that's the case as the only other token it uses is AS which is already a fully reserved word. However to fully support the DB2/ANSI syntax we would definitely have an ambiguity and I think we would have to make CYCLE a fully reserved word which seems like a much bigger concession than WITH. Observe the following case: WITH RECURSIVE foo (x,y) AS (select 1,2) SEARCH DEPTH FIRST BY x CYCLE x,y SET ... The parser can't search arbitrarily far checking for a SET to see if the CYCLE is a keyword or a binary operator. Even if it could things like this would be entirely ambiguous: WITH RECURSIVE foo (x,y) AS (select 1,2) SEARCH DEPTH FIRST BY x CYCLE x, y CYCLE y SET ... I'm nowhere near actually implementing this functionality yet so there's no pressing need for action. In fact I think the search clause is actually an ANSIism that isn't supported by DB2 itself yet either. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
[HACKERS] Recursive query syntax ambiguity
Woah, I just realized it's much worse than that. I think the syntax in the ANSI is not parsable in LALR(1) at all. Note the following: WITH RECURSIVE foo (a,b) AS (subq) SEARCH BREADTH FIRST BY a,b,c(x,z),d(y,z) AS (subq) SELECT ... To determine whether c is the name of a new with list element it has to scan as far ahead as the , before the d. Note that d here is in fact not part of the search clause at all, it's the name of a second with list element. bleagh. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Recursive query syntax ambiguity
Gregory Stark wrote: Woah, I just realized it's much worse than that. I think the syntax in the ANSI is not parsable in LALR(1) at all. Note the following: WITH RECURSIVE foo (a,b) AS (subq) SEARCH BREADTH FIRST BY a,b,c(x,z),d(y,z) AS (subq) SELECT ... To determine whether c is the name of a new with list element it has to scan as far ahead as the , before the d. Note that d here is in fact not part of the search clause at all, it's the name of a second with list element. bleagh. Can you post the rules you have so far that you're playing around with? (Also maybe the rules from the standard - I don't have a copy handy). cheers andrew ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Recursive query syntax ambiguity
Andrew Dunstan [EMAIL PROTECTED] writes: Can you post the rules you have so far that you're playing around with? (Also maybe the rules from the standard - I don't have a copy handy). This is the best compromise I've come up with so far. It makes CYCLE a reserved word and requires a CYCLE clause if there's a SEARCH clause. --- gram.y 09 Jan 2007 02:14:14 + 2.573 +++ gram.y 26 Jan 2007 20:02:21 + @@ -350,6 +350,10 @@ %type node xml_root_version opt_xml_root_standalone %type booleandocument_or_content xml_whitespace_option +%type node common_table_expression +%type list with_cte_list cte_list +%type boolean recursive_is_depth_first + /* * If you make any token changes, update the keyword table in @@ -364,7 +368,7 @@ ASSERTION ASSIGNMENT ASYMMETRIC AT AUTHORIZATION BACKWARD BEFORE BEGIN_P BETWEEN BIGINT BINARY BIT - BOOLEAN_P BOTH BY + BOOLEAN_P BOTH BREADTH BY CACHE CALLED CASCADE CASCADED CASE CAST CHAIN CHAR_P CHARACTER CHARACTERISTICS CHECK CHECKPOINT CLASS CLOSE @@ -376,7 +380,7 @@ DATABASE DAY_P DEALLOCATE DEC DECIMAL_P DECLARE DEFAULT DEFAULTS DEFERRABLE DEFERRED DEFINER DELETE_P DELIMITER DELIMITERS - DESC DISABLE_P DISTINCT DO DOCUMENT_P DOMAIN_P DOUBLE_P DROP + DEPTH DESC DISABLE_P DISTINCT DO DOCUMENT_P DOMAIN_P DOUBLE_P DROP EACH ELSE ENABLE_P ENCODING ENCRYPTED END_P ESCAPE EXCEPT EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN EXTERNAL EXTRACT @@ -416,11 +420,11 @@ QUOTE - READ REAL REASSIGN RECHECK REFERENCES REINDEX RELATIVE_P RELEASE RENAME + READ REAL REASSIGN RECHECK RECURSIVE REFERENCES REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE RESET RESTART RESTRICT RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROW ROWS RULE - SAVEPOINT SCHEMA SCROLL SECOND_P SECURITY SELECT SEQUENCE + SAVEPOINT SCHEMA SCROLL SEARCH SECOND_P SECURITY SELECT SEQUENCE SERIALIZABLE SESSION SESSION_USER SET SETOF SHARE SHOW SIMILAR SIMPLE SMALLINT SOME STABLE STANDALONE_P START STATEMENT STATISTICS STDIN STDOUT STORAGE STRICT_P STRIP_P SUBSTRING SUPERUSER_P @@ -5681,6 +5685,25 @@ list_nth($3, 0), list_nth($3, 1)); $$ = $1; } + | with_cte_list simple_select { $$ = $2; } + | with_cte_list select_clause sort_clause + { + insertSelectOptions((SelectStmt *) $2, $3, NIL, + NULL, NULL); + $$ = $2; + } + | with_cte_list select_clause opt_sort_clause for_locking_clause opt_select_limit + { + insertSelectOptions((SelectStmt *) $2, $3, $4, + list_nth($5, 0), list_nth($5, 1)); + $$ = $2; + } + | with_cte_list select_clause opt_sort_clause select_limit opt_for_locking_clause + { + insertSelectOptions((SelectStmt *) $2, $3, $5, + list_nth($4, 0), list_nth($4, 1)); + $$ = $2; + } ; select_clause: @@ -5742,6 +5765,72 @@ } ; +/* + * ANSI standard WITH clause looks like: + * WITH [ RECURSIVE ] query name [ (column,...) ] AS (query) [SEARCH or CYCLE clause] + * + * It seems with_cte_list has to be a separate token or else there's a s/r + * conflict between RECURSIVE and the cte name. This means we'll need a silly + * node just to hold the list and the recursive flag :( it doesn't seem like + * making RECURSIVE a fully reserved word would be very nice as it's probably a + * common column name. + * + * For now we don't support recursive so just ignore it and pass up the list + * + */ +with_cte_list: + WITH cte_list + { + $$ = $2; + } + | WITH RECURSIVE cte_list + { + elog(WARNING, WITH RECURSIVE not supported yet); + $$ = $3; + } + ; + +cte_list: + common_table_expression { $$ = list_make1($1); } + |
Re: [HACKERS] Recursive query syntax ambiguity
On Fri, Jan 26, 2007 at 05:10:01PM +, Gregory Stark wrote: However to fully support the DB2/ANSI syntax we would definitely have an ambiguity and I think we would have to make CYCLE a fully reserved word which seems like a much bigger concession than WITH. Observe the following case: WITH RECURSIVE foo (x,y) AS (select 1,2) SEARCH DEPTH FIRST BY x CYCLE x,y SET ... The parser can't search arbitrarily far checking for a SET to see if the CYCLE is a keyword or a binary operator. Er, CYCLE isn't a binary operator, and users can't make binary operators that are words, so I'm not sure of the problem here. I think the parser can tell that the expression ends at the word cycle. Or am I missing obvious? Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
Re: [HACKERS] Recursive query syntax ambiguity
Ok, looking at your example: WITH RECURSIVE foo (a,b) AS (subq) SEARCH BREADTH FIRST BY a,b , c(x,z),d(y,z) AS (subq) SELECT ... What you're trying to say is that the c is a with list element, not a cycle column. But the parser will see that as soon as it hits the open parenthesis, since a cycle column is always just a column name. Also, the AS is the with list element doesn't appear to be optional, I assume you left that out after the c(x,z) for clarity. I think bison should be able to handle this as long as the name in common_table_expression matches exactly the same things as whatever columnList uses. It can the merge the two parse paths, allowing it to see further. Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
Re: [HACKERS] Recursive query syntax ambiguity
Martijn van Oosterhout kleptog@svana.org writes: Er, CYCLE isn't a binary operator, and users can't make binary operators that are words, so I'm not sure of the problem here. Well, the problem typically is not being able to tell whether an operator is supposed to be infix or postfix; hence keywords that can terminate arbitrary expressions usually have to be reserved words. However, now that I look at the syntax I think Greg may be misreading it. I see search or cycle clause ::= search clause | cycle clause | search clause cycle clause search clause ::= SEARCH recursive search order SET sequence column recursive search order ::= DEPTH FIRST BY sort specification list | BREADTH FIRST BY sort specification list sequence column ::= column name cycle clause ::= CYCLE cycle column list SET cycle mark column TO cycle mark value DEFAULT non-cycle mark value USING path column cycle column list ::= cycle column [ {commacycle column}...] cycle column ::= column name cycle mark column ::= column name path column ::= column name cycle mark value ::= value expression non-cycle mark value ::= value expression and so CYCLE would come *after* SET sequence column not before it. It looks to me like we'd have to promote SET to fully reserved status, but that probably isn't going to surprise anyone. DEFAULT and USING already are fully reserved. I don't see anything else here that looks like it should need to be reserved. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Recursive query syntax ambiguity
Tom Lane [EMAIL PROTECTED] writes: search clause ::= SEARCH recursive search order SET sequence column and so CYCLE would come *after* SET sequence column not before it. Ah, thanks, I had glossed right over the SET sequence column bit. The SET that I had was the SET cycle column which remains after the CYCLE keyword. It looks to me like we'd have to promote SET to fully reserved status, but that probably isn't going to surprise anyone. DEFAULT and USING already are fully reserved. I don't see anything else here that looks like it should need to be reserved. Having fixed that everything works fine with SET and WITH being reserved keywords. You didn't mean to say I should be able to leave WITH unreserved did you? Of course that was the easy part... Implementing non-recursive common table expressions should be fairly mechanical though I think I'll have lots of questions about how to get all the variable references fixed up. Non-recursive common table expressions are always non-correlated. They can refer to previous common table expressions but only to select from them either in the FROM clause or in subqueries. So as far as I can see they can just go in an InitPlan (or One-Time-Plan? I'm not sure what the distinction is) and be referred to in the same way. Recursive queries are of course a whole lot trickier. I've been slowly wrapping my head around them. So far I have a pretty good idea how to churn out a typical recursive query analogous to a CONNECT BY query. But the spec is a lot more ambitious than that. I haven't quite wrapped my head around the idea of mutually recursive or non-linearly-recursive queries yet. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Recursive query syntax ambiguity
Gregory Stark [EMAIL PROTECTED] writes: Having fixed that everything works fine with SET and WITH being reserved keywords. You didn't mean to say I should be able to leave WITH unreserved did you? I think we'd decided that was a lost cause, unless you see a way? Of course that was the easy part... Yup ;-) regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate