Re: Fix XML handling with DOCTYPE
Hi Chapman, On 2019-Sep-05, Chapman Flack wrote: > Are these on your radar to maybe backpatch in this round of activity? > > The latest patches I did for 11 and 10 are in > https://www.postgresql.org/message-id/5D45A44F.8010803%40anastigmatix.net Thanks! I just pushed these to those branches. I think we're finally done with these. Many thanks for your persistence. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Fix XML handling with DOCTYPE
Hi Alvaro, On 08/03/19 12:15, Alvaro Herrera wrote: >> I don't know if it's too late to get in the upcoming minor releases, >> but maybe it can, if it looks ok, or the next ones, if that's too rushed. > > Hmm, I'm travelling back home from a conference the weekend, so yeah I > think it would be rushed for me to handle for the upcoming set. But I > can look at it before the *next* set. Are these on your radar to maybe backpatch in this round of activity? The latest patches I did for 11 and 10 are in https://www.postgresql.org/message-id/5D45A44F.8010803%40anastigmatix.net Cheers, -Chap
Re: Fix XML handling with DOCTYPE
On 2019-Aug-03, Chapman Flack wrote: > I don't know if it's too late to get in the upcoming minor releases, > but maybe it can, if it looks ok, or the next ones, if that's too rushed. Hmm, I'm travelling back home from a conference the weekend, so yeah I think it would be rushed for me to handle for the upcoming set. But I can look at it before the *next* set. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Fix XML handling with DOCTYPE
On 04/01/19 17:34, Alvaro Herrera wrote: > I think there were some outright bugs in the docs, at least for > XMLTABLE, that maybe we should backpatch. If you have the energy to > cherry-pick a minimal doc update to 10/11, I offer to back-patch it. I don't know if this fits your intention for "minimal". What I've done is taken the doc commit made by Tom for 12 (12d46a), then revised it so it describes the unfixed behavior for the bugs whose fixes weren't backpatched to 11 or 10. I don't know if it's too late to get in the upcoming minor releases, but maybe it can, if it looks ok, or the next ones, if that's too rushed. 11.patch applies cleanly to 11, 10.patch to 10. I've confirmed the 11 docs build successfully, but without sgml tools, I haven't confirmed that for 10. Regards, -Chap >From f55fb4daa47ed249e87bc417b111e842403fc1a9 Mon Sep 17 00:00:00 2001 From: nobody Date: Fri, 2 Aug 2019 22:47:10 -0400 Subject: [PATCH] Improve documentation about our XML functionality. Add a section explaining how our XML features depart from current versions of the SQL standard. Update and clarify the descriptions of some XML functions. Chapman Flack, reviewed by Ryan Lambert Discussion: https://postgr.es/m/5bd1284c.1010...@anastigmatix.net Discussion: https://postgr.es/m/5c81f8c0.6090...@anastigmatix.net Discussion: https://postgr.es/m/can-v+g-6jquqeqz55q3toxen6d5ez5uvzl4vr+8ktvjkj31...@mail.gmail.com This version for backpatching PG 11 and 10, taken from Tom's commit for 12, then edited to correctly describe behaviors that are fixed in 12 but still broken in 11 and 10. --- doc/src/sgml/datatype.sgml | 5 + doc/src/sgml/features.sgml | 381 ++- doc/src/sgml/func.sgml | 184 + src/backend/catalog/sql_features.txt | 6 +- 4 files changed, 490 insertions(+), 86 deletions(-) diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index 401a2f0..fa505e0 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -4228,6 +4228,11 @@ a0ee-bc99-9c0b-4ef8-bb6d-6bb9-bd38-0a11 value is a full document or only a content fragment. + +Limits and compatibility notes for the xml data type +can be found in . + + Creating XML Values diff --git a/doc/src/sgml/features.sgml b/doc/src/sgml/features.sgml index 6c22d69..253ec87 100644 --- a/doc/src/sgml/features.sgml +++ b/doc/src/sgml/features.sgml @@ -16,7 +16,8 @@ Language SQL. A revised version of the standard is released from time to time; the most recent update appearing in 2011. The 2011 version is referred to as ISO/IEC 9075:2011, or simply as SQL:2011. - The versions prior to that were SQL:2008, SQL:2003, SQL:1999, and SQL-92. Each version + The versions prior to that were SQL:2008, SQL:2006, SQL:2003, SQL:1999, + and SQL-92. Each version replaces the previous one, so claims of conformance to earlier versions have no official merit. PostgreSQL development aims for @@ -155,4 +156,382 @@ + + XML Limits and Conformance to SQL/XML + + +SQL/XML +limits and conformance + + + +Significant revisions to the XML-related specifications in ISO/IEC 9075-14 +(SQL/XML) were introduced with SQL:2006. +PostgreSQL's implementation of the XML data +type and related functions largely follows the earlier 2003 edition, +with some borrowing from later editions. In particular: + + + + Where the current standard provides a family of XML data types + to hold document or content in + untyped or XML Schema-typed variants, and a type + XML(SEQUENCE) to hold arbitrary pieces of XML content, + PostgreSQL provides the single + xml type, which can hold document or + content. There is no equivalent of the + standard's sequence type. + + + + + + PostgreSQL provides two functions + introduced in SQL:2006, but in variants that use the XPath 1.0 + language, rather than XML Query as specified for them in the + standard. + + + + + + +This section presents some of the resulting differences you may encounter. + + + +Queries are restricted to XPath 1.0 + + + The PostgreSQL-specific functions + xpath() and xpath_exists() + query XML documents using the XPath language. + PostgreSQL also provides XPath-only variants + of the standard functions XMLEXISTS and + XMLTABLE, which officially use + the XQuery language. For all of these functions, + PostgreSQL relies on the + libxml2 library, which provides only XPath 1.0. + + + + There is a strong connection between the XQuery language and XPath + versions 2.0 and later: any expression that is syntactically valid and + executes successfully in both produces the same result (with a minor + exception for expressions containing numeric
Re: Fix XML handling with DOCTYPE
On 04/01/19 17:34, Alvaro Herrera wrote: > I think there were some outright bugs in the docs, at least for > XMLTABLE, that maybe we should backpatch. If you have the energy to > cherry-pick a minimal doc update to 10/11, I offer to back-patch it. I'll see what I can do. There's breathing room for that after the end of the CF, right? It seems to me that the conformance-appendix part is worth using, along with all of the clarifications in datatype.sgml and func.sgml except the ones clarifying fixed behavior, where the behavior fix wasn't backpatched. That'll be where the cherry-picking effort lies. Regards, -Chap
Re: Fix XML handling with DOCTYPE
On 2019-Apr-01, Chapman Flack wrote: > When I get a moment, I'll update the PostgreSQL vs. SQL/XML wiki page > to reflect the things that were fixed. I think there were some outright bugs in the docs, at least for XMLTABLE, that maybe we should backpatch. If you have the energy to cherry-pick a minimal doc update to 10/11, I offer to back-patch it. Thanks everyone for taking care of this! -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Fix XML handling with DOCTYPE
On 4/1/19 4:22 PM, Tom Lane wrote: > Chapman Flack writes: >> So, xml-functions-type-docfix-6.patch. > > Pushed with some light(?) copy-editing. > > I believe this closes out everything discussed in > > https://commitfest.postgresql.org/22/1872/ > > but I haven't gone through all three threads in detail. > Please confirm whether that CF entry can be closed or not. I think that does wrap up everything in the CF entry. Thanks! And thanks for the copy-edits; they do read better than what I came up with. When I get a moment, I'll update the PostgreSQL vs. SQL/XML wiki page to reflect the things that were fixed. Regards, -Chap
Re: Fix XML handling with DOCTYPE
Chapman Flack writes: > So, xml-functions-type-docfix-6.patch. Pushed with some light(?) copy-editing. I believe this closes out everything discussed in https://commitfest.postgresql.org/22/1872/ but I haven't gone through all three threads in detail. Please confirm whether that CF entry can be closed or not. regards, tom lane
Re: Fix XML handling with DOCTYPE
On 03/27/19 19:27, Chapman Flack wrote: > A column marked FOR ORDINALITY will be populated with row numbers > matching the order in which the output rows appeared in the original > input XML document. > > I've been skimming right over it all this time, and that right there is > a glaring built-in reliance on the observable-but-disclaimed iteration > order of a libxml2 node-set. So, xml-functions-type-docfix-6.patch. I changed that language to say "populated with row numbers, starting with 1, in the order of nodes retrieved from the row_expression's result node-set." That's not such a terrible thing to have to say; in fact, it's the *correct* description for the standard, XQuery-based, XMLTABLE (where the language gives you control of the result sequence's order). I followed that with a short note saying since XPath 1.0 doesn't specify that order, relying on it is implementation-dependent, and linked to the existing Appendix D discussion. I would have like to link directly to the , but of course doesn't know what to call that, so I linked to the instead. Regards, -Chap diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index 52c28e7..0aed14c 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -4219,6 +4219,12 @@ a0ee-bc99-9c0b-4ef8-bb6d-6bb9-bd38-0a11 value is a full document or only a content fragment. + +Limits and compatibility notes for the xml data type +in PostgreSQL can be found in +. + + Creating XML Values diff --git a/doc/src/sgml/features.sgml b/doc/src/sgml/features.sgml index 6c22d69..e8015a9 100644 --- a/doc/src/sgml/features.sgml +++ b/doc/src/sgml/features.sgml @@ -16,7 +16,8 @@ Language SQL. A revised version of the standard is released from time to time; the most recent update appearing in 2011. The 2011 version is referred to as ISO/IEC 9075:2011, or simply as SQL:2011. - The versions prior to that were SQL:2008, SQL:2003, SQL:1999, and SQL-92. Each version + The versions prior to that were SQL:2008, SQL:2006, SQL:2003, SQL:1999, + and SQL-92. Each version replaces the previous one, so claims of conformance to earlier versions have no official merit. PostgreSQL development aims for @@ -155,4 +156,329 @@ + + XML Limits and Conformance to SQL/XML + + +SQL/XML +limits and conformance + + + +Significant revisions to the ISO/IEC 9075-14 XML-related specifications +(SQL/XML) were introduced with SQL:2006. The +PostgreSQL implementation of the XML data type +and related functions largely follows the earlier, 2003 edition, with some +borrowing from the later editions. In particular: + + + + Where the current standard provides a family of XML data types + to hold document or content in + untyped or XML Schema-typed variants, and a type + XML(SEQUENCE) to hold arbitrary pieces of XML content, + PostgreSQL provides the single + xml type, which can hold document or + content, and no equivalent of the sequence + type. + + + + + + PostgreSQL provides two functions introduced + in SQL:2006, but in variants that use the language XPath 1.0, rather than + XML Query as specified for them in the standard. + + + + + + +This section presents some of the resulting differences you may encounter. + + + +Queries restricted to XPath 1.0 + + + The PostgreSQL-specific functions + xpath and xpath_exists query + XML documents using the XPath language, and + PostgreSQL also provides XPath-only variants of + the standard functions XMLEXISTS and + XMLTABLE, which officially use + the XQuery language. For all of these functions, + PostgreSQL relies on the + libxml2 library, which provides only XPath 1.0. + + + + There is a strong connection between the XQuery language and XPath versions + 2.0 and later: any expression that is syntactically valid and executes + successfully in both produces the same result (with a minor exception for + expressions containing numeric character references or predefined entity + references, which XQuery replaces with the corresponding character while + XPath leaves them alone). But there is no such connection between XPath 1.0 + and XQuery or the later XPath versions; it was an earlier language and + differs in many respects. + + + + There are two categories of limitation to keep in mind: the restriction + from XQuery to XPath for the functions specified in the SQL standard, and + the restriction of XPath to version 1.0 for both the standard and the + PostgreSQL-specific functions. + + + + Restriction of XQuery to XPath + + + Features of XQuery beyond those of XPath include: + + + + + XQuery expressions can construct and
Re: Fix XML handling with DOCTYPE
On 03/27/19 19:07, Chapman Flack wrote: > xml-functions-type-docfix-5.patch attached, with node-sets instead of > nodesets, libxml2 instead of libxml, and parenthesized full sentences > now au naturel. > > I ended up turning the formerly-parenthesized note about libxml2's > node-set ordering into a DocBook : there is really something > parenthetical about it, with the official statement of node-set > element ordering being that there is none, and the description of > what the library happens to do being of possible interest, but set > apart, with the necessary caveats about relying on it. I have just suffered a giant sinking feeling upon re-reading this sentence in our XMLTABLE doc: A column marked FOR ORDINALITY will be populated with row numbers matching the order in which the output rows appeared in the original input XML document. I've been skimming right over it all this time, and that right there is a glaring built-in reliance on the observable-but-disclaimed iteration order of a libxml2 node-set. I'm a bit unsure what any clarifying language should even say. Regards, -Chap
Re: Fix XML handling with DOCTYPE
Hi, xml-functions-type-docfix-5.patch attached, with node-sets instead of nodesets, libxml2 instead of libxml, and parenthesized full sentences now au naturel. I ended up turning the formerly-parenthesized note about libxml2's node-set ordering into a DocBook : there is really something parenthetical about it, with the official statement of node-set element ordering being that there is none, and the description of what the library happens to do being of possible interest, but set apart, with the necessary caveats about relying on it. Spotted and fixed a couple more typos in the process. Regards, -Chap diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index 52c28e7..0aed14c 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -4219,6 +4219,12 @@ a0ee-bc99-9c0b-4ef8-bb6d-6bb9-bd38-0a11 value is a full document or only a content fragment. + +Limits and compatibility notes for the xml data type +in PostgreSQL can be found in +. + + Creating XML Values diff --git a/doc/src/sgml/features.sgml b/doc/src/sgml/features.sgml index 6c22d69..095fcb9 100644 --- a/doc/src/sgml/features.sgml +++ b/doc/src/sgml/features.sgml @@ -16,7 +16,8 @@ Language SQL. A revised version of the standard is released from time to time; the most recent update appearing in 2011. The 2011 version is referred to as ISO/IEC 9075:2011, or simply as SQL:2011. - The versions prior to that were SQL:2008, SQL:2003, SQL:1999, and SQL-92. Each version + The versions prior to that were SQL:2008, SQL:2006, SQL:2003, SQL:1999, + and SQL-92. Each version replaces the previous one, so claims of conformance to earlier versions have no official merit. PostgreSQL development aims for @@ -155,4 +156,329 @@ + + XML Limits and Conformance to SQL/XML + + +SQL/XML +limits and conformance + + + +Significant revisions to the ISO/IEC 9075-14 XML-related specifications +(SQL/XML) were introduced with SQL:2006. The +PostgreSQL implementation of the XML data type +and related functions largely follows the earlier, 2003 edition, with some +borrowing from the later editions. In particular: + + + + Where the current standard provides a family of XML data types + to hold document or content in + untyped or XML Schema-typed variants, and a type + XML(SEQUENCE) to hold arbitrary pieces of XML content, + PostgreSQL provides the single + xml type, which can hold document or + content, and no equivalent of the sequence + type. + + + + + + PostgreSQL provides two functions introduced + in SQL:2006, but in variants that use the language XPath 1.0, rather than + XML Query as specified for them in the standard. + + + + + + +This section presents some of the resulting differences you may encounter. + + + +Queries restricted to XPath 1.0 + + + The PostgreSQL-specific functions + xpath and xpath_exists query + XML documents using the XPath language, and + PostgreSQL also provides XPath-only variants of + the standard functions XMLEXISTS and + XMLTABLE, which officially use + the XQuery language. For all of these functions, + PostgreSQL relies on the + libxml2 library, which provides only XPath 1.0. + + + + There is a strong connection between the XQuery language and XPath versions + 2.0 and later: any expression that is syntactically valid and executes + successfully in both produces the same result (with a minor exception for + expressions containing numeric character references or predefined entity + references, which XQuery replaces with the corresponding character while + XPath leaves them alone). But there is no such connection between XPath 1.0 + and XQuery or the later XPath versions; it was an earlier language and + differs in many respects. + + + + There are two categories of limitation to keep in mind: the restriction + from XQuery to XPath for the functions specified in the SQL standard, and + the restriction of XPath to version 1.0 for both the standard and the + PostgreSQL-specific functions. + + + + Restriction of XQuery to XPath + + + Features of XQuery beyond those of XPath include: + + + + + XQuery expressions can construct and return new XML nodes, in addition + to all possible XPath values. XPath can introduce and return values of + the atomic types (numbers, strings, and so on) but can only return XML + nodes already present in documents supplied as input to the expression. + + + + + + XQuery has control constructs for iteration, sorting, and grouping. + + + + + + XQuery allows the declaration and use of
Re: Fix XML handling with DOCTYPE
On 3/27/19 9:31 AM, Alvaro Herrera wrote: > Everyone calls it "libxml2" nowadays. Let's just use that and avoid any > possible confusion. If some libxml3 emerges one day, it's quite likely > we'll need to revise much more than our docs in order to use it. That's persuasive to me. I'll change the references to say libxml2 and let a committer serve as tiebreaker. >> [1] http://xmlsoft.org/html/libxml-xpath.html#xmlNodeSet : "array of nodes >> in no particular order" > > What this means is "we don't guarantee any specific order". It's like a > query without ORDER BY: you may currently always get document order, but > if you upgrade the library one day, it's quite possible to get the nodes > in another order and you'll not get a refund. So you (the user) should > not rely on the order, or at least be mindful that it may change in the > future. Exactly. I called the behavior "counter-documented" to distinguish this from the usual "undocumented" case, where you notice that a library is behaving in a way you like, but its docs are utterly silent on the matter, so you know you're going out on a limb to count on what you've noticed. In this case, you can notice the handy behavior but the doc *comes right out and disclaims it* so if you count on it, you're going out on a limb that has no bark left and looks punky. And yet it seems worthwhile to mention how the library does in fact seem to behave, because you might well be in the situation of porting code over from SQL/XML:2006+ or XQuery or XPath 2+, or those are the languages you've learned, so you may have order assumptions you've made, and be surprised that XPath 1 doesn't let you make them, and at least we can say "in a pinch, if you don't mind standing on this punky limb here, you may be able to use the code you've got without having to refactor every XMLTABLE() or xpath() into something wrapped in an outer SQL query with ORDER BY. You just don't get your money back if a later library upgrade changes the order." The wiki page remembers[1] that I had tried some pretty gnarly XPath 1 queries to see if I could make libxml2 return things in a different order, but no, got document order every time. Regards, -Chap [1] https://www.postgresql.org/message-id/5C465A65.4030305%40anastigmatix.net
Re: Fix XML handling with DOCTYPE
On 2019-Mar-27, Chapman Flack wrote: > On 03/26/19 21:39, Ryan Lambert wrote: > > Should this be clarified as the libxml2 library? That's what I installed > > to build postgres from source (Ubuntu 16/18). If it is the libxml library > > and the "2" is irrelevant > > That's a good catch. I'm not actually sure whether there is any "libxml" > library that isn't libxml2. Maybe there was once and nobody admits to > hanging out with it. Most Google hits on "libxml" seem to be modules > that have libxml in their names and libxml2 as their actual dependency. > > Perl XML:LibXML: "This module is an interface to libxml2" > Haskell libxml: "Binding to libxml2" > libxml-ruby: "The Libxml-Ruby project provides Ruby language bindings > for the GNOME Libxml2 ..." > > --with-libxml is the PostgreSQL configure option to make it use libxml2. > > The very web page http://xmlsoft.org/index.html says "The XML C parser > and toolkit of Gnome: libxml" and is all about libxml2. > > So I think I was unsure what convention to follow, and threw up my hands > and went with libxml. I could just as easily throw them up and go with > libxml2. Which do you think would be better? Daniel Veillard actually had libxml version 1 in that repository (mostly of GNOME provenance, it seems, put together during some W3C meeting in 1998). The version number changed to 2 sometime during year 2000. Version 1 was mostly abandoned at that point, and for some reason everyone keeps using "libxml2" as the name as though it was a different thing from "libxml". I suppose the latter name is just too generic, or because they wanted to differentiate from the old (probably incompatible API) code. https://gitlab.gnome.org/GNOME/libxml2/tree/LIB_XML_1_BRANCH Everyone calls it "libxml2" nowadays. Let's just use that and avoid any possible confusion. If some libxml3 emerges one day, it's quite likely we'll need to revise much more than our docs in order to use it. > On 03/26/19 23:52, Tom Lane wrote: > > Do we need to mention that at all? If you're not building from source, > > it doesn't seem very interesting ... but maybe I'm missing some reason > > why end users would care. > > The three places I've mentioned it were the ones where I thought users > might care: These seem relevant details. > If you agree, I should go through and fix my nodesets to be node-sets. +1 > [1] http://xmlsoft.org/html/libxml-xpath.html#xmlNodeSet : "array of nodes > in no particular order" What this means is "we don't guarantee any specific order". It's like a query without ORDER BY: you may currently always get document order, but if you upgrade the library one day, it's quite possible to get the nodes in another order and you'll not get a refund. So you (the user) should not rely on the order, or at least be mindful that it may change in the future. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Fix XML handling with DOCTYPE
Thanks for putting up with a new reviewer! --with-libxml is the PostgreSQL configure option to make it use libxml2. > > The very web page http://xmlsoft.org/index.html says "The XML C parser > and toolkit of Gnome: libxml" and is all about libxml2. > > So I think I was unsure what convention to follow, and threw up my hands > and went with libxml. I could just as easily throw them up and go with > libxml2. Which do you think would be better? I think leaving it as libxml makes sense with all that. Good point that --with-libxml is used to build so I think staying with that works and is consistent. I agree that having this point included does clarify the how and why of the limitations of this implementation. I also over-parenthesize so I'm used to looking for that in my own writing. The full sentences were the ones that seemed excessive to me, I think the others are ok and I won't nit-pick either way on those (unless you want me to!). If you agree, I should go through and fix my nodesets to be node-sets. Yes, I like node-sets better, especially knowing it conforms to the spec's language. Thanks, *Ryan Lambert* On Tue, Mar 26, 2019 at 11:05 PM Chapman Flack wrote: > On 03/26/19 21:39, Ryan Lambert wrote: > > The following review has been posted through the commitfest application: > > make installcheck-world: tested, passed > > Implements feature: tested, passed > > Spec compliant: not tested > > Documentation:tested, passed > > Thanks for the review! > > > I have two recommendations for features.sgml. You state: > > > >> relies on the libxml library > > > > Should this be clarified as the libxml2 library? That's what I installed > > to build postgres from source (Ubuntu 16/18). If it is the libxml > library > > and the "2" is irrelevant > > That's a good catch. I'm not actually sure whether there is any "libxml" > library that isn't libxml2. Maybe there was once and nobody admits to > hanging out with it. Most Google hits on "libxml" seem to be modules > that have libxml in their names and libxml2 as their actual dependency. > > Perl XML:LibXML: "This module is an interface to libxml2" > Haskell libxml: "Binding to libxml2" > libxml-ruby: "The Libxml-Ruby project provides Ruby language bindings > for the GNOME Libxml2 ..." > > --with-libxml is the PostgreSQL configure option to make it use libxml2. > > The very web page http://xmlsoft.org/index.html says "The XML C parser > and toolkit of Gnome: libxml" and is all about libxml2. > > So I think I was unsure what convention to follow, and threw up my hands > and went with libxml. I could just as easily throw them up and go with > libxml2. Which do you think would be better? > > On 03/26/19 23:52, Tom Lane wrote: > > Do we need to mention that at all? If you're not building from source, > > it doesn't seem very interesting ... but maybe I'm missing some reason > > why end users would care. > > The three places I've mentioned it were the ones where I thought users > might care: > > - why are we stuck at XPath 1.0? It's what we get from the library we use. > > - in what order do we get things out from a (hmm) node-set? Per XPath 1.0, >it's indeterminate (it's a set!), unlike XPath 2.0/XQuery where there's >a sequence type and you have order control. Observable behavior from >libxml2 (and you could certainly want to know this) is you get things > out >in document order, whether that's what you wanted or not, even though >this is undocumented, and even counter-documented[1], libxml2 behavior. >So it's an example of something you would fundamentally like to know, >where the only available answer depends precariously on the library >we happen to be using. > > - which limits in our implementation are inherent to the library, and >which are just current limits in our embedding of it? (Maybe this is >right at the border of what a user would care to know, but I know it's >a question that crosses my mind when I bonk into a limit I wasn't >expecting.) > > > There are a few places where the parenthesis around a block of text > > seem unnecessary. > > )blush( that's a long-standing wart in my writing ... seems I often think > in parentheses, then look and say "those aren't needed" and take them out, > only sometimes I don't. > > I skimmed just now and found a few instances of parenthesized whole > sentence: the one you quoted, and some (if argument is null, the result > is null), and (No rows will be produced if ). Shall I deparenthesize > them all? Did you have other instances in mind? > > > It seems you are standardizing from "node set" to "nodeset", is that > > the preferred nomenclature or preference? > > Another good catch. I remember consciously making a last pass to get them > all consistent, and I wanted them consistent with the spec, and I see now > I messed up. > > XPath 1.0 [2] has zero instances of "nodeset", two of "node set" and
Re: Fix XML handling with DOCTYPE
On 03/26/19 21:39, Ryan Lambert wrote: > I can't verify technical accuracy for many of the details (nuances between > XPath 1.0, et. al), but overall my experience with the XML functionality > lines up with what has been documented here. By the way, in case it's buried too far back in the email thread now, much of the early drafting for this happened on the wiki page https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL/XML_Standards which includes a lot of reference links, including a nice paper by Andrew Eisenberg and Jim Melton that introduced the major changes from the SQL:2003 to :2006 editions of SQL/XML. Cheers, -Chap
Re: Fix XML handling with DOCTYPE
On 03/26/19 21:39, Ryan Lambert wrote: > The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: not tested > Documentation:tested, passed Thanks for the review! > I have two recommendations for features.sgml. You state: > >> relies on the libxml library > > Should this be clarified as the libxml2 library? That's what I installed > to build postgres from source (Ubuntu 16/18). If it is the libxml library > and the "2" is irrelevant That's a good catch. I'm not actually sure whether there is any "libxml" library that isn't libxml2. Maybe there was once and nobody admits to hanging out with it. Most Google hits on "libxml" seem to be modules that have libxml in their names and libxml2 as their actual dependency. Perl XML:LibXML: "This module is an interface to libxml2" Haskell libxml: "Binding to libxml2" libxml-ruby: "The Libxml-Ruby project provides Ruby language bindings for the GNOME Libxml2 ..." --with-libxml is the PostgreSQL configure option to make it use libxml2. The very web page http://xmlsoft.org/index.html says "The XML C parser and toolkit of Gnome: libxml" and is all about libxml2. So I think I was unsure what convention to follow, and threw up my hands and went with libxml. I could just as easily throw them up and go with libxml2. Which do you think would be better? On 03/26/19 23:52, Tom Lane wrote: > Do we need to mention that at all? If you're not building from source, > it doesn't seem very interesting ... but maybe I'm missing some reason > why end users would care. The three places I've mentioned it were the ones where I thought users might care: - why are we stuck at XPath 1.0? It's what we get from the library we use. - in what order do we get things out from a (hmm) node-set? Per XPath 1.0, it's indeterminate (it's a set!), unlike XPath 2.0/XQuery where there's a sequence type and you have order control. Observable behavior from libxml2 (and you could certainly want to know this) is you get things out in document order, whether that's what you wanted or not, even though this is undocumented, and even counter-documented[1], libxml2 behavior. So it's an example of something you would fundamentally like to know, where the only available answer depends precariously on the library we happen to be using. - which limits in our implementation are inherent to the library, and which are just current limits in our embedding of it? (Maybe this is right at the border of what a user would care to know, but I know it's a question that crosses my mind when I bonk into a limit I wasn't expecting.) > There are a few places where the parenthesis around a block of text > seem unnecessary. )blush( that's a long-standing wart in my writing ... seems I often think in parentheses, then look and say "those aren't needed" and take them out, only sometimes I don't. I skimmed just now and found a few instances of parenthesized whole sentence: the one you quoted, and some (if argument is null, the result is null), and (No rows will be produced if ). Shall I deparenthesize them all? Did you have other instances in mind? > It seems you are standardizing from "node set" to "nodeset", is that > the preferred nomenclature or preference? Another good catch. I remember consciously making a last pass to get them all consistent, and I wanted them consistent with the spec, and I see now I messed up. XPath 1.0 [2] has zero instances of "nodeset", two of "node set" and about six dozen of "node-set". The only appearances of "node set" without the hyphen are in a heading and its ToC entry. The stuff under that heading consistently uses node-set. It seems that's the XPath 1.0 term for sure. When I made my consistency pass, I must have been looking too recently in libxml2 C source, rather than the spec. On 03/26/19 23:52, Tom Lane wrote: > That seemed a bit jargon-y to me too. If that's standard terminology > in the XML world, maybe it's fine; but I'd have stuck with "node set". It really was my intention (though I flubbed it) to use XPath 1.0's term for XPath 1.0's concept; in my doc philosophy, that gives readers the most breadcrumbs to follow for the rest of the details if they want them. "Node set" might be some sort of squishy expository concept I'm using, but node-set is a thing, in a spec. If you agree, I should go through and fix my nodesets to be node-sets. I do think the terminology matters here, especially because of the differences between what you can do with a node-set (XPath 1.0 thing) and with a sequence (XPath 2.0+,XQuery,SQL/XML thing). Let me know what you'd like best on these points and I'll revise the patch. Regards, -Chap [1] http://xmlsoft.org/html/libxml-xpath.html#xmlNodeSet : "array of nodes in no particular order" [2]
Re: Fix XML handling with DOCTYPE
Ryan Lambert writes: > I have two recommendations for features.sgml. You state: >> relies on the libxml library > Should this be clarified as the libxml2 library? That's what I installed to > build postgres from source (Ubuntu 16/18). If it is the libxml library and > the "2" is irrelevant, or it it works with either, it might be nice to have a > clarifying note to indicate that. Do we need to mention that at all? If you're not building from source, it doesn't seem very interesting ... but maybe I'm missing some reason why end users would care. > It seems you are standardizing from "node set" to "nodeset", is that the > preferred nomenclature or preference? That seemed a bit jargon-y to me too. If that's standard terminology in the XML world, maybe it's fine; but I'd have stuck with "node set". regards, tom lane
Re: Fix XML handling with DOCTYPE
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation:tested, passed Overall I think this patch [1] improves the docs and help explain edge case functionality that many of us, myself included, don't fully understand. I can't verify technical accuracy for many of the details (nuances between XPath 1.0, et. al), but overall my experience with the XML functionality lines up with what has been documented here. Adding the clear declaration of "XPath 1.0" instead of the generic "XPath" helps make it clear of the functional differences and helps frame the differences for new users. I have two recommendations for features.sgml. You state: > relies on the libxml library Should this be clarified as the libxml2 library? That's what I installed to build postgres from source (Ubuntu 16/18). If it is the libxml library and the "2" is irrelevant, or it it works with either, it might be nice to have a clarifying note to indicate that. There are a few places where the parenthesis around a block of text seem unnecessary. I don't think parens serve a purpose when a full sentence is contained within. > (The libxml library does seem to always return nodesets to PostgreSQL with > their members in the same relative order they had in the input document; it > does not commit to this behavior, and an XPath 1.0 expression cannot control > it.) It seems you are standardizing from "node set" to "nodeset", is that the preferred nomenclature or preference? Hopefully this helps! Thanks, Ryan Lambert [1] https://www.postgresql.org/message-id/attachment/100016/xml-functions-type-docfix-4.patch The new status of this patch is: Waiting on Author
Re: Fix XML handling with DOCTYPE
Ryan Lambert writes: > Is the xml-functions-type-docfix-4.patch [1] the one needing review? I'll > test applying it and review the changes in better detail. Is there a > section in the docs that shows how to verify if the updated pages render > properly? I would assume the pages are build when installing from source. Plain old "make all" doesn't build the docs. See https://www.postgresql.org/docs/devel/docguide.html for tooling prerequisites and build instructions. (Usually people just build the HTML docs and look at them with a browser; the other doc formats are less interesting.) regards, tom lane
Re: Fix XML handling with DOCTYPE
Ok, I'll give it a go. > If you happened to feel moved to look over a documentation patch, that > would be what this CF entry most needs in the waning days of the > commitfest. Is the xml-functions-type-docfix-4.patch [1] the one needing review? I'll test applying it and review the changes in better detail. Is there a section in the docs that shows how to verify if the updated pages render properly? I would assume the pages are build when installing from source. Ryan [1] https://www.postgresql.org/message-id/attachment/100016/xml-functions-type-docfix-4.patch On Mon, Mar 25, 2019 at 4:52 PM Chapman Flack wrote: > On 03/25/19 18:03, Ryan Lambert wrote: > > The following review has been posted through the commitfest application: > > make installcheck-world: tested, passed > > Implements feature: tested, passed > > Spec compliant: not tested > > Documentation:not tested > > Hi, > > Thanks for the review! Yes, that part of this commitfest entry has been > committed already and will appear in the next minor releases of those > branches. > > That leaves only one patch in this commitfest entry that is still in > need of review, namely the update to the documentation. > > If you happened to feel moved to look over a documentation patch, that > would be what this CF entry most needs in the waning days of the > commitfest. > > There seem to be community members reluctant to review it because of not > feeling sufficiently expert in XML to scrutinize every technical detail, > but there are other valuable angles for documentation review. (And the > reason there *is* a documentation patch is the plentiful room for > improvement in the documentation that's already there, so as far as > reviewing goes, the old yarn about the two guys, the running shoes, and > the bear comes to mind.) > > I can supply pointers to specs, etc., for anyone who does see some > technical > details in the patch and has questions about them. > > Regards, > -Chap >
Re: Fix XML handling with DOCTYPE
Chapman Flack writes: > Thanks for the review! Yes, that part of this commitfest entry has been > committed already and will appear in the next minor releases of those > branches. Indeed, thanks for verifying that this fixes your problem. > That leaves only one patch in this commitfest entry that is still in > need of review, namely the update to the documentation. Yeah. Since it *is* in need of review, I changed the CF entry's state back to Needs Review. regards, tom lane
Re: Fix XML handling with DOCTYPE
On 03/25/19 18:03, Ryan Lambert wrote: > The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: not tested > Documentation:not tested Hi, Thanks for the review! Yes, that part of this commitfest entry has been committed already and will appear in the next minor releases of those branches. That leaves only one patch in this commitfest entry that is still in need of review, namely the update to the documentation. If you happened to feel moved to look over a documentation patch, that would be what this CF entry most needs in the waning days of the commitfest. There seem to be community members reluctant to review it because of not feeling sufficiently expert in XML to scrutinize every technical detail, but there are other valuable angles for documentation review. (And the reason there *is* a documentation patch is the plentiful room for improvement in the documentation that's already there, so as far as reviewing goes, the old yarn about the two guys, the running shoes, and the bear comes to mind.) I can supply pointers to specs, etc., for anyone who does see some technical details in the patch and has questions about them. Regards, -Chap
Re: Fix XML handling with DOCTYPE
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation:not tested I tested the master branch (commit 8edd0e7), REL_11_STABLE (commit 24df866) and REL9_6_STABLE (commit 5097368) and verified functionality. This patch fixes the bug I had reported [1] previously. With this in the stable branches is it safe to assume this will be included with the next minor releases? Thanks for your work on this!! Ryan [1] https://www.postgresql.org/message-id/flat/153478795159.1302.9617586466368699403%40wrigleys.postgresql.org The new status of this patch is: Ready for Committer
Re: Fix XML handling with DOCTYPE
Perfect, thank you! I do remember seeing that message now, but hadn't understood what it really meant. I will test later today. Thanks *Ryan* On Sun, Mar 24, 2019 at 9:19 PM Tom Lane wrote: > Chapman Flack writes: > > On 03/24/19 21:04, Ryan Lambert wrote: > >> I am unable to get latest patches I found [1] to apply cleanly to > current > >> branches. It's possible I missed the latest patches so if I'm using the > >> wrong ones please let me know. I tried against master, 11.2 stable and > the > >> 11.2 tag with similar results. > > > Tom pushed the content-with-DOCTYPE patch, it's now included in master, > > REL_11_STABLE, REL_10_STABLE, REL9_6_STABLE, REL9_5_STABLE, and > > REL9_4_STABLE. > > Right. If you want to test (and please do!) you could grab the relevant > branch tip from our git repo, or download a nightly snapshot tarball from > > https://www.postgresql.org/ftp/snapshot/ > > regards, tom lane >
Re: Fix XML handling with DOCTYPE
Chapman Flack writes: > On 03/24/19 21:04, Ryan Lambert wrote: >> I am unable to get latest patches I found [1] to apply cleanly to current >> branches. It's possible I missed the latest patches so if I'm using the >> wrong ones please let me know. I tried against master, 11.2 stable and the >> 11.2 tag with similar results. > Tom pushed the content-with-DOCTYPE patch, it's now included in master, > REL_11_STABLE, REL_10_STABLE, REL9_6_STABLE, REL9_5_STABLE, and > REL9_4_STABLE. Right. If you want to test (and please do!) you could grab the relevant branch tip from our git repo, or download a nightly snapshot tarball from https://www.postgresql.org/ftp/snapshot/ regards, tom lane
Re: Fix XML handling with DOCTYPE
On 03/24/19 21:04, Ryan Lambert wrote: > I am unable to get latest patches I found [1] to apply cleanly to current > branches. It's possible I missed the latest patches so if I'm using the > wrong ones please let me know. I tried against master, 11.2 stable and the > 11.2 tag with similar results. Tom pushed the content-with-DOCTYPE patch, it's now included in master, REL_11_STABLE, REL_10_STABLE, REL9_6_STABLE, REL9_5_STABLE, and REL9_4_STABLE. The only patch that's left to be reviewed and applied is the documentation fix, latest in [1]. If you were interested in giving a review opinion on some XML documentation. Regards, -Chap [1] https://www.postgresql.org/message-id/5c96dbb5.2080...@anastigmatix.net
Re: Fix XML handling with DOCTYPE
I am unable to get latest patches I found [1] to apply cleanly to current branches. It's possible I missed the latest patches so if I'm using the wrong ones please let me know. I tried against master, 11.2 stable and the 11.2 tag with similar results. It's quite possible it's user error on my end, I am new to this process but didn't have issues with the previous patches when I tested those a couple weeks ago. [1] https://www.postgresql.org/message-id/5c8ecaa4.3090...@anastigmatix.net Ryan Lambert On Sat, Mar 23, 2019 at 5:07 PM Chapman Flack wrote: > On 03/23/19 18:22, Tom Lane wrote: > >> Out of curiosity, what further processing would you expect libxml to do? > > > > Hm, I'd have thought it'd try to parse the arguments to some extent, > > but maybe not. Does everybody reimplement attribute parsing for > > themselves when using PIs? > > Yeah, the content of a PI (whatever's after the target name) is left > all to be defined by whatever XML-using application might care about > that PI. > > It could have an attribute=value syntax inspired by XML elements, or > some other form entirely, but there'd just better not be any ?> in it. > > Regards, > -Chap >
Re: Fix XML handling with DOCTYPE
On 03/23/19 18:22, Tom Lane wrote: >> Out of curiosity, what further processing would you expect libxml to do? > > Hm, I'd have thought it'd try to parse the arguments to some extent, > but maybe not. Does everybody reimplement attribute parsing for > themselves when using PIs? Yeah, the content of a PI (whatever's after the target name) is left all to be defined by whatever XML-using application might care about that PI. It could have an attribute=value syntax inspired by XML elements, or some other form entirely, but there'd just better not be any ?> in it. Regards, -Chap
Re: Fix XML handling with DOCTYPE
Chapman Flack writes: > On 03/23/19 16:59, Tom Lane wrote: >> You're not really validating that the input >> is something that libxml would accept, unless its processing of XML PIs >> is far stupider than I would expect it to be. > Out of curiosity, what further processing would you expect libxml to do? Hm, I'd have thought it'd try to parse the arguments to some extent, but maybe not. Does everybody reimplement attribute parsing for themselves when using PIs? regards, tom lane
Re: Fix XML handling with DOCTYPE
On 03/23/19 16:59, Tom Lane wrote: > Unicode-code-point numbers. I removed that, made some other changes to > bring the patch more in line with PG coding style, and pushed it. Thanks! It looks good. I'm content with the extra PI checking being gone. The magic Unicode-code-point numbers come straight from the XML standard; I couldn't make that stuff up. :) > > You're not really validating that the input > is something that libxml would accept, unless its processing of XML PIs > is far stupider than I would expect it to be. Out of curiosity, what further processing would you expect libxml to do? XML parsers are supposed to be transparent PI-preservers, except in the rare case of seeing a PI that actually means something to the embedding application, which isn't going to be the case for a database simply implementing an XML data type. The standard literally requires that the target must be a NAME, and can't match [Xx][Mm][Ll], and if there's whitespace and anything after that, there can't be an embedded ?> ... and that's it. Regards, -Chap
Re: Fix XML handling with DOCTYPE
Chapman Flack writes: > I decided, for a first point of reference, to wear the green eyeshade and > write a pre-check that exactly implements the applicable rules. That gives > a starting point for simplifications that are probably safe. > For example, a bunch of lines at the end have to do with verifying the > content inside of a processing-instruction, after finding where it ends. > We could reasonably decide that, for the purpose of skipping it, knowing > where it ends is enough, as libxml will parse it next and report any errors > anyway. Yeah, I did not like that code too much, particularly not all the magic Unicode-code-point numbers. I removed that, made some other changes to bring the patch more in line with PG coding style, and pushed it. > That made me just want to try it now, and--surprise!--the messages from > libxml are not the same. So maybe I would lean to keeping the green-eyeshade > rules in the test, if you can stomach them, but I would understand taking > them out. I doubt anyone will care too much about whether error messages for bad XML input are exactly like what they were before; and even if someone does, I doubt that these extra tests would be enough to ensure that the messages don't change. You're not really validating that the input is something that libxml would accept, unless its processing of XML PIs is far stupider than I would expect it to be. regards, tom lane
Re: Fix XML handling with DOCTYPE
There might be too many different email threads on this with patches, but in case it went under the radar, xml-content-2006-3.patch appeared in my previous message on this thread[1]. It is based on a simple pre-check of the prefix of the input, determining which form of parse to apply. That may or may not be simpler than parse- once-save-error-parse-again-report-first-error, but IMV it's a more direct solution and clearer (the logic is clearly about "how do I determine the way this input should be parsed?" which is the problem on the table, rather than "how should I save and regurgitate this libxml error?" which turns the problem on the table to a different one). I decided, for a first point of reference, to wear the green eyeshade and write a pre-check that exactly implements the applicable rules. That gives a starting point for simplifications that are probably safe. For example, a bunch of lines at the end have to do with verifying the content inside of a processing-instruction, after finding where it ends. We could reasonably decide that, for the purpose of skipping it, knowing where it ends is enough, as libxml will parse it next and report any errors anyway. That would slightly violate my intention of sending input to (the parser that wasn't asked for) /only/ when it's completely clear (from the prefix we've seen) that that's where it should go. The relaxed version could do that in completely-clear cases and cases with an invalid PI ahead of what looks like a DTD. But you'd pretty much expect both parsers to produce the same message for a bad PI anyway. That made me just want to try it now, and--surprise!--the messages from libxml are not the same. So maybe I would lean to keeping the green-eyeshade rules in the test, if you can stomach them, but I would understand taking them out. Regards, -Chap [1] https://www.postgresql.org/message-id/5c8ecaa4.3090...@anastigmatix.net
Re: Fix XML handling with DOCTYPE
On 03/17/19 15:06, Tom Lane wrote: > The error message issue is indeed a concern, but I don't see why it's > complicated if you agree that > >> If the query asked for CONTENT, any error result should be one you could get >> when parsing as CONTENT. > > That just requires us to save the first error message and be sure to issue > that one not the DOCUMENT one, no? I confess I haven't looked hard yet at how to do that. The way errors come out of libxml is more involved than "here's a message", so there's a choice of (a) try to copy off that struct in a way that's sure to survive re-executing the parser, and then use the copy, or (b) generate a message right away from the structured information and save that, and I guess b might not be too bad; a might not be too bad, or it might, and slide right back into the kind of libxml-behavior-assumptions you're wanting to avoid. Meanwhile, here is a patch on the lines I proposed earlier, with a pre-check. Any performance hit that it could entail (which I'd really expect to be de minimis, though I haven't benchmarked) ought to be compensated by the strlen I changed to strnlen in parse_xml_decl (as there's really no need to run off and count the whole rest of the input just to know if 1, 2, 3, or 4 bytes are available to decode a UTF-8 char). ... and, yes, I know that could be an independent patch, and then the performance effect here should be measured from there. But it was near what I was doing anyway, so I included it here. Attaching both still-outstanding patches (this one and docfix) so the CF app doesn't lose one. Regards, -Chap diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index b462c06..94b46a6 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -4282,16 +4282,21 @@ SET XML OPTION { DOCUMENT | CONTENT }; SET xmloption TO { DOCUMENT | CONTENT }; The default is CONTENT, so all forms of XML -data are allowed. +data are allowed except as noted below. - With the default XML option setting, you cannot directly cast - character strings to type xml if they contain a - document type declaration, because the definition of XML content - fragment does not accept them. If you need to do that, either - use XMLPARSE or change the XML option. + In the SQL:2006 and later standard, the CONTENT form + is a proper superset of the DOCUMENT form, and so the + default XML option setting would allow casting to XML from character + strings in either form. PostgreSQL, however, + uses the SQL:2003 definition in which CONTENT form + cannot contain a document type declaration. Therefore, there is no one + setting of the XML option that will allow casting to XML from every valid + character string. The default will work almost always, but for a document + with a DTD, it will be necessary to change the XML option or + use XMLPARSE specifying DOCUMENT. diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 03859a7..0017aab 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -10140,8 +10140,13 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple + XML Functions + + XML Functions + + The functions and function-like expressions described in this section operate on values of type xml. Check . The particular behavior for individual data types is expected to evolve in order to align the - SQL and PostgreSQL data types with the XML Schema specification, - at which point a more precise description will appear. + PostgreSQL mappings with those specified in SQL:2006 and later, + as discussed in . @@ -10587,10 +10592,12 @@ SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab; - The function xmlexists returns true if the - XPath expression in the first argument returns any nodes, and - false otherwise. (If either argument is null, the result is - null.) + The function xmlexists evaluates an XPath 1.0 + expression (the first argument), with the passed value as its context item. + The function returns false if the result of that evaluation yields an + empty nodeset, true if it yields any other value. (The function returns + null if an argument is null.) A nonnull value passed as the context item + must be an XML document, not a content fragment or any non-XML value. @@ -10607,24 +10614,12 @@ SELECT xmlexists('//town[text() = ''Toronto'']' PASSING BY VALUE 'T The BY REF or BY VALUE clauses - have no effect in PostgreSQL, but are allowed - for compatibility with other implementations. Per the SQL - standard, the one that precedes any argument is required, and indicates - the default for arguments that follow, and one may follow any argument to - override the default. - PostgreSQL ignores BY REF - and
Re: Fix XML handling with DOCTYPE
Chapman Flack writes: > On 03/17/19 13:16, Tom Lane wrote: >> Do we need a pre-scan at all? > Without it, we double the time to a failure result in every case that > should actually fail, as well as in this one corner case that we want to > see succeed, and the question you posed earlier about which error message > to return becomes thornier. I have absolutely zero concern about whether it takes twice as long to detect bad input; nobody should be sending bad input if they're concerned about performance. (The costs of the ensuing transaction abort would likely dwarf xml_in's runtime in any case.) Besides, with what we're talking about doing here, (1) the extra runtime is consumed only in cases that would fail up to now, so nobody can complain about a performance regression; (2) doing a pre-scan *would* be a performance regression for cases that work today; not a large one we hope, but still... The error message issue is indeed a concern, but I don't see why it's complicated if you agree that > If the query asked for CONTENT, any error result should be one you could get > when parsing as CONTENT. That just requires us to save the first error message and be sure to issue that one not the DOCUMENT one, no? That's what we'd want to do from a backwards-compatibility standpoint anyhow, since that's the error message wording you'd get with today's code. regards, tom lane
Re: Fix XML handling with DOCTYPE
On 03/17/19 13:16, Tom Lane wrote: > Chapman Flack writes: >> What I was doing in the patch is the reverse: parsing with the expectation >> of CONTENT to see if a DTD gets tripped over. It isn't allowed for an >> element to precede a DTD, so that approach can be expected to fail fast >> if the other branch needs to be taken. > > Ah, right. I don't have any problem with trying the CONTENT approach > before the DOCUMENT approach rather than vice-versa. What I was concerned > about was adding a lot of assumptions about exactly how libxml would > report the failure. IMO a maximally-safe patch would just rearrange > things we're already doing without adding new things. > >> But a quick pre-scan for the same thing would have the same property, >> without the libxml dependencies that bother you here. Watch this space. > > Do we need a pre-scan at all? Without it, we double the time to a failure result in every case that should actually fail, as well as in this one corner case that we want to see succeed, and the question you posed earlier about which error message to return becomes thornier. If the query asked for CONTENT, any error result should be one you could get when parsing as CONTENT. If we switch and try parsing as DOCUMENT _because the input is claiming to have the form of a DOCUMENT_, then it's defensible to return errors explaining why it's not a DOCUMENT ... but not in the general case of just throwing DOCUMENT at it any time CONTENT parse fails. Regards, -Chap
Re: Fix XML handling with DOCTYPE
Chapman Flack writes: > What I was doing in the patch is the reverse: parsing with the expectation > of CONTENT to see if a DTD gets tripped over. It isn't allowed for an > element to precede a DTD, so that approach can be expected to fail fast > if the other branch needs to be taken. Ah, right. I don't have any problem with trying the CONTENT approach before the DOCUMENT approach rather than vice-versa. What I was concerned about was adding a lot of assumptions about exactly how libxml would report the failure. IMO a maximally-safe patch would just rearrange things we're already doing without adding new things. > But a quick pre-scan for the same thing would have the same property, > without the libxml dependencies that bother you here. Watch this space. Do we need a pre-scan at all? regards, tom lane
Re: Fix XML handling with DOCTYPE
On 03/17/19 11:45, Tom Lane wrote: > Chapman Flack writes: >> On 03/16/19 17:21, Tom Lane wrote: >>> (1) always try to parse as document. If successful, we're done. >>> (2) otherwise, if allowed by xmloption, try to parse using our > >> What I don't like about that is that (a) the input could be >> arbitrarily long and complex to parse (not that you can't imagine >> a database populated with lots of short little XML snippets, but >> at the same time, a query could quite plausibly deal in yooge ones), >> and (b), step (1) could fail at the last byte of the input, followed >> by total reparsing as (2). > > That doesn't seem particularly likely to me: based on what's been > said here, I'd expect parsing with the wrong expectation to usually > fail near the start of the input. In any case, the other patch > also requires repeat parsing, no? It's just doing that in a different > set of cases. I'll do up a version with the open-coded prescan I proposed last night. Whether parsing with the wrong expectation is likely to fail near the start of the input depends on both the input and the expectation. If your expectation is DOCUMENT and the input is CONTENT, it's possible for the determining difference to be something that follows the first element, and a first element can be (and often is) nearly all of the input. What I was doing in the patch is the reverse: parsing with the expectation of CONTENT to see if a DTD gets tripped over. It isn't allowed for an element to precede a DTD, so that approach can be expected to fail fast if the other branch needs to be taken. But a quick pre-scan for the same thing would have the same property, without the libxml dependencies that bother you here. Watch this space. Regards, -Chap
Re: Fix XML handling with DOCTYPE
Chapman Flack writes: > On 03/16/19 17:21, Tom Lane wrote: >> Hm, so, maybe just >> >> (1) always try to parse as document. If successful, we're done. >> >> (2) otherwise, if allowed by xmloption, try to parse using our >> current logic for the CONTENT case. > What I don't like about that is that (a) the input could be > arbitrarily long and complex to parse (not that you can't imagine > a database populated with lots of short little XML snippets, but > at the same time, a query could quite plausibly deal in yooge ones), > and (b), step (1) could fail at the last byte of the input, followed > by total reparsing as (2). That doesn't seem particularly likely to me: based on what's been said here, I'd expect parsing with the wrong expectation to usually fail near the start of the input. In any case, the other patch also requires repeat parsing, no? It's just doing that in a different set of cases. The reason I'm pressing you for a simpler patch is that dump/reload failures are pretty bad, so ideally we'd find a fix that we're comfortable with back-patching into the released branches. Personally I would never dare to back-patch the proposed patch: it's too complex, so it's not real clear that it doesn't have unwanted side-effects, and it's not at all certain that there aren't libxml version dependencies in it. (Maybe another committer with more familiarity with libxml would evaluate the risks differently, but I doubt it.) But I think that something close to what I sketched above would pass muster as safe-to-backpatch. regards, tom lane
Re: Fix XML handling with DOCTYPE
On 03/16/19 18:33, Chapman Flack wrote: > The pre-scan is a simple linear search and will ordinarily say yes or no > within a couple dozen characters--you could *have* an input with 20k of > leading whitespace and comments, but it's hardly the norm. Just trying to If the available regexp functions want to start by munging the entire input into a pg_wchar array, then it may be better to implement the pre-scan as open code, the same way parse_xml_decl() is already implemented. Given that parse_xml_decl() already covers the first optional thing that can precede the doctype, the remaining scan routine would only need to recognize comments, PIs, and whitespace. That would be pretty straightforward. Regards, -Chap
Re: Fix XML handling with DOCTYPE
Thank you both! I had glanced at that item in the commitfest but didn't notice it would fix this issue. I'll try to test/review this before the end of the month, much better than starting from scratch myself. A quick glance at the patch looks logical and looks like it should work for my use case. Thanks, Ryan Lambert On Sat, Mar 16, 2019 at 4:33 PM Chapman Flack wrote: > On 03/16/19 17:21, Tom Lane wrote: > > Chapman Flack writes: > >> On 03/16/19 16:55, Tom Lane wrote: > >>> What do you think of the idea I just posted about parsing off the > DOCTYPE > >>> thing for ourselves, and not letting libxml see it? > > > >> The principled way of doing that would be to pre-parse to find a > DOCTYPE, > >> and if there is one, leave it there and parse the input as we do for > >> 'document'. Per XML, if there is a DOCTYPE, the document must satisfy > >> the 'document' syntax requirements, and per SQL/XML:2006-and-later, > >> 'content' is a proper superset of 'document', so if we were asked for > >> 'content' and can successfully parse it as 'document', we're good, > >> and if we see a DOCTYPE and yet it incurs a parse error as 'document', > >> well, that's what needed to happen. > > > > Hm, so, maybe just > > > > (1) always try to parse as document. If successful, we're done. > > > > (2) otherwise, if allowed by xmloption, try to parse using our > > current logic for the CONTENT case. > > What I don't like about that is that (a) the input could be > arbitrarily long and complex to parse (not that you can't imagine > a database populated with lots of short little XML snippets, but > at the same time, a query could quite plausibly deal in yooge ones), > and (b), step (1) could fail at the last byte of the input, followed > by total reparsing as (2). > > I think the safer structure is clearly that of the current patch, > modulo whether the "has a DOCTYPE" test is done by libxml itself > (with the assumptions you don't like) or by a pre-scan. > > So the current structure is: > > restart: > asked for document? > parse as document, or fail > else asked for content: > parse as content > failed? > because DOCTYPE? restart as if document > else fail > > and a pre-scan structure could be very similar: > > restart: > asked for document? > parse as document, or fail > else asked for content: > pre-scan finds DOCTYPE? > restart as if document > else parse as content, or fail > > The pre-scan is a simple linear search and will ordinarily say yes or no > within a couple dozen characters--you could *have* an input with 20k of > leading whitespace and comments, but it's hardly the norm. Just trying to > parse as 'document' first could easily parse a large fraction of the input > before discovering it's followed by something that can't follow a document > element. > > Regards, > -Chap >
Re: Fix XML handling with DOCTYPE
On 03/16/19 17:21, Tom Lane wrote: > Chapman Flack writes: >> On 03/16/19 16:55, Tom Lane wrote: >>> What do you think of the idea I just posted about parsing off the DOCTYPE >>> thing for ourselves, and not letting libxml see it? > >> The principled way of doing that would be to pre-parse to find a DOCTYPE, >> and if there is one, leave it there and parse the input as we do for >> 'document'. Per XML, if there is a DOCTYPE, the document must satisfy >> the 'document' syntax requirements, and per SQL/XML:2006-and-later, >> 'content' is a proper superset of 'document', so if we were asked for >> 'content' and can successfully parse it as 'document', we're good, >> and if we see a DOCTYPE and yet it incurs a parse error as 'document', >> well, that's what needed to happen. > > Hm, so, maybe just > > (1) always try to parse as document. If successful, we're done. > > (2) otherwise, if allowed by xmloption, try to parse using our > current logic for the CONTENT case. What I don't like about that is that (a) the input could be arbitrarily long and complex to parse (not that you can't imagine a database populated with lots of short little XML snippets, but at the same time, a query could quite plausibly deal in yooge ones), and (b), step (1) could fail at the last byte of the input, followed by total reparsing as (2). I think the safer structure is clearly that of the current patch, modulo whether the "has a DOCTYPE" test is done by libxml itself (with the assumptions you don't like) or by a pre-scan. So the current structure is: restart: asked for document? parse as document, or fail else asked for content: parse as content failed? because DOCTYPE? restart as if document else fail and a pre-scan structure could be very similar: restart: asked for document? parse as document, or fail else asked for content: pre-scan finds DOCTYPE? restart as if document else parse as content, or fail The pre-scan is a simple linear search and will ordinarily say yes or no within a couple dozen characters--you could *have* an input with 20k of leading whitespace and comments, but it's hardly the norm. Just trying to parse as 'document' first could easily parse a large fraction of the input before discovering it's followed by something that can't follow a document element. Regards, -Chap
Re: Fix XML handling with DOCTYPE
Chapman Flack writes: > On 03/16/19 16:55, Tom Lane wrote: >> What do you think of the idea I just posted about parsing off the DOCTYPE >> thing for ourselves, and not letting libxml see it? > The principled way of doing that would be to pre-parse to find a DOCTYPE, > and if there is one, leave it there and parse the input as we do for > 'document'. Per XML, if there is a DOCTYPE, the document must satisfy > the 'document' syntax requirements, and per SQL/XML:2006-and-later, > 'content' is a proper superset of 'document', so if we were asked for > 'content' and can successfully parse it as 'document', we're good, > and if we see a DOCTYPE and yet it incurs a parse error as 'document', > well, that's what needed to happen. Hm, so, maybe just (1) always try to parse as document. If successful, we're done. (2) otherwise, if allowed by xmloption, try to parse using our current logic for the CONTENT case. This avoids adding any new assumptions about how libxml acts, which is what I was hoping to achieve. One interesting question is which error to report if both (1) and (2) fail. regards, tom lane
Re: Fix XML handling with DOCTYPE
On 03/16/19 16:55, Tom Lane wrote: > What do you think of the idea I just posted about parsing off the DOCTYPE > thing for ourselves, and not letting libxml see it? The principled way of doing that would be to pre-parse to find a DOCTYPE, and if there is one, leave it there and parse the input as we do for 'document'. Per XML, if there is a DOCTYPE, the document must satisfy the 'document' syntax requirements, and per SQL/XML:2006-and-later, 'content' is a proper superset of 'document', so if we were asked for 'content' and can successfully parse it as 'document', we're good, and if we see a DOCTYPE and yet it incurs a parse error as 'document', well, that's what needed to happen. The DOCTYPE can appear arbitrarily far in, but the only things that can precede it are the XML decl, whitespace, XML comments, and XML processing instructions. None of those things nest, so the preceding stuff makes a regular language, and a regular expression that matches any amount of that stuff ending in
Re: Fix XML handling with DOCTYPE
Chapman Flack writes: > A patch for your issue is currently registered in the 2019-03 commitfest[1]. Oh! I apologize for saying nobody was working on this issue. Taking a very quick look at your patch, though, I dunno --- it seems like it adds a boatload of new assumptions about libxml's data structures and error-reporting behavior. I'm sure it works for you, but will it work across a wide range of libxml versions? What do you think of the idea I just posted about parsing off the DOCTYPE thing for ourselves, and not letting libxml see it? regards, tom lane
Re: Fix XML handling with DOCTYPE
On 03/16/19 16:42, Tom Lane wrote: > Ryan Lambert writes: >> I'm investigating the issue I reported here: >> https://www.postgresql.org/message-id/flat/153478795159.1302.9617586466368699403%40wrigleys.postgresql.org >> I'd like to work on a patch to address this issue and make it work as >> advertised. > > Good idea, because it doesn't seem like anybody else cares ... ahem
Re: Fix XML handling with DOCTYPE
Ryan Lambert writes: > I'm investigating the issue I reported here: > https://www.postgresql.org/message-id/flat/153478795159.1302.9617586466368699403%40wrigleys.postgresql.org > I'd like to work on a patch to address this issue and make it work as > advertised. Good idea, because it doesn't seem like anybody else cares ... > I see xmlParseBalancedChunkMemoryRecover that might provide the > functionality needed. TBH, our experience with libxml has not been so positive that I'd think adding dependencies on new parts of its API would be a good plan. Experimenting with different inputs, it seems like removing the "" tag is enough to make it work. So what I'm wondering about is writing something like parse_xml_decl() to skip over that. Bear in mind though that I know next to zip about XML. There may be some good reason why we don't want to strip off the !DOCTYPE part from what libxml sees. regards, tom lane
Re: Fix XML handling with DOCTYPE
On 03/16/19 16:10, Ryan Lambert wrote: > As Tom Lane mentioned there, the docs (8.13) indicate xmloption = CONTENT > should accept all valid XML. At this time, XML with a DOCTYPE declaration > is not accepted with this setting even though it is considered valid XML. Hello Ryan, A patch for your issue is currently registered in the 2019-03 commitfest[1]. If it attracts somebody to review it before the end of the month, it might make it into PG v12. It is the xml-content-2006-2.patch found on the email thread [2]. (The other patch found there is associated documentation fixes, and also needs to be reviewed.) Further conversation should probably be on that email thread so that it stays associated with the commitfest entry. Thanks for your interest in the issue! Regards, Chapman Flack [1] https://commitfest.postgresql.org/22/1872/ [2] https://www.postgresql.org/message-id/flat/5c81f8c0.6090...@anastigmatix.net
Fix XML handling with DOCTYPE
Hi all, I'm investigating the issue I reported here: https://www.postgresql.org/message-id/flat/153478795159.1302.9617586466368699403%40wrigleys.postgresql.org As Tom Lane mentioned there, the docs (8.13) indicate xmloption = CONTENT should accept all valid XML. At this time, XML with a DOCTYPE declaration is not accepted with this setting even though it is considered valid XML. I'd like to work on a patch to address this issue and make it work as advertised. I traced the source of the error to line ~1500 in /src/backend/utils/adt/xml.c res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0, utf8string + count, NULL); It looks like it is xmlParseBalancedChunkMemory from libxml that doesn't work when there's a DOCTYPE in the XML data. My assumption is the DOCTYPE element makes the XML not well-balanced. From: http://xmlsoft.org/html/libxml-parser.html#xmlParseBalancedChunkMemory This function returns: > 0 if the chunk is well balanced, -1 in case of args problem and the parser > error code otherwise I see xmlParseBalancedChunkMemoryRecover that might provide the functionality needed. That function returns: 0 if the chunk is well balanced, -1 in case of args problem and the parser > error code otherwise In case recover is set to 1, the nodelist will not be > empty even if the parsed chunk is not well balanced, assuming the parsing > succeeded to some extent. I haven't tested yet to see if this parses the data w/ DOCTYPE successfully yet. If it does, I don't think it would be difficult to update the check on res_code to not fail. I'm making another assumption that there is a distinct code from libxml to differentiate from other errors, but I couldn't find those codes quickly. The current check is this: if (res_code != 0 || xmlerrcxt->err_occurred) Does this sound reasonable? Have I missed some major aspect? If this is on the right track I can work on creating a patch to move this forward. Thanks, *Ryan Lambert* RustProof Labs www.rustprooflabs.com