subject:"Fix XML handling with DOCTYPE"

Re: Fix XML handling with DOCTYPE

2019-09-12 Thread Alvaro Herrera

Hi Chapman,

On 2019-Sep-05, Chapman Flack wrote:

> Are these on your radar to maybe backpatch in this round of activity?
> 
> The latest patches I did for 11 and 10 are in
> https://www.postgresql.org/message-id/5D45A44F.8010803%40anastigmatix.net

Thanks!  I just pushed these to those branches.

I think we're finally done with these.  Many thanks for your
persistence.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Fix XML handling with DOCTYPE

2019-09-05 Thread Chapman Flack

Hi Alvaro,

On 08/03/19 12:15, Alvaro Herrera wrote:

>> I don't know if it's too late to get in the upcoming minor releases,
>> but maybe it can, if it looks ok, or the next ones, if that's too rushed.
> 
> Hmm, I'm travelling back home from a conference the weekend, so yeah I
> think it would be rushed for me to handle for the upcoming set.  But I
> can look at it before the *next* set.

Are these on your radar to maybe backpatch in this round of activity?

The latest patches I did for 11 and 10 are in
https://www.postgresql.org/message-id/5D45A44F.8010803%40anastigmatix.net

Cheers,
-Chap

Re: Fix XML handling with DOCTYPE

2019-08-03 Thread Alvaro Herrera

On 2019-Aug-03, Chapman Flack wrote:

> I don't know if it's too late to get in the upcoming minor releases,
> but maybe it can, if it looks ok, or the next ones, if that's too rushed.

Hmm, I'm travelling back home from a conference the weekend, so yeah I
think it would be rushed for me to handle for the upcoming set.  But I
can look at it before the *next* set.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Fix XML handling with DOCTYPE

2019-08-03 Thread Chapman Flack

On 04/01/19 17:34, Alvaro Herrera wrote:
> I think there were some outright bugs in the docs, at least for
> XMLTABLE, that maybe we should backpatch.  If you have the energy to
> cherry-pick a minimal doc update to 10/11, I offer to back-patch it.

I don't know if this fits your intention for "minimal". What I've done
is taken the doc commit made by Tom for 12 (12d46a), then revised it
so it describes the unfixed behavior for the bugs whose fixes weren't
backpatched to 11 or 10.

I don't know if it's too late to get in the upcoming minor releases,
but maybe it can, if it looks ok, or the next ones, if that's too rushed.

11.patch applies cleanly to 11, 10.patch to 10.

I've confirmed the 11 docs build successfully, but without sgml tools,
I haven't confirmed that for 10.

Regards,
-Chap
>From f55fb4daa47ed249e87bc417b111e842403fc1a9 Mon Sep 17 00:00:00 2001
From: nobody 
Date: Fri, 2 Aug 2019 22:47:10 -0400
Subject: [PATCH] Improve documentation about our XML functionality.

Add a section explaining how our XML features depart from current
versions of the SQL standard.  Update and clarify the descriptions
of some XML functions.

Chapman Flack, reviewed by Ryan Lambert

Discussion: https://postgr.es/m/5bd1284c.1010...@anastigmatix.net
Discussion: https://postgr.es/m/5c81f8c0.6090...@anastigmatix.net
Discussion: https://postgr.es/m/can-v+g-6jquqeqz55q3toxen6d5ez5uvzl4vr+8ktvjkj31...@mail.gmail.com

This version for backpatching PG 11 and 10, taken from Tom's commit
for 12, then edited to correctly describe behaviors that are fixed
in 12 but still broken in 11 and 10.
---
 doc/src/sgml/datatype.sgml   |   5 +
 doc/src/sgml/features.sgml   | 381 ++-
 doc/src/sgml/func.sgml   | 184 +
 src/backend/catalog/sql_features.txt |   6 +-
 4 files changed, 490 insertions(+), 86 deletions(-)

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 401a2f0..fa505e0 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -4228,6 +4228,11 @@ a0ee-bc99-9c0b-4ef8-bb6d-6bb9-bd38-0a11
 value is a full document or only a content fragment.

 
+   
+Limits and compatibility notes for the xml data type
+can be found in .
+   
+

 Creating XML Values

diff --git a/doc/src/sgml/features.sgml b/doc/src/sgml/features.sgml
index 6c22d69..253ec87 100644
--- a/doc/src/sgml/features.sgml
+++ b/doc/src/sgml/features.sgml
@@ -16,7 +16,8 @@
   Language SQL.  A revised version of the standard is released
   from time to time; the most recent update appearing in 2011.
   The 2011 version is referred to as ISO/IEC 9075:2011, or simply as SQL:2011.
-  The versions prior to that were SQL:2008, SQL:2003, SQL:1999, and SQL-92.  Each version
+  The versions prior to that were SQL:2008, SQL:2006, SQL:2003, SQL:1999,
+  and SQL-92.  Each version
   replaces the previous one, so claims of conformance to earlier
   versions have no official merit.
   PostgreSQL development aims for
@@ -155,4 +156,382 @@

   
 
+  
+   XML Limits and Conformance to SQL/XML
+
+   
+SQL/XML
+limits and conformance
+   
+
+   
+Significant revisions to the XML-related specifications in ISO/IEC 9075-14
+(SQL/XML) were introduced with SQL:2006.
+PostgreSQL's implementation of the XML data
+type and related functions largely follows the earlier 2003 edition,
+with some borrowing from later editions.  In particular:
+
+ 
+  
+   Where the current standard provides a family of XML data types
+   to hold document or content in
+   untyped or XML Schema-typed variants, and a type
+   XML(SEQUENCE) to hold arbitrary pieces of XML content,
+   PostgreSQL provides the single
+   xml type, which can hold document or
+   content.  There is no equivalent of the
+   standard's sequence type.
+  
+ 
+
+ 
+  
+   PostgreSQL provides two functions
+   introduced in SQL:2006, but in variants that use the XPath 1.0
+   language, rather than XML Query as specified for them in the
+   standard.
+  
+ 
+
+   
+
+   
+This section presents some of the resulting differences you may encounter.
+   
+
+   
+Queries are restricted to XPath 1.0
+
+
+ The PostgreSQL-specific functions
+ xpath() and xpath_exists()
+ query XML documents using the XPath language.
+ PostgreSQL also provides XPath-only variants
+ of the standard functions XMLEXISTS and
+ XMLTABLE, which officially use
+ the XQuery language. For all of these functions,
+ PostgreSQL relies on the
+ libxml2 library, which provides only XPath 1.0.
+
+
+
+ There is a strong connection between the XQuery language and XPath
+ versions 2.0 and later: any expression that is syntactically valid and
+ executes successfully in both produces the same result (with a minor
+ exception for expressions containing numeric

Re: Fix XML handling with DOCTYPE

2019-04-01 Thread Chapman Flack

On 04/01/19 17:34, Alvaro Herrera wrote:
> I think there were some outright bugs in the docs, at least for
> XMLTABLE, that maybe we should backpatch.  If you have the energy to
> cherry-pick a minimal doc update to 10/11, I offer to back-patch it.

I'll see what I can do. There's breathing room for that after the end of
the CF, right?

It seems to me that the conformance-appendix part is worth using,
along with all of the clarifications in datatype.sgml and func.sgml
except the ones clarifying fixed behavior, where the behavior fix
wasn't backpatched. That'll be where the cherry-picking effort lies.

Regards,
-Chap

Re: Fix XML handling with DOCTYPE

2019-04-01 Thread Alvaro Herrera

On 2019-Apr-01, Chapman Flack wrote:

> When I get a moment, I'll update the PostgreSQL vs. SQL/XML wiki page
> to reflect the things that were fixed.

I think there were some outright bugs in the docs, at least for
XMLTABLE, that maybe we should backpatch.  If you have the energy to
cherry-pick a minimal doc update to 10/11, I offer to back-patch it.

Thanks everyone for taking care of this!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Fix XML handling with DOCTYPE

2019-04-01 Thread Chapman Flack

On 4/1/19 4:22 PM, Tom Lane wrote:
> Chapman Flack  writes:
>> So, xml-functions-type-docfix-6.patch.
> 
> Pushed with some light(?) copy-editing.
> 
> I believe this closes out everything discussed in
> 
> https://commitfest.postgresql.org/22/1872/
> 
> but I haven't gone through all three threads in detail.
> Please confirm whether that CF entry can be closed or not.

I think that does wrap up everything in the CF entry. Thanks!
And thanks for the copy-edits; they do read better than what
I came up with.

When I get a moment, I'll update the PostgreSQL vs. SQL/XML wiki page
to reflect the things that were fixed.

Regards,
-Chap

Re: Fix XML handling with DOCTYPE

2019-04-01 Thread Tom Lane

Chapman Flack  writes:
> So, xml-functions-type-docfix-6.patch.

Pushed with some light(?) copy-editing.

I believe this closes out everything discussed in

https://commitfest.postgresql.org/22/1872/

but I haven't gone through all three threads in detail.
Please confirm whether that CF entry can be closed or not.

regards, tom lane

Re: Fix XML handling with DOCTYPE

2019-03-28 Thread Chapman Flack

On 03/27/19 19:27, Chapman Flack wrote:
>   A column marked FOR ORDINALITY will be populated with row numbers
>   matching the order in which the output rows appeared in the original
>   input XML document.
> 
> I've been skimming right over it all this time, and that right there is
> a glaring built-in reliance on the observable-but-disclaimed iteration
> order of a libxml2 node-set.

So, xml-functions-type-docfix-6.patch.

I changed that language to say "populated with row numbers, starting
with 1, in the order of nodes retrieved from the row_expression's
result node-set."

That's not such a terrible thing to have to say; in fact, it's the
*correct* description for the standard, XQuery-based, XMLTABLE (where
the language gives you control of the result sequence's order).

I followed that with a short note saying since XPath 1.0 doesn't
specify that order, relying on it is implementation-dependent, and
linked to the existing Appendix D discussion.

I would have like to link directly to the , but of course
 doesn't know what to call that, so I linked to the 
instead.

Regards,
-Chap
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 52c28e7..0aed14c 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -4219,6 +4219,12 @@ a0ee-bc99-9c0b-4ef8-bb6d-6bb9-bd38-0a11
 value is a full document or only a content fragment.

 
+   
+Limits and compatibility notes for the xml data type
+in PostgreSQL can be found in
+.
+   
+

 Creating XML Values

diff --git a/doc/src/sgml/features.sgml b/doc/src/sgml/features.sgml
index 6c22d69..e8015a9 100644
--- a/doc/src/sgml/features.sgml
+++ b/doc/src/sgml/features.sgml
@@ -16,7 +16,8 @@
   Language SQL.  A revised version of the standard is released
   from time to time; the most recent update appearing in 2011.
   The 2011 version is referred to as ISO/IEC 9075:2011, or simply as SQL:2011.
-  The versions prior to that were SQL:2008, SQL:2003, SQL:1999, and SQL-92.  Each version
+  The versions prior to that were SQL:2008, SQL:2006, SQL:2003, SQL:1999,
+  and SQL-92.  Each version
   replaces the previous one, so claims of conformance to earlier
   versions have no official merit.
   PostgreSQL development aims for
@@ -155,4 +156,329 @@

   
 
+  
+   XML Limits and Conformance to SQL/XML
+
+   
+SQL/XML
+limits and conformance
+   
+
+   
+Significant revisions to the ISO/IEC 9075-14 XML-related specifications
+(SQL/XML) were introduced with SQL:2006. The
+PostgreSQL implementation of the XML data type
+and related functions largely follows the earlier, 2003 edition, with some
+borrowing from the later editions. In particular:
+
+ 
+  
+   Where the current standard provides a family of XML data types
+   to hold document or content in
+   untyped or XML Schema-typed variants, and a type
+   XML(SEQUENCE) to hold arbitrary pieces of XML content,
+   PostgreSQL provides the single
+   xml type, which can hold document or
+   content, and no equivalent of the sequence
+   type.
+  
+ 
+
+ 
+  
+   PostgreSQL provides two functions introduced
+   in SQL:2006, but in variants that use the language XPath 1.0, rather than
+   XML Query as specified for them in the standard.
+  
+ 
+
+   
+
+   
+This section presents some of the resulting differences you may encounter.
+   
+
+   
+Queries restricted to XPath 1.0
+
+
+ The PostgreSQL-specific functions
+ xpath and xpath_exists query
+ XML documents using the XPath language, and
+ PostgreSQL also provides XPath-only variants of
+ the standard functions XMLEXISTS and
+ XMLTABLE, which officially use
+ the XQuery language. For all of these functions,
+ PostgreSQL relies on the
+ libxml2 library, which provides only XPath 1.0.
+
+
+
+ There is a strong connection between the XQuery language and XPath versions
+ 2.0 and later: any expression that is syntactically valid and executes
+ successfully in both produces the same result (with a minor exception for
+ expressions containing numeric character references or predefined entity
+ references, which XQuery replaces with the corresponding character while
+ XPath leaves them alone). But there is no such connection between XPath 1.0
+ and XQuery or the later XPath versions; it was an earlier language and
+ differs in many respects.
+
+
+
+ There are two categories of limitation to keep in mind: the restriction
+ from XQuery to XPath for the functions specified in the SQL standard, and
+ the restriction of XPath to version 1.0 for both the standard and the
+ PostgreSQL-specific functions.
+
+
+
+ Restriction of XQuery to XPath
+
+ 
+  Features of XQuery beyond those of XPath include:
+
+  
+   
+
+ XQuery expressions can construct and

Re: Fix XML handling with DOCTYPE

2019-03-27 Thread Chapman Flack

On 03/27/19 19:07, Chapman Flack wrote:
> xml-functions-type-docfix-5.patch attached, with node-sets instead of
> nodesets, libxml2 instead of libxml, and parenthesized full sentences
> now au naturel.
> 
> I ended up turning the formerly-parenthesized note about libxml2's
> node-set ordering into a DocBook : there is really something
> parenthetical about it, with the official statement of node-set
> element ordering being that there is none, and the description of
> what the library happens to do being of possible interest, but set
> apart, with the necessary caveats about relying on it.

I have just suffered a giant sinking feeling upon re-reading this
sentence in our XMLTABLE doc:

  A column marked FOR ORDINALITY will be populated with row numbers
  matching the order in which the output rows appeared in the original
  input XML document.

I've been skimming right over it all this time, and that right there is
a glaring built-in reliance on the observable-but-disclaimed iteration
order of a libxml2 node-set.

I'm a bit unsure what any clarifying language should even say.

Regards,
-Chap

Re: Fix XML handling with DOCTYPE

2019-03-27 Thread Chapman Flack

Hi,

xml-functions-type-docfix-5.patch attached, with node-sets instead of
nodesets, libxml2 instead of libxml, and parenthesized full sentences
now au naturel.

I ended up turning the formerly-parenthesized note about libxml2's
node-set ordering into a DocBook : there is really something
parenthetical about it, with the official statement of node-set
element ordering being that there is none, and the description of
what the library happens to do being of possible interest, but set
apart, with the necessary caveats about relying on it.

Spotted and fixed a couple more typos in the process.

Regards,
-Chap
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 52c28e7..0aed14c 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -4219,6 +4219,12 @@ a0ee-bc99-9c0b-4ef8-bb6d-6bb9-bd38-0a11
 value is a full document or only a content fragment.

 
+   
+Limits and compatibility notes for the xml data type
+in PostgreSQL can be found in
+.
+   
+

 Creating XML Values

diff --git a/doc/src/sgml/features.sgml b/doc/src/sgml/features.sgml
index 6c22d69..095fcb9 100644
--- a/doc/src/sgml/features.sgml
+++ b/doc/src/sgml/features.sgml
@@ -16,7 +16,8 @@
   Language SQL.  A revised version of the standard is released
   from time to time; the most recent update appearing in 2011.
   The 2011 version is referred to as ISO/IEC 9075:2011, or simply as SQL:2011.
-  The versions prior to that were SQL:2008, SQL:2003, SQL:1999, and SQL-92.  Each version
+  The versions prior to that were SQL:2008, SQL:2006, SQL:2003, SQL:1999,
+  and SQL-92.  Each version
   replaces the previous one, so claims of conformance to earlier
   versions have no official merit.
   PostgreSQL development aims for
@@ -155,4 +156,329 @@

   
 
+  
+   XML Limits and Conformance to SQL/XML
+
+   
+SQL/XML
+limits and conformance
+   
+
+   
+Significant revisions to the ISO/IEC 9075-14 XML-related specifications
+(SQL/XML) were introduced with SQL:2006. The
+PostgreSQL implementation of the XML data type
+and related functions largely follows the earlier, 2003 edition, with some
+borrowing from the later editions. In particular:
+
+ 
+  
+   Where the current standard provides a family of XML data types
+   to hold document or content in
+   untyped or XML Schema-typed variants, and a type
+   XML(SEQUENCE) to hold arbitrary pieces of XML content,
+   PostgreSQL provides the single
+   xml type, which can hold document or
+   content, and no equivalent of the sequence
+   type.
+  
+ 
+
+ 
+  
+   PostgreSQL provides two functions introduced
+   in SQL:2006, but in variants that use the language XPath 1.0, rather than
+   XML Query as specified for them in the standard.
+  
+ 
+
+   
+
+   
+This section presents some of the resulting differences you may encounter.
+   
+
+   
+Queries restricted to XPath 1.0
+
+
+ The PostgreSQL-specific functions
+ xpath and xpath_exists query
+ XML documents using the XPath language, and
+ PostgreSQL also provides XPath-only variants of
+ the standard functions XMLEXISTS and
+ XMLTABLE, which officially use
+ the XQuery language. For all of these functions,
+ PostgreSQL relies on the
+ libxml2 library, which provides only XPath 1.0.
+
+
+
+ There is a strong connection between the XQuery language and XPath versions
+ 2.0 and later: any expression that is syntactically valid and executes
+ successfully in both produces the same result (with a minor exception for
+ expressions containing numeric character references or predefined entity
+ references, which XQuery replaces with the corresponding character while
+ XPath leaves them alone). But there is no such connection between XPath 1.0
+ and XQuery or the later XPath versions; it was an earlier language and
+ differs in many respects.
+
+
+
+ There are two categories of limitation to keep in mind: the restriction
+ from XQuery to XPath for the functions specified in the SQL standard, and
+ the restriction of XPath to version 1.0 for both the standard and the
+ PostgreSQL-specific functions.
+
+
+
+ Restriction of XQuery to XPath
+
+ 
+  Features of XQuery beyond those of XPath include:
+
+  
+   
+
+ XQuery expressions can construct and return new XML nodes, in addition
+ to all possible XPath values. XPath can introduce and return values of
+ the atomic types (numbers, strings, and so on) but can only return XML
+ nodes already present in documents supplied as input to the expression.
+
+   
+
+   
+
+ XQuery has control constructs for iteration, sorting, and grouping.
+
+   
+
+   
+
+ XQuery allows the declaration and use of

Re: Fix XML handling with DOCTYPE

2019-03-27 Thread Chapman Flack

On 3/27/19 9:31 AM, Alvaro Herrera wrote:
> Everyone calls it "libxml2" nowadays.  Let's just use that and avoid any
> possible confusion.  If some libxml3 emerges one day, it's quite likely
> we'll need to revise much more than our docs in order to use it.

That's persuasive to me. I'll change the references to say libxml2
and let a committer serve as tiebreaker.

>> [1] http://xmlsoft.org/html/libxml-xpath.html#xmlNodeSet : "array of nodes
>> in no particular order"
> 
> What this means is "we don't guarantee any specific order".  It's like a
> query without ORDER BY: you may currently always get document order, but
> if you upgrade the library one day, it's quite possible to get the nodes
> in another order and you'll not get a refund.  So you (the user) should
> not rely on the order, or at least be mindful that it may change in the
> future.

Exactly. I called the behavior "counter-documented" to distinguish this
from the usual "undocumented" case, where you notice that a library is
behaving in a way you like, but its docs are utterly silent on the
matter, so you know you're going out on a limb to count on what you've
noticed.

In this case, you can notice the handy behavior but the doc *comes
right out and disclaims it* so if you count on it, you're going out
on a limb that has no bark left and looks punky.

And yet it seems worthwhile to mention how the library does in fact
seem to behave, because you might well be in the situation of porting
code over from SQL/XML:2006+ or XQuery or XPath 2+, or those are the
languages you've learned, so you may have order assumptions you've made,
and be surprised that XPath 1 doesn't let you make them, and at least
we can say "in a pinch, if you don't mind standing on this punky limb
here, you may be able to use the code you've got without having to
refactor every XMLTABLE() or xpath() into something wrapped in an
outer SQL query with ORDER BY. You just don't get your money back if
a later library upgrade changes the order."

The wiki page remembers[1] that I had tried some pretty gnarly XPath 1
queries to see if I could make libxml2 return things in a different
order, but no, got document order every time.

Regards,
-Chap

[1]
https://www.postgresql.org/message-id/5C465A65.4030305%40anastigmatix.net

Re: Fix XML handling with DOCTYPE

2019-03-27 Thread Alvaro Herrera

On 2019-Mar-27, Chapman Flack wrote:

> On 03/26/19 21:39, Ryan Lambert wrote:

> > Should this be clarified as the libxml2 library?  That's what I installed
> > to build postgres from source (Ubuntu 16/18).  If it is the libxml library
> > and the "2" is irrelevant
> 
> That's a good catch. I'm not actually sure whether there is any "libxml"
> library that isn't libxml2. Maybe there was once and nobody admits to
> hanging out with it. Most Google hits on "libxml" seem to be modules
> that have libxml in their names and libxml2 as their actual dependency.
> 
>   Perl XML:LibXML: "This module is an interface to libxml2"
>   Haskell libxml: "Binding to libxml2"
>   libxml-ruby: "The Libxml-Ruby project provides Ruby language bindings
> for the GNOME Libxml2 ..."
> 
>   --with-libxml is the PostgreSQL configure option to make it use libxml2.
> 
>   The very web page http://xmlsoft.org/index.html says "The XML C parser
>   and toolkit of Gnome: libxml" and is all about libxml2.
> 
> So I think I was unsure what convention to follow, and threw up my hands
> and went with libxml. I could just as easily throw them up and go with
> libxml2. Which do you think would be better?

Daniel Veillard actually had libxml version 1 in that repository (mostly
of GNOME provenance, it seems, put together during some W3C meeting in
1998).  The version number changed to 2 sometime during year 2000.
Version 1 was mostly abandoned at that point, and for some reason
everyone keeps using "libxml2" as the name as though it was a different
thing from "libxml".  I suppose the latter name is just too generic, or
because they wanted to differentiate from the old (probably
incompatible API) code.
https://gitlab.gnome.org/GNOME/libxml2/tree/LIB_XML_1_BRANCH

Everyone calls it "libxml2" nowadays.  Let's just use that and avoid any
possible confusion.  If some libxml3 emerges one day, it's quite likely
we'll need to revise much more than our docs in order to use it.

> On 03/26/19 23:52, Tom Lane wrote:
> > Do we need to mention that at all?  If you're not building from source,
> > it doesn't seem very interesting ... but maybe I'm missing some reason
> > why end users would care.
> 
> The three places I've mentioned it were the ones where I thought users
> might care:

These seem relevant details.

> If you agree, I should go through and fix my nodesets to be node-sets.

+1

> [1] http://xmlsoft.org/html/libxml-xpath.html#xmlNodeSet : "array of nodes
> in no particular order"

What this means is "we don't guarantee any specific order".  It's like a
query without ORDER BY: you may currently always get document order, but
if you upgrade the library one day, it's quite possible to get the nodes
in another order and you'll not get a refund.  So you (the user) should
not rely on the order, or at least be mindful that it may change in the
future.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Fix XML handling with DOCTYPE

2019-03-27 Thread Ryan Lambert

Thanks for putting up with a new reviewer!

  --with-libxml is the PostgreSQL configure option to make it use libxml2.
>


>   The very web page http://xmlsoft.org/index.html says "The XML C parser
>   and toolkit of Gnome: libxml" and is all about libxml2.
>


> So I think I was unsure what convention to follow, and threw up my hands
> and went with libxml. I could just as easily throw them up and go with
> libxml2. Which do you think would be better?


I think leaving it as libxml makes sense with all that.  Good point that
--with-libxml is used to build so I think staying with that works and is
consistent.  I agree that having this point included does clarify the how
and why of the limitations of this implementation.

I also over-parenthesize so I'm used to looking for that in my own
writing.  The full sentences were the ones that seemed excessive to me, I
think the others are ok and I won't nit-pick either way on those (unless
you want me to!).

If you agree, I should go through and fix my nodesets to be node-sets.


Yes, I like node-sets better, especially knowing it conforms to the spec's
language.

Thanks,

*Ryan Lambert*


On Tue, Mar 26, 2019 at 11:05 PM Chapman Flack 
wrote:

> On 03/26/19 21:39, Ryan Lambert wrote:
> > The following review has been posted through the commitfest application:
> > make installcheck-world:  tested, passed
> > Implements feature:   tested, passed
> > Spec compliant:   not tested
> > Documentation:tested, passed
>
> Thanks for the review!
>
> > I have two recommendations for features.sgml.  You state:
> >
> >>  relies on the libxml library
> >
> > Should this be clarified as the libxml2 library?  That's what I installed
> > to build postgres from source (Ubuntu 16/18).  If it is the libxml
> library
> > and the "2" is irrelevant
>
> That's a good catch. I'm not actually sure whether there is any "libxml"
> library that isn't libxml2. Maybe there was once and nobody admits to
> hanging out with it. Most Google hits on "libxml" seem to be modules
> that have libxml in their names and libxml2 as their actual dependency.
>
>   Perl XML:LibXML: "This module is an interface to libxml2"
>   Haskell libxml: "Binding to libxml2"
>   libxml-ruby: "The Libxml-Ruby project provides Ruby language bindings
> for the GNOME Libxml2 ..."
>
>   --with-libxml is the PostgreSQL configure option to make it use libxml2.
>
>   The very web page http://xmlsoft.org/index.html says "The XML C parser
>   and toolkit of Gnome: libxml" and is all about libxml2.
>
> So I think I was unsure what convention to follow, and threw up my hands
> and went with libxml. I could just as easily throw them up and go with
> libxml2. Which do you think would be better?
>
> On 03/26/19 23:52, Tom Lane wrote:
> > Do we need to mention that at all?  If you're not building from source,
> > it doesn't seem very interesting ... but maybe I'm missing some reason
> > why end users would care.
>
> The three places I've mentioned it were the ones where I thought users
> might care:
>
>  - why are we stuck at XPath 1.0? It's what we get from the library we use.
>
>  - in what order do we get things out from a (hmm) node-set? Per XPath 1.0,
>it's indeterminate (it's a set!), unlike XPath 2.0/XQuery where there's
>a sequence type and you have order control. Observable behavior from
>libxml2 (and you could certainly want to know this) is you get things
> out
>in document order, whether that's what you wanted or not, even though
>this is undocumented, and even counter-documented[1], libxml2 behavior.
>So it's an example of something you would fundamentally like to know,
>where the only available answer depends precariously on the library
>we happen to be using.
>
>  - which limits in our implementation are inherent to the library, and
>which are just current limits in our embedding of it? (Maybe this is
>right at the border of what a user would care to know, but I know it's
>a question that crosses my mind when I bonk into a limit I wasn't
>expecting.)
>
> > There are a few places where the parenthesis around a block of text
> > seem unnecessary.
>
> )blush( that's a long-standing wart in my writing ... seems I often think
> in parentheses, then look and say "those aren't needed" and take them out,
> only sometimes I don't.
>
> I skimmed just now and found a few instances of parenthesized whole
> sentence: the one you quoted, and some (if argument is null, the result
> is null), and (No rows will be produced if ). Shall I deparenthesize
> them all? Did you have other instances in mind?
>
> > It seems you are standardizing from "node set" to "nodeset", is that
> > the preferred nomenclature or preference?
>
> Another good catch. I remember consciously making a last pass to get them
> all consistent, and I wanted them consistent with the spec, and I see now
> I messed up.
>
> XPath 1.0 [2] has zero instances of "nodeset", two of "node set" and

Re: Fix XML handling with DOCTYPE

2019-03-26 Thread Chapman Flack

On 03/26/19 21:39, Ryan Lambert wrote:

> I can't verify technical accuracy for many of the details (nuances between
> XPath 1.0, et. al), but overall my experience with the XML functionality
> lines up with what has been documented here.

By the way, in case it's buried too far back in the email thread now,
much of the early drafting for this happened on the wiki page

https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL/XML_Standards

which includes a lot of reference links, including a nice paper by
Andrew Eisenberg and Jim Melton that introduced the major changes
from the SQL:2003 to :2006 editions of SQL/XML.

Cheers,
-Chap

Re: Fix XML handling with DOCTYPE

2019-03-26 Thread Chapman Flack

On 03/26/19 21:39, Ryan Lambert wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   not tested
> Documentation:tested, passed

Thanks for the review!

> I have two recommendations for features.sgml.  You state: 
> 
>>  relies on the libxml library
> 
> Should this be clarified as the libxml2 library?  That's what I installed
> to build postgres from source (Ubuntu 16/18).  If it is the libxml library
> and the "2" is irrelevant

That's a good catch. I'm not actually sure whether there is any "libxml"
library that isn't libxml2. Maybe there was once and nobody admits to
hanging out with it. Most Google hits on "libxml" seem to be modules
that have libxml in their names and libxml2 as their actual dependency.

  Perl XML:LibXML: "This module is an interface to libxml2"
  Haskell libxml: "Binding to libxml2"
  libxml-ruby: "The Libxml-Ruby project provides Ruby language bindings
for the GNOME Libxml2 ..."

  --with-libxml is the PostgreSQL configure option to make it use libxml2.

  The very web page http://xmlsoft.org/index.html says "The XML C parser
  and toolkit of Gnome: libxml" and is all about libxml2.

So I think I was unsure what convention to follow, and threw up my hands
and went with libxml. I could just as easily throw them up and go with
libxml2. Which do you think would be better?

On 03/26/19 23:52, Tom Lane wrote:
> Do we need to mention that at all?  If you're not building from source,
> it doesn't seem very interesting ... but maybe I'm missing some reason
> why end users would care.

The three places I've mentioned it were the ones where I thought users
might care:

 - why are we stuck at XPath 1.0? It's what we get from the library we use.

 - in what order do we get things out from a (hmm) node-set? Per XPath 1.0,
   it's indeterminate (it's a set!), unlike XPath 2.0/XQuery where there's
   a sequence type and you have order control. Observable behavior from
   libxml2 (and you could certainly want to know this) is you get things out
   in document order, whether that's what you wanted or not, even though
   this is undocumented, and even counter-documented[1], libxml2 behavior.
   So it's an example of something you would fundamentally like to know,
   where the only available answer depends precariously on the library
   we happen to be using.

 - which limits in our implementation are inherent to the library, and
   which are just current limits in our embedding of it? (Maybe this is
   right at the border of what a user would care to know, but I know it's
   a question that crosses my mind when I bonk into a limit I wasn't
   expecting.)

> There are a few places where the parenthesis around a block of text
> seem unnecessary.

)blush( that's a long-standing wart in my writing ... seems I often think
in parentheses, then look and say "those aren't needed" and take them out,
only sometimes I don't.

I skimmed just now and found a few instances of parenthesized whole
sentence: the one you quoted, and some (if argument is null, the result
is null), and (No rows will be produced if ). Shall I deparenthesize
them all? Did you have other instances in mind?

> It seems you are standardizing from "node set" to "nodeset", is that
> the preferred nomenclature or preference?

Another good catch. I remember consciously making a last pass to get them
all consistent, and I wanted them consistent with the spec, and I see now
I messed up.

XPath 1.0 [2] has zero instances of "nodeset", two of "node set" and about
six dozen of "node-set". The only appearances of "node set" without the
hyphen are in a heading and its ToC entry. The stuff under that heading
consistently uses node-set. It seems that's the XPath 1.0 term for sure.

When I made my consistency pass, I must have been looking too recently
in libxml2 C source, rather than the spec.

On 03/26/19 23:52, Tom Lane wrote:
> That seemed a bit jargon-y to me too.  If that's standard terminology
> in the XML world, maybe it's fine; but I'd have stuck with "node set".

It really was my intention (though I flubbed it) to use XPath 1.0's term
for XPath 1.0's concept; in my doc philosophy, that gives readers
the most breadcrumbs to follow for the rest of the details if they want
them. "Node set" might be some sort of squishy expository concept I'm
using, but node-set is a thing, in a spec.

If you agree, I should go through and fix my nodesets to be node-sets.

I do think the terminology matters here, especially because of the
differences between what you can do with a node-set (XPath 1.0 thing)
and with a sequence (XPath 2.0+,XQuery,SQL/XML thing).

Let me know what you'd like best on these points and I'll revise the patch.

Regards,
-Chap

[1] http://xmlsoft.org/html/libxml-xpath.html#xmlNodeSet : "array of nodes
in no particular order"

[2]

Re: Fix XML handling with DOCTYPE

2019-03-26 Thread Tom Lane

Ryan Lambert  writes:
> I have two recommendations for features.sgml.  You state: 

>> relies on the libxml library

> Should this be clarified as the libxml2 library?  That's what I installed to 
> build postgres from source (Ubuntu 16/18).  If it is the libxml library and 
> the "2" is irrelevant, or it it works with either, it might be nice to have a 
> clarifying note to indicate that.

Do we need to mention that at all?  If you're not building from source,
it doesn't seem very interesting ... but maybe I'm missing some reason
why end users would care.

> It seems you are standardizing from "node set" to "nodeset", is that the 
> preferred nomenclature or preference?

That seemed a bit jargon-y to me too.  If that's standard terminology
in the XML world, maybe it's fine; but I'd have stuck with "node set".

regards, tom lane

Re: Fix XML handling with DOCTYPE

2019-03-26 Thread Ryan Lambert

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, passed

Overall I think this patch [1] improves the docs and help explain edge case 
functionality that many of us, myself included, don't fully understand.  I 
can't verify technical accuracy for many of the details (nuances between XPath 
1.0, et. al), but overall my experience with the XML functionality lines up 
with what has been documented here.  Adding the clear declaration of "XPath 
1.0" instead of the generic "XPath" helps make it clear of the functional 
differences and helps frame the differences for new users.

I have two recommendations for features.sgml.  You state: 

>  relies on the libxml library

Should this be clarified as the libxml2 library?  That's what I installed to 
build postgres from source (Ubuntu 16/18).  If it is the libxml library and the 
"2" is irrelevant, or it it works with either, it might be nice to have a 
clarifying note to indicate that.

There are a few places where the parenthesis around a block of text seem 
unnecessary.  I don't think parens serve a purpose when a full sentence is 
contained within.

> (The libxml library does seem to always return nodesets to PostgreSQL with 
> their members in the same relative order they had in the input document; it 
> does not commit to this behavior, and an XPath 1.0 expression cannot control 
> it.)


It seems you are standardizing from "node set" to "nodeset", is that the 
preferred nomenclature or preference?

Hopefully this helps!  Thanks,

Ryan Lambert

[1] 
https://www.postgresql.org/message-id/attachment/100016/xml-functions-type-docfix-4.patch

The new status of this patch is: Waiting on Author

Re: Fix XML handling with DOCTYPE

2019-03-26 Thread Tom Lane

Ryan Lambert  writes:
> Is the xml-functions-type-docfix-4.patch [1] the one needing review?  I'll
> test applying it and review the changes in better detail.  Is there a
> section in the docs that shows how to verify if the updated pages render
> properly?  I would assume the pages are build when installing from source.

Plain old "make all" doesn't build the docs.  See
https://www.postgresql.org/docs/devel/docguide.html
for tooling prerequisites and build instructions.

(Usually people just build the HTML docs and look at them
with a browser; the other doc formats are less interesting.)

regards, tom lane

Re: Fix XML handling with DOCTYPE

2019-03-26 Thread Ryan Lambert

Ok, I'll give it a go.


> If you happened to feel moved to look over a documentation patch, that
> would be what this CF entry most needs in the waning days of the
> commitfest.


Is the xml-functions-type-docfix-4.patch [1] the one needing review?  I'll
test applying it and review the changes in better detail.  Is there a
section in the docs that shows how to verify if the updated pages render
properly?  I would assume the pages are build when installing from source.

Ryan

[1]
https://www.postgresql.org/message-id/attachment/100016/xml-functions-type-docfix-4.patch

On Mon, Mar 25, 2019 at 4:52 PM Chapman Flack  wrote:

> On 03/25/19 18:03, Ryan Lambert wrote:
> > The following review has been posted through the commitfest application:
> > make installcheck-world:  tested, passed
> > Implements feature:   tested, passed
> > Spec compliant:   not tested
> > Documentation:not tested
>
> Hi,
>
> Thanks for the review! Yes, that part of this commitfest entry has been
> committed already and will appear in the next minor releases of those
> branches.
>
> That leaves only one patch in this commitfest entry that is still in
> need of review, namely the update to the documentation.
>
> If you happened to feel moved to look over a documentation patch, that
> would be what this CF entry most needs in the waning days of the
> commitfest.
>
> There seem to be community members reluctant to review it because of not
> feeling sufficiently expert in XML to scrutinize every technical detail,
> but there are other valuable angles for documentation review. (And the
> reason there *is* a documentation patch is the plentiful room for
> improvement in the documentation that's already there, so as far as
> reviewing goes, the old yarn about the two guys, the running shoes, and
> the bear comes to mind.)
>
> I can supply pointers to specs, etc., for anyone who does see some
> technical
> details in the patch and has questions about them.
>
> Regards,
> -Chap
>

Re: Fix XML handling with DOCTYPE

2019-03-25 Thread Tom Lane

Chapman Flack  writes:
> Thanks for the review! Yes, that part of this commitfest entry has been
> committed already and will appear in the next minor releases of those
> branches.

Indeed, thanks for verifying that this fixes your problem.

> That leaves only one patch in this commitfest entry that is still in
> need of review, namely the update to the documentation.

Yeah.  Since it *is* in need of review, I changed the CF entry's
state back to Needs Review.

regards, tom lane

Re: Fix XML handling with DOCTYPE

2019-03-25 Thread Chapman Flack

On 03/25/19 18:03, Ryan Lambert wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   not tested
> Documentation:not tested

Hi,

Thanks for the review! Yes, that part of this commitfest entry has been
committed already and will appear in the next minor releases of those
branches.

That leaves only one patch in this commitfest entry that is still in
need of review, namely the update to the documentation.

If you happened to feel moved to look over a documentation patch, that
would be what this CF entry most needs in the waning days of the commitfest.

There seem to be community members reluctant to review it because of not
feeling sufficiently expert in XML to scrutinize every technical detail,
but there are other valuable angles for documentation review. (And the
reason there *is* a documentation patch is the plentiful room for
improvement in the documentation that's already there, so as far as
reviewing goes, the old yarn about the two guys, the running shoes, and
the bear comes to mind.)

I can supply pointers to specs, etc., for anyone who does see some technical
details in the patch and has questions about them.

Regards,
-Chap

Re: Fix XML handling with DOCTYPE

2019-03-25 Thread Ryan Lambert

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

I tested the master branch (commit 8edd0e7), REL_11_STABLE (commit 24df866) and 
REL9_6_STABLE (commit 5097368) and verified functionality.  This patch fixes 
the bug I had reported [1] previously.

With this in the stable branches is it safe to assume this will be included 
with the next minor releases?  Thanks for your work on this!!

Ryan

[1] 
https://www.postgresql.org/message-id/flat/153478795159.1302.9617586466368699403%40wrigleys.postgresql.org

The new status of this patch is: Ready for Committer

Re: Fix XML handling with DOCTYPE

2019-03-25 Thread Ryan Lambert

Perfect, thank you!  I do remember seeing that message now, but hadn't
understood what it really meant.
I will test later today.  Thanks

*Ryan*

On Sun, Mar 24, 2019 at 9:19 PM Tom Lane  wrote:

> Chapman Flack  writes:
> > On 03/24/19 21:04, Ryan Lambert wrote:
> >> I am unable to get latest patches I found  [1] to apply cleanly to
> current
> >> branches. It's possible I missed the latest patches so if I'm using the
> >> wrong ones please let me know.  I tried against master, 11.2 stable and
> the
> >> 11.2 tag with similar results.
>
> > Tom pushed the content-with-DOCTYPE patch, it's now included in master,
> > REL_11_STABLE, REL_10_STABLE, REL9_6_STABLE, REL9_5_STABLE, and
> > REL9_4_STABLE.
>
> Right.  If you want to test (and please do!) you could grab the relevant
> branch tip from our git repo, or download a nightly snapshot tarball from
>
> https://www.postgresql.org/ftp/snapshot/
>
> regards, tom lane
>

Re: Fix XML handling with DOCTYPE

2019-03-24 Thread Tom Lane

Chapman Flack  writes:
> On 03/24/19 21:04, Ryan Lambert wrote:
>> I am unable to get latest patches I found  [1] to apply cleanly to current
>> branches. It's possible I missed the latest patches so if I'm using the
>> wrong ones please let me know.  I tried against master, 11.2 stable and the
>> 11.2 tag with similar results.

> Tom pushed the content-with-DOCTYPE patch, it's now included in master,
> REL_11_STABLE, REL_10_STABLE, REL9_6_STABLE, REL9_5_STABLE, and
> REL9_4_STABLE.

Right.  If you want to test (and please do!) you could grab the relevant
branch tip from our git repo, or download a nightly snapshot tarball from

https://www.postgresql.org/ftp/snapshot/

regards, tom lane

Re: Fix XML handling with DOCTYPE

2019-03-24 Thread Chapman Flack

On 03/24/19 21:04, Ryan Lambert wrote:
> I am unable to get latest patches I found  [1] to apply cleanly to current
> branches. It's possible I missed the latest patches so if I'm using the
> wrong ones please let me know.  I tried against master, 11.2 stable and the
> 11.2 tag with similar results.

Tom pushed the content-with-DOCTYPE patch, it's now included in master,
REL_11_STABLE, REL_10_STABLE, REL9_6_STABLE, REL9_5_STABLE, and
REL9_4_STABLE.

The only patch that's left to be reviewed and applied is the documentation
fix, latest in [1].

If you were interested in giving a review opinion on some XML documentation.

Regards,
-Chap

[1] https://www.postgresql.org/message-id/5c96dbb5.2080...@anastigmatix.net

Re: Fix XML handling with DOCTYPE

2019-03-24 Thread Ryan Lambert

I am unable to get latest patches I found  [1] to apply cleanly to current
branches. It's possible I missed the latest patches so if I'm using the
wrong ones please let me know.  I tried against master, 11.2 stable and the
11.2 tag with similar results.  It's quite possible it's user error on my
end, I am new to this process but didn't have issues with the previous
patches when I tested those a couple weeks ago.

[1] https://www.postgresql.org/message-id/5c8ecaa4.3090...@anastigmatix.net

Ryan Lambert

On Sat, Mar 23, 2019 at 5:07 PM Chapman Flack  wrote:

> On 03/23/19 18:22, Tom Lane wrote:
> >> Out of curiosity, what further processing would you expect libxml to do?
> >
> > Hm, I'd have thought it'd try to parse the arguments to some extent,
> > but maybe not.  Does everybody reimplement attribute parsing for
> > themselves when using PIs?
>
> Yeah, the content of a PI (whatever's after the target name) is left
> all to be defined by whatever XML-using application might care about
> that PI.
>
> It could have an attribute=value syntax inspired by XML elements, or
> some other form entirely, but there'd just better not be any ?> in it.
>
> Regards,
> -Chap
>

Re: Fix XML handling with DOCTYPE

2019-03-23 Thread Chapman Flack

On 03/23/19 18:22, Tom Lane wrote:
>> Out of curiosity, what further processing would you expect libxml to do?
> 
> Hm, I'd have thought it'd try to parse the arguments to some extent,
> but maybe not.  Does everybody reimplement attribute parsing for
> themselves when using PIs?

Yeah, the content of a PI (whatever's after the target name) is left
all to be defined by whatever XML-using application might care about
that PI.

It could have an attribute=value syntax inspired by XML elements, or
some other form entirely, but there'd just better not be any ?> in it.

Regards,
-Chap

Re: Fix XML handling with DOCTYPE

2019-03-23 Thread Tom Lane

Chapman Flack  writes:
> On 03/23/19 16:59, Tom Lane wrote:
>> You're not really validating that the input
>> is something that libxml would accept, unless its processing of XML PIs
>> is far stupider than I would expect it to be.

> Out of curiosity, what further processing would you expect libxml to do?

Hm, I'd have thought it'd try to parse the arguments to some extent,
but maybe not.  Does everybody reimplement attribute parsing for
themselves when using PIs?

regards, tom lane

Re: Fix XML handling with DOCTYPE

2019-03-23 Thread Chapman Flack

On 03/23/19 16:59, Tom Lane wrote:
> Unicode-code-point numbers.  I removed that, made some other changes to
> bring the patch more in line with PG coding style, and pushed it.

Thanks! It looks good. I'm content with the extra PI checking being gone.

The magic Unicode-code-point numbers come straight from the XML standard;
I couldn't make that stuff up. :)

> > You're not really validating that the input
> is something that libxml would accept, unless its processing of XML PIs
> is far stupider than I would expect it to be.

Out of curiosity, what further processing would you expect libxml to do?

XML parsers are supposed to be transparent PI-preservers, except in the
rare case of seeing a PI that actually means something to the embedding
application, which isn't going to be the case for a database simply
implementing an XML data type.

The standard literally requires that the target must be a NAME, and
can't match [Xx][Mm][Ll], and if there's whitespace and anything after
that, there can't be an embedded ?> ... and that's it.

Regards,
-Chap

Re: Fix XML handling with DOCTYPE

2019-03-23 Thread Tom Lane

Chapman Flack  writes:
> I decided, for a first point of reference, to wear the green eyeshade and
> write a pre-check that exactly implements the applicable rules. That gives
> a starting point for simplifications that are probably safe.
> For example, a bunch of lines at the end have to do with verifying the
> content inside of a processing-instruction, after finding where it ends.
> We could reasonably decide that, for the purpose of skipping it, knowing
> where it ends is enough, as libxml will parse it next and report any errors
> anyway.

Yeah, I did not like that code too much, particularly not all the magic
Unicode-code-point numbers.  I removed that, made some other changes to
bring the patch more in line with PG coding style, and pushed it.

> That made me just want to try it now, and--surprise!--the messages from
> libxml are not the same. So maybe I would lean to keeping the green-eyeshade
> rules in the test, if you can stomach them, but I would understand taking
> them out.

I doubt anyone will care too much about whether error messages for bad
XML input are exactly like what they were before; and even if someone
does, I doubt that these extra tests would be enough to ensure that
the messages don't change.  You're not really validating that the input
is something that libxml would accept, unless its processing of XML PIs
is far stupider than I would expect it to be.

regards, tom lane

Re: Fix XML handling with DOCTYPE

2019-03-18 Thread Chapman Flack

There might be too many different email threads on this with patches,
but in case it went under the radar, xml-content-2006-3.patch appeared
in my previous message on this thread[1].

It is based on a simple pre-check of the prefix of the input, determining
which form of parse to apply. That may or may not be simpler than parse-
once-save-error-parse-again-report-first-error, but IMV it's a more direct
solution and clearer (the logic is clearly about "how do I determine the way
this input should be parsed?" which is the problem on the table, rather
than "how should I save and regurgitate this libxml error?" which turns the
problem on the table to a different one).

I decided, for a first point of reference, to wear the green eyeshade and
write a pre-check that exactly implements the applicable rules. That gives
a starting point for simplifications that are probably safe.

For example, a bunch of lines at the end have to do with verifying the
content inside of a processing-instruction, after finding where it ends.
We could reasonably decide that, for the purpose of skipping it, knowing
where it ends is enough, as libxml will parse it next and report any errors
anyway.

That would slightly violate my intention of sending input to (the parser
that wasn't asked for) /only/ when it's completely clear (from the prefix
we've seen) that that's where it should go. The relaxed version could do
that in completely-clear cases and cases with an invalid PI ahead of what
looks like a DTD. But you'd pretty much expect both parsers to produce
the same message for a bad PI anyway.

That made me just want to try it now, and--surprise!--the messages from
libxml are not the same. So maybe I would lean to keeping the green-eyeshade
rules in the test, if you can stomach them, but I would understand taking
them out.

Regards,
-Chap

[1] https://www.postgresql.org/message-id/5c8ecaa4.3090...@anastigmatix.net

Re: Fix XML handling with DOCTYPE

2019-03-17 Thread Chapman Flack

On 03/17/19 15:06, Tom Lane wrote:
> The error message issue is indeed a concern, but I don't see why it's
> complicated if you agree that
> 
>> If the query asked for CONTENT, any error result should be one you could get
>> when parsing as CONTENT.
> 
> That just requires us to save the first error message and be sure to issue
> that one not the DOCUMENT one, no?

I confess I haven't looked hard yet at how to do that. The way errors come
out of libxml is more involved than "here's a message", so there's a choice
of (a) try to copy off that struct in a way that's sure to survive
re-executing the parser, and then use the copy, or (b) generate a message
right away from the structured information and save that, and I guess b
might not be too bad; a might not be too bad, or it might, and slide right
back into the kind of libxml-behavior-assumptions you're wanting to avoid.

Meanwhile, here is a patch on the lines I proposed earlier, with a
pre-check. Any performance hit that it could entail (which I'd really
expect to be de minimis, though I haven't benchmarked) ought to be
compensated by the strlen I changed to strnlen in parse_xml_decl (as
there's really no need to run off and count the whole rest of the input
just to know if 1, 2, 3, or 4 bytes are available to decode a UTF-8 char).

... and, yes, I know that could be an independent patch, and then the
performance effect here should be measured from there. But it was near
what I was doing anyway, so I included it here.

Attaching both still-outstanding patches (this one and docfix) so the
CF app doesn't lose one.

Regards,
-Chap
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index b462c06..94b46a6 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -4282,16 +4282,21 @@ SET XML OPTION { DOCUMENT | CONTENT };
 SET xmloption TO { DOCUMENT | CONTENT };
 
 The default is CONTENT, so all forms of XML
-data are allowed.
+data are allowed except as noted below.

 

 
- With the default XML option setting, you cannot directly cast
- character strings to type xml if they contain a
- document type declaration, because the definition of XML content
- fragment does not accept them.  If you need to do that, either
- use XMLPARSE or change the XML option.
+ In the SQL:2006 and later standard, the CONTENT form
+ is a proper superset of the DOCUMENT form, and so the
+ default XML option setting would allow casting to XML from character
+ strings in either form. PostgreSQL, however,
+ uses the SQL:2003 definition in which CONTENT form
+ cannot contain a document type declaration. Therefore, there is no one
+ setting of the XML option that will allow casting to XML from every valid
+ character string. The default will work almost always, but for a document
+ with a DTD, it will be necessary to change the XML option or
+ use XMLPARSE specifying DOCUMENT.
 

 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 03859a7..0017aab 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -10140,8 +10140,13 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
 
 
  
+
   XML Functions
 
+  
+   XML Functions
+  
+
   
The functions and function-like expressions described in this
section operate on values of type xml.  Check .  The particular behavior for
  individual data types is expected to evolve in order to align the
- SQL and PostgreSQL data types with the XML Schema specification,
- at which point a more precise description will appear.
+ PostgreSQL mappings with those specified in SQL:2006 and later,
+ as discussed in .
 

 
@@ -10587,10 +10592,12 @@ SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab;
 
 
 
- The function xmlexists returns true if the
- XPath expression in the first argument returns any nodes, and
- false otherwise.  (If either argument is null, the result is
- null.)
+ The function xmlexists evaluates an XPath 1.0
+ expression (the first argument), with the passed value as its context item.
+ The function returns false if the result of that evaluation yields an
+ empty nodeset, true if it yields any other value.  (The function returns
+ null if an argument is null.) A nonnull value passed as the context item
+ must be an XML document, not a content fragment or any non-XML value.
 
 
 
@@ -10607,24 +10614,12 @@ SELECT xmlexists('//town[text() = ''Toronto'']' PASSING BY VALUE 'T
 
 
  The BY REF or BY VALUE clauses
- have no effect in PostgreSQL, but are allowed
- for compatibility with other implementations.  Per the SQL
- standard, the one that precedes any argument is required, and indicates
- the default for arguments that follow, and one may follow any argument to
- override the default.
- PostgreSQL ignores BY REF
- and

Re: Fix XML handling with DOCTYPE

2019-03-17 Thread Tom Lane

Chapman Flack  writes:
> On 03/17/19 13:16, Tom Lane wrote:
>> Do we need a pre-scan at all?

> Without it, we double the time to a failure result in every case that
> should actually fail, as well as in this one corner case that we want to
> see succeed, and the question you posed earlier about which error message
> to return becomes thornier.

I have absolutely zero concern about whether it takes twice as long to
detect bad input; nobody should be sending bad input if they're concerned
about performance.  (The costs of the ensuing transaction abort would
likely dwarf xml_in's runtime in any case.)  Besides, with what we're
talking about doing here,

(1) the extra runtime is consumed only in cases that would fail up to now,
so nobody can complain about a performance regression;
(2) doing a pre-scan *would* be a performance regression for cases that
work today; not a large one we hope, but still...

The error message issue is indeed a concern, but I don't see why it's
complicated if you agree that

> If the query asked for CONTENT, any error result should be one you could get
> when parsing as CONTENT.

That just requires us to save the first error message and be sure to issue
that one not the DOCUMENT one, no?  That's what we'd want to do from a
backwards-compatibility standpoint anyhow, since that's the error message
wording you'd get with today's code.

regards, tom lane

Re: Fix XML handling with DOCTYPE

2019-03-17 Thread Chapman Flack

On 03/17/19 13:16, Tom Lane wrote:
> Chapman Flack  writes:
>> What I was doing in the patch is the reverse: parsing with the expectation
>> of CONTENT to see if a DTD gets tripped over. It isn't allowed for an
>> element to precede a DTD, so that approach can be expected to fail fast
>> if the other branch needs to be taken.
> 
> Ah, right.  I don't have any problem with trying the CONTENT approach
> before the DOCUMENT approach rather than vice-versa.  What I was concerned
> about was adding a lot of assumptions about exactly how libxml would
> report the failure.  IMO a maximally-safe patch would just rearrange
> things we're already doing without adding new things.
> 
>> But a quick pre-scan for the same thing would have the same property,
>> without the libxml dependencies that bother you here. Watch this space.
> 
> Do we need a pre-scan at all?

Without it, we double the time to a failure result in every case that
should actually fail, as well as in this one corner case that we want to
see succeed, and the question you posed earlier about which error message
to return becomes thornier.

If the query asked for CONTENT, any error result should be one you could get
when parsing as CONTENT. If we switch and try parsing as DOCUMENT _because
the input is claiming to have the form of a DOCUMENT_, then it's defensible
to return errors explaining why it's not a DOCUMENT ... but not in the
general case of just throwing DOCUMENT at it any time CONTENT parse fails.

Regards,
-Chap

Re: Fix XML handling with DOCTYPE

2019-03-17 Thread Tom Lane

Chapman Flack  writes:
> What I was doing in the patch is the reverse: parsing with the expectation
> of CONTENT to see if a DTD gets tripped over. It isn't allowed for an
> element to precede a DTD, so that approach can be expected to fail fast
> if the other branch needs to be taken.

Ah, right.  I don't have any problem with trying the CONTENT approach
before the DOCUMENT approach rather than vice-versa.  What I was concerned
about was adding a lot of assumptions about exactly how libxml would
report the failure.  IMO a maximally-safe patch would just rearrange
things we're already doing without adding new things.

> But a quick pre-scan for the same thing would have the same property,
> without the libxml dependencies that bother you here. Watch this space.

Do we need a pre-scan at all?

regards, tom lane

Re: Fix XML handling with DOCTYPE

2019-03-17 Thread Chapman Flack

On 03/17/19 11:45, Tom Lane wrote:
> Chapman Flack  writes:
>> On 03/16/19 17:21, Tom Lane wrote:
>>> (1) always try to parse as document.  If successful, we're done.
>>> (2) otherwise, if allowed by xmloption, try to parse using our
> 
>> What I don't like about that is that (a) the input could be
>> arbitrarily long and complex to parse (not that you can't imagine
>> a database populated with lots of short little XML snippets, but
>> at the same time, a query could quite plausibly deal in yooge ones),
>> and (b), step (1) could fail at the last byte of the input, followed
>> by total reparsing as (2).
> 
> That doesn't seem particularly likely to me: based on what's been
> said here, I'd expect parsing with the wrong expectation to usually
> fail near the start of the input.  In any case, the other patch
> also requires repeat parsing, no?  It's just doing that in a different
> set of cases.

I'll do up a version with the open-coded prescan I proposed last night.

Whether parsing with the wrong expectation is likely to fail near the
start of the input depends on both the input and the expectation. If
your expectation is DOCUMENT and the input is CONTENT, it's possible
for the determining difference to be something that follows the first
element, and a first element can be (and often is) nearly all of the input.

What I was doing in the patch is the reverse: parsing with the expectation
of CONTENT to see if a DTD gets tripped over. It isn't allowed for an
element to precede a DTD, so that approach can be expected to fail fast
if the other branch needs to be taken.

But a quick pre-scan for the same thing would have the same property,
without the libxml dependencies that bother you here. Watch this space.

Regards,
-Chap

Re: Fix XML handling with DOCTYPE

2019-03-17 Thread Tom Lane

Chapman Flack  writes:
> On 03/16/19 17:21, Tom Lane wrote:
>> Hm, so, maybe just
>> 
>> (1) always try to parse as document.  If successful, we're done.
>> 
>> (2) otherwise, if allowed by xmloption, try to parse using our
>> current logic for the CONTENT case.

> What I don't like about that is that (a) the input could be
> arbitrarily long and complex to parse (not that you can't imagine
> a database populated with lots of short little XML snippets, but
> at the same time, a query could quite plausibly deal in yooge ones),
> and (b), step (1) could fail at the last byte of the input, followed
> by total reparsing as (2).

That doesn't seem particularly likely to me: based on what's been
said here, I'd expect parsing with the wrong expectation to usually
fail near the start of the input.  In any case, the other patch
also requires repeat parsing, no?  It's just doing that in a different
set of cases.

The reason I'm pressing you for a simpler patch is that dump/reload
failures are pretty bad, so ideally we'd find a fix that we're
comfortable with back-patching into the released branches.
Personally I would never dare to back-patch the proposed patch:
it's too complex, so it's not real clear that it doesn't have unwanted
side-effects, and it's not at all certain that there aren't libxml
version dependencies in it.  (Maybe another committer with more
familiarity with libxml would evaluate the risks differently, but
I doubt it.)  But I think that something close to what I sketched
above would pass muster as safe-to-backpatch.

regards, tom lane

Re: Fix XML handling with DOCTYPE

2019-03-16 Thread Chapman Flack

On 03/16/19 18:33, Chapman Flack wrote:
> The pre-scan is a simple linear search and will ordinarily say yes or no
> within a couple dozen characters--you could *have* an input with 20k of
> leading whitespace and comments, but it's hardly the norm. Just trying to

If the available regexp functions want to start by munging the entire input
into a pg_wchar array, then it may be better to implement the pre-scan as
open code, the same way parse_xml_decl() is already implemented.

Given that parse_xml_decl() already covers the first optional thing that
can precede the doctype, the remaining scan routine would only need to
recognize comments, PIs, and whitespace. That would be pretty straightforward.

Regards,
-Chap

Re: Fix XML handling with DOCTYPE

2019-03-16 Thread Ryan Lambert

Thank you both!  I had glanced at that item in the commitfest but didn't
notice it would fix this issue.
I'll try to test/review this before the end of the month, much better than
starting from scratch myself.   A quick glance at the patch looks logical
and looks like it should work for my use case.

Thanks,

Ryan Lambert


On Sat, Mar 16, 2019 at 4:33 PM Chapman Flack  wrote:

> On 03/16/19 17:21, Tom Lane wrote:
> > Chapman Flack  writes:
> >> On 03/16/19 16:55, Tom Lane wrote:
> >>> What do you think of the idea I just posted about parsing off the
> DOCTYPE
> >>> thing for ourselves, and not letting libxml see it?
> >
> >> The principled way of doing that would be to pre-parse to find a
> DOCTYPE,
> >> and if there is one, leave it there and parse the input as we do for
> >> 'document'. Per XML, if there is a DOCTYPE, the document must satisfy
> >> the 'document' syntax requirements, and per SQL/XML:2006-and-later,
> >> 'content' is a proper superset of 'document', so if we were asked for
> >> 'content' and can successfully parse it as 'document', we're good,
> >> and if we see a DOCTYPE and yet it incurs a parse error as 'document',
> >> well, that's what needed to happen.
> >
> > Hm, so, maybe just
> >
> > (1) always try to parse as document.  If successful, we're done.
> >
> > (2) otherwise, if allowed by xmloption, try to parse using our
> > current logic for the CONTENT case.
>
> What I don't like about that is that (a) the input could be
> arbitrarily long and complex to parse (not that you can't imagine
> a database populated with lots of short little XML snippets, but
> at the same time, a query could quite plausibly deal in yooge ones),
> and (b), step (1) could fail at the last byte of the input, followed
> by total reparsing as (2).
>
> I think the safer structure is clearly that of the current patch,
> modulo whether the "has a DOCTYPE" test is done by libxml itself
> (with the assumptions you don't like) or by a pre-scan.
>
> So the current structure is:
>
> restart:
>   asked for document?
> parse as document, or fail
>   else asked for content:
> parse as content
> failed?
>   because DOCTYPE? restart as if document
>   else fail
>
> and a pre-scan structure could be very similar:
>
> restart:
>   asked for document?
> parse as document, or fail
>   else asked for content:
> pre-scan finds DOCTYPE?
>   restart as if document
> else parse as content, or fail
>
> The pre-scan is a simple linear search and will ordinarily say yes or no
> within a couple dozen characters--you could *have* an input with 20k of
> leading whitespace and comments, but it's hardly the norm. Just trying to
> parse as 'document' first could easily parse a large fraction of the input
> before discovering it's followed by something that can't follow a document
> element.
>
> Regards,
> -Chap
>

Re: Fix XML handling with DOCTYPE

2019-03-16 Thread Chapman Flack

On 03/16/19 17:21, Tom Lane wrote:
> Chapman Flack  writes:
>> On 03/16/19 16:55, Tom Lane wrote:
>>> What do you think of the idea I just posted about parsing off the DOCTYPE
>>> thing for ourselves, and not letting libxml see it?
> 
>> The principled way of doing that would be to pre-parse to find a DOCTYPE,
>> and if there is one, leave it there and parse the input as we do for
>> 'document'. Per XML, if there is a DOCTYPE, the document must satisfy
>> the 'document' syntax requirements, and per SQL/XML:2006-and-later,
>> 'content' is a proper superset of 'document', so if we were asked for
>> 'content' and can successfully parse it as 'document', we're good,
>> and if we see a DOCTYPE and yet it incurs a parse error as 'document',
>> well, that's what needed to happen.
> 
> Hm, so, maybe just
> 
> (1) always try to parse as document.  If successful, we're done.
> 
> (2) otherwise, if allowed by xmloption, try to parse using our
> current logic for the CONTENT case.

What I don't like about that is that (a) the input could be
arbitrarily long and complex to parse (not that you can't imagine
a database populated with lots of short little XML snippets, but
at the same time, a query could quite plausibly deal in yooge ones),
and (b), step (1) could fail at the last byte of the input, followed
by total reparsing as (2).

I think the safer structure is clearly that of the current patch,
modulo whether the "has a DOCTYPE" test is done by libxml itself
(with the assumptions you don't like) or by a pre-scan.

So the current structure is:

restart:
  asked for document?
parse as document, or fail
  else asked for content:
parse as content
failed?
  because DOCTYPE? restart as if document
  else fail

and a pre-scan structure could be very similar:

restart:
  asked for document?
parse as document, or fail
  else asked for content:
pre-scan finds DOCTYPE?
  restart as if document
else parse as content, or fail

The pre-scan is a simple linear search and will ordinarily say yes or no
within a couple dozen characters--you could *have* an input with 20k of
leading whitespace and comments, but it's hardly the norm. Just trying to
parse as 'document' first could easily parse a large fraction of the input
before discovering it's followed by something that can't follow a document
element.

Regards,
-Chap

Re: Fix XML handling with DOCTYPE

2019-03-16 Thread Tom Lane

Chapman Flack  writes:
> On 03/16/19 16:55, Tom Lane wrote:
>> What do you think of the idea I just posted about parsing off the DOCTYPE
>> thing for ourselves, and not letting libxml see it?

> The principled way of doing that would be to pre-parse to find a DOCTYPE,
> and if there is one, leave it there and parse the input as we do for
> 'document'. Per XML, if there is a DOCTYPE, the document must satisfy
> the 'document' syntax requirements, and per SQL/XML:2006-and-later,
> 'content' is a proper superset of 'document', so if we were asked for
> 'content' and can successfully parse it as 'document', we're good,
> and if we see a DOCTYPE and yet it incurs a parse error as 'document',
> well, that's what needed to happen.

Hm, so, maybe just

(1) always try to parse as document.  If successful, we're done.

(2) otherwise, if allowed by xmloption, try to parse using our
current logic for the CONTENT case.

This avoids adding any new assumptions about how libxml acts,
which is what I was hoping to achieve.

One interesting question is which error to report if both (1) and (2)
fail.

regards, tom lane

Re: Fix XML handling with DOCTYPE

2019-03-16 Thread Chapman Flack

On 03/16/19 16:55, Tom Lane wrote:
> What do you think of the idea I just posted about parsing off the DOCTYPE
> thing for ourselves, and not letting libxml see it?

The principled way of doing that would be to pre-parse to find a DOCTYPE,
and if there is one, leave it there and parse the input as we do for
'document'. Per XML, if there is a DOCTYPE, the document must satisfy
the 'document' syntax requirements, and per SQL/XML:2006-and-later,
'content' is a proper superset of 'document', so if we were asked for
'content' and can successfully parse it as 'document', we're good,
and if we see a DOCTYPE and yet it incurs a parse error as 'document',
well, that's what needed to happen.

The DOCTYPE can appear arbitrarily far in, but the only things that
can precede it are the XML decl, whitespace, XML comments, and XML
processing instructions. None of those things nest, so the preceding
stuff makes a regular language, and a regular expression that matches
any amount of that stuff ending in

Re: Fix XML handling with DOCTYPE

2019-03-16 Thread Tom Lane

Chapman Flack  writes:
> A patch for your issue is currently registered in the 2019-03 commitfest[1].

Oh!  I apologize for saying nobody was working on this issue.

Taking a very quick look at your patch, though, I dunno --- it seems like
it adds a boatload of new assumptions about libxml's data structures and
error-reporting behavior.  I'm sure it works for you, but will it work
across a wide range of libxml versions?

What do you think of the idea I just posted about parsing off the DOCTYPE
thing for ourselves, and not letting libxml see it?

regards, tom lane

Re: Fix XML handling with DOCTYPE

2019-03-16 Thread Chapman Flack

On 03/16/19 16:42, Tom Lane wrote:
> Ryan Lambert  writes:
>> I'm investigating the issue I reported here:
>> https://www.postgresql.org/message-id/flat/153478795159.1302.9617586466368699403%40wrigleys.postgresql.org
>> I'd like to work on a patch to address this issue and make it work as
>> advertised.
> 
> Good idea, because it doesn't seem like anybody else cares ...

ahem

Re: Fix XML handling with DOCTYPE

2019-03-16 Thread Tom Lane

Ryan Lambert  writes:
> I'm investigating the issue I reported here:
> https://www.postgresql.org/message-id/flat/153478795159.1302.9617586466368699403%40wrigleys.postgresql.org
> I'd like to work on a patch to address this issue and make it work as
> advertised.

Good idea, because it doesn't seem like anybody else cares ...

> I see xmlParseBalancedChunkMemoryRecover that might provide the
> functionality needed.

TBH, our experience with libxml has not been so positive that I'd think
adding dependencies on new parts of its API would be a good plan.

Experimenting with different inputs, it seems like removing the
"" tag is enough to make it work.  So what I'm wondering
about is writing something like parse_xml_decl() to skip over that.

Bear in mind though that I know next to zip about XML.  There may be
some good reason why we don't want to strip off the !DOCTYPE part
from what libxml sees.

regards, tom lane

Re: Fix XML handling with DOCTYPE

2019-03-16 Thread Chapman Flack

On 03/16/19 16:10, Ryan Lambert wrote:
> As Tom Lane mentioned there, the docs (8.13) indicate xmloption = CONTENT
> should accept all valid XML.  At this time, XML with a DOCTYPE declaration
> is not accepted with this setting even though it is considered valid XML.

Hello Ryan,

A patch for your issue is currently registered in the 2019-03 commitfest[1].

If it attracts somebody to review it before the end of the month, it might
make it into PG v12.

It is the xml-content-2006-2.patch found on the email thread [2]. (The other
patch found there is associated documentation fixes, and also needs to be
reviewed.)

Further conversation should probably be on that email thread so that it
stays associated with the commitfest entry.

Thanks for your interest in the issue!

Regards,
Chapman Flack

[1] https://commitfest.postgresql.org/22/1872/
[2] https://www.postgresql.org/message-id/flat/5c81f8c0.6090...@anastigmatix.net

Fix XML handling with DOCTYPE

2019-03-16 Thread Ryan Lambert

Hi all,

I'm investigating the issue I reported here:
https://www.postgresql.org/message-id/flat/153478795159.1302.9617586466368699403%40wrigleys.postgresql.org

As Tom Lane mentioned there, the docs (8.13) indicate xmloption = CONTENT
should accept all valid XML.  At this time, XML with a DOCTYPE declaration
is not accepted with this setting even though it is considered valid XML.
I'd like to work on a patch to address this issue and make it work as
advertised.

I traced the source of the error to line ~1500 in
/src/backend/utils/adt/xml.c

res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0, utf8string +
count, NULL);

It looks like it is xmlParseBalancedChunkMemory from libxml that doesn't
work when there's a DOCTYPE in the XML data. My assumption is the DOCTYPE
element makes the XML not well-balanced.  From:

http://xmlsoft.org/html/libxml-parser.html#xmlParseBalancedChunkMemory

This function returns:

> 0 if the chunk is well balanced, -1 in case of args problem and the parser
> error code otherwise


I see xmlParseBalancedChunkMemoryRecover that might provide the
functionality needed. That function returns:

0 if the chunk is well balanced, -1 in case of args problem and the parser
> error code otherwise In case recover is set to 1, the nodelist will not be
> empty even if the parsed chunk is not well balanced, assuming the parsing
> succeeded to some extent.


I haven't tested yet to see if this parses the data w/ DOCTYPE successfully
yet.  If it does, I don't think it would be difficult to update the check
on res_code to not fail.  I'm making another assumption that there is a
distinct code from libxml to differentiate from other errors, but I
couldn't find those codes quickly.  The current check is this:

if (res_code != 0 || xmlerrcxt->err_occurred)

Does this sound reasonable?  Have I missed some major aspect?  If this is
on the right track I can work on creating a patch to move this forward.

Thanks,

*Ryan Lambert*
RustProof Labs
www.rustprooflabs.com

48 matches

Mail list logo