Re: [PATCH] Re: [HACKERS] Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

2010-07-12 Thread Thom Brown
On 10 July 2010 14:12, Mike Fowler m...@mlfowler.com wrote:
 Robert Haas wrote:

 On Fri, Jul 9, 2010 at 4:06 PM, Peter Eisentraut pete...@gmx.net wrote:


 On ons, 2010-07-07 at 16:37 +0100, Mike Fowler wrote:


 Here's the patch to add the 'xml_is_well_formed' function.


 I suppose we should remove the function from contrib/xml2 at the same
 time.


 Yep

 Revised patch deleting the contrib/xml2 version of the function attached.

 Regards,

 --
 Mike Fowler
 Registered Linux user: 379787

sql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Would a test for mismatched or undefined namespaces be necessary?

For example:

Mismatched namespace:
pg:foo xmlns:pg=http://postgresql.org/stuff;bar/my:foo

Undefined namespace when used in conjunction with IS DOCUMENT:
pg:foo xmlns:my=http://postgresql.org/stuff;bar/pg:foo

Also, having a look at the following example from the patch:
SELECT xml_is_well_formed('local:data
xmlns:local=http://127.0.0.1;;local:piece id=1number
one/local:piecelocal:piece id=2 //local:data');
 xml_is_well_formed

 t
(1 row)

Just wondering about that semi-colon after the namespace definition.

Thom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [PATCH] Re: [HACKERS] Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

2010-07-12 Thread Mike Fowler

Thom Brown wrote:

Would a test for mismatched or undefined namespaces be necessary?

For example:

Mismatched namespace:
pg:foo xmlns:pg=http://postgresql.org/stuff;bar/my:foo

Undefined namespace when used in conjunction with IS DOCUMENT:
pg:foo xmlns:my=http://postgresql.org/stuff;bar/pg:foo
  


Thanks for looking at my patch Thom. I hadn't thought of that particular 
scenario and even though I didn't specifically code for it, the 
underlying libxml call does correctly reject the mismatched namespace:


template1=# SELECT xml_is_well_formed('pg:foo 
xmlns:pg=http://postgresql.org/stuff;bar/my:foo');
xml_is_well_formed

f
(1 row)



In the attached patch I've added the example to the SGML documentation 
and the regression tests.



Also, having a look at the following example from the patch:
SELECT xml_is_well_formed('local:data
xmlns:local=http://127.0.0.1;;local:piece id=1number
one/local:piecelocal:piece id=2 //local:data');
 xml_is_well_formed

 t
(1 row)

Just wondering about that semi-colon after the namespace definition.

Thom
  


The semi-colon is not supposed to be there, and I'm not sure where it's 
come from. With Thunderbird I see the email with my patch as an 
attachement, downloaded and viewing the file there are no instances of a 
 followed by a ;. However, if I look at the message on the archive at 
http://archives.postgresql.org/message-id/4c3871c2.8000...@mlfowler.com 
I can see every URL that ends with a  has  a ; following it. Should I 
be escaping the  in the patch file in some way or this just an artifact 
of HTML parsing a patch?


Regards,

--
Mike Fowler
Registered Linux user: 379787

*** a/contrib/xml2/xpath.c
--- b/contrib/xml2/xpath.c
***
*** 27,33  PG_MODULE_MAGIC;
  
  /* externally accessible functions */
  
- Datum		xml_is_well_formed(PG_FUNCTION_ARGS);
  Datum		xml_encode_special_chars(PG_FUNCTION_ARGS);
  Datum		xpath_nodeset(PG_FUNCTION_ARGS);
  Datum		xpath_string(PG_FUNCTION_ARGS);
--- 27,32 
***
*** 70,97  pgxml_parser_init(void)
  	xmlLoadExtDtdDefaultValue = 1;
  }
  
- 
- /* Returns true if document is well-formed */
- 
- PG_FUNCTION_INFO_V1(xml_is_well_formed);
- 
- Datum
- xml_is_well_formed(PG_FUNCTION_ARGS)
- {
- 	text	   *t = PG_GETARG_TEXT_P(0);		/* document buffer */
- 	int32		docsize = VARSIZE(t) - VARHDRSZ;
- 	xmlDocPtr	doctree;
- 
- 	pgxml_parser_init();
- 
- 	doctree = xmlParseMemory((char *) VARDATA(t), docsize);
- 	if (doctree == NULL)
- 		PG_RETURN_BOOL(false);	/* i.e. not well-formed */
- 	xmlFreeDoc(doctree);
- 	PG_RETURN_BOOL(true);
- }
- 
- 
  /* Encodes special characters (, , ,  and \r) as XML entities */
  
  PG_FUNCTION_INFO_V1(xml_encode_special_chars);
--- 69,74 
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***
*** 8554,8562  SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab;
  ]]/screen
  /para
 /sect3
  
 sect3
! titleXML Predicates/title
  
  indexterm
   primaryIS DOCUMENT/primary
--- 8554,8566 
  ]]/screen
  /para
 /sect3
+   /sect2
+ 
+   sect2
+titleXML Predicates/title
  
 sect3
! titleIS DOCUMENT/title
  
  indexterm
   primaryIS DOCUMENT/primary
***
*** 8574,8579  SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab;
--- 8578,8675 
   between documents and content fragments.
  /para
 /sect3
+ 
+sect3
+ titlexml_is_well_formed/title
+ 
+ indexterm
+  primaryxml_is_well_formed/primary
+  secondarywell formed/secondary
+ /indexterm
+ 
+ synopsis
+ functionxml_is_well_formed/function(replaceabletext/replaceable)
+ /synopsis
+ 
+ para
+  The function functionxml_is_well_formed/function evaluates whether
+  the replaceabletext/replaceable is well formed XML content, returning
+  a boolean.
+ /para
+ para
+ Example:
+ screen![CDATA[
+ SELECT xml_is_well_formed('foobar/foo');
+  xml_is_well_formed
+ 
+  t
+ (1 row)
+ 
+ SELECT xml_is_well_formed('foobar/foo');
+  xml_is_well_formed
+ 
+  f
+ (1 row)
+ 
+ SELECT xml_is_well_formed('foobarstuff/foo');
+  xml_is_well_formed
+ 
+  f
+ (1 row)
+ ]]/screen
+ /para
+ para
+ In addition to the structure checks, the function ensures that namespaces are correcty matched.
+ screen![CDATA[
+ SELECT xml_is_well_formed('pg:foo xmlns:pg=http://postgresql.org/stuff;bar/my:foo');
+  xml_is_well_formed
+ 
+  f
+ (1 row)
+ 
+ SELECT xml_is_well_formed('pg:foo xmlns:pg=http://postgresql.org/stuff;bar/pg:foo');
+  xml_is_well_formed
+ 
+  t
+ (1 row)
+ ]]/screen
+ /para
+ para
+ This function can be combined with the IS DOCUMENT predicate to prevent
+ invalid XML content errors from occuring in queries. For example, given a
+ table that may have rows with invalid XML mixed in with rows of valid
+ 

Re: [PATCH] Re: [HACKERS] Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

2010-07-12 Thread Thom Brown
On 12 July 2010 13:07, Mike Fowler m...@mlfowler.com wrote:
 Thom Brown wrote:

 Just wondering about that semi-colon after the namespace definition.

 Thom


 The semi-colon is not supposed to be there, and I'm not sure where it's come
 from. With Thunderbird I see the email with my patch as an attachement,
 downloaded and viewing the file there are no instances of a  followed by a
 ;. However, if I look at the message on the archive at
 http://archives.postgresql.org/message-id/4c3871c2.8000...@mlfowler.com I
 can see every URL that ends with a  has  a ; following it. Should I be
 escaping the  in the patch file in some way or this just an artifact of
 HTML parsing a patch?

Yeah, I guess it's a parsing issue related to the archive viewer.  I
arrived there from the commitfest page and should have really looked
directly at the patch.  No problem there then I guess.

Thanks for the work you've done on this. :)

Thom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [PATCH] Re: [HACKERS] Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

2010-07-10 Thread Mike Fowler

Robert Haas wrote:

On Fri, Jul 9, 2010 at 4:06 PM, Peter Eisentraut pete...@gmx.net wrote:
  

On ons, 2010-07-07 at 16:37 +0100, Mike Fowler wrote:


Here's the patch to add the 'xml_is_well_formed' function.
  

I suppose we should remove the function from contrib/xml2 at the same
time.



Yep


Revised patch deleting the contrib/xml2 version of the function attached.

Regards,

--
Mike Fowler
Registered Linux user: 379787

*** a/contrib/xml2/xpath.c
--- b/contrib/xml2/xpath.c
***
*** 27,33  PG_MODULE_MAGIC;
  
  /* externally accessible functions */
  
- Datum		xml_is_well_formed(PG_FUNCTION_ARGS);
  Datum		xml_encode_special_chars(PG_FUNCTION_ARGS);
  Datum		xpath_nodeset(PG_FUNCTION_ARGS);
  Datum		xpath_string(PG_FUNCTION_ARGS);
--- 27,32 
***
*** 70,97  pgxml_parser_init(void)
  	xmlLoadExtDtdDefaultValue = 1;
  }
  
- 
- /* Returns true if document is well-formed */
- 
- PG_FUNCTION_INFO_V1(xml_is_well_formed);
- 
- Datum
- xml_is_well_formed(PG_FUNCTION_ARGS)
- {
- 	text	   *t = PG_GETARG_TEXT_P(0);		/* document buffer */
- 	int32		docsize = VARSIZE(t) - VARHDRSZ;
- 	xmlDocPtr	doctree;
- 
- 	pgxml_parser_init();
- 
- 	doctree = xmlParseMemory((char *) VARDATA(t), docsize);
- 	if (doctree == NULL)
- 		PG_RETURN_BOOL(false);	/* i.e. not well-formed */
- 	xmlFreeDoc(doctree);
- 	PG_RETURN_BOOL(true);
- }
- 
- 
  /* Encodes special characters (, , ,  and \r) as XML entities */
  
  PG_FUNCTION_INFO_V1(xml_encode_special_chars);
--- 69,74 
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***
*** 8554,8562  SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab;
  ]]/screen
  /para
 /sect3
  
 sect3
! titleXML Predicates/title
  
  indexterm
   primaryIS DOCUMENT/primary
--- 8554,8566 
  ]]/screen
  /para
 /sect3
+   /sect2
+ 
+   sect2
+titleXML Predicates/title
  
 sect3
! titleIS DOCUMENT/title
  
  indexterm
   primaryIS DOCUMENT/primary
***
*** 8574,8579  SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab;
--- 8578,8653 
   between documents and content fragments.
  /para
 /sect3
+ 
+sect3
+ titlexml_is_well_formed/title
+ 
+ indexterm
+  primaryxml_is_well_formed/primary
+  secondarywell formed/secondary
+ /indexterm
+ 
+ synopsis
+ functionxml_is_well_formed/function(replaceabletext/replaceable)
+ /synopsis
+ 
+ para
+  The function functionxml_is_well_formed/function evaluates whether
+  the replaceabletext/replaceable is well formed XML content, returning
+  a boolean.
+ /para
+ para
+ Example:
+ screen![CDATA[
+ SELECT xml_is_well_formed('foobar/foo');
+  xml_is_well_formed
+ 
+  t
+ (1 row)
+ 
+ SELECT xml_is_well_formed('foobar/foo');
+  xml_is_well_formed
+ 
+  f
+ (1 row)
+ ]]/screen
+ /para
+ para
+ This function can be combined with the IS DOCUMENT predicate to prevent
+ invalid XML content errors from occuring in queries. For example, given a
+ table that may have rows with invalid XML mixed in with rows of valid
+ XML, functionxml_is_well_formed/function can be used to filter out all
+ the invalid rows.
+ /para
+ para
+ Example:
+ screen![CDATA[
+ SELECT * FROM mixed;
+  data
+ --
+  foobar/foo
+  foobar/foo
+  foobar/foobarfoo/bar
+  foobar/foobarfoo/bar
+ (4 rows)
+ 
+ SELECT COUNT(data) FROM mixed WHERE data::xml IS DOCUMENT;
+ ERROR:  invalid XML content
+ DETAIL:  Entity: line 1: parser error : expected ''
+ foobar/foo
+  ^
+ Entity: line 1: parser error : chunk is not well balanced
+ foobar/foo
+  ^
+ 
+ SELECT COUNT(data) FROM mixed WHERE xml_is_well_formed(data) AND data::xml IS DOCUMENT;
+  count
+ ---
+  1
+ (1 row)
+ ]]/screen
+ /para
+/sect3
/sect2
  
sect2 id=functions-xml-processing
*** a/src/backend/utils/adt/xml.c
--- b/src/backend/utils/adt/xml.c
***
*** 3293,3298  xml_xmlnodetoxmltype(xmlNodePtr cur)
--- 3293,3365 
  }
  #endif
  
+ Datum
+ xml_is_well_formed(PG_FUNCTION_ARGS)
+ {
+ #ifdef USE_LIBXML
+ 	text*data = PG_GETARG_TEXT_P(0);
+ 	boolresult;
+ 	int	res_code;
+ 	int32len;
+ 	const xmlChar		*string;
+ 	xmlParserCtxtPtr	ctxt;
+ 	xmlDocPtr			doc = NULL;
+ 
+ 	len = VARSIZE(data) - VARHDRSZ;
+ 	string = xml_text2xmlChar(data);
+ 
+ 	/* Start up libxml and its parser (no-ops if already done) */
+ 	pg_xml_init();
+ 	xmlInitParser();
+ 
+ 	ctxt = xmlNewParserCtxt();
+ 	if (ctxt == NULL)
+ 		xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY,
+ 	could not allocate parser context);
+ 
+ 	PG_TRY();
+ 	{
+ 		size_t		count;
+ 		xmlChar*version = NULL;
+ 		int			standalone = -1;
+ 
+ 		res_code = parse_xml_decl(string, count, version, NULL, standalone);
+ 		if (res_code != 0)
+ 			xml_ereport_by_code(ERROR, 

Re: [PATCH] Re: [HACKERS] Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

2010-07-09 Thread Peter Eisentraut
On ons, 2010-07-07 at 16:37 +0100, Mike Fowler wrote:
 Here's the patch to add the 'xml_is_well_formed' function.

I suppose we should remove the function from contrib/xml2 at the same
time.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [PATCH] Re: [HACKERS] Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

2010-07-09 Thread Robert Haas
On Fri, Jul 9, 2010 at 4:06 PM, Peter Eisentraut pete...@gmx.net wrote:
 On ons, 2010-07-07 at 16:37 +0100, Mike Fowler wrote:
 Here's the patch to add the 'xml_is_well_formed' function.

 I suppose we should remove the function from contrib/xml2 at the same
 time.

Yep.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[PATCH] Re: [HACKERS] Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

2010-07-07 Thread Mike Fowler

Peter Eisentraut wrote:

On lör, 2010-07-03 at 09:26 +0100, Mike Fowler wrote:
  
What I will do  
instead is implement the xml_is_well_formed function and get a patch  
out in the next day or two. 



That sounds very useful.
  
Here's the patch to add the 'xml_is_well_formed' function. Paraphrasing 
the SGML the syntax is:


|xml_is_well_formed|(/text/)

The function |xml_is_well_formed| evaluates whether the /text/ is well 
formed XML content, returning a boolean. I've done some tests (included 
in the patch) with tables containing a mixture of well formed documents 
and content and the function is happily returning the expected result. 
Combining with IS (NOT) DOCUMENT is working nicely for pulling out 
content or documents from a table of text.


Unless I missed something in the original correspondence, I think this 
patch will solve the issue.


Regards,

--
Mike Fowler
Registered Linux user: 379787

*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***
*** 8554,8562  SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab;
  ]]/screen
  /para
 /sect3
  
 sect3
! titleXML Predicates/title
  
  indexterm
   primaryIS DOCUMENT/primary
--- 8554,8566 
  ]]/screen
  /para
 /sect3
+   /sect2
+ 
+   sect2
+titleXML Predicates/title
  
 sect3
! titleIS DOCUMENT/title
  
  indexterm
   primaryIS DOCUMENT/primary
***
*** 8574,8579  SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab;
--- 8578,8653 
   between documents and content fragments.
  /para
 /sect3
+ 
+sect3
+ titlexml_is_well_formed/title
+ 
+ indexterm
+  primaryxml_is_well_formed/primary
+  secondarywell formed/secondary
+ /indexterm
+ 
+ synopsis
+ functionxml_is_well_formed/function(replaceabletext/replaceable)
+ /synopsis
+ 
+ para
+  The function functionxml_is_well_formed/function evaluates whether
+  the replaceabletext/replaceable is well formed XML content, returning
+  a boolean.
+ /para
+ para
+ Example:
+ screen![CDATA[
+ SELECT xml_is_well_formed('foobar/foo');
+  xml_is_well_formed
+ 
+  t
+ (1 row)
+ 
+ SELECT xml_is_well_formed('foobar/foo');
+  xml_is_well_formed
+ 
+  f
+ (1 row)
+ ]]/screen
+ /para
+ para
+ This function can be combined with the IS DOCUMENT predicate to prevent
+ invalid XML content errors from occuring in queries. For example, given a
+ table that may have rows with invalid XML mixed in with rows of valid
+ XML, functionxml_is_well_formed/function can be used to filter out all
+ the invalid rows.
+ /para
+ para
+ Example:
+ screen![CDATA[
+ SELECT * FROM mixed;
+  data
+ --
+  foobar/foo
+  foobar/foo
+  foobar/foobarfoo/bar
+  foobar/foobarfoo/bar
+ (4 rows)
+ 
+ SELECT COUNT(data) FROM mixed WHERE data::xml IS DOCUMENT;
+ ERROR:  invalid XML content
+ DETAIL:  Entity: line 1: parser error : expected ''
+ foobar/foo
+  ^
+ Entity: line 1: parser error : chunk is not well balanced
+ foobar/foo
+  ^
+ 
+ SELECT COUNT(data) FROM mixed WHERE xml_is_well_formed(data) AND data::xml IS DOCUMENT;
+  count
+ ---
+  1
+ (1 row)
+ ]]/screen
+ /para
+/sect3
/sect2
  
sect2 id=functions-xml-processing
*** a/src/backend/utils/adt/xml.c
--- b/src/backend/utils/adt/xml.c
***
*** 3293,3298  xml_xmlnodetoxmltype(xmlNodePtr cur)
--- 3293,3365 
  }
  #endif
  
+ Datum
+ xml_is_well_formed(PG_FUNCTION_ARGS)
+ {
+ #ifdef USE_LIBXML
+ 	text*data = PG_GETARG_TEXT_P(0);
+ 	boolresult;
+ 	int	res_code;
+ 	int32len;
+ 	const xmlChar		*string;
+ 	xmlParserCtxtPtr	ctxt;
+ 	xmlDocPtr			doc = NULL;
+ 
+ 	len = VARSIZE(data) - VARHDRSZ;
+ 	string = xml_text2xmlChar(data);
+ 
+ 	/* Start up libxml and its parser (no-ops if already done) */
+ 	pg_xml_init();
+ 	xmlInitParser();
+ 
+ 	ctxt = xmlNewParserCtxt();
+ 	if (ctxt == NULL)
+ 		xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY,
+ 	could not allocate parser context);
+ 
+ 	PG_TRY();
+ 	{
+ 		size_t		count;
+ 		xmlChar*version = NULL;
+ 		int			standalone = -1;
+ 
+ 		res_code = parse_xml_decl(string, count, version, NULL, standalone);
+ 		if (res_code != 0)
+ 			xml_ereport_by_code(ERROR, ERRCODE_INVALID_XML_CONTENT,
+ 		  invalid XML content: invalid XML declaration,
+ 			res_code);
+ 
+ 		doc = xmlNewDoc(version);
+ 		doc-encoding = xmlStrdup((const xmlChar *) UTF-8);
+ 		doc-standalone = 1;
+ 
+ 		res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0, string + count, NULL);
+ 
+ 		result = !res_code;
+ 	}
+ 	PG_CATCH();
+ 	{
+ 		if (doc)
+ 			xmlFreeDoc(doc);
+ 		if (ctxt)
+ 			xmlFreeParserCtxt(ctxt);
+ 
+ 		PG_RE_THROW();
+ 	}
+ 	PG_END_TRY();
+ 
+ 	if (doc)
+ 		xmlFreeDoc(doc);
+ 	if (ctxt)
+ 		xmlFreeParserCtxt(ctxt);
+ 
+ 	return result;
+ #else
+ 	NO_XML_SUPPORT();
+ 	return 0;
+