Re: [HACKERS] Native XML

2011-03-10 Thread Robert Haas
On Wed, Mar 9, 2011 at 7:03 PM, Josh Berkus j...@agliodbs.com wrote:
 Then I think the answer is that we need both data types.  One for
 text-XML and one for binary-XML.

That's what I think, too.  I'm not sure whether we want both of them
in core, but I think the binary-XML one would, at a minimum, make an
awfully nice extension to ship in contrib.  I'd also like to have text
and binary JSON types... very MongoDB-ish...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-03-09 Thread Bruce Momjian
Robert Haas wrote:
 On Mon, Feb 28, 2011 at 10:30 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Well, in principle we could allow them to work on both, just the same
  way that (for instance) + is a standardized operator but works on more
  than one datatype. ?But I agree that the prospect of two parallel types
  with essentially duplicate functionality isn't pleasing at all.
 
 The real issue here is whether we want to store XML as text (as we do
 now) or as some predigested form which would make output the whole
 thing slower but speed up things like xpath lookups.  We had the same
 issue with JSON, and due to the uncertainty about which way to go with
 it we ended up integrating nothing into core at all.  It's really not
 clear that there is one way of doing this that is right for all use
 cases.  If you are storing xml in an xml column just to get it
 validated, and doing no processing in the DB, then you'd probably
 prefer our current representation.  If you want to build functional
 indexes on xpath expressions, and then run queries that extract data
 using other xpath expressions, you would probably prefer the other
 representation.

Someone should measure how much overhead the indexing of xml values
might have.  If it is minor, we might be OK with only an indexed xml
type.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-03-09 Thread Robert Haas
On Wed, Mar 9, 2011 at 1:11 PM, Bruce Momjian br...@momjian.us wrote:
 Robert Haas wrote:
 On Mon, Feb 28, 2011 at 10:30 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Well, in principle we could allow them to work on both, just the same
  way that (for instance) + is a standardized operator but works on more
  than one datatype. ?But I agree that the prospect of two parallel types
  with essentially duplicate functionality isn't pleasing at all.

 The real issue here is whether we want to store XML as text (as we do
 now) or as some predigested form which would make output the whole
 thing slower but speed up things like xpath lookups.  We had the same
 issue with JSON, and due to the uncertainty about which way to go with
 it we ended up integrating nothing into core at all.  It's really not
 clear that there is one way of doing this that is right for all use
 cases.  If you are storing xml in an xml column just to get it
 validated, and doing no processing in the DB, then you'd probably
 prefer our current representation.  If you want to build functional
 indexes on xpath expressions, and then run queries that extract data
 using other xpath expressions, you would probably prefer the other
 representation.

 Someone should measure how much overhead the indexing of xml values
 might have.  If it is minor, we might be OK with only an indexed xml
 type.

I think the relevant thing to measure would be how fast the
predigested representation speeds up the evaluation of xpath
expressions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-03-09 Thread Yeb Havinga

On 2011-03-09 19:30, Robert Haas wrote:

On Wed, Mar 9, 2011 at 1:11 PM, Bruce Momjianbr...@momjian.us  wrote:

Robert Haas wrote:

On Mon, Feb 28, 2011 at 10:30 AM, Tom Lanet...@sss.pgh.pa.us  wrote:

Well, in principle we could allow them to work on both, just the same
way that (for instance) + is a standardized operator but works on more
than one datatype. ?But I agree that the prospect of two parallel types
with essentially duplicate functionality isn't pleasing at all.

The real issue here is whether we want to store XML as text (as we do
now) or as some predigested form which would make output the whole
thing slower but speed up things like xpath lookups.  We had the same
issue with JSON, and due to the uncertainty about which way to go with
it we ended up integrating nothing into core at all.  It's really not
clear that there is one way of doing this that is right for all use
cases.  If you are storing xml in an xml column just to get it
validated, and doing no processing in the DB, then you'd probably
prefer our current representation.  If you want to build functional
indexes on xpath expressions, and then run queries that extract data
using other xpath expressions, you would probably prefer the other
representation.

Someone should measure how much overhead the indexing of xml values
might have.  If it is minor, we might be OK with only an indexed xml
type.

I think the relevant thing to measure would be how fast the
predigested representation speeds up the evaluation of xpath
expressions.
About a predigested representation, I hope I'm not insulting anyone's 
education here, but a lot of XML database 'accellerators' seem to be 
using the pre and post orders (see 
http://en.wikipedia.org/wiki/Tree_traversal) of the document nodes. The 
following two pdfs show how these orders can be used to query for e.g. 
all ancestors of a node: second pdf slide 10: for nodes x,y : x is an 
ancestor of y when x.pre  y.pre AND x.post  y.post.


www.cse.unsw.edu.au/~cs4317/09s1/tutorials/tutor4.pdf  about the format
www.cse.unsw.edu.au/~cs4317/09s1/tutorials/tutor10.pdf about querying 
the format


regards,
Yeb Havinga



Re: [HACKERS] Native XML

2011-03-09 Thread Anton
On 03/09/2011 08:21 PM, Yeb Havinga wrote:
 On 2011-03-09 19:30, Robert Haas wrote:
 On Wed, Mar 9, 2011 at 1:11 PM, Bruce Momjian br...@momjian.us wrote:
 
 Robert Haas wrote:
   
 On Mon, Feb 28, 2011 at 10:30 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 
 Well, in principle we could allow them to work on both, just the same
 way that (for instance) + is a standardized operator but works on more
 than one datatype. ?But I agree that the prospect of two parallel types
 with essentially duplicate functionality isn't pleasing at all.
   
 The real issue here is whether we want to store XML as text (as we do
 now) or as some predigested form which would make output the whole
 thing slower but speed up things like xpath lookups.  We had the same
 issue with JSON, and due to the uncertainty about which way to go with
 it we ended up integrating nothing into core at all.  It's really not
 clear that there is one way of doing this that is right for all use
 cases.  If you are storing xml in an xml column just to get it
 validated, and doing no processing in the DB, then you'd probably
 prefer our current representation.  If you want to build functional
 indexes on xpath expressions, and then run queries that extract data
 using other xpath expressions, you would probably prefer the other
 representation.
 
 Someone should measure how much overhead the indexing of xml values
 might have.  If it is minor, we might be OK with only an indexed xml
 type.
   
 I think the relevant thing to measure would be how fast the
 predigested representation speeds up the evaluation of xpath
 expressions.
 
 About a predigested representation, I hope I'm not insulting anyone's
 education here, but a lot of XML database 'accellerators' seem to be
 using the pre and post orders (see
 http://en.wikipedia.org/wiki/Tree_traversal) of the document nodes.
 The following two pdfs show how these orders can be used to query for
 e.g. all ancestors of a node: second pdf slide 10: for nodes x,y : x
 is an ancestor of y when x.pre  y.pre AND x.post  y.post.

 www.cse.unsw.edu.au/~cs4317/09s1/tutorials/tutor4.pdf  about the format
 www.cse.unsw.edu.au/~cs4317/09s1/tutorials/tutor10.pdf about querying
 the format

 regards,
 Yeb Havinga

This looks rather like a special kind of XML shredding and that is
listed at http://wiki.postgresql.org/wiki/Todo

About the predigested / plain storage and the evaluation: I haven't yet
fully given up the idea to play with it, even though on purely
experimental basis (i.e. with little or no ambition to contribute to the
core product). If doing so, interesting might also be to use TOAST
slicing during the xpath evaluation so that large documents are not
fetched immediately as a whole, if the xpath is rather 'short'.

But first I should let all the thoughts 'settle down'. There may well be
other areas of Postgres where it's worth to spend some time, whether
writing something or just reading.


Re: [HACKERS] Native XML

2011-03-09 Thread Josh Berkus
On 3/9/11 10:11 AM, Bruce Momjian wrote:
 If you are storing xml in an xml column just to get it
 validated, and doing no processing in the DB, then you'd probably
 prefer our current representation.  If you want to build functional
 indexes on xpath expressions, and then run queries that extract data
 using other xpath expressions, you would probably prefer the other
 representation.

Then I think the answer is that we need both data types.  One for
text-XML and one for binary-XML.

For my part, I don't use PostgreSQL's native XML tools for storage of
XML data because the xpath functions are too slow and limited to make PG
useful as an XML database.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-03-02 Thread Nicolas Barbier
2011/3/1 Andrew Dunstan and...@dunslane.net:

 I think hierarchical data really only scratches the surface of the problem.
 It would be nice to be able to specify all sorts of context for searches:

   * foo after bar
   * foo near bar
   * foo and bar in the same paragraph
   * foo as a parent/child/ancestor/descendent/sibling/cousin of bar

I wonder whether you are deliberately describing XPath here? :-)

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-03-01 Thread Robert Haas
On Mon, Feb 28, 2011 at 6:54 PM, Andrew Dunstan and...@dunslane.net wrote:
 There seems to be an almost universal assumption that storing XML in its
 native form (i.e. a text stream) is going to produce inefficient results.
 Maybe it will, but I think it needs to be fairly convincingly demonstrated.
 And then we would have to consider the costs. For example, unless we
 implemented our own XPath processor to work with our own XML format (do we
 really want to do that?), to evaluate an XPath expression for a piece of XML
 we'd actually need to produce the text format from our internal format
 before passing it to some external library to parse into its internal format
 and then process the XPath expression. That means we'd actually be making
 things worse, not better. But this is clearly the sort of processing people
 want to do - see today's discussion upthread about xpath_table.

Well, obviously the only point of having our own internal format is if
we have our own xpath processor c to match.  One would think that
this would be a lot faster than parsing the string with libxml2 every
time we want to xpath it, especially for large documents.  But then
again, I haven't seen any benchmarks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-03-01 Thread Andrew Dunstan



On 03/01/2011 08:16 AM, Robert Haas wrote:

On Mon, Feb 28, 2011 at 6:54 PM, Andrew Dunstanand...@dunslane.net  wrote:

There seems to be an almost universal assumption that storing XML in its
native form (i.e. a text stream) is going to produce inefficient results.
Maybe it will, but I think it needs to be fairly convincingly demonstrated.
And then we would have to consider the costs. For example, unless we
implemented our own XPath processor to work with our own XML format (do we
really want to do that?), to evaluate an XPath expression for a piece of XML
we'd actually need to produce the text format from our internal format
before passing it to some external library to parse into its internal format
and then process the XPath expression. That means we'd actually be making
things worse, not better. But this is clearly the sort of processing people
want to do - see today's discussion upthread about xpath_table.

Well, obviously the only point of having our own internal format is if
we have our own xpath processorc to match.  One would think that
this would be a lot faster than parsing the string with libxml2 every
time we want to xpath it, especially for large documents.  But then
again, I haven't seen any benchmarks.



That would be a huge body of code we'd need to maintain, complex and 
full of subtleties which, if we weren't deeply invested in the XML 
standards would bite us, I have no doubt.


Now, if someone wanted to start a project that added efficient 
serialization/de-serialization of libxml2 (or other library) objects so 
we could avoid constant parsing overhead, that would make lots more 
sense to me.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-03-01 Thread Kevin Grittner
Andrew Dunstan and...@dunslane.net wrote:
 On 02/28/2011 05:28 PM, Kevin Grittner wrote:
 Antonantonin.hou...@gmail.com  wrote:

 it was actually the focal point of my considerations: whether to
 store plain text or 'something else'.
 
 There seems to be an almost universal assumption that storing XML
 in its native form (i.e. a text stream) is going to produce
 inefficient results.
 
Well, certainly not in all cases.  Finding those rows which satisfy
an XPath search among a few million rows with 20KB XML fields might
benefit from sort of indexing, though.
 
 unless we implemented our own XPath processor to work with our own
 XML format (do we really want to do that?), to evaluate an XPath
 expression for a piece of XML we'd actually need to produce the
 text format from our internal format before passing it to some
 external library to parse into its internal format and then
 process the XPath expression.
 
My suggestion was that you would store the text format, and allow
the developer to create a sharded format in a different column with
a different type if desired, not the other way around.  As I said,
similar to what a developer would do for tsvector to allow text
searches.  I agree that creating the text from an internal format
doesn't sound good.
 
 Given that there were similar issues for other hierarchical data
 types, perhaps we need something similar to tsvector, but for
 hierarchical data.  The extra layer of abstraction might not cost
 much when used for XML compared to the possible benefit with
 other data.  It seems likely to be a very nice fit with GiST
 indexes.

 So under this idea, you would always have the text (or maybe byte
 array?) version of the XML, and you could shard it to a
 separate column for fast searches.
 
 Tsearch should be able to handle XML now. It certainly knows how
 to recognize XML tags.
 
I apparently didn't express myself very well, since you seem to have
*completely* missed my point.  I know we can do tsearch2 searches
against XML, or JSON, or YAML, or (insert next week's new favorite
format here).  What we can't currently do efficiently is search for
particular values in some particular place in the hierarchy of a
document.  I've had loads of fun approximating it with regular
expressions, but some days I'd like life to be easier.
 
What I was arguing for is a new type which would represent the
structure in a fashion which was independent of the particular text
format and was efficient to traverse hierarchically.  Done right,
that would map well to GiST.  Although, thinking about that some
more, perhaps there would be a way to create a GiST index suitable
for that straight from the XML text, and avoid the sharded column. 
A GiST index actually seems pretty close to what such a structure
would look like anyway
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-03-01 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 I apparently didn't express myself very well, since you seem to have
 *completely* missed my point.  I know we can do tsearch2 searches
 against XML, or JSON, or YAML, or (insert next week's new favorite
 format here).  What we can't currently do efficiently is search for
 particular values in some particular place in the hierarchy of a
 document.  I've had loads of fun approximating it with regular
 expressions, but some days I'd like life to be easier.

Check.
 
 What I was arguing for is a new type which would represent the
 structure in a fashion which was independent of the particular text
 format and was efficient to traverse hierarchically.  Done right,
 that would map well to GiST.  Although, thinking about that some
 more, perhaps there would be a way to create a GiST index suitable
 for that straight from the XML text, and avoid the sharded column. 
 A GiST index actually seems pretty close to what such a structure
 would look like anyway

FWIW, GIN might be a more natural match, at least for the cases where
place in the document has a scalar value.  If you need to search for
place with something other than equality or prefix match semantics,
maybe not.

But in any case I think your point is that this is an indexing problem,
and whether the full document in the table column is pre-parsed or not
isn't all that relevant for performance.  I agree.  tsearch2 is really a
precedent for your argument, not a distinct approach, because it doesn't
expect pre-parsed text columns either.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-03-01 Thread Andrew Dunstan



On 03/01/2011 02:15 PM, Kevin Grittner wrote:



Given that there were similar issues for other hierarchical data
types, perhaps we need something similar to tsvector, but for
hierarchical data.  The extra layer of abstraction might not cost
much when used for XML compared to the possible benefit with
other data.  It seems likely to be a very nice fit with GiST
indexes.

So under this idea, you would always have the text (or maybe byte
array?) version of the XML, and you could shard it to a
separate column for fast searches.



Tsearch should be able to handle XML now. It certainly knows how
to recognize XML tags.


I apparently didn't express myself very well, since you seem to have
*completely* missed my point.  I know we can do tsearch2 searches
against XML, or JSON, or YAML, or (insert next week's new favorite
format here).  What we can't currently do efficiently is search for
particular values in some particular place in the hierarchy of a
document.  I've had loads of fun approximating it with regular
expressions, but some days I'd like life to be easier.

What I was arguing for is a new type which would represent the
structure in a fashion which was independent of the particular text
format and was efficient to traverse hierarchically.  Done right,
that would map well to GiST.  Although, thinking about that some
more, perhaps there would be a way to create a GiST index suitable
for that straight from the XML text, and avoid the sharded column.
A GiST index actually seems pretty close to what such a structure
would look like anyway




I probably didn't read your suggestion closely enough.


I think hierarchical data really only scratches the surface of the 
problem. It would be nice to be able to specify all sorts of context for 
searches:


   * foo after bar
   * foo near bar
   * foo and bar in the same paragraph
   * foo as a parent/child/ancestor/descendent/sibling/cousin of bar


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-28 Thread Anton
On 02/27/2011 11:57 PM, Peter Eisentraut wrote:
 On sön, 2011-02-27 at 10:45 -0500, Tom Lane wrote:
   
 Hmm, so this doesn't rely on libxml2 at all?  Given the amount of pain
 that library has caused us, getting out from under it seems like a
 mighty attractive idea.
 
 This doesn't replace the existing xml functionality, so it won't help
 getting rid of libxml.

   
Right, what I published on github.com doesn't replace the libxml2
functionality and I didn't say it does at this moment. The idea is to
design (or rather start designing) a low-level XML API on which SQL/XML
functionality can be based. As long as XSLT can be considered a sort of
separate topic, then Postgres uses very small subset of what libxml2
offers and thus it might not be that difficult to implement the same
level of functionality in a new way.

In addition, I think that using a low-level API that Postgres
development team fully controls would speed-up enhancements of the XML
functionality in the future. When I thought of implementing some
functionality listed on the official TODO, I was a little bit
discouraged by the workarounds that need to be added in order to deal
with libxml2 memory management. Also parsing the document each time it's
accessed (which involves parser initialization and finalization) is not
too comfortable and eventually efficient.

A question is of course, if potential new implementation must
necessarily replace the existing one, immediately or at all. What I
published is implemented as a new data type and thus pg_type.h and
pg_proc.h are the only files where something needs to be merged. From
technical point of view, the new type can co-exist with the existing easily.

This however implies a question if such co-existence (whether temporary
or permanent) would be acceptable for users, i.e. if it wouldn't bring
some/significant confusion. That's something I'm not able to answer.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-28 Thread Andrew Dunstan



On 02/28/2011 04:25 AM, Anton wrote:

On 02/27/2011 11:57 PM, Peter Eisentraut wrote:

On sön, 2011-02-27 at 10:45 -0500, Tom Lane wrote:


Hmm, so this doesn't rely on libxml2 at all?  Given the amount of pain
that library has caused us, getting out from under it seems like a
mighty attractive idea.


This doesn't replace the existing xml functionality, so it won't help
getting rid of libxml.



Right, what I published on github.com doesn't replace the libxml2
functionality and I didn't say it does at this moment. The idea is to
design (or rather start designing) a low-level XML API on which SQL/XML
functionality can be based. As long as XSLT can be considered a sort of
separate topic, then Postgres uses very small subset of what libxml2
offers and thus it might not be that difficult to implement the same
level of functionality in a new way.

In addition, I think that using a low-level API that Postgres
development team fully controls would speed-up enhancements of the XML
functionality in the future. When I thought of implementing some
functionality listed on the official TODO, I was a little bit
discouraged by the workarounds that need to be added in order to deal
with libxml2 memory management. Also parsing the document each time it's
accessed (which involves parser initialization and finalization) is not
too comfortable and eventually efficient.

A question is of course, if potential new implementation must
necessarily replace the existing one, immediately or at all. What I
published is implemented as a new data type and thus pg_type.h and
pg_proc.h are the only files where something needs to be merged. From
technical point of view, the new type can co-exist with the existing easily.

This however implies a question if such co-existence (whether temporary
or permanent) would be acceptable for users, i.e. if it wouldn't bring
some/significant confusion. That's something I'm not able to answer.



The only reason we need the XML stuff in core at all and not in a 
separate module is because of the odd syntax requirements of SQL/XML. 
But those operators work on the xml type, and not on any new type you 
might invent.


Which TODO items were you trying to implement? And what were the blockers?

We really can't just consider XSLT, and more importantly XPath, as 
separate topics. Any alternative XML implementation that doesn't include 
XPath is going to be unacceptably incomplete, IMNSHO.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-28 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 02/28/2011 04:25 AM, Anton wrote:
 A question is of course, if potential new implementation must
 necessarily replace the existing one, immediately or at all. What I
 published is implemented as a new data type and thus pg_type.h and
 pg_proc.h are the only files where something needs to be merged. From
 technical point of view, the new type can co-exist with the existing easily.
 
 This however implies a question if such co-existence (whether temporary
 or permanent) would be acceptable for users, i.e. if it wouldn't bring
 some/significant confusion. That's something I'm not able to answer.

 The only reason we need the XML stuff in core at all and not in a 
 separate module is because of the odd syntax requirements of SQL/XML. 
 But those operators work on the xml type, and not on any new type you 
 might invent.

Well, in principle we could allow them to work on both, just the same
way that (for instance) + is a standardized operator but works on more
than one datatype.  But I agree that the prospect of two parallel types
with essentially duplicate functionality isn't pleasing at all.

I think a reasonable path forwards for this work would be to develop and
extend the non-libxml-based type as an extension, outside of core, with
the idea that it might replace the core implementation if it ever gets
complete enough.  The main thing that that would imply that you might
not bother with otherwise is an ability to deal with existing
plain-text-style stored values.  This doesn't seem terribly hard to do
IMO --- one easy way would be to insert an initial zero byte in all
new-style values as a flag to distinguish them from old-style.  The
forced parsing that would occur to deal with an old-style value would be
akin to detoasting and could be hidden in the same access macros.

 We really can't just consider XSLT, and more importantly XPath, as 
 separate topics. Any alternative XML implementation that doesn't include 
 XPath is going to be unacceptably incomplete, IMNSHO.

Agreed.  The single most pressing problem we've got with XML right now
is the poor state of the XPath extensions in contrib/xml2.  If we don't
see a meaningful step forward in that area, a new implementation of the
xml datatype isn't likely to win acceptance.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-28 Thread Robert Haas
On Sun, Feb 27, 2011 at 10:20 PM, Andrew Dunstan and...@dunslane.net wrote:
 No, I think the xpath implementation is from libxml2. But in any case, I
 think the problem is in the whole design of the xpath_table function, and
 not in the library used for running the xpath queries. i.e it's our fault,
 and not the libraries. (mutters about workmen and tools)

Yeah, I think the problem is that we picked a poor definition for the
xpath_table() function.  That poor definition will be equally capable
of causing us headaches on top of any other implementation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-28 Thread Andrew Dunstan



On 02/28/2011 10:30 AM, Tom Lane wrote:

The single most pressing problem we've got with XML right now
is the poor state of the XPath extensions in contrib/xml2.  If we don't
see a meaningful step forward in that area, a new implementation of the
xml datatype isn't likely to win acceptance.




xpath_table is severely broken by design IMNSHO. We need a new design, 
but I'm reluctant to work on that until someone does LATERAL, because a 
replacement would be much nicer to design with it than without it.


But I don't believe replacing the underlying XML/XPath implementation 
would help us fix it at all.


cheers

andreww

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-28 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 xpath_table is severely broken by design IMNSHO. We need a new design, 
 but I'm reluctant to work on that until someone does LATERAL, because a 
 replacement would be much nicer to design with it than without it.

Well, maybe I'm missing something, but I don't really understand why
xpath_table's design is so unreasonable.  Also, what would a better
solution look like exactly?  (Feel free to assume LATERAL is available.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-28 Thread Robert Haas
On Mon, Feb 28, 2011 at 10:30 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Well, in principle we could allow them to work on both, just the same
 way that (for instance) + is a standardized operator but works on more
 than one datatype.  But I agree that the prospect of two parallel types
 with essentially duplicate functionality isn't pleasing at all.

The real issue here is whether we want to store XML as text (as we do
now) or as some predigested form which would make output the whole
thing slower but speed up things like xpath lookups.  We had the same
issue with JSON, and due to the uncertainty about which way to go with
it we ended up integrating nothing into core at all.  It's really not
clear that there is one way of doing this that is right for all use
cases.  If you are storing xml in an xml column just to get it
validated, and doing no processing in the DB, then you'd probably
prefer our current representation.  If you want to build functional
indexes on xpath expressions, and then run queries that extract data
using other xpath expressions, you would probably prefer the other
representation.

I tend to think that it would be useful to have both text and
predigested types for both XML and JSON, but I am not too eager to
begin integrating more stuff into core or contrib until it spends some
time on pgfoundry or github or wherever people publish their
PostgreSQL extensions these days and we have a few users prepared to
testify to its awesomeness.

In any case, the definitional problems with xpath_table(), and/or the
memory management problems with libxml2, are not the basis on which we
should be making this decision.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-28 Thread Andrew Dunstan



On 02/28/2011 10:51 AM, Tom Lane wrote:

Andrew Dunstanand...@dunslane.net  writes:

xpath_table is severely broken by design IMNSHO. We need a new design,
but I'm reluctant to work on that until someone does LATERAL, because a
replacement would be much nicer to design with it than without it.

Well, maybe I'm missing something, but I don't really understand why
xpath_table's design is so unreasonable.  Also, what would a better
solution look like exactly?  (Feel free to assume LATERAL is available.)




What's unreasonable about it is that the supplied paths are independent 
of each other, and evaluated in the context of the entire XML document.


Let's take the given example in the docs, changed slightly to assume 
each piece of XML can have more than one article listing in it (i.e,. 
'article' is not the root node of the document):


   SELECT * FROM
   xpath_table('article_id',
'article_xml',
'articles',
'//article/author|//article/pages|//article/title',
'date_entered  ''2003-01-01'' ')
   AS t(article_id integer, author text, page_count integer, title text);

There is nothing that says that the author has to come from the same 
article as the title, nor is there any way of saying that they must. If 
an article node is missing author or pages or title, or has more than 
one where its siblings do not, they will line up wrongly.


An alternative would be to supply a single xpath expression that would 
specify the context nodes to be iterated over (in this case that would 
be '//article') and a set of xpath expressions to be evaluated in the 
context of those nodes (in this case 'article|pages|title' ort better 
yet, supply these as a text array). We'd produce exactly one row for 
each node found by the context expression, and take the first value 
found by each of the column expressions in that context (or we could 
error out if we found more than one, or supply an array if the result 
field is an array). So with LATERAL taking care of the rest, the 
function signature could be something like:


   xpath_table_new(
doc xml,
context_xpath text,
column_xpath text[])
   returns setof record


Given this, you could not get a row with title and author from different 
article nodes in the source document like you can now.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-28 Thread Anton
On 02/28/2011 05:23 PM, Robert Haas wrote:
 On Mon, Feb 28, 2011 at 10:30 AM, Tom Lane t...@sss.pgh.pa.us wrote:
   
 Well, in principle we could allow them to work on both, just the same
 way that (for instance) + is a standardized operator but works on more
 than one datatype.  But I agree that the prospect of two parallel types
 with essentially duplicate functionality isn't pleasing at all.
 
 The real issue here is whether we want to store XML as text (as we do
 now) or as some predigested form which would make output the whole
 thing slower but speed up things like xpath lookups.  We had the same
 issue with JSON, and due to the uncertainty about which way to go with
 it we ended up integrating nothing into core at all.  It's really not
 clear that there is one way of doing this that is right for all use
 cases.  If you are storing xml in an xml column just to get it
 validated, and doing no processing in the DB, then you'd probably
 prefer our current representation.  If you want to build functional
 indexes on xpath expressions, and then run queries that extract data
 using other xpath expressions, you would probably prefer the other
 representation.
   
Yes, it was actually the focal point of my considerations: whether to
store plain text or 'something else'.
It's interesting to know that such uncertainty already existed in
another area. Maybe it's specific to other open source projects too...
 I tend to think that it would be useful to have both text and
 predigested types for both XML and JSON, but I am not too eager to
 begin integrating more stuff into core or contrib until it spends some
 time on pgfoundry or github or wherever people publish their
 PostgreSQL extensions these days and we have a few users prepared to
 testify to its awesomeness.
   
It definitely makes sense to develop this new functionality separate for
some time.
It's kind of exciting to develop something new, but spending significant
effort on the 'native XM' probably needs a bit higher level of consensus
than what appeared in this discussion. In that context, the remark about
users and their needs is something that I can't ignore.

Thanks to all for contributions to this discussion.
 In any case, the definitional problems with xpath_table(), and/or the
 memory management problems with libxml2, are not the basis on which we
 should be making this decision.

   

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-28 Thread Kevin Grittner
Anton antonin.hou...@gmail.com wrote:
 
 it was actually the focal point of my considerations: whether to
 store plain text or 'something else'.
 
Given that there were similar issues for other hierarchical data
types, perhaps we need something similar to tsvector, but for
hierarchical data.  The extra layer of abstraction might not cost
much when used for XML compared to the possible benefit with other
data.  It seems likely to be a very nice fit with GiST indexes.
 
So under this idea, you would always have the text (or maybe byte
array?) version of the XML, and you could shard it to a separate
column for fast searches.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-28 Thread Andrew Dunstan



On 02/28/2011 05:28 PM, Kevin Grittner wrote:

Antonantonin.hou...@gmail.com  wrote:


it was actually the focal point of my considerations: whether to
store plain text or 'something else'.





There seems to be an almost universal assumption that storing XML in its 
native form (i.e. a text stream) is going to produce inefficient 
results. Maybe it will, but I think it needs to be fairly convincingly 
demonstrated. And then we would have to consider the costs. For example, 
unless we implemented our own XPath processor to work with our own XML 
format (do we really want to do that?), to evaluate an XPath expression 
for a piece of XML we'd actually need to produce the text format from 
our internal format before passing it to some external library to parse 
into its internal format and then process the XPath expression. That 
means we'd actually be making things worse, not better. But this is 
clearly the sort of processing people want to do - see today's 
discussion upthread about xpath_table.


I'm still waiting to hear what it is that the OP is finding hard to do 
because we use libxml2.




Given that there were similar issues for other hierarchical data
types, perhaps we need something similar to tsvector, but for
hierarchical data.  The extra layer of abstraction might not cost
much when used for XML compared to the possible benefit with other
data.  It seems likely to be a very nice fit with GiST indexes.

So under this idea, you would always have the text (or maybe byte
array?) version of the XML, and you could shard it to a separate
column for fast searches.





Tsearch should be able to handle XML now. It certainly knows how to 
recognize XML tags.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-27 Thread Tom Lane
Anton antonin.hou...@gmail.com writes:
 I've been playing with 'native XML' for a while and now wondering if
 further development of such a feature makes sense for Postgres.
 ...
 Unlike 'libxml2', the parser uses palloc()/pfree(). The output format is
 independent from any 3rd party code.

Hmm, so this doesn't rely on libxml2 at all?  Given the amount of pain
that library has caused us, getting out from under it seems like a
mighty attractive idea.  How big a chunk of code do you think it'd be
by the time you complete the missing features?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-27 Thread Andrew Dunstan



On 02/27/2011 10:45 AM, Tom Lane wrote:

Antonantonin.hou...@gmail.com  writes:

I've been playing with 'native XML' for a while and now wondering if
further development of such a feature makes sense for Postgres.
...
Unlike 'libxml2', the parser uses palloc()/pfree(). The output format is
independent from any 3rd party code.

Hmm, so this doesn't rely on libxml2 at all?  Given the amount of pain
that library has caused us, getting out from under it seems like a
mighty attractive idea.  How big a chunk of code do you think it'd be
by the time you complete the missing features?





TBH, by the time it does all the things that libxml2, and libxslt, which 
depends on it, do for us, I think it will be huge. Do we really want to 
be maintaining a complete xpath and xslt implementation? I think that's 
likely to be a waste of our scarce resources.


I use Postgres' XML functionality a lot, so I'm all in favor of 
improving it, but rolling our own doesn't seem like the best way to go.


As for the pain, we seem to be over the worst of it, AFAICT. It would be 
nice to move the remaining pieces of the xml2 contrib module into the core.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-27 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 02/27/2011 10:45 AM, Tom Lane wrote:
 Hmm, so this doesn't rely on libxml2 at all?  Given the amount of pain
 that library has caused us, getting out from under it seems like a
 mighty attractive idea.  How big a chunk of code do you think it'd be
 by the time you complete the missing features?

 TBH, by the time it does all the things that libxml2, and libxslt, which 
 depends on it, do for us, I think it will be huge. Do we really want to 
 be maintaining a complete xpath and xslt implementation? I think that's 
 likely to be a waste of our scarce resources.

Well, that's why I asked --- if it's going to be a huge chunk of code,
then I agree this is the wrong path to pursue.  However, I do feel that
libxml pretty well sucks, so if we could replace it with a relatively
small amount of code, that might be the right thing to do.

 I use Postgres' XML functionality a lot, so I'm all in favor of 
 improving it, but rolling our own doesn't seem like the best way to go.

 As for the pain, we seem to be over the worst of it, AFAICT.

No, because the xpath stuff is fundamentally broken, and nobody seems to
know how to make libxslt do what we actually need.  See the open bugs
on the TODO list.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-27 Thread David E. Wheeler
On Feb 27, 2011, at 11:23 AM, Tom Lane wrote:

 Well, that's why I asked --- if it's going to be a huge chunk of code,
 then I agree this is the wrong path to pursue.  However, I do feel that
 libxml pretty well sucks, so if we could replace it with a relatively
 small amount of code, that might be the right thing to do.

I think that XML parsers must be hard to get really right, because of all those 
I've used in Perl, XML::LibXML is far and away the best. Its docs suck, but it 
does the work really well.

 No, because the xpath stuff is fundamentally broken, and nobody seems to
 know how to make libxslt do what we actually need.  See the open bugs
 on the TODO list.

XPath is broken? I use it heavily in the Perl module Test::XPath and now, in 
PostgreSQL, with my explanation extension.

  http://github.com/theory/explanation/

Is this something I need to worry about?

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-27 Thread Tom Lane
David E. Wheeler da...@kineticode.com writes:
 On Feb 27, 2011, at 11:23 AM, Tom Lane wrote:
 No, because the xpath stuff is fundamentally broken, and nobody seems to
 know how to make libxslt do what we actually need.  See the open bugs
 on the TODO list.

 XPath is broken? I use it heavily in the Perl module Test::XPath and now, in 
 PostgreSQL, with my explanation extension.

Well, if you're only using cases that work, you don't need to worry.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-27 Thread Mike Fowler

On 27/02/11 19:37, David E. Wheeler wrote:

On Feb 27, 2011, at 11:23 AM, Tom Lane wrote:


Well, that's why I asked --- if it's going to be a huge chunk of code,
then I agree this is the wrong path to pursue.  However, I do feel that
libxml pretty well sucks, so if we could replace it with a relatively
small amount of code, that might be the right thing to do.

I think that XML parsers must be hard to get really right, because of all those 
I've used in Perl, XML::LibXML is far and away the best. Its docs suck, but it 
does the work really well.

No, because the xpath stuff is fundamentally broken, and nobody seems to
know how to make libxslt do what we actually need.  See the open bugs
on the TODO list.

XPath is broken? I use it heavily in the Perl module Test::XPath and now, in 
PostgreSQL, with my explanation extension.

   http://github.com/theory/explanation/

Is this something I need to worry about
I don't believe that XPath is fundamentally broken, but I think Tom 
may have meant xslt. When reviewing a recent patch to xml2/xslt I found 
a few bugs in the way were using libxslt, as well as a bug in the 
library itself (see 
http://archives.postgresql.org/pgsql-hackers/2011-02/msg01878.php).


However if Tom does mean that xpath is the culprit, it may be with the 
way the libxml2 library works. It's a very messy singleton. If I'm 
wrong, I'm sure I'll be corrected!


Regards,
--
Mike Fowler
Registered Linux user: 379787


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-27 Thread David E. Wheeler
On Feb 27, 2011, at 11:43 AM, Tom Lane wrote:

 XPath is broken? I use it heavily in the Perl module Test::XPath and now, in 
 PostgreSQL, with my explanation extension.
 
 Well, if you're only using cases that work, you don't need to worry.

Okay then.

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-27 Thread Tom Lane
Mike Fowler m...@mlfowler.com writes:
 I don't believe that XPath is fundamentally broken, but I think Tom 
 may have meant xslt. When reviewing a recent patch to xml2/xslt I found 
 a few bugs in the way were using libxslt, as well as a bug in the 
 library itself (see 
 http://archives.postgresql.org/pgsql-hackers/2011-02/msg01878.php).

The case that I don't think we have any idea how to solve is
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02424.php

Most of the other stuff on the TODO list looks like it just requires
application of round tuits, although some of it seems to me to reinforce
the thesis that libxml/libxslt don't do quite what we need.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Fwd: Re: [HACKERS] Native XML

2011-02-27 Thread Anton
Sorry for resending, I forgot to add 'pgsql-hackers' to CC.

 Original Message 
Subject:Re: [HACKERS] Native XML
Date:   Sun, 27 Feb 2011 23:18:03 +0100
From:   Anton antonin.hou...@gmail.com
To: Tom Lane t...@sss.pgh.pa.us



On 02/27/2011 04:45 PM, Tom Lane wrote:
 Anton antonin.hou...@gmail.com writes:
   
 I've been playing with 'native XML' for a while and now wondering if
 further development of such a feature makes sense for Postgres.
 ...
 Unlike 'libxml2', the parser uses palloc()/pfree(). The output format is
 independent from any 3rd party code.
 
 Hmm, so this doesn't rely on libxml2 at all?  Given the amount of pain
 that library has caused us, getting out from under it seems like a
 mighty attractive idea.  How big a chunk of code do you think it'd be
 by the time you complete the missing features?

   regards, tom lane
   
Right, no dependency, everything coded from scratch.
For the initial stable version, my plan is to make the parser conform to
the standard as much as possible and the same for XMLPath / XMLQuery.
(In all cases the question is which version of the standard to start at.)

Integration of SQL  XML data in queries is my primary interest. I
didn't really think to re-implement XSLT. For those who really need to
use XSLT functionality at the database level, can't the API be left for
optional installation?

Also I'm not sure if document validation is necessary for the initial
version - I still see a related item on the current TODO list.

Sincerely,
Tony,



Re: [HACKERS] Native XML

2011-02-27 Thread Peter Eisentraut
On sön, 2011-02-27 at 10:45 -0500, Tom Lane wrote:
 Hmm, so this doesn't rely on libxml2 at all?  Given the amount of pain
 that library has caused us, getting out from under it seems like a
 mighty attractive idea.

This doesn't replace the existing xml functionality, so it won't help
getting rid of libxml.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-27 Thread Andrew Dunstan



On 02/27/2011 03:06 PM, Tom Lane wrote:

Mike Fowlerm...@mlfowler.com  writes:

I don't believe that XPath is fundamentally broken, but I think Tom
may have meant xslt. When reviewing a recent patch to xml2/xslt I found
a few bugs in the way were using libxslt, as well as a bug in the
library itself (see
http://archives.postgresql.org/pgsql-hackers/2011-02/msg01878.php).

The case that I don't think we have any idea how to solve is
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02424.php



I'd forgotten about this. But as ugly as it is, I don't think it's 
libxml2's fault.



cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-27 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 02/27/2011 03:06 PM, Tom Lane wrote:
 The case that I don't think we have any idea how to solve is
 http://archives.postgresql.org/pgsql-hackers/2010-02/msg02424.php

 I'd forgotten about this. But as ugly as it is, I don't think it's 
 libxml2's fault.

Well, strictly speaking it's libxslt's fault, no?  But AFAIK those two
things are a package.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-27 Thread Andrew Dunstan



On 02/27/2011 10:07 PM, Tom Lane wrote:

Andrew Dunstanand...@dunslane.net  writes:

On 02/27/2011 03:06 PM, Tom Lane wrote:

The case that I don't think we have any idea how to solve is
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02424.php

I'd forgotten about this. But as ugly as it is, I don't think it's
libxml2's fault.

Well, strictly speaking it's libxslt's fault, no?  But AFAIK those two
things are a package.




No, I think the xpath implementation is from libxml2. But in any case, I 
think the problem is in the whole design of the xpath_table function, 
and not in the library used for running the xpath queries. i.e it's our 
fault, and not the libraries. (mutters about workmen and tools)


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-02-26 Thread Josh Berkus
On 2/26/11 3:40 PM, Anton wrote:
 I've been playing with 'native XML' for a while and now wondering if
 further development of such a feature makes sense for Postgres.
 (By not having brought this up earlier I'm taking the chance that the
 effort will be wasted, but that's not something you should worry about.)

Nah, just if you don't get any feedback, bring it up again in June when
9.2 development officially starts.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers