Re: [xml] xml Digest, Vol 60, Issue 3

D Kimmel Mon, 06 Apr 2009 22:06:08 -0700

Michael,  I wanted to avoid using CDATA because what I have *IS* valid XML
but just didn't want my first/top level parser to worry about subelements.
 I liked the idea you presented and looked into it.  I don't think it will
quite do what I want it to do since I don't want to abort parsing or skip
the subelements but rather forward those on, in bulk, to a different parser.



  I decided to go the route of reconstituting the XML.  I added a switch
which will: keep track of the node depth, and start buffering up chunks of
reconstructed XML.  When the buffer is full or the the node depth returns to
the starting depth the buffer it is forwarded on.  I rationalized it with 1)
it's a good idea to make sure the XML stream is valid, 2) it's a good place
to strip out things that I don't care about forwarding like say comments,
and 3) it's at least as fast as the other ways I process XML data through
lookup tables.  I'm in the assessing throughput stage right now.  Thanks for
the info Michael and thanks Daniel for libxml2, it's very nice.

  Saw this randomly, sounds like the exact opposite of what I'm looking
for: http://freshmeat.net/projects/xmldego
:-)

On Fri, Apr 3, 2009 at 5:00 AM, <[email protected]> wrote:

> Send xml mailing list submissions to
>        [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://mail.gnome.org/mailman/listinfo/xml
> or, via email, send a message with subject or body 'help' to
>        [email protected]
>
> You can reach the person managing the list at
>        [email protected]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of xml digest..."
>
>
> Today's Topics:
>
>   1. Re: IO callbacks are not thread-safe (Daniel Veillard)
>   2. Re: serialize nodes returned by successive XPath evaluation;
>      preserving namespaces (Daniel Veillard)
>   3. Re: SAX question (Daniel Veillard)
>   4. Re: IO callbacks are not thread-safe (Petr Pajas)
>   5. Re: SAX question (Michael Ludwig)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 2 Apr 2009 17:09:26 +0200
> From: Daniel Veillard <[email protected]>
> Subject: Re: [xml] IO callbacks are not thread-safe
> To: Nick Wellnhofer <[email protected]>
> Cc: [email protected]
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=us-ascii
>
> On Thu, Mar 26, 2009 at 07:06:14PM +0100, Nick Wellnhofer wrote:
> >
> > The input and output callbacks of libxml are stored in static arrays in
> > xmlIO.c, so any use of the callback functions is not thread-safe.
> >
> > In many cases this shouldn't be a problem, if callbacks are registered
> > only at the start of a program. But the Perl bindings register and
> > unregister callbacks every time a document is parsed. I can reproduce
>
>  Uhhhh ????
> That sounds severely broken to me. Can you details why, and how ?
>
> > random segfaults or other errors when processing many thousand documents
> > in concurrent threads with the libxml Perl bindings.
> >
> > I'm willing to help fix this, but I'm not sure about the correct
> > approach. Should the callback arrays be added to the global variables in
> > globals.c?
>
> Those variables are not public, so I guess a different way would be
> preferable. Still I can't see any good valid reason to change the values
> all the time. Something is severely broken there in the perl bindings !
> If they need a per parsing instance processing they should use the data
> block provided by the I/O to make the switch, but register an unified
> routine for all threads. No really this doesn't make any sense to me,
> but maybe you can come up with a valid reason,
>
> Daniel
>
> --
> Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
> [email protected]  | Rpmfind RPM search engine http://rpmfind.net/
> http://veillard.com/ | virtualization library  http://libvirt.org/
>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 2 Apr 2009 17:23:51 +0200
> From: Daniel Veillard <[email protected]>
> Subject: Re: [xml] serialize nodes returned by successive XPath
>        evaluation;     preserving namespaces
> To: Matt Magoffin <[email protected]>
> Cc: [email protected]
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=us-ascii
>
> On Mon, Mar 30, 2009 at 01:56:22PM +1300, Matt Magoffin wrote:
> > I'm trying to find the correct way to serialize the nodes returned in a
> > node list via an XPath evaluation and preserving the namespaces of the
> > source document. My problem originates in the XML support in PostgresSQL
> > (http://archives.postgresql.org//pgsql-bugs/2008-06/msg00124.php) which
> > shows a small test case... but in effect if I have a document like
> >
> > <a:foo xmlns:a="a:urn">
> >   <a:bar x="y">bar1</a:bar>
> >   <a:bar x="y">bar2</a:bar>
> > </a:foo>
> >
> > and I evaluate the XPath /a:foo/a:bar[1] (with the "a:urn" namespace
> > mapping registered) to get a single node
> >
> > <a:bar x="y">bar1</a:bar>
> >
> > I want to then be able to evaluate another XPath on that node like
> > /a:bar/@x and get a matching attribute @x.
> >
> > This second XPath evaluation is what is not working... but it _does_ work
> > if no namespaces are present in the source document.
> >
> > In the context of how PostgreSQL is using libxml, after the first XPath
> > evaluation it is serializing the results by calling xmlNodeDump() on each
> > node returned in the node list returned by the XPath evaluation. And
> > xmlNodeDump() is returning the string literal
> >
> > <a:bar x="y">bar1</a:bar>
> >
> > which does not have the "a:urn" namespace declaration as one might expect
> > (at least, for a document), e.g.
> >
> > <a:bar xmlns:a="a:urn" x="y">bar1</a:bar>
> >
> > Is there a way for xmlNodeDump(), or some other function, to serialize a
> > node such as this one in this latter way rather than the former?
>
>  Hum, no. Still I don't really understand the need to serialize , but I
> assume it's not an option to reevaluate the XPath (as a relative one
> i.e. ./a:bar/@x ) on the node(s) selected from the first query.
>
>  That could possibly be added to libxml2 but won't be available by
> default, until people update.
>  It's very weird that the implementation has been made this way, XPath
> was designed to be namespace aware, so whoever plugged XPath in pgsql
> completely missed the namespace issue, a simple node dump is not
> preserving namespaces, and if you add them and reserialize you may
> change the semantic from XPath on the original document.
>  So I really wonder how hard the design based on serialization of the
> intermediate result really is, maybe that should be revisited, maybe
> that's impossible, but in that case you will have to play tricks
> like use xmlGetNsList() on the node (or rather its parent), make a copy
> at the node level (verifying they don't clash with existing namespace on
> the node), and then do the xmlNodeDump(). A bit messy...
>
> Daniel
>
> --
> Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
> [email protected]  | Rpmfind RPM search engine http://rpmfind.net/
> http://veillard.com/ | virtualization library  http://libvirt.org/
>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 2 Apr 2009 17:26:24 +0200
> From: Daniel Veillard <[email protected]>
> Subject: Re: [xml] SAX question
> To: D Kimmel <[email protected]>
> Cc: [email protected]
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=us-ascii
>
> On Thu, Apr 02, 2009 at 12:31:06AM -0700, D Kimmel wrote:
> > Currently, I have been using the xmlCreatePushParserCtxt along with the
> > xmlParseChunk for some applications that have to read from an XML stream.
> >  Is there a way to ignore (or not parse) subelements and just have them
> > returned as a chunk of data?  I was hoping to avoid using CDATA blocks,
> but
> > basically that's the functionality I am looking for.  Thanks,
>
>  No, basically the XML spec mandates that the parser examine and
> process every byte of the document input data (and fail with a fatal
> error if they don't match the XML character range or grammar).
>
> Daniel
>
> --
> Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
> [email protected]  | Rpmfind RPM search engine http://rpmfind.net/
> http://veillard.com/ | virtualization library  http://libvirt.org/
>
>
> ------------------------------
>
> Message: 4
> Date: Thu, 2 Apr 2009 17:33:54 +0200
> From: Petr Pajas <[email protected]>
> Subject: Re: [xml] IO callbacks are not thread-safe
> To: [email protected], [email protected]
> Message-ID: <[email protected]>
> Content-Type: text/plain;  charset="iso-8859-2"
>
> On ?t 2. dubna 2009, Daniel Veillard wrote:
> > On Thu, Mar 26, 2009 at 07:06:14PM +0100, Nick Wellnhofer wrote:
> > > The input and output callbacks of libxml are stored in static
> > > arrays in xmlIO.c, so any use of the callback functions is not
> > > thread-safe.
> > >
> > > In many cases this shouldn't be a problem, if callbacks are
> > > registered only at the start of a program. But the Perl
> > > bindings register and unregister callbacks every time a
> > > document is parsed. I can reproduce
> >
> >   Uhhhh ????
> > That sounds severely broken to me. Can you details why, and how ?
> >
> >
> > > random segfaults or other errors when processing many thousand
> > > documents in concurrent threads with the libxml Perl bindings.
> > >
> > > I'm willing to help fix this, but I'm not sure about the
> > > correct approach. Should the callback arrays be added to the
> > > global variables in globals.c?
> >
> > Those variables are not public, so I guess a different way would
> > be preferable. Still I can't see any good valid reason to change
> > the values all the time. Something is severely broken there in
> > the perl bindings ! If they need a per parsing instance
> > processing they should use the data block provided by the I/O to
> > make the switch, but register an unified routine for all threads.
> > No really this doesn't make any sense to me, but maybe you can
> > come up with a valid reason,
>
> Hi,
>
> I think the original reason for this was that when Perl bindings are
> used with mod_perl, there may be other (non-Perl) components using
> the global callbacks differently; that's why XML::LibXML Perl
> module tries to clean after itself (restoring whatever was in the
> callbacks previously). Is there any other way around this?
>
> -- Petr
>
>
> ------------------------------
>
> Message: 5
> Date: Thu, 02 Apr 2009 17:50:51 +0200
> From: Michael Ludwig <[email protected]>
> Subject: Re: [xml] SAX question
> To: [email protected]
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> D Kimmel schrieb:
> > Currently, I have been using the xmlCreatePushParserCtxt along with
> > the xmlParseChunk for some applications that have to read from an XML
> > stream.
> >  Is there a way to ignore (or not parse) subelements and just have
> > them returned as a chunk of data?  I was hoping to avoid using CDATA
> > blocks, but basically that's the functionality I am looking for.
>
> XML doesn't need CDATA, but it may be a convenience. If the reason for
> avoiding to parse the data is to prevent parse errors, than what you
> have isn't XML.
>
> Using the push parser, you should be able to abort parsing once you've
> collected the data you're interested in. Only learnt about it the day
> before yesterday.
>
> http://aspn.activestate.com/ASPN/Mail/Message/perl-xml/3707312
>
> The same thing is possible using SAX (which the subject of your mail
> refers to) at the price of throwing and catching an exception.
>
> http://aspn.activestate.com/ASPN/Mail/Message/perl-xml/3707238
>
> I hope this helps.
>
> Michael Ludwig
>
>
> ------------------------------
>
> _______________________________________________
> xml mailing list
> [email protected]
> http://mail.gnome.org/mailman/listinfo/xml
>
>
> End of xml Digest, Vol 60, Issue 3
> **********************************
>

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] xml Digest, Vol 60, Issue 3

Reply via email to