Daniel,
sorry that I'm returning to this topic after two months. I'm still struggling
(read below).
On Sunday 09 September 2007, Daniel Veillard wrote:
> > On Sunday 10 June 2007 23:10, Petr Pajas wrote:
> > > Hi,
> > >
> > > I have two files (also attached)
> > >
> > > 1) test.xml:
> > > <?xml version="1.0" encoding="ISO-8859-1"?>
> > > <!DOCTYPE a [
> > > <!ENTITY b SYSTEM "b.txt">
> > > ]>
> > > <a>&b;</a>
> > >
> > > 2) b.txt, which contains just "B"
> > >
> > > When parsing test.xml via the SAX2 interface, I get two character
> > > callbacks for the string "B". The problem can be reproduced with
> > > testSAX --noent from the libxml2 distribution:
> > >
> > > $ /home/pajas/h2/compile/gnome-xml/testSAX --noent test.xml
> > > SAX.setDocumentLocator()
> > > SAX.startDocument()
> > > SAX.internalSubset(a, , )
> > > SAX.entityDecl(b, 2, (null), b.txt, (null))
> > > SAX.externalSubset(a, , )
> > > SAX.startElement(a)
> > > SAX.getEntity(b)
> > > SAX.characters(B, 1)
> > > SAX.characters(B, 1) <--- why?
>
> One when parsing the entity to make sure it's well formed the first time
> you use the entity.
> One each time the entity must be delivered to user land.
Ok, I understand. But so far I found no way to either avoid one of these
callbacks or at least distinguish between them from within the callback (even
my _private is copied at the ctxt passed to the extra callbacks). Assuming my
codebase was basically an analogy of testSAX --noent, what specifically do I
have to do? I tried installing a resolveEntity callback, but it is not called
at all.
Also, looking into parser.c for some hints, I was struck by this (possible)
inconsistency: In parser.c near line 6141, one reads:
if (ent->children == NULL) {
/*
* Probably running in SAX mode and the callbacks don't
* build the entity content. So unless we already went
* though parsing for first checking go though the entity
* content to generate callbacks associated to the entity
*/
if (was_checked == 1) {
I think the block that follows is responsible for one of the callbacks.
What strikes me is that the comment says "unless" while the implementation
says "if" (provided I understand the comment correctly).
When I changed == to !=, I got rid of one of the character callbacks. With
this change, most regression tests pass but few regression tests of SAX
callbacks fail (I assume they are those that just expect this "duplication"
of the callbacks). I do not claim this is a bug, just a suspicion.
> > > SAX.endElement(a)
> > > SAX.endDocument()
> > >
> > > (similarly if b.txt is complex XML - I get the same callbacks for
> > > nodes in the entity twice)
> > >
> > > Is this an expected behavior? If yes, can I somehow distinguish
> > > between the two calls (e.g. based on ctxt) so that I can filter
> > > one of them out?
> > >
> > > P.S. this was observed by one of the users of the Perl bindings
> > > for libxml2. We also have interface for libxml2's reader API in
> > > Perl too, but there are hundreds of very popular Perl modules
> > > build upon the SAX interface (mainly because Perl has really
> > > advanced sax filtering and pipelining with interchangeable SAX
> > > implementations varying from pure-perl, expat, to libxml2;
> > > libxml2 is the fastest among them which makes it very popular and
> > > thus worth maintaining).
>
> it's all dependant on how your entity handler is implemented I think.
I do not install any entity handler by default. When I installed
resolveEntity callback, it didn't get called.
> It's very tricky, I agree, that's why I suggest to not use SAX in general.
I agree, but as I pointed above, removing SAX support from the Perl bindings
would be a great loss for the Perl-xml community.
> One important point is to ask
> the parser to do entity substitution if you provide your own SAX routines
> so it does as much of the work as possible.
I do that (testSAX --noent). I get all entities substituted but receive
doubled SAX events for their content.
What I would like to get is a stream of SAX events that looks as if I was
parsing the output of xmllint --noent, ie.
...
<a>B</a>
Instead, what I get is a SAX stream that looks (approximately) like I was
parsing
...
<a>BB</a>
Thanks,
-- Petr
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml