On po 10. září 2007, Daniel Veillard wrote:
> On Sun, Sep 09, 2007 at 04:51:55PM +0200, Petr Pajas wrote:
> > Daniel,
> >
> > sorry that I'm returning to this topic after two months. I'm
> > still struggling (read below).
> >
> > On Sunday 09 September 2007, Daniel Veillard wrote:
> > > > On Sunday 10 June 2007 23:10, Petr Pajas wrote:
> > > > > Hi,
> > > > >
> > > > > I have two files (also attached)
> > > > >
> > > > > 1) test.xml:
> > > > > <?xml version="1.0" encoding="ISO-8859-1"?>
> > > > > <!DOCTYPE a [
> > > > >   <!ENTITY b SYSTEM "b.txt">
> > > > > ]>
> > > > > <a>&b;</a>
> > > > >
> > > > > 2) b.txt, which contains just "B"
> > > > >
> > > > > When parsing test.xml via the SAX2 interface, I get two
> > > > > character callbacks for the string "B". The problem can
> > > > > be reproduced with testSAX --noent from the libxml2
> > > > > distribution:
> > > > >
> > > > > $ /home/pajas/h2/compile/gnome-xml/testSAX --noent
> > > > > test.xml SAX.setDocumentLocator()
> > > > > SAX.startDocument()
> > > > > SAX.internalSubset(a, , )
> > > > > SAX.entityDecl(b, 2, (null), b.txt, (null))
> > > > > SAX.externalSubset(a, , )
> > > > > SAX.startElement(a)
> > > > > SAX.getEntity(b)
> > > > > SAX.characters(B, 1)
> > > > > SAX.characters(B, 1)  <--- why?
> > >
> > >   One when parsing the entity to make sure it's well formed
> > > the first time you use the entity.
> > >   One each time the entity must be delivered to user land.
> >
> > Ok, I understand. But so far I found no way to either avoid one
> > of these callbacks or at least distinguish between them from
> > within the callback (even my _private is copied at the ctxt
> > passed to the extra callbacks). Assuming my codebase was
> > basically an analogy of testSAX --noent, what specifically do I
> > have to do? I tried installing a resolveEntity callback, but it
> > is not called at all.
> >
> > Also, looking into parser.c for some hints, I was struck by
> > this (possible) inconsistency: In parser.c near line 6141, one
> > reads:
> >
> >  if (ent->children == NULL) {
> >                 /*
> >                  * Probably running in SAX mode and the
> > callbacks don't * build the entity content. So unless we
> > already went * though parsing for first checking go though the
> > entity * content to generate callbacks associated to the entity
> > */
> >                 if (was_checked == 1) {
> >
> > I think the block that follows is responsible for one of the
> > callbacks.
> >
> > What strikes me is that the comment says "unless" while the
> > implementation says "if" (provided I understand the comment
> > correctly).
> >
> > When I changed == to !=, I got rid of one of the character
> > callbacks. With this change, most regression tests pass but few
> > regression tests of SAX callbacks fail (I assume they are those
> > that just expect this "duplication" of the callbacks). I do not
> > claim this is a bug, just a suspicion.
>
>   And I guess if you do this you won't see fatal errrors if they
> occur in entities, right ?

probably right, I'll have to check. Hm, so then maybe the first call 
to xmlParseExternalEntityPrivate could get a sax handler structure 
that is NULL exccept for the fatal error callback, which is copied 
from the original sax structure? But again this is something I 
can't do from the "user-land".

> > > > > SAX.endElement(a)
> > > > > SAX.endDocument()
> > > > >
> > > > > (similarly if b.txt is complex XML - I get the same
> > > > > callbacks for nodes in the entity twice)
> > > > >
> > > > > Is this an expected behavior? If yes, can I somehow
> > > > > distinguish between the two calls (e.g. based on ctxt) so
> > > > > that I can filter one of them out?
> > > > >
> > > > > P.S. this was observed by one of the users of the Perl
> > > > > bindings for libxml2. We also have interface for
> > > > > libxml2's reader API in Perl too, but there are hundreds
> > > > > of very popular Perl modules build upon the SAX interface
> > > > > (mainly because Perl has really advanced sax filtering
> > > > > and pipelining with interchangeable SAX implementations
> > > > > varying from pure-perl, expat, to libxml2; libxml2 is the
> > > > > fastest among them which makes it very popular and thus
> > > > > worth maintaining).
> > >
> > >   it's all dependant on how your entity handler is
> > > implemented I think.
> >
> > I do not install any entity handler by default. When I
> > installed resolveEntity callback, it didn't get called.
> >
> > > It's very tricky, I agree, that's why I suggest to not use
> > > SAX in general.
> >
> > I agree, but as I pointed above, removing SAX support from the
> > Perl bindings would be a great loss for the Perl-xml community.
>
>   Well how did the bindings changed ? Because libxml2 behaviour
> didn't as far as I understand

They didn't. The problem probably existed since ever, it's just that 
nobody reported it before. But now I know I cannot guarantee the 
same behavior as other SAX interfaces and do not have a workaround.

> > > One important point is to ask
> > > the parser to do entity substitution if you provide your own
> > > SAX routines so it does as much of the work as possible.
> >
> > I do that (testSAX --noent). I get all entities substituted but
> > receive doubled SAX events for their content.
> >
> > What I would like to get is a stream of SAX events that looks
> > as if I was parsing the output of xmllint --noent, ie.
> >
> > ...
> > <a>B</a>
> >
> > Instead, what I get is a SAX stream that looks (approximately)
> > like I was parsing
> >
> > ...
> > <a>BB</a>
>
>  It should happen on the first occurence of the entity reference
> only, i.e. when it is first used.
>
> Daniel

Yes, of course. But that does not make the problem any smaller.

-- Petr
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Reply via email to