Daniel,

sorry that I'm returning to this topic after two months. I'm still struggling 
(read below).

On Sunday 09 September 2007, Daniel Veillard wrote:
> > On Sunday 10 June 2007 23:10, Petr Pajas wrote:
> > > Hi,
> > >
> > > I have two files (also attached)
> > >
> > > 1) test.xml:
> > > <?xml version="1.0" encoding="ISO-8859-1"?>
> > > <!DOCTYPE a [
> > >   <!ENTITY b SYSTEM "b.txt">
> > > ]>
> > > <a>&b;</a>
> > >
> > > 2) b.txt, which contains just "B"
> > >
> > > When parsing test.xml via the SAX2 interface, I get two character
> > > callbacks for the string "B". The problem can be reproduced with
> > > testSAX --noent from the libxml2 distribution:
> > >
> > > $ /home/pajas/h2/compile/gnome-xml/testSAX --noent test.xml
> > > SAX.setDocumentLocator()
> > > SAX.startDocument()
> > > SAX.internalSubset(a, , )
> > > SAX.entityDecl(b, 2, (null), b.txt, (null))
> > > SAX.externalSubset(a, , )
> > > SAX.startElement(a)
> > > SAX.getEntity(b)
> > > SAX.characters(B, 1)
> > > SAX.characters(B, 1)  <--- why?
>
>   One when parsing the entity to make sure it's well formed the first time
> you use the entity.
>   One each time the entity must be delivered to user land.

Ok, I understand. But so far I found no way to either avoid one of these 
callbacks or at least distinguish between them from within the callback (even 
my _private is copied at the ctxt passed to the extra callbacks). Assuming my 
codebase was basically an analogy of testSAX --noent, what specifically do I 
have to do? I tried installing a resolveEntity callback, but it is not called 
at all.

Also, looking into parser.c for some hints, I was struck by this (possible) 
inconsistency: In parser.c near line 6141, one reads:

 if (ent->children == NULL) {
                /*
                 * Probably running in SAX mode and the callbacks don't
                 * build the entity content. So unless we already went
                 * though parsing for first checking go though the entity
                 * content to generate callbacks associated to the entity
                 */
                if (was_checked == 1) {

I think the block that follows is responsible for one of the callbacks.

What strikes me is that the comment says "unless" while the implementation 
says "if" (provided I understand the comment correctly).

When I changed == to !=, I got rid of one of the character callbacks. With 
this change, most regression tests pass but few regression tests of SAX 
callbacks fail (I assume they are those that just expect this "duplication" 
of the callbacks). I do not claim this is a bug, just a suspicion.

> > > SAX.endElement(a)
> > > SAX.endDocument()
> > >
> > > (similarly if b.txt is complex XML - I get the same callbacks for
> > > nodes in the entity twice)
> > >
> > > Is this an expected behavior? If yes, can I somehow distinguish
> > > between the two calls (e.g. based on ctxt) so that I can filter
> > > one of them out?
> > >
> > > P.S. this was observed by one of the users of the Perl bindings
> > > for libxml2. We also have interface for libxml2's reader API in
> > > Perl too, but there are hundreds of very popular Perl modules
> > > build upon the SAX interface (mainly because Perl has really
> > > advanced sax filtering and pipelining with interchangeable SAX
> > > implementations varying from pure-perl, expat, to libxml2;
> > > libxml2 is the fastest among them which makes it very popular and
> > > thus worth maintaining).
>
>   it's all dependant on how your entity handler is implemented I think.

I do not install any entity handler by default. When I installed
resolveEntity callback, it didn't get called. 

> It's very tricky, I agree, that's why I suggest to not use SAX in general.

I agree, but as I pointed above, removing SAX support from the Perl bindings 
would be a great loss for the Perl-xml community.

> One important point is to ask 
> the parser to do entity substitution if you provide your own SAX routines
> so it does as much of the work as possible.

I do that (testSAX --noent). I get all entities substituted but receive 
doubled SAX events for their content. 

What I would like to get is a stream of SAX events that looks as if I was 
parsing the output of xmllint --noent, ie.

...
<a>B</a>

Instead, what I get is a SAX stream that looks (approximately) like I was 
parsing

...
<a>BB</a>

Thanks,
-- Petr
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Reply via email to