Re: [xml] external DTD validation of large XML's

Jon Tue, 16 Aug 2011 07:15:21 -0700

> > > >> In many cases you don't even need that. Write a shell XML file,
> > > >> 
> > > >> <!DOCTYPE wrapper SYSTEM "the-dtd-file.dtd" [
> > > >>   <!ELEMENT wrapper the-real-root-element>
> > > >>   <!ENTITY the-real-document SYSTEM "bigfile.xml">
> > > >> ]>
> > > >> <wrapper>&the-real-document;</wrapper>
> > > >
> > > > Will the libxml2 implementation try to bring the entire 
> > > > &the-real-document; entity into memory, or will it stream it if I use 
> > > > the SAX2 or Reader API?  My gut tells me both the dtd and the 
> > > > bigfile.xml will be completely parsed into memory. This is fine for the 
> > > > dtd but not for the bigfile.xml.
> > > 
> > > A reading of xmlParseReference suggests your gut is wrong. :)
> > > 
> > > http://git.gnome.org/browse/libxml2/tree/parser.c#n6823
> > 
> >   Yeah I would think that for a extrernal parsed entities we create a
> > new input stream and feed it to the parser, hence progressingly.
> > This may work in constant memory for SAX but unfortunately I'm afraid
> > that for the reader we still build a tree for the entity content
> > (stored in ent->children), so yes we do it progresively, but no
> > unfortunately we accumulate the tree in memory :-\
> 
> OK, I'll catch up and learn what xmlParseReference is doing. Good to know 
> it's constant memory in SAX and I'll focus my testing of the wrapping idea 
> with SAX. 
>  
> >   The real solution would be to allow DTD validation from a preparsed
> > DTD at the xmlreader level directly. For my excuse, validating from
> > a DTD not referenced from the document is not a scenario actually
> > described by XML-1.0, and the way it's implemented will diverge slightly
> > from when you reference with a DOCTYPE. Which is why I think the
> > cleanest is to use a custom I/O which will automatically add the DOCTYPE
> > at the beginning of the document, that's the safest and fastest at this
> > point in my opinion.


Daniel,

I'm getting spare moments again to play with external DTD validation on my 
https://github.com/jonforums/xvalid pet project.

I've concluded that implementing some type of buffer transformation scheme and 
feeding the buffer to the parser is the most reliable and adaptable solution. 
But I've not yet tried the idea with the existing SAX, push, or reader API to 
see if it's workable.

However, from your comments it appears you prefer integrating with xmlIO.c?

Would you quickly summarize (maybe code snippets) your idea, its applicability 
to the SAX, push, and reader APIs, and what you see are the key issues/gotchas?

Or if you've already discussed this ad infinitum, I'd appreciate a RTFM link ;)

Jon

---
blog: http://jonforums.github.com/
twitter: @jonforums

"Anyone who can only think of one way to spell a word obviously lacks 
imagination." - Mark Twain
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] external DTD validation of large XML's

Reply via email to