Re: [xml] loading concatenated documents

Ethan Tira-Thompson Mon, 29 Mar 2010 18:22:13 -0700

Thanks for all the information, I'll try to collate things :)

> you have to indicate where the data ends or what the last chunks is.


Unfortunately, this is not very attractive... if I have to invent some 
arbitrary data format to wrap around the XML, it defeats a significant goal of 
using XML. (i.e. I still end up writing a custom/buggy parser... even something 
simple like looking for a \0 delimiter, depending on the charset, I might see 
those in the XML document; if I add a length field between documents, will it 
be binary?  little endian or big endian?  If it's serialized as text, will 
there be a newline afterward?  Is that included in the count?  Plus then I need 
more documentation of this new format for everyone who wants to use it.  I'm 
using XML because I *don't* want to deal with all of these issues.)

So anyway, I like to think there's a better solution, like perhaps the XMPP 
<stream> type thing, or what I actually wound up doing, described at the end.

>  No any XML parser MUST report
> "<foo/><foo/>"
> as a not well formed document if passed this data.

You've got what I'm saying backwards.  I don't claim that's a single well 
formed document, I claim that's *two* documents.  Once you reach the end of one 
document, that's it, parsing complete, no need to go looking for trouble in the 
dark alley that follows.  If the user chooses to independently parse the next 
document vs. treat it as an error, that's up to them.

Or regardless of whether you want to call that two documents or not, it's two 
elements, and it would be nice to have a feature to parse each fragment, one 
tree at a time.

However, I guess it's not too bad do two passes: one lightweight SAX pass which 
just skims through looking for the end of the current root element and 
buffering all the data up to that point, and then a second pass which builds a 
tree from that buffer of data from xmlParseBalancedChunkMemory() or such.

> Failure to do so would just make the parser non-conformant to the XML-1.0 
> specification.

Are you sure about this?  Like I said, I'm not aware of the specification that 
it must be an error if more data follows the document.  The spec does defines 
this extra data is not part of the document, but AFAIK not what you should do 
with/about it.  It would better serve interoperability to simply ignore it and 
let the user decide if it's an issue, probably issuing a warning by default.  
But I'm no expert on the spec, it would be educational if you could point me to 
the section.

> I assume you're heard of XMPP aka Jabber they solved this 10 years ago. Send 
> everything as 1 document, chunk by chunk, and close the top element when 
> closing the connection.

Yeah, I'm actually doing exactly this already in a different part of my 
project.  I'm not strongly opposed to inserting something like a '<stream>' at 
the beginning of the connection... actually I don't even need to modify the 
stream, I could just have my read callback hallucinate the root element on 
first access.

However the problem is I want to build a tree for each of the chunks (i.e. 
"stanzas" to use the XMPP term), and there does not seem to be an obvious way 
to do this, even if everything is wrapped in a single root node.  The 
fundamental problem is it is difficult to pass a balanced fragment without 
already having the fragment parsed to know where it ends.

This is what brought me to my originally proposed solution, which only uses a 
single pass: as libxml builds the tree, use a SAX callback on endElement to 
jump in at the end of the chunk/stanza to interrupt the parser and reset the 
stream for the next chunk/stanza.  I have implemented this solution and it does 
seem to work well.

One caveat for those who follow: my original plan to use 
xmlCreateIOParserCtxt() to pull the data out of a realtime istream failed 
because libxml internally requests additional data before it's actually done 
with the previous buffer.  This causes the parsing to block and wait for the 
next update instead of finishing the current update, so parsing is always one 
update late.  Further, once it finishes the old buffer, the code puts the 
unused new buffer back into the stream for the next round of parsing.  However 
I loop based on whether the stream already has more data waiting, so the cycle 
immediately repeats: the loop is always behind on the latest data, and never 
actually breaks out to handle that data.

Instead, switching to xmlCreatePushParserCtxt() allows me to control the data 
flow better, only pushing what's available and correctly detecting when the 
parser is caught up with the data stream.

My custom parser 'StreamParser' context defined here:
http://cvs.tekkotsu.org/viewvc/Tekkotsu/Shared/XMLLoadSave.cc?revision=1.23&view=markup#l538

And usage in 'loadStream()' here:
http://cvs.tekkotsu.org/viewvc/Tekkotsu/Shared/XMLLoadSave.cc?revision=1.23&view=markup#l590

Thanks again,
 -Ethan

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] loading concatenated documents

Reply via email to