[xml] Recovering from errors in an XML "stream"

Webb Scales Thu, 05 Sep 2019 22:57:58 -0700

Greetings, all. My apologies if this has already been addressed...I hadno luck searching the archive.

My code is being presented with a stream of XML-like data which lookssimilar to this:


<!-- 9576335552596 --><?xml version="1.0" encoding="UTF-8" standalone="yes"?><outer1 attr1="1.0" attr2="Xxx 1" attr3="Xxx" 
attr4="552851" attr5="true"><nested1 attr_n1="Xxx" attr_n2="xx_xx"><nested2>\
<nested2a x="1.8" y="-5.3" z="3.1"/><nested2b a="8.2" b="-0.7"/><nested2c a_start="0.0" a_end="10.0" b_start="0.9" 
b_end="3.9" c_start="-1.7" c_end="1.3"/></nested2></nested1></outer1>
<!-- 9576335552596 --><?xml version="1.0" encoding="UTF-8" standalone="yes"?><outer2 attr1="1.0" attr2="Xxx 1" attr3="Xxx" 
attr4="552851" attr5="true"><nested1 attr_n1="Xxx" attr_n2="xx_xx"><nested2>\
<nested2a x="1.8" y="-5.3" z="3.1"/><nested2b a="8.2" b="-0.7"/><nested2c a_start="0.0" a_end="10.0" b_start="0.9" 
b_end="3.9" c_start="-1.7" c_end="1.3"/></nested2></nested1></outer2>
<!-- 9576335552596 --><?xml version="1.0" encoding="UTF-8" standalone="yes"?><outer3 attr1="1.0" attr2="Xxx 1" attr3="Xxx" 
attr4="552851" attr5="true"><nested1 attr_n1="Xxx" attr_n2="xx_xx"><nested2>\
<nested2a x="1.8" y="-5.3" z="3.1"/><nested2b a="8.2" b="-0.7"/><nested2c a_start="0.0" a_end="10.0" b_start="0.9" 
b_end="3.9" c_start="-1.7" c_end="1.3"/></nested2></nested1></outer3>
<!-- 9576335552596 --><?xml version="1.0" encoding="UTF-8" standalone="yes"?><outer4 attr1="1.0" attr2="Xxx 1" attr3="Xxx" 
attr4="552851" attr5="true"><nested1 attr_n1="Xxx" attr_n2="xx_xx"><nested2>\
<nested2a x="1.8" y="-5.3" z="3.1"/><nested2b a="8.2" b="-0.7"/><nested2c a_start="0.0" a_end="10.0" b_start="0.9" 
b_end="3.9" c_start="-1.7" c_end="1.3"/></nested2></nested1></outer4>

I cannot read it all into memory, because it might be "big" or even"infinite" in size.

What I think I want to do is to use the xmlTextReader interface to parsethe file in chunks, ideally producing a parse of each successive "root"document.


I've had only very limited success doing this, so far.

The first issue is that the XML parser seems to balk entirely at thefact that the document is preceded by a comment before the XMLdeclaration. (I'm less than shocked, but it is kind of disappointing.) I cannot seem to get the parser to skip over it, so I wrote my own I/Ohandler (specified via cmlReaderForIO()) which filters out all comments.

The next issue is that the XML parser reports an error near the end ofthe document, when it notices that the document is followed by an XMLdeclaration. (I'm a little closer to shocked by this.) I managed towork around this by specifying my own error handler (viaxmlTextReaderSetErrorHandler()) and callingxmlTextReaderRead()/xmlTextReaderNext() repeatedly until it returnssomething other than -1. (I found a partial explanation of this effectin the archive, but it was still surprising, because the errors arereported well before the point in the parse where the offending textappears and especially because the offending text doesn't appear untilafter the closing tag for the root.) Although, I'm afraid my workaroundonly works if the documents are large.

The crushing problem arises when I try to read the second document inthe stream (or when I try to retrieve the nodes near the end of a smallinitial document): in my application code, every time I callxmlTextReaderNext(), I get a -1 return, and the parser doesn't advancepast the offending tokens (and, in the small document case, it doesn'tadvance to the tokens prior to the offense). And, so my code is just stuck.

Is there something I'm missing? Is there some way that I canacknowledge the error and allow the XML parser to proceed? Or, is theresome way to get the parser to ignore the fact that there is additionaltext after the closing tag for the root? (Why is the parser requestingmore input when it hasn't returned all the tags to the reader yet? Iarranged to have the input routine return exactly up to the closing tagfor the root, and the parser went ahead and asked for more instead ofreturning the parse of what it already had to the reader!)

Is there some other approach which is better for my situation than thexmlTextReader?



            Thanks for your help!

                Webb



--

Webb Scales
Principal Software Architect
603-673-2306
www.ursasecure.com <https://www.ursasecure.com>
w...@ursasecure.com <mailto:w...@ursasecure.com>

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml

[xml] Recovering from errors in an XML "stream"

Reply via email to