Greetings, all. My apologies if this has already been addressed...I had
no luck searching the archive.
My code is being presented with a stream of XML-like data which looks
similar to this:
<!-- 9576335552596 --><?xml version="1.0" encoding="UTF-8" standalone="yes"?><outer1 attr1="1.0" attr2="Xxx 1" attr3="Xxx"
attr4="552851" attr5="true"><nested1 attr_n1="Xxx" attr_n2="xx_xx"><nested2>\
<nested2a x="1.8" y="-5.3" z="3.1"/><nested2b a="8.2" b="-0.7"/><nested2c a_start="0.0" a_end="10.0" b_start="0.9"
b_end="3.9" c_start="-1.7" c_end="1.3"/></nested2></nested1></outer1>
<!-- 9576335552596 --><?xml version="1.0" encoding="UTF-8" standalone="yes"?><outer2 attr1="1.0" attr2="Xxx 1" attr3="Xxx"
attr4="552851" attr5="true"><nested1 attr_n1="Xxx" attr_n2="xx_xx"><nested2>\
<nested2a x="1.8" y="-5.3" z="3.1"/><nested2b a="8.2" b="-0.7"/><nested2c a_start="0.0" a_end="10.0" b_start="0.9"
b_end="3.9" c_start="-1.7" c_end="1.3"/></nested2></nested1></outer2>
<!-- 9576335552596 --><?xml version="1.0" encoding="UTF-8" standalone="yes"?><outer3 attr1="1.0" attr2="Xxx 1" attr3="Xxx"
attr4="552851" attr5="true"><nested1 attr_n1="Xxx" attr_n2="xx_xx"><nested2>\
<nested2a x="1.8" y="-5.3" z="3.1"/><nested2b a="8.2" b="-0.7"/><nested2c a_start="0.0" a_end="10.0" b_start="0.9"
b_end="3.9" c_start="-1.7" c_end="1.3"/></nested2></nested1></outer3>
<!-- 9576335552596 --><?xml version="1.0" encoding="UTF-8" standalone="yes"?><outer4 attr1="1.0" attr2="Xxx 1" attr3="Xxx"
attr4="552851" attr5="true"><nested1 attr_n1="Xxx" attr_n2="xx_xx"><nested2>\
<nested2a x="1.8" y="-5.3" z="3.1"/><nested2b a="8.2" b="-0.7"/><nested2c a_start="0.0" a_end="10.0" b_start="0.9"
b_end="3.9" c_start="-1.7" c_end="1.3"/></nested2></nested1></outer4>
I cannot read it all into memory, because it might be "big" or even
"infinite" in size.
What I think I want to do is to use the xmlTextReader interface to parse
the file in chunks, ideally producing a parse of each successive "root"
document.
I've had only very limited success doing this, so far.
The first issue is that the XML parser seems to balk entirely at the
fact that the document is preceded by a comment before the XML
declaration. (I'm less than shocked, but it is kind of disappointing.)
I cannot seem to get the parser to skip over it, so I wrote my own I/O
handler (specified via cmlReaderForIO()) which filters out all comments.
The next issue is that the XML parser reports an error near the end of
the document, when it notices that the document is followed by an XML
declaration. (I'm a little closer to shocked by this.) I managed to
work around this by specifying my own error handler (via
xmlTextReaderSetErrorHandler()) and calling
xmlTextReaderRead()/xmlTextReaderNext() repeatedly until it returns
something other than -1. (I found a partial explanation of this effect
in the archive, but it was still surprising, because the errors are
reported well before the point in the parse where the offending text
appears and especially because the offending text doesn't appear until
after the closing tag for the root.) Although, I'm afraid my workaround
only works if the documents are large.
The crushing problem arises when I try to read the second document in
the stream (or when I try to retrieve the nodes near the end of a small
initial document): in my application code, every time I call
xmlTextReaderNext(), I get a -1 return, and the parser doesn't advance
past the offending tokens (and, in the small document case, it doesn't
advance to the tokens prior to the offense). And, so my code is just stuck.
Is there something I'm missing? Is there some way that I can
acknowledge the error and allow the XML parser to proceed? Or, is there
some way to get the parser to ignore the fact that there is additional
text after the closing tag for the root? (Why is the parser requesting
more input when it hasn't returned all the tags to the reader yet? I
arranged to have the input routine return exactly up to the closing tag
for the root, and the parser went ahead and asked for more instead of
returning the parse of what it already had to the reader!)
Is there some other approach which is better for my situation than the
xmlTextReader?
Thanks for your help!
Webb
--
Webb Scales
Principal Software Architect
603-673-2306
www.ursasecure.com <https://www.ursasecure.com>
w...@ursasecure.com <mailto:w...@ursasecure.com>
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml