Matt,

I have tackled a similar problem, I did't have to worry about different
encodings though. The first problem is that the parsers only stop when the
input source returns eof, this is because a document is not delimited by the
start/end of the root element, there can be multiple processing instructions
before and after the root element. I got around that by creating my own
input stream class, the parser intercepted end element events and when I hit
the end element for the root element I instructed the input source to return
eof. There is also another issue, when you tell the input stream to return
eof the parser will correctly end the parse of the current document, however
when you reset the parser it will throw away any extra data it had obtained
from the input stream - this could be part of the next document. The only
way I could fix this was to only ever return 1 byte at a time from the input
stream, the whole thing turned out to be too slow.

I wanted to do this so I could have client/server applications streaming xml
to each other over sockets, I have iostream socket classes and could have
parsed the xml directly from the socket which would have been very clean and
simple but alas I have not been able to find a good solution so I simply
serialised the xml into another message format, I read these messages, extra
the xml section and parse it, a bit of double handling but it's still quick
and works well.

Gary.

> -----Original Message-----
> From: Matt Nemenman [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, 2 October 2003 10:19 AM
> To: [EMAIL PROTECTED]
> Subject: Parsing a stream of XML documents
> 
> 
> Hello Everyone!
> 
> I am trying to write an application that has to parse a 
> sequence of XML
> documents (thousands of them) from the file/stream. Every document in
> the sequence should be a well-formed XML, but they are not necessarily
> in the same encoding. The stream will look somewhat like this:
> 
> :BEGIN EXAMPLE: 
> 
> <?xml version="1.0" encoding="utf-8"?>
> <document id="1">
>   ... content ...
> </document>
> 
> <?xml version="1.0" encoding="iso-8859-1"?>
> <document id="2">
>   ... content ...
> </document>
> 
> ...
> 
> :END EXAMPLE:
> 
> The problem is, that if there is a well-formness error in any of the
> documents, I don't want to discard the whole stream, since 
> there may be
> thousands of good well-formed XML documents in it.  I want to discard
> just one document, but try to recover and continue parsing 
> the next one.
> 
> Anyone has any suggestions on how to do it "the right way"? 
> 
> I was thinking of deriving my own InputSource class, that will be
> similar to LocalFileInputSource, but will keep reusing the same
> BinFileInputStream object for every makeStream() call. Then 
> supply this
> InputSource to SAX2XMLReader::parse(), reset SAX2XMLReader 
> after the doc
> is complete, and call parse() again and again ...
> 
> This should work fine (I haven't tried it yet, though) if all 
> documents
> in the stream are well-formed. If not, parser will die 
> half-way through
> the document. At this point I will have to recover by 
> searching for the
> closing </document> tag, to start parsing next document right 
> after it.
> But in order to do that I need to know what encoding the malformed
> document was in. Is there any way to get access to that info?
> 
> I can see other problems with such approach too (e.g. what if
> well-formness error is even before the opening <document> tag?), and
> therefore I am wondering if I am at all on the right path. 
> 
> Any advice on this is really appreciated. Thanks a lot,
> 
>       -- Matt
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to