> It will might never be possible to reproduce character for character any > arbritrary XML file. This is not the job of an XML parser really, so I > personally feel its unrealistic to expect this to happen. The overhead to > allow it to happen would place a very large burden on the vast majority of > users of the parser who don't need this functionality. The job of the > parser is to present via its internal event APIs the info set, properly > massaged. The infoset doesn't include whitespace that is not relevant.
I agree. I was thinking more or less about what you call a proper infoset and not an identical copy. Nevertheless I think it should be possible to reproduce the a file that doesn't modify the entities. Why? In my opinion there are several reasons for that: 1. We don't live in a perfect world. Even if Xerces is able to handle UTF-8, ISO-8859-1 etc. encodings according to the spec, there are other XML-Applications (e.g. XML-Editors) that get a little upset if you serve them Unicode (in my special case the editor not only misinterprets the Unicode but swallows "normal" characters if unicode i included). These tools (especially if they come from the SGML-world) work well with external entities. 2. Debugging. If you have more then one XML-app in your processing chain and errors occur you want to isolate that error. The easiest way to do this is to separate the steps by doing module tests with data you "prepare by hand". In my case, everything worked fine as long as I served app-1,app-2,app-3 etc "handcrafted" XML-files. When integrating the pieces the whole chain collapsed and I had to spend numerous hours to find out where the problem was, because I wasn't able to produce a "do-nothing" pipe and selectively activate the modules. By the way: How can you evaluate the correctness of a parser without an identity transformation? If you cannot produce an output that is identical to the input you have to use your own and at least one other parser to check whether this output is correct (unless you manually consult the unicode-spec). At least from my experience it is hard to choose a parser for that check since there is no reference-implementation of an XML-parser 3. The traditional publishing industry. If you have to deliver XML to a traditional publisher he will not accept Unicode. Not because he or she was born in the 1920es and didn't catch up with modern technologies yet, but because many tools used for print-perparation have to be adjusted to the character sets you deliver. No printing engine - at least none I know of - is capable of handling Unicode. Therefore you usually deliver your entity set and have the publisher adjust his engine to this defined set of "irregular" characters. That was one major advantage of SGML over WinWord! You were sure, that your special characters could be mapped the way you expected it. 4. The browsers. Take a simple XHTML 1.0 file, process it with Xerces and transform it with Xalan. Can you be sure the result can be displayed in IE5, IE3, Opera 1.0, Netscape 2, Mozilla M5? If your answer is Yes, why? The only thing you know ist that these browsers support HTML 2.0 - an SGML-DTD that uses external entities. You therefore have to conserve the entities. Why is there an XHTML-serializer in Xerces? XHTML is an XML-DTD! Why can't you use the "normal" XMLSerializer (if it is for the XML-declaration that might be misinterpreted, just say ommitDeclaration())? > Any application which is sensitive to the fact that two > XML files (which are um.... lets say topologically equivalent by XML > infoset rules) are different, maybe shouldn't be using XML. If you are transfering money to your bank through XML-EDI, would you accept it if this bank says that they can't reproduce your original transfer-bill from their internal processes and instead state that your order and their infoset are topologically equivalent? > And of course I do agree that the file that is spit back out by any > standard 'rewriter' tools we provide should create a legal document :-) So > if that's not happening then we should fix it. > Please don't misinterpret this. This error was easy to locate and I'm sure you'll fix it asap. Things like this happen and are the reason why the OpenSource-community will always be ahead of "monopolistic" software-producers. I know this bug will be fixed in one of your next versions and not with the Y3K-version of a commercial product. Armin
