> It will might never be possible to reproduce character for character any
> arbritrary XML file. This is not the job of an XML parser really, so I
> personally feel its unrealistic to expect this to happen. The overhead to
> allow it to happen would place a very large burden on the vast majority of
> users of the parser who don't need this functionality. The job of the
> parser is to present via its internal event APIs the info set, properly
> massaged. The infoset doesn't include whitespace that is not relevant.

I agree. I was thinking more or less about what you call a proper infoset
and not an identical copy.
Nevertheless I think it should be possible to reproduce the a file that
doesn't modify the entities. Why?
In my opinion there are several reasons for that:

1. We don't live in a perfect world. Even if Xerces is able to handle UTF-8,
ISO-8859-1 etc.  encodings according to the spec, there
are other XML-Applications (e.g. XML-Editors) that get a little upset if you
serve them Unicode (in my special case
the editor not only misinterprets the Unicode but swallows "normal"
characters if unicode i included). These tools (especially if they come from
the SGML-world) work well with external entities.

2. Debugging. If you have more then one XML-app in your processing chain and
errors occur you want to isolate that error. The easiest way to do this is
to separate the steps by doing module tests with data you "prepare by hand".
In my case, everything worked fine as long as I served app-1,app-2,app-3 etc
"handcrafted" XML-files. When integrating the pieces the whole chain
collapsed and I had to spend numerous hours to find out where the problem
was, because I wasn't able to produce a "do-nothing" pipe and selectively
activate the modules.
By the way: How can you evaluate the correctness of a parser without an
identity transformation? If you cannot produce an output that is identical
to the input you have to use your own and at least one other parser to check
whether this output is correct (unless you manually consult the
unicode-spec). At least from my experience it is hard to choose a parser for
that check since there is no reference-implementation of an XML-parser

3. The traditional publishing industry. If you have to deliver XML to a
traditional publisher he will not accept Unicode. Not because he or she was
born in the 1920es and didn't catch up with modern technologies yet, but
because many tools used for print-perparation have to be adjusted to the
character sets you deliver. No printing engine - at least none I know of -
is capable of handling Unicode. Therefore you usually deliver your entity
set and have the publisher adjust his engine to this defined set of
"irregular" characters. That was one major advantage of SGML over WinWord!
You were sure, that your special characters could be mapped the way you
expected it.

4. The browsers. Take a simple XHTML 1.0 file, process it with Xerces and
transform it with Xalan. Can you be sure the result can be displayed in IE5,
IE3, Opera 1.0, Netscape 2, Mozilla M5? If your answer is Yes, why? The only
thing you know ist that these browsers support HTML 2.0 - an SGML-DTD that
uses external entities. You therefore have to conserve the entities.
Why is there an XHTML-serializer in Xerces? XHTML is an XML-DTD! Why can't
you use the "normal" XMLSerializer (if it is for the XML-declaration that
might be misinterpreted, just say ommitDeclaration())?

> Any application which is sensitive to the fact that two
> XML files (which are um.... lets say topologically equivalent by XML
> infoset rules) are different, maybe shouldn't be using XML.

If you are transfering money to your bank through XML-EDI, would you accept
it if this bank says that they can't reproduce your original transfer-bill
from their internal processes and instead state that your order and their
infoset are topologically equivalent?

> And of course I do agree that the file that is spit back out by any
> standard 'rewriter' tools we provide should create a legal document :-) So
> if that's not happening then we should fix it.
>
Please don't misinterpret this. This error was easy to locate and I'm sure
you'll fix it asap. Things like this happen and are the reason why the
OpenSource-community will always be ahead of "monopolistic"
software-producers. I know this bug will be fixed in one of your next
versions and not with the Y3K-version of a commercial product.

Armin


Reply via email to