"By the way: How can you evaluate the correctness of a parser without an
identity transformation? If you cannot produce an output that is identical
to the input you have to use your own and at least one other parser to
check
whether this output is correct (unless you manually consult the
unicode-spec). At least from my experience it is hard to choose a parser
for
that check since there is no reference-implementation of an XML-parser"
There are some 'canonical' output formats defined. They aren't very useful
for real world stuff, but they insure that any XML parser, when parsing the
same file, will spit out the results in that format. And of course those
formats only depend on the info set information. So we have lots of test
files which have associated files that hold the expected canonical output.
If the output from the program (which is designed to output this canonical
format) is different from that expected, then either the canonical output
program is bad (easy enough to check), or the parser is not spitting out
the correctly munged XML infoset data.
"3. The traditional publishing industry. If you have to deliver XML to a
traditional publisher he will not accept Unicode. Not because he or she was
born in the 1920es and didn't catch up with modern technologies yet, but
because many tools used for print-perparation have to be adjusted to the
character sets you deliver. No printing engine - at least none I know of -
is capable of handling Unicode. Therefore you usually deliver your entity
set and have the publisher adjust his engine to this defined set of
"irregular" characters. That was one major advantage of SGML over WinWord!
You were sure, that your special characters could be mapped the way you
expected it."
I agree with this, but I'm not sure how it relates to the topic at hand?
"4. The browsers. Take a simple XHTML 1.0 file, process it with Xerces and
transform it with Xalan. Can you be sure the result can be displayed in
IE5,
IE3, Opera 1.0, Netscape 2, Mozilla M5? If your answer is Yes, why? The
only
thing you know ist that these browsers support HTML 2.0 - an SGML-DTD that
uses external entities. You therefore have to conserve the entities.
Why is there an XHTML-serializer in Xerces? XHTML is an XML-DTD! Why can't
you use the "normal" XMLSerializer (if it is for the XML-declaration that
might be misinterpreted, just say ommitDeclaration())?"
Not sure that the point is here either, though it sounds reasonable enough
to me. I agreed that entities should be reproduceable, so if that's what
you are saying, then we agree.
"If you are transfering money to your bank through XML-EDI, would you
accept
it if this bank says that they can't reproduce your original transfer-bill
from their internal processes and instead state that your order and their
infoset are topologically equivalent?"
The non-infoset information in my transaction is irrelevant. As long as the
relevant data (the infoset) is captured by them, that's all I care about.
They don't have to reproduce byte for byte the original document I sent
them, just the data that mattered. If my original bill and their stored
version of it is 'topologically' the same from an infoset point of view,
then the data is absolutely in agreement between them. I don't see why this
would keep them from providing me with a reproduction of my original? Does
the bank statement you get each month have literal pictures of the checks
you wrote, or just a transcript of the relevant data?
In a non-XML type EDI transfer, if my name were in a 128 character fixed
field, would they save the unused spaces after my name?
----------------------------------------
Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley
[EMAIL PROTECTED]