This is not an answer to your question, but a general purpose rant about
what parsers should and should not be required to do...
It will might never be possible to reproduce character for character any
arbritrary XML file. This is not the job of an XML parser really, so I
personally feel its unrealistic to expect this to happen. The overhead to
allow it to happen would place a very large burden on the vast majority of
users of the parser who don't need this functionality. The job of the
parser is to present via its internal event APIs the info set, properly
massaged. The infoset doesn't include whitespace that is not relevant.
The C++ parser does go somewhat further than the infoset and returns
whitespace between markup in the internal and external subsets so that they
can be reasonably recreated, but that's it. Returning irrelevant whitespace
inside markup would be such a performance burden that it wouldn't be really
realistic to do it. Any application which is sensitive to the fact that two
XML files (which are um.... lets say topologically equivalent by XML
infoset rules) are different, maybe shouldn't be using XML.
I do agree that being able to know what is being expanded from inside
referenced entities is useful and we are supposed to do that (though the
C++ parser doesn't do it for entities used in attributes right now.) But in
general, I don't think that parsers should be required to allow client code
to exactly byte for byte reproduce an XML file, because any two XML files
which are 'topologically equivalent' will product the same infoset data
from the parser.
And of course I do agree that the file that is spit back out by any
standard 'rewriter' tools we provide should create a legal document :-) So
if that's not happening then we should fix it.
----------------------------------------
Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley
[EMAIL PROTECTED]
"Armin Pfarr" <[EMAIL PROTECTED]> on 02/25/2000 05:59:27 AM
Please respond to [EMAIL PROTECTED]
To: <[EMAIL PROTECTED]>
cc:
Subject: Identity-transformation
Hi,
I'm parsing documents with the Xerces DOMParser, modify some nodes and then
want to write these document back to disk. At the moment, there doesn't
seem
to be a working solution for this problem. If you leave out my
DOM-processing, the simple question is, whether there is a standard way to
parse a Document into memory via DOMParser and stream it out again so that
both input and output are identical.
1. Serializing with Xerces 1.0.2's XMLSerializer doesn't work
When trying to serialize the DOM-Document with
DOMParser parser = new DOMParser();
parser.parse(input);
Document d = parser.getDocument();
PrintWriter writer = new PrintWriter(.....);
OutputFormat format = new OutputFormat();
format.setMethod(Method.XML);
format.setOmitXMLDeclaration(false);
format.setPreserveSpace(true);
format.setVersion("1.0");
Serializer serializer =
SerializerFactory.getSerializerFactory(Method.XML).makeSerializer(writer,
format);
serializer.asDOMSerializer().serialize(document);
After serializing, the file does not contain a space between the public-
and
the systemidentifier. I don't know if this is the only problem, but the
resulting file doesn't parse and is.not identical to the input.
2. When using Xalan 0.19.5, you run into major entity-problems
My file contains entity-references to the standard XHTML-Entity-sets (e.g.
ä) which are declared in a separate file. I don't want to convert
these
references to unicode but want to leave them as they are. I tried several
stylesheets with serveral encodings, but wasn't able to produce a propper
output.
Here is a sample XSLT-stylesheet
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8"> <!-- I also tried several
other
codes -->
<xsl:template match="*|@*|comment()|processing-instruction()|text()">
<xsl:copy>
<xsl:apply-templates
select="*|@*|comment()|processing-instruction()|text()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
As you can see, I just do a straight copy-over.
Has anybody run into the same problem before or does anybody have an idea
how to solve this without writing a specialized DOM-Serializer?
Armin