[ http://issues.apache.org/jira/browse/XERCESJ-1019?page=history ]
Michael Glavassevich resolved XERCESJ-1019:
-------------------------------------------
Resolution: Won't Fix
The native serializer doesn't check well-formedness. If you want a serializer which
checks well-formedness you should use DOM Level 3 [1]. With DOM Level 3 you can also
check the well-formedness of your DOM giving you an opportunity to do what your
application wants for instance replacing the illegal characters with placeholders as
you suggested.
[1] http://xml.apache.org/xerces2-j/dom3.html
> produces invalid character reference
> ------------------------------------
>
> Key: XERCESJ-1019
> URL: http://issues.apache.org/jira/browse/XERCESJ-1019
> Project: Xerces2-J
> Type: Bug
> Components: Serialization
> Versions: 2.0.2, 2.6.2
> Environment: W2K, Sun JDK 1.4.2
> Reporter: Thomas Bensler
> Priority: Minor
>
> When a org.w3c.Document contains a text node containing control characters <0x20
> e.g. 0x0b, these characters end up (well encoded) in the xml file.
> The code snippet below demonstrates it:
> ----------------------- 8< -----------------------
> final File file = new File("E:\\temp\\illegalCharacter.xml");
> final FileOutputStream fout = new FileOutputStream(file);
> final XMLSerializer serializer = new XMLSerializer();
> final OutputFormat outFormat = new OutputFormat();
> final DocumentImpl doc = new DocumentImpl();
> final Element rootElement = doc.createElement("rootelement");
> final DOMParser parser = new DOMParser();
> doc.appendChild(rootElement);
> rootElement.appendChild(doc.createTextNode(new String(new char[] {11})));
> outFormat.setEncoding("UTF-8");
> outFormat.setIndenting(false);
> serializer.setOutputFormat(outFormat);
> serializer.setOutputByteStream(fout);
> serializer.serialize(doc);
> fout.close();
> // reparsing the serialization result
> parser.parse(new InputSource(new FileInputStream(file)));
> ----------------------- 8< -----------------------
> The produced xml file looks like that:
> ----------------------- 8< -----------------------
> <?xml version="1.0" encoding="UTF-8"?>
> <rootelement></rootelement>
> ----------------------- 8< -----------------------
> reparsing the file fails:
> [Fatal Error] :2:19: Character reference "&#b" is an invalid XML character.
> As I understood the xml spec the parser is right rejecting the file. So I think the
> serializer should replace illegal characters by some legal placeholder character
> (space or '?').
> The whole case came up in a content management system using Xerces 2-J for parsing
> and serializing. The content typed into JTextFields by users is put into TextNodes
> of a DOM tree and serialized. Some user grabbed the 0x0b character by doing some c&p
> from a powerpoint presentation. Even if it is not very common having this kind of
> characters in a java String, I thing the serializer should handle them without
> producing invalid xml.
> If you define the the right behaviour for handling the control characters (which
> chars should be replaced by which placeholder) I would like to provide a patch (a
> hint for involved classes would be appreciated)
> Thanks for listening!
> Ciao, Thomas.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]