[jira] Resolved: (XERCESJ-1019) produces invalid character reference

Michael Glavassevich (JIRA) Thu, 28 Oct 2004 10:59:25 -0700

     [ http://issues.apache.org/jira/browse/XERCESJ-1019?page=history ]
     
Michael Glavassevich resolved XERCESJ-1019:
-------------------------------------------


    Resolution: Won't Fix

The native serializer doesn't check well-formedness. If you want a serializer which 
checks well-formedness you should use DOM Level 3 [1]. With DOM Level 3 you can also 
check the well-formedness of your DOM giving you an opportunity to do what your 
application wants for instance replacing the illegal characters with placeholders as 
you suggested.

[1] http://xml.apache.org/xerces2-j/dom3.html

> produces invalid character reference
> ------------------------------------
>
>          Key: XERCESJ-1019
>          URL: http://issues.apache.org/jira/browse/XERCESJ-1019
>      Project: Xerces2-J
>         Type: Bug
>   Components: Serialization
>     Versions: 2.0.2, 2.6.2
>  Environment: W2K, Sun JDK 1.4.2
>     Reporter: Thomas Bensler
>     Priority: Minor

>
> When a org.w3c.Document contains a text node containing control characters <0x20 
> e.g. 0x0b, these characters end up (well encoded) in the xml file. 
> The code snippet below demonstrates it:
> ----------------------- 8< ----------------------- 
> final File          file        = new File("E:\\temp\\illegalCharacter.xml");
> final FileOutputStream  fout    = new FileOutputStream(file);
> final XMLSerializer serializer  = new XMLSerializer();
> final OutputFormat  outFormat   = new OutputFormat();
> final DocumentImpl  doc         = new DocumentImpl();
> final Element       rootElement = doc.createElement("rootelement");
> final DOMParser     parser      = new DOMParser();
> doc.appendChild(rootElement);
> rootElement.appendChild(doc.createTextNode(new String(new char[] {11})));
> outFormat.setEncoding("UTF-8");
> outFormat.setIndenting(false);
> serializer.setOutputFormat(outFormat);
> serializer.setOutputByteStream(fout);
> serializer.serialize(doc);
> fout.close();
> // reparsing the serialization result 
> parser.parse(new InputSource(new FileInputStream(file)));
> ----------------------- 8< ----------------------- 
> The produced xml file looks like that:
> ----------------------- 8< ----------------------- 
> <?xml version="1.0" encoding="UTF-8"?>
> <rootelement>&#xb;</rootelement>
> ----------------------- 8< ----------------------- 
> reparsing the file fails:
> [Fatal Error] :2:19: Character reference "&#b" is an invalid XML character.
> As I understood the xml spec the parser is right rejecting the file. So I think the 
> serializer should replace illegal characters by some legal placeholder character 
> (space or '?').
> The whole case came up in a content management system using Xerces 2-J for parsing 
> and serializing. The content typed into JTextFields by users is put into TextNodes 
> of a DOM tree and serialized. Some user grabbed the 0x0b character by doing some c&p 
> from a powerpoint presentation. Even if it is not very common having this kind of 
> characters in a java String, I thing the serializer should handle them without 
> producing invalid xml.
> If you define the the right behaviour for handling the control characters (which 
> chars should be replaced by which placeholder) I would like to provide a patch (a 
> hint for involved classes would be appreciated)
> Thanks for listening!
> Ciao, Thomas.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (XERCESJ-1019) produces invalid character reference

Reply via email to