Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8 ---------------------------------------------------------------------------------------------
Key: XALANJ-2419 URL: https://issues.apache.org/jira/browse/XALANJ-2419 Project: XalanJ2 Issue Type: Bug Components: Serialization Affects Versions: 2.7.1 Reporter: Henri Sivonen org.apache.xml.serializer.ToStream contains the following code: else if (m_encodingInfo.isInEncoding(ch)) { // If the character is in the encoding, and // not in the normal ASCII range, we also // just leave it get added on to the clean characters } else { // This is a fallback plan, we should never get here // but if the character wasn't previously handled // (i.e. isn't in the encoding, etc.) then what // should we do? We choose to write out an entity writeOutCleanChars(chars, i, lastDirtyCharProcessed); writer.write("&#"); writer.write(Integer.toString(ch)); writer.write(';'); lastDirtyCharProcessed = i; } This leads to the wrong (latter) if branch running for surrogates, because isInEncoding() for UTF-8 returns false for surrogates. It is always wrong (regardless of encoding) to escape a surrogate as an NCR. The practical effect of this bug is that any document with astral characters in it ends up in an ill-formed serialization and does not parse back using an XML parser. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]