Johannes Willnecker created XERCESC-2158: --------------------------------------------
Summary: XMLUTF8Transcoder: One multibyte UTF8 character is swallowed from the srcData when the resulting surrogate pair does not fit in toFill at the end Key: XERCESC-2158 URL: https://issues.apache.org/jira/browse/XERCESC-2158 Project: Xerces-C++ Issue Type: Bug Components: Utilities Affects Versions: 3.2.2, 3.1.4 Environment: OS independent: Linux (RedHat 7.5)/Windows 10 Compiler independent Reporter: Johannes Willnecker Attachments: UTF8.xml, xerces.patch *Bug found in Xerces-C++ Version 3.1.4* (based on code reviews also newer versions are affected) *How to reproduce:* Call SAX2Print for the attached UTF8.xml file "SAX2Print UTF8.xml". One chinese character is missing in the name attribute of the last but one Instance element. *Fix:* The fix for this bug is included in the xerces.patch file. In XMLUTF8Transcoder.cpp a check for this issue was already included but the conclusion that the bytes read are updated at the end of the loop was wrong. The bytes read (bytesEaten) calculation is based on the srcPtr which was already updated when the check is made. Therefore srcPtr needs to be repositioned in case the Surrogate pair does not fit into the toFill buffer. *Contributor related:* Author Name of the code being contributed: Johannes Willnecker Employer: Siemens AG I have the right to grant the copyright licenses for the contribution. My employer has rights to the code that I have written. My employer gave me permission to contribute this code on its behalf. I am not aware of any third-party license or other restrictions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: c-dev-unsubscr...@xerces.apache.org For additional commands, e-mail: c-dev-h...@xerces.apache.org