-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 All,
I'm attempting to do everything with UTF-8 in Cocoon 2.1.11. I have a servlet generating XML in UTF-8 encoding and I have a pipeline with a few transforms in it, ultimately serializing to XHTML. If I have a Unicode character in the XML which is outside of the BMP, such as this one: 🇺🇸 (that's an American flag, in case your mail reader doesn't render it correctly), then I end up getting a series of bytes coming from Cocoon after the transform that look like UTF-16. Here's what's in the XML: <first-name>Test🇺🇸</first-name> Just like that. The bytes in the message for the flag character are: f0 9f 87 ba f0 9f 87 b8 When rendering that into XHTML, I'm getting this in the output: Test���� The American flag in Unicode reference can be found here: https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%87%BA%F0%9F%87%B8 You can see it broken down a bit better here for "Regional U": http://www.fileformat.info/info/unicode/char/1f1fa/index.htm and "Regional S": http://www.fileformat.info/info/unicode/char/1f1f8/index.htm What's happening is that some component in Cocoon has decided to generate HTML entities instead of just emitting the character. That's okay IMO. But what it does doesn't make sense for a UTF-8 output encodin g. The first two entities "��" are the decimal numbers that represent the UTF-16 character for that "Regional Indicator Symbol Letter U" and they are correct... for UTF-16. If I change the output encoding from UTF-8 to UTF0-16, then the browser will render these correctly. Using UTF-8, they show as four of those ugly [?] characters on the screen. I had originally just decided to throw up my hands and use UTF-16 encoding even though it's dumb. But it seems that MSIE cannot be convinced to use UTF-16 no matter what, and I must continue to support MSIE. :( So it's back to UTF-8 for me. How can I get Cocoon to output that character (or "those characters") correctly? It needs to be one of the following: 🇺🇸 (HTML decimal entities) 🇺🇸 (HTML hex entities) f0 9f 87 ba f0 9f 87 b8 (raw UTF-8 bytes) Does anyone know how/where this conversion is being performed ion Cocoon? Probably in a XHTML serializer (I'm using org.apache.cocoon.serialization.XMLSerializer). I'm using mime-type "text/html" and <encoding>UTF-8</encoding> in my sitemap for that serializer (the one named "xhtml"). I believe I've mads very few changes from the default, if any. I haven't yet figured out how to get from what Java sees (\uE50C for the "S" for example) to 🇸, but knowing where the code is that is making that decision would be very helpful. Any ideas? - -chris -----BEGIN PGP SIGNATURE----- Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/ iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlvYf7cACgkQHPApP6U8 pFhSdg/+NFO0iHGiACYgLyOJoZBay3XTDLptbynh/nTk+RHua7kLoYx4OFE9kLSu Kf5psWFNrhsr3aRiJ7zmhqronlwG8M2WP8cqSAC8HlYmxTy9eJmrVfGQMLmH4OWB KaNmRoDW3TCTTQYTkVHFSVv1GxfZVwO1bZrILPgIRgflVNzuERqYCmrdkxRK1z3i Qau8WKQ/sKBmIAOhlrXALCkU5yfhn6zQpD5A8mmqUZHJACxvyOFhlT+jrqrlWx47 pVmtyyXZxAMc2KqrG9jlY5fG+Jzv3FAyTuCZzZWmgPEGbrdeZdlJi5IlYI6Sm4zZ nk5d1153wB4+y/JfU/wR4rn22XfbKpS4I1j03vfuGO/WNa1a+WEZ70M3yd6LYveK JDX6MDFIRt+PvGcC3pxq08iBpzmTaGfaYJU9JY3Ywii51CmzCSxHNjB48NEIYS9C KTehmgio2MVIVh2mu3p6NV4RoVF81LSiJk+q3OpsKnTAjC85WtuSO/ntLiZwFK2R USrtpE/nZdF4fZqgSnTJMml7ogc91upcHG8HB3oz1rS256SjhH48ug1XcDAEinEK cvwonUEKsM33l0apKdk0RdcdQXmWZJVxcOtxphzDYHW9VvaDhNp3yVDAJt+hnlgO 8Pps5av4iyW7KffHFFQf3xPEaYhZYYDniVZTSIFSDAg4OHrBJ/4= =bW4T -----END PGP SIGNATURE----- --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org For additional commands, e-mail: users-h...@cocoon.apache.org