-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 All,
Some additional information at the end. On 10/30/18 11:58, Christopher Schultz wrote: > All, > > I'm attempting to do everything with UTF-8 in Cocoon 2.1.11. I have > a servlet generating XML in UTF-8 encoding and I have a pipeline > with a few transforms in it, ultimately serializing to XHTML. > > If I have a Unicode character in the XML which is outside of the > BMP, such as this one: 🇺🇸 (that's an American flag, in case your > mail reader doesn't render it correctly), then I end up getting a > series of bytes coming from Cocoon after the transform that look > like UTF-16. > > Here's what's in the XML: > > <first-name>Test🇺🇸</first-name> > > Just like that. The bytes in the message for the flag character > are: > > f0 9f 87 ba f0 9f 87 b8 > > When rendering that into XHTML, I'm getting this in the output: > > Test���� > > The American flag in Unicode reference can be found here: > https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%87%BA%F0%9F%87% B8 > > You can see it broken down a bit better here for "Regional U": > http://www.fileformat.info/info/unicode/char/1f1fa/index.htm > > and "Regional S": > http://www.fileformat.info/info/unicode/char/1f1f8/index.htm > > What's happening is that some component in Cocoon has decided to > generate HTML entities instead of just emitting the character. > That's okay IMO. But what it does doesn't make sense for a UTF-8 > output encodin g. > > The first two entities "��" are the decimal numbers > that represent the UTF-16 character for that "Regional Indicator > Symbol Letter U" and they are correct... for UTF-16. If I change > the output encoding from UTF-8 to UTF0-16, then the browser will > render these correctly. Using UTF-8, they show as four of those > ugly [?] characters on the screen. > > I had originally just decided to throw up my hands and use UTF-16 > encoding even though it's dumb. But it seems that MSIE cannot be > convinced to use UTF-16 no matter what, and I must continue to > support MSIE. :( > > So it's back to UTF-8 for me. > > How can I get Cocoon to output that character (or "those > characters") correctly? > > It needs to be one of the following: > > 🇺🇸 (HTML decimal entities) > 🇺🇸 (HTML hex entities) f0 9f 87 ba > f0 9f 87 b8 (raw UTF-8 bytes) > > Does anyone know how/where this conversion is being performed ion > Cocoon? Probably in a XHTML serializer (I'm using > org.apache.cocoon.serialization.XMLSerializer). I'm using > mime-type "text/html" and <encoding>UTF-8</encoding> in my sitemap > for that serializer (the one named "xhtml"). I believe I've mads > very few changes from the default, if any. > > I haven't yet figured out how to get from what Java sees (\uE50C > for the "S" for example) to 🇸, but knowing where the code > is that is making that decision would be very helpful. > > Any ideas? > > -chris I created a text file (UTF-8) containing only the flag and read it in using Java and printed all of the code points. There should be 2 "characters" in the file. It's 4 bytes per UTF-8 character so I assumed I'd end up with 2 'char' primitives in the file, but I ended up with more. Here's the loop and the output: try(java.io.FileReader in = new java.io.FileReader("file.txt")) { char[] chars = new char[10]; int count = in.read(chars); for(int i=0; i<count; ++i) System.out.println("Code point at " + i + " is " + Integer.toHexString(Character.codePointAt(chars, i))); } catch (Exception e) { e.printStackTrace(); } == output == Code point at 0 is 1f1fa Code point at 1 is ddfa Code point at 2 is 1f1f8 Code point at 3 is ddf8 Code point at 4 is a So Java thinks there are 4 things there, not 2. That could be a part of the confusion. The code points shown for indexes 0 and 2 are the "correct" ones. Those at indexes 1 and 3 should actually be *skipped*. So, to render this string as an HTML numeric entity, we'd do something like this: String str = // this is the input for(int i=0; i<str.length(); ++i) { int cp = Character.codePointAt(chars, i); out.print("&#x"); out.print(Integer.toHexString(cp)); out.println(';'); // Skip any trailing "characters" that are actually a part of this one if(1 < Character.charCount(cp)) i += Character.charCount(cp) - 1; } Using the above code is completely encoding-agnostic, because it's describing the Unicode code point and not some set of bytes in a particular flavor of UTF-x. - -chris -----BEGIN PGP SIGNATURE----- Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/ iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlvYhDgACgkQHPApP6U8 pFjPZRAAs9jgubhuIVMs52AmAEPXqVSuG8Y18t7RP7W2F5XouZ69SXqihUKmYODM tQnOlyGghfUnXAkQ3uVNLjbx+dSKGwpuQbkb8987Po6AgzweL9stmqzowdn4Zcam ow7aZSp8gmxa31YHbb7pphGPnjzVqr84Mz9MCCCcSMg/1ZkvayarJTWhYkBgeWip wdxbR2nP7wYNLkEy+v4hLvIcWYI8IeuA2nWb5qNvb6zFVYkPLZZGhdOm19J06cHR Rvxb83g+8X80ngP6Uztbg0p4/qa7vfJXlM46iCEqOM/7+eE0gMwOGk7Akbt+2Utd sSNUChUPzgeRZkzSAbOZcnDhGLXCWodEM75GL1nDJED1N+gWtJwRDb4kfLdY337R ghiVB9yupjFZFhho2BArWl58hx8WrQ9Lawsrn/OFOTjea9A+3/k9QYYCpMObpwJ9 rhTA1bQV9rQbbPC2CG1iajAlb5Moe7tWF1AmhJsqFXKPjMGiIwBlOKRAgcaIZxbr rJRI4SKDkbIlCTWKOqe4cT/HgDQ/O9mBynZ353EcmSrr4Oye8k91e8SRjUh3UdLh XfRnMcEKEwJfIzv+JZgJQK8kwERM4mxLrf3tdhvo9IUwN44Z5QKjZjQHbkYQaIT/ m58tqqNmApzH3gyWeyd6F7HqeTO8wlaRMCipBVX6/SW1Qop2Qno= =YXAW -----END PGP SIGNATURE----- --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org For additional commands, e-mail: users-h...@cocoon.apache.org