Re: Getting UTF-16 encoding on dynamic content regardless of output content type

Christopher Schultz Tue, 30 Oct 2018 09:18:45 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

All,


Some additional information at the end.

On 10/30/18 11:58, Christopher Schultz wrote:
> All,
> 
> I'm attempting to do everything with UTF-8 in Cocoon 2.1.11. I have
> a servlet generating XML in UTF-8 encoding and I have a pipeline
> with a few transforms in it, ultimately serializing to XHTML.
> 
> If I have a Unicode character in the XML which is outside of the
> BMP, such as this one: 🇺🇸  (that's an American flag, in case your
> mail reader doesn't render it correctly), then I end up getting a
> series of bytes coming from Cocoon after the transform that look
> like UTF-16.
> 
> Here's what's in the XML:
> 
> <first-name>Test🇺🇸</first-name>
> 
> Just like that. The bytes in the message for the flag character
> are:
> 
> f0  9f  87  ba  f0  9f  87  b8
> 
> When rendering that into XHTML, I'm getting this in the output:
> 
> Test&#55356;&#56826;&#55356;&#56824;
> 
> The American flag in Unicode reference can be found here: 
> https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%87%BA%F0%9F%87%
B8
>
>  You can see it broken down a bit better here for "Regional U": 
> http://www.fileformat.info/info/unicode/char/1f1fa/index.htm
> 
> and "Regional S": 
> http://www.fileformat.info/info/unicode/char/1f1f8/index.htm
> 
> What's happening is that some component in Cocoon has decided to 
> generate HTML entities instead of just emitting the character.
> That's okay IMO. But what it does doesn't make sense for a UTF-8
> output encodin g.
> 
> The first two entities "&#55356;&#56826;" are the decimal numbers
> that represent the UTF-16 character for that "Regional Indicator
> Symbol Letter U" and they are correct... for UTF-16. If I change
> the output encoding from UTF-8 to UTF0-16, then the browser will
> render these correctly. Using UTF-8, they show as four of those
> ugly [?] characters on the screen.
> 
> I had originally just decided to throw up my hands and use UTF-16 
> encoding even though it's dumb. But it seems that MSIE cannot be 
> convinced to use UTF-16 no matter what, and I must continue to
> support MSIE. :(
> 
> So it's back to UTF-8 for me.
> 
> How can I get Cocoon to output that character (or "those
> characters") correctly?
> 
> It needs to be one of the following:
> 
> &#127482;&#127480;             (HTML decimal entities) 
> &#x1f1fa;&#x1f1f8;             (HTML hex entities) f0  9f  87  ba
> f0  9f  87  b8 (raw UTF-8 bytes)
> 
> Does anyone know how/where this conversion is being performed ion 
> Cocoon? Probably in a XHTML serializer (I'm using 
> org.apache.cocoon.serialization.XMLSerializer). I'm using
> mime-type "text/html" and <encoding>UTF-8</encoding> in my sitemap
> for that serializer (the one named "xhtml"). I believe I've mads
> very few changes from the default, if any.
> 
> I haven't yet figured out how to get from what Java sees (\uE50C
> for the "S" for example) to &#x1f1f8;, but knowing where the code
> is that is making that decision would be very helpful.
> 
> Any ideas?
> 
> -chris

I created a text file (UTF-8) containing only the flag and read it in
using Java and printed all of the code points. There should be 2
"characters" in the file. It's 4 bytes per UTF-8 character so I
assumed I'd end up with 2 'char' primitives in the file, but I ended
up with more.

Here's the loop and the output:

        try(java.io.FileReader in = new java.io.FileReader("file.txt"))
{
            char[] chars = new char[10];

            int count = in.read(chars);

            for(int i=0; i<count; ++i)
                System.out.println("Code point at " + i + " is " +
Integer.toHexString(Character.codePointAt(chars, i)));

        } catch (Exception e) {
            e.printStackTrace();
        }

== output ==

Code point at 0 is 1f1fa
Code point at 1 is ddfa
Code point at 2 is 1f1f8
Code point at 3 is ddf8
Code point at 4 is a

So Java thinks there are 4 things there, not 2. That could be a part
of the confusion. The code points shown for indexes 0 and 2 are the
"correct" ones. Those at indexes 1 and 3 should actually be *skipped*.

So, to render this string as an HTML numeric entity, we'd do something
like this:

String str = // this is the input

for(int i=0; i<str.length(); ++i) {
  int cp = Character.codePointAt(chars, i);

  out.print("&#x");
  out.print(Integer.toHexString(cp));
  out.println(';');

  // Skip any trailing "characters" that are actually a part of this one
  if(1 < Character.charCount(cp))
    i += Character.charCount(cp) - 1;
}

Using the above code is completely encoding-agnostic, because it's
describing the Unicode code point and not some set of bytes in a
particular flavor of UTF-x.

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlvYhDgACgkQHPApP6U8
pFjPZRAAs9jgubhuIVMs52AmAEPXqVSuG8Y18t7RP7W2F5XouZ69SXqihUKmYODM
tQnOlyGghfUnXAkQ3uVNLjbx+dSKGwpuQbkb8987Po6AgzweL9stmqzowdn4Zcam
ow7aZSp8gmxa31YHbb7pphGPnjzVqr84Mz9MCCCcSMg/1ZkvayarJTWhYkBgeWip
wdxbR2nP7wYNLkEy+v4hLvIcWYI8IeuA2nWb5qNvb6zFVYkPLZZGhdOm19J06cHR
Rvxb83g+8X80ngP6Uztbg0p4/qa7vfJXlM46iCEqOM/7+eE0gMwOGk7Akbt+2Utd
sSNUChUPzgeRZkzSAbOZcnDhGLXCWodEM75GL1nDJED1N+gWtJwRDb4kfLdY337R
ghiVB9yupjFZFhho2BArWl58hx8WrQ9Lawsrn/OFOTjea9A+3/k9QYYCpMObpwJ9
rhTA1bQV9rQbbPC2CG1iajAlb5Moe7tWF1AmhJsqFXKPjMGiIwBlOKRAgcaIZxbr
rJRI4SKDkbIlCTWKOqe4cT/HgDQ/O9mBynZ353EcmSrr4Oye8k91e8SRjUh3UdLh
XfRnMcEKEwJfIzv+JZgJQK8kwERM4mxLrf3tdhvo9IUwN44Z5QKjZjQHbkYQaIT/
m58tqqNmApzH3gyWeyd6F7HqeTO8wlaRMCipBVX6/SW1Qop2Qno=
=YXAW
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Getting UTF-16 encoding on dynamic content regardless of output content type

Reply via email to