Hash: SHA256


Some additional information at the end.

On 10/30/18 11:58, Christopher Schultz wrote:
> All,
> I'm attempting to do everything with UTF-8 in Cocoon 2.1.11. I have
> a servlet generating XML in UTF-8 encoding and I have a pipeline
> with a few transforms in it, ultimately serializing to XHTML.
> If I have a Unicode character in the XML which is outside of the
> BMP, such as this one: 🇺🇸  (that's an American flag, in case your
> mail reader doesn't render it correctly), then I end up getting a
> series of bytes coming from Cocoon after the transform that look
> like UTF-16.
> Here's what's in the XML:
> <first-name>Test🇺🇸</first-name>
> Just like that. The bytes in the message for the flag character
> are:
> f0  9f  87  ba  f0  9f  87  b8
> When rendering that into XHTML, I'm getting this in the output:
> Test&#55356;&#56826;&#55356;&#56824;
> The American flag in Unicode reference can be found here: 
> https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%87%BA%F0%9F%87%
>  You can see it broken down a bit better here for "Regional U": 
> http://www.fileformat.info/info/unicode/char/1f1fa/index.htm
> and "Regional S": 
> http://www.fileformat.info/info/unicode/char/1f1f8/index.htm
> What's happening is that some component in Cocoon has decided to 
> generate HTML entities instead of just emitting the character.
> That's okay IMO. But what it does doesn't make sense for a UTF-8
> output encodin g.
> The first two entities "&#55356;&#56826;" are the decimal numbers
> that represent the UTF-16 character for that "Regional Indicator
> Symbol Letter U" and they are correct... for UTF-16. If I change
> the output encoding from UTF-8 to UTF0-16, then the browser will
> render these correctly. Using UTF-8, they show as four of those
> ugly [?] characters on the screen.
> I had originally just decided to throw up my hands and use UTF-16 
> encoding even though it's dumb. But it seems that MSIE cannot be 
> convinced to use UTF-16 no matter what, and I must continue to
> support MSIE. :(
> So it's back to UTF-8 for me.
> How can I get Cocoon to output that character (or "those
> characters") correctly?
> It needs to be one of the following:
> &#127482;&#127480;             (HTML decimal entities) 
> &#x1f1fa;&#x1f1f8;             (HTML hex entities) f0  9f  87  ba
> f0  9f  87  b8 (raw UTF-8 bytes)
> Does anyone know how/where this conversion is being performed ion 
> Cocoon? Probably in a XHTML serializer (I'm using 
> org.apache.cocoon.serialization.XMLSerializer). I'm using
> mime-type "text/html" and <encoding>UTF-8</encoding> in my sitemap
> for that serializer (the one named "xhtml"). I believe I've mads
> very few changes from the default, if any.
> I haven't yet figured out how to get from what Java sees (\uE50C
> for the "S" for example) to &#x1f1f8;, but knowing where the code
> is that is making that decision would be very helpful.
> Any ideas?
> -chris

I created a text file (UTF-8) containing only the flag and read it in
using Java and printed all of the code points. There should be 2
"characters" in the file. It's 4 bytes per UTF-8 character so I
assumed I'd end up with 2 'char' primitives in the file, but I ended
up with more.

Here's the loop and the output:

        try(java.io.FileReader in = new java.io.FileReader("file.txt"))
            char[] chars = new char[10];

            int count = in.read(chars);

            for(int i=0; i<count; ++i)
                System.out.println("Code point at " + i + " is " +
Integer.toHexString(Character.codePointAt(chars, i)));

        } catch (Exception e) {

== output ==

Code point at 0 is 1f1fa
Code point at 1 is ddfa
Code point at 2 is 1f1f8
Code point at 3 is ddf8
Code point at 4 is a

So Java thinks there are 4 things there, not 2. That could be a part
of the confusion. The code points shown for indexes 0 and 2 are the
"correct" ones. Those at indexes 1 and 3 should actually be *skipped*.

So, to render this string as an HTML numeric entity, we'd do something
like this:

String str = // this is the input

for(int i=0; i<str.length(); ++i) {
  int cp = Character.codePointAt(chars, i);


  // Skip any trailing "characters" that are actually a part of this one
  if(1 < Character.charCount(cp))
    i += Character.charCount(cp) - 1;

Using the above code is completely encoding-agnostic, because it's
describing the Unicode code point and not some set of bytes in a
particular flavor of UTF-x.

- -chris
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/


To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
For additional commands, e-mail: users-h...@cocoon.apache.org

Reply via email to