Re: Getting UTF-16 encoding on dynamic content regardless of output content type

Christopher Schultz Thu, 31 Mar 2022 09:36:02 -0700

Greg,

On 3/31/22 12:13, Christopher Schultz wrote:

On 3/29/22 13:37, gelo1234 wrote:
Hello Chris,
I think you will not get any icon-type character on output withoutusing proper font rendering - like Emoji support? Emoji might not besupported by default in Cocoon.
This isn't a font-rendering issue; it's just ... wrong. Either the rawcharacter should be output, or the proper set of HTML entities should beoutput. Neither is happening. It's just mojibake somewhere in the pipeline.
So this might be the reason why you get HTML entities instead ofEmoji-icons.Also notice:https://www.mail-archive.com/[email protected]/msg61629.html
I read that, and was hopeful that 2.1.13 would resolve this issue, butit hasn't.
Hmm... strangely, the X-Cocoon-Version header still says 2.1.11. PerhapsI didn't upgrade properly...

Yeah, I had Cocoon 2.1.11 as a compile-time dependency which wasdropping cocoon-2.1.11.jar into the web application along with all theother artifacts from the 2.1.13 build. Whoops.

I got that all fixed-up, but the behavior is still the same. I waspretty hopeful that was the only thing missing.


-chris

wt., 29 mar 2022 o 18:36 Christopher Schultz<[email protected] <mailto:[email protected]>>napisał(a):


    Cédric,

    On 3/29/22 12:06, Cédric Damioli wrote:
     > Could you provide more details ?
     > How is your XML processed before outputting the wrong UTF-8
    sequence ?

    It's somewhat straightforward:

    <map:match pattern="/foo">
        <map:generate src="https://source/ <https://source/>" />

        <map:transform src="stuff-to-cincludes.xsl" />

        <map:transform src="other-stuff-to-cincludes.xsl" />

        <map:transform type="cinclude" />

        <map:transform src="my-big-transformer-to-xhtml.xsl" />

        <map:transform type="cinclude" /><!-- Yes, another one -->

        <map:transform type="i18n" />

<map:transform src="strip-namespaces.xsl" /><!-- This is mine,not

    Cocoons ->

        <map:serialize type="xhtml" />
    </map:match>

    The xhtml serializer is the default, with encoding set to UTF-8. The
    HTTP response has "Content-Type: text/html" and the document itself
    contains:

    <?xml version="1.0" encoding="UTF-8"?>

    and

    <meta content="text/html; charset=utf-8" http-equiv="content-type" />

    So I think everything is configured correctly; it's just that those

characters are getting mangled by something. I can try to cut-outsome

    of those steps and see where it's happening.

I seem to remember being able to give each pipeline step a"marker" or something where you can say "stop after step 3" or whateverinstead of

    having to chop-out configuration. Can you remind me or what that is
    again?

    Thanks,
    -chris

     > Le 29/03/2022 à 17:48, Christopher Schultz a écrit :
     >> All,
     >>
     >> I'm still struggling with this. I have upgraded to 2.1.13 which
     >> includes the fix for
    https://issues.apache.org/jira/browse/COCOON-2352
    <https://issues.apache.org/jira/browse/COCOON-2352>
     >> but I'm still getting that American flag converted into those 4
    HTML
     >> entities:
     >>
     >> &#55356;&#56826;&#55356;&#56824;
     >>
     >> I would expect there to be a single (multibyte) character in the
     >> output with no HTML entities.
     >>
     >> I've double-checked, and the source XML contains the flag as a
    single
     >> multi-byte character, served as UTF-8.
     >>
     >> Any ideas for how to get this working? I'm sure I could put
    together a
     >> trivial test-case.
     >>
     >> Thanks,
     >> -chris
     >>
     >> On 10/30/18 12:18, Christopher Schultz wrote:
     >>> All,
     >>>
     >>> Some additional information at the end.
     >>>
     >>> On 10/30/18 11:58, Christopher Schultz wrote:
     >>>> All,
     >>>
     >>>> I'm attempting to do everything with UTF-8 in Cocoon 2.1.11. I
    have

>>>> a servlet generating XML in UTF-8 encoding and I have apipeline

     >>>> with a few transforms in it, ultimately serializing to XHTML.
     >>>

>>>> If I have a Unicode character in the XML which is outside ofthe

     >>>> BMP, such as this one: 🇺🇸  (that's an American flag, in case
    your

>>>> mail reader doesn't render it correctly), then I end upgetting a >>>> series of bytes coming from Cocoon after the transform thatlook

     >>>> like UTF-16.
     >>>
     >>>> Here's what's in the XML:
     >>>
     >>>> <first-name>Test🇺🇸</first-name>
     >>>
     >>>> Just like that. The bytes in the message for the flag character
     >>>> are:
     >>>
     >>>> f0  9f  87  ba  f0  9f  87  b8
     >>>
     >>>> When rendering that into XHTML, I'm getting this in the output:
     >>>
     >>>> Test&#55356;&#56826;&#55356;&#56824;
     >>>
     >>>> The American flag in Unicode reference can be found here:
     >>>>

https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%87%BA%F0%9F%87%<https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%87%BA%F0%9F%87%>

     >>> B8
     >>>

>>>> You can see it broken down a bit better here for "RegionalU":

     >>>> http://www.fileformat.info/info/unicode/char/1f1fa/index.htm
    <http://www.fileformat.info/info/unicode/char/1f1fa/index.htm>
     >>>
     >>>> and "Regional S":
     >>>> http://www.fileformat.info/info/unicode/char/1f1f8/index.htm
    <http://www.fileformat.info/info/unicode/char/1f1f8/index.htm>
     >>>

>>>> What's happening is that some component in Cocoon hasdecided to

     >>>> generate HTML entities instead of just emitting the character.

>>>> That's okay IMO. But what it does doesn't make sense for aUTF-8

     >>>> output encodin g.
     >>>

>>>> The first two entities "&#55356;&#56826;" are the decimalnumbers >>>> that represent the UTF-16 character for that "RegionalIndicator >>>> Symbol Letter U" and they are correct... for UTF-16. If Ichange >>>> the output encoding from UTF-8 to UTF0-16, then the browserwill

     >>>> render these correctly. Using UTF-8, they show as four of those
     >>>> ugly [?] characters on the screen.
     >>>

>>>> I had originally just decided to throw up my hands and useUTF-16 >>>> encoding even though it's dumb. But it seems that MSIEcannot be

     >>>> convinced to use UTF-16 no matter what, and I must continue to
     >>>> support MSIE. :(
     >>>
     >>>> So it's back to UTF-8 for me.
     >>>
     >>>> How can I get Cocoon to output that character (or "those
     >>>> characters") correctly?
     >>>
     >>>> It needs to be one of the following:
     >>>
     >>>> &#127482;&#127480; (HTML decimal entities)

>>>> 🇺🇸 (HTML hex entities) f0 9f87 ba

     >>>> f0  9f  87  b8 (raw UTF-8 bytes)
     >>>

>>>> Does anyone know how/where this conversion is beingperformed ion

     >>>> Cocoon? Probably in a XHTML serializer (I'm using
     >>>> org.apache.cocoon.serialization.XMLSerializer). I'm using

>>>> mime-type "text/html" and <encoding>UTF-8</encoding> in mysitemap >>>> for that serializer (the one named "xhtml"). I believe I'vemads

     >>>> very few changes from the default, if any.
     >>>

>>>> I haven't yet figured out how to get from what Java sees(\uE50C >>>> for the "S" for example) to 🇸, but knowing where thecode

     >>>> is that is making that decision would be very helpful.
     >>>
     >>>> Any ideas?
     >>>
     >>>> -chris
     >>>
     >>> I created a text file (UTF-8) containing only the flag and read
    it in
     >>> using Java and printed all of the code points. There should be 2
     >>> "characters" in the file. It's 4 bytes per UTF-8 character so I
     >>> assumed I'd end up with 2 'char' primitives in the file, but I
    ended
     >>> up with more.
     >>>
     >>> Here's the loop and the output:
     >>>
     >>>          try(java.io.FileReader in = new
    java.io.FileReader("file.txt"))
     >>> {
     >>>              char[] chars = new char[10];
     >>>
     >>>              int count = in.read(chars);
     >>>
     >>>              for(int i=0; i<count; ++i)

>>> System.out.println("Code point at " + i + "is " +

     >>> Integer.toHexString(Character.codePointAt(chars, i)));
     >>>
     >>>          } catch (Exception e) {
     >>>              e.printStackTrace();
     >>>          }
     >>>
     >>> == output ==
     >>>
     >>> Code point at 0 is 1f1fa
     >>> Code point at 1 is ddfa
     >>> Code point at 2 is 1f1f8
     >>> Code point at 3 is ddf8
     >>> Code point at 4 is a
     >>>
     >>> So Java thinks there are 4 things there, not 2. That could be a
    part

>>> of the confusion. The code points shown for indexes 0 and 2are the

     >>> "correct" ones. Those at indexes 1 and 3 should actually be
    *skipped*.
     >>>
     >>> So, to render this string as an HTML numeric entity, we'd do
    something
     >>> like this:
     >>>
     >>> String str = // this is the input
     >>>
     >>> for(int i=0; i<str.length(); ++i) {
     >>>    int cp = Character.codePointAt(chars, i);
     >>>
     >>>    out.print("&#x");
     >>>    out.print(Integer.toHexString(cp));
     >>>    out.println(';');
     >>>
     >>>    // Skip any trailing "characters" that are actually a part
    of this
     >>> one
     >>>    if(1 < Character.charCount(cp))
     >>>      i += Character.charCount(cp) - 1;
     >>> }
     >>>

>>> Using the above code is completely encoding-agnostic, becauseit's

     >>> describing the Unicode code point and not some set of bytes in a
     >>> particular flavor of UTF-x.
     >>>
     >>> -chris
     >>
     >>
    ---------------------------------------------------------------------
     >> To unsubscribe, e-mail: [email protected]
    <mailto:[email protected]>
     >> For additional commands, e-mail: [email protected]
    <mailto:[email protected]>
     >>
     >

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    <mailto:[email protected]>
    For additional commands, e-mail: [email protected]
    <mailto:[email protected]>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Getting UTF-16 encoding on dynamic content regardless of output content type

Reply via email to