Greg,

On 3/31/22 12:13, Christopher Schultz wrote:
On 3/29/22 13:37, gelo1234 wrote:
Hello Chris,

I think you will not get any icon-type character on output without using proper font rendering - like Emoji support? Emoji might not be supported by default in Cocoon.

This isn't a font-rendering issue; it's just ... wrong. Either the raw character should be output, or the proper set of HTML entities should be output. Neither is happening. It's just mojibake somewhere in the pipeline.

So this might be the reason why you get HTML entities instead of Emoji-icons. Also notice: https://www.mail-archive.com/dev@cocoon.apache.org/msg61629.html

I read that, and was hopeful that 2.1.13 would resolve this issue, but it hasn't.

Hmm... strangely, the X-Cocoon-Version header still says 2.1.11. Perhaps I didn't upgrade properly...

Yeah, I had Cocoon 2.1.11 as a compile-time dependency which was dropping cocoon-2.1.11.jar into the web application along with all the other artifacts from the 2.1.13 build. Whoops.

I got that all fixed-up, but the behavior is still the same. I was pretty hopeful that was the only thing missing.

-chris

wt., 29 mar 2022 o 18:36 Christopher Schultz <ch...@christopherschultz.net <mailto:ch...@christopherschultz.net>> napisał(a):

    Cédric,

    On 3/29/22 12:06, Cédric Damioli wrote:
     > Could you provide more details ?
     > How is your XML processed before outputting the wrong UTF-8
    sequence ?

    It's somewhat straightforward:

    <map:match pattern="/foo">
        <map:generate src="https://source/ <https://source/>" />

        <map:transform src="stuff-to-cincludes.xsl" />

        <map:transform src="other-stuff-to-cincludes.xsl" />

        <map:transform type="cinclude" />

        <map:transform src="my-big-transformer-to-xhtml.xsl" />

        <map:transform type="cinclude" /><!-- Yes, another one -->

        <map:transform type="i18n" />

        <map:transform src="strip-namespaces.xsl" /><!-- This is mine, not
    Cocoons ->

        <map:serialize type="xhtml" />
    </map:match>

    The xhtml serializer is the default, with encoding set to UTF-8. The
    HTTP response has "Content-Type: text/html" and the document itself
    contains:

    <?xml version="1.0" encoding="UTF-8"?>

    and

    <meta content="text/html; charset=utf-8" http-equiv="content-type" />

    So I think everything is configured correctly; it's just that those
    characters are getting mangled by something. I can try to cut-out some
    of those steps and see where it's happening.

    I seem to remember being able to give each pipeline step a "marker" or     something where you can say "stop after step 3" or whatever instead of
    having to chop-out configuration. Can you remind me or what that is
    again?

    Thanks,
    -chris

     > Le 29/03/2022 à 17:48, Christopher Schultz a écrit :
     >> All,
     >>
     >> I'm still struggling with this. I have upgraded to 2.1.13 which
     >> includes the fix for
    https://issues.apache.org/jira/browse/COCOON-2352
    <https://issues.apache.org/jira/browse/COCOON-2352>
     >> but I'm still getting that American flag converted into those 4
    HTML
     >> entities:
     >>
     >> &#55356;&#56826;&#55356;&#56824;
     >>
     >> I would expect there to be a single (multibyte) character in the
     >> output with no HTML entities.
     >>
     >> I've double-checked, and the source XML contains the flag as a
    single
     >> multi-byte character, served as UTF-8.
     >>
     >> Any ideas for how to get this working? I'm sure I could put
    together a
     >> trivial test-case.
     >>
     >> Thanks,
     >> -chris
     >>
     >> On 10/30/18 12:18, Christopher Schultz wrote:
     >>> All,
     >>>
     >>> Some additional information at the end.
     >>>
     >>> On 10/30/18 11:58, Christopher Schultz wrote:
     >>>> All,
     >>>
     >>>> I'm attempting to do everything with UTF-8 in Cocoon 2.1.11. I
    have
     >>>> a servlet generating XML in UTF-8 encoding and I have a pipeline
     >>>> with a few transforms in it, ultimately serializing to XHTML.
     >>>
     >>>> If I have a Unicode character in the XML which is outside of the
     >>>> BMP, such as this one: 🇺🇸  (that's an American flag, in case
    your
     >>>> mail reader doesn't render it correctly), then I end up getting a      >>>> series of bytes coming from Cocoon after the transform that look
     >>>> like UTF-16.
     >>>
     >>>> Here's what's in the XML:
     >>>
     >>>> <first-name>Test🇺🇸</first-name>
     >>>
     >>>> Just like that. The bytes in the message for the flag character
     >>>> are:
     >>>
     >>>> f0  9f  87  ba  f0  9f  87  b8
     >>>
     >>>> When rendering that into XHTML, I'm getting this in the output:
     >>>
     >>>> Test&#55356;&#56826;&#55356;&#56824;
     >>>
     >>>> The American flag in Unicode reference can be found here:
     >>>>
https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%87%BA%F0%9F%87% <https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%87%BA%F0%9F%87%>
     >>> B8
     >>>
     >>>>   You can see it broken down a bit better here for "Regional U":
     >>>> http://www.fileformat.info/info/unicode/char/1f1fa/index.htm
    <http://www.fileformat.info/info/unicode/char/1f1fa/index.htm>
     >>>
     >>>> and "Regional S":
     >>>> http://www.fileformat.info/info/unicode/char/1f1f8/index.htm
    <http://www.fileformat.info/info/unicode/char/1f1f8/index.htm>
     >>>
     >>>> What's happening is that some component in Cocoon has decided to
     >>>> generate HTML entities instead of just emitting the character.
     >>>> That's okay IMO. But what it does doesn't make sense for a UTF-8
     >>>> output encodin g.
     >>>
     >>>> The first two entities "&#55356;&#56826;" are the decimal numbers      >>>> that represent the UTF-16 character for that "Regional Indicator      >>>> Symbol Letter U" and they are correct... for UTF-16. If I change      >>>> the output encoding from UTF-8 to UTF0-16, then the browser will
     >>>> render these correctly. Using UTF-8, they show as four of those
     >>>> ugly [?] characters on the screen.
     >>>
     >>>> I had originally just decided to throw up my hands and use UTF-16      >>>> encoding even though it's dumb. But it seems that MSIE cannot be
     >>>> convinced to use UTF-16 no matter what, and I must continue to
     >>>> support MSIE. :(
     >>>
     >>>> So it's back to UTF-8 for me.
     >>>
     >>>> How can I get Cocoon to output that character (or "those
     >>>> characters") correctly?
     >>>
     >>>> It needs to be one of the following:
     >>>
     >>>> &#127482;&#127480; (HTML decimal entities)
     >>>> &#x1f1fa;&#x1f1f8;             (HTML hex entities) f0 9f 87  ba
     >>>> f0  9f  87  b8 (raw UTF-8 bytes)
     >>>
     >>>> Does anyone know how/where this conversion is being performed ion
     >>>> Cocoon? Probably in a XHTML serializer (I'm using
     >>>> org.apache.cocoon.serialization.XMLSerializer). I'm using
     >>>> mime-type "text/html" and <encoding>UTF-8</encoding> in my sitemap      >>>> for that serializer (the one named "xhtml"). I believe I've mads
     >>>> very few changes from the default, if any.
     >>>
     >>>> I haven't yet figured out how to get from what Java sees (\uE50C      >>>> for the "S" for example) to &#x1f1f8;, but knowing where the code
     >>>> is that is making that decision would be very helpful.
     >>>
     >>>> Any ideas?
     >>>
     >>>> -chris
     >>>
     >>> I created a text file (UTF-8) containing only the flag and read
    it in
     >>> using Java and printed all of the code points. There should be 2
     >>> "characters" in the file. It's 4 bytes per UTF-8 character so I
     >>> assumed I'd end up with 2 'char' primitives in the file, but I
    ended
     >>> up with more.
     >>>
     >>> Here's the loop and the output:
     >>>
     >>>          try(java.io.FileReader in = new
    java.io.FileReader("file.txt"))
     >>> {
     >>>              char[] chars = new char[10];
     >>>
     >>>              int count = in.read(chars);
     >>>
     >>>              for(int i=0; i<count; ++i)
     >>>                  System.out.println("Code point at " + i + " is " +
     >>> Integer.toHexString(Character.codePointAt(chars, i)));
     >>>
     >>>          } catch (Exception e) {
     >>>              e.printStackTrace();
     >>>          }
     >>>
     >>> == output ==
     >>>
     >>> Code point at 0 is 1f1fa
     >>> Code point at 1 is ddfa
     >>> Code point at 2 is 1f1f8
     >>> Code point at 3 is ddf8
     >>> Code point at 4 is a
     >>>
     >>> So Java thinks there are 4 things there, not 2. That could be a
    part
     >>> of the confusion. The code points shown for indexes 0 and 2 are the
     >>> "correct" ones. Those at indexes 1 and 3 should actually be
    *skipped*.
     >>>
     >>> So, to render this string as an HTML numeric entity, we'd do
    something
     >>> like this:
     >>>
     >>> String str = // this is the input
     >>>
     >>> for(int i=0; i<str.length(); ++i) {
     >>>    int cp = Character.codePointAt(chars, i);
     >>>
     >>>    out.print("&#x");
     >>>    out.print(Integer.toHexString(cp));
     >>>    out.println(';');
     >>>
     >>>    // Skip any trailing "characters" that are actually a part
    of this
     >>> one
     >>>    if(1 < Character.charCount(cp))
     >>>      i += Character.charCount(cp) - 1;
     >>> }
     >>>
     >>> Using the above code is completely encoding-agnostic, because it's
     >>> describing the Unicode code point and not some set of bytes in a
     >>> particular flavor of UTF-x.
     >>>
     >>> -chris
     >>
     >>
    ---------------------------------------------------------------------
     >> To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
    <mailto:users-unsubscr...@cocoon.apache.org>
     >> For additional commands, e-mail: users-h...@cocoon.apache.org
    <mailto:users-h...@cocoon.apache.org>
     >>
     >

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
    <mailto:users-unsubscr...@cocoon.apache.org>
    For additional commands, e-mail: users-h...@cocoon.apache.org
    <mailto:users-h...@cocoon.apache.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
For additional commands, e-mail: users-h...@cocoon.apache.org

Reply via email to