Re: Getting UTF-16 encoding on dynamic content regardless of output content type

gelo1234 Tue, 29 Mar 2022 10:38:13 -0700

Hello Chris,

I think you will not get any icon-type character on output without using
proper font rendering - like Emoji support? Emoji might not be supported by
default in Cocoon.
So this might be the reason why you get HTML entities instead of
Emoji-icons.
Also notice:
https://www.mail-archive.com/[email protected]/msg61629.html


Greetings,
Greg



wt., 29 mar 2022 o 18:36 Christopher Schultz <[email protected]>
napisał(a):

> Cédric,
>
> On 3/29/22 12:06, Cédric Damioli wrote:
> > Could you provide more details ?
> > How is your XML processed before outputting the wrong UTF-8 sequence ?
>
> It's somewhat straightforward:
>
> <map:match pattern="/foo">
>    <map:generate src="https://source/"; />
>
>    <map:transform src="stuff-to-cincludes.xsl" />
>
>    <map:transform src="other-stuff-to-cincludes.xsl" />
>
>    <map:transform type="cinclude" />
>
>    <map:transform src="my-big-transformer-to-xhtml.xsl" />
>
>    <map:transform type="cinclude" /><!-- Yes, another one -->
>
>    <map:transform type="i18n" />
>
>    <map:transform src="strip-namespaces.xsl" /><!-- This is mine, not
> Cocoons ->
>
>    <map:serialize type="xhtml" />
> </map:match>
>
> The xhtml serializer is the default, with encoding set to UTF-8. The
> HTTP response has "Content-Type: text/html" and the document itself
> contains:
>
> <?xml version="1.0" encoding="UTF-8"?>
>
> and
>
> <meta content="text/html; charset=utf-8" http-equiv="content-type" />
>
> So I think everything is configured correctly; it's just that those
> characters are getting mangled by something. I can try to cut-out some
> of those steps and see where it's happening.
>
> I seem to remember being able to give each pipeline step a "marker" or
> something where you can say "stop after step 3" or whatever instead of
> having to chop-out configuration. Can you remind me or what that is again?
>
> Thanks,
> -chris
>
> > Le 29/03/2022 à 17:48, Christopher Schultz a écrit :
> >> All,
> >>
> >> I'm still struggling with this. I have upgraded to 2.1.13 which
> >> includes the fix for https://issues.apache.org/jira/browse/COCOON-2352
> >> but I'm still getting that American flag converted into those 4 HTML
> >> entities:
> >>
> >> &#55356;&#56826;&#55356;&#56824;
> >>
> >> I would expect there to be a single (multibyte) character in the
> >> output with no HTML entities.
> >>
> >> I've double-checked, and the source XML contains the flag as a single
> >> multi-byte character, served as UTF-8.
> >>
> >> Any ideas for how to get this working? I'm sure I could put together a
> >> trivial test-case.
> >>
> >> Thanks,
> >> -chris
> >>
> >> On 10/30/18 12:18, Christopher Schultz wrote:
> >>> All,
> >>>
> >>> Some additional information at the end.
> >>>
> >>> On 10/30/18 11:58, Christopher Schultz wrote:
> >>>> All,
> >>>
> >>>> I'm attempting to do everything with UTF-8 in Cocoon 2.1.11. I have
> >>>> a servlet generating XML in UTF-8 encoding and I have a pipeline
> >>>> with a few transforms in it, ultimately serializing to XHTML.
> >>>
> >>>> If I have a Unicode character in the XML which is outside of the
> >>>> BMP, such as this one: 🇺🇸  (that's an American flag, in case your
> >>>> mail reader doesn't render it correctly), then I end up getting a
> >>>> series of bytes coming from Cocoon after the transform that look
> >>>> like UTF-16.
> >>>
> >>>> Here's what's in the XML:
> >>>
> >>>> <first-name>Test🇺🇸</first-name>
> >>>
> >>>> Just like that. The bytes in the message for the flag character
> >>>> are:
> >>>
> >>>> f0  9f  87  ba  f0  9f  87  b8
> >>>
> >>>> When rendering that into XHTML, I'm getting this in the output:
> >>>
> >>>> Test&#55356;&#56826;&#55356;&#56824;
> >>>
> >>>> The American flag in Unicode reference can be found here:
> >>>>
> https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%87%BA%F0%9F%87%
> >>> B8
> >>>
> >>>>   You can see it broken down a bit better here for "Regional U":
> >>>> http://www.fileformat.info/info/unicode/char/1f1fa/index.htm
> >>>
> >>>> and "Regional S":
> >>>> http://www.fileformat.info/info/unicode/char/1f1f8/index.htm
> >>>
> >>>> What's happening is that some component in Cocoon has decided to
> >>>> generate HTML entities instead of just emitting the character.
> >>>> That's okay IMO. But what it does doesn't make sense for a UTF-8
> >>>> output encodin g.
> >>>
> >>>> The first two entities "&#55356;&#56826;" are the decimal numbers
> >>>> that represent the UTF-16 character for that "Regional Indicator
> >>>> Symbol Letter U" and they are correct... for UTF-16. If I change
> >>>> the output encoding from UTF-8 to UTF0-16, then the browser will
> >>>> render these correctly. Using UTF-8, they show as four of those
> >>>> ugly [?] characters on the screen.
> >>>
> >>>> I had originally just decided to throw up my hands and use UTF-16
> >>>> encoding even though it's dumb. But it seems that MSIE cannot be
> >>>> convinced to use UTF-16 no matter what, and I must continue to
> >>>> support MSIE. :(
> >>>
> >>>> So it's back to UTF-8 for me.
> >>>
> >>>> How can I get Cocoon to output that character (or "those
> >>>> characters") correctly?
> >>>
> >>>> It needs to be one of the following:
> >>>
> >>>> &#127482;&#127480; (HTML decimal entities)
> >>>> &#x1f1fa;&#x1f1f8;             (HTML hex entities) f0 9f  87  ba
> >>>> f0  9f  87  b8 (raw UTF-8 bytes)
> >>>
> >>>> Does anyone know how/where this conversion is being performed ion
> >>>> Cocoon? Probably in a XHTML serializer (I'm using
> >>>> org.apache.cocoon.serialization.XMLSerializer). I'm using
> >>>> mime-type "text/html" and <encoding>UTF-8</encoding> in my sitemap
> >>>> for that serializer (the one named "xhtml"). I believe I've mads
> >>>> very few changes from the default, if any.
> >>>
> >>>> I haven't yet figured out how to get from what Java sees (\uE50C
> >>>> for the "S" for example) to &#x1f1f8;, but knowing where the code
> >>>> is that is making that decision would be very helpful.
> >>>
> >>>> Any ideas?
> >>>
> >>>> -chris
> >>>
> >>> I created a text file (UTF-8) containing only the flag and read it in
> >>> using Java and printed all of the code points. There should be 2
> >>> "characters" in the file. It's 4 bytes per UTF-8 character so I
> >>> assumed I'd end up with 2 'char' primitives in the file, but I ended
> >>> up with more.
> >>>
> >>> Here's the loop and the output:
> >>>
> >>>          try(java.io.FileReader in = new
> java.io.FileReader("file.txt"))
> >>> {
> >>>              char[] chars = new char[10];
> >>>
> >>>              int count = in.read(chars);
> >>>
> >>>              for(int i=0; i<count; ++i)
> >>>                  System.out.println("Code point at " + i + " is " +
> >>> Integer.toHexString(Character.codePointAt(chars, i)));
> >>>
> >>>          } catch (Exception e) {
> >>>              e.printStackTrace();
> >>>          }
> >>>
> >>> == output ==
> >>>
> >>> Code point at 0 is 1f1fa
> >>> Code point at 1 is ddfa
> >>> Code point at 2 is 1f1f8
> >>> Code point at 3 is ddf8
> >>> Code point at 4 is a
> >>>
> >>> So Java thinks there are 4 things there, not 2. That could be a part
> >>> of the confusion. The code points shown for indexes 0 and 2 are the
> >>> "correct" ones. Those at indexes 1 and 3 should actually be *skipped*.
> >>>
> >>> So, to render this string as an HTML numeric entity, we'd do something
> >>> like this:
> >>>
> >>> String str = // this is the input
> >>>
> >>> for(int i=0; i<str.length(); ++i) {
> >>>    int cp = Character.codePointAt(chars, i);
> >>>
> >>>    out.print("&#x");
> >>>    out.print(Integer.toHexString(cp));
> >>>    out.println(';');
> >>>
> >>>    // Skip any trailing "characters" that are actually a part of this
> >>> one
> >>>    if(1 < Character.charCount(cp))
> >>>      i += Character.charCount(cp) - 1;
> >>> }
> >>>
> >>> Using the above code is completely encoding-agnostic, because it's
> >>> describing the Unicode code point and not some set of bytes in a
> >>> particular flavor of UTF-x.
> >>>
> >>> -chris
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Getting UTF-16 encoding on dynamic content regardless of output content type

Reply via email to