Re: Getting UTF-16 encoding on dynamic content regardless of output content type

gelo1234 Tue, 29 Mar 2022 10:41:29 -0700

Chris,

Have you also tried HTMLT or XHTMLT Serializers?
Default HTMLSerializer cannot handle some unicode characters:
https://issues.apache.org/jira/browse/SLING-5973?attachmentOrder=asc


Greetings,
Greg


wt., 29 mar 2022 o 19:37 gelo1234 <gelo1...@gmail.com> napisał(a):

> Hello Chris,
>
> I think you will not get any icon-type character on output without using
> proper font rendering - like Emoji support? Emoji might not be supported by
> default in Cocoon.
> So this might be the reason why you get HTML entities instead of
> Emoji-icons.
> Also notice:
> https://www.mail-archive.com/dev@cocoon.apache.org/msg61629.html
>
> Greetings,
> Greg
>
>
>
> wt., 29 mar 2022 o 18:36 Christopher Schultz <ch...@christopherschultz.net>
> napisał(a):
>
>> Cédric,
>>
>> On 3/29/22 12:06, Cédric Damioli wrote:
>> > Could you provide more details ?
>> > How is your XML processed before outputting the wrong UTF-8 sequence ?
>>
>> It's somewhat straightforward:
>>
>> <map:match pattern="/foo">
>>    <map:generate src="https://source/"; />
>>
>>    <map:transform src="stuff-to-cincludes.xsl" />
>>
>>    <map:transform src="other-stuff-to-cincludes.xsl" />
>>
>>    <map:transform type="cinclude" />
>>
>>    <map:transform src="my-big-transformer-to-xhtml.xsl" />
>>
>>    <map:transform type="cinclude" /><!-- Yes, another one -->
>>
>>    <map:transform type="i18n" />
>>
>>    <map:transform src="strip-namespaces.xsl" /><!-- This is mine, not
>> Cocoons ->
>>
>>    <map:serialize type="xhtml" />
>> </map:match>
>>
>> The xhtml serializer is the default, with encoding set to UTF-8. The
>> HTTP response has "Content-Type: text/html" and the document itself
>> contains:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>>
>> and
>>
>> <meta content="text/html; charset=utf-8" http-equiv="content-type" />
>>
>> So I think everything is configured correctly; it's just that those
>> characters are getting mangled by something. I can try to cut-out some
>> of those steps and see where it's happening.
>>
>> I seem to remember being able to give each pipeline step a "marker" or
>> something where you can say "stop after step 3" or whatever instead of
>> having to chop-out configuration. Can you remind me or what that is again?
>>
>> Thanks,
>> -chris
>>
>> > Le 29/03/2022 à 17:48, Christopher Schultz a écrit :
>> >> All,
>> >>
>> >> I'm still struggling with this. I have upgraded to 2.1.13 which
>> >> includes the fix for https://issues.apache.org/jira/browse/COCOON-2352
>> >> but I'm still getting that American flag converted into those 4 HTML
>> >> entities:
>> >>
>> >> &#55356;&#56826;&#55356;&#56824;
>> >>
>> >> I would expect there to be a single (multibyte) character in the
>> >> output with no HTML entities.
>> >>
>> >> I've double-checked, and the source XML contains the flag as a single
>> >> multi-byte character, served as UTF-8.
>> >>
>> >> Any ideas for how to get this working? I'm sure I could put together a
>> >> trivial test-case.
>> >>
>> >> Thanks,
>> >> -chris
>> >>
>> >> On 10/30/18 12:18, Christopher Schultz wrote:
>> >>> All,
>> >>>
>> >>> Some additional information at the end.
>> >>>
>> >>> On 10/30/18 11:58, Christopher Schultz wrote:
>> >>>> All,
>> >>>
>> >>>> I'm attempting to do everything with UTF-8 in Cocoon 2.1.11. I have
>> >>>> a servlet generating XML in UTF-8 encoding and I have a pipeline
>> >>>> with a few transforms in it, ultimately serializing to XHTML.
>> >>>
>> >>>> If I have a Unicode character in the XML which is outside of the
>> >>>> BMP, such as this one: 🇺🇸  (that's an American flag, in case your
>> >>>> mail reader doesn't render it correctly), then I end up getting a
>> >>>> series of bytes coming from Cocoon after the transform that look
>> >>>> like UTF-16.
>> >>>
>> >>>> Here's what's in the XML:
>> >>>
>> >>>> <first-name>Test🇺🇸</first-name>
>> >>>
>> >>>> Just like that. The bytes in the message for the flag character
>> >>>> are:
>> >>>
>> >>>> f0  9f  87  ba  f0  9f  87  b8
>> >>>
>> >>>> When rendering that into XHTML, I'm getting this in the output:
>> >>>
>> >>>> Test&#55356;&#56826;&#55356;&#56824;
>> >>>
>> >>>> The American flag in Unicode reference can be found here:
>> >>>>
>> https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%87%BA%F0%9F%87%
>> >>> B8
>> >>>
>> >>>>   You can see it broken down a bit better here for "Regional U":
>> >>>> http://www.fileformat.info/info/unicode/char/1f1fa/index.htm
>> >>>
>> >>>> and "Regional S":
>> >>>> http://www.fileformat.info/info/unicode/char/1f1f8/index.htm
>> >>>
>> >>>> What's happening is that some component in Cocoon has decided to
>> >>>> generate HTML entities instead of just emitting the character.
>> >>>> That's okay IMO. But what it does doesn't make sense for a UTF-8
>> >>>> output encodin g.
>> >>>
>> >>>> The first two entities "&#55356;&#56826;" are the decimal numbers
>> >>>> that represent the UTF-16 character for that "Regional Indicator
>> >>>> Symbol Letter U" and they are correct... for UTF-16. If I change
>> >>>> the output encoding from UTF-8 to UTF0-16, then the browser will
>> >>>> render these correctly. Using UTF-8, they show as four of those
>> >>>> ugly [?] characters on the screen.
>> >>>
>> >>>> I had originally just decided to throw up my hands and use UTF-16
>> >>>> encoding even though it's dumb. But it seems that MSIE cannot be
>> >>>> convinced to use UTF-16 no matter what, and I must continue to
>> >>>> support MSIE. :(
>> >>>
>> >>>> So it's back to UTF-8 for me.
>> >>>
>> >>>> How can I get Cocoon to output that character (or "those
>> >>>> characters") correctly?
>> >>>
>> >>>> It needs to be one of the following:
>> >>>
>> >>>> &#127482;&#127480; (HTML decimal entities)
>> >>>> &#x1f1fa;&#x1f1f8;             (HTML hex entities) f0 9f  87  ba
>> >>>> f0  9f  87  b8 (raw UTF-8 bytes)
>> >>>
>> >>>> Does anyone know how/where this conversion is being performed ion
>> >>>> Cocoon? Probably in a XHTML serializer (I'm using
>> >>>> org.apache.cocoon.serialization.XMLSerializer). I'm using
>> >>>> mime-type "text/html" and <encoding>UTF-8</encoding> in my sitemap
>> >>>> for that serializer (the one named "xhtml"). I believe I've mads
>> >>>> very few changes from the default, if any.
>> >>>
>> >>>> I haven't yet figured out how to get from what Java sees (\uE50C
>> >>>> for the "S" for example) to &#x1f1f8;, but knowing where the code
>> >>>> is that is making that decision would be very helpful.
>> >>>
>> >>>> Any ideas?
>> >>>
>> >>>> -chris
>> >>>
>> >>> I created a text file (UTF-8) containing only the flag and read it in
>> >>> using Java and printed all of the code points. There should be 2
>> >>> "characters" in the file. It's 4 bytes per UTF-8 character so I
>> >>> assumed I'd end up with 2 'char' primitives in the file, but I ended
>> >>> up with more.
>> >>>
>> >>> Here's the loop and the output:
>> >>>
>> >>>          try(java.io.FileReader in = new
>> java.io.FileReader("file.txt"))
>> >>> {
>> >>>              char[] chars = new char[10];
>> >>>
>> >>>              int count = in.read(chars);
>> >>>
>> >>>              for(int i=0; i<count; ++i)
>> >>>                  System.out.println("Code point at " + i + " is " +
>> >>> Integer.toHexString(Character.codePointAt(chars, i)));
>> >>>
>> >>>          } catch (Exception e) {
>> >>>              e.printStackTrace();
>> >>>          }
>> >>>
>> >>> == output ==
>> >>>
>> >>> Code point at 0 is 1f1fa
>> >>> Code point at 1 is ddfa
>> >>> Code point at 2 is 1f1f8
>> >>> Code point at 3 is ddf8
>> >>> Code point at 4 is a
>> >>>
>> >>> So Java thinks there are 4 things there, not 2. That could be a part
>> >>> of the confusion. The code points shown for indexes 0 and 2 are the
>> >>> "correct" ones. Those at indexes 1 and 3 should actually be *skipped*.
>> >>>
>> >>> So, to render this string as an HTML numeric entity, we'd do something
>> >>> like this:
>> >>>
>> >>> String str = // this is the input
>> >>>
>> >>> for(int i=0; i<str.length(); ++i) {
>> >>>    int cp = Character.codePointAt(chars, i);
>> >>>
>> >>>    out.print("&#x");
>> >>>    out.print(Integer.toHexString(cp));
>> >>>    out.println(';');
>> >>>
>> >>>    // Skip any trailing "characters" that are actually a part of this
>> >>> one
>> >>>    if(1 < Character.charCount(cp))
>> >>>      i += Character.charCount(cp) - 1;
>> >>> }
>> >>>
>> >>> Using the above code is completely encoding-agnostic, because it's
>> >>> describing the Unicode code point and not some set of bytes in a
>> >>> particular flavor of UTF-x.
>> >>>
>> >>> -chris
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
>> >> For additional commands, e-mail: users-h...@cocoon.apache.org
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
>> For additional commands, e-mail: users-h...@cocoon.apache.org
>>
>>

Re: Getting UTF-16 encoding on dynamic content regardless of output content type

Reply via email to