Hello Chris, I think you will not get any icon-type character on output without using proper font rendering - like Emoji support? Emoji might not be supported by default in Cocoon. So this might be the reason why you get HTML entities instead of Emoji-icons. Also notice: https://www.mail-archive.com/dev@cocoon.apache.org/msg61629.html
Greetings, Greg wt., 29 mar 2022 o 18:36 Christopher Schultz <ch...@christopherschultz.net> napisał(a): > Cédric, > > On 3/29/22 12:06, Cédric Damioli wrote: > > Could you provide more details ? > > How is your XML processed before outputting the wrong UTF-8 sequence ? > > It's somewhat straightforward: > > <map:match pattern="/foo"> > <map:generate src="https://source/" /> > > <map:transform src="stuff-to-cincludes.xsl" /> > > <map:transform src="other-stuff-to-cincludes.xsl" /> > > <map:transform type="cinclude" /> > > <map:transform src="my-big-transformer-to-xhtml.xsl" /> > > <map:transform type="cinclude" /><!-- Yes, another one --> > > <map:transform type="i18n" /> > > <map:transform src="strip-namespaces.xsl" /><!-- This is mine, not > Cocoons -> > > <map:serialize type="xhtml" /> > </map:match> > > The xhtml serializer is the default, with encoding set to UTF-8. The > HTTP response has "Content-Type: text/html" and the document itself > contains: > > <?xml version="1.0" encoding="UTF-8"?> > > and > > <meta content="text/html; charset=utf-8" http-equiv="content-type" /> > > So I think everything is configured correctly; it's just that those > characters are getting mangled by something. I can try to cut-out some > of those steps and see where it's happening. > > I seem to remember being able to give each pipeline step a "marker" or > something where you can say "stop after step 3" or whatever instead of > having to chop-out configuration. Can you remind me or what that is again? > > Thanks, > -chris > > > Le 29/03/2022 à 17:48, Christopher Schultz a écrit : > >> All, > >> > >> I'm still struggling with this. I have upgraded to 2.1.13 which > >> includes the fix for https://issues.apache.org/jira/browse/COCOON-2352 > >> but I'm still getting that American flag converted into those 4 HTML > >> entities: > >> > >> ���� > >> > >> I would expect there to be a single (multibyte) character in the > >> output with no HTML entities. > >> > >> I've double-checked, and the source XML contains the flag as a single > >> multi-byte character, served as UTF-8. > >> > >> Any ideas for how to get this working? I'm sure I could put together a > >> trivial test-case. > >> > >> Thanks, > >> -chris > >> > >> On 10/30/18 12:18, Christopher Schultz wrote: > >>> All, > >>> > >>> Some additional information at the end. > >>> > >>> On 10/30/18 11:58, Christopher Schultz wrote: > >>>> All, > >>> > >>>> I'm attempting to do everything with UTF-8 in Cocoon 2.1.11. I have > >>>> a servlet generating XML in UTF-8 encoding and I have a pipeline > >>>> with a few transforms in it, ultimately serializing to XHTML. > >>> > >>>> If I have a Unicode character in the XML which is outside of the > >>>> BMP, such as this one: 🇺🇸 (that's an American flag, in case your > >>>> mail reader doesn't render it correctly), then I end up getting a > >>>> series of bytes coming from Cocoon after the transform that look > >>>> like UTF-16. > >>> > >>>> Here's what's in the XML: > >>> > >>>> <first-name>Test🇺🇸</first-name> > >>> > >>>> Just like that. The bytes in the message for the flag character > >>>> are: > >>> > >>>> f0 9f 87 ba f0 9f 87 b8 > >>> > >>>> When rendering that into XHTML, I'm getting this in the output: > >>> > >>>> Test���� > >>> > >>>> The American flag in Unicode reference can be found here: > >>>> > https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%87%BA%F0%9F%87% > >>> B8 > >>> > >>>> You can see it broken down a bit better here for "Regional U": > >>>> http://www.fileformat.info/info/unicode/char/1f1fa/index.htm > >>> > >>>> and "Regional S": > >>>> http://www.fileformat.info/info/unicode/char/1f1f8/index.htm > >>> > >>>> What's happening is that some component in Cocoon has decided to > >>>> generate HTML entities instead of just emitting the character. > >>>> That's okay IMO. But what it does doesn't make sense for a UTF-8 > >>>> output encodin g. > >>> > >>>> The first two entities "��" are the decimal numbers > >>>> that represent the UTF-16 character for that "Regional Indicator > >>>> Symbol Letter U" and they are correct... for UTF-16. If I change > >>>> the output encoding from UTF-8 to UTF0-16, then the browser will > >>>> render these correctly. Using UTF-8, they show as four of those > >>>> ugly [?] characters on the screen. > >>> > >>>> I had originally just decided to throw up my hands and use UTF-16 > >>>> encoding even though it's dumb. But it seems that MSIE cannot be > >>>> convinced to use UTF-16 no matter what, and I must continue to > >>>> support MSIE. :( > >>> > >>>> So it's back to UTF-8 for me. > >>> > >>>> How can I get Cocoon to output that character (or "those > >>>> characters") correctly? > >>> > >>>> It needs to be one of the following: > >>> > >>>> 🇺🇸 (HTML decimal entities) > >>>> 🇺🇸 (HTML hex entities) f0 9f 87 ba > >>>> f0 9f 87 b8 (raw UTF-8 bytes) > >>> > >>>> Does anyone know how/where this conversion is being performed ion > >>>> Cocoon? Probably in a XHTML serializer (I'm using > >>>> org.apache.cocoon.serialization.XMLSerializer). I'm using > >>>> mime-type "text/html" and <encoding>UTF-8</encoding> in my sitemap > >>>> for that serializer (the one named "xhtml"). I believe I've mads > >>>> very few changes from the default, if any. > >>> > >>>> I haven't yet figured out how to get from what Java sees (\uE50C > >>>> for the "S" for example) to 🇸, but knowing where the code > >>>> is that is making that decision would be very helpful. > >>> > >>>> Any ideas? > >>> > >>>> -chris > >>> > >>> I created a text file (UTF-8) containing only the flag and read it in > >>> using Java and printed all of the code points. There should be 2 > >>> "characters" in the file. It's 4 bytes per UTF-8 character so I > >>> assumed I'd end up with 2 'char' primitives in the file, but I ended > >>> up with more. > >>> > >>> Here's the loop and the output: > >>> > >>> try(java.io.FileReader in = new > java.io.FileReader("file.txt")) > >>> { > >>> char[] chars = new char[10]; > >>> > >>> int count = in.read(chars); > >>> > >>> for(int i=0; i<count; ++i) > >>> System.out.println("Code point at " + i + " is " + > >>> Integer.toHexString(Character.codePointAt(chars, i))); > >>> > >>> } catch (Exception e) { > >>> e.printStackTrace(); > >>> } > >>> > >>> == output == > >>> > >>> Code point at 0 is 1f1fa > >>> Code point at 1 is ddfa > >>> Code point at 2 is 1f1f8 > >>> Code point at 3 is ddf8 > >>> Code point at 4 is a > >>> > >>> So Java thinks there are 4 things there, not 2. That could be a part > >>> of the confusion. The code points shown for indexes 0 and 2 are the > >>> "correct" ones. Those at indexes 1 and 3 should actually be *skipped*. > >>> > >>> So, to render this string as an HTML numeric entity, we'd do something > >>> like this: > >>> > >>> String str = // this is the input > >>> > >>> for(int i=0; i<str.length(); ++i) { > >>> int cp = Character.codePointAt(chars, i); > >>> > >>> out.print("&#x"); > >>> out.print(Integer.toHexString(cp)); > >>> out.println(';'); > >>> > >>> // Skip any trailing "characters" that are actually a part of this > >>> one > >>> if(1 < Character.charCount(cp)) > >>> i += Character.charCount(cp) - 1; > >>> } > >>> > >>> Using the above code is completely encoding-agnostic, because it's > >>> describing the Unicode code point and not some set of bytes in a > >>> particular flavor of UTF-x. > >>> > >>> -chris > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org > >> For additional commands, e-mail: users-h...@cocoon.apache.org > >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org > For additional commands, e-mail: users-h...@cocoon.apache.org > >