Chris, Have you also tried HTMLT or XHTMLT Serializers? Default HTMLSerializer cannot handle some unicode characters: https://issues.apache.org/jira/browse/SLING-5973?attachmentOrder=asc
Greetings, Greg wt., 29 mar 2022 o 19:37 gelo1234 <gelo1...@gmail.com> napisał(a): > Hello Chris, > > I think you will not get any icon-type character on output without using > proper font rendering - like Emoji support? Emoji might not be supported by > default in Cocoon. > So this might be the reason why you get HTML entities instead of > Emoji-icons. > Also notice: > https://www.mail-archive.com/dev@cocoon.apache.org/msg61629.html > > Greetings, > Greg > > > > wt., 29 mar 2022 o 18:36 Christopher Schultz <ch...@christopherschultz.net> > napisał(a): > >> Cédric, >> >> On 3/29/22 12:06, Cédric Damioli wrote: >> > Could you provide more details ? >> > How is your XML processed before outputting the wrong UTF-8 sequence ? >> >> It's somewhat straightforward: >> >> <map:match pattern="/foo"> >> <map:generate src="https://source/" /> >> >> <map:transform src="stuff-to-cincludes.xsl" /> >> >> <map:transform src="other-stuff-to-cincludes.xsl" /> >> >> <map:transform type="cinclude" /> >> >> <map:transform src="my-big-transformer-to-xhtml.xsl" /> >> >> <map:transform type="cinclude" /><!-- Yes, another one --> >> >> <map:transform type="i18n" /> >> >> <map:transform src="strip-namespaces.xsl" /><!-- This is mine, not >> Cocoons -> >> >> <map:serialize type="xhtml" /> >> </map:match> >> >> The xhtml serializer is the default, with encoding set to UTF-8. The >> HTTP response has "Content-Type: text/html" and the document itself >> contains: >> >> <?xml version="1.0" encoding="UTF-8"?> >> >> and >> >> <meta content="text/html; charset=utf-8" http-equiv="content-type" /> >> >> So I think everything is configured correctly; it's just that those >> characters are getting mangled by something. I can try to cut-out some >> of those steps and see where it's happening. >> >> I seem to remember being able to give each pipeline step a "marker" or >> something where you can say "stop after step 3" or whatever instead of >> having to chop-out configuration. Can you remind me or what that is again? >> >> Thanks, >> -chris >> >> > Le 29/03/2022 à 17:48, Christopher Schultz a écrit : >> >> All, >> >> >> >> I'm still struggling with this. I have upgraded to 2.1.13 which >> >> includes the fix for https://issues.apache.org/jira/browse/COCOON-2352 >> >> but I'm still getting that American flag converted into those 4 HTML >> >> entities: >> >> >> >> ���� >> >> >> >> I would expect there to be a single (multibyte) character in the >> >> output with no HTML entities. >> >> >> >> I've double-checked, and the source XML contains the flag as a single >> >> multi-byte character, served as UTF-8. >> >> >> >> Any ideas for how to get this working? I'm sure I could put together a >> >> trivial test-case. >> >> >> >> Thanks, >> >> -chris >> >> >> >> On 10/30/18 12:18, Christopher Schultz wrote: >> >>> All, >> >>> >> >>> Some additional information at the end. >> >>> >> >>> On 10/30/18 11:58, Christopher Schultz wrote: >> >>>> All, >> >>> >> >>>> I'm attempting to do everything with UTF-8 in Cocoon 2.1.11. I have >> >>>> a servlet generating XML in UTF-8 encoding and I have a pipeline >> >>>> with a few transforms in it, ultimately serializing to XHTML. >> >>> >> >>>> If I have a Unicode character in the XML which is outside of the >> >>>> BMP, such as this one: 🇺🇸 (that's an American flag, in case your >> >>>> mail reader doesn't render it correctly), then I end up getting a >> >>>> series of bytes coming from Cocoon after the transform that look >> >>>> like UTF-16. >> >>> >> >>>> Here's what's in the XML: >> >>> >> >>>> <first-name>Test🇺🇸</first-name> >> >>> >> >>>> Just like that. The bytes in the message for the flag character >> >>>> are: >> >>> >> >>>> f0 9f 87 ba f0 9f 87 b8 >> >>> >> >>>> When rendering that into XHTML, I'm getting this in the output: >> >>> >> >>>> Test���� >> >>> >> >>>> The American flag in Unicode reference can be found here: >> >>>> >> https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%87%BA%F0%9F%87% >> >>> B8 >> >>> >> >>>> You can see it broken down a bit better here for "Regional U": >> >>>> http://www.fileformat.info/info/unicode/char/1f1fa/index.htm >> >>> >> >>>> and "Regional S": >> >>>> http://www.fileformat.info/info/unicode/char/1f1f8/index.htm >> >>> >> >>>> What's happening is that some component in Cocoon has decided to >> >>>> generate HTML entities instead of just emitting the character. >> >>>> That's okay IMO. But what it does doesn't make sense for a UTF-8 >> >>>> output encodin g. >> >>> >> >>>> The first two entities "��" are the decimal numbers >> >>>> that represent the UTF-16 character for that "Regional Indicator >> >>>> Symbol Letter U" and they are correct... for UTF-16. If I change >> >>>> the output encoding from UTF-8 to UTF0-16, then the browser will >> >>>> render these correctly. Using UTF-8, they show as four of those >> >>>> ugly [?] characters on the screen. >> >>> >> >>>> I had originally just decided to throw up my hands and use UTF-16 >> >>>> encoding even though it's dumb. But it seems that MSIE cannot be >> >>>> convinced to use UTF-16 no matter what, and I must continue to >> >>>> support MSIE. :( >> >>> >> >>>> So it's back to UTF-8 for me. >> >>> >> >>>> How can I get Cocoon to output that character (or "those >> >>>> characters") correctly? >> >>> >> >>>> It needs to be one of the following: >> >>> >> >>>> 🇺🇸 (HTML decimal entities) >> >>>> 🇺🇸 (HTML hex entities) f0 9f 87 ba >> >>>> f0 9f 87 b8 (raw UTF-8 bytes) >> >>> >> >>>> Does anyone know how/where this conversion is being performed ion >> >>>> Cocoon? Probably in a XHTML serializer (I'm using >> >>>> org.apache.cocoon.serialization.XMLSerializer). I'm using >> >>>> mime-type "text/html" and <encoding>UTF-8</encoding> in my sitemap >> >>>> for that serializer (the one named "xhtml"). I believe I've mads >> >>>> very few changes from the default, if any. >> >>> >> >>>> I haven't yet figured out how to get from what Java sees (\uE50C >> >>>> for the "S" for example) to 🇸, but knowing where the code >> >>>> is that is making that decision would be very helpful. >> >>> >> >>>> Any ideas? >> >>> >> >>>> -chris >> >>> >> >>> I created a text file (UTF-8) containing only the flag and read it in >> >>> using Java and printed all of the code points. There should be 2 >> >>> "characters" in the file. It's 4 bytes per UTF-8 character so I >> >>> assumed I'd end up with 2 'char' primitives in the file, but I ended >> >>> up with more. >> >>> >> >>> Here's the loop and the output: >> >>> >> >>> try(java.io.FileReader in = new >> java.io.FileReader("file.txt")) >> >>> { >> >>> char[] chars = new char[10]; >> >>> >> >>> int count = in.read(chars); >> >>> >> >>> for(int i=0; i<count; ++i) >> >>> System.out.println("Code point at " + i + " is " + >> >>> Integer.toHexString(Character.codePointAt(chars, i))); >> >>> >> >>> } catch (Exception e) { >> >>> e.printStackTrace(); >> >>> } >> >>> >> >>> == output == >> >>> >> >>> Code point at 0 is 1f1fa >> >>> Code point at 1 is ddfa >> >>> Code point at 2 is 1f1f8 >> >>> Code point at 3 is ddf8 >> >>> Code point at 4 is a >> >>> >> >>> So Java thinks there are 4 things there, not 2. That could be a part >> >>> of the confusion. The code points shown for indexes 0 and 2 are the >> >>> "correct" ones. Those at indexes 1 and 3 should actually be *skipped*. >> >>> >> >>> So, to render this string as an HTML numeric entity, we'd do something >> >>> like this: >> >>> >> >>> String str = // this is the input >> >>> >> >>> for(int i=0; i<str.length(); ++i) { >> >>> int cp = Character.codePointAt(chars, i); >> >>> >> >>> out.print("&#x"); >> >>> out.print(Integer.toHexString(cp)); >> >>> out.println(';'); >> >>> >> >>> // Skip any trailing "characters" that are actually a part of this >> >>> one >> >>> if(1 < Character.charCount(cp)) >> >>> i += Character.charCount(cp) - 1; >> >>> } >> >>> >> >>> Using the above code is completely encoding-agnostic, because it's >> >>> describing the Unicode code point and not some set of bytes in a >> >>> particular flavor of UTF-x. >> >>> >> >>> -chris >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org >> >> For additional commands, e-mail: users-h...@cocoon.apache.org >> >> >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org >> For additional commands, e-mail: users-h...@cocoon.apache.org >> >>