All,
Some additional information at the end.
On 10/30/18 11:58, Christopher Schultz wrote:
All,
I'm attempting to do everything with UTF-8 in Cocoon 2.1.11. I have
a servlet generating XML in UTF-8 encoding and I have a pipeline
with a few transforms in it, ultimately serializing to XHTML.
If I have a Unicode character in the XML which is outside of the
BMP, such as this one: 🇺🇸 (that's an American flag, in case your
mail reader doesn't render it correctly), then I end up getting a
series of bytes coming from Cocoon after the transform that look
like UTF-16.
Here's what's in the XML:
<first-name>Test🇺🇸</first-name>
Just like that. The bytes in the message for the flag character
are:
f0 9f 87 ba f0 9f 87 b8
When rendering that into XHTML, I'm getting this in the output:
Test����
The American flag in Unicode reference can be found here:
https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%87%BA%F0%9F%87%
B8
You can see it broken down a bit better here for "Regional U":
http://www.fileformat.info/info/unicode/char/1f1fa/index.htm
and "Regional S":
http://www.fileformat.info/info/unicode/char/1f1f8/index.htm
What's happening is that some component in Cocoon has decided to
generate HTML entities instead of just emitting the character.
That's okay IMO. But what it does doesn't make sense for a UTF-8
output encodin g.
The first two entities "��" are the decimal numbers
that represent the UTF-16 character for that "Regional Indicator
Symbol Letter U" and they are correct... for UTF-16. If I change
the output encoding from UTF-8 to UTF0-16, then the browser will
render these correctly. Using UTF-8, they show as four of those
ugly [?] characters on the screen.
I had originally just decided to throw up my hands and use UTF-16
encoding even though it's dumb. But it seems that MSIE cannot be
convinced to use UTF-16 no matter what, and I must continue to
support MSIE. :(
So it's back to UTF-8 for me.
How can I get Cocoon to output that character (or "those
characters") correctly?
It needs to be one of the following:
🇺🇸 (HTML decimal entities)
🇺🇸 (HTML hex entities) f0 9f 87 ba
f0 9f 87 b8 (raw UTF-8 bytes)
Does anyone know how/where this conversion is being performed ion
Cocoon? Probably in a XHTML serializer (I'm using
org.apache.cocoon.serialization.XMLSerializer). I'm using
mime-type "text/html" and <encoding>UTF-8</encoding> in my sitemap
for that serializer (the one named "xhtml"). I believe I've mads
very few changes from the default, if any.
I haven't yet figured out how to get from what Java sees (\uE50C
for the "S" for example) to 🇸, but knowing where the code
is that is making that decision would be very helpful.
Any ideas?
-chris
I created a text file (UTF-8) containing only the flag and read
it in
using Java and printed all of the code points. There should be 2
"characters" in the file. It's 4 bytes per UTF-8 character so I
assumed I'd end up with 2 'char' primitives in the file, but I ended
up with more.
Here's the loop and the output:
try(java.io.FileReader in = new
java.io.FileReader("file.txt"))
{
char[] chars = new char[10];
int count = in.read(chars);
for(int i=0; i<count; ++i)
System.out.println("Code point at " + i + " is " +
Integer.toHexString(Character.codePointAt(chars, i)));
} catch (Exception e) {
e.printStackTrace();
}
== output ==
Code point at 0 is 1f1fa
Code point at 1 is ddfa
Code point at 2 is 1f1f8
Code point at 3 is ddf8
Code point at 4 is a
So Java thinks there are 4 things there, not 2. That could be a part
of the confusion. The code points shown for indexes 0 and 2 are the
"correct" ones. Those at indexes 1 and 3 should actually be
*skipped*.
So, to render this string as an HTML numeric entity, we'd do
something
like this:
String str = // this is the input
for(int i=0; i<str.length(); ++i) {
int cp = Character.codePointAt(chars, i);
out.print("&#x");
out.print(Integer.toHexString(cp));
out.println(';');
// Skip any trailing "characters" that are actually a part of
this one
if(1 < Character.charCount(cp))
i += Character.charCount(cp) - 1;
}
Using the above code is completely encoding-agnostic, because it's
describing the Unicode code point and not some set of bytes in a
particular flavor of UTF-x.
-chris