Re: [2.1] Overzealous escaping of high Unicode code points
Hi Chris, I suppose you cannot use 2 different encodings in 1 Serializer, so if you changed your Serializer config to be UTF16, you also have to use _external_ UTF16 encoded CSS styles. Of couse you can define many different Serializer configs per each pipeline. By default common-lang/cocoon uses 2-byte char sequence as encoding base. If you had UTF-8 and 32 bits, you would have 4 chars (each 8 bits), encoded as 1 PAIR 2-bytes sequence. if you switched to UTF-16, you would have 2 chars (each 16 bits), encoded as 1 SINGLE 4-bytes sequence. Greetings, Greg 2017-06-20 22:14 GMT+02:00 Christopher Schultz: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > Greg, > > On 6/20/17 4:11 PM, Christopher Schultz wrote: > > Greg, > > > > On 6/8/17 2:17 PM, gelo1234 wrote: > >> Chris, > > > >> Even with C3 (cocoon 3.0 beta) unless you specify optional > >> encoding in your Serializer config, you fallback to default > >> UTF-8: > > > >> org.apache.cocoon.optional.servlet.components.sax.serializers.util > > > >> public class ConfigurationUtils { > > > >> private ConfigurationUtils() { } > > > >> public static String getEncoding(Map > >> configuration) { String encoding = (String) > >> configuration.get("encoding"); > > > >> if (encoding == null || "".equals(encoding)) { encoding = > >> "UTF-8"; } > > > >> return encoding; } ... > > > > I would have expected the Unicode codepoint to be converted into a > > single 4-byte UTF-8 byte without any &-encoding at all. It looks > > like what I got was a pair of 2-byte characters with &-encoding. > > > > I'll try UTF-16 but my expectation is that it's going to get > > worse, not better. > > Interestingly enough, my emojis are now showing (which I don't totally > understand why!) but it looks like my CSS aren't being loaded. That's > a separate problem I'll have to figure out for myself. > > In my own application, switching from commons-lang to commans-lang3 > HTML/XML escaping allowed me to use these 4-byte emojis and UTF-8 > together. I'm surprised that Cocoon can't do the same thing. (I think > it comes down to exactly how the character-escaper makes its decisions). > > Thanks, > - -chris > -BEGIN PGP SIGNATURE- > Comment: GPGTools - http://gpgtools.org > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAllJgiwACgkQHPApP6U8 > pFgJkRAAqiXn7DWNDN41m1V98aI5xWjTuoka0tKcadN1IUGemTZwipaXHtYQcois > 6yuI3st31ZuanghIpRPcBu9pZzuHtOSBVSHZSIhDGqPwYgczScQ2LgnfMi6zwAdd > j2LFlSWtKGjgCczV5Ok56PyMq1BEAOVw96vmF5xfXmpLAyNA/PvLKsncoW4pN+ES > 1MQMm1aPwbmEpWz7ykReUzfauwBtL4rEX1wO3pl88m9Wq3x174AKHWs/a+4Z1Hdq > 0CnxfrdTK50p7Ng+ECfnPwx8y1Em64lA7KKMuz2jTd0PnxlpZTAgO6lq8S7BdSeY > H1lwBJojVT/+m2w8b9OC/XoyiAyiC/zIswQ3TSMA3ZC2SnCxxAXMTsmT49Ql+lyq > 01JRCIVMitKeoKI4I4066oaBW91FpSSpZXX14XCHrMBtKnIJI+NxBnI++eQq8wdi > ZdX3GzLF2zaPHvZMSz4DRskR1xKGLsAxZAukINW3AGrEAZ/GwbPd76ml3YJam5Yy > R31u0kcRJl4z79pd1n46yxB66V10Rn5IkSMQ8R7uK/ht9wLi5T8bkeAoLjZFFoyq > awmfQTbJzquXAtwjX99WKWEzviN2ph+P0h2rBInHnos5ud8IlLjcS7FmdxQ4DNOw > Nirmj7cikxcr2Fn22pGQh6o3/Eph0lMf1d1HjUZ1C7SchEgsqrk= > =0nTd > -END PGP SIGNATURE- > > - > To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org > For additional commands, e-mail: users-h...@cocoon.apache.org > >
Re: [2.1] Overzealous escaping of high Unicode code points
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Greg, On 6/20/17 4:11 PM, Christopher Schultz wrote: > Greg, > > On 6/8/17 2:17 PM, gelo1234 wrote: >> Chris, > >> Even with C3 (cocoon 3.0 beta) unless you specify optional >> encoding in your Serializer config, you fallback to default >> UTF-8: > >> org.apache.cocoon.optional.servlet.components.sax.serializers.util > >> public class ConfigurationUtils { > >> private ConfigurationUtils() { } > >> public static String getEncoding(Map>> configuration) { String encoding = (String) >> configuration.get("encoding"); > >> if (encoding == null || "".equals(encoding)) { encoding = >> "UTF-8"; } > >> return encoding; } ... > > I would have expected the Unicode codepoint to be converted into a > single 4-byte UTF-8 byte without any &-encoding at all. It looks > like what I got was a pair of 2-byte characters with &-encoding. > > I'll try UTF-16 but my expectation is that it's going to get > worse, not better. Interestingly enough, my emojis are now showing (which I don't totally understand why!) but it looks like my CSS aren't being loaded. That's a separate problem I'll have to figure out for myself. In my own application, switching from commons-lang to commans-lang3 HTML/XML escaping allowed me to use these 4-byte emojis and UTF-8 together. I'm surprised that Cocoon can't do the same thing. (I think it comes down to exactly how the character-escaper makes its decisions). Thanks, - -chris -BEGIN PGP SIGNATURE- Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAllJgiwACgkQHPApP6U8 pFgJkRAAqiXn7DWNDN41m1V98aI5xWjTuoka0tKcadN1IUGemTZwipaXHtYQcois 6yuI3st31ZuanghIpRPcBu9pZzuHtOSBVSHZSIhDGqPwYgczScQ2LgnfMi6zwAdd j2LFlSWtKGjgCczV5Ok56PyMq1BEAOVw96vmF5xfXmpLAyNA/PvLKsncoW4pN+ES 1MQMm1aPwbmEpWz7ykReUzfauwBtL4rEX1wO3pl88m9Wq3x174AKHWs/a+4Z1Hdq 0CnxfrdTK50p7Ng+ECfnPwx8y1Em64lA7KKMuz2jTd0PnxlpZTAgO6lq8S7BdSeY H1lwBJojVT/+m2w8b9OC/XoyiAyiC/zIswQ3TSMA3ZC2SnCxxAXMTsmT49Ql+lyq 01JRCIVMitKeoKI4I4066oaBW91FpSSpZXX14XCHrMBtKnIJI+NxBnI++eQq8wdi ZdX3GzLF2zaPHvZMSz4DRskR1xKGLsAxZAukINW3AGrEAZ/GwbPd76ml3YJam5Yy R31u0kcRJl4z79pd1n46yxB66V10Rn5IkSMQ8R7uK/ht9wLi5T8bkeAoLjZFFoyq awmfQTbJzquXAtwjX99WKWEzviN2ph+P0h2rBInHnos5ud8IlLjcS7FmdxQ4DNOw Nirmj7cikxcr2Fn22pGQh6o3/Eph0lMf1d1HjUZ1C7SchEgsqrk= =0nTd -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org For additional commands, e-mail: users-h...@cocoon.apache.org
Re: [2.1] Overzealous escaping of high Unicode code points
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Greg, On 6/8/17 2:17 PM, gelo1234 wrote: > Chris, > > Even with C3 (cocoon 3.0 beta) unless you specify optional encoding > in your Serializer config, you fallback to default UTF-8: > > org.apache.cocoon.optional.servlet.components.sax.serializers.util > > public class ConfigurationUtils { > > private ConfigurationUtils() { } > > public static String getEncoding(Map> configuration) { String encoding = (String) > configuration.get("encoding"); > > if (encoding == null || "".equals(encoding)) { encoding = "UTF-8"; > } > > return encoding; } ... I would have expected the Unicode codepoint to be converted into a single 4-byte UTF-8 byte without any &-encoding at all. It looks like what I got was a pair of 2-byte characters with &-encoding. I'll try UTF-16 but my expectation is that it's going to get worse, not better. Thanks, - -chris -BEGIN PGP SIGNATURE- Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAllJgYoACgkQHPApP6U8 pFjKCg//UXuln4vSZ4bw32OVWRlsLnfm9RcOjiuDb+DqKjfTTqdIY1kdLyZQK+o4 Y8n12ct3sHQRdsViULtm9dhOClF+6qBXFgbjKO9ya6v4WvWeC4NOh0HK+nFlmvqA 1fNjTuc4orDgDl5npt+6Co8LprToPKBJlF7Vq+dvgLbiYJHh4lTrgAQuyY7YCXoC BUJAieW/ntPficv6q/Tm0g32N/pBnLYArJd3ncwxIZyEYt4jX6tMsPZNwqaY2HrE +D1nc5jTfMnx7B9WH3W5MMw0t4VxiwE2KbK88oHSUf6IV/Nok/5EfMNefQSZr71Z gtxvFRld8Lim/YYMgFieAHXFP5axE81Bk7Z76lj9jOK7YcOMFUPYST63JVv0uVUZ urIEwf5FBEiW/264YTESUfOuPWsbuQQ9x23FRFKh2HiZJmN0afp7uJrkLK55XCT/ OAn6h9wcAtch4idney8BWkLfMOtdHTTaY5PzZRc1EpWDZk4jYYyD+2sdjnHD21Ka CmwKkwnA9WDTJ5owD6n5lIZpYaPBGqFRaCcwWYQtERUA7ZrmBvI7GbuSvfLA3CDp H0nO97fOd2s+IXlxno73V9B7Kvj56CKxP2O5OoXgQHl6b2J+z9ZZ16l83beEblNS 5HWxQSvFw2FjLqhSSQOOsLvkIjWLL/tpBSWq4XEH1iVxViFGJvk= =KIbJ -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org For additional commands, e-mail: users-h...@cocoon.apache.org
Re: [2.1] Overzealous escaping of high Unicode code points
Chris, Even with C3 (cocoon 3.0 beta) unless you specify optional encoding in your Serializer config, you fallback to default UTF-8: org.apache.cocoon.optional.servlet.components.sax.serializers.util public class ConfigurationUtils { private ConfigurationUtils() { } public static String getEncoding(Mapconfiguration) { String encoding = (String) configuration.get("encoding"); if (encoding == null || "".equals(encoding)) { encoding = "UTF-8"; } return encoding; } ... Greetings, Greg 2017-06-08 20:11 GMT+02:00 gelo1234 : > > It depends on what type of Serializer you use and what kind of Serlializer > config you put into your sitemap? > > By default XMLSerializer/HTMLSerializer uses UTF-8 encoding. So instead of > 1 UTF-16 char you got 2 chars UTF-8 encoded. > Of cource there might be also issue with emoji charset, but I would first > try to change encoding in Serliazer config (to be UTF-16). > > Greetings, > -Greg > > 2017-06-07 10:43 GMT+02:00 Flynn, Peter : > >> I had a related problem with 3–4 CJK characters being converted to their >>
Re: [2.1] Overzealous escaping of high Unicode code points
It depends on what type of Serializer you use and what kind of Serlializer config you put into your sitemap? By default XMLSerializer/HTMLSerializer uses UTF-8 encoding. So instead of 1 UTF-16 char you got 2 chars UTF-8 encoded. Of cource there might be also issue with emoji charset, but I would first try to change encoding in Serliazer config (to be UTF-16). Greetings, -Greg 2017-06-07 10:43 GMT+02:00 Flynn, Peter: > I had a related problem with 3–4 CJK characters being converted to their >
Re: [2.1] Overzealous escaping of high Unicode code points
I had a related problem with 3–4 CJK characters being converted to their