On Fri, Sep 1, 2017 at 3:05 PM Rob Landley <[email protected]> wrote: > > On 09/01/2017 10:45 AM, enh wrote: > > http://www.unicode.org/faq/utf_bom.html#utf8-4 > > Horrible windows thing that tries to store unicode in shorts instead of > longs for backwards compatility with the assumption there couldn't > possibly be more than 65535 letters in the world. Yeah, I figured that. > What I can't figure out is why we'd exclude "you leaked utf-16 to the > outside world" from a Linux translation of utf8, and that... still > doesn't explain it? From the linked page: > > > CESU-8 is... designed and recommended for use only within products > > requiring this UTF-16 binary collation equivalence. It is not intended > > nor recommended for open interchange. > > Oh well. > > Meanwhile, http://www.unicode.org/faq/utf_bom.html#utf16-6 says 1114111 > which is 0x10ffff, so there's the source of that limit. (Looks like the > standards bodies swore a blood oath not to break windows. I'm assuming > the check they cashed also had a carefully measured number of zeroes.) > > Meanwhile, glibc, musl, and bionic all translate stuff differently, and > none of them quite agree with each other: > > Musl is matching my output except they cap at 0x11ffff instead of > 0x10ffff. (I poked Rich and he said that's a bug and he'll fix it.) > > Bionic is A) going up to 0x1fffff, B) refusing to translate efbfbe as > fffe and efbfbf as ffff (it says both are invalid sequences).
finally got around to fixing this: https://android-review.googlesource.com/c/platform/bionic/+/714149 > Glibc is also capping the output at 0x1fffff, but on top of that it says > sequences like fe808080 are -2 not -1. (The one in ubuntu 14.04 is > anyway, who knows what the current version's doing...) > > Rob > > P.S. I can't be the first person to test this stuff, can I? > > P.P.S. would creating/parsing an intentionally overlong coding for the > d800-dfff "blood oath to windows" space be cheating? _______________________________________________ Toybox mailing list [email protected] http://lists.landley.net/listinfo.cgi/toybox-landley.net
