http://www.unicode.org/faq/utf_bom.html#utf8-4
On Fri, Sep 1, 2017 at 1:08 AM, Rob Landley <[email protected]> wrote: > On 08/31/2017 04:01 PM, enh wrote: >>>> didn't you get in to utf8 because of my wc -m patch? :-) >>> >>> Working on it. It's one of those "I'd like to do what I consider the >>> _proper_ fix" things that's honestly been a bit of a luxury these days. >>> >>> I wrote a for loop to go from 0 to UINT_MAX, and I'm comparing the >>> mbrtowc(&wc, str, 4, &mb) results to my contextless utf8towc(&wc, str, >>> len) output, and I'm fixing every deviation between the two. I'm >>> currently trying to figure out why 0xeda080 _isn't_ 0xd800. (glibc >>> translates wc 0xd800 as f8a08a83 but it's less than ffff so >>> https://en.wikipedia.org/wiki/UTF-8 says it should be 3 bytes and I'm >>> CONFUSED...) >> >> U+d800 is a surrogate, so shouldn't be valid in utf8. > > Still dunno what a surrogate is but I read more of the wikipedia page > and while utf8 is simple, unicode is insane. > > Now I'm up to f5 80 80 80 parsing 4 bytes to produce 0x140000 (according > to glibc) but wikipedia[citation needed] says the last code point is > 0x10ffff _and_ that 245 (f5) is never the first byte in a valid sequence. > > Meanwhile over on musl, f4 90 80 80 is parsing to 0x110000 and again, > that's > 0x10ffff. And I tried the bionic ndk I have lying around > (/opt/android/AndroidVersion.txt says 3.8.275480) and efbfbe is failing > to be fffe. > > Rob > > P.S. Ongoing terrible test program attached. (When you run it on a > little endian system you have to reverse the bytes in the output when it > reports an error...) > > P.P.S. The unindented lines are debug lines, makes 'em easy to strip > back out again. -- Elliott Hughes - http://who/enh - http://jessies.org/~enh/ Android native code/tools questions? Mail me/drop by/add me as a reviewer. _______________________________________________ Toybox mailing list [email protected] http://lists.landley.net/listinfo.cgi/toybox-landley.net
