[Toybox] utf8 (was Re: musl intentionally broke chrt)

Rob Landley Fri, 01 Sep 2017 01:09:07 -0700

On 08/31/2017 04:01 PM, enh wrote:
>>> didn't you get in to utf8 because of my wc -m patch? :-)
>>
>> Working on it. It's one of those "I'd like to do what I consider the
>> _proper_ fix" things that's honestly been a bit of a luxury these days.
>>
>> I wrote a for loop to go from 0 to UINT_MAX, and I'm comparing the
>> mbrtowc(&wc, str, 4, &mb) results to my contextless utf8towc(&wc, str,
>> len) output, and I'm fixing every deviation between the two. I'm
>> currently trying to figure out why 0xeda080 _isn't_ 0xd800. (glibc
>> translates wc 0xd800 as f8a08a83 but it's less than ffff so
>> https://en.wikipedia.org/wiki/UTF-8 says it should be 3 bytes and I'm
>> CONFUSED...)
> 
> U+d800 is a surrogate, so shouldn't be valid in utf8.


Still dunno what a surrogate is but I read more of the wikipedia page
and while utf8 is simple, unicode is insane.

Now I'm up to f5 80 80 80 parsing 4 bytes to produce 0x140000 (according
to glibc) but wikipedia[citation needed] says the last code point is
0x10ffff _and_ that 245 (f5) is never the first byte in a valid sequence.

Meanwhile over on musl, f4 90 80 80 is parsing to 0x110000 and again,
that's > 0x10ffff. And I tried the bionic ndk I have lying around
(/opt/android/AndroidVersion.txt says 3.8.275480) and efbfbe is failing
to be fffe.

Rob

P.S. Ongoing terrible test program attached. (When you run it on a
little endian system you have to reverse the bytes in the output when it
reports an error...)

P.P.S. The unindented lines are debug lines, makes 'em easy to strip
back out again.

#include <stdio.h>
#include <wchar.h>
#include <string.h>
#include <locale.h>


// Convert utf8 sequence to a unicode wide character
int utf8towc(wchar_t *wc, unsigned char *str, unsigned len, int debug)
{
  unsigned result, mask, min, first;
  unsigned char *s, c;

  // fast path ASCII
  if (len && *str<128) return !!(*wc = *str);

  result = first = *(s = str++);
  for (mask = 6; (first&0xc0)==0xc0; mask += 5, first <<= 1) {
    c = *(str++);
    if ((c&0xc0) != 0x80) return -1;
    result = (result<<6)|(c&0x3f);
  }
  result &= (1<<mask)-1;
  c = str-s;
if (debug) printf("mask=%d res=%x lim=%x\n", mask, result, c);
  if (mask==6 || mask>21) return -1;
if (debug) printf("result=%d gt=%d\n", result, 1<<(mask-4));
if (debug) printf("passed\n");
if (debug) printf("%d\n", (unsigned []){0x80,0x800,0x10000}[c-2]);

  if (mask==6 || mask>21 || result<(unsigned []){0x80,0x800,0x10000}[c-2])
    return -1;
if (debug) printf("here\n");
  // Gratuitous arbitrary limitations of unicode.
  if (result>0x10ffff || (result>=0xd800 && result<=0xdfff)) return -1;
  *wc = result;

  return str-s;
}

int main(int argc, char *argv[])
{
  mbstate_t mb;
  int len1, len2;
  unsigned u;
  char *str = (void *)&u;
  wchar_t wc1, wc2;

  setlocale(LC_ALL, "");
  memset(&mb, 0, sizeof(mb));

  for (u = 1; u; u++) {
if (!(u&0xffffff)) printf("%x\n", u);
    wc1 = wc2 = 0;
    len1 = mbrtowc(&wc1, str, 4, &mb);
    if (len1<1) memset(&mb, 0, sizeof(mb));
    len2 = utf8towc(&wc2, str, 4, 0);

    if (len1==len2 && wc1==wc2) continue;

utf8towc(&wc2, str, 4, 1);
    printf("%x %d %x %d %x\n", u, len1, wc1, len2, wc2);
return 1;
  }
}

_______________________________________________
Toybox mailing list
[email protected]
http://lists.landley.net/listinfo.cgi/toybox-landley.net

[Toybox] utf8 (was Re: musl intentionally broke chrt)

Reply via email to