Heya, looking more at the utf8 code in toybox. The first thing I spotted is that utf8towc() and wctoutf8() are both in lib.c instead of utf8.c, why haven't they been moved yet, is it easier to track code that way? Also, the documentation (header comment) should probably mention that they store stuff as unicode codepoints, I spent a while scratching my head at the fact wide characters are 4 byte int's when the maximum utf8 single character length is 6 bytes.
Another thing I noticed is that if you pass a null byte into utf8towc(), it will assign, but will not "return bytes read" like it's supposed to, instead it will return 0 when it reads 1 byte. This is because we collapse the return value for ascii characters down into 1 _or 0_ with !!(*a = *b). When "|| 1" would collapse the value to 1. Suppose you have a function that turns a character string into a array of "wide characters", this is easily done by a while loop keeping a index for the old character string and the new wide character string. So you should just be able to "while (ai < len) ai += utf8towc(...", the problem? If you hit a null byte the code goes into an infinite loop. This can be solved by a ternary operator or some other checking, but fixing utf8towc() to do the _right_ thing seems more sensible (We have read one byte and wrote it successfully). - Oliver Webb <[email protected]>
From b8c0d9432b018fca692673732296a986bf1fc8f3 Mon Sep 17 00:00:00 2001 From: Oliver Webb <[email protected]> Date: Sat, 6 Apr 2024 17:19:32 -0500 Subject: [PATCH] utf8towc(), return 1 on null byte instead of 0 --- lib/lib.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/lib.c b/lib/lib.c index 6a4a77dd..0d2e5442 100644 --- a/lib/lib.c +++ b/lib/lib.c @@ -380,7 +380,7 @@ int utf8towc(unsigned *wc, char *str, unsigned len) char *s, c; // fast path ASCII - if (len && *str<128) return !!(*wc = *str); + if (len && *str<128) return (*wc = *str) || 1; result = first = *(s = str++); if (result<0xc2 || result>0xf4) return -1; -- 2.44.0
_______________________________________________ Toybox mailing list [email protected] http://lists.landley.net/listinfo.cgi/toybox-landley.net
