Heya, looking more at the utf8 code in toybox. The first thing I spotted is that
utf8towc() and wctoutf8() are both in lib.c instead of utf8.c, why haven't they
been moved yet, is it easier to track code that way? Also, the documentation
(header comment) should probably mention that they store stuff as unicode 
codepoints,
I spent a while scratching my head at the fact wide characters are 4 byte int's
when the maximum utf8 single character length is 6 bytes.

Another thing I noticed is that if you pass a null byte into utf8towc(), it will
assign, but will not "return bytes read" like it's supposed to, instead it will
return 0 when it reads 1 byte. This is because we collapse the return value for 
ascii
characters down into 1 _or 0_ with !!(*a = *b). When "|| 1" would collapse the 
value to 1.

Suppose you have a function that turns a character string into a array of "wide 
characters",
this is easily done by a while loop keeping a index for the old character 
string and the new
wide character string. So you should just be able to "while (ai < len) ai += 
utf8towc(...",
the problem? If you hit a null byte the code goes into an infinite loop. This 
can be solved
by a ternary operator or some other checking, but fixing utf8towc() to do the 
_right_ thing
seems more sensible (We have read one byte and wrote it successfully).

-   Oliver Webb <[email protected]>
From b8c0d9432b018fca692673732296a986bf1fc8f3 Mon Sep 17 00:00:00 2001
From: Oliver Webb <[email protected]>
Date: Sat, 6 Apr 2024 17:19:32 -0500
Subject: [PATCH] utf8towc(), return 1 on null byte instead of 0

---
 lib/lib.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/lib.c b/lib/lib.c
index 6a4a77dd..0d2e5442 100644
--- a/lib/lib.c
+++ b/lib/lib.c
@@ -380,7 +380,7 @@ int utf8towc(unsigned *wc, char *str, unsigned len)
   char *s, c;
 
   // fast path ASCII
-  if (len && *str<128) return !!(*wc = *str);
+  if (len && *str<128) return (*wc = *str) || 1;
 
   result = first = *(s = str++);
   if (result<0xc2 || result>0xf4) return -1;
-- 
2.44.0

_______________________________________________
Toybox mailing list
[email protected]
http://lists.landley.net/listinfo.cgi/toybox-landley.net

Reply via email to