On 4/6/24 17:48, Oliver Webb via Toybox wrote: > Heya, looking more at the utf8 code in toybox. The first thing I spotted is > that > utf8towc() and wctoutf8() are both in lib.c instead of utf8.c, why haven't > they > been moved yet, is it easier to track code that way? The "yet" seems a bit presumptuous and accusatory, but given the title of the post I suppose that's a given.
I have no current plans to move xstrtol() from lib.c to xwrap() And atolx() is only called that instead of xatol() because it does suffixes. The reason it had to go in lib.c back in the day was explained in the commit that moved it to lib.c: https://github.com/landley/toybox/commit/6e766936396e As for moving it again someday, unnecessarily moving files is churn that makes the history harder to see, and lib/*.c has never been a strict division (more "one giant file seems a bit much"). The basic conversion to/from utf8 is different from caring about the characteristics of unicode code points (which the rest of utf8.c does), so having it in lib.c makes a certain amount of sense, and I'm not strongly motivated to change it without a good reason. It might happen eventually because I'm still not happy with the general unicode handling design "yet", but that's a larger story. Way back when there was "interestingtimes.c" for all my not-curses code, but it was too long to type and mixed together a couple different kinds of things, so I split it into utf8.c and tty.c both of which were shorter and didn't screw up "ls" columnization. (I probably should have called it unicode.c instead, but unicode is icky, the name is longer, and half the unicode stuff is still in libc anyway). Unicode is icky because utf8 and unicode are not the same thing. Ken Thompson came up with a very elegant utf8 design and microsoft crapped all over it (cap the conversion range, don't add the base value covered by the previous range so there are no duplicate encodings) for no apparent reason, and then unicode just plain got nuts. (You had an ENORMOUS encoding space, the bottom bit could totally have been combining vs physical characters so we don't need a function to tell, and combining characters should 100% have gone BEFORE the physical characters rather than after to avoid the whole problem of FLUSHING them, and higher bits could indicate 1 column vs 2 column or upper/lower/numeric so you don't have to test with special functions like that, just collage them into LARGE BLOCKS which is LESS SILLY than the whole "skipping 0xd800" or whatever that is for the legacy 16 bit windows encoding space that microsoft CRAPPED INTO THE STANDARD... Ahem.) But alas, microsoft bought control of the unicode committee, so you need functions to say what each character is, and those functions are unnecessarily complicated. In theory libc has code to do wide char conversions already, but glibc refuses to enable it unless you've installed and selected a utf8-aware locale (which is just nuts, but that's glibc for you). I made some clean dependency-free functions to do the simple stuff that doesn't care what locale you're in, but there's still wcwidth() and friends that depend on libc's whims (hence the dance to try to find a utf8 locale in main.c, and the repeated discussion on this list between me and Elliott and Rich Felker about trying to come up with portable fontmetrics code. Well, column-metrics. Elliott keeps trying to dissuade me, but bionic's code for this still didn't work static linked last I checked...) Moving stuff around between files when I'm not entirely satisfied with the design (partly depending on libc state and partly _not_ depending on it) doesn't seem helpful. > Also, the documentation > (header comment) should probably mention that they store stuff as unicode > codepoints, Because I consistently attach comments before the function _body_ explaining what the function does, instead of putting long explanations in the .h files included from every other file which the compiler would have to churn through repeatedly. In this case: // Convert utf8 sequence to a unicode wide character // returns bytes consumed, or -1 if err, or -2 if need more data. int utf8towc(unsigned *wc, char *str, unsigned len) > I spent a while scratching my head at the fact wide characters are 4 byte > int's > when the maximum utf8 single character length is 6 bytes. Because Microsoft broke utf8 in multiple ways through the unicode consortium, among other things making 4 bytes the max: http://lists.landley.net/pipermail/toybox-landley.net/2017-September/017184.html In addition to the mailing list threads, I thought I blogged about this rather a lot at the time: https://landley.net/notes-2017.html#29-08-2017 https://landley.net/notes-2017.html#01-09-2017 https://landley.net/notes-2017.html#19-10-2017 Which was contemporaneous with the above git commit that added the function to lib/lib.c. I generally find that stuff by going "when did this code show up and/or get changed" (in this case September 2017) and then checking known data sources like "the mailing list" and "the developer's blog". (Whether it's my code or somebody else's, "this happened 7 years ago" tends to require some digging to get the details right.) > Another thing I noticed is that if you pass a null byte into utf8towc(), it > will > assign, but will not "return bytes read" like it's supposed to, instead it > will > return 0 when it reads 1 byte. The same way strlen() doesn't include the null terminator in the length "like it's supposed to"? Obviously stpcpy() is defective to write a null terminator and then return a pointer to that null terminator, instead of returning the first byte it didn't modify "like it's supposed to"... An assertion is not the same as a question. > Suppose you have a function that turns a character string into a array of > "wide characters",> this is easily done by a while loop keeping a index for > the old character string and the new > wide character string. So you should just be able to "while (ai < len) ai += > utf8towc(...", > the problem? Again with the "should". This is in use in 9 commands outside pending, obviously it CAN BE used. Returning length 0 means we hit a null terminator, but returning <0 means there was a problem preventing conversion. Both require handling, so "+=" without checking the return value means missing error handling. The reason calling utf8towc() with length 0 returns -2 (need more data) instead of 0 is that 0 is ONLY returned when it hits a NUL terminator. If you run out of data without hitting a null terminator, that's not the same thing and maybe you want to refill your buffer instead of prematurely ending the string. (Yes, I thought about it.) If you don't want to track the length and know you're working on a null terminated string, feeding in length 4 works because that's the longest utf8 sequence unicode allows, due to the maximum possible value being truncated BY MICROSOFT so it doesn't outshine their horrible legacy format: https://stackoverflow.com/questions/27415935/does-unicode-have-a-defined-maximum-number-of-code-points Knowing your string is NUL terminated means it can't read off the end because any value <128 will terminate a unicode sequence, with error as necessary. (The first byte indicates how many following bytes to expect, but all unicode sequences have the high bit set, and everything after the first has 10 as the first two bits. So an unexpected NUL terminator is an encoding error returning -1, so it can't read past the end of a null terminated string no matter how long you say the length is. This also allows you to traverse utf8 sequences backwards by skipping anything with 10 in the top two bits: everything else is a codepoint start or ascii value). Rob _______________________________________________ Toybox mailing list [email protected] http://lists.landley.net/listinfo.cgi/toybox-landley.net
