On Sun, Apr 7, 2024 at 7:43 AM Oliver Webb via Toybox <[email protected]> wrote: > > On Sunday, April 7th, 2024 at 03:54, Rob Landley <[email protected]> wrote: > > > As for moving it again someday, unnecessarily moving files is churn that > > makes > > the history harder to see, and lib/*.c has never been a strict division > > (more > > "one giant file seems a bit much"). The basic conversion to/from utf8 is > > different from caring about the characteristics of unicode code points > > (which > > the rest of utf8.c does), so having it in lib.c makes a certain amount of > > sense, > > and I'm not strongly motivated to change it without a good reason. > > > > It might happen eventually because I'm still not happy with the general > > unicode > > handling design "yet", but that's a larger story. > > Eh, they're utf8 functions, utf8 functions being in the file named "utf8.c" > makes > more sense from my perspective. > > I was also planning on doing some form of a documentation write up in > code.html > about, among other things, the utf8 functions. That stopped when I realized > that would mean documenting all of the eighty-something functions in lib.c. > > > (I probably should have called it unicode.c instead, but > > unicode is icky, the name is longer, and half the unicode stuff is still in > > libc > > anyway). > > > > Unicode is icky because utf8 and unicode are not the same thing. > > If it's handling unicode instead of utf8 and the 2 are noticeably different, > I don't see why a file for unicode stuff should be called utf8.c. > > > Because Microsoft broke utf8 in multiple ways through the unicode > > consortium, > > among other things making 4 bytes the max: > > I have to ask, if you disagree with the decision to cap utf8 to only a > million codepoints, > and not complying with that only means that anyone who wants to pass unicode > codepoints over > U+10FFFF to toybox code will be able to. Why have code make sure we comply > with an insane > microsoft decision when we don't (I don't think?) have to: > > // Limit unicode so it can't encode anything UTF-16 can't. > if (result>0x10ffff || (result>=0xd800 && result<=0xdfff)) return -1; > > > > Another thing I noticed is that if you pass a null byte into utf8towc(), > > > it will > > > assign, but will not "return bytes read" like it's supposed to, instead > > > it will > > > return 0 when it reads 1 byte. > > > > The same way strlen() doesn't include the null terminator in the length > > "like > > it's supposed to"? Obviously stpcpy() is defective to write a null > > terminator > > and then return a pointer to that null terminator, instead of returning the > > first byte it didn't modify "like it's supposed to"... > > > > An assertion is not the same as a question. > > If I'm going my the comment over the function body ("This returns bytes read > unless error"), > then yes, that is what "it's supposed to do", we have read one byte of input, > and written it > successfully to our return destination. A special case for null bytes is > fine, but to save > me and any other person that debugging nightmare when they try to do utf8 > processing on data > with null bytes in it. I'd prefer if that was mentioned somewhere. > > A bug only becomes a feature when you declare it is, and "undocumented > special case" > is another way to say "landmine". > > > Returning length 0 means we hit a null terminator, > > Null bytes aren't always "terminators". You can embed null bytes into data > and still > want to do utf8 processing with it.
that's questionable ... the desire to have ASCII NUL in utf-8 sequences (without breaking the "utf-8 sequences are usable as c strings" property) is the main reason for the existence of "modified utf-8". > > due to the maximum possible value being truncated BY MICROSOFT so it > > doesn't outshine their horrible legacy format: > > "BY MICROSOFT", and by you. > https://github.com/landley/toybox/blob/master/lib/lib.c#L189. > Do we need to do that for any reason other then to comply to microsoft and > the unicode commite? > The linux kernel is agnostic to filenames having "good utf8". Should utf8towc > (I don't think > wctoutf8 has this restriction) be agnostic towards "good unicode" when it's > utf8 we are processing, > and delegate that job to the fontmetrics code? Again, it's utf8 we are > handling with these, > not unicode, even if the 2 are linked. > > "And even then it might be the wrong thing to disallow clever > people from doing clever things. Encoding other information in filenames > might be proper for a number of applications." > - Linus Torvalds, https://yarchive.net/comp/linux/utf8.html > > - Oliver Webb <[email protected]> > > _______________________________________________ > Toybox mailing list > [email protected] > http://lists.landley.net/listinfo.cgi/toybox-landley.net _______________________________________________ Toybox mailing list [email protected] http://lists.landley.net/listinfo.cgi/toybox-landley.net
