Re: [Toybox] utf8towc(), stop being defective on null bytes

enh via Toybox Mon, 08 Apr 2024 09:02:05 -0700

On Sun, Apr 7, 2024 at 7:43 AM Oliver Webb via Toybox
<[email protected]> wrote:
>
> On Sunday, April 7th, 2024 at 03:54, Rob Landley <[email protected]> wrote:
>
> > As for moving it again someday, unnecessarily moving files is churn that 
> > makes
> > the history harder to see, and lib/*.c has never been a strict division 
> > (more
> > "one giant file seems a bit much"). The basic conversion to/from utf8 is
> > different from caring about the characteristics of unicode code points 
> > (which
> > the rest of utf8.c does), so having it in lib.c makes a certain amount of 
> > sense,
> > and I'm not strongly motivated to change it without a good reason.
> >
> > It might happen eventually because I'm still not happy with the general 
> > unicode
> > handling design "yet", but that's a larger story.
>
> Eh, they're utf8 functions, utf8 functions being in the file named "utf8.c" 
> makes
> more sense from my perspective.
>
> I was also planning on doing some form of a documentation write up in 
> code.html
> about, among other things, the utf8 functions. That stopped when I realized
> that would mean documenting all of the eighty-something functions in lib.c.
>
> > (I probably should have called it unicode.c instead, but
> > unicode is icky, the name is longer, and half the unicode stuff is still in 
> > libc
> > anyway).
> >
> > Unicode is icky because utf8 and unicode are not the same thing.
>
> If it's handling unicode instead of utf8 and the 2 are noticeably different,
> I don't see why a file for unicode stuff should be called utf8.c.
>
> > Because Microsoft broke utf8 in multiple ways through the unicode 
> > consortium,
> > among other things making 4 bytes the max:
>
> I have to ask, if you disagree with the decision to cap utf8 to only a 
> million codepoints,
> and not complying with that only means that anyone who wants to pass unicode 
> codepoints over
> U+10FFFF to toybox code will be able to. Why have code make sure we comply 
> with an insane
> microsoft decision when we don't (I don't think?) have to:
>
>   // Limit unicode so it can't encode anything UTF-16 can't.
>   if (result>0x10ffff || (result>=0xd800 && result<=0xdfff)) return -1;
>
> > > Another thing I noticed is that if you pass a null byte into utf8towc(), 
> > > it will
> > > assign, but will not "return bytes read" like it's supposed to, instead 
> > > it will
> > > return 0 when it reads 1 byte.
> >
> > The same way strlen() doesn't include the null terminator in the length 
> > "like
> > it's supposed to"? Obviously stpcpy() is defective to write a null 
> > terminator
> > and then return a pointer to that null terminator, instead of returning the
> > first byte it didn't modify "like it's supposed to"...
> >
> > An assertion is not the same as a question.
>
> If I'm going my the comment over the function body ("This returns bytes read 
> unless error"),
> then yes, that is what "it's supposed to do", we have read one byte of input, 
> and written it
> successfully to our return destination. A special case for null bytes is 
> fine, but to save
> me and any other person that debugging nightmare when they try to do utf8 
> processing on data
> with null bytes in it. I'd prefer if that was mentioned somewhere.
>
> A bug only becomes a feature when you declare it is, and "undocumented 
> special case"
> is another way to say "landmine".
>
> > Returning length 0 means we hit a null terminator,
>
> Null bytes aren't always "terminators". You can embed null bytes into data 
> and still
> want to do utf8 processing with it.


that's questionable ... the desire to have ASCII NUL in utf-8
sequences (without breaking the "utf-8 sequences are usable as c
strings" property) is the main reason for the existence of "modified
utf-8".

> > due to the maximum possible value being truncated BY MICROSOFT so it 
> > doesn't outshine their horrible legacy format:
>
> "BY MICROSOFT", and by you. 
> https://github.com/landley/toybox/blob/master/lib/lib.c#L189.
> Do we need to do that for any reason other then to comply to microsoft and 
> the unicode commite?
> The linux kernel is agnostic to filenames having "good utf8". Should utf8towc 
> (I don't think
> wctoutf8 has this restriction) be agnostic towards "good unicode" when it's 
> utf8 we are processing,
> and delegate that job to the fontmetrics code? Again, it's utf8 we are 
> handling with these,
> not unicode, even if the 2 are linked.
>
> "And even then it might be the wrong thing to disallow clever
> people from doing clever things. Encoding other information in filenames
> might be proper for a number of applications."
> - Linus Torvalds, https://yarchive.net/comp/linux/utf8.html
>
> -   Oliver Webb <[email protected]>
>
> _______________________________________________
> Toybox mailing list
> [email protected]
> http://lists.landley.net/listinfo.cgi/toybox-landley.net
_______________________________________________
Toybox mailing list
[email protected]
http://lists.landley.net/listinfo.cgi/toybox-landley.net

Re: [Toybox] utf8towc(), stop being defective on null bytes

Reply via email to