Re: [Toybox] utf8towc(), stop being defective on null bytes

Oliver Webb via Toybox Sun, 07 Apr 2024 07:43:02 -0700

On Sunday, April 7th, 2024 at 03:54, Rob Landley <[email protected]> wrote:


> As for moving it again someday, unnecessarily moving files is churn that makes
> the history harder to see, and lib/*.c has never been a strict division (more
> "one giant file seems a bit much"). The basic conversion to/from utf8 is
> different from caring about the characteristics of unicode code points (which
> the rest of utf8.c does), so having it in lib.c makes a certain amount of 
> sense,
> and I'm not strongly motivated to change it without a good reason.
> 
> It might happen eventually because I'm still not happy with the general 
> unicode
> handling design "yet", but that's a larger story.

Eh, they're utf8 functions, utf8 functions being in the file named "utf8.c" 
makes
more sense from my perspective.

I was also planning on doing some form of a documentation write up in code.html
about, among other things, the utf8 functions. That stopped when I realized
that would mean documenting all of the eighty-something functions in lib.c.

> (I probably should have called it unicode.c instead, but
> unicode is icky, the name is longer, and half the unicode stuff is still in 
> libc
> anyway).
> 
> Unicode is icky because utf8 and unicode are not the same thing.

If it's handling unicode instead of utf8 and the 2 are noticeably different,
I don't see why a file for unicode stuff should be called utf8.c.

> Because Microsoft broke utf8 in multiple ways through the unicode consortium,
> among other things making 4 bytes the max:

I have to ask, if you disagree with the decision to cap utf8 to only a million 
codepoints,
and not complying with that only means that anyone who wants to pass unicode 
codepoints over
U+10FFFF to toybox code will be able to. Why have code make sure we comply with 
an insane
microsoft decision when we don't (I don't think?) have to:

  // Limit unicode so it can't encode anything UTF-16 can't.
  if (result>0x10ffff || (result>=0xd800 && result<=0xdfff)) return -1;

> > Another thing I noticed is that if you pass a null byte into utf8towc(), it 
> > will
> > assign, but will not "return bytes read" like it's supposed to, instead it 
> > will
> > return 0 when it reads 1 byte.
> 
> The same way strlen() doesn't include the null terminator in the length "like
> it's supposed to"? Obviously stpcpy() is defective to write a null terminator
> and then return a pointer to that null terminator, instead of returning the
> first byte it didn't modify "like it's supposed to"...
> 
> An assertion is not the same as a question.

If I'm going my the comment over the function body ("This returns bytes read 
unless error"),
then yes, that is what "it's supposed to do", we have read one byte of input, 
and written it
successfully to our return destination. A special case for null bytes is fine, 
but to save
me and any other person that debugging nightmare when they try to do utf8 
processing on data
with null bytes in it. I'd prefer if that was mentioned somewhere.

A bug only becomes a feature when you declare it is, and "undocumented special 
case"
is another way to say "landmine".

> Returning length 0 means we hit a null terminator,

Null bytes aren't always "terminators". You can embed null bytes into data and 
still
want to do utf8 processing with it.

> due to the maximum possible value being truncated BY MICROSOFT so it doesn't 
> outshine their horrible legacy format:

"BY MICROSOFT", and by you. 
https://github.com/landley/toybox/blob/master/lib/lib.c#L189.
Do we need to do that for any reason other then to comply to microsoft and the 
unicode commite?
The linux kernel is agnostic to filenames having "good utf8". Should utf8towc 
(I don't think
wctoutf8 has this restriction) be agnostic towards "good unicode" when it's 
utf8 we are processing,
and delegate that job to the fontmetrics code? Again, it's utf8 we are handling 
with these,
not unicode, even if the 2 are linked.

"And even then it might be the wrong thing to disallow clever
people from doing clever things. Encoding other information in filenames
might be proper for a number of applications."
- Linus Torvalds, https://yarchive.net/comp/linux/utf8.html

-   Oliver Webb <[email protected]>

_______________________________________________
Toybox mailing list
[email protected]
http://lists.landley.net/listinfo.cgi/toybox-landley.net

Re: [Toybox] utf8towc(), stop being defective on null bytes

Reply via email to