Re: Patch: ksh: fix input handling for 4 byte UTF-8 sequences

ropers Mon, 07 Jun 2021 18:28:14 -0700

Hiya,

@Ingo:
Sorry I have been out of touch.  I have arguably been out of sorts,
though hopefully not out of order in your book.

> Index: emacs.c
> ===================================================================
> RCS file: /cvs/src/bin/ksh/emacs.c,v
> retrieving revision 1.87
> diff -u -p -r1.87 emacs.c
> --- emacs.c     8 May 2020 14:30:42 -0000       1.87
> +++ emacs.c     13 May 2021 18:16:13 -0000
> @@ -1851,11 +1851,17 @@ x_e_getu8(char *buf, int off)
>                 return -1;
>         buf[off++] = c;
>
> -       if (c == 0xf4)
> +       /*
> +        * In the following, comments refer to violations of
> +        * the inequality tests at the ends of the lines.
> +        * See the utf8(7) manual page for details.
> +        */
> +
> +       if ((c & 0xf8) == 0xf0 && c < 0xf5)  /* beyond Unicode */

This threw me at first.  I didn't initially understand why a check
whether this is "beyond Unicode" needed that "(c & 0xf8) == 0xf0 &&"
part at all.

But I now think I got it:  I think I let the comment lead me astray,
because in truth, it does not just check whether that fifth most
significant bit is zero, it also checks if the leftmost nibble is 0xF.
Seeing though that the zero status of the fifth msb is checked twice
(the 0x8 nibble ensures it is 0, and the "c < 0xf5" check ensures it
is 0), would it be clearer to check that only once?  Like thus:

 if ((c & 0xf0) == 0xf0 && c < 0xf5)  /* 4B leading byte, not beyond Unicode */

I'm NOT saying your way is wrong; I'm just throwing this out there.

Also, suppose we get a byte here that IS beyond Unicode, would any
further handling of that be needed once we arrive at "len = 1;" (the
final else) below?

>                 len = 4;
>         else if ((c & 0xf0) == 0xe0)
>                 len = 3;
> -       else if ((c & 0xe0) == 0xc0 && c > 0xc1)
> +       else if ((c & 0xe0) == 0xc0 && c > 0xc1)  /* use single byte */
>                 len = 2;
>         else
>                 len = 1;

^Here.

The way I read this, that's still unhandled for now, is it?

> Which problem needs fixing:
> Of the four-byte UTF-8 sequences, only a subset is identified by the
> existing code.  The other four-byte UTF-8 sequences still get chopped
> up resulting in individual bytes being passed on.

Or does that refer to other LEGAL 4-byte UTF-8?

> @@ -1865,9 +1871,10 @@ x_e_getu8(char *buf, int off)
>                if (cc == -1)
>                        break;
>                if (isu8cont(cc) == 0 ||
> -                  (c == 0xe0 && len == 3 && cc < 0xa0) ||
> -                  (c == 0xed && len == 3 && cc & 0x20) ||
> -                  (c == 0xf4 && len == 4 && cc & 0x30)) {
> +                  (c == 0xe0 && len == 3 && cc < 0xa0) ||  /* use 2 bytes */
> +                  (c == 0xed && len == 3 && cc > 0x9f) ||  /* surrogates  */
> +                  (c == 0xf0 && len == 4 && cc < 0x90) ||  /* use 3 bytes */
> +                  (c == 0xf4 && len == 4 && cc > 0x8f)) {  /* beyond Uni. */
>                        x_e_ungetc(cc);
>                        break;
>                }

Whatever you ultimately choose there, please DO include your comments,
i.e. these:

    /* use single byte */
    /* use 2 bytes */
    /* surrogates  */
    /* use 3 bytes */
    /* beyond Uni. */

...because those are actually helpful, especially for nincompoops like me.

On 07/06/2021, Sören Tempel <soe...@soeren-tempel.net> wrote:
>
> BTW: Is there any reason why ksh doesn't use editline for all its line
> editing needs? That would allow handling all these nitty-gritty details
> in a central place.
>
> Greetings,
> Sören

That might end up fixing a minor quality of life issue and might end
up obsoleting a long-delayed diff that I've let go stale because I've
been too noobish and daft to confidently complete it.  It worked in
principle last time I tried many moons ago, but I didn't fully
comprehend how and why, which is concerning.
The issue is to do with the fact that unlike ksh vi mode, ksh emacs
mode won't let you return to the same line with arrow down after
you've gone back into the past with arrow-up. (It's a BTTF bug.)
Editline could take care of all of that.
Whether that's a good reason to support your suggestion is not for me to say.
(But wait, there's more: I did research and compare lotsa related
things there, which yielded an iffy diff, but mainly a VERY verbose
text file with my notes and findings.)
Should I even try to rediscover what I had and maybe share it with
somebody, perhaps off-list?  (Caveat emptor; it may not be worth your
time, but YMMV.)

Thanks and regards,
Ian

PS: (This part is purely for shits and giggles.)

I'd long thought that actually, octal was a fine way to deal with
UTF-8, because it fits the distribution of bits somewhat more cleanly
than hexadecimal.

Here's why I thought so.  Sorry for the possibly confusing ad-hoc notation:

1 byte : Octal works for code points U+0000–U+007F b/c "that's all she wrote":
    It's ok the left octal is 1 at most; it can't bleed into anything else.
    (binary: 0?|???|??? = octal maximum: 1|7|7)

2 bytes: Octal works for code points U+0080–U+07FF because of the 0 between
    the leftmost 11 and the remainder of the leading byte.
    (binary: 11|0??|??? 10|???|??? = octal maximum: 3|7 7|7)

3 bytes: Code points U+0800–U+FFFF are the exception and don't work as well
    at the high end, but it's not too bad.
    (binary: 11{1}0?|??? 10|???|??? 10|???|??? = octal maximum: ⅘|7 7|7 7|7)
    *⅘ here meaning 4 or 5, so it's ok despite the {1} from the leftmost bits

4 bytes: Code points U+10000–U+10FFFF are a perfect fit.
    (bin: 11110|??? 10|???|??? 10|???|??? 10|???|??? = oct max: 7 7|7 7|7 7|7)

However, your email made me want to try that putative usefulness in
practice, and after I had written this line, I kind of did see the
error of my ways:
                         if ((c & 0370) == 0360 && c < 0365)

So now I'm not sure anymore if octal will come in handy in dealing with UTF-8.

Re: Patch: ksh: fix input handling for 4 byte UTF-8 sequences

Reply via email to