Looks good to me, ok nicm
On Wed, Jun 02, 2021 at 09:00:16PM +0200, Ingo Schwarze wrote:
> Hi,
>
> feeling hesitant to commit into ksh without at least one proper OK,
> i'm resending this patch here, sorry if i missed private feedback.
>
> What the existing code does:
> It tries to make sure that multi-byte UTF-8 characters get passed on by
> the shell without fragmentation, not one byte at time. I heard people
> say that some software, for example tmux(1), may sometimes get confused
> when receiving a UTF-8 character in a piecemeal manner.
>
> Which problem needs fixing:
> Of the four-byte UTF-8 sequences, only a subset is identified by the
> existing code. The other four-byte UTF-8 sequences still get chopped
> up resulting in individual bytes being passed on.
>
>
> I'm also adding a few comments as suggested by jca@. Parsing of UTF-8
> is less trivial than one might think, witnessed once again by the fact
> that i got this code wrong in the first place.
>
> I also changed "cc & 0x20" to "cc > 0x9f" and "cc & 0x30" to "cc > 0x8f"
> for uniformity and readabilty - UTF-8-parsing is bad enough without
> needless micro-optimization, right?
>
>
> Note that even with the patch below, moving backward and forward
> over a blowfish icon on the command line still does not work because
> the character is width 2 and the ksh code intentionally does not
> use wcwidth(3). But maybe it improves something in tmux? Not sure.
>
> Either way, unless it causes regressions, this (or a further improved
> version) should go in because what is there is clearly wrong.
>
> OK?
> Ingo
>
>
> Index: emacs.c
> ===================================================================
> RCS file: /cvs/src/bin/ksh/emacs.c,v
> retrieving revision 1.87
> diff -u -p -r1.87 emacs.c
> --- emacs.c 8 May 2020 14:30:42 -0000 1.87
> +++ emacs.c 13 May 2021 18:16:13 -0000
> @@ -1851,11 +1851,17 @@ x_e_getu8(char *buf, int off)
> return -1;
> buf[off++] = c;
>
> - if (c == 0xf4)
> + /*
> + * In the following, comments refer to violations of
> + * the inequality tests at the ends of the lines.
> + * See the utf8(7) manual page for details.
> + */
> +
> + if ((c & 0xf8) == 0xf0 && c < 0xf5) /* beyond Unicode */
> len = 4;
> else if ((c & 0xf0) == 0xe0)
> len = 3;
> - else if ((c & 0xe0) == 0xc0 && c > 0xc1)
> + else if ((c & 0xe0) == 0xc0 && c > 0xc1) /* use single byte */
> len = 2;
> else
> len = 1;
> @@ -1865,9 +1871,10 @@ x_e_getu8(char *buf, int off)
> if (cc == -1)
> break;
> if (isu8cont(cc) == 0 ||
> - (c == 0xe0 && len == 3 && cc < 0xa0) ||
> - (c == 0xed && len == 3 && cc & 0x20) ||
> - (c == 0xf4 && len == 4 && cc & 0x30)) {
> + (c == 0xe0 && len == 3 && cc < 0xa0) || /* use 2 bytes */
> + (c == 0xed && len == 3 && cc > 0x9f) || /* surrogates */
> + (c == 0xf0 && len == 4 && cc < 0x90) || /* use 3 bytes */
> + (c == 0xf4 && len == 4 && cc > 0x8f)) { /* beyond Uni. */
> x_e_ungetc(cc);
> break;
> }