Re: Patch: ksh: fix input handling for 4 byte UTF-8 sequences

Nicholas Marriott Tue, 08 Jun 2021 05:40:04 -0700

Looks good to me, ok nicm



On Wed, Jun 02, 2021 at 09:00:16PM +0200, Ingo Schwarze wrote:
> Hi,
> 
> feeling hesitant to commit into ksh without at least one proper OK,
> i'm resending this patch here, sorry if i missed private feedback.
> 
> What the existing code does:
> It tries to make sure that multi-byte UTF-8 characters get passed on by
> the shell without fragmentation, not one byte at time.  I heard people
> say that some software, for example tmux(1), may sometimes get confused
> when receiving a UTF-8 character in a piecemeal manner.
> 
> Which problem needs fixing:
> Of the four-byte UTF-8 sequences, only a subset is identified by the
> existing code.  The other four-byte UTF-8 sequences still get chopped
> up resulting in individual bytes being passed on.
> 
> 
> I'm also adding a few comments as suggested by jca@.  Parsing of UTF-8
> is less trivial than one might think, witnessed once again by the fact
> that i got this code wrong in the first place.
> 
> I also changed "cc & 0x20" to "cc > 0x9f" and "cc & 0x30" to "cc > 0x8f"
> for uniformity and readabilty - UTF-8-parsing is bad enough without
> needless micro-optimization, right?
> 
> 
> Note that even with the patch below, moving backward and forward
> over a blowfish icon on the command line still does not work because
> the character is width 2 and the ksh code intentionally does not
> use wcwidth(3).  But maybe it improves something in tmux?  Not sure.
> 
> Either way, unless it causes regressions, this (or a further improved
> version) should go in because what is there is clearly wrong.
> 
> OK?
>   Ingo
> 
> 
> Index: emacs.c
> ===================================================================
> RCS file: /cvs/src/bin/ksh/emacs.c,v
> retrieving revision 1.87
> diff -u -p -r1.87 emacs.c
> --- emacs.c   8 May 2020 14:30:42 -0000       1.87
> +++ emacs.c   13 May 2021 18:16:13 -0000
> @@ -1851,11 +1851,17 @@ x_e_getu8(char *buf, int off)
>               return -1;
>       buf[off++] = c;
>  
> -     if (c == 0xf4)
> +     /*
> +      * In the following, comments refer to violations of
> +      * the inequality tests at the ends of the lines.
> +      * See the utf8(7) manual page for details.
> +      */
> +
> +     if ((c & 0xf8) == 0xf0 && c < 0xf5)  /* beyond Unicode */
>               len = 4;
>       else if ((c & 0xf0) == 0xe0)
>               len = 3;
> -     else if ((c & 0xe0) == 0xc0 && c > 0xc1)
> +     else if ((c & 0xe0) == 0xc0 && c > 0xc1)  /* use single byte */
>               len = 2;
>       else
>               len = 1;
> @@ -1865,9 +1871,10 @@ x_e_getu8(char *buf, int off)
>               if (cc == -1)
>                       break;
>               if (isu8cont(cc) == 0 ||
> -                 (c == 0xe0 && len == 3 && cc < 0xa0) ||
> -                 (c == 0xed && len == 3 && cc & 0x20) ||
> -                 (c == 0xf4 && len == 4 && cc & 0x30)) {
> +                 (c == 0xe0 && len == 3 && cc < 0xa0) ||  /* use 2 bytes */
> +                 (c == 0xed && len == 3 && cc > 0x9f) ||  /* surrogates  */
> +                 (c == 0xf0 && len == 4 && cc < 0x90) ||  /* use 3 bytes */
> +                 (c == 0xf4 && len == 4 && cc > 0x8f)) {  /* beyond Uni. */
>                       x_e_ungetc(cc);
>                       break;
>               }

Re: Patch: ksh: fix input handling for 4 byte UTF-8 sequences

Reply via email to