Looks good to me, ok nicm
On Wed, Jun 02, 2021 at 09:00:16PM +0200, Ingo Schwarze wrote: > Hi, > > feeling hesitant to commit into ksh without at least one proper OK, > i'm resending this patch here, sorry if i missed private feedback. > > What the existing code does: > It tries to make sure that multi-byte UTF-8 characters get passed on by > the shell without fragmentation, not one byte at time. I heard people > say that some software, for example tmux(1), may sometimes get confused > when receiving a UTF-8 character in a piecemeal manner. > > Which problem needs fixing: > Of the four-byte UTF-8 sequences, only a subset is identified by the > existing code. The other four-byte UTF-8 sequences still get chopped > up resulting in individual bytes being passed on. > > > I'm also adding a few comments as suggested by jca@. Parsing of UTF-8 > is less trivial than one might think, witnessed once again by the fact > that i got this code wrong in the first place. > > I also changed "cc & 0x20" to "cc > 0x9f" and "cc & 0x30" to "cc > 0x8f" > for uniformity and readabilty - UTF-8-parsing is bad enough without > needless micro-optimization, right? > > > Note that even with the patch below, moving backward and forward > over a blowfish icon on the command line still does not work because > the character is width 2 and the ksh code intentionally does not > use wcwidth(3). But maybe it improves something in tmux? Not sure. > > Either way, unless it causes regressions, this (or a further improved > version) should go in because what is there is clearly wrong. > > OK? > Ingo > > > Index: emacs.c > =================================================================== > RCS file: /cvs/src/bin/ksh/emacs.c,v > retrieving revision 1.87 > diff -u -p -r1.87 emacs.c > --- emacs.c 8 May 2020 14:30:42 -0000 1.87 > +++ emacs.c 13 May 2021 18:16:13 -0000 > @@ -1851,11 +1851,17 @@ x_e_getu8(char *buf, int off) > return -1; > buf[off++] = c; > > - if (c == 0xf4) > + /* > + * In the following, comments refer to violations of > + * the inequality tests at the ends of the lines. > + * See the utf8(7) manual page for details. > + */ > + > + if ((c & 0xf8) == 0xf0 && c < 0xf5) /* beyond Unicode */ > len = 4; > else if ((c & 0xf0) == 0xe0) > len = 3; > - else if ((c & 0xe0) == 0xc0 && c > 0xc1) > + else if ((c & 0xe0) == 0xc0 && c > 0xc1) /* use single byte */ > len = 2; > else > len = 1; > @@ -1865,9 +1871,10 @@ x_e_getu8(char *buf, int off) > if (cc == -1) > break; > if (isu8cont(cc) == 0 || > - (c == 0xe0 && len == 3 && cc < 0xa0) || > - (c == 0xed && len == 3 && cc & 0x20) || > - (c == 0xf4 && len == 4 && cc & 0x30)) { > + (c == 0xe0 && len == 3 && cc < 0xa0) || /* use 2 bytes */ > + (c == 0xed && len == 3 && cc > 0x9f) || /* surrogates */ > + (c == 0xf0 && len == 4 && cc < 0x90) || /* use 3 bytes */ > + (c == 0xf4 && len == 4 && cc > 0x8f)) { /* beyond Uni. */ > x_e_ungetc(cc); > break; > }