Hi,
feeling hesitant to commit into ksh without at least one proper OK,
i'm resending this patch here, sorry if i missed private feedback.
What the existing code does:
It tries to make sure that multi-byte UTF-8 characters get passed on by
the shell without fragmentation, not one byte at time. I heard people
say that some software, for example tmux(1), may sometimes get confused
when receiving a UTF-8 character in a piecemeal manner.
Which problem needs fixing:
Of the four-byte UTF-8 sequences, only a subset is identified by the
existing code. The other four-byte UTF-8 sequences still get chopped
up resulting in individual bytes being passed on.
I'm also adding a few comments as suggested by jca@. Parsing of UTF-8
is less trivial than one might think, witnessed once again by the fact
that i got this code wrong in the first place.
I also changed "cc & 0x20" to "cc > 0x9f" and "cc & 0x30" to "cc > 0x8f"
for uniformity and readabilty - UTF-8-parsing is bad enough without
needless micro-optimization, right?
Note that even with the patch below, moving backward and forward
over a blowfish icon on the command line still does not work because
the character is width 2 and the ksh code intentionally does not
use wcwidth(3). But maybe it improves something in tmux? Not sure.
Either way, unless it causes regressions, this (or a further improved
version) should go in because what is there is clearly wrong.
OK?
Ingo
Index: emacs.c
===================================================================
RCS file: /cvs/src/bin/ksh/emacs.c,v
retrieving revision 1.87
diff -u -p -r1.87 emacs.c
--- emacs.c 8 May 2020 14:30:42 -0000 1.87
+++ emacs.c 13 May 2021 18:16:13 -0000
@@ -1851,11 +1851,17 @@ x_e_getu8(char *buf, int off)
return -1;
buf[off++] = c;
- if (c == 0xf4)
+ /*
+ * In the following, comments refer to violations of
+ * the inequality tests at the ends of the lines.
+ * See the utf8(7) manual page for details.
+ */
+
+ if ((c & 0xf8) == 0xf0 && c < 0xf5) /* beyond Unicode */
len = 4;
else if ((c & 0xf0) == 0xe0)
len = 3;
- else if ((c & 0xe0) == 0xc0 && c > 0xc1)
+ else if ((c & 0xe0) == 0xc0 && c > 0xc1) /* use single byte */
len = 2;
else
len = 1;
@@ -1865,9 +1871,10 @@ x_e_getu8(char *buf, int off)
if (cc == -1)
break;
if (isu8cont(cc) == 0 ||
- (c == 0xe0 && len == 3 && cc < 0xa0) ||
- (c == 0xed && len == 3 && cc & 0x20) ||
- (c == 0xf4 && len == 4 && cc & 0x30)) {
+ (c == 0xe0 && len == 3 && cc < 0xa0) || /* use 2 bytes */
+ (c == 0xed && len == 3 && cc > 0x9f) || /* surrogates */
+ (c == 0xf0 && len == 4 && cc < 0x90) || /* use 3 bytes */
+ (c == 0xf4 && len == 4 && cc > 0x8f)) { /* beyond Uni. */
x_e_ungetc(cc);
break;
}