Re: Patch: ksh: fix input handling for 4 byte UTF-8 sequences

Sören Tempel Mon, 07 Jun 2021 10:04:48 -0700

Ingo Schwarze <schwa...@usta.de> wrote:
> Hi,

Hello,


> Which problem needs fixing:
> Of the four-byte UTF-8 sequences, only a subset is identified by the
> existing code.  The other four-byte UTF-8 sequences still get chopped
> up resulting in individual bytes being passed on.
> 
> 
> I'm also adding a few comments as suggested by jca@.  Parsing of UTF-8
> is less trivial than one might think, witnessed once again by the fact
> that i got this code wrong in the first place.
> 
> I also changed "cc & 0x20" to "cc > 0x9f" and "cc & 0x30" to "cc > 0x8f"
> for uniformity and readabilty - UTF-8-parsing is bad enough without
> needless micro-optimization, right?

Nice, wasn't aware that you also had a patch ready. Sounds good to me
and also fixes the problem I originally experienced with 4 byte UTF-8
sequences.

> Note that even with the patch below, moving backward and forward
> over a blowfish icon on the command line still does not work because
> the character is width 2 and the ksh code intentionally does not
> use wcwidth(3).  But maybe it improves something in tmux?  Not sure.

Character movements over emojis (e.g. U+1F421) are currently broken
because the ksh code doesn't correctly determine the amount of columns
needed for a given character (i.e. what you would normally do with
wcwidth). I tried fixing this but without wchar.h doing so seemed very
cumbersome. Inputting emojis works with your patch though and was broken
previously.

> Either way, unless it causes regressions, this (or a further improved
> version) should go in because what is there is clearly wrong.
> 
> OK?

Your diff looks good to me.

BTW: Is there any reason why ksh doesn't use editline for all its line
editing needs? That would allow handling all these nitty-gritty details
in a central place.

Greetings,
Sören

Re: Patch: ksh: fix input handling for 4 byte UTF-8 sequences

Reply via email to