Re: Patch: ksh: fix input handling for 4 byte UTF-8 sequences

2021-06-27 Thread Sören Tempel
Ingo Schwarze  wrote:
> It would, and in principle, that would be an improvement.
> But i think editline(3) code quality is insufficent for use in a
> shell.  It's all quite messy and hastily and sloppily written.
> I tried to polish some of it in the past, but got distracted,
> so editline(3) code is still full of stuff that needs review
> and quality improvement.
> 
> Actually, i'm a bit scared that sftp(1) uses it.  Then again, i'm not
> aware that it caused any major vulnerabilities in the past, and the
> OpenSSH developers are not at all reckless people, so i am sure they
> know what they are doing.

Coincidentally, I actually found two non-critical out-of-bounds memory
accesses in the NetBSD version of editline two years ago [1] [2].
Quickly checking the OpenBSD code, it seems the associated fixes haven't
made their way into the OpenBSD editline version yet. Not sure if/how
this is normally done but maybe it makes sense to (partially) sync the
OpenBSD version with the NetBSD one? The latter has accumulated a few
general improvements over the past years. Either way, I do understand
your resentment towards using editline in ksh.

Greetings,
Sören

[1]: 
https://github.com/NetBSD/src/commit/e93ca0319937d29fabb0a98abfdbef65a55dba9a
[2]: 
https://github.com/NetBSD/src/commit/f6dff047a3ee27c332a60cef9c76355093a4e05e



Re: Patch: ksh: fix input handling for 4 byte UTF-8 sequences

2021-06-27 Thread Theo de Raadt
> But i think editline(3) code quality is insufficent for use in a
> shell.  It's all quite messy and hastily and sloppily written.

I also mistrust it.  It may be tempting to use a library, but the shell
is quite standalone code, and benefits from internal purpose-built code.



Re: Patch: ksh: fix input handling for 4 byte UTF-8 sequences

2021-06-27 Thread Ingo Schwarze
Hi Soeren,

Soeren Tempel wrote on Mon, Jun 07, 2021 at 07:02:25PM +0200:

> Nice, wasn't aware that you also had a patch ready.

Yeah, that was due to the fact that we, as developers, often use
the lists but sometimes also send comments and patches privately,
to reduce the mail volume for everybody.  Deciding where to send
comments and patches works out well enough overall, but it does
occasionally cause slight communication gaps, like in this case.
I don't think any general rule can be designed to make it better
(at least not more specific than "think about who needs to see this
and act accordingly"), it's a matter of case-by-case judgement.

> Sounds good to me and also fixes the problem I originally experienced
> with 4 byte UTF-8 sequences.

Great, thanks for reporting and testing, it's now finally fixed.

> BTW: Is there any reason why ksh doesn't use editline for all its line
> editing needs? That would allow handling all these nitty-gritty details
> in a central place.

It would, and in principle, that would be an improvement.
But i think editline(3) code quality is insufficent for use in a
shell.  It's all quite messy and hastily and sloppily written.
I tried to polish some of it in the past, but got distracted,
so editline(3) code is still full of stuff that needs review
and quality improvement.

Actually, i'm a bit scared that sftp(1) uses it.  Then again, i'm not
aware that it caused any major vulnerabilities in the past, and the
OpenSSH developers are not at all reckless people, so i am sure they
know what they are doing.

Yours,
  Ingo



Re: Patch: ksh: fix input handling for 4 byte UTF-8 sequences

2021-06-27 Thread Ingo Schwarze
Hi Jeremie,

Jeremie Courreges-Anglas wrote on Thu, Jun 03, 2021 at 11:17:08PM +0200:
> On Wed, Jun 02 2021, Ingo Schwarze  wrote:

>> I'm also adding a few comments as suggested by jca@.  Parsing of UTF-8
>> is less trivial than one might think, witnessed once again by the fact
>> that i got this code wrong in the first place.

> Thanks for this, though I was mostly interested into a pointer to the
> relevant standard document that was used.

Standards tend to be huge, the manual page is very compact, contains
all the relevant information and already points to the standard for
more details.  So i considered refering to the manual more helpful.

If you see a way to improve the STANDARDS section in the manual page,
you are very welcome.

> Now that the checks match the
> Unicode standard, I'm less confused.  You're now pointing readers to the
> utf8(7) manpage.  Discussion for another diff: should this manpage list
> valid sequences instead of invalid ones?

The idea is that it does both: first it show valid sequences (in the
first table).  Then it explains the constraints, and which bytes are
consequently valid, in the following text.

Finally, it turns the matter around and summarizes by also listing
invalid start bytes and invalid initial pairs.

Showing invalid rather than valid bytes and pairs in the final two
tables seems a good choice to me because there are much fewer invalid
than valid start bytes and initial pairs.

Does that make sense to you, or can you propose a specific diff to
improve the page?

> ok jca@

Thanks for checking the ksh/emacs.c diff and sorry for the ridiculous
delay, it is now finally committed.

Yours,
  Ingo



Re: Patch: ksh: fix input handling for 4 byte UTF-8 sequences

2021-06-08 Thread Nicholas Marriott
Looks good to me, ok nicm



On Wed, Jun 02, 2021 at 09:00:16PM +0200, Ingo Schwarze wrote:
> Hi,
> 
> feeling hesitant to commit into ksh without at least one proper OK,
> i'm resending this patch here, sorry if i missed private feedback.
> 
> What the existing code does:
> It tries to make sure that multi-byte UTF-8 characters get passed on by
> the shell without fragmentation, not one byte at time.  I heard people
> say that some software, for example tmux(1), may sometimes get confused
> when receiving a UTF-8 character in a piecemeal manner.
> 
> Which problem needs fixing:
> Of the four-byte UTF-8 sequences, only a subset is identified by the
> existing code.  The other four-byte UTF-8 sequences still get chopped
> up resulting in individual bytes being passed on.
> 
> 
> I'm also adding a few comments as suggested by jca@.  Parsing of UTF-8
> is less trivial than one might think, witnessed once again by the fact
> that i got this code wrong in the first place.
> 
> I also changed "cc & 0x20" to "cc > 0x9f" and "cc & 0x30" to "cc > 0x8f"
> for uniformity and readabilty - UTF-8-parsing is bad enough without
> needless micro-optimization, right?
> 
> 
> Note that even with the patch below, moving backward and forward
> over a blowfish icon on the command line still does not work because
> the character is width 2 and the ksh code intentionally does not
> use wcwidth(3).  But maybe it improves something in tmux?  Not sure.
> 
> Either way, unless it causes regressions, this (or a further improved
> version) should go in because what is there is clearly wrong.
> 
> OK?
>   Ingo
> 
> 
> Index: emacs.c
> ===
> RCS file: /cvs/src/bin/ksh/emacs.c,v
> retrieving revision 1.87
> diff -u -p -r1.87 emacs.c
> --- emacs.c   8 May 2020 14:30:42 -   1.87
> +++ emacs.c   13 May 2021 18:16:13 -
> @@ -1851,11 +1851,17 @@ x_e_getu8(char *buf, int off)
>   return -1;
>   buf[off++] = c;
>  
> - if (c == 0xf4)
> + /*
> +  * In the following, comments refer to violations of
> +  * the inequality tests at the ends of the lines.
> +  * See the utf8(7) manual page for details.
> +  */
> +
> + if ((c & 0xf8) == 0xf0 && c < 0xf5)  /* beyond Unicode */
>   len = 4;
>   else if ((c & 0xf0) == 0xe0)
>   len = 3;
> - else if ((c & 0xe0) == 0xc0 && c > 0xc1)
> + else if ((c & 0xe0) == 0xc0 && c > 0xc1)  /* use single byte */
>   len = 2;
>   else
>   len = 1;
> @@ -1865,9 +1871,10 @@ x_e_getu8(char *buf, int off)
>   if (cc == -1)
>   break;
>   if (isu8cont(cc) == 0 ||
> - (c == 0xe0 && len == 3 && cc < 0xa0) ||
> - (c == 0xed && len == 3 && cc & 0x20) ||
> - (c == 0xf4 && len == 4 && cc & 0x30)) {
> + (c == 0xe0 && len == 3 && cc < 0xa0) ||  /* use 2 bytes */
> + (c == 0xed && len == 3 && cc > 0x9f) ||  /* surrogates  */
> + (c == 0xf0 && len == 4 && cc < 0x90) ||  /* use 3 bytes */
> + (c == 0xf4 && len == 4 && cc > 0x8f)) {  /* beyond Uni. */
>   x_e_ungetc(cc);
>   break;
>   }



Re: Patch: ksh: fix input handling for 4 byte UTF-8 sequences

2021-06-07 Thread ropers
Hiya,

@Ingo:
Sorry I have been out of touch.  I have arguably been out of sorts,
though hopefully not out of order in your book.

> Index: emacs.c
> ===
> RCS file: /cvs/src/bin/ksh/emacs.c,v
> retrieving revision 1.87
> diff -u -p -r1.87 emacs.c
> --- emacs.c 8 May 2020 14:30:42 -   1.87
> +++ emacs.c 13 May 2021 18:16:13 -
> @@ -1851,11 +1851,17 @@ x_e_getu8(char *buf, int off)
> return -1;
> buf[off++] = c;
>
> -   if (c == 0xf4)
> +   /*
> +* In the following, comments refer to violations of
> +* the inequality tests at the ends of the lines.
> +* See the utf8(7) manual page for details.
> +*/
> +
> +   if ((c & 0xf8) == 0xf0 && c < 0xf5)  /* beyond Unicode */

This threw me at first.  I didn't initially understand why a check
whether this is "beyond Unicode" needed that "(c & 0xf8) == 0xf0 &&"
part at all.

But I now think I got it:  I think I let the comment lead me astray,
because in truth, it does not just check whether that fifth most
significant bit is zero, it also checks if the leftmost nibble is 0xF.
Seeing though that the zero status of the fifth msb is checked twice
(the 0x8 nibble ensures it is 0, and the "c < 0xf5" check ensures it
is 0), would it be clearer to check that only once?  Like thus:

 if ((c & 0xf0) == 0xf0 && c < 0xf5)  /* 4B leading byte, not beyond Unicode */

I'm NOT saying your way is wrong; I'm just throwing this out there.


Also, suppose we get a byte here that IS beyond Unicode, would any
further handling of that be needed once we arrive at "len = 1;" (the
final else) below?

> len = 4;
> else if ((c & 0xf0) == 0xe0)
> len = 3;
> -   else if ((c & 0xe0) == 0xc0 && c > 0xc1)
> +   else if ((c & 0xe0) == 0xc0 && c > 0xc1)  /* use single byte */
> len = 2;
> else
> len = 1;

^Here.

The way I read this, that's still unhandled for now, is it?

> Which problem needs fixing:
> Of the four-byte UTF-8 sequences, only a subset is identified by the
> existing code.  The other four-byte UTF-8 sequences still get chopped
> up resulting in individual bytes being passed on.

Or does that refer to other LEGAL 4-byte UTF-8?


> @@ -1865,9 +1871,10 @@ x_e_getu8(char *buf, int off)
>if (cc == -1)
>break;
>if (isu8cont(cc) == 0 ||
> -  (c == 0xe0 && len == 3 && cc < 0xa0) ||
> -  (c == 0xed && len == 3 && cc & 0x20) ||
> -  (c == 0xf4 && len == 4 && cc & 0x30)) {
> +  (c == 0xe0 && len == 3 && cc < 0xa0) ||  /* use 2 bytes */
> +  (c == 0xed && len == 3 && cc > 0x9f) ||  /* surrogates  */
> +  (c == 0xf0 && len == 4 && cc < 0x90) ||  /* use 3 bytes */
> +  (c == 0xf4 && len == 4 && cc > 0x8f)) {  /* beyond Uni. */
>x_e_ungetc(cc);
>break;
>}

Whatever you ultimately choose there, please DO include your comments,
i.e. these:

/* use single byte */
/* use 2 bytes */
/* surrogates  */
/* use 3 bytes */
/* beyond Uni. */

...because those are actually helpful, especially for nincompoops like me.



On 07/06/2021, Sören Tempel  wrote:
>
> BTW: Is there any reason why ksh doesn't use editline for all its line
> editing needs? That would allow handling all these nitty-gritty details
> in a central place.
>
> Greetings,
> Sören

That might end up fixing a minor quality of life issue and might end
up obsoleting a long-delayed diff that I've let go stale because I've
been too noobish and daft to confidently complete it.  It worked in
principle last time I tried many moons ago, but I didn't fully
comprehend how and why, which is concerning.
The issue is to do with the fact that unlike ksh vi mode, ksh emacs
mode won't let you return to the same line with arrow down after
you've gone back into the past with arrow-up. (It's a BTTF bug.)
Editline could take care of all of that.
Whether that's a good reason to support your suggestion is not for me to say.
(But wait, there's more: I did research and compare lotsa related
things there, which yielded an iffy diff, but mainly a VERY verbose
text file with my notes and findings.)
Should I even try to rediscover what I had and maybe share it with
somebody, perhaps off-list?  (Caveat emptor; it may not be worth your
time, but YMMV.)


Thanks and regards,
Ian


PS: (This part is purely for shits and giggles.)

I'd long thought that actually, octal was a fine way to deal with
UTF-8, because it fits the distribution of bits somewhat more cleanly
than hexadecimal.

Here's why I thought so.  Sorry for the possibly confusing ad-hoc notation:


1 byte : Octal works for code points U+–U+007F b/c "that's all she wrote":
It's ok the left octal is 1 at most; it can't bleed into 

Re: Patch: ksh: fix input handling for 4 byte UTF-8 sequences

2021-06-07 Thread Sören Tempel
Ingo Schwarze  wrote:
> Hi,

Hello,

> Which problem needs fixing:
> Of the four-byte UTF-8 sequences, only a subset is identified by the
> existing code.  The other four-byte UTF-8 sequences still get chopped
> up resulting in individual bytes being passed on.
> 
> 
> I'm also adding a few comments as suggested by jca@.  Parsing of UTF-8
> is less trivial than one might think, witnessed once again by the fact
> that i got this code wrong in the first place.
> 
> I also changed "cc & 0x20" to "cc > 0x9f" and "cc & 0x30" to "cc > 0x8f"
> for uniformity and readabilty - UTF-8-parsing is bad enough without
> needless micro-optimization, right?

Nice, wasn't aware that you also had a patch ready. Sounds good to me
and also fixes the problem I originally experienced with 4 byte UTF-8
sequences.

> Note that even with the patch below, moving backward and forward
> over a blowfish icon on the command line still does not work because
> the character is width 2 and the ksh code intentionally does not
> use wcwidth(3).  But maybe it improves something in tmux?  Not sure.

Character movements over emojis (e.g. U+1F421) are currently broken
because the ksh code doesn't correctly determine the amount of columns
needed for a given character (i.e. what you would normally do with
wcwidth). I tried fixing this but without wchar.h doing so seemed very
cumbersome. Inputting emojis works with your patch though and was broken
previously.

> Either way, unless it causes regressions, this (or a further improved
> version) should go in because what is there is clearly wrong.
> 
> OK?

Your diff looks good to me.

BTW: Is there any reason why ksh doesn't use editline for all its line
editing needs? That would allow handling all these nitty-gritty details
in a central place.

Greetings,
Sören



Re: Patch: ksh: fix input handling for 4 byte UTF-8 sequences

2021-06-03 Thread Jeremie Courreges-Anglas
On Wed, Jun 02 2021, Ingo Schwarze  wrote:
> Hi,
>
> feeling hesitant to commit into ksh without at least one proper OK,
> i'm resending this patch here, sorry if i missed private feedback.

I found two mails in my drafts folder, sorry, you didn't miss any
feedback from me.

> What the existing code does:
> It tries to make sure that multi-byte UTF-8 characters get passed on by
> the shell without fragmentation, not one byte at time.  I heard people
> say that some software, for example tmux(1), may sometimes get confused
> when receiving a UTF-8 character in a piecemeal manner.
>
> Which problem needs fixing:
> Of the four-byte UTF-8 sequences, only a subset is identified by the
> existing code.  The other four-byte UTF-8 sequences still get chopped
> up resulting in individual bytes being passed on.
>
>
> I'm also adding a few comments as suggested by jca@.  Parsing of UTF-8
> is less trivial than one might think, witnessed once again by the fact
> that i got this code wrong in the first place.

Thanks for this, though I was mostly interested into a pointer to the
relevant standard document that was used.  Now that the checks match the
Unicode standard, I'm less confused.  You're now pointing readers to the
utf8(7) manpage.  Discussion for another diff: should this manpage list
valid sequences instead of invalid ones?

> I also changed "cc & 0x20" to "cc > 0x9f" and "cc & 0x30" to "cc > 0x8f"
> for uniformity and readabilty - UTF-8-parsing is bad enough without
> needless micro-optimization, right?

I don't see a reason to optimize this.  IMO the checks should look just
like in your proposal, which matches the table in section "Unicode
Encoding Forms" in UnicodeStandard-13.0.pdf.

> Note that even with the patch below, moving backward and forward
> over a blowfish icon on the command line still does not work because
> the character is width 2 and the ksh code intentionally does not
> use wcwidth(3).  But maybe it improves something in tmux?  Not sure.
>
> Either way, unless it causes regressions, this (or a further improved
> version) should go in because what is there is clearly wrong.
>
> OK?

ok jca@

>   Ingo
>
> Index: emacs.c
> ===
> RCS file: /cvs/src/bin/ksh/emacs.c,v
> retrieving revision 1.87
> diff -u -p -r1.87 emacs.c
> --- emacs.c   8 May 2020 14:30:42 -   1.87
> +++ emacs.c   13 May 2021 18:16:13 -
> @@ -1851,11 +1851,17 @@ x_e_getu8(char *buf, int off)
>   return -1;
>   buf[off++] = c;
>  
> - if (c == 0xf4)
> + /*
> +  * In the following, comments refer to violations of
> +  * the inequality tests at the ends of the lines.
> +  * See the utf8(7) manual page for details.
> +  */
> +
> + if ((c & 0xf8) == 0xf0 && c < 0xf5)  /* beyond Unicode */
>   len = 4;
>   else if ((c & 0xf0) == 0xe0)
>   len = 3;
> - else if ((c & 0xe0) == 0xc0 && c > 0xc1)
> + else if ((c & 0xe0) == 0xc0 && c > 0xc1)  /* use single byte */
>   len = 2;
>   else
>   len = 1;
> @@ -1865,9 +1871,10 @@ x_e_getu8(char *buf, int off)
>   if (cc == -1)
>   break;
>   if (isu8cont(cc) == 0 ||
> - (c == 0xe0 && len == 3 && cc < 0xa0) ||
> - (c == 0xed && len == 3 && cc & 0x20) ||
> - (c == 0xf4 && len == 4 && cc & 0x30)) {
> + (c == 0xe0 && len == 3 && cc < 0xa0) ||  /* use 2 bytes */
> + (c == 0xed && len == 3 && cc > 0x9f) ||  /* surrogates  */
> + (c == 0xf0 && len == 4 && cc < 0x90) ||  /* use 3 bytes */
> + (c == 0xf4 && len == 4 && cc > 0x8f)) {  /* beyond Uni. */
>   x_e_ungetc(cc);
>   break;
>   }
>

-- 
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE