Re: ksh(1): don't output invalid UTF-8 characters

Ingo Schwarze Mon, 05 Jun 2017 05:35:11 -0700

Hi Anton,

Anton Lindqvist wrote on Sun, Jun 04, 2017 at 11:09:35AM +0200:


> Although this discussion hasn't settled,

True.  I think nicm@ has convinced me that the shell *can* try to be
nicer towards terminals, without risking hangs if done very carefully.
Probably that's worth doing, it makes the system more robust.

The goal would be to only send invalid sequences to the terminal if
the user actually typed invalid input, but to not send invalid
intermediate states to the terminal if what the user types is
correct.

> here's a new diff trying to
> address the previously raised issues:
> 
> - The new function x_e_getu8() tries to read a complete UTF-8 character.
>   When a continuation byte is expected but not received, it resets its
>   state and retries.
> 
>   The fix to u8len() from Boudewijn Dijkstra is incorporated in this
>    new function, thanks!

No, no, no.  You are thanking him for really bad advice.
That function is still outrageously wrong.

> - In x_emacs(), In order to distinguish between when the line buffer
>   contains more than one character

I assume you mean "byte" here.

>   due to a UTF-8 character being read
>   or caused by calling x_e_getu8() repeatedly the number of calls to
>   x_e_getu8() is stored ntries. This is of importance to preserve the
>   existing behavior where none matched escape sequences should not be
>   inserted but instead ring the bell.

I didn't do full testing and review of your changes yet, but nothing
springs to my eye as being obviously wrong, except the UTF-8 parsing.

> Index: emacs.c
> ===================================================================
> RCS file: /cvs/src/bin/ksh/emacs.c,v
> retrieving revision 1.67
> diff -u -p -r1.67 emacs.c
> --- emacs.c   12 May 2017 14:37:52 -0000      1.67
> +++ emacs.c   4 Jun 2017 09:06:54 -0000
[...]
> @@ -1891,6 +1894,49 @@ x_e_getc(void)
>               c = x_getc();
>  
>       return c;
> +}
> +
> +static int
> +x_e_getu8(char *buf, int off)
> +{
> +     int     c, len;
> +
> +again:
> +     c = x_e_getc();
> +     if (c == -1)
> +             return -1;
> +     buf[off++] = c;
> +
> +     switch (c & 0xF0) {

That is very WRONG.  You cannot identify a UTF-8 start byte by only
looking at its first half.  At the very least, you must also make
sure that bit 5 is not set.

> +     case 0xF0:
> +             len = 4;
> +             break;

That is very WRONG.  The bytes 0xf8 to 0xff must *not* result in
len = 4.  They are not UTF-8 start bytes, and not UTF-8 at all, so
they must result in len = 1.

Also, we do know that the last three bits must be 100.  If bit 6
was zero, the codepoint would have to be encoded in a three-byte
sequence instead.  If bit 7 or 8 would be set in addition to bit
6, that would take us beyond U+10FFFF (specifically, U+1C0000 to
U+1FFFFF, which are invalid).  So why not check bits 6 to 8 right
away, too?

> +     case 0xE0:
> +             len = 3;
> +             break;

Hmmm.
Let me think whether we can already catch surrogates at this point.

Valid low three-byte:   U+0800  11100000 10100000 10000000  0xe0a080
                        U+D000  11101101 10000000 10000000  0xed8080
                        U+D7FF  11101101 10011111 10111111  0xed9fbf
Invalid surrogates:     U+D800  11101101 10100000 10000000  0xeda080
                        U+DFFF  11101101 10111111 10111111  0xedbfbf
Valid high three-byte:  U+E000  11101110 10000000 10000000  0xee8080
                        U+FFFF  11101111 10111111 10111111  0xefbfbf

So the start byte 0xed includes both the valid range U+D000 to U+D7FF
and the invalid surrogate range U+D800 to U+DFFF.  The decisive bit
is the third bit of the *second* byte, so surrogates cannot be
excluded when only the first bit is known.

> +     case 0xC0:
> +     case 0xD0:
> +             len = 2;
> +             break;

That is also (somewhat) wrong.  Arguably, it would be better
for 0xc0 and 0xc1 to result in len = 1, because we already know
that there is no valid continuation.

So, here is a better start byte parser (only partially tested):

        if (c == 0xf4)
                len = 4;
        else if ((c & 0xf0) == 0xe0)
                len = 3;
        else if ((c & 0xe0) == 0xc0 && c > 0xc1)
                len = 2;
        else
                len = 1;

> +     default:
> +             len = 1;
> +     }
> +
> +     for (; len > 1; len--) {
> +             c = x_e_getc();

Better use a different variable, say cc, such that we retain the
start byte for subsequent validation.

> +             if (c == -1)
> +                     return -1;
> +             if (!isu8cont(c)) {
> +                     off = 0;
> +                     x_e_ungetc(c);
> +                     x_error(0);
> +                     goto again;

That looks wrong.  At "again:", the first thing you do is x_e_getc(),
undoing the effect of the (reasonable) x_e_ungetc() you just did.
That looks like an endless loop to me.

Also, off = 0 and x_error() seem wrong because that discards a byte.

Instead, wouldn't you have to return successfully, to process the
invalid byte just read?

> +             }

At this point, we might consider validating the second byte.

Something like (only partially tested):

        if (isu8cont(cc) == 0 ||
            (c == 0xe0 && len == 3 && cc < 0xa0) ||
            (c == 0xed && len == 3 && cc & 0x20) ||
            (c == 0xf4 && len == 4 && cc & 0x30)) {
                x_e_ungetc(cc);
                break;
        }

Yours,
  Ingo

Re: ksh(1): don't output invalid UTF-8 characters

Reply via email to