Hi Anton, Anton Lindqvist wrote on Sun, Jun 04, 2017 at 11:09:35AM +0200:
> Although this discussion hasn't settled, True. I think nicm@ has convinced me that the shell *can* try to be nicer towards terminals, without risking hangs if done very carefully. Probably that's worth doing, it makes the system more robust. The goal would be to only send invalid sequences to the terminal if the user actually typed invalid input, but to not send invalid intermediate states to the terminal if what the user types is correct. > here's a new diff trying to > address the previously raised issues: > > - The new function x_e_getu8() tries to read a complete UTF-8 character. > When a continuation byte is expected but not received, it resets its > state and retries. > > The fix to u8len() from Boudewijn Dijkstra is incorporated in this > new function, thanks! No, no, no. You are thanking him for really bad advice. That function is still outrageously wrong. > - In x_emacs(), In order to distinguish between when the line buffer > contains more than one character I assume you mean "byte" here. > due to a UTF-8 character being read > or caused by calling x_e_getu8() repeatedly the number of calls to > x_e_getu8() is stored ntries. This is of importance to preserve the > existing behavior where none matched escape sequences should not be > inserted but instead ring the bell. I didn't do full testing and review of your changes yet, but nothing springs to my eye as being obviously wrong, except the UTF-8 parsing. > Index: emacs.c > =================================================================== > RCS file: /cvs/src/bin/ksh/emacs.c,v > retrieving revision 1.67 > diff -u -p -r1.67 emacs.c > --- emacs.c 12 May 2017 14:37:52 -0000 1.67 > +++ emacs.c 4 Jun 2017 09:06:54 -0000 [...] > @@ -1891,6 +1894,49 @@ x_e_getc(void) > c = x_getc(); > > return c; > +} > + > +static int > +x_e_getu8(char *buf, int off) > +{ > + int c, len; > + > +again: > + c = x_e_getc(); > + if (c == -1) > + return -1; > + buf[off++] = c; > + > + switch (c & 0xF0) { That is very WRONG. You cannot identify a UTF-8 start byte by only looking at its first half. At the very least, you must also make sure that bit 5 is not set. > + case 0xF0: > + len = 4; > + break; That is very WRONG. The bytes 0xf8 to 0xff must *not* result in len = 4. They are not UTF-8 start bytes, and not UTF-8 at all, so they must result in len = 1. Also, we do know that the last three bits must be 100. If bit 6 was zero, the codepoint would have to be encoded in a three-byte sequence instead. If bit 7 or 8 would be set in addition to bit 6, that would take us beyond U+10FFFF (specifically, U+1C0000 to U+1FFFFF, which are invalid). So why not check bits 6 to 8 right away, too? > + case 0xE0: > + len = 3; > + break; Hmmm. Let me think whether we can already catch surrogates at this point. Valid low three-byte: U+0800 11100000 10100000 10000000 0xe0a080 U+D000 11101101 10000000 10000000 0xed8080 U+D7FF 11101101 10011111 10111111 0xed9fbf Invalid surrogates: U+D800 11101101 10100000 10000000 0xeda080 U+DFFF 11101101 10111111 10111111 0xedbfbf Valid high three-byte: U+E000 11101110 10000000 10000000 0xee8080 U+FFFF 11101111 10111111 10111111 0xefbfbf So the start byte 0xed includes both the valid range U+D000 to U+D7FF and the invalid surrogate range U+D800 to U+DFFF. The decisive bit is the third bit of the *second* byte, so surrogates cannot be excluded when only the first bit is known. > + case 0xC0: > + case 0xD0: > + len = 2; > + break; That is also (somewhat) wrong. Arguably, it would be better for 0xc0 and 0xc1 to result in len = 1, because we already know that there is no valid continuation. So, here is a better start byte parser (only partially tested): if (c == 0xf4) len = 4; else if ((c & 0xf0) == 0xe0) len = 3; else if ((c & 0xe0) == 0xc0 && c > 0xc1) len = 2; else len = 1; > + default: > + len = 1; > + } > + > + for (; len > 1; len--) { > + c = x_e_getc(); Better use a different variable, say cc, such that we retain the start byte for subsequent validation. > + if (c == -1) > + return -1; > + if (!isu8cont(c)) { > + off = 0; > + x_e_ungetc(c); > + x_error(0); > + goto again; That looks wrong. At "again:", the first thing you do is x_e_getc(), undoing the effect of the (reasonable) x_e_ungetc() you just did. That looks like an endless loop to me. Also, off = 0 and x_error() seem wrong because that discards a byte. Instead, wouldn't you have to return successfully, to process the invalid byte just read? > + } At this point, we might consider validating the second byte. Something like (only partially tested): if (isu8cont(cc) == 0 || (c == 0xe0 && len == 3 && cc < 0xa0) || (c == 0xed && len == 3 && cc & 0x20) || (c == 0xf4 && len == 4 && cc & 0x30)) { x_e_ungetc(cc); break; } Yours, Ingo