Re: ksh(1): don't output invalid UTF-8 characters
For reference, I just committed the fix, see message below. Thanks to all who helped out. > CVSROOT: /cvs > Module name: src > Changes by: an...@cvs.openbsd.org 2017/06/25 02:51:53 > > Modified files: > bin/ksh: emacs.c > > Log message: > Don't output partial UTF-8 characters in ksh emacs mode. Instead, try to read > a > complete UTF-8 character first. Fixes an issue while running ksh in tmux where > UTF-8 characters inserted in columns other than the last one are discarded. > > With help from nicm@ and schwarze@ who also wrote the UTF-8 validation, > thanks! > > ok schwarze@
Re: ksh(1): don't output invalid UTF-8 characters
On Mon, Jun 05, 2017 at 10:46:49PM +0200, Ingo Schwarze wrote: > Hi Walter, > > Walter Alejandro Iglesias wrote on Mon, Jun 05, 2017 at 09:21:31PM +0200: > > On Mon, Jun 05, 2017 at 06:06:34PM +0200, Ingo Schwarze wrote: > >> Walter Alejandro Iglesias wrote on Mon, Jun 05, 2017 at 04:50:21PM +0200: > > > But this time I don't think you need a capture of the sequence. > > Well *you* obviously know what problem you have, but *I* don't > understand it, so please don't tell me that i understand when i > don't, and don't tell me that i don't need the details i need. In > general, when reporting bugs, do not guess which information might > or might not be needed, but provide what people trying to fix it > ask for. First, you're who is assuming wrong. I don't contact free software developers to ask them to solve *my* problems. Second, I didn't provided you more details now than the first time. The information I provided was enough to understand and to reproduce the bug. Finally, I'm doing this for free too. If you're not happy with the way I explain myself feel free to ignore me. Greetings.
Re: ksh(1): don't output invalid UTF-8 characters
Hi Walter, Walter Alejandro Iglesias wrote on Mon, Jun 05, 2017 at 09:21:31PM +0200: > On Mon, Jun 05, 2017 at 06:06:34PM +0200, Ingo Schwarze wrote: >> Walter Alejandro Iglesias wrote on Mon, Jun 05, 2017 at 04:50:21PM +0200: > But this time I don't think you need a capture of the sequence. Well *you* obviously know what problem you have, but *I* don't understand it, so please don't tell me that i understand when i don't, and don't tell me that i don't need the details i need. In general, when reporting bugs, do not guess which information might or might not be needed, but provide what people trying to fix it ask for. > Just use *any* > latin-1 character whose hex value is smaller than \xc0. > > To facilitate you the test, in xterm after setting "setxkbmap de": > > AltGr + Shift + 1 > > prints me the opening exclamation mark (\xa1) we also use in Spanish. Yes, i can confirm that. > In console The console is not UTF-8 capable but expects latin-1 encoding, which ksh doesn't support at all. So at the console, you cannot do command line editing if you type in non-ASCII characters, or all bets are off. > or a C xterm, type that merged among random ascii characters, then > move the cursor from the first to the last column passing over that > character. Assuming you're running current, see what happens. Everything is just fine, i don't see any problem. Wenn reporting a bug, it is not sufficient to say "see what happens"; you need to describe what you see that happens and what you would expect too happen instead. Oh, after playing around a bit, i started an xterm with $ export LC_CTYPE=C $ xterm +u8 # DON'T DO THAT on OpenBSD, ever Now if i type echo abcdef in that situation and move around, the notions ksh and xterm have of the cursor position get out of sync, and the display gets garbled. Is that what you mean? That is expected. In that situation, the mark is represented as 0xa1, which ksh treats as an invalid zero-width byte, while xterm treats it as a printable character in that mode. The problem here is not that ksh does something wrong. The problem is that i forced xterm into a mode using a character encoding that is not supported and cannot be used on OpenBSD at all. > Anyway, to be honest, these bugs don't hurt, you can live with them. > What I'm trying to say with these reports is I'm not truly convinced > utf8 support in console is a good idea. Oh, having the console grok UTF-8 (just like xterm) would be great, but its not trivial because it has tentacles into the kernel. > Another test you can do, this time in a utf-8 xterm: if you activate the > bell and go with the cursor to the end of the line it'll beep. More precisely, in emacs mode, if the curser is after the last character and i type Ctrl-F, it beeps, yes. In vi command mode, if the cursor is on the last character and i type "l", it beeps. > Now type some utf-8 character at the end and do the same, > it won't beep, because the cursor is in the first byte of the > utf-8 character, *it can't reach the real end of the line*. I don't see such a problem in emacs mode, but i do see that in vi mode. That's an admittedly small, but real bug... I should probably have a look how to fix that. > Nobody will die because this issue or the other above. My point > is utf8 will always be a mess. Yes, it is not likely to ever be perfect, but we can make it better (and more usable for simple practical tasks). > KEN, DO YOU HEAR ME?, IT WAS YOUR OWN CHILD, KEN! :-) To be honest, i doubt that it is possible to design a character encoding that is substantially better than UTF-8. Of course, it is easy to design other encodings of the same quality of UTF-8 (colour of the bikeshed), but none of those are used in practice. UTF-8 is optimal in a large number of ways and very elegant, and while it is not trivial to handle, it is as simple as possible. By contrast, Unicode has large numbers of problems and is not only not trivial, but so large that is very hard to get an adequate understanding of it. Some of those problems (for example, surrogates and the arbitrary U+10 limit) extend into UTF-8's domain, but those are Unicode problems, not UTF-8 problems. >>$ ./obj/edit < input.txt | hexdump -C > I've been wondering how to work with this. Thanks! Maybe somebody should write /usr/src/regress/bin/ksh/edit/edit.1, like we have /usr/src/regress/usr.bin/mandoc/db/dbm_dump/dbm_dump.1. >Encodings using more bytes than required are invalid. In particular, >1100 and 1101 are not valid start bytes, the byte after >1110 must be at least 1010, and the byte after must >be at least 1001. > > I don't understand the 'at least' assumptions. Some examples in which > the byte after 1110 is *smaller* than 1010: > > Euro sign: > 11100010 1010 10101100 > > Em dash: > 11100010 1000 10010100 > > Double quotes: > 11100010 1000 10011100 >
Re: ksh(1): don't output invalid UTF-8 characters
In article <20170605192131.ga60...@server.roquesor.com> you wrote: > >Encodings using more bytes than required are invalid. In particular, >1100 and 1101 are not valid start bytes, the byte after >1110 must be at least 1010, and the byte after must >be at least 1001. > Someone off list explained me this is true just for the exact 1110 byte. I'd assumed the man page was referring to any 1110 sequence. Now I understand.
Re: ksh(1): don't output invalid UTF-8 characters
On Mon, Jun 05, 2017 at 09:21:31PM +0200, Walter Alejandro Iglesias wrote: > > I wonder how plan9 handle utf8. > In general, by getting rid of TTYs and character-addressed interfaces almost entirely. Probably not the best fit for OpenBSD. khm
Re: ksh(1): don't output invalid UTF-8 characters
On Mon, Jun 05, 2017 at 06:06:34PM +0200, Ingo Schwarze wrote: > Hi Walter, > > Walter Alejandro Iglesias wrote on Mon, Jun 05, 2017 at 04:50:21PM +0200: > > > report (I'm on chapter 2 of K :-)). I wish with time I'll learn how > > to do it. > > IIRC, you said you saw some undesirable behaviour with ksh input. > > I assume you have a sequence of key presses on your keyboard that > demonstrate the undesirable behaviour. To capture the sequence, > I will *study* all the indications you gave me. But this time I don't think you need a capture of the sequence. Just use *any* latin-1 character whose hex value is smaller than \xc0. To facilitate you the test, in xterm after setting "setxkbmap de": AltGr + Shift + 1 prints me the opening exclamation mark (\xa1) we also use in Spanish. In console or a C xterm, type that merged among random ascii characters, then move the cursor from the first to the last column passing over that character. Assuming you're running current, see what happens. Anyway, to be honest, these bugs don't hurt, you can live with them. What I'm trying to say with these reports is I'm not truly convinced utf8 support in console is a good idea. Another test you can do, this time in a utf-8 xterm: if you activate the bell and go with the cursor to the end of the line it'll beep. Now type some utf-8 character at the end and do the same, it won't beep, because the cursor is in the first byte of the utf-8 character, *it can't reach the real end of the line*. Nobody will die because this issue or the other above. My point is utf8 will always be a mess. KEN, DO YOU HEAR ME?, IT WAS YOUR OWN CHILD, KEN! :-) I wonder how plan9 handle utf8. [...] > > > For testing, go to the regress directory: > >$ cd /usr/src/regress/bin/ksh >$ cvs up -dP >$ cd edit >$ make obj >$ make cleandir >$ make regress >$ ./obj/edit < input.txt | hexdump -C > 24 20 78 79 08 c3 a9 79 08 0a |$ xy...y..| > 000a I've been wondering how to work with this. Thanks! [...] > > By the way, something the last paragraph of the new utf8(7) man page > > isn't clear enough (I mentioned this to tedu@). > > Which paragraph exactly, and what is unclear? Maybe we can fix it > quickly. As I told you, the _last_ one: Encodings using more bytes than required are invalid. In particular, 1100 and 1101 are not valid start bytes, the byte after 1110 must be at least 1010, and the byte after must be at least 1001. I don't understand the 'at least' assumptions. Some examples in which the byte after 1110 is *smaller* than 1010: Euro sign: 11100010 1010 10101100 Em dash: 11100010 1000 10010100 Double quotes: 11100010 1000 10011100 11100010 1000 10011101 You can find examples where the byte after 1110 is *grater* than 1010 here: http://www.utf8-chartable.de/ Thank you for your advices I'll study your whole message carefuly. > > Yours, > Ingo
Re: ksh(1): don't output invalid UTF-8 characters
Hi Walter, Walter Alejandro Iglesias wrote on Mon, Jun 05, 2017 at 04:50:21PM +0200: > I'm still not skilled enough to make a proper patch or a clear bug > report (I'm on chapter 2 of K :-)). I wish with time I'll learn how > to do it. IIRC, you said you saw some undesirable behaviour with ksh input. I assume you have a sequence of key presses on your keyboard that demonstrate the undesirable behaviour. To capture the sequence, 1. Type printf ' (without hitting enter) 2. Now type the sequence of characters, BUT - before any control character, press Ctrl-V - before any single quote, press a backslash 3. Type ' > input.txt and hit enter. You can look at the sequence with hexdump -C input.txt There should be at least one byte for each key pressed, sometimes more. If some key presses do not show up or the order is mixed up, you likely forgot the Ctrl-V before it. In that case, retry. If you want to save people who look at your report some work, you can craft a printf(1) statement that generates the same output as the above, but using only ASCII letters, and send that instead of the output of hexdump -C input.txt. Here is a simple example. Test insertion of a non-ASCII character between two ASCII characters. What i type on my keyboard: x y Ctrl-B e-accent-aigu What i type to capture the input: p r i n t f blank ' x y Ctrl-V Ctrl-B e-accent-aigu ' > i n p u t . t x t This can be used to create the same input file: $ printf 'xy\x02\xc3\xa9' > input2.txt $ cmp input.txt input2.txt $ echo $? 0 For testing, go to the regress directory: $ cd /usr/src/regress/bin/ksh $ cvs up -dP $ cd edit $ make obj $ make cleandir $ make regress $ ./obj/edit < input.txt | hexdump -C 24 20 78 79 08 c3 a9 79 08 0a |$ xy...y..| 000a Note that the above output is already an improved (uncommitted) version containing a private patch. Here is what the -current ksh does: $ PATH=/obin ./obj/edit < input.txt | hexdump -C 24 20 78 79 08 c3 79 08 08 c3 a9 79 08 0a |$ xy..yy..| 000e Also describe in words why you are typing that exact sequence, what you would like to happen, and how the actual events differ from that. Should be feasible, right? > I came to the ksh utf8 discussion because I've been playing > with some mail mime encoder just to learn C and recognizing > valid utf-8 was the first challenge I ecountered. In a normal non-threaded application program, you should probably use mbtowc(3) rather than coding your own UTF-8 parser. > The code pasted below is what I got so far in recognizing valid utf-8. That code looks confusing. I'm not studying it in detail right now because we already have at least two working UTF-8 decoders in the tree. One is /usr/src/lib/libc/citrus/citrus_utf8.c, function _citrus_utf8_ctype_mbrtowc(). It is relatively long, but quite easy to understand. Another one is /usr/src/usr.bin/mandoc/preconv.c, function preconv_encode(). It is substantially shorter, but admittedly harder to understand. This is OpenBSD. Read the fantastic source to learn something. > I'm showing it to make my point, I realize it isn't easy; and from my > poor C I'm not able to figure out how you can do such test byte by byte > while the user is typing at command line. (Don't bother in explaining > me how, I know this is not the place to take C lessons.) Well, it seems likely the something doing just that will be committed to ksh/emacs.c during the next week or so - not a full parser, but something showing the incremental nature. Then you can inspect it. If you report actual bugs or propose fixes, then developers are often willing to help with C issues related to it as well, even though teaching C is indeed not the point of this list. > By the way, something the last paragraph of the new utf8(7) man page > isn't clear enough (I mentioned this to tedu@). Which paragraph exactly, and what is unclear? Maybe we can fix it quickly. > #define YES 1 > #define NO0 Either just use 1 and 0 in the code or use . [...] > } Don't forget checking ferror(3) after falling out of the main loop. > return 0; > } Yours, Ingo
Re: ksh(1): don't output invalid UTF-8 characters
Just to applogize to developers here, I'm still not skilled enough to make a proper patch or a clear bug report (I'm on chapter 2 of K :-)). I wish with time I'll learn how to do it. I came to the ksh utf8 discussion because I've been playing with some mail mime encoder just to learn C and recognizing valid utf-8 was the first challenge I ecountered. The code pasted below is what I got so far in recognizing valid utf-8. I'm showing it to make my point, I realize it isn't easy; and from my poor C I'm not able to figure out how you can do such test byte by byte while the user is typing at command line. (Don't bother in explaining me how, I know this is not the place to take C lessons.) By the way, something the last paragraph of the new utf8(7) man page isn't clear enough (I mentioned this to tedu@). Thanks to all of you for your work. Now I know how hard it is. #include #define ASCII 0x7f #define YES 1 #define NO 0 int main() { int c, ch, wd, ln, col, isutf8; ch = wd = ln = col = 1; isutf8 = YES; while ((c = getchar()) != EOF) { if (c > ASCII) { if ((ch == 1 && (c < 0xc2 || c > 0xf7)) || ((ch > 1 && c <= 4) && ch <= wd && (c < 0x80 || c > 0xbf))) isutf8 = NO; /* 110. */ else if (ch == 1 && c >= 0xc2 && c <= 0xdf) { wd = 2; ++ch; /* 1110 */ } else if (ch == 1 && c >= 0xe0 && c <= 0xef) { wd = 3; ++ch; /* 0... */ } else if (ch == 1 && c >= 0xf0 && c <= 0xf7) { wd = 4; ++ch; } else if (ch > 1 && c <= 4 && ch == wd && c >= 0x80 && c <= 0xbf) ch = 1; else if (ch > 1 && c <= 4 && ch < wd && c >= 0x80 && c <= 0xbf) ++ch; else ++ch; } else if (ch > 1 && ch <= 4 && ch <= wd) isutf8= NO; else ch = 1; if (isutf8 == NO) { printf("Invalid utf-8 character"); printf(" at line %d col %d.\n", ln, col); return 1; } if (c == '\n') { col = 1; ++ln; } else ++col; } return 0; }
Re: ksh(1): don't output invalid UTF-8 characters
Hi Anton, Anton Lindqvist wrote on Sun, Jun 04, 2017 at 11:09:35AM +0200: > Although this discussion hasn't settled, True. I think nicm@ has convinced me that the shell *can* try to be nicer towards terminals, without risking hangs if done very carefully. Probably that's worth doing, it makes the system more robust. The goal would be to only send invalid sequences to the terminal if the user actually typed invalid input, but to not send invalid intermediate states to the terminal if what the user types is correct. > here's a new diff trying to > address the previously raised issues: > > - The new function x_e_getu8() tries to read a complete UTF-8 character. > When a continuation byte is expected but not received, it resets its > state and retries. > > The fix to u8len() from Boudewijn Dijkstra is incorporated in this >new function, thanks! No, no, no. You are thanking him for really bad advice. That function is still outrageously wrong. > - In x_emacs(), In order to distinguish between when the line buffer > contains more than one character I assume you mean "byte" here. > due to a UTF-8 character being read > or caused by calling x_e_getu8() repeatedly the number of calls to > x_e_getu8() is stored ntries. This is of importance to preserve the > existing behavior where none matched escape sequences should not be > inserted but instead ring the bell. I didn't do full testing and review of your changes yet, but nothing springs to my eye as being obviously wrong, except the UTF-8 parsing. > Index: emacs.c > === > RCS file: /cvs/src/bin/ksh/emacs.c,v > retrieving revision 1.67 > diff -u -p -r1.67 emacs.c > --- emacs.c 12 May 2017 14:37:52 - 1.67 > +++ emacs.c 4 Jun 2017 09:06:54 - [...] > @@ -1891,6 +1894,49 @@ x_e_getc(void) > c = x_getc(); > > return c; > +} > + > +static int > +x_e_getu8(char *buf, int off) > +{ > + int c, len; > + > +again: > + c = x_e_getc(); > + if (c == -1) > + return -1; > + buf[off++] = c; > + > + switch (c & 0xF0) { That is very WRONG. You cannot identify a UTF-8 start byte by only looking at its first half. At the very least, you must also make sure that bit 5 is not set. > + case 0xF0: > + len = 4; > + break; That is very WRONG. The bytes 0xf8 to 0xff must *not* result in len = 4. They are not UTF-8 start bytes, and not UTF-8 at all, so they must result in len = 1. Also, we do know that the last three bits must be 100. If bit 6 was zero, the codepoint would have to be encoded in a three-byte sequence instead. If bit 7 or 8 would be set in addition to bit 6, that would take us beyond U+10 (specifically, U+1C to U+1F, which are invalid). So why not check bits 6 to 8 right away, too? > + case 0xE0: > + len = 3; > + break; Hmmm. Let me think whether we can already catch surrogates at this point. Valid low three-byte: U+0800 1110 1010 1000 0xe0a080 U+D000 11101101 1000 1000 0xed8080 U+D7FF 11101101 1001 1011 0xed9fbf Invalid surrogates: U+D800 11101101 1010 1000 0xeda080 U+DFFF 11101101 1011 1011 0xedbfbf Valid high three-byte: U+E000 11101110 1000 1000 0xee8080 U+ 1110 1011 1011 0xefbfbf So the start byte 0xed includes both the valid range U+D000 to U+D7FF and the invalid surrogate range U+D800 to U+DFFF. The decisive bit is the third bit of the *second* byte, so surrogates cannot be excluded when only the first bit is known. > + case 0xC0: > + case 0xD0: > + len = 2; > + break; That is also (somewhat) wrong. Arguably, it would be better for 0xc0 and 0xc1 to result in len = 1, because we already know that there is no valid continuation. So, here is a better start byte parser (only partially tested): if (c == 0xf4) len = 4; else if ((c & 0xf0) == 0xe0) len = 3; else if ((c & 0xe0) == 0xc0 && c > 0xc1) len = 2; else len = 1; > + default: > + len = 1; > + } > + > + for (; len > 1; len--) { > + c = x_e_getc(); Better use a different variable, say cc, such that we retain the start byte for subsequent validation. > + if (c == -1) > + return -1; > + if (!isu8cont(c)) { > + off = 0; > + x_e_ungetc(c); > + x_error(0); > + goto again; That looks wrong. At "again:", the first thing you do is x_e_getc(), undoing the effect of the (reasonable) x_e_ungetc() you just did. That looks like an endless loop to me. Also, off = 0 and x_error() seem wrong because that
Re: ksh(1): don't output invalid UTF-8 characters
Hi, Although this discussion hasn't settled, here's a new diff trying to address the previously raised issues: - The new function x_e_getu8() tries to read a complete UTF-8 character. When a continuation byte is expected but not received, it resets its state and retries. The fix to u8len() from Boudewijn Dijkstra is incorporated in this new function, thanks! - In x_emacs(), In order to distinguish between when the line buffer contains more than one character due to a UTF-8 character being read or caused by calling x_e_getu8() repeatedly the number of calls to x_e_getu8() is stored ntries. This is of importance to preserve the existing behavior where none matched escape sequences should not be inserted but instead ring the bell. Index: emacs.c === RCS file: /cvs/src/bin/ksh/emacs.c,v retrieving revision 1.67 diff -u -p -r1.67 emacs.c --- emacs.c 12 May 2017 14:37:52 - 1.67 +++ emacs.c 4 Jun 2017 09:06:54 - @@ -135,6 +135,7 @@ static void x_push(int); static voidx_adjust(void); static voidx_e_ungetc(int); static int x_e_getc(void); +static int x_e_getu8(char *, int); static voidx_e_putc(int); static voidx_e_puts(const char *); static int x_comment(int); @@ -277,7 +278,7 @@ x_emacs(char *buf, size_t len) { struct kb_entry *k, *kmatch = NULL; charline[LINE + 1]; - int at = 0, submatch, ret, c; + int at = 0, ntries = 0, submatch, ret; const char *p; xbp = xbuf = buf; xend = buf + len; @@ -318,11 +319,9 @@ x_emacs(char *buf, size_t len) x_last_command = NULL; while (1) { x_flush(); - if ((c = x_e_getc()) < 0) + if ((at = x_e_getu8(line, at)) < 0) return 0; - - line[at++] = c; - line[at] = '\0'; + ntries++; if (x_arg == -1) { x_arg = 1; @@ -360,14 +359,18 @@ x_emacs(char *buf, size_t len) macro_args = kmatch->args; ret = KSTD; } else - ret = kmatch->ftab->xf_func(c); + ret = kmatch->ftab->xf_func(line[at - 1]); } else { if (submatch) continue; - if (at == 1) - ret = x_insert(c); - else - ret = x_error(c); /* not matched meta sequence */ + if (ntries > 1) { + ret = x_error(0); /* not matched meta sequence */ + } else if (at > 1) { + x_ins(line); + ret = KSTD; + } else { + ret = x_insert(line[0]); + } } switch (ret) { @@ -392,7 +395,7 @@ x_emacs(char *buf, size_t len) } /* reset meta sequence */ - at = 0; + at = ntries = 0; line[0] = '\0'; if (x_arg_set) x_arg_set = 0; /* reset args next time around */ @@ -1891,6 +1894,49 @@ x_e_getc(void) c = x_getc(); return c; +} + +static int +x_e_getu8(char *buf, int off) +{ + int c, len; + +again: + c = x_e_getc(); + if (c == -1) + return -1; + buf[off++] = c; + + switch (c & 0xF0) { + case 0xF0: + len = 4; + break; + case 0xE0: + len = 3; + break; + case 0xC0: + case 0xD0: + len = 2; + break; + default: + len = 1; + } + + for (; len > 1; len--) { + c = x_e_getc(); + if (c == -1) + return -1; + if (!isu8cont(c)) { + off = 0; + x_e_ungetc(c); + x_error(0); + goto again; + } + buf[off++] = c; + } + buf[off] = '\0'; + + return off; } static void
Re: ksh(1): don't output invalid UTF-8 characters
Op Fri, 19 May 2017 15:17:55 +0200 schreef Anton Lindqvist: On Fri, May 19, 2017 at 09:33:33AM -0300, Lucas Gabriel Vuotto wrote: On 19/05/17 03:42, Anton Lindqvist wrote: > > +static int > +u8len(unsigned char c) > +{ > + switch (c & 0xF0) { > + case 0xF0: > + return 4; > + case 0xE0: > + return 3; > + case 0xC0: > + return 2; > + default: > + return 1; > + } > +} > + This is wrong: most codepoints in the range U+0080-U+07ff (the ones greater than U+0400) would be interpreted as being 1 character long instead of 2. Thanks for the heads-up. Maybe a more reliable solution would be to call mbtowc(3) repeatedly as new input arrives until it returns successfully. Assuming the first read byte is a UTF-8 start byte. Not needed. Only case 0xD0 is missing. case 0xC0: case 0xD0: return 2; -- Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/
Re: ksh(1): don't output invalid UTF-8 characters
Having a look at ksh, I don't see how Anton's original diff is much different from x_emacs() looping around x_e_getc() until it finishes a long key input? It would be better to stop reading early if an invalid UTF-8 byte is input rather than always requiring exactly N bytes; he needs to fix his u8len(); and I would want to test the other calls to x_e_getc(), such as in x_search_char_forw() -- but I the general idea seems reasonable. On Sat, May 20, 2017 at 12:04:35AM +0100, Nicholas Marriott wrote: > On Fri, May 19, 2017 at 09:29:06PM +0200, Ingo Schwarze wrote: > > On a side note, i don't think gnome-terminal and konsole are relevant. > > I never installed them before and did so now for the first time for > > testing, but they installed so many libraries that i feel uncomfortable > > and unsafe using them even briefly and purely for testing purposes. > > I will certainly pkg_delete them, and the libraries they pulled in, in > > the near future. > > You might not use them, and I don't use them, but they are very popular, > as are half a dozen others that may well have different behaviours. > > > > FWIW bash seems to do the replacement to U+FFFD itself before it sends > > > it to the terminal, which means it is (more) predictable. I don't know > > > if this is a sensible option for ksh. > > > > That is not really an option. This matters most while the multibyte > > character is being typed, when the first byte is already being > > processed but the second one not yet. Replacing the byte with > > U+FFFD, then later substituting the actual bytes back when all have > > arrived, doesn't really make much sense to me. > > It would help because terminals would know how to deal with the > character. > > So ksh is doing this now - if we pretend A,B,C are UTF-8 characters: > > A,cursor left,B,cursor left,C > > So first A appears, then B overwrites A, then C overwrites B. > > Except A and B are invalid so the terminal may not do the right thing > with them. It may draw no characters and not move the cursor (like > tmux), or draw two and move the cursor by two (like rxvt-unicode appears > to). > > If you sent U+FFFD instead of the invalid characters, that is a known > character and terminals should be guaranteed to do what ksh wants (move > the cursor back one position to the right), until ksh overwrites it > again. > > > Hanging the shell until all expected bytes have arrived seems like > > a bad idea, too. You can see the misbehaviour that is causing in > > programs like uniname(1) from the misc/uniutils package: > > I don't really understand this. How does ksh handle cursor keys? What if > when I press Left, the three bytes (\033OA) are split across two > separate read()s? How does this avoid hanging ksh?
Re: ksh(1): don't output invalid UTF-8 characters
On Fri, May 19, 2017 at 09:29:06PM +0200, Ingo Schwarze wrote: > On a side note, i don't think gnome-terminal and konsole are relevant. > I never installed them before and did so now for the first time for > testing, but they installed so many libraries that i feel uncomfortable > and unsafe using them even briefly and purely for testing purposes. > I will certainly pkg_delete them, and the libraries they pulled in, in > the near future. You might not use them, and I don't use them, but they are very popular, as are half a dozen others that may well have different behaviours. > > FWIW bash seems to do the replacement to U+FFFD itself before it sends > > it to the terminal, which means it is (more) predictable. I don't know > > if this is a sensible option for ksh. > > That is not really an option. This matters most while the multibyte > character is being typed, when the first byte is already being > processed but the second one not yet. Replacing the byte with > U+FFFD, then later substituting the actual bytes back when all have > arrived, doesn't really make much sense to me. It would help because terminals would know how to deal with the character. So ksh is doing this now - if we pretend A,B,C are UTF-8 characters: A,cursor left,B,cursor left,C So first A appears, then B overwrites A, then C overwrites B. Except A and B are invalid so the terminal may not do the right thing with them. It may draw no characters and not move the cursor (like tmux), or draw two and move the cursor by two (like rxvt-unicode appears to). If you sent U+FFFD instead of the invalid characters, that is a known character and terminals should be guaranteed to do what ksh wants (move the cursor back one position to the right), until ksh overwrites it again. > Hanging the shell until all expected bytes have arrived seems like > a bad idea, too. You can see the misbehaviour that is causing in > programs like uniname(1) from the misc/uniutils package: I don't really understand this. How does ksh handle cursor keys? What if when I press Left, the three bytes (\033OA) are split across two separate read()s? How does this avoid hanging ksh?
Re: ksh(1): don't output invalid UTF-8 characters
Hi On Fri, May 19, 2017 at 10:23:08PM +0200, Ingo Schwarze wrote: > Hi Nicholas, > > Nicholas Marriott wrote on Fri, May 19, 2017 at 07:04:53PM +0100: > > > Perhaps I haven't understood what you are saying correctly, > > What matters most is that sending an incomplete character > followed by U+0008 (ASCII BACKSPACE) is a no-op, both in the sense > that it doesn't change the line being edited and that it doesn't > change the display. All terminals you mentioned seem to conform > to that according to my testing, except tmux. They don't all do the same thing for me. I am doing this, the same as ksh does: printf 'a\010\342a\010\010\342\211a\010\010\342\211\240a\010\n' And the terminals behave differently. > > but I don't think it is possible to send control characters or > > any other invalid UTF-8 bytes inside UTF-8 characters and safely > > predict what the terminal will do. How about these examples: > > If the invalid bytes are still present by the time the line is sent > off for processing (like in your example printf '\343\203\n'), then > it is indeed hard to predict what random terminals will do, though I don't know what you mean by "sent off for processing". I think you might be under some misapprehension. \010 (ASCII BS) is just a cursor positioning sequence, all it does is move the cursor one position to the left. \n the same, it moves the cursor one line down. The problem here is that ksh assumes a partial UTF-8 character (whether one byte or two) will also move the cursor by one position, but that is not always the case. Not for tmux, and - for me - not for other terminals. Are you thinking of ICANON? That is a kernel mechanism, and when it is in use, the backspace will never reach the tmux. But ksh doesn't appear to use it. > i would argue that xterm's behaviour is correct (print one substitution > glyph for each incomplete character, or if bytes don't form even > incomplete characters, then one for each such invalid byte. So you think the correct behaviour is to replace invalid sequences by U+FEFF? tmux could do that, but it won't fix ksh for other terminals. ksh is making an assumption about how terminals will behave that is not correct. > urxvt is clearly broken: > > $ printf '\343\203x\n' > > prints U+00E3 x linefeed; i have no idea what it does to garble > 0xe383 into 0xc3a3. Maybe some naive misparsing, or spewing out > incomplete parsing state in some inconsistent way. > > gnome-terminal and konsole print one replacement character for each > invalid byte, even if bytes form an incomplete character. Maybe > not outright wrong, but arguably a bit confusing. This doesn't sound like all the terminals doing the same thing. > So yeah, if lines containing incomplete sequences *when they are > sent off* misformat with tmux and gnome-terminal or konsole, i > wouldn't call that tmux'es fault, and i agree there is little that > can be done about it. There is no "sent off". What you send is what you get, invalid characters, backspaces, everything. > > > Having tmux ignore the whole lot seems like a relatively sensible > > course. > > Well, what tmux currently does is making sure that everything gets > broken in the maximum possible way on every terminal, even if the > line that is finally sent off is completely correct. > > > The only other alternative would be to substitute U+FFFD. > > Why not just pass the bytes through? I don't think it's the job > of a terminal mutiplexer to mess with individual bytes. It's the > job of the final terminal doing the display to select glyphs and > place them, for printable characters, for non-printable characters, > for incomplete characters, and for invalid bytes. No no, it is the job of tmux to interpret what it receives from the application and make sure it shows the same no matter what terminal is outside. If terminals outside behave differently for the same output, tmux can't keep its own internal state correct, so the display will be mangled. tmux has to understand everything it receives, there is almost no case where we can just pass through. > > > But that seems iffy too - U+FFFD is width 1, but what if the > > application is expecting a width 2? > > By definition, incomplete characters and invalid bytes don't > have a width, so it doesn't matter what the application wanted > (for example, which character the user intended to type but > didn't finish). What matters is how wide the replacement > glyph will look on the final terminal. In that respect, we > cannot help making an assumption, and "incomplete sequences > and invalid bytes are displayed as U+FFFD (i.e. width 1) > seems about the best we can do. That may be slightly off > for gnome-terminal and konsole, but i don't see how that can > be helped. > > Anyway, these subtleties of invalid bytes that *remain* are not > the main inconvenience in practice. What matters more is that > tmux breaks even if incomplete characters are deleted again > with backspaces and
Re: ksh(1): don't output invalid UTF-8 characters
Hi Nicholas, Nicholas Marriott wrote on Fri, May 19, 2017 at 07:04:53PM +0100: > Perhaps I haven't understood what you are saying correctly, What matters most is that sending an incomplete character followed by U+0008 (ASCII BACKSPACE) is a no-op, both in the sense that it doesn't change the line being edited and that it doesn't change the display. All terminals you mentioned seem to conform to that according to my testing, except tmux. > but I don't think it is possible to send control characters or > any other invalid UTF-8 bytes inside UTF-8 characters and safely > predict what the terminal will do. How about these examples: If the invalid bytes are still present by the time the line is sent off for processing (like in your example printf '\343\203\n'), then it is indeed hard to predict what random terminals will do, though i would argue that xterm's behaviour is correct (print one substitution glyph for each incomplete character, or if bytes don't form even incomplete characters, then one for each such invalid byte. urxvt is clearly broken: $ printf '\343\203x\n' prints U+00E3 x linefeed; i have no idea what it does to garble 0xe383 into 0xc3a3. Maybe some naive misparsing, or spewing out incomplete parsing state in some inconsistent way. gnome-terminal and konsole print one replacement character for each invalid byte, even if bytes form an incomplete character. Maybe not outright wrong, but arguably a bit confusing. So yeah, if lines containing incomplete sequences *when they are sent off* misformat with tmux and gnome-terminal or konsole, i wouldn't call that tmux'es fault, and i agree there is little that can be done about it. > Having tmux ignore the whole lot seems like a relatively sensible > course. Well, what tmux currently does is making sure that everything gets broken in the maximum possible way on every terminal, even if the line that is finally sent off is completely correct. > The only other alternative would be to substitute U+FFFD. Why not just pass the bytes through? I don't think it's the job of a terminal mutiplexer to mess with individual bytes. It's the job of the final terminal doing the display to select glyphs and place them, for printable characters, for non-printable characters, for incomplete characters, and for invalid bytes. > But that seems iffy too - U+FFFD is width 1, but what if the > application is expecting a width 2? By definition, incomplete characters and invalid bytes don't have a width, so it doesn't matter what the application wanted (for example, which character the user intended to type but didn't finish). What matters is how wide the replacement glyph will look on the final terminal. In that respect, we cannot help making an assumption, and "incomplete sequences and invalid bytes are displayed as U+FFFD (i.e. width 1) seems about the best we can do. That may be slightly off for gnome-terminal and konsole, but i don't see how that can be helped. Anyway, these subtleties of invalid bytes that *remain* are not the main inconvenience in practice. What matters more is that tmux breaks even if incomplete characters are deleted again with backspaces and never sent off. Yours, Ingo
Re: ksh(1): don't output invalid UTF-8 characters
Hi Nicholas, Nicholas Marriott wrote on Fri, May 19, 2017 at 07:27:36PM +0100: > ksh has problems for me with Anton's example in several terminals, > not just in tmux. Mostly the cursor seems to end up one character > off rather than in the prompt, which is less visibly incorrect > perhaps, but still wrong. Can't reproduce. At least when having LC_CTYPE=en_US.UTF-8 set when starting the terminal, there is no problem for me with Anton's example in xterm, urxvt, gnome-terminal, or konsole. If i type echo xx then left-arrow, a width-one non-ASCII character, a dot, and hit enter, the dot appears at the right place both on the command line and in the output. Moving around with arrow-left or arrow-right after typing the non-ASCII character to insert the dot at a different place doesn't break anything for me either. The output is correct even with tmux, all that tmux garbles is the command line display. On a side note, i don't think gnome-terminal and konsole are relevant. I never installed them before and did so now for the first time for testing, but they installed so many libraries that i feel uncomfortable and unsafe using them even briefly and purely for testing purposes. I will certainly pkg_delete them, and the libraries they pulled in, in the near future. I can't believe anyone in their right mind would use a terminal depending on webkit, json, javascript, geoclue, pulseaudio, polkit, libssh and the like for any serious work (and on top of that, konsole is spewing huge amounts of scary error messages to stderr and/or stdout). > FWIW bash seems to do the replacement to U+FFFD itself before it sends > it to the terminal, which means it is (more) predictable. I don't know > if this is a sensible option for ksh. That is not really an option. This matters most while the multibyte character is being typed, when the first byte is already being processed but the second one not yet. Replacing the byte with U+FFFD, then later substituting the actual bytes back when all have arrived, doesn't really make much sense to me. Hanging the shell until all expected bytes have arrived seems like a bad idea, too. You can see the misbehaviour that is causing in programs like uniname(1) from the misc/uniutils package: $ printf '\377turd' | uniname Now the program hangs indefinitely. I wouldn't want the shell to hang in such a manner. Yours, Ingo
Re: ksh(1): don't output invalid UTF-8 characters
ksh has problems for me with Anton's example in several terminals, not just in tmux. Mostly the cursor seems to end up one character off rather than in the prompt, which is less visibly incorrect perhaps, but still wrong. I don't know that ksh will be able to predict this reliably (not uncommon with shells). FWIW bash seems to do the replacement to U+FFFD itself before it sends it to the terminal, which means it is (more) predictable. I don't know if this is a sensible option for ksh. On Fri, May 19, 2017 at 07:04:53PM +0100, Nicholas Marriott wrote: > Hi > > Perhaps I haven't understood what you are saying correctly, but I don't > think it is possible to send control characters or any other invalid > UTF-8 bytes inside UTF-8 characters and safely predict what the terminal > will do. How about these examples: > > printf '\343\203\010\217a\n' > printf '\343\203\r\217b\n' > printf '\343\203\n\217c\n' > > Or with a width 1 character: > > printf '\342\211\010\240\n' > printf '\342\211\r\240b\n' > printf '\342\211\n\240c\n' > > I have four UTF-8 capable X terminals here and between them they do > three different things (xterm, rxvt-unicode, gnome-terminal & konsole). > > tmux can't forward this stuff to the terminal outside, because if it > can't predict what that terminal will do, tmux's idea of the display > will get out of sync with reality. > > Having tmux ignore the whole lot seems like a relatively sensible > course. > > The only other alternative would be to substitute U+FFFD. But that seems > iffy too - U+FFFD is width 1, but what if the application is expecting > a width 2? > > > > > On Fri, May 19, 2017 at 06:54:38PM +0200, Ingo Schwarze wrote: > > Hi Anton, > > > > Anton Lindqvist wrote on Fri, May 19, 2017 at 08:42:05AM +0200: > > > > > 1. Run ksh under tmux. > > > > > > 2. Input the following characters, without spaces: > > > > > >a (any character) ^B (backward-char) ?? (any UTF-8 character) > > > > > > 3. At this point, the prompt gets overwritten. > > > > > > Since ksh read a single byte of input, it will display a partial UTF-8 > > > character before the whole character has been read. This is especially > > > troublesome when the cursor is not placed at the end of the line. In the > > > scenario above, after reading the first byte of '??' the following > > > sequence will be displayed: > > > > > > 0xc3 0x61 0x08 > > > > > > That is the first byte of '??' (0xc3), 'a' (0x61), '\b' (0x08). tmux > > > does the right thing here, since 0xc3 is a valid UTF-8 start byte it > > > expects it to be followed by a UTF-8 continuation byte which is not the > > > case. The two first bytes (0xc3, 0x61) are discarded and the parser is > > > reset to its initial state > > > > I call that a bug in tmux. At least for UTF-8, tmux should never reset > > its parser. Depending on the keyboard configuration, it may be possible > > to enter single non-UTF-8 bytes. For example, at the console, i can do > > this: > > > > - type "printf x" > > - press Ctrl-V, Ctrl-Alt-Z > > - type "printf x | hexdump -C" > > > > The result is: > > > > 78 9a 78 |x.x| > > 0003 > > > > All of the shell, the console, and xterm more or less handle such > > stunts, even though admittedly, that is not the usual way of typing > > in a binary file. tmux is a terminal emulator, so it ought to cope, > > too, and not reset its state. > > > > Note that this is yet another instance where the concept of arbitrary > > locales is utterly broken and insecure. On some non-OpenBSD system > > supporting such insecure stuff, if the user has set an arbitrary, > > non-UTF-8, state-dependent locale and somehow manages to insert an > > invalid byte into the input stream, the state of the input stream > > becomes invalid, no further characters can be read from that terminal, > > and *there is no way to recover*. The only remaining half-secure > > option is for the shell to exit, which may also not be very secure, > > on a different level. > > > > But with UTF-8, there is no problem whatsoever dealing with invalid > > bytes, so tmux ought to cope. > > > > > causing the backspace to be accepted and the > > > first character in the prompt to be overwritten. > > > > That's part of the reason why tmux must not reset its state: > > The shell won't know about it, and the screen will become garbled. > > > > > Below is diff that make sure to read a whole UTF-8 character in > > > x_emacs() prior doing another iteration of the main-loop > > > > I don't think we can do that. What if there is no next byte? > > Then the shell will hang until, maybe considerably later, the next > > character is typed. Also, we cannot rely on parsing the UTF-8 start > > byte (even if done correctly). It tells the byte length of the > > character only for valid byte sequences, and the byte sequence need > > not be valid. > > > > Besides, i think patching the shell to work around terminal bugs > > (or terminal emulator bugs)
Re: ksh(1): don't output invalid UTF-8 characters
Hi Perhaps I haven't understood what you are saying correctly, but I don't think it is possible to send control characters or any other invalid UTF-8 bytes inside UTF-8 characters and safely predict what the terminal will do. How about these examples: printf '\343\203\010\217a\n' printf '\343\203\r\217b\n' printf '\343\203\n\217c\n' Or with a width 1 character: printf '\342\211\010\240\n' printf '\342\211\r\240b\n' printf '\342\211\n\240c\n' I have four UTF-8 capable X terminals here and between them they do three different things (xterm, rxvt-unicode, gnome-terminal & konsole). tmux can't forward this stuff to the terminal outside, because if it can't predict what that terminal will do, tmux's idea of the display will get out of sync with reality. Having tmux ignore the whole lot seems like a relatively sensible course. The only other alternative would be to substitute U+FFFD. But that seems iffy too - U+FFFD is width 1, but what if the application is expecting a width 2? On Fri, May 19, 2017 at 06:54:38PM +0200, Ingo Schwarze wrote: > Hi Anton, > > Anton Lindqvist wrote on Fri, May 19, 2017 at 08:42:05AM +0200: > > > 1. Run ksh under tmux. > > > > 2. Input the following characters, without spaces: > > > >a (any character) ^B (backward-char) ?? (any UTF-8 character) > > > > 3. At this point, the prompt gets overwritten. > > > > Since ksh read a single byte of input, it will display a partial UTF-8 > > character before the whole character has been read. This is especially > > troublesome when the cursor is not placed at the end of the line. In the > > scenario above, after reading the first byte of '??' the following > > sequence will be displayed: > > > > 0xc3 0x61 0x08 > > > > That is the first byte of '??' (0xc3), 'a' (0x61), '\b' (0x08). tmux > > does the right thing here, since 0xc3 is a valid UTF-8 start byte it > > expects it to be followed by a UTF-8 continuation byte which is not the > > case. The two first bytes (0xc3, 0x61) are discarded and the parser is > > reset to its initial state > > I call that a bug in tmux. At least for UTF-8, tmux should never reset > its parser. Depending on the keyboard configuration, it may be possible > to enter single non-UTF-8 bytes. For example, at the console, i can do > this: > > - type "printf x" > - press Ctrl-V, Ctrl-Alt-Z > - type "printf x | hexdump -C" > > The result is: > > 78 9a 78 |x.x| > 0003 > > All of the shell, the console, and xterm more or less handle such > stunts, even though admittedly, that is not the usual way of typing > in a binary file. tmux is a terminal emulator, so it ought to cope, > too, and not reset its state. > > Note that this is yet another instance where the concept of arbitrary > locales is utterly broken and insecure. On some non-OpenBSD system > supporting such insecure stuff, if the user has set an arbitrary, > non-UTF-8, state-dependent locale and somehow manages to insert an > invalid byte into the input stream, the state of the input stream > becomes invalid, no further characters can be read from that terminal, > and *there is no way to recover*. The only remaining half-secure > option is for the shell to exit, which may also not be very secure, > on a different level. > > But with UTF-8, there is no problem whatsoever dealing with invalid > bytes, so tmux ought to cope. > > > causing the backspace to be accepted and the > > first character in the prompt to be overwritten. > > That's part of the reason why tmux must not reset its state: > The shell won't know about it, and the screen will become garbled. > > > Below is diff that make sure to read a whole UTF-8 character in > > x_emacs() prior doing another iteration of the main-loop > > I don't think we can do that. What if there is no next byte? > Then the shell will hang until, maybe considerably later, the next > character is typed. Also, we cannot rely on parsing the UTF-8 start > byte (even if done correctly). It tells the byte length of the > character only for valid byte sequences, and the byte sequence need > not be valid. > > Besides, i think patching the shell to work around terminal bugs > (or terminal emulator bugs) is the wrong approach in the first place. > > Yours, > Ingo >
Re: ksh(1): don't output invalid UTF-8 characters
Hi Anton, Anton Lindqvist wrote on Fri, May 19, 2017 at 08:42:05AM +0200: > 1. Run ksh under tmux. > > 2. Input the following characters, without spaces: > >a (any character) ^B (backward-char) ö (any UTF-8 character) > > 3. At this point, the prompt gets overwritten. > > Since ksh read a single byte of input, it will display a partial UTF-8 > character before the whole character has been read. This is especially > troublesome when the cursor is not placed at the end of the line. In the > scenario above, after reading the first byte of 'ö' the following > sequence will be displayed: > > 0xc3 0x61 0x08 > > That is the first byte of 'ö' (0xc3), 'a' (0x61), '\b' (0x08). tmux > does the right thing here, since 0xc3 is a valid UTF-8 start byte it > expects it to be followed by a UTF-8 continuation byte which is not the > case. The two first bytes (0xc3, 0x61) are discarded and the parser is > reset to its initial state I call that a bug in tmux. At least for UTF-8, tmux should never reset its parser. Depending on the keyboard configuration, it may be possible to enter single non-UTF-8 bytes. For example, at the console, i can do this: - type "printf x" - press Ctrl-V, Ctrl-Alt-Z - type "printf x | hexdump -C" The result is: 78 9a 78 |x.x| 0003 All of the shell, the console, and xterm more or less handle such stunts, even though admittedly, that is not the usual way of typing in a binary file. tmux is a terminal emulator, so it ought to cope, too, and not reset its state. Note that this is yet another instance where the concept of arbitrary locales is utterly broken and insecure. On some non-OpenBSD system supporting such insecure stuff, if the user has set an arbitrary, non-UTF-8, state-dependent locale and somehow manages to insert an invalid byte into the input stream, the state of the input stream becomes invalid, no further characters can be read from that terminal, and *there is no way to recover*. The only remaining half-secure option is for the shell to exit, which may also not be very secure, on a different level. But with UTF-8, there is no problem whatsoever dealing with invalid bytes, so tmux ought to cope. > causing the backspace to be accepted and the > first character in the prompt to be overwritten. That's part of the reason why tmux must not reset its state: The shell won't know about it, and the screen will become garbled. > Below is diff that make sure to read a whole UTF-8 character in > x_emacs() prior doing another iteration of the main-loop I don't think we can do that. What if there is no next byte? Then the shell will hang until, maybe considerably later, the next character is typed. Also, we cannot rely on parsing the UTF-8 start byte (even if done correctly). It tells the byte length of the character only for valid byte sequences, and the byte sequence need not be valid. Besides, i think patching the shell to work around terminal bugs (or terminal emulator bugs) is the wrong approach in the first place. Yours, Ingo
Re: ksh(1): don't output invalid UTF-8 characters
On Fri, May 19, 2017 at 09:33:33AM -0300, Lucas Gabriel Vuotto wrote: > Hi, > > On 19/05/17 03:42, Anton Lindqvist wrote: > > Hi, > > I did submit this problem[1] earlier but with an incomplete analysis and > > fix. Here's a second attempt. > > > > This does only occur when running ksh with emacs mode under tmux. How to > > re-produce: > > > > 1. Run ksh under tmux. > > > > 2. Input the following characters, without spaces: > > > >a (any character) ^B (backward-char) ö (any UTF-8 character) > > > > 3. At this point, the prompt gets overwritten. > > > > Since ksh read a single byte of input, it will display a partial UTF-8 > > character before the whole character has been read. This is especially > > troublesome when the cursor is not placed at the end of the line. In the > > scenario above, after reading the first byte of 'ö' the following > > sequence will be displayed: > > > > 0xc3 0x61 0x08 > > > > That is the first byte of 'ö' (0xc3), 'a' (0x61), '\b' (0x08). tmux > > does the right thing here, since 0xc3 is a valid UTF-8 start byte it > > expects it to be followed by a UTF-8 continuation byte which is not the > > case. The two first bytes (0xc3, 0x61) are discarded and the parser is > > reset to its initial state causing the backspace to be accepted and the > > first character in the prompt to be overwritten. > > > > After the second byte of 'ö' (0xb6) is read by ksh, the following > > sequence will be displayed: > > > >0x08 0xc3 0xb6 0x61 0x08 > > > > That is '\b' (0x08), 'ö' (0xc3, 0xb6), 'a' (0x61), '\b' (0x08). Since > > ksh assumes the cursor is correctly positioned it displays a leading > > backspace in order to move passed the first character. This is however > > not true causing another character in the prompt to be overwritten. > > > > Below is diff that make sure to read a whole UTF-8 character in > > x_emacs() prior doing another iteration of the main-loop which solves > > the problem. It does not validate UTF-8 input but instead assumes every > > such character is valid. > > > > Comments and feedback are much appreciated. > > > > [1] http://marc.info/?l=openbsd-misc=148509346310901=2 > > > > Index: emacs.c > > === > > RCS file: /cvs/src/bin/ksh/emacs.c,v > > retrieving revision 1.67 > > diff -u -p -r1.67 emacs.c > > --- emacs.c 12 May 2017 14:37:52 - 1.67 > > +++ emacs.c 14 May 2017 08:21:26 - > > @@ -98,6 +98,7 @@ static intx_col; > > static int x_displen; > > static int x_arg; /* general purpose arg */ > > static int x_arg_defaulted;/* x_arg not explicitly set; defaulted to 1 */ > > +static int x_getc_again; > > > > static int xlp_valid; > > /* end from 4.9 edit.h } */ > > @@ -142,6 +143,7 @@ static int x_fold_case(int); > > static char*x_lastcp(void); > > static voiddo_complete(int, Comp_type); > > static int isu8cont(unsigned char); > > +static int u8len(unsigned char); > > > > /* proto's for keybindings */ > > static int x_abort(int); > > @@ -272,6 +274,21 @@ isu8cont(unsigned char c) > > return (c & (0x80 | 0x40)) == 0x80; > > } > > > > +static int > > +u8len(unsigned char c) > > +{ > > + switch (c & 0xF0) { > > + case 0xF0: > > + return 4; > > + case 0xE0: > > + return 3; > > + case 0xC0: > > + return 2; > > + default: > > + return 1; > > + } > > +} > > + > > This is wrong: most codepoints in the range U+0080-U+07ff (the ones greater > than U+0400) would be interpreted as being 1 character long instead of 2. Thanks for the heads-up. Maybe a more reliable solution would be to call mbtowc(3) repeatedly as new input arrives until it returns successfully. Assuming the first read byte is a UTF-8 start byte.
Re: ksh(1): don't output invalid UTF-8 characters
Hi, On 19/05/17 03:42, Anton Lindqvist wrote: > Hi, > I did submit this problem[1] earlier but with an incomplete analysis and > fix. Here's a second attempt. > > This does only occur when running ksh with emacs mode under tmux. How to > re-produce: > > 1. Run ksh under tmux. > > 2. Input the following characters, without spaces: > >a (any character) ^B (backward-char) ö (any UTF-8 character) > > 3. At this point, the prompt gets overwritten. > > Since ksh read a single byte of input, it will display a partial UTF-8 > character before the whole character has been read. This is especially > troublesome when the cursor is not placed at the end of the line. In the > scenario above, after reading the first byte of 'ö' the following > sequence will be displayed: > > 0xc3 0x61 0x08 > > That is the first byte of 'ö' (0xc3), 'a' (0x61), '\b' (0x08). tmux > does the right thing here, since 0xc3 is a valid UTF-8 start byte it > expects it to be followed by a UTF-8 continuation byte which is not the > case. The two first bytes (0xc3, 0x61) are discarded and the parser is > reset to its initial state causing the backspace to be accepted and the > first character in the prompt to be overwritten. > > After the second byte of 'ö' (0xb6) is read by ksh, the following > sequence will be displayed: > >0x08 0xc3 0xb6 0x61 0x08 > > That is '\b' (0x08), 'ö' (0xc3, 0xb6), 'a' (0x61), '\b' (0x08). Since > ksh assumes the cursor is correctly positioned it displays a leading > backspace in order to move passed the first character. This is however > not true causing another character in the prompt to be overwritten. > > Below is diff that make sure to read a whole UTF-8 character in > x_emacs() prior doing another iteration of the main-loop which solves > the problem. It does not validate UTF-8 input but instead assumes every > such character is valid. > > Comments and feedback are much appreciated. > > [1] http://marc.info/?l=openbsd-misc=148509346310901=2 > > Index: emacs.c > === > RCS file: /cvs/src/bin/ksh/emacs.c,v > retrieving revision 1.67 > diff -u -p -r1.67 emacs.c > --- emacs.c 12 May 2017 14:37:52 - 1.67 > +++ emacs.c 14 May 2017 08:21:26 - > @@ -98,6 +98,7 @@ static int x_col; > static int x_displen; > static int x_arg; /* general purpose arg */ > static int x_arg_defaulted;/* x_arg not explicitly set; defaulted to 1 */ > +static int x_getc_again; > > static int xlp_valid; > /* end from 4.9 edit.h } */ > @@ -142,6 +143,7 @@ static intx_fold_case(int); > static char *x_lastcp(void); > static void do_complete(int, Comp_type); > static int isu8cont(unsigned char); > +static int u8len(unsigned char); > > /* proto's for keybindings */ > static int x_abort(int); > @@ -272,6 +274,21 @@ isu8cont(unsigned char c) > return (c & (0x80 | 0x40)) == 0x80; > } > > +static int > +u8len(unsigned char c) > +{ > + switch (c & 0xF0) { > + case 0xF0: > + return 4; > + case 0xE0: > + return 3; > + case 0xC0: > + return 2; > + default: > + return 1; > + } > +} > + This is wrong: most codepoints in the range U+0080-U+07ff (the ones greater than U+0400) would be interpreted as being 1 character long instead of 2. > int > x_emacs(char *buf, size_t len) > { > @@ -318,10 +335,12 @@ x_emacs(char *buf, size_t len) > x_last_command = NULL; > while (1) { > x_flush(); > - if ((c = x_e_getc()) < 0) > - return 0; > + do { > + if ((c = x_e_getc()) < 0) > + return 0; > > - line[at++] = c; > + line[at++] = c; > + } while (x_getc_again > 0); > line[at] = '\0'; > > if (x_arg == -1) { > @@ -364,7 +383,10 @@ x_emacs(char *buf, size_t len) > } else { > if (submatch) > continue; > - if (at == 1) > + if (at > 1) { > + x_ins(line); > + ret = KSTD; > + } else if (at == 1) > ret = x_insert(c); > else > ret = x_error(c); /* not matched meta sequence > */ > @@ -1887,8 +1909,12 @@ x_e_getc(void) > macro_args = NULL; > c = x_getc(); > } > - } else > + } else { > c = x_getc(); > + if (x_getc_again == 0) > + x_getc_again = u8len(c); > + x_getc_again--; > + } > > return c; > } >