Re: ksh(1): don't output invalid UTF-8 characters

2017-06-25 Thread Anton Lindqvist
For reference, I just committed the fix, see message below. Thanks to
all who helped out.

> CVSROOT:  /cvs
> Module name:  src
> Changes by:   an...@cvs.openbsd.org   2017/06/25 02:51:53
>
> Modified files:
>   bin/ksh: emacs.c 
>
> Log message:
> Don't output partial UTF-8 characters in ksh emacs mode. Instead, try to read 
> a
> complete UTF-8 character first. Fixes an issue while running ksh in tmux where
> UTF-8 characters inserted in columns other than the last one are discarded.
>
> With help from nicm@ and schwarze@ who also wrote the UTF-8 validation, 
> thanks!
>
> ok schwarze@



Re: ksh(1): don't output invalid UTF-8 characters

2017-06-05 Thread Walter Alejandro Iglesias
On Mon, Jun 05, 2017 at 10:46:49PM +0200, Ingo Schwarze wrote:
> Hi Walter,
> 
> Walter Alejandro Iglesias wrote on Mon, Jun 05, 2017 at 09:21:31PM +0200:
> > On Mon, Jun 05, 2017 at 06:06:34PM +0200, Ingo Schwarze wrote:
> >> Walter Alejandro Iglesias wrote on Mon, Jun 05, 2017 at 04:50:21PM +0200:
> 
> > But this time I don't think you need a capture of the sequence.
> 
> Well *you* obviously know what problem you have, but *I* don't
> understand it, so please don't tell me that i understand when i
> don't, and don't tell me that i don't need the details i need.  In
> general, when reporting bugs, do not guess which information might
> or might not be needed, but provide what people trying to fix it
> ask for.

First, you're who is assuming wrong.  I don't contact free software
developers to ask them to solve *my* problems.

Second, I didn't provided you more details now than the first time.  The
information I provided was enough to understand and to reproduce the
bug.

Finally, I'm doing this for free too.  If you're not happy with the way I
explain myself feel free to ignore me.


Greetings.




Re: ksh(1): don't output invalid UTF-8 characters

2017-06-05 Thread Ingo Schwarze
Hi Walter,

Walter Alejandro Iglesias wrote on Mon, Jun 05, 2017 at 09:21:31PM +0200:
> On Mon, Jun 05, 2017 at 06:06:34PM +0200, Ingo Schwarze wrote:
>> Walter Alejandro Iglesias wrote on Mon, Jun 05, 2017 at 04:50:21PM +0200:

> But this time I don't think you need a capture of the sequence.

Well *you* obviously know what problem you have, but *I* don't
understand it, so please don't tell me that i understand when i
don't, and don't tell me that i don't need the details i need.  In
general, when reporting bugs, do not guess which information might
or might not be needed, but provide what people trying to fix it
ask for.

> Just use *any*
> latin-1 character whose hex value is smaller than \xc0.
> 
> To facilitate you the test, in xterm after setting "setxkbmap de":
> 
>   AltGr + Shift + 1
> 
> prints me the opening exclamation mark (\xa1) we also use in Spanish.

Yes, i can confirm that.

> In console

The console is not UTF-8 capable but expects latin-1 encoding,
which ksh doesn't support at all.  So at the console, you cannot
do command line editing if you type in non-ASCII characters,
or all bets are off.

> or a C xterm, type that merged among random ascii characters, then
> move the cursor from the first to the last column passing over that
> character.  Assuming you're running current, see what happens.

Everything is just fine, i don't see any problem.  Wenn reporting a
bug, it is not sufficient to say "see what happens"; you need to
describe what you see that happens and what you would expect too
happen instead.

Oh, after playing around a bit, i started an xterm with

   $ export LC_CTYPE=C
   $ xterm +u8   # DON'T DO THAT on OpenBSD, ever

Now if i type

   echo abcdef

in that situation and move around, the notions ksh and xterm have
of the cursor position get out of sync, and the display gets
garbled.  Is that what you mean?  That is expected.

In that situation, the mark is represented as 0xa1, which ksh
treats as an invalid zero-width byte, while xterm treats it as
a printable character in that mode.

The problem here is not that ksh does something wrong.  The problem
is that i forced xterm into a mode using a character encoding that
is not supported and cannot be used on OpenBSD at all.

> Anyway, to be honest, these bugs don't hurt, you can live with them.
> What I'm trying to say with these reports is I'm not truly convinced
> utf8 support in console is a good idea.

Oh, having the console grok UTF-8 (just like xterm) would be great,
but its not trivial because it has tentacles into the kernel.

> Another test you can do, this time in a utf-8 xterm: if you activate the
> bell and go with the cursor to the end of the line it'll beep.

More precisely, in emacs mode, if the curser is after the last
character and i type Ctrl-F, it beeps, yes.  In vi command mode,
if the cursor is on the last character and i type "l", it beeps.

> Now type some utf-8 character at the end and do the same,
> it won't beep, because the cursor is in the first byte of the
> utf-8 character, *it can't reach the real end of the line*.

I don't see such a problem in emacs mode, but i do see that in
vi mode.  That's an admittedly small, but real bug...
I should probably have a look how to fix that.

> Nobody will die because this issue or the other above.  My point
> is utf8 will always be a mess.

Yes, it is not likely to ever be perfect, but we can make it
better (and more usable for simple practical tasks).

> KEN, DO YOU HEAR ME?, IT WAS YOUR OWN CHILD, KEN! :-)

To be honest, i doubt that it is possible to design a character
encoding that is substantially better than UTF-8.  Of course, it
is easy to design other encodings of the same quality of UTF-8
(colour of the bikeshed), but none of those are used in practice.
UTF-8 is optimal in a large number of ways and very elegant, and
while it is not trivial to handle, it is as simple as possible.

By contrast, Unicode has large numbers of problems and is not
only not trivial, but so large that is very hard to get an adequate
understanding of it.  Some of those problems (for example, surrogates
and the arbitrary U+10 limit) extend into UTF-8's domain, but
those are Unicode problems, not UTF-8 problems.

>>$ ./obj/edit < input.txt | hexdump -C

> I've been wondering how to work with this.  Thanks!

Maybe somebody should write /usr/src/regress/bin/ksh/edit/edit.1,
like we have /usr/src/regress/usr.bin/mandoc/db/dbm_dump/dbm_dump.1.

>Encodings using more bytes than required are invalid.  In particular,
>1100 and 1101 are not valid start bytes, the byte after
>1110 must be at least 1010, and the byte after  must
>be at least 1001.
> 
> I don't understand the 'at least' assumptions.  Some examples in which
> the byte after 1110 is *smaller* than 1010:
> 
> Euro sign:
> 11100010 1010 10101100
> 
> Em dash:
> 11100010 1000 10010100
> 
> Double quotes:
> 11100010 1000 10011100
> 

Re: ksh(1): don't output invalid UTF-8 characters

2017-06-05 Thread Walter Alejandro Iglesias
In article <20170605192131.ga60...@server.roquesor.com> you wrote:
> 
>Encodings using more bytes than required are invalid.  In particular,
>1100 and 1101 are not valid start bytes, the byte after
>1110 must be at least 1010, and the byte after  must
>be at least 1001.
> 

Someone off list explained me this is true just for the exact 1110
byte.  I'd assumed the man page was referring to any 1110 sequence.
Now I understand.



Re: ksh(1): don't output invalid UTF-8 characters

2017-06-05 Thread Kurt H Maier
On Mon, Jun 05, 2017 at 09:21:31PM +0200, Walter Alejandro Iglesias wrote:
> 
> I wonder how plan9 handle utf8.
> 

In general, by getting rid of TTYs and character-addressed interfaces
almost entirely.  Probably not the best fit for OpenBSD.

khm



Re: ksh(1): don't output invalid UTF-8 characters

2017-06-05 Thread Walter Alejandro Iglesias
On Mon, Jun 05, 2017 at 06:06:34PM +0200, Ingo Schwarze wrote:
> Hi Walter,
> 
> Walter Alejandro Iglesias wrote on Mon, Jun 05, 2017 at 04:50:21PM +0200:
> 
> > report (I'm on chapter 2 of K :-)).  I wish with time I'll learn how
> > to do it.
> 
> IIRC, you said you saw some undesirable behaviour with ksh input.
> 
> I assume you have a sequence of key presses on your keyboard that
> demonstrate the undesirable behaviour.  To capture the sequence,
>

I will *study* all the indications you gave me.  But this time
I don't think you need a capture of the sequence.  Just use *any*
latin-1 character whose hex value is smaller than \xc0.

To facilitate you the test, in xterm after setting "setxkbmap de":

  AltGr + Shift + 1

prints me the opening exclamation mark (\xa1) we also use in Spanish.
In console or a C xterm, type that merged among random ascii characters,
then move the cursor from the first to the last column passing over that
character.  Assuming you're running current, see what happens.


Anyway, to be honest, these bugs don't hurt, you can live with them.
What I'm trying to say with these reports is I'm not truly convinced
utf8 support in console is a good idea.

Another test you can do, this time in a utf-8 xterm: if you activate the
bell and go with the cursor to the end of the line it'll beep.  Now type
some utf-8 character at the end and do the same, it won't beep, because
the cursor is in the first byte of the utf-8 character, *it can't reach
the real end of the line*.  Nobody will die because this issue or the
other above.  My point is utf8 will always be a mess.  KEN, DO YOU HEAR
ME?, IT WAS YOUR OWN CHILD, KEN! :-)

I wonder how plan9 handle utf8.


[...]

>
> 
> For testing, go to the regress directory:
> 
>$ cd /usr/src/regress/bin/ksh
>$ cvs up -dP
>$ cd edit
>$ make obj
>$ make cleandir
>$ make regress
>$ ./obj/edit < input.txt | hexdump -C
>     24 20 78 79 08 c3 a9 79  08 0a   |$ xy...y..|
>   000a


I've been wondering how to work with this.  Thanks!


[...]

> > By the way, something the last paragraph of the new utf8(7) man page
> > isn't clear enough (I mentioned this to tedu@).
> 
> Which paragraph exactly, and what is unclear?  Maybe we can fix it
> quickly.

As I told you, the _last_ one:

   Encodings using more bytes than required are invalid.  In particular,
   1100 and 1101 are not valid start bytes, the byte after
   1110 must be at least 1010, and the byte after  must
   be at least 1001.

I don't understand the 'at least' assumptions.  Some examples in which
the byte after 1110 is *smaller* than 1010:

Euro sign:
11100010 1010 10101100

Em dash:
11100010 1000 10010100

Double quotes:
11100010 1000 10011100
11100010 1000 10011101

You can find examples where the byte after 1110 is *grater* than
1010 here:

http://www.utf8-chartable.de/



Thank you for your advices I'll study your whole message carefuly.


> 
> Yours,
>   Ingo



Re: ksh(1): don't output invalid UTF-8 characters

2017-06-05 Thread Ingo Schwarze
Hi Walter,

Walter Alejandro Iglesias wrote on Mon, Jun 05, 2017 at 04:50:21PM +0200:

> I'm still not skilled enough to make a proper patch or a clear bug
> report (I'm on chapter 2 of K :-)).  I wish with time I'll learn how
> to do it.

IIRC, you said you saw some undesirable behaviour with ksh input.

I assume you have a sequence of key presses on your keyboard that
demonstrate the undesirable behaviour.  To capture the sequence,

 1. Type

printf '

(without hitting enter)

 2. Now type the sequence of characters, BUT
 - before any control character, press Ctrl-V
 - before any single quote, press a backslash

 3. Type

' > input.txt

and hit enter.

You can look at the sequence with

  hexdump -C input.txt

There should be at least one byte for each key pressed, sometimes
more.  If some key presses do not show up or the order is mixed up,
you likely forgot the Ctrl-V before it.  In that case, retry.

If you want to save people who look at your report some work,
you can craft a printf(1) statement that generates the same
output as the above, but using only ASCII letters, and send
that instead of the output of hexdump -C input.txt.

Here is a simple example.
Test insertion of a non-ASCII character between two ASCII characters.

What i type on my keyboard:

  x y Ctrl-B e-accent-aigu

What i type to capture the input:

  p r i n t f blank ' x y Ctrl-V Ctrl-B e-accent-aigu ' > i n p u t . t x t

This can be used to create the same input file:

   $ printf 'xy\x02\xc3\xa9' > input2.txt
   $ cmp input.txt input2.txt 
   $ echo $?
  0

For testing, go to the regress directory:

   $ cd /usr/src/regress/bin/ksh
   $ cvs up -dP
   $ cd edit
   $ make obj
   $ make cleandir
   $ make regress
   $ ./obj/edit < input.txt | hexdump -C
    24 20 78 79 08 c3 a9 79  08 0a   |$ xy...y..|
  000a

Note that the above output is already an improved (uncommitted)
version containing a private patch.

Here is what the -current ksh does:

   $ PATH=/obin ./obj/edit < input.txt | hexdump -C
    24 20 78 79 08 c3 79 08  08 c3 a9 79 08 0a   |$ xy..yy..|
  000e

Also describe in words why you are typing that exact sequence,
what you would like to happen, and how the actual events
differ from that.

Should be feasible, right?


> I came to the ksh utf8 discussion because I've been playing
> with some mail mime encoder just to learn C and recognizing
> valid utf-8 was the first challenge I ecountered.

In a normal non-threaded application program, you should probably
use mbtowc(3) rather than coding your own UTF-8 parser.

> The code pasted below is what I got so far in recognizing valid utf-8.

That code looks confusing.  I'm not studying it in detail right now
because we already have at least two working UTF-8 decoders in the
tree.

One is /usr/src/lib/libc/citrus/citrus_utf8.c,
function _citrus_utf8_ctype_mbrtowc().  It is relatively long,
but quite easy to understand.

Another one is /usr/src/usr.bin/mandoc/preconv.c,
function preconv_encode().  It is substantially shorter,
but admittedly harder to understand.

This is OpenBSD.  Read the fantastic source to learn something.

> I'm showing it to make my point, I realize it isn't easy; and from my
> poor C I'm not able to figure out how you can do such test byte by byte
> while the user is typing at command line.  (Don't bother in explaining
> me how, I know this is not the place to take C lessons.)

Well, it seems likely the something doing just that will be committed
to ksh/emacs.c during the next week or so - not a full parser, but
something showing the incremental nature.  Then you can inspect it.

If you report actual bugs or propose fixes, then developers are often
willing to help with C issues related to it as well, even though
teaching C is indeed not the point of this list.

> By the way, something the last paragraph of the new utf8(7) man page
> isn't clear enough (I mentioned this to tedu@).

Which paragraph exactly, and what is unclear?  Maybe we can fix it
quickly.

> #define YES   1
> #define NO0

Either just use 1 and 0 in the code or use .

[...]
>   }

Don't forget checking ferror(3) after falling out of the main loop.

>   return 0;
> }

Yours,
  Ingo



Re: ksh(1): don't output invalid UTF-8 characters

2017-06-05 Thread Walter Alejandro Iglesias
Just to applogize to developers here,

I'm still not skilled enough to make a proper patch or a clear bug
report (I'm on chapter 2 of K :-)).  I wish with time I'll learn how
to do it.  I came to the ksh utf8 discussion because I've been playing
with some mail mime encoder just to learn C and recognizing valid utf-8
was the first challenge I ecountered.

The code pasted below is what I got so far in recognizing valid utf-8.
I'm showing it to make my point, I realize it isn't easy; and from my
poor C I'm not able to figure out how you can do such test byte by byte
while the user is typing at command line.  (Don't bother in explaining
me how, I know this is not the place to take C lessons.)

By the way, something the last paragraph of the new utf8(7) man page
isn't clear enough (I mentioned this to tedu@).

Thanks to all of you for your work.  Now I know how hard it is.


#include 

#define ASCII   0x7f
#define YES 1
#define NO  0

int
main()
{
int c, ch, wd, ln, col, isutf8;

ch = wd = ln = col = 1;
isutf8 = YES;

while ((c = getchar()) != EOF) {
if (c > ASCII) {
if ((ch == 1 && (c < 0xc2 || c > 0xf7)) ||
((ch > 1 && c <= 4) &&
ch <= wd && (c < 0x80 || c > 0xbf)))
isutf8 = NO;
/* 110. */
else if (ch == 1 && c >= 0xc2 && c <= 0xdf) {
wd = 2;
++ch;
/* 1110 */
} else if (ch == 1 && c >= 0xe0 && c <= 0xef) {
wd = 3;
++ch;
/* 0... */
} else if (ch == 1 && c >= 0xf0 && c <= 0xf7) {
wd = 4;
++ch;
} else if (ch > 1 && c <= 4 &&
ch == wd && c >= 0x80 && c <= 0xbf)
ch = 1;
else if (ch > 1 && c <= 4 &&
ch < wd && c >= 0x80 && c <= 0xbf)
++ch;
else
++ch;
} else if (ch > 1 && ch <= 4 && ch <= wd)
isutf8= NO;
else
ch = 1;

if (isutf8 == NO) {
printf("Invalid utf-8 character");
printf(" at line %d col %d.\n", ln, col);
return 1;
}
if (c == '\n') {
col = 1;
++ln;
} else
++col;
}

return 0;
}



Re: ksh(1): don't output invalid UTF-8 characters

2017-06-05 Thread Ingo Schwarze
Hi Anton,

Anton Lindqvist wrote on Sun, Jun 04, 2017 at 11:09:35AM +0200:

> Although this discussion hasn't settled,

True.  I think nicm@ has convinced me that the shell *can* try to be
nicer towards terminals, without risking hangs if done very carefully.
Probably that's worth doing, it makes the system more robust.

The goal would be to only send invalid sequences to the terminal if
the user actually typed invalid input, but to not send invalid
intermediate states to the terminal if what the user types is
correct.

> here's a new diff trying to
> address the previously raised issues:
> 
> - The new function x_e_getu8() tries to read a complete UTF-8 character.
>   When a continuation byte is expected but not received, it resets its
>   state and retries.
> 
>   The fix to u8len() from Boudewijn Dijkstra is incorporated in this
>new function, thanks!

No, no, no.  You are thanking him for really bad advice.
That function is still outrageously wrong.

> - In x_emacs(), In order to distinguish between when the line buffer
>   contains more than one character

I assume you mean "byte" here.

>   due to a UTF-8 character being read
>   or caused by calling x_e_getu8() repeatedly the number of calls to
>   x_e_getu8() is stored ntries. This is of importance to preserve the
>   existing behavior where none matched escape sequences should not be
>   inserted but instead ring the bell.

I didn't do full testing and review of your changes yet, but nothing
springs to my eye as being obviously wrong, except the UTF-8 parsing.

> Index: emacs.c
> ===
> RCS file: /cvs/src/bin/ksh/emacs.c,v
> retrieving revision 1.67
> diff -u -p -r1.67 emacs.c
> --- emacs.c   12 May 2017 14:37:52 -  1.67
> +++ emacs.c   4 Jun 2017 09:06:54 -
[...]
> @@ -1891,6 +1894,49 @@ x_e_getc(void)
>   c = x_getc();
>  
>   return c;
> +}
> +
> +static int
> +x_e_getu8(char *buf, int off)
> +{
> + int c, len;
> +
> +again:
> + c = x_e_getc();
> + if (c == -1)
> + return -1;
> + buf[off++] = c;
> +
> + switch (c & 0xF0) {

That is very WRONG.  You cannot identify a UTF-8 start byte by only
looking at its first half.  At the very least, you must also make
sure that bit 5 is not set.

> + case 0xF0:
> + len = 4;
> + break;

That is very WRONG.  The bytes 0xf8 to 0xff must *not* result in
len = 4.  They are not UTF-8 start bytes, and not UTF-8 at all, so
they must result in len = 1.

Also, we do know that the last three bits must be 100.  If bit 6
was zero, the codepoint would have to be encoded in a three-byte
sequence instead.  If bit 7 or 8 would be set in addition to bit
6, that would take us beyond U+10 (specifically, U+1C to
U+1F, which are invalid).  So why not check bits 6 to 8 right
away, too?

> + case 0xE0:
> + len = 3;
> + break;

Hmmm.
Let me think whether we can already catch surrogates at this point.

Valid low three-byte:   U+0800  1110 1010 1000  0xe0a080
U+D000  11101101 1000 1000  0xed8080
U+D7FF  11101101 1001 1011  0xed9fbf
Invalid surrogates: U+D800  11101101 1010 1000  0xeda080
U+DFFF  11101101 1011 1011  0xedbfbf
Valid high three-byte:  U+E000  11101110 1000 1000  0xee8080
U+  1110 1011 1011  0xefbfbf

So the start byte 0xed includes both the valid range U+D000 to U+D7FF
and the invalid surrogate range U+D800 to U+DFFF.  The decisive bit
is the third bit of the *second* byte, so surrogates cannot be
excluded when only the first bit is known.

> + case 0xC0:
> + case 0xD0:
> + len = 2;
> + break;

That is also (somewhat) wrong.  Arguably, it would be better
for 0xc0 and 0xc1 to result in len = 1, because we already know
that there is no valid continuation.

So, here is a better start byte parser (only partially tested):

if (c == 0xf4)
len = 4;
else if ((c & 0xf0) == 0xe0)
len = 3;
else if ((c & 0xe0) == 0xc0 && c > 0xc1)
len = 2;
else
len = 1;

> + default:
> + len = 1;
> + }
> +
> + for (; len > 1; len--) {
> + c = x_e_getc();

Better use a different variable, say cc, such that we retain the
start byte for subsequent validation.

> + if (c == -1)
> + return -1;
> + if (!isu8cont(c)) {
> + off = 0;
> + x_e_ungetc(c);
> + x_error(0);
> + goto again;

That looks wrong.  At "again:", the first thing you do is x_e_getc(),
undoing the effect of the (reasonable) x_e_ungetc() you just did.
That looks like an endless loop to me.

Also, off = 0 and x_error() seem wrong because that 

Re: ksh(1): don't output invalid UTF-8 characters

2017-06-04 Thread Anton Lindqvist
Hi,
Although this discussion hasn't settled, here's a new diff trying to
address the previously raised issues:

- The new function x_e_getu8() tries to read a complete UTF-8 character.
  When a continuation byte is expected but not received, it resets its
  state and retries.

  The fix to u8len() from Boudewijn Dijkstra is incorporated in this new
  function, thanks!

- In x_emacs(), In order to distinguish between when the line buffer
  contains more than one character due to a UTF-8 character being read
  or caused by calling x_e_getu8() repeatedly the number of calls to
  x_e_getu8() is stored ntries. This is of importance to preserve the
  existing behavior where none matched escape sequences should not be
  inserted but instead ring the bell.

Index: emacs.c
===
RCS file: /cvs/src/bin/ksh/emacs.c,v
retrieving revision 1.67
diff -u -p -r1.67 emacs.c
--- emacs.c 12 May 2017 14:37:52 -  1.67
+++ emacs.c 4 Jun 2017 09:06:54 -
@@ -135,6 +135,7 @@ static void x_push(int);
 static voidx_adjust(void);
 static voidx_e_ungetc(int);
 static int x_e_getc(void);
+static int x_e_getu8(char *, int);
 static voidx_e_putc(int);
 static voidx_e_puts(const char *);
 static int x_comment(int);
@@ -277,7 +278,7 @@ x_emacs(char *buf, size_t len)
 {
struct kb_entry *k, *kmatch = NULL;
charline[LINE + 1];
-   int at = 0, submatch, ret, c;
+   int at = 0, ntries = 0, submatch, ret;
const char  *p;
 
xbp = xbuf = buf; xend = buf + len;
@@ -318,11 +319,9 @@ x_emacs(char *buf, size_t len)
x_last_command = NULL;
while (1) {
x_flush();
-   if ((c = x_e_getc()) < 0)
+   if ((at = x_e_getu8(line, at)) < 0)
return 0;
-
-   line[at++] = c;
-   line[at] = '\0';
+   ntries++;
 
if (x_arg == -1) {
x_arg = 1;
@@ -360,14 +359,18 @@ x_emacs(char *buf, size_t len)
macro_args = kmatch->args;
ret = KSTD;
} else
-   ret = kmatch->ftab->xf_func(c);
+   ret = kmatch->ftab->xf_func(line[at - 1]);
} else {
if (submatch)
continue;
-   if (at == 1)
-   ret = x_insert(c);
-   else
-   ret = x_error(c); /* not matched meta sequence 
*/
+   if (ntries > 1) {
+   ret = x_error(0); /* not matched meta sequence 
*/
+   } else if (at > 1) {
+   x_ins(line);
+   ret = KSTD;
+   } else {
+   ret = x_insert(line[0]);
+   }
}
 
switch (ret) {
@@ -392,7 +395,7 @@ x_emacs(char *buf, size_t len)
}
 
/* reset meta sequence */
-   at = 0;
+   at = ntries = 0;
line[0] = '\0';
if (x_arg_set)
x_arg_set = 0; /* reset args next time around */
@@ -1891,6 +1894,49 @@ x_e_getc(void)
c = x_getc();
 
return c;
+}
+
+static int
+x_e_getu8(char *buf, int off)
+{
+   int c, len;
+
+again:
+   c = x_e_getc();
+   if (c == -1)
+   return -1;
+   buf[off++] = c;
+
+   switch (c & 0xF0) {
+   case 0xF0:
+   len = 4;
+   break;
+   case 0xE0:
+   len = 3;
+   break;
+   case 0xC0:
+   case 0xD0:
+   len = 2;
+   break;
+   default:
+   len = 1;
+   }
+
+   for (; len > 1; len--) {
+   c = x_e_getc();
+   if (c == -1)
+   return -1;
+   if (!isu8cont(c)) {
+   off = 0;
+   x_e_ungetc(c);
+   x_error(0);
+   goto again;
+   }
+   buf[off++] = c;
+   }
+   buf[off] = '\0';
+
+   return off;
 }
 
 static void



Re: ksh(1): don't output invalid UTF-8 characters

2017-05-22 Thread Boudewijn Dijkstra
Op Fri, 19 May 2017 15:17:55 +0200 schreef Anton Lindqvist  
:

On Fri, May 19, 2017 at 09:33:33AM -0300, Lucas Gabriel Vuotto wrote:

On 19/05/17 03:42, Anton Lindqvist wrote:
>
> +static int
> +u8len(unsigned char c)
> +{
> +  switch (c & 0xF0) {
> +  case 0xF0:
> +  return 4;
> +  case 0xE0:
> +  return 3;
> +  case 0xC0:
> +  return 2;
> +  default:
> +  return 1;
> +  }
> +}
> +

This is wrong: most codepoints in the range U+0080-U+07ff (the ones  
greater than U+0400) would be interpreted as being 1 character long  
instead of 2.


Thanks for the heads-up. Maybe a more reliable solution would be to call
mbtowc(3) repeatedly as new input arrives until it returns successfully.
Assuming the first read byte is a UTF-8 start byte.


Not needed. Only case 0xD0 is missing.

case 0xC0: case 0xD0:
 return 2;



--
Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/



Re: ksh(1): don't output invalid UTF-8 characters

2017-05-19 Thread Nicholas Marriott
Having a look at ksh, I don't see how Anton's original diff is much
different from x_emacs() looping around x_e_getc() until it finishes a
long key input?

It would be better to stop reading early if an invalid UTF-8 byte is
input rather than always requiring exactly N bytes; he needs to fix his
u8len(); and I would want to test the other calls to x_e_getc(), such as
in x_search_char_forw() -- but I the general idea seems reasonable.



On Sat, May 20, 2017 at 12:04:35AM +0100, Nicholas Marriott wrote:
> On Fri, May 19, 2017 at 09:29:06PM +0200, Ingo Schwarze wrote:
> > On a side note, i don't think gnome-terminal and konsole are relevant.
> > I never installed them before and did so now for the first time for
> > testing, but they installed so many libraries that i feel uncomfortable
> > and unsafe using them even briefly and purely for testing purposes.
> > I will certainly pkg_delete them, and the libraries they pulled in, in
> > the near future.
> 
> You might not use them, and I don't use them, but they are very popular,
> as are half a dozen others that may well have different behaviours.
> 
> > > FWIW bash seems to do the replacement to U+FFFD itself before it sends
> > > it to the terminal, which means it is (more) predictable. I don't know
> > > if this is a sensible option for ksh.
> > 
> > That is not really an option.  This matters most while the multibyte
> > character is being typed, when the first byte is already being
> > processed but the second one not yet.  Replacing the byte with
> > U+FFFD, then later substituting the actual bytes back when all have
> > arrived, doesn't really make much sense to me.
> 
> It would help because terminals would know how to deal with the
> character.
> 
> So ksh is doing this now - if we pretend A,B,C are UTF-8 characters:
> 
> A,cursor left,B,cursor left,C
> 
> So first A appears, then B overwrites A, then C overwrites B.
> 
> Except A and B are invalid so the terminal may not do the right thing
> with them. It may draw no characters and not move the cursor (like
> tmux), or draw two and move the cursor by two (like rxvt-unicode appears
> to).
> 
> If you sent U+FFFD instead of the invalid characters, that is a known
> character and terminals should be guaranteed to do what ksh wants (move
> the cursor back one position to the right), until ksh overwrites it
> again.
> 
> > Hanging the shell until all expected bytes have arrived seems like
> > a bad idea, too.  You can see the misbehaviour that is causing in
> > programs like uniname(1) from the misc/uniutils package:
> 
> I don't really understand this. How does ksh handle cursor keys? What if
> when I press Left, the three bytes (\033OA) are split across two
> separate read()s? How does this avoid hanging ksh?



Re: ksh(1): don't output invalid UTF-8 characters

2017-05-19 Thread Nicholas Marriott
On Fri, May 19, 2017 at 09:29:06PM +0200, Ingo Schwarze wrote:
> On a side note, i don't think gnome-terminal and konsole are relevant.
> I never installed them before and did so now for the first time for
> testing, but they installed so many libraries that i feel uncomfortable
> and unsafe using them even briefly and purely for testing purposes.
> I will certainly pkg_delete them, and the libraries they pulled in, in
> the near future.

You might not use them, and I don't use them, but they are very popular,
as are half a dozen others that may well have different behaviours.

> > FWIW bash seems to do the replacement to U+FFFD itself before it sends
> > it to the terminal, which means it is (more) predictable. I don't know
> > if this is a sensible option for ksh.
> 
> That is not really an option.  This matters most while the multibyte
> character is being typed, when the first byte is already being
> processed but the second one not yet.  Replacing the byte with
> U+FFFD, then later substituting the actual bytes back when all have
> arrived, doesn't really make much sense to me.

It would help because terminals would know how to deal with the
character.

So ksh is doing this now - if we pretend A,B,C are UTF-8 characters:

A,cursor left,B,cursor left,C

So first A appears, then B overwrites A, then C overwrites B.

Except A and B are invalid so the terminal may not do the right thing
with them. It may draw no characters and not move the cursor (like
tmux), or draw two and move the cursor by two (like rxvt-unicode appears
to).

If you sent U+FFFD instead of the invalid characters, that is a known
character and terminals should be guaranteed to do what ksh wants (move
the cursor back one position to the right), until ksh overwrites it
again.

> Hanging the shell until all expected bytes have arrived seems like
> a bad idea, too.  You can see the misbehaviour that is causing in
> programs like uniname(1) from the misc/uniutils package:

I don't really understand this. How does ksh handle cursor keys? What if
when I press Left, the three bytes (\033OA) are split across two
separate read()s? How does this avoid hanging ksh?



Re: ksh(1): don't output invalid UTF-8 characters

2017-05-19 Thread Nicholas Marriott
Hi

On Fri, May 19, 2017 at 10:23:08PM +0200, Ingo Schwarze wrote:
> Hi Nicholas,
> 
> Nicholas Marriott wrote on Fri, May 19, 2017 at 07:04:53PM +0100:
> 
> > Perhaps I haven't understood what you are saying correctly,
> 
> What matters most is that sending an incomplete character
> followed by U+0008 (ASCII BACKSPACE) is a no-op, both in the sense
> that it doesn't change the line being edited and that it doesn't
> change the display.  All terminals you mentioned seem to conform
> to that according to my testing, except tmux.

They don't all do the same thing for me. I am doing this, the same as
ksh does:

printf 'a\010\342a\010\010\342\211a\010\010\342\211\240a\010\n'

And the terminals behave differently.

> > but I don't think it is possible to send control characters or
> > any other invalid UTF-8 bytes inside UTF-8 characters and safely
> > predict what the terminal will do. How about these examples:
> 
> If the invalid bytes are still present by the time the line is sent
> off for processing (like in your example printf '\343\203\n'), then
> it is indeed hard to predict what random terminals will do, though

I don't know what you mean by "sent off for processing".

I think you might be under some misapprehension.

\010 (ASCII BS) is just a cursor positioning sequence, all it does is
move the cursor one position to the left. \n the same, it moves the
cursor one line down.

The problem here is that ksh assumes a partial UTF-8 character (whether
one byte or two) will also move the cursor by one position, but that is
not always the case. Not for tmux, and - for me - not for other
terminals.

Are you thinking of ICANON? That is a kernel mechanism, and when it is
in use, the backspace will never reach the tmux. But ksh doesn't appear
to use it.

> i would argue that xterm's behaviour is correct (print one substitution
> glyph for each incomplete character, or if bytes don't form even
> incomplete characters, then one for each such invalid byte.

So you think the correct behaviour is to replace invalid sequences by
U+FEFF?

tmux could do that, but it won't fix ksh for other terminals. ksh is
making an assumption about how terminals will behave that is not
correct.

> urxvt is clearly broken:
> 
>  $ printf '\343\203x\n'
> 
> prints U+00E3 x linefeed; i have no idea what it does to garble
> 0xe383 into 0xc3a3.  Maybe some naive misparsing, or spewing out
> incomplete parsing state in some inconsistent way.
> 
> gnome-terminal and konsole print one replacement character for each
> invalid byte, even if bytes form an incomplete character.  Maybe
> not outright wrong, but arguably a bit confusing.

This doesn't sound like all the terminals doing the same thing.

> So yeah, if lines containing incomplete sequences *when they are
> sent off* misformat with tmux and gnome-terminal or konsole, i
> wouldn't call that tmux'es fault, and i agree there is little that
> can be done about it.

There is no "sent off". What you send is what you get, invalid
characters, backspaces, everything.

> 
> > Having tmux ignore the whole lot seems like a relatively sensible
> > course.
> 
> Well, what tmux currently does is making sure that everything gets
> broken in the maximum possible way on every terminal, even if the
> line that is finally sent off is completely correct.
> 
> > The only other alternative would be to substitute U+FFFD.
> 
> Why not just pass the bytes through?  I don't think it's the job
> of a terminal mutiplexer to mess with individual bytes.  It's the
> job of the final terminal doing the display to select glyphs and
> place them, for printable characters, for non-printable characters,
> for incomplete characters, and for invalid bytes.

No no, it is the job of tmux to interpret what it receives from the
application and make sure it shows the same no matter what terminal is
outside.

If terminals outside behave differently for the same output, tmux can't
keep its own internal state correct, so the display will be
mangled. tmux has to understand everything it receives, there is almost
no case where we can just pass through.

> 
> > But that seems iffy too - U+FFFD is width 1, but what if the
> > application is expecting a width 2?
> 
> By definition, incomplete characters and invalid bytes don't
> have a width, so it doesn't matter what the application wanted
> (for example, which character the user intended to type but
> didn't finish).  What matters is how wide the replacement
> glyph will look on the final terminal.  In that respect, we
> cannot help making an assumption, and "incomplete sequences
> and invalid bytes are displayed as U+FFFD (i.e. width 1)
> seems about the best we can do.  That may be slightly off
> for gnome-terminal and konsole, but i don't see how that can
> be helped.
> 
> Anyway, these subtleties of invalid bytes that *remain* are not
> the main inconvenience in practice.  What matters more is that
> tmux breaks even if incomplete characters are deleted again
> with backspaces and 

Re: ksh(1): don't output invalid UTF-8 characters

2017-05-19 Thread Ingo Schwarze
Hi Nicholas,

Nicholas Marriott wrote on Fri, May 19, 2017 at 07:04:53PM +0100:

> Perhaps I haven't understood what you are saying correctly,

What matters most is that sending an incomplete character
followed by U+0008 (ASCII BACKSPACE) is a no-op, both in the sense
that it doesn't change the line being edited and that it doesn't
change the display.  All terminals you mentioned seem to conform
to that according to my testing, except tmux.

> but I don't think it is possible to send control characters or
> any other invalid UTF-8 bytes inside UTF-8 characters and safely
> predict what the terminal will do. How about these examples:

If the invalid bytes are still present by the time the line is sent
off for processing (like in your example printf '\343\203\n'), then
it is indeed hard to predict what random terminals will do, though
i would argue that xterm's behaviour is correct (print one substitution
glyph for each incomplete character, or if bytes don't form even
incomplete characters, then one for each such invalid byte.

urxvt is clearly broken:

 $ printf '\343\203x\n'

prints U+00E3 x linefeed; i have no idea what it does to garble
0xe383 into 0xc3a3.  Maybe some naive misparsing, or spewing out
incomplete parsing state in some inconsistent way.

gnome-terminal and konsole print one replacement character for each
invalid byte, even if bytes form an incomplete character.  Maybe
not outright wrong, but arguably a bit confusing.

So yeah, if lines containing incomplete sequences *when they are
sent off* misformat with tmux and gnome-terminal or konsole, i
wouldn't call that tmux'es fault, and i agree there is little that
can be done about it.

> Having tmux ignore the whole lot seems like a relatively sensible
> course.

Well, what tmux currently does is making sure that everything gets
broken in the maximum possible way on every terminal, even if the
line that is finally sent off is completely correct.

> The only other alternative would be to substitute U+FFFD.

Why not just pass the bytes through?  I don't think it's the job
of a terminal mutiplexer to mess with individual bytes.  It's the
job of the final terminal doing the display to select glyphs and
place them, for printable characters, for non-printable characters,
for incomplete characters, and for invalid bytes.

> But that seems iffy too - U+FFFD is width 1, but what if the
> application is expecting a width 2?

By definition, incomplete characters and invalid bytes don't
have a width, so it doesn't matter what the application wanted
(for example, which character the user intended to type but
didn't finish).  What matters is how wide the replacement
glyph will look on the final terminal.  In that respect, we
cannot help making an assumption, and "incomplete sequences
and invalid bytes are displayed as U+FFFD (i.e. width 1)
seems about the best we can do.  That may be slightly off
for gnome-terminal and konsole, but i don't see how that can
be helped.

Anyway, these subtleties of invalid bytes that *remain* are not
the main inconvenience in practice.  What matters more is that
tmux breaks even if incomplete characters are deleted again
with backspaces and never sent off.

Yours,
  Ingo



Re: ksh(1): don't output invalid UTF-8 characters

2017-05-19 Thread Ingo Schwarze
Hi Nicholas,

Nicholas Marriott wrote on Fri, May 19, 2017 at 07:27:36PM +0100:

> ksh has problems for me with Anton's example in several terminals,
> not just in tmux. Mostly the cursor seems to end up one character
> off rather than in the prompt, which is less visibly incorrect
> perhaps, but still wrong.

Can't reproduce.  At least when having LC_CTYPE=en_US.UTF-8 set
when starting the terminal, there is no problem for me with
Anton's example in xterm, urxvt, gnome-terminal, or konsole.

If i type

  echo xx

then left-arrow, a width-one non-ASCII character, a dot, and hit
enter, the dot appears at the right place both on the command line
and in the output.  Moving around with arrow-left or arrow-right
after typing the non-ASCII character to insert the dot at a different
place doesn't break anything for me either.

The output is correct even with tmux, all that tmux garbles is the
command line display.


On a side note, i don't think gnome-terminal and konsole are relevant.
I never installed them before and did so now for the first time for
testing, but they installed so many libraries that i feel uncomfortable
and unsafe using them even briefly and purely for testing purposes.
I will certainly pkg_delete them, and the libraries they pulled in, in
the near future.

I can't believe anyone in their right mind would use a terminal
depending on webkit, json, javascript, geoclue, pulseaudio, polkit,
libssh and the like for any serious work (and on top of that, konsole
is spewing huge amounts of scary error messages to stderr and/or
stdout).

> FWIW bash seems to do the replacement to U+FFFD itself before it sends
> it to the terminal, which means it is (more) predictable. I don't know
> if this is a sensible option for ksh.

That is not really an option.  This matters most while the multibyte
character is being typed, when the first byte is already being
processed but the second one not yet.  Replacing the byte with
U+FFFD, then later substituting the actual bytes back when all have
arrived, doesn't really make much sense to me.

Hanging the shell until all expected bytes have arrived seems like
a bad idea, too.  You can see the misbehaviour that is causing in
programs like uniname(1) from the misc/uniutils package:

 $ printf '\377turd' | uniname

Now the program hangs indefinitely.  I wouldn't want the shell to
hang in such a manner.

Yours,
  Ingo



Re: ksh(1): don't output invalid UTF-8 characters

2017-05-19 Thread Nicholas Marriott
ksh has problems for me with Anton's example in several terminals, not
just in tmux. Mostly the cursor seems to end up one character off rather
than in the prompt, which is less visibly incorrect perhaps, but still
wrong.

I don't know that ksh will be able to predict this reliably (not
uncommon with shells).

FWIW bash seems to do the replacement to U+FFFD itself before it sends
it to the terminal, which means it is (more) predictable. I don't know
if this is a sensible option for ksh.




On Fri, May 19, 2017 at 07:04:53PM +0100, Nicholas Marriott wrote:
> Hi
> 
> Perhaps I haven't understood what you are saying correctly, but I don't
> think it is possible to send control characters or any other invalid
> UTF-8 bytes inside UTF-8 characters and safely predict what the terminal
> will do. How about these examples:
> 
> printf '\343\203\010\217a\n'
> printf '\343\203\r\217b\n'
> printf '\343\203\n\217c\n'
> 
> Or with a width 1 character:
> 
> printf '\342\211\010\240\n'
> printf '\342\211\r\240b\n'
> printf '\342\211\n\240c\n'
> 
> I have four UTF-8 capable X terminals here and between them they do
> three different things (xterm, rxvt-unicode, gnome-terminal & konsole).
> 
> tmux can't forward this stuff to the terminal outside, because if it
> can't predict what that terminal will do, tmux's idea of the display
> will get out of sync with reality.
> 
> Having tmux ignore the whole lot seems like a relatively sensible
> course.
> 
> The only other alternative would be to substitute U+FFFD. But that seems
> iffy too - U+FFFD is width 1, but what if the application is expecting
> a width 2?
> 
> 
> 
> 
> On Fri, May 19, 2017 at 06:54:38PM +0200, Ingo Schwarze wrote:
> > Hi Anton,
> > 
> > Anton Lindqvist wrote on Fri, May 19, 2017 at 08:42:05AM +0200:
> > 
> > > 1. Run ksh under tmux.
> > > 
> > > 2. Input the following characters, without spaces:
> > > 
> > >a (any character) ^B (backward-char) ?? (any UTF-8 character)
> > > 
> > > 3. At this point, the prompt gets overwritten.
> > > 
> > > Since ksh read a single byte of input, it will display a partial UTF-8
> > > character before the whole character has been read. This is especially
> > > troublesome when the cursor is not placed at the end of the line. In the
> > > scenario above, after reading the first byte of '??' the following
> > > sequence will be displayed:
> > > 
> > >   0xc3 0x61 0x08
> > > 
> > > That is the first byte of '??' (0xc3), 'a' (0x61), '\b' (0x08). tmux
> > > does the right thing here, since 0xc3 is a valid UTF-8 start byte it
> > > expects it to be followed by a UTF-8 continuation byte which is not the
> > > case. The two first bytes (0xc3, 0x61) are discarded and the parser is
> > > reset to its initial state
> > 
> > I call that a bug in tmux.  At least for UTF-8, tmux should never reset
> > its parser.  Depending on the keyboard configuration, it may be possible
> > to enter single non-UTF-8 bytes.  For example, at the console, i can do
> > this:
> > 
> >  - type "printf x"
> >  - press Ctrl-V, Ctrl-Alt-Z
> >  - type "printf x | hexdump -C"
> > 
> > The result is:
> > 
> >     78 9a 78   |x.x|
> >   0003
> > 
> > All of the shell, the console, and xterm more or less handle such
> > stunts, even though admittedly, that is not the usual way of typing
> > in a binary file.  tmux is a terminal emulator, so it ought to cope,
> > too, and not reset its state.
> > 
> > Note that this is yet another instance where the concept of arbitrary
> > locales is utterly broken and insecure.  On some non-OpenBSD system
> > supporting such insecure stuff, if the user has set an arbitrary,
> > non-UTF-8, state-dependent locale and somehow manages to insert an
> > invalid byte into the input stream, the state of the input stream
> > becomes invalid, no further characters can be read from that terminal,
> > and *there is no way to recover*.  The only remaining half-secure
> > option is for the shell to exit, which may also not be very secure,
> > on a different level.
> > 
> > But with UTF-8, there is no problem whatsoever dealing with invalid
> > bytes, so tmux ought to cope.
> > 
> > > causing the backspace to be accepted and the
> > > first character in the prompt to be overwritten.
> > 
> > That's part of the reason why tmux must not reset its state:
> > The shell won't know about it, and the screen will become garbled.
> > 
> > > Below is diff that make sure to read a whole UTF-8 character in
> > > x_emacs() prior doing another iteration of the main-loop
> > 
> > I don't think we can do that.  What if there is no next byte?
> > Then the shell will hang until, maybe considerably later, the next
> > character is typed.  Also, we cannot rely on parsing the UTF-8 start
> > byte (even if done correctly).  It tells the byte length of the
> > character only for valid byte sequences, and the byte sequence need
> > not be valid.
> > 
> > Besides, i think patching the shell to work around terminal bugs
> > (or terminal emulator bugs) 

Re: ksh(1): don't output invalid UTF-8 characters

2017-05-19 Thread Nicholas Marriott
Hi

Perhaps I haven't understood what you are saying correctly, but I don't
think it is possible to send control characters or any other invalid
UTF-8 bytes inside UTF-8 characters and safely predict what the terminal
will do. How about these examples:

printf '\343\203\010\217a\n'
printf '\343\203\r\217b\n'
printf '\343\203\n\217c\n'

Or with a width 1 character:

printf '\342\211\010\240\n'
printf '\342\211\r\240b\n'
printf '\342\211\n\240c\n'

I have four UTF-8 capable X terminals here and between them they do
three different things (xterm, rxvt-unicode, gnome-terminal & konsole).

tmux can't forward this stuff to the terminal outside, because if it
can't predict what that terminal will do, tmux's idea of the display
will get out of sync with reality.

Having tmux ignore the whole lot seems like a relatively sensible
course.

The only other alternative would be to substitute U+FFFD. But that seems
iffy too - U+FFFD is width 1, but what if the application is expecting
a width 2?




On Fri, May 19, 2017 at 06:54:38PM +0200, Ingo Schwarze wrote:
> Hi Anton,
> 
> Anton Lindqvist wrote on Fri, May 19, 2017 at 08:42:05AM +0200:
> 
> > 1. Run ksh under tmux.
> > 
> > 2. Input the following characters, without spaces:
> > 
> >a (any character) ^B (backward-char) ?? (any UTF-8 character)
> > 
> > 3. At this point, the prompt gets overwritten.
> > 
> > Since ksh read a single byte of input, it will display a partial UTF-8
> > character before the whole character has been read. This is especially
> > troublesome when the cursor is not placed at the end of the line. In the
> > scenario above, after reading the first byte of '??' the following
> > sequence will be displayed:
> > 
> >   0xc3 0x61 0x08
> > 
> > That is the first byte of '??' (0xc3), 'a' (0x61), '\b' (0x08). tmux
> > does the right thing here, since 0xc3 is a valid UTF-8 start byte it
> > expects it to be followed by a UTF-8 continuation byte which is not the
> > case. The two first bytes (0xc3, 0x61) are discarded and the parser is
> > reset to its initial state
> 
> I call that a bug in tmux.  At least for UTF-8, tmux should never reset
> its parser.  Depending on the keyboard configuration, it may be possible
> to enter single non-UTF-8 bytes.  For example, at the console, i can do
> this:
> 
>  - type "printf x"
>  - press Ctrl-V, Ctrl-Alt-Z
>  - type "printf x | hexdump -C"
> 
> The result is:
> 
>     78 9a 78   |x.x|
>   0003
> 
> All of the shell, the console, and xterm more or less handle such
> stunts, even though admittedly, that is not the usual way of typing
> in a binary file.  tmux is a terminal emulator, so it ought to cope,
> too, and not reset its state.
> 
> Note that this is yet another instance where the concept of arbitrary
> locales is utterly broken and insecure.  On some non-OpenBSD system
> supporting such insecure stuff, if the user has set an arbitrary,
> non-UTF-8, state-dependent locale and somehow manages to insert an
> invalid byte into the input stream, the state of the input stream
> becomes invalid, no further characters can be read from that terminal,
> and *there is no way to recover*.  The only remaining half-secure
> option is for the shell to exit, which may also not be very secure,
> on a different level.
> 
> But with UTF-8, there is no problem whatsoever dealing with invalid
> bytes, so tmux ought to cope.
> 
> > causing the backspace to be accepted and the
> > first character in the prompt to be overwritten.
> 
> That's part of the reason why tmux must not reset its state:
> The shell won't know about it, and the screen will become garbled.
> 
> > Below is diff that make sure to read a whole UTF-8 character in
> > x_emacs() prior doing another iteration of the main-loop
> 
> I don't think we can do that.  What if there is no next byte?
> Then the shell will hang until, maybe considerably later, the next
> character is typed.  Also, we cannot rely on parsing the UTF-8 start
> byte (even if done correctly).  It tells the byte length of the
> character only for valid byte sequences, and the byte sequence need
> not be valid.
> 
> Besides, i think patching the shell to work around terminal bugs
> (or terminal emulator bugs) is the wrong approach in the first place.
> 
> Yours,
>   Ingo
> 



Re: ksh(1): don't output invalid UTF-8 characters

2017-05-19 Thread Ingo Schwarze
Hi Anton,

Anton Lindqvist wrote on Fri, May 19, 2017 at 08:42:05AM +0200:

> 1. Run ksh under tmux.
> 
> 2. Input the following characters, without spaces:
> 
>a (any character) ^B (backward-char) ö (any UTF-8 character)
> 
> 3. At this point, the prompt gets overwritten.
> 
> Since ksh read a single byte of input, it will display a partial UTF-8
> character before the whole character has been read. This is especially
> troublesome when the cursor is not placed at the end of the line. In the
> scenario above, after reading the first byte of 'ö' the following
> sequence will be displayed:
> 
>   0xc3 0x61 0x08
> 
> That is the first byte of 'ö' (0xc3), 'a' (0x61), '\b' (0x08). tmux
> does the right thing here, since 0xc3 is a valid UTF-8 start byte it
> expects it to be followed by a UTF-8 continuation byte which is not the
> case. The two first bytes (0xc3, 0x61) are discarded and the parser is
> reset to its initial state

I call that a bug in tmux.  At least for UTF-8, tmux should never reset
its parser.  Depending on the keyboard configuration, it may be possible
to enter single non-UTF-8 bytes.  For example, at the console, i can do
this:

 - type "printf x"
 - press Ctrl-V, Ctrl-Alt-Z
 - type "printf x | hexdump -C"

The result is:

    78 9a 78   |x.x|
  0003

All of the shell, the console, and xterm more or less handle such
stunts, even though admittedly, that is not the usual way of typing
in a binary file.  tmux is a terminal emulator, so it ought to cope,
too, and not reset its state.

Note that this is yet another instance where the concept of arbitrary
locales is utterly broken and insecure.  On some non-OpenBSD system
supporting such insecure stuff, if the user has set an arbitrary,
non-UTF-8, state-dependent locale and somehow manages to insert an
invalid byte into the input stream, the state of the input stream
becomes invalid, no further characters can be read from that terminal,
and *there is no way to recover*.  The only remaining half-secure
option is for the shell to exit, which may also not be very secure,
on a different level.

But with UTF-8, there is no problem whatsoever dealing with invalid
bytes, so tmux ought to cope.

> causing the backspace to be accepted and the
> first character in the prompt to be overwritten.

That's part of the reason why tmux must not reset its state:
The shell won't know about it, and the screen will become garbled.

> Below is diff that make sure to read a whole UTF-8 character in
> x_emacs() prior doing another iteration of the main-loop

I don't think we can do that.  What if there is no next byte?
Then the shell will hang until, maybe considerably later, the next
character is typed.  Also, we cannot rely on parsing the UTF-8 start
byte (even if done correctly).  It tells the byte length of the
character only for valid byte sequences, and the byte sequence need
not be valid.

Besides, i think patching the shell to work around terminal bugs
(or terminal emulator bugs) is the wrong approach in the first place.

Yours,
  Ingo



Re: ksh(1): don't output invalid UTF-8 characters

2017-05-19 Thread Anton Lindqvist
On Fri, May 19, 2017 at 09:33:33AM -0300, Lucas Gabriel Vuotto wrote:
> Hi,
> 
> On 19/05/17 03:42, Anton Lindqvist wrote:
> > Hi,
> > I did submit this problem[1] earlier but with an incomplete analysis and
> > fix. Here's a second attempt.
> > 
> > This does only occur when running ksh with emacs mode under tmux. How to
> > re-produce:
> > 
> > 1. Run ksh under tmux.
> > 
> > 2. Input the following characters, without spaces:
> > 
> >a (any character) ^B (backward-char) ö (any UTF-8 character)
> > 
> > 3. At this point, the prompt gets overwritten.
> > 
> > Since ksh read a single byte of input, it will display a partial UTF-8
> > character before the whole character has been read. This is especially
> > troublesome when the cursor is not placed at the end of the line. In the
> > scenario above, after reading the first byte of 'ö' the following
> > sequence will be displayed:
> > 
> >   0xc3 0x61 0x08
> > 
> > That is the first byte of 'ö' (0xc3), 'a' (0x61), '\b' (0x08). tmux
> > does the right thing here, since 0xc3 is a valid UTF-8 start byte it
> > expects it to be followed by a UTF-8 continuation byte which is not the
> > case. The two first bytes (0xc3, 0x61) are discarded and the parser is
> > reset to its initial state causing the backspace to be accepted and the
> > first character in the prompt to be overwritten.
> > 
> > After the second byte of 'ö' (0xb6) is read by ksh, the following
> > sequence will be displayed:
> > 
> >0x08 0xc3 0xb6 0x61 0x08
> > 
> > That is '\b' (0x08), 'ö' (0xc3, 0xb6), 'a' (0x61), '\b' (0x08). Since
> > ksh assumes the cursor is correctly positioned it displays a leading
> > backspace in order to move passed the first character. This is however
> > not true causing another character in the prompt to be overwritten.
> > 
> > Below is diff that make sure to read a whole UTF-8 character in
> > x_emacs() prior doing another iteration of the main-loop which solves
> > the problem. It does not validate UTF-8 input but instead assumes every
> > such character is valid.
> > 
> > Comments and feedback are much appreciated.
> > 
> > [1] http://marc.info/?l=openbsd-misc=148509346310901=2
> > 
> > Index: emacs.c
> > ===
> > RCS file: /cvs/src/bin/ksh/emacs.c,v
> > retrieving revision 1.67
> > diff -u -p -r1.67 emacs.c
> > --- emacs.c 12 May 2017 14:37:52 -  1.67
> > +++ emacs.c 14 May 2017 08:21:26 -
> > @@ -98,6 +98,7 @@ static intx_col;
> >  static int x_displen;
> >  static int x_arg;  /* general purpose arg */
> >  static int x_arg_defaulted;/* x_arg not explicitly set; defaulted to 1 */
> > +static int x_getc_again;
> >  
> >  static int xlp_valid;
> >  /* end from 4.9 edit.h } */
> > @@ -142,6 +143,7 @@ static int  x_fold_case(int);
> >  static char*x_lastcp(void);
> >  static voiddo_complete(int, Comp_type);
> >  static int isu8cont(unsigned char);
> > +static int u8len(unsigned char);
> >  
> >  /* proto's for keybindings */
> >  static int x_abort(int);
> > @@ -272,6 +274,21 @@ isu8cont(unsigned char c)
> > return (c & (0x80 | 0x40)) == 0x80;
> >  }
> >  
> > +static int
> > +u8len(unsigned char c)
> > +{
> > +   switch (c & 0xF0) {
> > +   case 0xF0:
> > +   return 4;
> > +   case 0xE0:
> > +   return 3;
> > +   case 0xC0:
> > +   return 2;
> > +   default:
> > +   return 1;
> > +   }
> > +}
> > +
> 
> This is wrong: most codepoints in the range U+0080-U+07ff (the ones greater 
> than U+0400) would be interpreted as being 1 character long instead of 2.

Thanks for the heads-up. Maybe a more reliable solution would be to call
mbtowc(3) repeatedly as new input arrives until it returns successfully.
Assuming the first read byte is a UTF-8 start byte.



Re: ksh(1): don't output invalid UTF-8 characters

2017-05-19 Thread Lucas Gabriel Vuotto
Hi,

On 19/05/17 03:42, Anton Lindqvist wrote:
> Hi,
> I did submit this problem[1] earlier but with an incomplete analysis and
> fix. Here's a second attempt.
> 
> This does only occur when running ksh with emacs mode under tmux. How to
> re-produce:
> 
> 1. Run ksh under tmux.
> 
> 2. Input the following characters, without spaces:
> 
>a (any character) ^B (backward-char) ö (any UTF-8 character)
> 
> 3. At this point, the prompt gets overwritten.
> 
> Since ksh read a single byte of input, it will display a partial UTF-8
> character before the whole character has been read. This is especially
> troublesome when the cursor is not placed at the end of the line. In the
> scenario above, after reading the first byte of 'ö' the following
> sequence will be displayed:
> 
>   0xc3 0x61 0x08
> 
> That is the first byte of 'ö' (0xc3), 'a' (0x61), '\b' (0x08). tmux
> does the right thing here, since 0xc3 is a valid UTF-8 start byte it
> expects it to be followed by a UTF-8 continuation byte which is not the
> case. The two first bytes (0xc3, 0x61) are discarded and the parser is
> reset to its initial state causing the backspace to be accepted and the
> first character in the prompt to be overwritten.
> 
> After the second byte of 'ö' (0xb6) is read by ksh, the following
> sequence will be displayed:
> 
>0x08 0xc3 0xb6 0x61 0x08
> 
> That is '\b' (0x08), 'ö' (0xc3, 0xb6), 'a' (0x61), '\b' (0x08). Since
> ksh assumes the cursor is correctly positioned it displays a leading
> backspace in order to move passed the first character. This is however
> not true causing another character in the prompt to be overwritten.
> 
> Below is diff that make sure to read a whole UTF-8 character in
> x_emacs() prior doing another iteration of the main-loop which solves
> the problem. It does not validate UTF-8 input but instead assumes every
> such character is valid.
> 
> Comments and feedback are much appreciated.
> 
> [1] http://marc.info/?l=openbsd-misc=148509346310901=2
> 
> Index: emacs.c
> ===
> RCS file: /cvs/src/bin/ksh/emacs.c,v
> retrieving revision 1.67
> diff -u -p -r1.67 emacs.c
> --- emacs.c   12 May 2017 14:37:52 -  1.67
> +++ emacs.c   14 May 2017 08:21:26 -
> @@ -98,6 +98,7 @@ static int  x_col;
>  static int   x_displen;
>  static int   x_arg;  /* general purpose arg */
>  static int   x_arg_defaulted;/* x_arg not explicitly set; defaulted to 1 */
> +static int   x_getc_again;
>  
>  static int   xlp_valid;
>  /* end from 4.9 edit.h } */
> @@ -142,6 +143,7 @@ static intx_fold_case(int);
>  static char  *x_lastcp(void);
>  static void  do_complete(int, Comp_type);
>  static int   isu8cont(unsigned char);
> +static int   u8len(unsigned char);
>  
>  /* proto's for keybindings */
>  static int   x_abort(int);
> @@ -272,6 +274,21 @@ isu8cont(unsigned char c)
>   return (c & (0x80 | 0x40)) == 0x80;
>  }
>  
> +static int
> +u8len(unsigned char c)
> +{
> + switch (c & 0xF0) {
> + case 0xF0:
> + return 4;
> + case 0xE0:
> + return 3;
> + case 0xC0:
> + return 2;
> + default:
> + return 1;
> + }
> +}
> +

This is wrong: most codepoints in the range U+0080-U+07ff (the ones greater 
than U+0400) would be interpreted as being 1 character long instead of 2.

>  int
>  x_emacs(char *buf, size_t len)
>  {
> @@ -318,10 +335,12 @@ x_emacs(char *buf, size_t len)
>   x_last_command = NULL;
>   while (1) {
>   x_flush();
> - if ((c = x_e_getc()) < 0)
> - return 0;
> + do {
> + if ((c = x_e_getc()) < 0)
> + return 0;
>  
> - line[at++] = c;
> + line[at++] = c;
> + } while (x_getc_again > 0);
>   line[at] = '\0';
>  
>   if (x_arg == -1) {
> @@ -364,7 +383,10 @@ x_emacs(char *buf, size_t len)
>   } else {
>   if (submatch)
>   continue;
> - if (at == 1)
> + if (at > 1) {
> + x_ins(line);
> + ret = KSTD;
> + } else if (at == 1)
>   ret = x_insert(c);
>   else
>   ret = x_error(c); /* not matched meta sequence 
> */
> @@ -1887,8 +1909,12 @@ x_e_getc(void)
>   macro_args = NULL;
>   c = x_getc();
>   }
> - } else
> + } else {
>   c = x_getc();
> + if (x_getc_again == 0)
> + x_getc_again = u8len(c);
> + x_getc_again--;
> + }
>  
>   return c;
>  }
>