Hi, Stefan Sperling wrote on Sat, Feb 23, 2019 at 04:19:02PM +0100:
> Your diff looks good to me. > And I can't see how it could make this situation any worse either. thanks for checking; i committed the first patch. To be able to continue with the less cleanup i started, i have to explain some theory first, and i'm kindly asking for advice which functionality is desired. Personally, i'd like to make less simpler and safer, removing some anachronistic, dangerous functionality, see below. Considerable complication in the less code results from support for https://en.wikipedia.org/wiki/ANSI_escape_code sequences. Note that i will *not* propose to delete that support completely, but only some more arcane and more dangerous parts of it, see at the very end of this mail. There are three ways to write an ANSI escape sequence; i'll have to explain all three in turn. First syntax: ESCAPE-BRACKET ---------------------------- $ printf "normal \x1b[43m yellow \x1b[0m normal\n" Each ANSI escape sequence is introduced by the two-byte sequence consisting of the ASCII escape character followed by an ASCII opening square bracket. This form is relatively benign because it is both valid ASCII and valid UTF-8 and looks identical in both encodings. In our default configuration of xterm(1), which is moderately secure without being paranoid, the above command works out of the box, showing you the word "yellow" on a yellow background. It doesn't matter what LC_CTYPE is, it always works. By default, less(1) does *not* interpret such escape sequences, but escapes them and shows them as follows: normal ESC[43m yellow ESC[0m normal where the two instances of the string "ESC" are highlighted in reverse video. But if you run less(1) with the -R option, it passes the escape sequences on to the terminal, and again you see the word "yellow" on a yellow background, inside less(1). Horizontal scrolling works correctly, preserving vertical alignment of display columns, with the two escape sequences taking up zero columns each in an xterm(1). Your mileage may vary on exotic terminals, though. I'm conivinced this behaviour ought to be preserved. Second syntax: conventional 8-bit CSI ------------------------------------- $ LC_CTYPE=C printf "normal \23343m yellow \2330m normal\n" Here, each ANSI escape sequence is introduced by the single byte 0x9b (CSI = control sequence introducer) instead of by the two-byte sequence ESC-[. Obviously, this implies that the second syntax is incompatible with UTF-8, so we always have to make sure we have LC_CTYPE=C set when doing anything with the second syntax. Note that i had to give the CSI bytes in octal in the printf command above; \x9b43m would not work because printf(1) would be unable to figure out where the hexadecimal character number ends. In our default xterm(1) configuration, this syntax is *NOT* supported. The above command prints the same as $ printf "normal \xef\xbf\27543m yellow \xef\xbf\2750m normal\n" where \xef\xbf\275 is the Unicode replacement character U+FFFD substituted for the (unsafe, ill-encoded) CSI. I you really want to, you can use the second syntax on OpenBSD, but that is *ABSOLUTELY NOT RECOMMENDED*. You would have to run xterm(1) in the unsafe, legacy, so called "conventional 8bit" mode as follows: $ LC_CTYPE=C xterm +lc # This is a really BAD idea. In such an xterm(1), the line $ LC_CTYPE=C printf "normal \23343m yellow \2330m normal\n" does show the word "yellow" in front of a yellow background. Reading the less(1) source code, i guess that it *intends* to support this, but it doesn't actually work for me: $ export LC_CTYPE=C $ printf "normal \23343m yellow \2330m normal\n" | less -R results in normal <9B>43m yellow <9B>0m normal for me, with the two strings "<9B>" highlighted in reverse video. Only with $ printf "normal \23343m yellow \2330m normal\n" | less -r do i see yellow background - but that is utterly useless because -r completely breaks column counting and wrapping in less(1), and besides it is irrelevent because -r doesn't need to inspect anything but simply passes *everything* through no matter what. I think it is better to completely remove support for the second syntax from less than to repair it. It is rarely used because it is incompatible with Unicode, it provides no advantages compared to the first syntax, but it painfully complicates the code. Third syntax: UTF-8 CSI ----------------------- $ printf "normal \xc2\23343m yellow \xc2\2330m normal\n" Here, each ANSI escape sequence is introduced by the Unicode character U+009B, represented in its two-byte UTF-8 encoding \xc2\x9b instead of by the two-byte sequence ESC-[. Obviously, this only mskes sense with a UTF-8 locale. However, the xterm documentation says in the file /usr/xenocara/app/xterm/ctlseqs.ms : It is not possible to use a C1 control obtained from decoding the UTF-8 text, because that would require reprocessing the data. Consequently, xterm(1) does not support the third syntax *at all*. You can't even reconfigure it to make it work, it just isn't supported at all. In an xterm, the output from the above command always looks like normal 43m yellow 0m normal with the two U+009B characters being invisible and zero width. By default, our less(1) does not pass these control characters through, but escapes them just like ESC characters: $ printf "normal \xc2\23343m yellow \xc2\2330m normal\n" | less normal <U+009B>43m yellow <U+009B>0m normal where the two strings "<U+009B>" are highlighted in reverse video. Again, with -r, they are passed through, but exactly as above, that is of no relevance (the following shows the invisble sequences): $ printf "normal \xc2\23343m yellow \xc2\2330m normal\n" | less -r Again, reading the source code, my impression is that less(1) *intends* to support them, but it doesn't actually work for me; the following shows <U+009B>43m just like without -R: $ printf "normal \xc2\23343m yellow \xc2\2330m normal\n" | less -R Again, rather than trying to fix less(1) in this respect, i'd prefer to completely remove support for the third syntax: it is rarely used because it doesn't work with xterm(1), it provides no advantage with respect to the first syntax, but it again painfully complicates the code. Summary ------- During the upcoming cleanup steps, let use retain full support for the first (ESC-[) syntax and lets us completely delete support for the second and third CSI syntaxes (single-byte CSI and UTF-8 single-character two-byte CSI). If you are OK with that plan, i'll send diffs implementing that. Thanks for any feedback, Ingo