Re: start cleaning up UTF-8 processing in less(1)

Ingo Schwarze Sun, 24 Feb 2019 02:40:53 -0800

Hi,

Stefan Sperling wrote on Sat, Feb 23, 2019 at 04:19:02PM +0100:


> Your diff looks good to me.
> And I can't see how it could make this situation any worse either.

thanks for checking; i committed the first patch.


To be able to continue with the less cleanup i started, i have to
explain some theory first, and i'm kindly asking for advice which
functionality is desired.  Personally, i'd like to make less simpler
and safer, removing some anachronistic, dangerous functionality,
see below.

Considerable complication in the less code results from support for
https://en.wikipedia.org/wiki/ANSI_escape_code sequences.  Note
that i will *not* propose to delete that support completely, but
only some more arcane and more dangerous parts of it, see at the
very end of this mail.

There are three ways to write an ANSI escape sequence; i'll have
to explain all three in turn.


First syntax: ESCAPE-BRACKET
----------------------------
  $ printf "normal \x1b[43m yellow \x1b[0m normal\n"

Each ANSI escape sequence is introduced by the two-byte sequence
consisting of the ASCII escape character followed by an ASCII opening
square bracket.  This form is relatively benign because it is both
valid ASCII and valid UTF-8 and looks identical in both encodings.
In our default configuration of xterm(1), which is moderately secure
without being paranoid, the above command works out of the box,
showing you the word "yellow" on a yellow background.  It doesn't
matter what LC_CTYPE is, it always works.

By default, less(1) does *not* interpret such escape sequences, but
escapes them and shows them as follows:

  normal ESC[43m yellow ESC[0m normal

where the two instances of the string "ESC" are highlighted in
reverse video.  But if you run less(1) with the -R option, it passes
the escape sequences on to the terminal, and again you see the word
"yellow" on a yellow background, inside less(1).  Horizontal scrolling
works correctly, preserving vertical alignment of display columns,
with the two escape sequences taking up zero columns each in an
xterm(1).  Your mileage may vary on exotic terminals, though.

I'm conivinced this behaviour ought to be preserved.


Second syntax: conventional 8-bit CSI
-------------------------------------
  $ LC_CTYPE=C printf "normal \23343m yellow \2330m normal\n"

Here, each ANSI escape sequence is introduced by the single
byte 0x9b (CSI = control sequence introducer) instead of by the
two-byte sequence ESC-[.  Obviously, this implies that the second
syntax is incompatible with UTF-8, so we always have to make sure
we have LC_CTYPE=C set when doing anything with the second syntax.
Note that i had to give the CSI bytes in octal in the printf command
above; \x9b43m would not work because printf(1) would be unable to
figure out where the hexadecimal character number ends.

In our default xterm(1) configuration, this syntax is *NOT*
supported.  The above command prints the same as

  $ printf "normal \xef\xbf\27543m yellow \xef\xbf\2750m normal\n"

where \xef\xbf\275 is the Unicode replacement character U+FFFD
substituted for the (unsafe, ill-encoded) CSI.

I you really want to, you can use the second syntax on OpenBSD,
but that is *ABSOLUTELY NOT RECOMMENDED*.  You would have to run
xterm(1) in the unsafe, legacy, so called "conventional 8bit" mode
as follows:

  $ LC_CTYPE=C xterm +lc  # This is a really BAD idea.

In such an xterm(1), the line

  $ LC_CTYPE=C printf "normal \23343m yellow \2330m normal\n"

does show the word "yellow" in front of a yellow background.

Reading the less(1) source code, i guess that it *intends* to support
this, but it doesn't actually work for me:

  $ export LC_CTYPE=C
  $ printf "normal \23343m yellow \2330m normal\n" | less -R

results in

  normal <9B>43m yellow <9B>0m normal

for me, with the two strings "<9B>" highlighted in reverse video.
Only with

  $ printf "normal \23343m yellow \2330m normal\n" | less -r

do i see yellow background - but that is utterly useless because -r
completely breaks column counting and wrapping in less(1), and besides
it is irrelevent because -r doesn't need to inspect anything but
simply passes *everything* through no matter what.

I think it is better to completely remove support for the second
syntax from less than to repair it.  It is rarely used because it
is incompatible with Unicode, it provides no advantages compared
to the first syntax, but it painfully complicates the code.


Third syntax: UTF-8 CSI
-----------------------
 $ printf "normal \xc2\23343m yellow \xc2\2330m normal\n"

Here, each ANSI escape sequence is introduced by the Unicode
character U+009B, represented in its two-byte UTF-8 encoding \xc2\x9b
instead of by the two-byte sequence ESC-[.  Obviously, this only
mskes sense with a UTF-8 locale.

However, the xterm documentation says in the file
/usr/xenocara/app/xterm/ctlseqs.ms :

  It is not possible to use a C1 control obtained from decoding the
  UTF-8 text, because that would require reprocessing the data.

Consequently, xterm(1) does not support the third syntax *at all*.
You can't even reconfigure it to make it work, it just isn't
supported at all.  In an xterm, the output from the above command
always looks like

  normal 43m yellow 0m normal

with the two U+009B characters being invisible and zero width.

By default, our less(1) does not pass these control characters through,
but escapes them just like ESC characters:

  $ printf "normal \xc2\23343m yellow \xc2\2330m normal\n" | less
  normal <U+009B>43m yellow <U+009B>0m normal

where the two strings "<U+009B>" are highlighted in reverse video.
Again, with -r, they are passed through, but exactly as above, that
is of no relevance (the following shows the invisble sequences):

  $ printf "normal \xc2\23343m yellow \xc2\2330m normal\n" | less -r

Again, reading the source code, my impression is that less(1) *intends* to
support them, but it doesn't actually work for me; the following shows
<U+009B>43m just like without -R:

  $ printf "normal \xc2\23343m yellow \xc2\2330m normal\n" | less -R

Again, rather than trying to fix less(1) in this respect, i'd prefer
to completely remove support for the third syntax: it is rarely
used because it doesn't work with xterm(1), it provides no advantage
with respect to the first syntax, but it again painfully complicates
the code.


Summary
-------
During the upcoming cleanup steps, let use retain full support for
the first (ESC-[) syntax and lets us completely delete support for
the second and third CSI syntaxes (single-byte CSI and UTF-8
single-character two-byte CSI).

If you are OK with that plan, i'll send diffs implementing that.

Thanks for any feedback,
  Ingo

Re: start cleaning up UTF-8 processing in less(1)

Reply via email to