start cleaning up UTF-8 processing in less(1)

Ingo Schwarze Sat, 23 Feb 2019 06:46:27 -0800

Hi,

Evan Silberman, by posting a well-founded bug report to bugs@,
just drew my attention to the pigsty of having a hand-rolled
re-implemention of Unicode character property handling within
less(1).


The root of the evil is the existence of the custom definition

  typedef unsigned long LWCHAR;

in the file less.h, instead of simply using wchar_t where needed.
My first thought was "maybe they are doing that to store Unicode
codepoints portably on platforms where the values stored in wchar_t
are not Unicode codepoints" - but it turns out less(1) is *already*
broken on such platforms because the file cvt.c, function cvt_text(),
already calls iswupper(3) and towlower(3) on LWCHAR values, i.e. on
Unicode codepoints.  And some other functions are even worse than that:

 * search.c, is_ucase() calls isupper(3) - without *w*! - on an LWCHAR
   value, which makes no sense whatsoever.
 * charset.c, control_char(), calls iscntrl((unsigned char)c)
   on an LWCHAR value - which is absurd to the point of not even being
   funny any longer, rather making you cry.  When i first saw that,
   i was struck with incredulity, but it is really truncating valid
   codepoints to their lowest eight bits and then handing those
   meaningless bits to iscntrl(3)...

Of course it would be possible to root out all the insanity in one
bold step.  But it would be a very large diff, prone to introducing
bugs, hard to review, so i decided against that route.

Instead, for the moment, let us assume that wchar_t always stores
Unicode codepoints and is always contained within the "unsigned
long" range.  Both will always remain true on OpenBSD.  Making these
assumptions allows me to proceed step by step, with small, focussed
diffs that will be easy to argue about.

The end goal is to get rid of the LWCHAR data type and to delete
most of the charset.c file, maybe even all of it, without adding any
new code.

Once that goal is reached, i expect our less(1) will be even *more*
portable than it is now, because once that is reached, everything
that requires Unicode character properties will be done with wchar_t,
which is likely to just work even on platforms where wchar_t uses
some arcane format instead of simply storing Unicode codepoints.


Pursuing that goal, i first tried to proceed bottom-up, i.e. by
starting to first replace the lowest-level functions like get_wchar()
and step_char(), and then work upwards.  But that failed miserably.
What these low-level functions are doing is generally so wrong, so
absurd, that it is unreasonable to simply replace them: the internal
APIs they define are practically unusable by the callers.  Instead,
all call sites have to be inspected one by one, figuring out what
the code *should* do at the call sites and then implement that with
standard functions like mbtowc(3) and iswprint(3).

I combed through the less source code and identified 35 functions
in seven files that have to be partially rewritten, totally rewritten,
or deleted because what they are doing is wrong, in most cases
completely absurd.  I don't see any alternative to tackling them
one by one.

I decided to start with the file line.c because it is central to
the purpose of less: displaying lines of text to the screen.
To quickly get to the important function store_char(), which
intends to save a character into the line output buffer, i first
have to clean up the function pwidth() that it uses.  The function
pwidth() intends to measure the display width of a character, i.e.
how many columns it takes up on the screen: 0, 1, or 2.  So it has
to become a shallow wrapper around wcwidth(3).  While i doubt the
wisdom of the addition of display widths of ANSI escape sequences
that it is doing at the end - on any sane terminal, such sequences
should never take up any additional columns - i decided to stay
focussed on UTF-8 issues and just leave the code related to attributes
unchanged for now.

Here is the benfit of the following diff:  it gets rid of
 - two LWCHAR function arguments
 - two calls to is_wide_char() which uses an outdated table
 - one call to is_composing_char() which uses an outdated table
 - one call to is_combining_char() which uses an outdated table
 - one call to control_char() which is totally broken
 - one call to is_ascii_char() which is a trivial function
 - two tests of the ideosyncratic "utf_mode" variable

None of the horrific features listed above can be deleted just yet
because they are also in use at other places in less.

Of course, the diff *does* change behaviour: character display
widths (for the purpose at hand) get updated to the latest version
of Unicode in our tree, and the effects of any bugs contained in
the functions listed above silently vanish.  That should only make
things better and not cause regressions.  I didn't check whether
no longer calling control_char() here fixes some bugs; that is quite
possible, but depends on which arguments pwidth() is called with,
which i didn't try to research.  The new version simply does the
right thing for all arguments it could possibly be passed.

One final remark: the diff makes the arguments of pwidth() narrower
on some platforms.  But that is safe.  As far as i can see, wchar_t
is "int" on all our platforms, which is at least 32 bits wide.  Even
the bogus 6-byte sequences that less accepts - which will hopefully
get fixed at a later stage of the beginning rampage - are at most
31 bits wide: one bit in the start byte and six bits in each of the
five continuation bytes.  So even though they are not at all valid
UTF-8 and not even valid Unicode, they still fit safely into our
wchar_t.

OK?
  Ingo


Index: line.c
===================================================================
RCS file: /cvs/src/usr.bin/less/line.c,v
retrieving revision 1.22
diff -u -p -r1.22 line.c
--- line.c      2 Apr 2017 23:02:06 -0000       1.22
+++ line.c      23 Feb 2019 14:15:45 -0000
@@ -15,6 +15,8 @@
  * in preparation for output to the screen.
  */
 
+#include <wchar.h>
+
 #include "charset.h"
 #include "less.h"
 
@@ -373,50 +375,53 @@ attr_ewidth(int a)
  * attribute sequence to be inserted, so this must be taken into account.
  */
 static int
-pwidth(LWCHAR ch, int a, LWCHAR prev_ch)
+pwidth(wchar_t ch, int a, wchar_t prev_ch)
 {
        int w;
 
-       if (ch == '\b')
-               /*
-                * Backspace moves backwards one or two positions.
-                * XXX - Incorrect if several '\b' in a row.
-                */
-               return ((utf_mode && is_wide_char(prev_ch)) ? -2 : -1);
-
-       if (!utf_mode || is_ascii_char(ch)) {
-               if (control_char((char)ch)) {
-                       /*
-                        * Control characters do unpredictable things,
-                        * so we don't even try to guess; say it doesn't move.
-                        * This can only happen if the -r flag is in effect.
-                        */
-                       return (0);
-               }
-       } else {
-               if (is_composing_char(ch) || is_combining_char(prev_ch, ch)) {
-                       /*
-                        * Composing and combining chars take up no space.
-                        *
-                        * Some terminals, upon failure to compose a
-                        * composing character with the character(s) that
-                        * precede(s) it will actually take up one column
-                        * for the composing character; there isn't much
-                        * we could do short of testing the (complex)
-                        * composition process ourselves and printing
-                        * a binary representation when it fails.
-                        */
-                       return (0);
-               }
+       /*
+        * In case of a backspace, back up by the width of the previous
+        * character.  If that is non-printable (for example another
+        * backspace) or zero width (for example a combining accent),
+        * the terminal may actually back up to a character even further
+        * back, but we no longer know how wide that may have been.
+        * The best guess possible at this point is that it was
+        * hopefully width one.
+        */
+       if (ch == L'\b') {
+               w = wcwidth(prev_ch);
+               if (w <= 0)
+                       w = 1;
+               return (-w);
        }
 
+       w = wcwidth(ch);
+
+       /*
+        * Non-printable characters can get here if the -r flag is in
+        * effect, and possibly in some other situations (XXX check that!).
+        * Treat them as zero width.
+        * That may not always match their actual behaviour,
+        * but there is no reasonable way to be more exact.
+        */
+       if (w == -1)
+               w = 0;
+
+       /*
+        * Combining accents take up no space.
+        * Some terminals, upon failure to compose them with the
+        * characters that precede them, will actually take up one column
+        * for the combining accent; there isn't much we could do short
+        * of testing the (complex) composition process ourselves and
+        * printing a binary representation when it fails.
+        */
+       if (w == 0)
+               return (0);
+
        /*
         * Other characters take one or two columns,
         * plus the width of any attribute enter/exit sequence.
         */
-       w = 1;
-       if (is_wide_char(ch))
-               w++;
        if (curr > 0 && !is_at_equiv(attr[curr-1], a))
                w += attr_ewidth(attr[curr-1]);
        if ((apply_at_specials(a) != AT_NORMAL) &&

start cleaning up UTF-8 processing in less(1)

Reply via email to