less(1) UTF-8 cleanup: store_char()

Ingo Schwarze Tue, 26 Feb 2019 04:44:52 -0800

Hi,

Nicholas Marriott wrote on Tue, Feb 26, 2019 at 07:20:47AM +0000:


> Looks good, ok nicm

Thanks to all of you for checking.

Here is the next step, slowly coming closer to the meat of the
matter: Start cleaning up the function store_char(), which takes a
wide character, receiving both the Unicode codepoint "LWCHAR ch"
and the UTF-8 multibyte representation "char *rep" as (redundant)
input, behaves specially when ch is part of an ESC[...m ANSI sequence,
otherwise measures the display width of ch, checks that it still
fits on the screen, copies the multibyte representation to the line
buffer, and advances the global variables "curr" (byte position)
and "column" (visual display position).

This is the only function calling in_ansi_esc_seq(), and having that
function separate is actually counter-productive: if an escape sequence
is found, then store_char() currently does the same backward iteration
again if the sequence is incomplete and needs to be deleted.
Consequently, merging in_ansi_esc_seq() into store_char() simplifies
the code.

If ch neither continues an ANSI sequence begun earlier nor starts a
new one, the code needs to find the previous character in order to
measure the width of the current one with pwidth(): the current one
might be BACKSPACE.  Currently, this backward iteration is done
with the horrific, buggy function step_char(-1) which we aim to
delete.  Instead, iterate backwards until we find something that
is not a UTF-8 continuation byte (i.e. which is an ASCII byte or a
UTF-8 start byte) then decode the character with the standard
function mbtowc(3).

For extra safety, let's not only check that the previous character
is valid (mbtowc() != -1) but also that there are no stray continuation
bytes between the previous character and the character currently
being added (i + w == curr).  Maybe invalid UTF-8 can't actually
make it into the line buffer, but let's better play it safe.  If
there is no previous character on the line, if it is invalid, or
if there are stray continuation bytes after it, use a dummy blank,
which means that in case of doubt, we assume that a backspace backs
up by one visual column.

Note that the function still contains a call to the buggy function
utf_len() that we want to delete.  Ultimately, most instances of
utf_len() will be replaced with the standard function mblen(3), but
we can't do that right now in store_char().  At this point, the
function store_char() may still receive bogus characters incorrectly
parsed with get_wchar(), and handling these at least consistently
(even though not correctly) requires sticking with utf_len() for
now.

I tested that emboldening with "c\bc" and underlining with "c\b_"
still works, even when "c" is a width two East Asian character,
and that behaviour is sane with intervening stray continuation
bytes.

OK?
  Ingo


Index: line.c
===================================================================
RCS file: /cvs/src/usr.bin/less/line.c,v
retrieving revision 1.24
diff -u -p -r1.24 line.c
--- line.c      26 Feb 2019 11:01:54 -0000      1.24
+++ line.c      26 Feb 2019 12:26:42 -0000
@@ -458,27 +458,6 @@ backc(void)
 }
 
 /*
- * Are we currently within a recognized ANSI escape sequence?
- */
-static int
-in_ansi_esc_seq(void)
-{
-       int i;
-
-       /*
-        * Search backwards for either an ESC (which means we ARE in a seq);
-        * or an end char (which means we're NOT in a seq).
-        */
-       for (i = curr - 1; i >= 0; i--) {
-               if (linebuf[i] == ESC)
-                       return (1);
-               if (!is_ansi_middle(linebuf[i]))
-                       return (0);
-       }
-       return (0);
-}
-
-/*
  * Is a character the end of an ANSI escape sequence?
  */
 int
@@ -512,6 +491,7 @@ is_ansi_middle(LWCHAR ch)
 static int
 store_char(LWCHAR ch, char a, char *rep, off_t pos)
 {
+       int i;
        int w;
        int replen;
        char cs;
@@ -529,22 +509,43 @@ store_char(LWCHAR ch, char a, char *rep,
                }
        }
 
-       if (ctldisp == OPT_ONPLUS && in_ansi_esc_seq()) {
-               if (!is_ansi_end(ch) && !is_ansi_middle(ch)) {
+       w = -1;
+       if (ctldisp == OPT_ONPLUS) {
+               /*
+                * Set i to the beginning of an ANSI escape sequence
+                * that was begun and not yet ended, or to -1 otherwise.
+                */
+               for (i = curr - 1; i >= 0; i--) {
+                       if (linebuf[i] == ESC)
+                               break;
+                       if (!is_ansi_middle(linebuf[i]))
+                               i = 0;
+               }
+               if (i >= 0 && !is_ansi_end(ch) && !is_ansi_middle(ch)) {
                        /* Remove whole unrecognized sequence.  */
-                       do {
-                               curr--;
-                       } while (curr > 0 && linebuf[curr] != ESC);
+                       curr = i;
                        return (0);
                }
-               a = AT_ANSI;    /* Will force re-AT_'ing around it.  */
-               w = 0;
-       } else if (ctldisp == OPT_ONPLUS && ch == ESC) {
-               a = AT_ANSI;    /* Will force re-AT_'ing around it.  */
-               w = 0;
-       } else {
-               char *p = &linebuf[curr];
-               LWCHAR prev_ch = step_char(&p, -1, linebuf);
+               if (i >= 0 || ch == ESC) {
+                       a = AT_ANSI;  /* Will force re-AT_'ing around it. */
+                       w = 0;
+               }
+       }
+       if (w == -1) {
+               wchar_t prev_ch;
+
+               if (utf_mode) {
+                       for (i = curr - 1; i >= 0; i--)
+                               if (!IS_UTF8_TRAIL(linebuf[i]))
+                                       break;
+                       if (i >= 0) {
+                               w = mbtowc(&prev_ch, linebuf + i, curr - i);
+                               if (w == -1 || i + w < curr)
+                                       prev_ch = L' ';
+                       } else
+                               prev_ch = L' ';
+               } else
+                       prev_ch = curr > 0 ? linebuf[curr - 1] : L' ';
                w = pwidth(ch, a, prev_ch);
        }

less(1) UTF-8 cleanup: store_char()

Reply via email to