less(1) UTF-8 cleanup: backc()

Ingo Schwarze Sat, 02 Mar 2019 04:43:26 -0800

Hi,

the first step in cleaning up do_append() is the subroutine backc(),
which is only called from a single place, namely in do_append().
Its purpose is to back up over the previous positive-width character
during the application of backspace-encoding for "bold" or "underlined".


The task here is to get rid of the data type LWCHAR and to stop
calling the bad function step_char().

I tried to figure out whether the content of linebuf[] is guaranteed
to always be valid UTF-8, but i still don't feel completely convinced.
For example, the function forw_raw_line() in line.c looks suspicious,
as if it might potentially copy arbitrary, invalidly encoded data into
linebuf[].  Consequently, i consider it safer to perform complete
error handling in backc(), at least for now.

The new code for the two backward iterations in backc() looks slightly
repetive.  But since in the first iteration (for ch), failure means
there is no previous character whatsoever and we have to return failure,
whereas for the second iteration (for prev_ch), failure only means
that we need to call pwidth() without passing a previous character,
error handling is different, so trying to unify both backward iterations
would probably make the code more complicated rather than simpler.

Note that while prev_ch failure must not prevent the width measurement
of ch with pwidth(), if that returns 0 (i.e. when ch is a combining
accent), iterating to the preceding character only makes sense when
the preceding character is valid, so i'm adding "if (prev_ch == L'\0')
return (0);".  The lack of such a check in the old code may have
been a bug, but i'm not completely sure it could actually be triggered
in some way.

Finally, when mbtowc(3) fails, portability requires to reset the
internal state of mbtowc(3).  That's a no-op on OpenBSD, and it is
unlikely to actually be required for UTF-8 even on other systems
because internal state is only supposed to be useful for stateful
encodings, but i consider it good idiomatic coding to include the
reset anyway.  I'm also adding it to one place where i forgot
about it in an earlier patch.  Of course, it is only required for
len == -1, but it does no harm for the (less likely) case of stray
intervening continuation bytes (i + len < curr).

OK?
  Ingo


Index: line.c
===================================================================
RCS file: /cvs/src/usr.bin/less/line.c,v
retrieving revision 1.26
diff -u -p -r1.26 line.c
--- line.c      1 Mar 2019 16:41:37 -0000       1.26
+++ line.c      2 Mar 2019 11:54:41 -0000
@@ -437,23 +437,51 @@ pwidth(wchar_t ch, int a, wchar_t prev_c
 static int
 backc(void)
 {
-       LWCHAR prev_ch;
-       char *p = linebuf + curr;
-       LWCHAR ch = step_char(&p, -1, linebuf + lmargin);
-       int width;
+       wchar_t  ch, prev_ch;
+       int      i, len, width;
+
+       i = curr - 1;
+       if (utf_mode) {
+               while (i >= lmargin && IS_UTF8_TRAIL(linebuf[i]))
+                       i--;
+       }
+       if (i < lmargin)
+               return (0);
+       if (utf_mode) {
+               len = mbtowc(&ch, linebuf + i, curr - i);
+               if (len == -1 || i + len < curr) {
+                       (void)mbtowc(NULL, NULL, MB_CUR_MAX);
+                       return (0);
+               }
+       } else
+               ch = linebuf[i];
 
        /* This assumes that there is no '\b' in linebuf.  */
        while (curr > lmargin && column > lmargin &&
            (!(attr[curr - 1] & (AT_ANSI|AT_BINARY)))) {
-               curr = p - linebuf;
-               prev_ch = step_char(&p, -1, linebuf + lmargin);
+               curr = i--;
+               if (utf_mode) {
+                       while (i >= lmargin && IS_UTF8_TRAIL(linebuf[i]))
+                               i--;
+               }
+               if (i < lmargin)
+                       prev_ch = L'\0';
+               else if (utf_mode) {
+                       len = mbtowc(&prev_ch, linebuf + i, curr - i);
+                       if (len == -1 || i + len < curr) {
+                               (void)mbtowc(NULL, NULL, MB_CUR_MAX);
+                               prev_ch = L'\0';
+                       }
+               } else
+                       prev_ch = linebuf[i];
                width = pwidth(ch, attr[curr], prev_ch);
                column -= width;
                if (width > 0)
                        return (1);
+               if (prev_ch == L'\0')
+                       return (0);
                ch = prev_ch;
        }
-
        return (0);
 }
 
@@ -536,8 +564,10 @@ store_char(LWCHAR ch, char a, char *rep,
                                        break;
                        if (i >= 0) {
                                w = mbtowc(&prev_ch, linebuf + i, curr - i);
-                               if (w == -1 || i + w < curr)
+                               if (w == -1 || i + w < curr) {
+                                       (void)mbtowc(NULL, NULL, MB_CUR_MAX);
                                        prev_ch = L' ';
+                               }
                        } else
                                prev_ch = L' ';
                } else

less(1) UTF-8 cleanup: backc()

Reply via email to