Hi, the first step in cleaning up do_append() is the subroutine backc(), which is only called from a single place, namely in do_append(). Its purpose is to back up over the previous positive-width character during the application of backspace-encoding for "bold" or "underlined".
The task here is to get rid of the data type LWCHAR and to stop calling the bad function step_char(). I tried to figure out whether the content of linebuf[] is guaranteed to always be valid UTF-8, but i still don't feel completely convinced. For example, the function forw_raw_line() in line.c looks suspicious, as if it might potentially copy arbitrary, invalidly encoded data into linebuf[]. Consequently, i consider it safer to perform complete error handling in backc(), at least for now. The new code for the two backward iterations in backc() looks slightly repetive. But since in the first iteration (for ch), failure means there is no previous character whatsoever and we have to return failure, whereas for the second iteration (for prev_ch), failure only means that we need to call pwidth() without passing a previous character, error handling is different, so trying to unify both backward iterations would probably make the code more complicated rather than simpler. Note that while prev_ch failure must not prevent the width measurement of ch with pwidth(), if that returns 0 (i.e. when ch is a combining accent), iterating to the preceding character only makes sense when the preceding character is valid, so i'm adding "if (prev_ch == L'\0') return (0);". The lack of such a check in the old code may have been a bug, but i'm not completely sure it could actually be triggered in some way. Finally, when mbtowc(3) fails, portability requires to reset the internal state of mbtowc(3). That's a no-op on OpenBSD, and it is unlikely to actually be required for UTF-8 even on other systems because internal state is only supposed to be useful for stateful encodings, but i consider it good idiomatic coding to include the reset anyway. I'm also adding it to one place where i forgot about it in an earlier patch. Of course, it is only required for len == -1, but it does no harm for the (less likely) case of stray intervening continuation bytes (i + len < curr). OK? Ingo Index: line.c =================================================================== RCS file: /cvs/src/usr.bin/less/line.c,v retrieving revision 1.26 diff -u -p -r1.26 line.c --- line.c 1 Mar 2019 16:41:37 -0000 1.26 +++ line.c 2 Mar 2019 11:54:41 -0000 @@ -437,23 +437,51 @@ pwidth(wchar_t ch, int a, wchar_t prev_c static int backc(void) { - LWCHAR prev_ch; - char *p = linebuf + curr; - LWCHAR ch = step_char(&p, -1, linebuf + lmargin); - int width; + wchar_t ch, prev_ch; + int i, len, width; + + i = curr - 1; + if (utf_mode) { + while (i >= lmargin && IS_UTF8_TRAIL(linebuf[i])) + i--; + } + if (i < lmargin) + return (0); + if (utf_mode) { + len = mbtowc(&ch, linebuf + i, curr - i); + if (len == -1 || i + len < curr) { + (void)mbtowc(NULL, NULL, MB_CUR_MAX); + return (0); + } + } else + ch = linebuf[i]; /* This assumes that there is no '\b' in linebuf. */ while (curr > lmargin && column > lmargin && (!(attr[curr - 1] & (AT_ANSI|AT_BINARY)))) { - curr = p - linebuf; - prev_ch = step_char(&p, -1, linebuf + lmargin); + curr = i--; + if (utf_mode) { + while (i >= lmargin && IS_UTF8_TRAIL(linebuf[i])) + i--; + } + if (i < lmargin) + prev_ch = L'\0'; + else if (utf_mode) { + len = mbtowc(&prev_ch, linebuf + i, curr - i); + if (len == -1 || i + len < curr) { + (void)mbtowc(NULL, NULL, MB_CUR_MAX); + prev_ch = L'\0'; + } + } else + prev_ch = linebuf[i]; width = pwidth(ch, attr[curr], prev_ch); column -= width; if (width > 0) return (1); + if (prev_ch == L'\0') + return (0); ch = prev_ch; } - return (0); } @@ -536,8 +564,10 @@ store_char(LWCHAR ch, char a, char *rep, break; if (i >= 0) { w = mbtowc(&prev_ch, linebuf + i, curr - i); - if (w == -1 || i + w < curr) + if (w == -1 || i + w < curr) { + (void)mbtowc(NULL, NULL, MB_CUR_MAX); prev_ch = L' '; + } } else prev_ch = L' '; } else