Hi, Nicholas Marriott wrote on Tue, Feb 26, 2019 at 07:20:47AM +0000:
> Looks good, ok nicm Thanks to all of you for checking. Here is the next step, slowly coming closer to the meat of the matter: Start cleaning up the function store_char(), which takes a wide character, receiving both the Unicode codepoint "LWCHAR ch" and the UTF-8 multibyte representation "char *rep" as (redundant) input, behaves specially when ch is part of an ESC[...m ANSI sequence, otherwise measures the display width of ch, checks that it still fits on the screen, copies the multibyte representation to the line buffer, and advances the global variables "curr" (byte position) and "column" (visual display position). This is the only function calling in_ansi_esc_seq(), and having that function separate is actually counter-productive: if an escape sequence is found, then store_char() currently does the same backward iteration again if the sequence is incomplete and needs to be deleted. Consequently, merging in_ansi_esc_seq() into store_char() simplifies the code. If ch neither continues an ANSI sequence begun earlier nor starts a new one, the code needs to find the previous character in order to measure the width of the current one with pwidth(): the current one might be BACKSPACE. Currently, this backward iteration is done with the horrific, buggy function step_char(-1) which we aim to delete. Instead, iterate backwards until we find something that is not a UTF-8 continuation byte (i.e. which is an ASCII byte or a UTF-8 start byte) then decode the character with the standard function mbtowc(3). For extra safety, let's not only check that the previous character is valid (mbtowc() != -1) but also that there are no stray continuation bytes between the previous character and the character currently being added (i + w == curr). Maybe invalid UTF-8 can't actually make it into the line buffer, but let's better play it safe. If there is no previous character on the line, if it is invalid, or if there are stray continuation bytes after it, use a dummy blank, which means that in case of doubt, we assume that a backspace backs up by one visual column. Note that the function still contains a call to the buggy function utf_len() that we want to delete. Ultimately, most instances of utf_len() will be replaced with the standard function mblen(3), but we can't do that right now in store_char(). At this point, the function store_char() may still receive bogus characters incorrectly parsed with get_wchar(), and handling these at least consistently (even though not correctly) requires sticking with utf_len() for now. I tested that emboldening with "c\bc" and underlining with "c\b_" still works, even when "c" is a width two East Asian character, and that behaviour is sane with intervening stray continuation bytes. OK? Ingo Index: line.c =================================================================== RCS file: /cvs/src/usr.bin/less/line.c,v retrieving revision 1.24 diff -u -p -r1.24 line.c --- line.c 26 Feb 2019 11:01:54 -0000 1.24 +++ line.c 26 Feb 2019 12:26:42 -0000 @@ -458,27 +458,6 @@ backc(void) } /* - * Are we currently within a recognized ANSI escape sequence? - */ -static int -in_ansi_esc_seq(void) -{ - int i; - - /* - * Search backwards for either an ESC (which means we ARE in a seq); - * or an end char (which means we're NOT in a seq). - */ - for (i = curr - 1; i >= 0; i--) { - if (linebuf[i] == ESC) - return (1); - if (!is_ansi_middle(linebuf[i])) - return (0); - } - return (0); -} - -/* * Is a character the end of an ANSI escape sequence? */ int @@ -512,6 +491,7 @@ is_ansi_middle(LWCHAR ch) static int store_char(LWCHAR ch, char a, char *rep, off_t pos) { + int i; int w; int replen; char cs; @@ -529,22 +509,43 @@ store_char(LWCHAR ch, char a, char *rep, } } - if (ctldisp == OPT_ONPLUS && in_ansi_esc_seq()) { - if (!is_ansi_end(ch) && !is_ansi_middle(ch)) { + w = -1; + if (ctldisp == OPT_ONPLUS) { + /* + * Set i to the beginning of an ANSI escape sequence + * that was begun and not yet ended, or to -1 otherwise. + */ + for (i = curr - 1; i >= 0; i--) { + if (linebuf[i] == ESC) + break; + if (!is_ansi_middle(linebuf[i])) + i = 0; + } + if (i >= 0 && !is_ansi_end(ch) && !is_ansi_middle(ch)) { /* Remove whole unrecognized sequence. */ - do { - curr--; - } while (curr > 0 && linebuf[curr] != ESC); + curr = i; return (0); } - a = AT_ANSI; /* Will force re-AT_'ing around it. */ - w = 0; - } else if (ctldisp == OPT_ONPLUS && ch == ESC) { - a = AT_ANSI; /* Will force re-AT_'ing around it. */ - w = 0; - } else { - char *p = &linebuf[curr]; - LWCHAR prev_ch = step_char(&p, -1, linebuf); + if (i >= 0 || ch == ESC) { + a = AT_ANSI; /* Will force re-AT_'ing around it. */ + w = 0; + } + } + if (w == -1) { + wchar_t prev_ch; + + if (utf_mode) { + for (i = curr - 1; i >= 0; i--) + if (!IS_UTF8_TRAIL(linebuf[i])) + break; + if (i >= 0) { + w = mbtowc(&prev_ch, linebuf + i, curr - i); + if (w == -1 || i + w < curr) + prev_ch = L' '; + } else + prev_ch = L' '; + } else + prev_ch = curr > 0 ? linebuf[curr - 1] : L' '; w = pwidth(ch, a, prev_ch); }