Hi, the most important small base system utilities fixed with respect to the most important UTF-8 issues (or at least having patches on tech@), i still didn't encounter a single case where a function written for one utility could be reused in another. So before tackling larger beasts like shells and editors, i'll look at some less important utilities in the hope that patterns may finally emerge more clearly.
Unfortunately, less important doesn't imply "simpler", and doing something carelessly just because it's less important is not better than not doing it at all. For example, colrm(1). The manual says it is intended to remove characters according to character positions relative to the beginning of each line. That is clear enough as far as it goes. However, the sentence "tab characters increment the column count to the next multiple of eight" casts some doubt. That would make more sense if the utility would remove characters according to display columns rather than character positions. POSIX doesn't help to resolve the ambiguity, since this is a non-standard, traditional 1BSD utility. The situation in other operating systems is desolate. FreeBSD is buggy in a large number of ways. 1. The documentation and code contradict each other. The manual page says that colrm(1) counts characters, while the code actually counts display columns. 2. If a character is valid but non-printable, it is silently treated as width 0 and not counted, even though the documentation talks about characters in general, without restricting that to printable characters. 3. If the input contains an invalid byte, the program aborts with err(3) instead of doing something sensible. 4. The backspace character (U+0008) backs up by one display position rather than by one character. That causes miscounting when backspace follows a zero-width or double-width character. DragonFly and Darwin look similar to FreeBSD. NetBSD, like OpenBSD, has no multibyte support in colrm(1). Illumos doesn't appear to have colrm(1) at all. As might be expected, util-linux is ridiculously complicated with about three times the amount of code of FreeBSD or us. If i read that code correctly, it shares all the FreeBSD bugs but adds at least one additional one: When finding an invalid input byte, it does exit(0), silently truncating the stream. https://github.com/karelzak/util-linux/ So the best we can do is implement what we consider most useful ourselves. I deem that the use case deciding usefulness is input that contains combining accent characters (width 0), for these reasons: If all characters are width 1, nothing is ambiguous. If the file is a mix of width 1 and width 2 characters, one could make arguments for either behaviour. But in a file containing width 1 characters with and without following combining accents, cutting by character positions would be almost useless because an accented character wouldn't fit in a single column. Cutting by display columns, by contrast, seems useful. As a nice side effect, this also makes tabs more useful. So let's change the manual and cut by display columns, similar to what FreeBSD does (but does not document), but without all those bugs. Now, FreeBSD bug #3 is almost unfixable with the approach used there because that code uses getwchar(3), and when that fails, there is no way to find out how many bytes were read or what they contained or even to put them back for re-reading - short of calling ftello(3) before each read operation or similar insanity. The function getwchar(3) is only useful when you want to weed out invalid bytes, and according to the documentation and the current implementation, this utility does not want to do that. The easiest solution is to change the main loop to getline(3) and to use our familiar mbtowc(3)/wcwidth(3) iteration. Also note that the check() function is not very useful. It's sufficent to just check for I/O errors once, at the end. With the current code, various things are broken even with ASCII input. Let's fix those while here, it's not all that difficult: - Backspace characters are never deleted, not even if they follow characters in columns that are deleted. $ echo "ab^Hcde" | ocolrm 2 2 | hexdump -C 00000000 61 08 64 65 0a |a.de.| $ echo "ab^Hcde" | ocolrm 2 3 | hexdump -C 00000000 61 08 65 0a |a.e.| $ echo "ab^Hcde" | colrm 2 2 | hexdump -C 00000000 61 64 65 0a |ade.| $ echo "ab^Hcde" | colrm 2 3 | hexdump -C 00000000 61 65 0a |ae.| - Tabs later on the input line than the deletion are passed through, breaking alignment. Better expand them. $ echo "1234567 |\n1234\t|" 1234567 | 1234 | # actually, "1234\t|" $ echo "1234567 |\n1234\t|" | ocolrm 1 4 567 | | # actually, "\t|" $ echo "1234567 |\n1234\t|" | colrm 1 4 567 | | # actually, four spaces and "|" - Tabs are always retained, even if they intersect the deletion, breaking alignment. Better expand them when they intersect the deletion such that the appropriate number of blanks can be deleted. OK? Ingo Index: colrm.1 =================================================================== RCS file: /cvs/src/usr.bin/colrm/colrm.1,v retrieving revision 1.8 diff -u -p -r1.8 colrm.1 --- colrm.1 28 Dec 2011 22:27:18 -0000 1.8 +++ colrm.1 22 Dec 2015 23:11:13 -0000 @@ -42,7 +42,6 @@ .Sh DESCRIPTION .Nm removes selected columns from the lines of a file. -A column is defined as a single character in a line. Input is read from the standard input. Output is written to the standard output. .Pp @@ -63,8 +62,39 @@ or greater than the column will be written. Column numbering starts with one, not zero. .Pp -Tab characters increment the column count to the next multiple of eight. -Backspace characters decrement the column count by one. +Each character occupies the number of columns defined by +.Xr wcwidth 3 . +Zero-width characters belong to the previous column rather +than to the following column. +If deletion of half of a double-width character is requested, +its remaining half is replaced by a blank character. +Non-printable characters are treated as if they had width 1. +Each invalid byte is regarded as a non-printable character. +.Pp +Tab characters increment the input column count to the next multiple +of eight. +If they intersect or follow a deletion, they are expanded to blank +characters such that the original alignment is preserved. +.Pp +Backspace characters decrement the column count by the width of the +previous character. +If they follow a character that is completely or partially deleted, +they are deleted together with that character. +If they follow a character that is partially deleted, +they also suppress printing of the replacement blank character. +.Sh ENVIRONMENT +.Bl -tag -width LC_CTYPE +.It Ev LC_CTYPE +The character set +.Xr locale 1 . +It decides which sequences of bytes are treated as characters, +and what their display width is. +If unset or set to +.Qq C , +.Qq POSIX , +or an unsupported value, each byte except tab and backspace is treated +as a character of width 1. +.El .Sh SEE ALSO .Xr awk 1 , .Xr column 1 , @@ -80,3 +110,11 @@ utility first appeared in wrote the original version of .Nm in November 1974. +.Sh BUGS +If two characters of different widths are followed by two backspace +characters in a row, the column count is decremented twice by the +width of the second character rather than by the sum of both widths. +This is hardly a practical problem because not even backspace +encoding in +.Xr roff 7 +style uses such double-backspace sequences. Index: colrm.c =================================================================== RCS file: /cvs/src/usr.bin/colrm/colrm.c,v retrieving revision 1.11 diff -u -p -r1.11 colrm.c --- colrm.c 9 Oct 2015 01:37:06 -0000 1.11 +++ colrm.c 22 Dec 2015 23:11:13 -0000 @@ -35,22 +35,27 @@ #include <err.h> #include <errno.h> #include <limits.h> +#include <locale.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> +#include <wchar.h> #define TAB 8 -void check(FILE *); void usage(void); int main(int argc, char *argv[]) { - u_long column, start, stop; - int ch; - char *p; + char *line, *p; + ssize_t linesz; + wchar_t wc; + u_long column, newcol, start, stop; + int ch, len, width; + + setlocale(LC_ALL, ""); if (pledge("stdio", NULL) == -1) err(1, "pledge"); @@ -85,39 +90,87 @@ main(int argc, char *argv[]) if (stop && start > stop) err(1, "illegal start and stop columns"); - for (column = 0;;) { - switch (ch = getchar()) { - case EOF: - check(stdin); - break; - case '\b': - if (column) - --column; - break; - case '\n': - column = 0; - break; - case '\t': - column = (column + TAB) & ~(TAB - 1); - break; - default: - ++column; - break; - } + line = NULL; + while (getline(&line, &linesz, stdin) != -1) { + column = 0; + width = 0; + for (p = line; *p != '\0'; p += len) { + len = 1; + switch (*p) { + case '\n': + putchar('\n'); + continue; + case '\b': + /* + * Pass it through if the previous character + * was in scope, still represented by the + * current value of "column". + */ + if (start == 0 || column < start || + (stop > 0 && column > stop + (width > 1))) + putchar('\b'); + column -= width; + continue; + case '\t': + newcol = (column + TAB) & ~(TAB - 1); + if (start == 0 || newcol < start) { + putchar('\t'); + column = newcol; + } else + /* + * Expand tabs that intersect or + * follow deleted columns. + */ + while (column < newcol) + if (++column < start || + (stop > 0 && + column > stop)) + putchar(' '); + continue; + default: + break; + } + + /* + * Handle the three cases of invalid bytes, + * non-printable, and printable characters. + */ + + if ((len = mbtowc(&wc, p, MB_CUR_MAX)) == -1) { + (void)mbtowc(NULL, NULL, MB_CUR_MAX); + len = 1; + width = 1; + } else if ((width = wcwidth(wc)) == -1) + width = 1; + + /* + * If the character completely fits before or + * after the cut, keep it; otherwise, skip it. + */ + + if ((start == 0 || column + width < start || + (stop > 0 && column + (width > 0) > stop))) + fwrite(p, 1, len, stdout); + + /* + * If the cut cuts the character in half + * and no backspace follows, + * print a blank for correct columnation. + */ + + else if (width > 1 && p[len] != '\b' && + (start == 0 || column + 1 < start || + (stop > 0 && column + width > stop))) + putchar(' '); - if ((!start || column < start || (stop && column > stop)) && - putchar(ch) == EOF) - check(stdout); + column += width; + } } -} - -void -check(FILE *stream) -{ - if (feof(stream)) - exit(0); - if (ferror(stream)) - err(1, "%s", stream == stdin ? "stdin" : "stdout"); + if (ferror(stdin)) + err(1, "stdin"); + if (ferror(stdout)) + err(1, "stdout"); + return 0; } void
