UTF-8 support for colrm(1)

Ingo Schwarze Tue, 22 Dec 2015 15:21:00 -0800

Hi,

the most important small base system utilities fixed with respect
to the most important UTF-8 issues (or at least having patches on
tech@), i still didn't encounter a single case where a function
written for one utility could be reused in another.  So before
tackling larger beasts like shells and editors, i'll look at some
less important utilities in the hope that patterns may finally
emerge more clearly.


Unfortunately, less important doesn't imply "simpler", and doing
something carelessly just because it's less important is not better
than not doing it at all.

For example, colrm(1).

The manual says it is intended to remove characters according to
character positions relative to the beginning of each line.  That
is clear enough as far as it goes.  However, the sentence "tab
characters increment the column count to the next multiple of eight"
casts some doubt.  That would make more sense if the utility would
remove characters according to display columns rather than character
positions.  POSIX doesn't help to resolve the ambiguity, since this
is a non-standard, traditional 1BSD utility.

The situation in other operating systems is desolate.

FreeBSD is buggy in a large number of ways.
1. The documentation and code contradict each other.  The manual
   page says that colrm(1) counts characters, while the code actually
   counts display columns.
2. If a character is valid but non-printable, it is silently treated
   as width 0 and not counted, even though the documentation talks
   about characters in general, without restricting that to printable
   characters.
3. If the input contains an invalid byte, the program aborts with
   err(3) instead of doing something sensible.
4. The backspace character (U+0008) backs up by one display position
   rather than by one character.  That causes miscounting when
   backspace follows a zero-width or double-width character.

DragonFly and Darwin look similar to FreeBSD.

NetBSD, like OpenBSD, has no multibyte support in colrm(1).

Illumos doesn't appear to have colrm(1) at all.

As might be expected, util-linux is ridiculously complicated
with about three times the amount of code of FreeBSD or us.
If i read that code correctly, it shares all the FreeBSD bugs but
adds at least one additional one: When finding an invalid input
byte, it does exit(0), silently truncating the stream.
https://github.com/karelzak/util-linux/


So the best we can do is implement what we consider most useful
ourselves.

I deem that the use case deciding usefulness is input that contains
combining accent characters (width 0), for these reasons:  If all
characters are width 1, nothing is ambiguous.  If the file is a mix
of width 1 and width 2 characters, one could make arguments for
either behaviour.  But in a file containing width 1 characters with
and without following combining accents, cutting by character
positions would be almost useless because an accented character
wouldn't fit in a single column.  Cutting by display columns, by
contrast, seems useful.  As a nice side effect, this also makes
tabs more useful.  So let's change the manual and cut by display
columns, similar to what FreeBSD does (but does not document), but
without all those bugs.

Now, FreeBSD bug #3 is almost unfixable with the approach used there
because that code uses getwchar(3), and when that fails, there is
no way to find out how many bytes were read or what they contained
or even to put them back for re-reading - short of calling ftello(3)
before each read operation or similar insanity.  The function
getwchar(3) is only useful when you want to weed out invalid bytes,
and according to the documentation and the current implementation,
this utility does not want to do that.

The easiest solution is to change the main loop to getline(3)
and to use our familiar mbtowc(3)/wcwidth(3) iteration.
Also note that the check() function is not very useful.
It's sufficent to just check for I/O errors once, at the end.

With the current code, various things are broken even with ASCII
input.  Let's fix those while here, it's not all that difficult:
 - Backspace characters are never deleted, not even if they
   follow characters in columns that are deleted.
      $ echo "ab^Hcde" | ocolrm 2 2 | hexdump -C
     00000000  61 08 64 65 0a  |a.de.|
      $ echo "ab^Hcde" | ocolrm 2 3 | hexdump -C
     00000000  61 08 65 0a     |a.e.|
      $ echo "ab^Hcde" | colrm 2 2 | hexdump -C
     00000000  61 64 65 0a     |ade.|
      $ echo "ab^Hcde" | colrm 2 3 | hexdump -C
     00000000  61 65 0a        |ae.|
 - Tabs later on the input line than the deletion are passed
   through, breaking alignment.  Better expand them.
      $ echo "1234567 |\n1234\t|"
     1234567 |
     1234    |  # actually, "1234\t|"
      $ echo "1234567 |\n1234\t|" | ocolrm 1 4
     567 |
             |  # actually, "\t|"
      $ echo "1234567 |\n1234\t|" | colrm 1 4  
     567 |
         |      # actually, four spaces and "|"
 - Tabs are always retained, even if they intersect the deletion,
   breaking alignment.  Better expand them when they intersect
   the deletion such that the appropriate number of blanks can be
   deleted.

OK?
  Ingo


Index: colrm.1
===================================================================
RCS file: /cvs/src/usr.bin/colrm/colrm.1,v
retrieving revision 1.8
diff -u -p -r1.8 colrm.1
--- colrm.1     28 Dec 2011 22:27:18 -0000      1.8
+++ colrm.1     22 Dec 2015 23:11:13 -0000
@@ -42,7 +42,6 @@
 .Sh DESCRIPTION
 .Nm
 removes selected columns from the lines of a file.
-A column is defined as a single character in a line.
 Input is read from the standard input.
 Output is written to the standard output.
 .Pp
@@ -63,8 +62,39 @@ or greater than the
 column will be written.
 Column numbering starts with one, not zero.
 .Pp
-Tab characters increment the column count to the next multiple of eight.
-Backspace characters decrement the column count by one.
+Each character occupies the number of columns defined by
+.Xr wcwidth 3 .
+Zero-width characters belong to the previous column rather
+than to the following column.
+If deletion of half of a double-width character is requested,
+its remaining half is replaced by a blank character.
+Non-printable characters are treated as if they had width 1.
+Each invalid byte is regarded as a non-printable character.
+.Pp
+Tab characters increment the input column count to the next multiple
+of eight.
+If they intersect or follow a deletion, they are expanded to blank
+characters such that the original alignment is preserved.
+.Pp
+Backspace characters decrement the column count by the width of the
+previous character.
+If they follow a character that is completely or partially deleted,
+they are deleted together with that character.
+If they follow a character that is partially deleted,
+they also suppress printing of the replacement blank character.
+.Sh ENVIRONMENT
+.Bl -tag -width LC_CTYPE
+.It Ev LC_CTYPE
+The character set
+.Xr locale 1 .
+It decides which sequences of bytes are treated as characters,
+and what their display width is.
+If unset or set to
+.Qq C ,
+.Qq POSIX ,
+or an unsupported value, each byte except tab and backspace is treated
+as a character of width 1.
+.El
 .Sh SEE ALSO
 .Xr awk 1 ,
 .Xr column 1 ,
@@ -80,3 +110,11 @@ utility first appeared in
 wrote the original version of
 .Nm
 in November 1974.
+.Sh BUGS
+If two characters of different widths are followed by two backspace
+characters in a row, the column count is decremented twice by the
+width of the second character rather than by the sum of both widths.
+This is hardly a practical problem because not even backspace
+encoding in
+.Xr roff 7
+style uses such double-backspace sequences.
Index: colrm.c
===================================================================
RCS file: /cvs/src/usr.bin/colrm/colrm.c,v
retrieving revision 1.11
diff -u -p -r1.11 colrm.c
--- colrm.c     9 Oct 2015 01:37:06 -0000       1.11
+++ colrm.c     22 Dec 2015 23:11:13 -0000
@@ -35,22 +35,27 @@
 #include <err.h>
 #include <errno.h>
 #include <limits.h>
+#include <locale.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
+#include <wchar.h>
 
 #define        TAB     8
 
-void check(FILE *);
 void usage(void);
 
 int
 main(int argc, char *argv[])
 {
-       u_long column, start, stop;
-       int ch;
-       char *p;
+       char     *line, *p;
+       ssize_t   linesz;
+       wchar_t   wc;
+       u_long    column, newcol, start, stop;
+       int       ch, len, width;
+
+       setlocale(LC_ALL, "");
 
        if (pledge("stdio", NULL) == -1)
                err(1, "pledge");
@@ -85,39 +90,87 @@ main(int argc, char *argv[])
        if (stop && start > stop)
                err(1, "illegal start and stop columns");
 
-       for (column = 0;;) {
-               switch (ch = getchar()) {
-               case EOF:
-                       check(stdin);
-                       break;
-               case '\b':
-                       if (column)
-                               --column;
-                       break;
-               case '\n':
-                       column = 0;
-                       break;
-               case '\t':
-                       column = (column + TAB) & ~(TAB - 1);
-                       break;
-               default:
-                       ++column;
-                       break;
-               }
+       line = NULL;
+       while (getline(&line, &linesz, stdin) != -1) {
+               column = 0;
+               width = 0;
+               for (p = line; *p != '\0'; p += len) {
+                       len = 1;
+                       switch (*p) {
+                       case '\n':
+                               putchar('\n');
+                               continue;
+                       case '\b':
+                               /*
+                                * Pass it through if the previous character
+                                * was in scope, still represented by the
+                                * current value of "column".
+                                */
+                               if (start == 0 || column < start ||
+                                   (stop > 0 && column > stop + (width > 1)))
+                                       putchar('\b');
+                               column -= width;
+                               continue;
+                       case '\t':
+                               newcol = (column + TAB) & ~(TAB - 1);
+                               if (start == 0 || newcol < start) {
+                                       putchar('\t');
+                                       column = newcol;
+                               } else
+                                       /*
+                                        * Expand tabs that intersect or
+                                        * follow deleted columns.
+                                        */
+                                       while (column < newcol)
+                                               if (++column < start ||
+                                                   (stop > 0 &&
+                                                    column > stop))
+                                                       putchar(' ');
+                               continue;
+                       default:
+                               break;
+                       }
+
+                       /*
+                        * Handle the three cases of invalid bytes,
+                        * non-printable, and printable characters.
+                        */
+
+                       if ((len = mbtowc(&wc, p, MB_CUR_MAX)) == -1) {
+                               (void)mbtowc(NULL, NULL, MB_CUR_MAX);
+                               len = 1;
+                               width = 1;
+                       } else if ((width = wcwidth(wc)) == -1)
+                               width = 1;
+
+                       /*
+                        * If the character completely fits before or
+                        * after the cut, keep it; otherwise, skip it.
+                        */
+
+                       if ((start == 0 || column + width < start ||
+                           (stop > 0 && column + (width > 0) > stop)))
+                               fwrite(p, 1, len, stdout);
+
+                       /*
+                        * If the cut cuts the character in half
+                        * and no backspace follows,
+                        * print a blank for correct columnation.
+                        */
+
+                       else if (width > 1 && p[len] != '\b' &&
+                           (start == 0 || column + 1 < start ||
+                           (stop > 0 && column + width > stop)))
+                               putchar(' ');
 
-               if ((!start || column < start || (stop && column > stop)) &&
-                   putchar(ch) == EOF)
-                       check(stdout);
+                       column += width;
+               }
        }
-}
-
-void
-check(FILE *stream)
-{
-       if (feof(stream))
-               exit(0);
-       if (ferror(stream))
-               err(1, "%s", stream == stdin ? "stdin" : "stdout");
+       if (ferror(stdin))
+               err(1, "stdin");
+       if (ferror(stdout))
+               err(1, "stdout");
+       return 0;
 }
 
 void

UTF-8 support for colrm(1)

Reply via email to