Hi,

Ingo Schwarze wrote on Wed, Dec 23, 2015 at 07:44:05PM +0100:
> Steffen Nurpmeso wrote on Wed, Dec 23, 2015 at 11:45:36AM +0100:
>> Ingo Schwarze <[email protected]> wrote:

>>> For example, colrm(1).

[Regarding one of the various bugs in FreeBSD:]
>>> 4. The backspace character (U+0008) backs up by one display position
>>>    rather than by one character.  That causes miscounting when
>>>    backspace follows a zero-width or double-width character.

>> this however is unfortunately common behaviour for terminals, too.

> Sure, i noticed that in xterm(1) during testing.  I neither said
> that all other software is perfect, nor that mixing backspace
> encoding with UTF-8 is particularly robust this season.  I'm just
> trying to improve colrm(1) here.
> 
> Note that making backspace back up by one character is not making
> anything worse.  Sure, xterm(1) displays it badly, but some other
> programs, for example less(1), already implement the better semantics.
> So changing additional utilities to also use the better semantics
> makes the system better and more constistent overall.
> 
> Telling people to add TWO backpaces after a double width character
> to please xterm(1) would not be reasonable.  It breaks less(1) in
> an even worse way then using one backspace breaks xterm(1).  And
> then we would have to change less(1) and mandoc(1) and groff(1) and
> probably more programs, people would have to change their files,
> and we would have a more awkward and more complicated semantics.
> 
> So, remember this rule:
> 
>  +----------------------------------------------------------------+
>  | Backspace removes the previous character, no matter its width. |
>  +--------++--------------------------------------------++--------+
>           ||                                            ||
>           ||                                            ||

I just noticed that the POSIX specification of fold(1) also requires
two backspace characters after double-width characters.  Yes, that's
very stupid, the POSIX committee apparently never looked at nroff.
I'm not planning to make all our utilities, not even fold(1), choke
on nroff and mandoc output just to please POSIX.

However, there is one simple measure that can be taken to at least
partially mitigate the unavoidable POSIX violation:  *If* a
double-width character is followed by two backspace characters as
required by POSIX (which is likely to occur rarely, i'm not aware
of any tools producing such output), treat the two backspaces just
like one.  That way, our tools will be compatible with POSIX tools
both ways:  Our tools will be able to process input prepared for
POSIX tools, and after processing such input, the output generated
by our tools will still be usable by POSIX tools.  Of course, that
won't magically make the (broken) POSIX tools able to handle nroff
output, which our tools will of course handle just fine.

I updated my colrm(1) patch in that spirit, changing nothing but
the "case '\b':" in the main processing loop and adding a sentence
to the manual page:

     For compatibility with IEEE Std 1003.1-2008 ("POSIX.1") fold(1),
     if a double-width character is followed by two backspace
     characters instead of the usual one, both are regarded as
     belonging to that character, and the second one does not
     decrement the column count.

The rest of the manual page is already OK jmc@.

Since there was no opposition on tech@ when showing this patch
about two weeks ago, i'm planning to commit one of the next days,
unless there are objections.  An OK is still welcome, of course.

The rest of the rationale is available here:

  http://marc.info/?l=openbsd-tech&m=145082694731970

Yours,
  Ingo


Index: colrm.1
===================================================================
RCS file: /cvs/src/usr.bin/colrm/colrm.1,v
retrieving revision 1.8
diff -u -p -r1.8 colrm.1
--- colrm.1     28 Dec 2011 22:27:18 -0000      1.8
+++ colrm.1     10 Jan 2016 15:48:24 -0000
@@ -42,7 +42,6 @@
 .Sh DESCRIPTION
 .Nm
 removes selected columns from the lines of a file.
-A column is defined as a single character in a line.
 Input is read from the standard input.
 Output is written to the standard output.
 .Pp
@@ -63,8 +62,46 @@ or greater than the
 column will be written.
 Column numbering starts with one, not zero.
 .Pp
-Tab characters increment the column count to the next multiple of eight.
-Backspace characters decrement the column count by one.
+Each character occupies the number of columns defined by
+.Xr wcwidth 3 .
+Zero-width characters belong to the previous column rather
+than to the following column.
+If deletion of half of a double-width character is requested,
+its remaining half is replaced by a blank character.
+Non-printable characters are treated as if they had width 1.
+Each invalid byte is regarded as a non-printable character.
+.Pp
+Tab characters increment the input column count to the next multiple
+of eight.
+If they intersect or follow a deletion, they are expanded to blank
+characters such that the original alignment is preserved.
+.Pp
+Backspace characters decrement the column count by the width of the
+previous character.
+If they follow a character that is completely or partially deleted,
+they are deleted together with that character.
+If they follow a character that is partially deleted,
+they also suppress printing of the replacement blank character.
+.Pp
+For compatibility with
+.St -p1003.1-2008
+.Xr fold 1 ,
+if a double-width character is followed by two backspace characters
+instead of the usual one, both are regarded as belonging to that
+character, and the second one does not decrement the column count.
+.Sh ENVIRONMENT
+.Bl -tag -width LC_CTYPE
+.It Ev LC_CTYPE
+The character set
+.Xr locale 1 .
+It decides which sequences of bytes are treated as characters,
+and what their display width is.
+If unset or set to
+.Qq C ,
+.Qq POSIX ,
+or an unsupported value, each byte except tab and backspace is treated
+as a character of width 1.
+.El
 .Sh SEE ALSO
 .Xr awk 1 ,
 .Xr column 1 ,
@@ -80,3 +117,11 @@ utility first appeared in
 wrote the original version of
 .Nm
 in November 1974.
+.Sh BUGS
+If two characters of different widths are followed by two backspace
+characters in a row, the column count is decremented twice by the
+width of the second character rather than by the sum of both widths.
+This is hardly a practical problem because not even backspace
+encoding in
+.Xr roff 7
+style uses such double-backspace sequences.
Index: colrm.c
===================================================================
RCS file: /cvs/src/usr.bin/colrm/colrm.c,v
retrieving revision 1.11
diff -u -p -r1.11 colrm.c
--- colrm.c     9 Oct 2015 01:37:06 -0000       1.11
+++ colrm.c     10 Jan 2016 15:48:24 -0000
@@ -35,22 +35,27 @@
 #include <err.h>
 #include <errno.h>
 #include <limits.h>
+#include <locale.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
+#include <wchar.h>
 
 #define        TAB     8
 
-void check(FILE *);
 void usage(void);
 
 int
 main(int argc, char *argv[])
 {
-       u_long column, start, stop;
-       int ch;
-       char *p;
+       char     *line, *p;
+       ssize_t   linesz;
+       wchar_t   wc;
+       u_long    column, newcol, start, stop;
+       int       ch, len, width;
+
+       setlocale(LC_ALL, "");
 
        if (pledge("stdio", NULL) == -1)
                err(1, "pledge");
@@ -85,39 +90,95 @@ main(int argc, char *argv[])
        if (stop && start > stop)
                err(1, "illegal start and stop columns");
 
-       for (column = 0;;) {
-               switch (ch = getchar()) {
-               case EOF:
-                       check(stdin);
-                       break;
-               case '\b':
-                       if (column)
-                               --column;
-                       break;
-               case '\n':
-                       column = 0;
-                       break;
-               case '\t':
-                       column = (column + TAB) & ~(TAB - 1);
-                       break;
-               default:
-                       ++column;
-                       break;
-               }
+       line = NULL;
+       while (getline(&line, &linesz, stdin) != -1) {
+               column = 0;
+               width = 0;
+               for (p = line; *p != '\0'; p += len) {
+                       len = 1;
+                       switch (*p) {
+                       case '\n':
+                               putchar('\n');
+                               continue;
+                       case '\b':
+                               /*
+                                * Pass it through if the previous character
+                                * was in scope, still represented by the
+                                * current value of "column".
+                                * Allow an optional second backspace
+                                * after a double-width character.
+                                */
+                               if (start == 0 || column < start ||
+                                   (stop > 0 &&
+                                    column > stop + (width > 1))) {
+                                       putchar('\b');
+                                       if (width > 1 && p[1] == '\b')
+                                               putchar('\b');
+                               } 
+                               if (width > 1 && p[1] == '\b')
+                                       p++;
+                               column -= width;
+                               continue;
+                       case '\t':
+                               newcol = (column + TAB) & ~(TAB - 1);
+                               if (start == 0 || newcol < start) {
+                                       putchar('\t');
+                                       column = newcol;
+                               } else
+                                       /*
+                                        * Expand tabs that intersect or
+                                        * follow deleted columns.
+                                        */
+                                       while (column < newcol)
+                                               if (++column < start ||
+                                                   (stop > 0 &&
+                                                    column > stop))
+                                                       putchar(' ');
+                               continue;
+                       default:
+                               break;
+                       }
+
+                       /*
+                        * Handle the three cases of invalid bytes,
+                        * non-printable, and printable characters.
+                        */
+
+                       if ((len = mbtowc(&wc, p, MB_CUR_MAX)) == -1) {
+                               (void)mbtowc(NULL, NULL, MB_CUR_MAX);
+                               len = 1;
+                               width = 1;
+                       } else if ((width = wcwidth(wc)) == -1)
+                               width = 1;
+
+                       /*
+                        * If the character completely fits before or
+                        * after the cut, keep it; otherwise, skip it.
+                        */
+
+                       if ((start == 0 || column + width < start ||
+                           (stop > 0 && column + (width > 0) > stop)))
+                               fwrite(p, 1, len, stdout);
+
+                       /*
+                        * If the cut cuts the character in half
+                        * and no backspace follows,
+                        * print a blank for correct columnation.
+                        */
+
+                       else if (width > 1 && p[len] != '\b' &&
+                           (start == 0 || column + 1 < start ||
+                           (stop > 0 && column + width > stop)))
+                               putchar(' ');
 
-               if ((!start || column < start || (stop && column > stop)) &&
-                   putchar(ch) == EOF)
-                       check(stdout);
+                       column += width;
+               }
        }
-}
-
-void
-check(FILE *stream)
-{
-       if (feof(stream))
-               exit(0);
-       if (ferror(stream))
-               err(1, "%s", stream == stdin ? "stdin" : "stdout");
+       if (ferror(stdin))
+               err(1, "stdin");
+       if (ferror(stdout))
+               err(1, "stdout");
+       return 0;
 }
 
 void

Reply via email to