Re: UTF-8 support for fmt(1)

2015-12-16 Thread Ingo Schwarze
Hi,

Ingo Schwarze wrote on Tue, Dec 08, 2015 at 10:37:29PM +0100:

> here is UTF-8 support for fmt(1).
> This does not include the -c case; the patch is already large enough.

Meanwhile, i committed that.
Here is a simple solution for the -c case.
The loop in center_stream() is designed to be similar
to the loop in process_stream(), but it's a bit simpler.

This patch implies two changes in behaviour, though.

First, i don't see why fmt -c should pass through invalid bytes,
given that fmt without -c weeds them out.  So handle them the same
way in both cases, replace them with ASCII question marks.

Then, the concept of tabs inside lines that are to be centered makes
no sense in the first place.  In the past, the width of such a tab
depended on the leading whitespace on the line, even though that
whitespace was otherwise ignored.  Yet, tabs on subsequent lines
did not align because the leading space on output depends on the
width of the string following the tab.  None of that was useful.

I see no way to define the meaning of a tab in a line that is to
be centered in a more useful way.  If we want tabs on subsequent
centered lines to align, the *number* of tabs needed will depend
on the width of the string *following* the last tab.  That is
completely intransparent to people writing such files, and i see
no way to prepare such files correctly without experimentation.
Even then, the output positioning of the text preceding the tab
remains ill-defined.

So, i propose that in lines to be centered, we just replace each
tab with one single blank.  That is easy to understand, easy to
implement, and not less useful than any other solution i can think
of.

OK?
  Ingo


Index: fmt.c
===
RCS file: /cvs/src/usr.bin/fmt/fmt.c,v
retrieving revision 1.34
diff -u -p -r1.34 fmt.c
--- fmt.c   15 Dec 2015 16:26:17 -  1.34
+++ fmt.c   16 Dec 2015 10:37:27 -
@@ -620,13 +620,29 @@ output_word(size_t indent0, size_t inden
 static void
 center_stream(FILE *stream, const char *name)
 {
-   char *line;
-   size_t l;
+   char *line, *cp;
+   wchar_t wc;
+   size_t l;   /* Display width of the line. */
+   int wcw;/* Display width of one character. */
+   int wcl;/* Length in bytes of one character. */
 
while ((line = get_line(stream)) != NULL) {
-   while (isspace((unsigned char)*line))
-   ++line;
-   l = strlen(line);
+   l = 0;
+   for (cp = line; *cp != '\0'; cp += wcl) {
+   if (*cp == '\t')
+   *cp = ' ';
+   if ((wcl = mbtowc(&wc, cp, MB_CUR_MAX)) == -1) {
+   (void)mbtowc(NULL, NULL, MB_CUR_MAX);
+   *cp = '?';
+   wcl = 1;
+   wcw = 1;
+   } else if ((wcw = wcwidth(wc)) == -1)
+   wcw = 1;
+   if (l == 0 && iswspace(wc))
+   line += wcl;
+   else
+   l += wcw;
+   }
while (l < goal_length) {
putchar(' ');
l += 2;



UTF-8 support for fmt(1)

2015-12-08 Thread Ingo Schwarze
Hi,

here is UTF-8 support for fmt(1).
This does not include the -c case; the patch is already large enough.

Because tedu@ said he didn't see value in splitting the cut(1) diff,
i dare sending it as one big patch.  If anybody wants to have it
split into steps for easier review and a safer transition, please
just say so.  But i don't think changing this program is particularly
dangerous.

The main changes are in three areas:

1. get_line():
This function can no longer expand tabs up front because their width
depends on the display width of characters earlier on the line.
This change causes minor growth in indent_length().

While here, always NUL-terminate the input buffer.  It's safer and
simplifies the code, also reducing the number of arguments for two
functions.

Also delete the contorted spaces_pending logic in get_line(), simply
trim trailing whitespace at the end, and delete the pointless XMALLOC
macro.

2. process_stream():
It used to iterate bytes, now it iterates characters.  The code
becomes a bit longer, but using mbtowc(3), wcwidth(3), and iswblank(3)
directly, it's quite readable in this case.

3. output_word():
Needs both the length of the word in bytes and the width in output
positions now.  The hand-rolled output_buffer complicated matters
for no gain.  Just let stdio do its work.  Simplifies new_paragraph,
too.  Also simplify calling of output_indent() by doing the 0 check
inside.


All told, the patch shortens the code by four lines.  Not bad
for adding functionality, right?  :-)

OK?
  Ingo


Index: fmt.c
===
RCS file: /cvs/src/usr.bin/fmt/fmt.c,v
retrieving revision 1.33
diff -u -p -r1.33 fmt.c
--- fmt.c   9 Oct 2015 01:37:07 -   1.33
+++ fmt.c   8 Dec 2015 21:15:15 -
@@ -176,6 +176,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 /* Something that, we hope, will never be a genuine line length,
  * indentation etc.
@@ -222,7 +224,6 @@ static int grok_mail_headers = 0;   /* tr
 static int format_troff = 0;   /* Format troff? */
 
 static int n_errors = 0;   /* Number of failed files. */
-static char *output_buffer = NULL; /* Output line will be built 
here */
 static size_t x;   /* Horizontal position in 
output line */
 static size_t x0;  /* Ditto, ignoring leading 
whitespace */
 static size_t pending_spaces;  /* Spaces to add before next 
word */
@@ -232,17 +233,16 @@ static int output_in_paragraph = 0;   /* 
 
 static voidprocess_named_file(const char *);
 static voidprocess_stream(FILE *, const char *);
-static size_t  indent_length(const char *, size_t);
+static size_t  indent_length(const char *);
 static int might_be_header(const char *);
-static voidnew_paragraph(size_t, size_t);
-static voidoutput_word(size_t, size_t, const char *, size_t, size_t);
+static voidnew_paragraph(size_t);
+static voidoutput_word(size_t, size_t, const char *, int, int, int);
 static voidoutput_indent(size_t);
 static voidcenter_stream(FILE *, const char *);
-static char*get_line(FILE *, size_t *);
+static char*get_line(FILE *);
 static void*xrealloc(void *, size_t);
 void   usage(void);
 
-#define XMALLOC(x) xrealloc(0, x)
 #define ERRS(x) (x >= 127 ? 127 : ++x)
 
 /* Here is perhaps the right place to mention that this code is
@@ -332,7 +332,6 @@ main(int argc, char *argv[])
goal_length = 65;
if (max_length == 0)
max_length = goal_length+10;
-   output_buffer = XMALLOC(max_length+1);  /* really needn't be longer */
 
/* 2. Process files. */
 
@@ -381,25 +380,31 @@ typedef enum {
 static void
 process_stream(FILE *stream, const char *name)
 {
-   size_t n;
+   const char *wordp, *cp;
+   wchar_t wc;
size_t np;
size_t last_indent = SILLY; /* how many spaces in last indent? */
size_t para_line_number = 0;/* how many lines already read in this 
para? */
size_t first_indent = SILLY;/* indentation of line 0 of paragraph */
+   int wcl;/* number of bytes in wide character */
+   int wcw;/* display width of wide character */
+   int word_length;/* number of bytes in word */
+   int word_width; /* display width of word */
+   int space_width;/* display width of space after word */
+   int line_width; /* display width of line */
HdrType prev_header_type = hdr_ParagraphStart;
HdrType header_type;
 
/* ^-- header_type of previous line; -1 at para start */
const char *line;
-   size_t length;
 
if (centerP) {
center_stream(stream, name);
return;
}
 
-   while ((lin