Re: UTF-8 support for wc(1)

2015-12-04 Thread Ingo Schwarze
Hi Todd,

Todd C. Miller wrote on Thu, Dec 03, 2015 at 11:40:55AM -0700:
> On Sun, 29 Nov 2015 17:45:55 +0100, Ingo Schwarze wrote:

>> our wc(1) utility currently violates POSIX in two ways:
>> 
>>  1. The -m option counts bytes instead of characters.
>> The patch given below fixes that.
>> 
>>  2. Word counting with -w only treats ASCII whitespace as word
>> boundaries and regards two words joined by non-ASCII whitespace
>> as one single word.
>> 
>> The second issue is not related to UTF-8, but a matter of full
>> Unicode support.  It would not be hard to fix that by using
>> mbtowc(3) and iswblank(3) instead of mblen(3).  However, i don't
>> think we want to pollute our base system tools with functions
>> requiring full Unicode support, not even to the extent available
>> in our own C library.  So i consider iswblank(3) taboo for now.

> I'm a little surprised by this.  It doesn't seem like it would be
> any more complicated to use mbtowc(3) and iswblank(3) for the
> multibyte case.

Reconsidering, your argument makes sense to me.  Even if we implement
a simplified lookup table in the future, it doesn't complicate matters.
We already include data for iswprint(3) and wcwidth(3); iswspace(3)
is not more expensive and probably about as often needed.

So let's include iswblank(3) and iswspace(3) into the list of
function that we are willing to use.  Of course, that still doesn't
mean that we can do full Unicode support (think of collations etc.).

So, here is a patch for wc(1) getting both character and word
counting right.  I also improved the manual in various respects.

OK?
  Ingo


Index: wc.1
===
RCS file: /cvs/src/usr.bin/wc/wc.1,v
retrieving revision 1.25
diff -u -p -r1.25 wc.1
--- wc.121 Apr 2015 10:46:48 -  1.25
+++ wc.14 Dec 2015 12:54:26 -
@@ -72,9 +72,10 @@ using powers of 2 for sizes (K=1024, M=1
 The number of lines in each input file
 is written to the standard output.
 .It Fl m
-Intended to count characters instead of bytes;
-currently an alias for
-.Fl c .
+Count characters instead of bytes, and use
+.Xr iswspace 3
+instead of
+.Xr isspace 3 .
 .It Fl w
 The number of words in each input file
 is written to the standard output.
@@ -102,6 +103,20 @@ lines   words  bytes   file_name
 The counts for lines, words, and bytes
 .Pq or characters
 are integers separated by spaces.
+.Sh ENVIRONMENT
+.Bl -tag -width LC_CTYPE
+.It Ev LC_CTYPE
+The character set
+.Xr locale 1 ,
+defining which byte sequences form characters.
+If unset or set to
+.Qq C ,
+.Qq POSIX ,
+or an unsupported value,
+.Fl m
+has the same effect as
+.Fl c .
+.El
 .Sh EXIT STATUS
 .Ex -std wc
 .Sh SEE ALSO
@@ -111,7 +126,7 @@ The
 .Nm
 utility is compliant with the
 .St -p1003.1-2008
-specification, except that it ignores the locale.
+specification.
 .Pp
 The flag
 .Op Fl h
@@ -121,7 +136,3 @@ A
 .Nm
 utility appeared in
 .At v1 .
-.Sh BUGS
-The
-.Fl m
-option counts bytes instead of characters.
Index: wc.c
===
RCS file: /cvs/src/usr.bin/wc/wc.c,v
retrieving revision 1.19
diff -u -p -r1.19 wc.c
--- wc.c9 Oct 2015 01:37:09 -   1.19
+++ wc.c4 Dec 2015 12:54:26 -
@@ -40,9 +40,11 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 int64_ttlinect, twordct, tcharct;
-intdoline, doword, dochar, humanchar;
+intdoline, doword, dochar, humanchar, multibyte;
 intrval;
 extern char *__progname;
 
@@ -55,7 +57,7 @@ main(int argc, char *argv[])
 {
int ch;
 
-   setlocale(LC_ALL, "");
+   setlocale(LC_CTYPE, "");
 
if (pledge("stdio rpath", NULL) == -1)
err(1, "pledge");
@@ -68,8 +70,11 @@ main(int argc, char *argv[])
case 'w':
doword = 1;
break;
-   case 'c':
case 'm':
+   if (MB_CUR_MAX > 1)
+   multibyte = 1;
+   /* FALLTHROUGH */
+   case 'c':
dochar = 1;
break;
case 'h':
@@ -112,15 +117,20 @@ main(int argc, char *argv[])
 void
 cnt(char *file)
 {
-   u_char *C;
+   static char *buf;
+   static ssize_t bufsz;
+
+   FILE *stream;
+   char *C;
+   wchar_t wc;
short gotsp;
-   int len;
+   ssize_t len;
int64_t linect, wordct, charct;
struct stat sbuf;
int fd;
-   u_char buf[MAXBSIZE];
 
linect = wordct = charct = 0;
+   stream = NULL;
if (file) {
if ((fd = open(file, O_RDONLY, 0)) < 0) {
warn("%s", file);
@@ -131,7 +141,10 @@ cnt(char *file)
fd = STDIN_FILENO;
}
 
-   if (!doword) {
+   if (!doword && !multibyte) {
+   if (bufsz < MAXBSIZE &&
+   (buf = realloc(b

Re: UTF-8 support for wc(1)

2015-12-03 Thread Todd C. Miller
On Sun, 29 Nov 2015 17:45:55 +0100, Ingo Schwarze wrote:

> our wc(1) utility currently violates POSIX in two ways:
> 
>  1. The -m option counts bytes instead of characters.
> The patch given below fixes that.
> 
>  2. Word counting with -w only treats ASCII whitespace as word
> boundaries and regards two words joined by non-ASCII whitespace
> as one single word.
> 
> The second issue is not related to UTF-8, but a matter of full
> Unicode support.  It would not be hard to fix that by using
> mbtowc(3) and iswblank(3) instead of mblen(3).  However, i don't
> think we want to pollute our base system tools with functions
> requiring full Unicode support, not even to the extent available
> in our own C library.  So i consider iswblank(3) taboo for now.

I'm a little surprised by this.  It doesn't seem like it would be
any more complicated to use mbtowc(3) and iswblank(3) for the
multibyte case.

If you want to revisit this later when we have better Unicode support
I suppose that is OK too.

 - todd



UTF-8 support for wc(1)

2015-11-29 Thread Ingo Schwarze
Hi,

our wc(1) utility currently violates POSIX in two ways:

 1. The -m option counts bytes instead of characters.
The patch given below fixes that.

 2. Word counting with -w only treats ASCII whitespace as word
boundaries and regards two words joined by non-ASCII whitespace
as one single word.

The second issue is not related to UTF-8, but a matter of full
Unicode support.  It would not be hard to fix that by using
mbtowc(3) and iswblank(3) instead of mblen(3).  However, i don't
think we want to pollute our base system tools with functions
requiring full Unicode support, not even to the extent available
in our own C library.  So i consider iswblank(3) taboo for now.

A few notes about the patch:

 * As usual, reduce the ridiculous setlocale(LC_CTYPE, "")
   to what is actually needed, setlocale(LC_ALL, "").

 * As usual, -m only differs from -c if LC_CTYPE is set
   to a multibyte encoding.

 * In the case  /* Do it the hard way... */,
   we need to switch from read(2) to getline(3)
   because read(2) might chop multibyte characters to pieces.
   That doesn't affect memory consumption of "wc -l" or "wc -c",
   not even for huge binary files without newline characters.
   It does increase memory consumption for files with very long
   lines when -w or -m is requested - but that's not a problem
   because both only make sense with real text, and real text
   does not have lines of a length that getline(3) is unable
   to handle.

OK?
  Ingo


Index: wc.1
===
RCS file: /cvs/src/usr.bin/wc/wc.1,v
retrieving revision 1.25
diff -u -p -r1.25 wc.1
--- wc.121 Apr 2015 10:46:48 -  1.25
+++ wc.129 Nov 2015 16:34:28 -
@@ -72,9 +72,7 @@ using powers of 2 for sizes (K=1024, M=1
 The number of lines in each input file
 is written to the standard output.
 .It Fl m
-Intended to count characters instead of bytes;
-currently an alias for
-.Fl c .
+Count characters instead of bytes.
 .It Fl w
 The number of words in each input file
 is written to the standard output.
@@ -111,7 +109,8 @@ The
 .Nm
 utility is compliant with the
 .St -p1003.1-2008
-specification, except that it ignores the locale.
+specification, except that it recognizes word boundaries only at ASCII
+whitespace.
 .Pp
 The flag
 .Op Fl h
@@ -121,7 +120,16 @@ A
 .Nm
 utility appeared in
 .At v1 .
-.Sh BUGS
+.Sh CAVEATS
 The
 .Fl m
-option counts bytes instead of characters.
+option depends on the character set
+.Xr locale 1 .
+If
+.Ev LC_CTYPE
+is set to
+.Qq C
+or
+.Qq POSIX ,
+it has the same effect as
+.Fl c .
Index: wc.c
===
RCS file: /cvs/src/usr.bin/wc/wc.c,v
retrieving revision 1.19
diff -u -p -r1.19 wc.c
--- wc.c9 Oct 2015 01:37:09 -   1.19
+++ wc.c29 Nov 2015 16:34:28 -
@@ -42,7 +42,7 @@
 #include 
 
 int64_ttlinect, twordct, tcharct;
-intdoline, doword, dochar, humanchar;
+intdoline, doword, dochar, humanchar, multibyte;
 intrval;
 extern char *__progname;
 
@@ -55,7 +55,7 @@ main(int argc, char *argv[])
 {
int ch;
 
-   setlocale(LC_ALL, "");
+   setlocale(LC_CTYPE, "");
 
if (pledge("stdio rpath", NULL) == -1)
err(1, "pledge");
@@ -68,8 +68,11 @@ main(int argc, char *argv[])
case 'w':
doword = 1;
break;
-   case 'c':
case 'm':
+   if (MB_CUR_MAX > 1)
+   multibyte = 1;
+   /* FALLTHROUGH */
+   case 'c':
dochar = 1;
break;
case 'h':
@@ -112,15 +115,19 @@ main(int argc, char *argv[])
 void
 cnt(char *file)
 {
+   static char *buf;
+   static ssize_t bufsz;
+
+   FILE *stream;
u_char *C;
short gotsp;
-   int len;
+   ssize_t len;
int64_t linect, wordct, charct;
struct stat sbuf;
int fd;
-   u_char buf[MAXBSIZE];
 
linect = wordct = charct = 0;
+   stream = NULL;
if (file) {
if ((fd = open(file, O_RDONLY, 0)) < 0) {
warn("%s", file);
@@ -131,7 +138,10 @@ cnt(char *file)
fd = STDIN_FILENO;
}
 
-   if (!doword) {
+   if (!doword && !multibyte) {
+   if (bufsz < MAXBSIZE &&
+   (buf = realloc(buf, MAXBSIZE)) == NULL)
+   err(1, NULL);
/*
 * Line counting is split out because it's a lot
 * faster to get lines than to get words, since
@@ -178,16 +188,25 @@ cnt(char *file)
}
}
} else {
+   if (file == NULL)
+   stream = stdin;
+   else if ((stream = fdopen(fd, "r")) == NULL) {
+   warn("%s", file);
+