Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-28 Thread Chih-Hsuan Yen
2018-07-28 3:05 GMT+08:00 Paul Eggert :
> Bruno Haible wrote:
>>
>> You can assume that mbrtowc returns
>> 0 if and only if the multibyte sequence is a NUL byte - but you had
>> chosen srcend in such a way that this would not happen in the loop.
>
>
> Thanks for the correction. I mistakenly thought that C allows multibyte
> encodings in which a null wide character's multibyte representation contains
> an all-bits-zero byte. I installed the attached to omit the unnecessary
> test.

Thanks you all for the efforts! I've installed commit
e5dae2c6b0bcd0e4ac6e5b212688d223e2e62f79 of coreutils, and `df` works
like a charm!

Cheers!

Chih-Hsuan Yen



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-27 Thread Paul Eggert

Bruno Haible wrote:

You can assume that mbrtowc returns
0 if and only if the multibyte sequence is a NUL byte - but you had
chosen srcend in such a way that this would not happen in the loop.


Thanks for the correction. I mistakenly thought that C allows multibyte 
encodings in which a null wide character's multibyte representation contains an 
all-bits-zero byte. I installed the attached to omit the unnecessary test.
From 34e261fa2768533f34cf35158489ea6a22115c17 Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Fri, 27 Jul 2018 12:00:02 -0700
Subject: [PATCH] df: omit redundant comparison

Trivial inefficiency reported by Bruno Haible in:
http://lists.gnu.org/r/bug-gnulib/2018-07/msg00109.html
* src/df.c (hide_problematic_chars): Omit redundant test.
---
 src/df.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/df.c b/src/df.c
index 9b65872..5553221 100644
--- a/src/df.c
+++ b/src/df.c
@@ -288,7 +288,7 @@ hide_problematic_chars (char *cell)
   wchar_t wc;
   size_t srcbytes = srcend - src;
   n = mbrtowc (, src, srcbytes, );
-  bool ok = 0 < n && n <= srcbytes;
+  bool ok = n <= srcbytes;
 
   if (ok)
 ok = !iswcntrl (wc);
-- 
2.7.4



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-27 Thread Bruno Haible
Paul Eggert wrote:
> my earlier patch 
> neglected the possibility that mbrtowc can return 0

I wouldn't see this as a bug: You can assume that mbrtowc returns
0 if and only if the multibyte sequence is a NUL byte - but you had
chosen srcend in such a way that this would not happen in the loop.

> and it incorrectly assumed 
> wide control characters always have a single-byte representation.

Oops, you're right. My mistake as well.

The new patch looks good.

This will catch (and replace with '?') U+2028 and U+2029 on glibc systems.
On macOS, it will not do this, because iswcntrl(0x2028) and iswcntrl(0x2029)
is 0 on this system; this is consistent with the fact that the 'Terminal'
program displays these characters as simple spaces. So, no need to override
iswcntrl on macOS.

Bruno


2018-07-27  Bruno Haible  

iswcntrl: Mention minor problem on macOS.
* doc/posix-functions/iswcntrl.texi: Mention oddity on macOS.

diff --git a/doc/posix-functions/iswcntrl.texi 
b/doc/posix-functions/iswcntrl.texi
index 99eaa0e..44dd034 100644
--- a/doc/posix-functions/iswcntrl.texi
+++ b/doc/posix-functions/iswcntrl.texi
@@ -25,4 +25,8 @@ Portability problems not fixed by Gnulib:
 @item
 On AIX and Windows platforms, @code{wchar_t} is a 16-bit type and therefore 
cannot
 accommodate all Unicode characters.
+@item
+This function returns 0 for U+2028 (LINE SEPARATOR) and
+U+2029 (PARAGRAPH SEPARATOR) on some platforms:
+Mac OS X 10.13.
 @end itemize




Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-26 Thread Paul Eggert

Pádraig Brady wrote:

I've pushed the c_iscntrl patch since it's simplest
and probably most appropriate patch for an existing release.


Yes, that makes sense for a quick patch. However, for the next release I think 
it'd be better to catch encoding errors and multibyte control characters, given 
the problems noted. I installed the attached further patch to try to do this. 
This fixes the problem that Bruno noted, along with two others; my earlier patch 
neglected the possibility that mbrtowc can return 0, and it incorrectly assumed 
wide control characters always have a single-byte representation.


Either way the original bug appears to be fix so I'm boldly closing the bug 
report.
From 2cf5d730690dad600f8b6d74d0b5fde522804e43 Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Sun, 22 Jul 2018 09:50:20 -0700
Subject: [PATCH] df: avoid multibyte character corruption on macOS

This improves on the earlier fix for the problem reported by
Chih-Hsuan Yen (Bug#32236), by also looking for other control
characters and for encoding errors.
* src/df.c: Include wchar.h and wctype.h instead of c-ctype.h.
(hide_problematic_chars): Process the string as multibyte.
Use iswcntrl, not c_iscntrl.
---
 src/df.c | 43 ---
 1 file changed, 32 insertions(+), 11 deletions(-)

diff --git a/src/df.c b/src/df.c
index c851fcc..d27ba02 100644
--- a/src/df.c
+++ b/src/df.c
@@ -23,7 +23,8 @@
 #include 
 #include 
 #include 
-#include 
+#include 
+#include 
 
 #include "system.h"
 #include "canonicalize.h"
@@ -272,21 +273,41 @@ static struct option const long_options[] =
   {NULL, 0, NULL, 0}
 };
 
-/* Replace problematic chars with '?'.
-   Since only control characters are currently considered,
-   this should work in all encodings.  */
+/* Replace problematic chars with '?'.  */
 
-static char*
+static void
 hide_problematic_chars (char *cell)
 {
-  char *p = cell;
-  while (*p)
+  char *srcend = cell + strlen (cell);
+  char *dst = cell;
+  mbstate_t mbstate = { 0, };
+  size_t n;
+
+  for (char *src = cell; src != srcend; src += n)
 {
-  if (c_iscntrl (to_uchar (*p)))
-*p = '?';
-  p++;
+  wchar_t wc;
+  size_t srcbytes = srcend - src;
+  n = mbrtowc (, src, srcbytes, );
+  bool ok = 0 < n && n <= srcbytes;
+
+  if (ok)
+ok = !iswcntrl (wc);
+  else
+n = 1;
+
+  if (ok)
+{
+  memmove (dst, src, n);
+  dst += n;
+}
+  else
+{
+  *dst++ = '?';
+  memset (, 0, sizeof mbstate);
+}
 }
-  return cell;
+
+  *dst = '\0';
 }
 
 /* Dynamically allocate a row of pointers in TABLE, which
-- 
2.7.4



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-26 Thread Pádraig Brady
On 26/07/18 02:01, Paul Eggert wrote:
> Chih-Hsuan Yen wrote:
>> How about following the idea from Pádraig Brady
>> and filter \n only?
> 
> Given the later comments it seems better to filter out encoding errors and 
> control characters. Programs that parse the output already cannot trust the 
> strings to be exactly right, since newlines are gonna get replaced no matter 
> what. So there seems little benefit to copying the other garbage faithfully.
> 
> Revised proposed patch(es) attached.

This is better, though this means that mount points now
need to match the locale of df or they won't be displayed.
Theoretically that was the case previously, but only for control chars
and so wouldn't have have had a practical impact for mounts
encoded in another local, only for security/robustness reasons where
one might have \n etc.

I've pushed the c_iscntrl patch since it's simplest
and probably most appropriate patch for an existing release.

If you consider the matching encoding issue as a non issue,
then I'm OK with this.

cheers,
Pádraig



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-26 Thread Bruno Haible
Paul Eggert wrote:
> Revised proposed patch(es) attached.

Looks good to me, except for one little thing:

>   memcpy (dst, src, n);

src and dst may overlap. Therefore memmove should be used instead of memcpy.

Bruno




Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-26 Thread Paul Eggert

Chih-Hsuan Yen wrote:

How about following the idea from Pádraig Brady
and filter \n only?


Given the later comments it seems better to filter out encoding errors and 
control characters. Programs that parse the output already cannot trust the 
strings to be exactly right, since newlines are gonna get replaced no matter 
what. So there seems little benefit to copying the other garbage faithfully.


Revised proposed patch(es) attached.
>From e4f2a5f97771c4a74e0bdef1c1e4a0d2735cef15 Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Sun, 22 Jul 2018 09:50:20 -0700
Subject: [PATCH 1/2] df: avoid multibyte character corruption on macOS

Problem reported by Chih-Hsuan Yen (Bug#32236).
* NEWS: Mention the bug fix.
* src/df.c: Include wchar.h and wctype.h.
(hide_problematic_chars): Respect multibyte encodings
when replacing problematic characters or bytes with '?'.
---
 NEWS |  4 
 src/df.c | 35 +--
 2 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/NEWS b/NEWS
index af1a990..aa3b4f9 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,10 @@ GNU coreutils NEWS-*- outline -*-
 
 * Noteworthy changes in release ?.? (-??-??) [?]
 
+** Bug fixes
+
+  df no longer corrupts displayed multibyte characters on macOS.
+
 
 * Noteworthy changes in release 8.30 (2018-07-01) [stable]
 
diff --git a/src/df.c b/src/df.c
index 1178865..52be414b 100644
--- a/src/df.c
+++ b/src/df.c
@@ -23,6 +23,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "system.h"
 #include "canonicalize.h"
@@ -271,21 +273,34 @@ static struct option const long_options[] =
   {NULL, 0, NULL, 0}
 };
 
-/* Replace problematic chars with '?'.
-   Since only control characters are currently considered,
-   this should work in all encodings.  */
+/* Replace problematic chars with '?'.  */
 
-static char*
+static void
 hide_problematic_chars (char *cell)
 {
-  char *p = cell;
-  while (*p)
+  char *srcend = cell + strlen (cell);
+  char *dst = cell;
+  mbstate_t mbstate = { 0, };
+  size_t n;
+
+  for (char *src = cell; src != srcend; src += n)
 {
-  if (iscntrl (to_uchar (*p)))
-*p = '?';
-  p++;
+  wchar_t wc;
+  size_t srcbytes = srcend - src;
+  n = mbrtowc (, src, srcbytes, );
+  if (n <= srcbytes && !iswcntrl (wc))
+{
+  memcpy (dst, src, n);
+  dst += n;
+}
+  else
+{
+  *dst++ = '?';
+  memset (, 0, sizeof mbstate);
+}
 }
-  return cell;
+
+  *dst = '\0';
 }
 
 /* Dynamically allocate a row of pointers in TABLE, which
-- 
2.7.4

>From e40ae8fb26fd0c8c0cb7e42598a2b8d9bd8551bb Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Thu, 26 Jul 2018 01:56:28 -0700
Subject: [PATCH 2/2] df: tune slightly

* src/df.c (get_header, get_dev):
Avoid calling mbswidth twice when once will do.
---
 src/df.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/df.c b/src/df.c
index 52be414b..f5bd26d 100644
--- a/src/df.c
+++ b/src/df.c
@@ -588,7 +588,8 @@ get_header (void)
 
   table[nrows - 1][col] = cell;
 
-  columns[col]->width = MAX (columns[col]->width, mbswidth (cell, 0));
+  size_t cell_width = mbswidth (cell, 0);
+  columns[col]->width = MAX (columns[col]->width, cell_width);
 }
 }
 
@@ -1198,7 +1199,8 @@ get_dev (char const *disk, char const *mount_point, char const* file,
 assert (!"empty cell");
 
   hide_problematic_chars (cell);
-  columns[col]->width = MAX (columns[col]->width, mbswidth (cell, 0));
+  size_t cell_width = mbswidth (cell, 0);
+  columns[col]->width = MAX (columns[col]->width, cell_width);
   table[nrows - 1][col] = cell;
 }
   free (dev_name);
-- 
2.7.4



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-25 Thread Chih-Hsuan Yen
2018-07-23 5:40 GMT+08:00 Bruno Haible :
> Pádraig Brady wrote:
>> > This patch is correct (because the characters that you test for in 
>> > c_iscntrl
>> > are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a 
>> > multibyte
>> > character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings).
>>
>> ... It might be worth mentioning this subtle point in the c_iscntrl() docs?
>> "Note this identifies all single byte control chars even in multibyte 
>> encodings".
>
> Only in the multibyte encodings that are currently in use. We never know what
> kinds of features or misfeatures new multibyte encodings will come up with:
> Before GB18030 was introduced, it was a common feature of all multibyte 
> encodings
> (including SJIS) that ASCII characters in the range 0x00..0x3F never occur as
> second or later byte in a multibyte character. Well, GB18030 broke this 
> assumption.
>
> So, it is dangerous to rely on this property. Therefore I wouldn't like to
> document it in the c_iscntrl() documentation.
>
> Bruno
>

Hello any update on this? Discussions about encodings are beyond my
knowledge, yet I can feel that it's difficult to correctly filter
control characters. How about following the idea from Pádraig Brady
and filter \n only?



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Bruno Haible
Pádraig Brady wrote:
> > This patch is correct (because the characters that you test for in c_iscntrl
> > are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a 
> > multibyte
> > character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings).
> 
> ... It might be worth mentioning this subtle point in the c_iscntrl() docs?
> "Note this identifies all single byte control chars even in multibyte 
> encodings".

Only in the multibyte encodings that are currently in use. We never know what
kinds of features or misfeatures new multibyte encodings will come up with:
Before GB18030 was introduced, it was a common feature of all multibyte 
encodings
(including SJIS) that ASCII characters in the range 0x00..0x3F never occur as
second or later byte in a multibyte character. Well, GB18030 broke this 
assumption.

So, it is dangerous to rely on this property. Therefore I wouldn't like to
document it in the c_iscntrl() documentation.

Bruno




Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Bruno Haible
Pádraig Brady wrote:
> but I did want to only avoid \n etc. that might cause issues for
> programs that parsed output from df on a line by line basis.

The current code (which uses iscntrl) also catches escape sequences
that can cause weird output on the screen, in a terminal emulator.
This is good (because it can confuse a human reader as much as a '\n'
would confuse a line-by-line parser).

Now, this feature currently only works for escape sequence that
start with an ASCII escape U+001B. It would be useful also for
other control characters to be caught, at least:
  * escape characters U+009B.
  * other characters that cause a newline in a terminal emulator:
U+2028 and U+2029.
For example, in konsole, the escape sequence '\u009bf' repositions
the cursor. So the effects of

$ mkdir /tmp/`printf 'abc\u009bf'`
$ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc*
$ df
...
/dev/loop019860481986048  0  100% /tmp/abc�f

is that 'df' produces an U+FFFD. This is less useful than what
it produces for an ASCII escape:

$ mkdir /tmp/`printf 'abc\u001b[2J'`
$ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc*
$ df
...
/dev/loop0 692828 692828  0  100% /tmp/abc?[2J

Bruno




Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Paul Eggert

Pádraig Brady wrote:

I did want to only avoid \n etc. that might cause issues for
programs that parsed output from df on a line by line basis.
This subset of control characters is safe to identify
It seems problematic to start eliding improperly encoded
mount points for example, rather than just outputting
what's there.


Yes, I suppose you're right, it's not df's job to police encodings.


Also just incrementing width++ per each wide character
doesn't seem right, though again I've not tested it.


True as well. OK, please ignore my patch.

I was prompted by worries about multibyte encodings that use bytes that could be 
misinterpreted as ASCII control characters, such as a locale that uses EBCDIC 
encoding. However, that's probably just a theoretical concern; no coreutils 
users use EBCDIC any more, right? Plus there are doubtless lots of other places 
in coreutils that assume '\n' is a newline in encoded text.




Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Pádraig Brady
On 21/07/18 15:43, Bruno Haible wrote:
> Hi Pádraig,
> 
>> I've attached a gnulib patch to document for iscntrl at least.
> 
>> +This function does not support arguments outside of the range of the
>> +unsigned char type in locales with large character sets, on some platforms.
>> +OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales.
> 
> In UTF-8 locales, arguments >= 0x80 are invalid arguments for iscntrl().
> 
> POSIX [1] says
>   "The c argument is a type int, the value of which the application shall
>ensure is a character representable as an unsigned char or equal to the
>value of the macro EOF. If the argument has any other value, the behavior
>is undefined."
> 
> The term "character" is defined here [2]:
>   "A sequence of one or more bytes representing a single graphic symbol or
>control code."
> 
> So, in a UTF-8 locale, a "character representable as an unsigned char"
> is a byte sequence of length 1, where the single byte has a value in the
> range 0x00..0x7F.
> 
> For invalid values "the behavior is undefined." You were expecting a value 0.
> 
> Now, in the gnulib documentations, what we mention as portability problems
> are the cases where
>   - the behaviour for valid arguments is different on different platforms, or
>   - the boundary between valid and invalid arguments is fuzzy and depends on
> the platform.
> IMO there's no point in documenting that a function _really_ has undefined
> behaviour when POSIX says that it has undefined behaviour.


Thanks for all that info. I agree iscntrl() behavior on macOS is within spec,
though is still surprising, and different from other systems.
I agree docs should be as succinct as possible, though...

>> I've also attached an alternative patch for df (in your name).
> 
> This patch is correct (because the characters that you test for in c_iscntrl
> are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte
> character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings).

... It might be worth mentioning this subtle point in the c_iscntrl() docs?
"Note this identifies all single byte control chars even in multibyte 
encodings".

> But it does not catch control characters outside of the ASCII range. It would
> make sense to catch these as well. If you want to do that,
> 'hide_problematic_chars' needs to be rewritten as a loop that iterates across
> the multibyte characters. For example with the 'mbiter' module, in
> combination with the mb_iscntrl function from the 'mbchar' module. Or
> directly with mbrtowc() and iswcntrl().

I was mainly worried here about \n for scripts to robustly parse df output.

cheers,
Pádraig.



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Pádraig Brady
On 22/07/18 08:12, Paul Eggert wrote:
> Pádraig Brady wrote:
>> I've also attached an alternative patch for df (in your name).
> 
> That still has problems, since it can generate improperly-encoded strings in 
> UTF-8 locales (if the inputs are improperly encoded), and can replace parts 
> of 
> multibyte characters with '?' in non-UTF-8 locales. Please try the attached 
> patch instead, which attempts to address these issues. This is more along the 
> lines that Bruno suggested, except it doesn't use mbsiter as I figured it was 
> simpler overall just to use mbrtowc directly for this one thing.

I haven't time to review this now,
but I did want to only avoid \n etc. that might cause issues for
programs that parsed output from df on a line by line basis.
This subset of control characters is safe to identify
It seems problematic to start eliding improperly encoded
mount points for example, rather than just outputting
what's there.

Also just incrementing width++ per each wide character
doesn't seem right, though again I've not tested it.

cheers,
Pádraig



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Chih-Hsuan Yen
2018-07-22 23:12 GMT+08:00 Paul Eggert :
> Pádraig Brady wrote:
>>
>> I've also attached an alternative patch for df (in your name).
>
>
> That still has problems, since it can generate improperly-encoded strings in
> UTF-8 locales (if the inputs are improperly encoded), and can replace parts
> of multibyte characters with '?' in non-UTF-8 locales. Please try the
> attached patch instead, which attempts to address these issues. This is more
> along the lines that Bruno suggested, except it doesn't use mbsiter as I
> figured it was simpler overall just to use mbrtowc directly for this one
> thing.

Here's the result of df:

$ df
檔案系統容量  已用  可用 已用 掛載點
/dev/disk1s1234G  137G   95G  60% /
/dev/disk1s4234G  2.1G   95G   3% /private/var/vm
chyen.cc:25G   12G   12G  51% /private/tmp/abc def ghi

$ df | xxd
: e6aa 94e6 a188 e7b3 bbe7 b5b1 2020 2020  
0010: 2020 2020 e5ae b9e9 878f 2020 e5b7 b2e7  ..  
0020: 94a8 2020 e58f afe7 94a8 20e5 b7b2 e794  ..  .. .
0030: a820 e68e 9be8 bc89 e9bb 9e0a 2f64 6576  . ../dev
0040: 2f64 6973 6b31 7331 2020 2020 3233 3447  /disk1s1234G
0050: 2020 3133 3747 2020 2039 3547 2020 3630137G   95G  60
0060: 2520 2f0a 2f64 6576 2f64 6973 6b31 7334  % /./dev/disk1s4
0070: 2020 2020 3233 3447 2020 322e 3147 2020  234G  2.1G
0080: 2039 3547 2020 2033 2520 2f70 7269 7661   95G   3% /priva
0090: 7465 2f76 6172 2f76 6d0a 6368 7965 6e2e  te/var/vm.chyen.
00a0: 6363 3a20 2020 2020 2020 2032 3547 2020  cc:25G
00b0: 2031 3247 2020 2031 3247 2020 3531 2520   12G   12G  51%
00c0: 2f70 7269 7661 7465 2f74 6d70 2f61 6263  /private/tmp/abc
00d0: e280 a864 6566 e280 a967 6869 0a ...def...ghi.

Chinese header names are correct, and U+2028 and U+2029 are written
as-is. All tested with LANG=zh_TW.UTF-8 LC_COLLATE=C
LC_CTYPE=zh_TW.UTF-8.



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Paul Eggert

Pádraig Brady wrote:

I've also attached an alternative patch for df (in your name).


That still has problems, since it can generate improperly-encoded strings in 
UTF-8 locales (if the inputs are improperly encoded), and can replace parts of 
multibyte characters with '?' in non-UTF-8 locales. Please try the attached 
patch instead, which attempts to address these issues. This is more along the 
lines that Bruno suggested, except it doesn't use mbsiter as I figured it was 
simpler overall just to use mbrtowc directly for this one thing.
From 17a1a37549344cdfd95cc84b1848dafa256be5a0 Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Sun, 22 Jul 2018 08:09:01 -0700
Subject: [PATCH] df: avoid multibyte character corruption on macOS

Problem reported by Chih-Hsuan Yen (Bug#32236).
* NEWS: Mention the bug fix.
* src/df.c: Include wchar.h and wctype.h instead of mbswidth.h.
(hide_problematic_chars): Return number of screen columns.
All callers changed.  Use iswcntrl, not iscntrl.
(get_header, get_dev): Rely on hide_problematic_chars width,
not mbswidth.  Scan the cell once, instead of two or three times.
---
 NEWS |  4 
 src/df.c | 46 +++---
 2 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/NEWS b/NEWS
index af1a990..aa3b4f9 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,10 @@ GNU coreutils NEWS-*- outline -*-
 
 * Noteworthy changes in release ?.? (-??-??) [?]
 
+** Bug fixes
+
+  df no longer corrupts displayed multibyte characters on macOS.
+
 
 * Noteworthy changes in release 8.30 (2018-07-01) [stable]
 
diff --git a/src/df.c b/src/df.c
index 1178865..664b88b 100644
--- a/src/df.c
+++ b/src/df.c
@@ -23,6 +23,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "system.h"
 #include "canonicalize.h"
@@ -31,7 +33,6 @@
 #include "fsusage.h"
 #include "human.h"
 #include "mbsalign.h"
-#include "mbswidth.h"
 #include "mountlist.h"
 #include "quote.h"
 #include "find-mount-point.h"
@@ -272,20 +273,36 @@ static struct option const long_options[] =
 };
 
 /* Replace problematic chars with '?'.
-   Since only control characters are currently considered,
-   this should work in all encodings.  */
+   Return the number of screen columns.  */
 
-static char*
+static size_t
 hide_problematic_chars (char *cell)
 {
-  char *p = cell;
-  while (*p)
+  char *srcend = cell + strlen (cell);
+  char *dst = cell;
+  mbstate_t mbstate = { 0, };
+  size_t n;
+  size_t width = 0;
+
+  for (char *src = cell; src != srcend; src += n)
 {
-  if (iscntrl (to_uchar (*p)))
-*p = '?';
-  p++;
+  wchar_t wc;
+  n = mbrtowc (, src, srcend - src, );
+  if (n < (size_t) -2 && !iswcntrl (wc))
+{
+  memcpy (dst, src, n);
+  dst += n;
+}
+  else
+{
+  *dst++ = '?';
+  memset (, 0, sizeof mbstate);
+}
+  width++;
 }
-  return cell;
+
+  *dst = '\0';
+  return width;
 }
 
 /* Dynamically allocate a row of pointers in TABLE, which
@@ -569,11 +586,10 @@ get_header (void)
   if (!cell)
 xalloc_die ();
 
-  hide_problematic_chars (cell);
-
   table[nrows - 1][col] = cell;
 
-  columns[col]->width = MAX (columns[col]->width, mbswidth (cell, 0));
+  size_t cell_width = hide_problematic_chars (cell);
+  columns[col]->width = MAX (columns[col]->width, cell_width);
 }
 }
 
@@ -1182,8 +1198,8 @@ get_dev (char const *disk, char const *mount_point, char const* file,
   if (!cell)
 assert (!"empty cell");
 
-  hide_problematic_chars (cell);
-  columns[col]->width = MAX (columns[col]->width, mbswidth (cell, 0));
+  size_t cell_width = hide_problematic_chars (cell);
+  columns[col]->width = MAX (columns[col]->width, cell_width);
   table[nrows - 1][col] = cell;
 }
   free (dev_name);
-- 
2.7.4



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Chih-Hsuan Yen
2018-07-22 18:46 GMT+08:00 Bruno Haible :
> Chih-Hsuan Yen wrote:
>> The `c_iscntrl()` patch also fixes the issue on macOS. Please tell me
>> if you want me to test other patches, thanks!
>
> You could test how it behaves with mount points that contain U+2028 or
> U+2029 characters. On Linux, I'd test it like this. Hope it's similar
> on macOS:
> $ mkdir /tmp/`printf 'abc\u2028def\u2029ghi'`
> $ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc*
> $ df
> ...
> /dev/loop019860481986048  0  100% /tmp/abc�def�ghi
> $ ls -ld /tmp/abc*
> dr-xr-xr-x 4 root root 2048 Nov 19  2014 /tmp/abc?def?ghi
>
> Bruno
>

Hi Bruno,

With the c_iscntrl() patch, the result of ls and df are: (I use xxd as
GMail seems unable to handle U+2028 and U+2029 correctly)

$ ls -ld /tmp/abc
def
ghi | xxd
: 6472 7778 722d 7872 2d78 2031 2079 656e  drwxr-xr-x 1 yen
0010: 2073 7461  2034 3039 3620 3230 3138   staff 4096 2018
0020: 2f30 372f 3232 2030 323a 3136 3a34 3320  /07/22 02:16:43
0030: 2f74 6d70 2f61 6263 e280 a864 6566 e280  /tmp/abc...def..
0040: a967 6869 0a .ghi.

$ df | xxd
: e6aa 94e6 a188 e7b3 bbe7 b5b1 2020 2020  
0010: 2020 2020 e5ae b9e9 878f 2020 e5b7 b2e7  ..  
0020: 94a8 2020 e58f afe7 94a8 20e5 b7b2 e794  ..  .. .
0030: a825 20e6 8e9b e8bc 89e9 bb9e 0a2f 6465  .% ../de
0040: 762f 6469 736b 3173 3120 2020 2032 3334  v/disk1s1234
0050: 4720 2031 3337 4720 2020 3935 4720 2020  G  137G   95G
0060: 3630 2520 2f0a 2f64 6576 2f64 6973 6b31  60% /./dev/disk1
0070: 7334 2020 2020 3233 3447 2020 322e 3147  s4234G  2.1G
0080: 2020 2039 3547 2020 2020 3325 202f 7072 95G3% /pr
0090: 6976 6174 652f 7661 722f 766d 0a63 6879  ivate/var/vm.chy
00a0: 656e 2e63 633a 2020 2020 2020 2020 3235  en.cc:25
00b0: 4720 2020 3132 4720 2020 3132 4720 2020  G   12G   12G
00c0: 3532 2520 2f70 7269 7661 7465 2f74 6d70  52% /private/tmp
00d0: 2f61 6263 e280 a864 6566 e280 a967 6869  /abc...def...ghi
00e0: 0a   .

Without the c_iscntrl() patch (unmodified 8.30), ls behaves the same,
and the result of df is:

$ df | xxd
: e6aa 3fe6 a13f e7b3 bbe7 b5b1 20e5 aeb9  ..?..?.. ...
0010: e93f 3f20 e5b7 b2e7 3fa8 20e5 3faf e73f  .?? ?. .?..?
0020: a820 e5b7 b2e7 3fa8 2520 e63f 3fe8 bc3f  . ?.% .??..?
0030: e9bb 3f0a 2f64 6576 2f64 6973 6b31 7331  ..?./dev/disk1s1
0040: 2020 2020 3233 3447 2020 3133 3747 2020  234G  137G
0050: 2020 3935 4720 2020 2036 3025 202f 0a2f95G60% /./
0060: 6465 762f 6469 736b 3173 3420 2020 2032  dev/disk1s42
0070: 3334 4720 2032 2e31 4720 2020 2039 3547  34G  2.1G95G
0080: 2020 2020 2033 2520 2f70 7269 7661 7465   3% /private
0090: 2f76 6172 2f76 6d0a 6368 7965 6e2e 6363  /var/vm.chyen.cc
00a0: 3a20 2020 2020 2020 2032 3547 2020 2031  :25G   1
00b0: 3247 2020 2020 3132 4720 2020 2035 3125  2G12G51%
00c0: 202f 7072 6976 6174 652f 746d 702f 6162   /private/tmp/ab
00d0: 63e2 3fa8 6465 66e2 3fa9 6768 690a   c.?.def.?.ghi.

Hope those results are helpful!

Chih-Hsuan Yen



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Bruno Haible
Chih-Hsuan Yen wrote:
> The `c_iscntrl()` patch also fixes the issue on macOS. Please tell me
> if you want me to test other patches, thanks!

You could test how it behaves with mount points that contain U+2028 or
U+2029 characters. On Linux, I'd test it like this. Hope it's similar
on macOS:
$ mkdir /tmp/`printf 'abc\u2028def\u2029ghi'`
$ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc*
$ df
...
/dev/loop019860481986048  0  100% /tmp/abc�def�ghi
$ ls -ld /tmp/abc*
dr-xr-xr-x 4 root root 2048 Nov 19  2014 /tmp/abc?def?ghi

Bruno




Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Chih-Hsuan Yen
2018-07-22 6:43 GMT+08:00 Bruno Haible :
> Hi Pádraig,
>
>> I've attached a gnulib patch to document for iscntrl at least.
>
>> +This function does not support arguments outside of the range of the
>> +unsigned char type in locales with large character sets, on some platforms.
>> +OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales.
>
> In UTF-8 locales, arguments >= 0x80 are invalid arguments for iscntrl().
>
> POSIX [1] says
>   "The c argument is a type int, the value of which the application shall
>ensure is a character representable as an unsigned char or equal to the
>value of the macro EOF. If the argument has any other value, the behavior
>is undefined."
>
> The term "character" is defined here [2]:
>   "A sequence of one or more bytes representing a single graphic symbol or
>control code."
>
> So, in a UTF-8 locale, a "character representable as an unsigned char"
> is a byte sequence of length 1, where the single byte has a value in the
> range 0x00..0x7F.
>
> For invalid values "the behavior is undefined." You were expecting a value 0.
>
> Now, in the gnulib documentations, what we mention as portability problems
> are the cases where
>   - the behaviour for valid arguments is different on different platforms, or
>   - the boundary between valid and invalid arguments is fuzzy and depends on
> the platform.
> IMO there's no point in documenting that a function _really_ has undefined
> behaviour when POSIX says that it has undefined behaviour.
>
>> I've also attached an alternative patch for df (in your name).
>
> This patch is correct (because the characters that you test for in c_iscntrl
> are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte
> character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings).
>
> But it does not catch control characters outside of the ASCII range. It would
> make sense to catch these as well. If you want to do that,
> 'hide_problematic_chars' needs to be rewritten as a loop that iterates across
> the multibyte characters. For example with the 'mbiter' module, in
> combination with the mb_iscntrl function from the 'mbchar' module. Or
> directly with mbrtowc() and iswcntrl().
>
> Bruno
>
> [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/iscntrl.html
> [2] 
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_87

The `c_iscntrl()` patch also fixes the issue on macOS. Please tell me
if you want me to test other patches, thanks!

Cheers,

Chih-Hsuan Yen



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-21 Thread Bruno Haible
Hi Pádraig,

> I've attached a gnulib patch to document for iscntrl at least.

> +This function does not support arguments outside of the range of the
> +unsigned char type in locales with large character sets, on some platforms.
> +OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales.

In UTF-8 locales, arguments >= 0x80 are invalid arguments for iscntrl().

POSIX [1] says
  "The c argument is a type int, the value of which the application shall
   ensure is a character representable as an unsigned char or equal to the
   value of the macro EOF. If the argument has any other value, the behavior
   is undefined."

The term "character" is defined here [2]:
  "A sequence of one or more bytes representing a single graphic symbol or
   control code."

So, in a UTF-8 locale, a "character representable as an unsigned char"
is a byte sequence of length 1, where the single byte has a value in the
range 0x00..0x7F.

For invalid values "the behavior is undefined." You were expecting a value 0.

Now, in the gnulib documentations, what we mention as portability problems
are the cases where
  - the behaviour for valid arguments is different on different platforms, or
  - the boundary between valid and invalid arguments is fuzzy and depends on
the platform.
IMO there's no point in documenting that a function _really_ has undefined
behaviour when POSIX says that it has undefined behaviour.

> I've also attached an alternative patch for df (in your name).

This patch is correct (because the characters that you test for in c_iscntrl
are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte
character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings).

But it does not catch control characters outside of the ASCII range. It would
make sense to catch these as well. If you want to do that,
'hide_problematic_chars' needs to be rewritten as a loop that iterates across
the multibyte characters. For example with the 'mbiter' module, in
combination with the mb_iscntrl function from the 'mbchar' module. Or
directly with mbrtowc() and iswcntrl().

Bruno

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/iscntrl.html
[2] 
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_87



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-21 Thread Pádraig Brady
On 21/07/18 07:20, Chih-Hsuan Yen wrote:
> Hi coreutils developers,
> 
> I'm using coreutils on macOS High Sierra (10.13). I noticed that with
> `LANG=zh_TW.UTF-8`, `df` output is corrupted.
> 
> �?�?系統 容�?? 已�?� �?��?� 已�?�% �??�?�?
> /dev/disk1s1234G  151G81G65% /
> /dev/disk1s4234G  2.1G81G 3% /private/var/vm
> 
> (I'm not sure if other mail agents can display those characters
> correctly or not. See my blog post [1] for the exact output.)
> 
> Seems it's similar to bug#25630 [2], which is not resolved. I guess
> the reason of my issue is that iscntrl() is broken on macOS High
> Sierra, so in hide_problematic_chars(), some bytes in the Chinese
> header is replaced with a question mark. I managed to patch coreutils
> [3] to make `df` work. Could you have a look? Thanks!
> 
> Best,
> 
> Chih-Hsuan Yen
> 
> [1] https://blog.chyen.cc/posts/2018/06/23/mac-df-chinese.html
> [2] http://lists.gnu.org/archive/html/bug-coreutils/2017-02/msg8.html
> [3] 
> https://github.com/yan12125/macports-ports/blob/fix-coreutils-df-chinese/sysutils/coreutils/files/patch-df.diff

Wow. That's surprising. I do see the FreeBSD man pages say:

"The 4.4BSD extension of accepting arguments outside of the range of the 
unsigned char type
in locales with large character sets is considered obsolete and may not be 
supported in
future releases."

Now I think that might have been referring to >= 0xFF, but fair enough.

I've attached a gnulib patch to document for iscntrl at least.
It would be great if someone could test the other is*() classification
functions on macOS so that I might have a more complete documentation patch.

I've also attached an alternative patch for df (in your name).
Can you try that one?

thanks!
Pádraig
>From 6b7434fb222144af3ae9e2d4fd6b4c72eec25f5b Mon Sep 17 00:00:00 2001
From: Chih-Hsuan Yen 
Date: Sat, 21 Jul 2018 13:19:23 -0700
Subject: [PATCH] df: avoid multibyte character corruption on macOS

* src/df.c (hide_problematic_chars): Use c_iscntrl() as
passing 8 bit characters to iscntrl() is not supported on macOS.
* NEWS: Mention the bug fix.
Fixes https://bugs.gnu.org/32236
---
 NEWS | 4 
 src/df.c | 3 ++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/NEWS b/NEWS
index af1a990..aa3b4f9 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,10 @@ GNU coreutils NEWS-*- outline -*-
 
 * Noteworthy changes in release ?.? (-??-??) [?]
 
+** Bug fixes
+
+  df no longer corrupts displayed multibyte characters on macOS.
+
 
 * Noteworthy changes in release 8.30 (2018-07-01) [stable]
 
diff --git a/src/df.c b/src/df.c
index 1178865..c851fcc 100644
--- a/src/df.c
+++ b/src/df.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "system.h"
 #include "canonicalize.h"
@@ -281,7 +282,7 @@ hide_problematic_chars (char *cell)
   char *p = cell;
   while (*p)
 {
-  if (iscntrl (to_uchar (*p)))
+  if (c_iscntrl (to_uchar (*p)))
 *p = '?';
   p++;
 }
-- 
2.9.3

>From 816cc0d5fb92552a551c523f49c829261731dfe8 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= 
Date: Sat, 21 Jul 2018 13:15:13 -0700
Subject: [PATCH] iscntrl: document that macOS returns true for >= 0x80

* doc/posix-functions/iscntrl.texi: Mention that support
for chars >= 0x80 is not standarized, and not supported
on OS X >= 10.5 at least
---
 doc/posix-functions/iscntrl.texi | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/doc/posix-functions/iscntrl.texi b/doc/posix-functions/iscntrl.texi
index 7e6813f..3c15708 100644
--- a/doc/posix-functions/iscntrl.texi
+++ b/doc/posix-functions/iscntrl.texi
@@ -16,6 +16,9 @@ OS X 10.8.
 
 Portability problems not fixed by Gnulib:
 @itemize
+This function does not support arguments outside of the range of the
+unsigned char type in locales with large character sets, on some platforms.
+OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales.
 @end itemize
 
 Note: This function's behaviour depends on the locale, but does not support
-- 
2.9.3