Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Chih-Hsuan Yen
2018-07-22 6:43 GMT+08:00 Bruno Haible :
> Hi Pádraig,
>
>> I've attached a gnulib patch to document for iscntrl at least.
>
>> +This function does not support arguments outside of the range of the
>> +unsigned char type in locales with large character sets, on some platforms.
>> +OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales.
>
> In UTF-8 locales, arguments >= 0x80 are invalid arguments for iscntrl().
>
> POSIX [1] says
>   "The c argument is a type int, the value of which the application shall
>ensure is a character representable as an unsigned char or equal to the
>value of the macro EOF. If the argument has any other value, the behavior
>is undefined."
>
> The term "character" is defined here [2]:
>   "A sequence of one or more bytes representing a single graphic symbol or
>control code."
>
> So, in a UTF-8 locale, a "character representable as an unsigned char"
> is a byte sequence of length 1, where the single byte has a value in the
> range 0x00..0x7F.
>
> For invalid values "the behavior is undefined." You were expecting a value 0.
>
> Now, in the gnulib documentations, what we mention as portability problems
> are the cases where
>   - the behaviour for valid arguments is different on different platforms, or
>   - the boundary between valid and invalid arguments is fuzzy and depends on
> the platform.
> IMO there's no point in documenting that a function _really_ has undefined
> behaviour when POSIX says that it has undefined behaviour.
>
>> I've also attached an alternative patch for df (in your name).
>
> This patch is correct (because the characters that you test for in c_iscntrl
> are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte
> character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings).
>
> But it does not catch control characters outside of the ASCII range. It would
> make sense to catch these as well. If you want to do that,
> 'hide_problematic_chars' needs to be rewritten as a loop that iterates across
> the multibyte characters. For example with the 'mbiter' module, in
> combination with the mb_iscntrl function from the 'mbchar' module. Or
> directly with mbrtowc() and iswcntrl().
>
> Bruno
>
> [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/iscntrl.html
> [2] 
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_87

The `c_iscntrl()` patch also fixes the issue on macOS. Please tell me
if you want me to test other patches, thanks!

Cheers,

Chih-Hsuan Yen



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Paul Eggert

Pádraig Brady wrote:

I've also attached an alternative patch for df (in your name).


That still has problems, since it can generate improperly-encoded strings in 
UTF-8 locales (if the inputs are improperly encoded), and can replace parts of 
multibyte characters with '?' in non-UTF-8 locales. Please try the attached 
patch instead, which attempts to address these issues. This is more along the 
lines that Bruno suggested, except it doesn't use mbsiter as I figured it was 
simpler overall just to use mbrtowc directly for this one thing.
From 17a1a37549344cdfd95cc84b1848dafa256be5a0 Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Sun, 22 Jul 2018 08:09:01 -0700
Subject: [PATCH] df: avoid multibyte character corruption on macOS

Problem reported by Chih-Hsuan Yen (Bug#32236).
* NEWS: Mention the bug fix.
* src/df.c: Include wchar.h and wctype.h instead of mbswidth.h.
(hide_problematic_chars): Return number of screen columns.
All callers changed.  Use iswcntrl, not iscntrl.
(get_header, get_dev): Rely on hide_problematic_chars width,
not mbswidth.  Scan the cell once, instead of two or three times.
---
 NEWS |  4 
 src/df.c | 46 +++---
 2 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/NEWS b/NEWS
index af1a990..aa3b4f9 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,10 @@ GNU coreutils NEWS-*- outline -*-
 
 * Noteworthy changes in release ?.? (-??-??) [?]
 
+** Bug fixes
+
+  df no longer corrupts displayed multibyte characters on macOS.
+
 
 * Noteworthy changes in release 8.30 (2018-07-01) [stable]
 
diff --git a/src/df.c b/src/df.c
index 1178865..664b88b 100644
--- a/src/df.c
+++ b/src/df.c
@@ -23,6 +23,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "system.h"
 #include "canonicalize.h"
@@ -31,7 +33,6 @@
 #include "fsusage.h"
 #include "human.h"
 #include "mbsalign.h"
-#include "mbswidth.h"
 #include "mountlist.h"
 #include "quote.h"
 #include "find-mount-point.h"
@@ -272,20 +273,36 @@ static struct option const long_options[] =
 };
 
 /* Replace problematic chars with '?'.
-   Since only control characters are currently considered,
-   this should work in all encodings.  */
+   Return the number of screen columns.  */
 
-static char*
+static size_t
 hide_problematic_chars (char *cell)
 {
-  char *p = cell;
-  while (*p)
+  char *srcend = cell + strlen (cell);
+  char *dst = cell;
+  mbstate_t mbstate = { 0, };
+  size_t n;
+  size_t width = 0;
+
+  for (char *src = cell; src != srcend; src += n)
 {
-  if (iscntrl (to_uchar (*p)))
-*p = '?';
-  p++;
+  wchar_t wc;
+  n = mbrtowc (, src, srcend - src, );
+  if (n < (size_t) -2 && !iswcntrl (wc))
+{
+  memcpy (dst, src, n);
+  dst += n;
+}
+  else
+{
+  *dst++ = '?';
+  memset (, 0, sizeof mbstate);
+}
+  width++;
 }
-  return cell;
+
+  *dst = '\0';
+  return width;
 }
 
 /* Dynamically allocate a row of pointers in TABLE, which
@@ -569,11 +586,10 @@ get_header (void)
   if (!cell)
 xalloc_die ();
 
-  hide_problematic_chars (cell);
-
   table[nrows - 1][col] = cell;
 
-  columns[col]->width = MAX (columns[col]->width, mbswidth (cell, 0));
+  size_t cell_width = hide_problematic_chars (cell);
+  columns[col]->width = MAX (columns[col]->width, cell_width);
 }
 }
 
@@ -1182,8 +1198,8 @@ get_dev (char const *disk, char const *mount_point, char const* file,
   if (!cell)
 assert (!"empty cell");
 
-  hide_problematic_chars (cell);
-  columns[col]->width = MAX (columns[col]->width, mbswidth (cell, 0));
+  size_t cell_width = hide_problematic_chars (cell);
+  columns[col]->width = MAX (columns[col]->width, cell_width);
   table[nrows - 1][col] = cell;
 }
   free (dev_name);
-- 
2.7.4



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Pádraig Brady
On 22/07/18 08:12, Paul Eggert wrote:
> Pádraig Brady wrote:
>> I've also attached an alternative patch for df (in your name).
> 
> That still has problems, since it can generate improperly-encoded strings in 
> UTF-8 locales (if the inputs are improperly encoded), and can replace parts 
> of 
> multibyte characters with '?' in non-UTF-8 locales. Please try the attached 
> patch instead, which attempts to address these issues. This is more along the 
> lines that Bruno suggested, except it doesn't use mbsiter as I figured it was 
> simpler overall just to use mbrtowc directly for this one thing.

I haven't time to review this now,
but I did want to only avoid \n etc. that might cause issues for
programs that parsed output from df on a line by line basis.
This subset of control characters is safe to identify
It seems problematic to start eliding improperly encoded
mount points for example, rather than just outputting
what's there.

Also just incrementing width++ per each wide character
doesn't seem right, though again I've not tested it.

cheers,
Pádraig



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Chih-Hsuan Yen
2018-07-22 23:12 GMT+08:00 Paul Eggert :
> Pádraig Brady wrote:
>>
>> I've also attached an alternative patch for df (in your name).
>
>
> That still has problems, since it can generate improperly-encoded strings in
> UTF-8 locales (if the inputs are improperly encoded), and can replace parts
> of multibyte characters with '?' in non-UTF-8 locales. Please try the
> attached patch instead, which attempts to address these issues. This is more
> along the lines that Bruno suggested, except it doesn't use mbsiter as I
> figured it was simpler overall just to use mbrtowc directly for this one
> thing.

Here's the result of df:

$ df
檔案系統容量  已用  可用 已用 掛載點
/dev/disk1s1234G  137G   95G  60% /
/dev/disk1s4234G  2.1G   95G   3% /private/var/vm
chyen.cc:25G   12G   12G  51% /private/tmp/abc def ghi

$ df | xxd
: e6aa 94e6 a188 e7b3 bbe7 b5b1 2020 2020  
0010: 2020 2020 e5ae b9e9 878f 2020 e5b7 b2e7  ..  
0020: 94a8 2020 e58f afe7 94a8 20e5 b7b2 e794  ..  .. .
0030: a820 e68e 9be8 bc89 e9bb 9e0a 2f64 6576  . ../dev
0040: 2f64 6973 6b31 7331 2020 2020 3233 3447  /disk1s1234G
0050: 2020 3133 3747 2020 2039 3547 2020 3630137G   95G  60
0060: 2520 2f0a 2f64 6576 2f64 6973 6b31 7334  % /./dev/disk1s4
0070: 2020 2020 3233 3447 2020 322e 3147 2020  234G  2.1G
0080: 2039 3547 2020 2033 2520 2f70 7269 7661   95G   3% /priva
0090: 7465 2f76 6172 2f76 6d0a 6368 7965 6e2e  te/var/vm.chyen.
00a0: 6363 3a20 2020 2020 2020 2032 3547 2020  cc:25G
00b0: 2031 3247 2020 2031 3247 2020 3531 2520   12G   12G  51%
00c0: 2f70 7269 7661 7465 2f74 6d70 2f61 6263  /private/tmp/abc
00d0: e280 a864 6566 e280 a967 6869 0a ...def...ghi.

Chinese header names are correct, and U+2028 and U+2029 are written
as-is. All tested with LANG=zh_TW.UTF-8 LC_COLLATE=C
LC_CTYPE=zh_TW.UTF-8.



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Paul Eggert

Pádraig Brady wrote:

I did want to only avoid \n etc. that might cause issues for
programs that parsed output from df on a line by line basis.
This subset of control characters is safe to identify
It seems problematic to start eliding improperly encoded
mount points for example, rather than just outputting
what's there.


Yes, I suppose you're right, it's not df's job to police encodings.


Also just incrementing width++ per each wide character
doesn't seem right, though again I've not tested it.


True as well. OK, please ignore my patch.

I was prompted by worries about multibyte encodings that use bytes that could be 
misinterpreted as ASCII control characters, such as a locale that uses EBCDIC 
encoding. However, that's probably just a theoretical concern; no coreutils 
users use EBCDIC any more, right? Plus there are doubtless lots of other places 
in coreutils that assume '\n' is a newline in encoded text.




Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Pádraig Brady
On 21/07/18 15:43, Bruno Haible wrote:
> Hi Pádraig,
> 
>> I've attached a gnulib patch to document for iscntrl at least.
> 
>> +This function does not support arguments outside of the range of the
>> +unsigned char type in locales with large character sets, on some platforms.
>> +OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales.
> 
> In UTF-8 locales, arguments >= 0x80 are invalid arguments for iscntrl().
> 
> POSIX [1] says
>   "The c argument is a type int, the value of which the application shall
>ensure is a character representable as an unsigned char or equal to the
>value of the macro EOF. If the argument has any other value, the behavior
>is undefined."
> 
> The term "character" is defined here [2]:
>   "A sequence of one or more bytes representing a single graphic symbol or
>control code."
> 
> So, in a UTF-8 locale, a "character representable as an unsigned char"
> is a byte sequence of length 1, where the single byte has a value in the
> range 0x00..0x7F.
> 
> For invalid values "the behavior is undefined." You were expecting a value 0.
> 
> Now, in the gnulib documentations, what we mention as portability problems
> are the cases where
>   - the behaviour for valid arguments is different on different platforms, or
>   - the boundary between valid and invalid arguments is fuzzy and depends on
> the platform.
> IMO there's no point in documenting that a function _really_ has undefined
> behaviour when POSIX says that it has undefined behaviour.


Thanks for all that info. I agree iscntrl() behavior on macOS is within spec,
though is still surprising, and different from other systems.
I agree docs should be as succinct as possible, though...

>> I've also attached an alternative patch for df (in your name).
> 
> This patch is correct (because the characters that you test for in c_iscntrl
> are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte
> character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings).

... It might be worth mentioning this subtle point in the c_iscntrl() docs?
"Note this identifies all single byte control chars even in multibyte 
encodings".

> But it does not catch control characters outside of the ASCII range. It would
> make sense to catch these as well. If you want to do that,
> 'hide_problematic_chars' needs to be rewritten as a loop that iterates across
> the multibyte characters. For example with the 'mbiter' module, in
> combination with the mb_iscntrl function from the 'mbchar' module. Or
> directly with mbrtowc() and iswcntrl().

I was mainly worried here about \n for scripts to robustly parse df output.

cheers,
Pádraig.



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Bruno Haible
Chih-Hsuan Yen wrote:
> The `c_iscntrl()` patch also fixes the issue on macOS. Please tell me
> if you want me to test other patches, thanks!

You could test how it behaves with mount points that contain U+2028 or
U+2029 characters. On Linux, I'd test it like this. Hope it's similar
on macOS:
$ mkdir /tmp/`printf 'abc\u2028def\u2029ghi'`
$ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc*
$ df
...
/dev/loop019860481986048  0  100% /tmp/abc�def�ghi
$ ls -ld /tmp/abc*
dr-xr-xr-x 4 root root 2048 Nov 19  2014 /tmp/abc?def?ghi

Bruno




Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Chih-Hsuan Yen
2018-07-22 18:46 GMT+08:00 Bruno Haible :
> Chih-Hsuan Yen wrote:
>> The `c_iscntrl()` patch also fixes the issue on macOS. Please tell me
>> if you want me to test other patches, thanks!
>
> You could test how it behaves with mount points that contain U+2028 or
> U+2029 characters. On Linux, I'd test it like this. Hope it's similar
> on macOS:
> $ mkdir /tmp/`printf 'abc\u2028def\u2029ghi'`
> $ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc*
> $ df
> ...
> /dev/loop019860481986048  0  100% /tmp/abc�def�ghi
> $ ls -ld /tmp/abc*
> dr-xr-xr-x 4 root root 2048 Nov 19  2014 /tmp/abc?def?ghi
>
> Bruno
>

Hi Bruno,

With the c_iscntrl() patch, the result of ls and df are: (I use xxd as
GMail seems unable to handle U+2028 and U+2029 correctly)

$ ls -ld /tmp/abc
def
ghi | xxd
: 6472 7778 722d 7872 2d78 2031 2079 656e  drwxr-xr-x 1 yen
0010: 2073 7461  2034 3039 3620 3230 3138   staff 4096 2018
0020: 2f30 372f 3232 2030 323a 3136 3a34 3320  /07/22 02:16:43
0030: 2f74 6d70 2f61 6263 e280 a864 6566 e280  /tmp/abc...def..
0040: a967 6869 0a .ghi.

$ df | xxd
: e6aa 94e6 a188 e7b3 bbe7 b5b1 2020 2020  
0010: 2020 2020 e5ae b9e9 878f 2020 e5b7 b2e7  ..  
0020: 94a8 2020 e58f afe7 94a8 20e5 b7b2 e794  ..  .. .
0030: a825 20e6 8e9b e8bc 89e9 bb9e 0a2f 6465  .% ../de
0040: 762f 6469 736b 3173 3120 2020 2032 3334  v/disk1s1234
0050: 4720 2031 3337 4720 2020 3935 4720 2020  G  137G   95G
0060: 3630 2520 2f0a 2f64 6576 2f64 6973 6b31  60% /./dev/disk1
0070: 7334 2020 2020 3233 3447 2020 322e 3147  s4234G  2.1G
0080: 2020 2039 3547 2020 2020 3325 202f 7072 95G3% /pr
0090: 6976 6174 652f 7661 722f 766d 0a63 6879  ivate/var/vm.chy
00a0: 656e 2e63 633a 2020 2020 2020 2020 3235  en.cc:25
00b0: 4720 2020 3132 4720 2020 3132 4720 2020  G   12G   12G
00c0: 3532 2520 2f70 7269 7661 7465 2f74 6d70  52% /private/tmp
00d0: 2f61 6263 e280 a864 6566 e280 a967 6869  /abc...def...ghi
00e0: 0a   .

Without the c_iscntrl() patch (unmodified 8.30), ls behaves the same,
and the result of df is:

$ df | xxd
: e6aa 3fe6 a13f e7b3 bbe7 b5b1 20e5 aeb9  ..?..?.. ...
0010: e93f 3f20 e5b7 b2e7 3fa8 20e5 3faf e73f  .?? ?. .?..?
0020: a820 e5b7 b2e7 3fa8 2520 e63f 3fe8 bc3f  . ?.% .??..?
0030: e9bb 3f0a 2f64 6576 2f64 6973 6b31 7331  ..?./dev/disk1s1
0040: 2020 2020 3233 3447 2020 3133 3747 2020  234G  137G
0050: 2020 3935 4720 2020 2036 3025 202f 0a2f95G60% /./
0060: 6465 762f 6469 736b 3173 3420 2020 2032  dev/disk1s42
0070: 3334 4720 2032 2e31 4720 2020 2039 3547  34G  2.1G95G
0080: 2020 2020 2033 2520 2f70 7269 7661 7465   3% /private
0090: 2f76 6172 2f76 6d0a 6368 7965 6e2e 6363  /var/vm.chyen.cc
00a0: 3a20 2020 2020 2020 2032 3547 2020 2031  :25G   1
00b0: 3247 2020 2020 3132 4720 2020 2035 3125  2G12G51%
00c0: 202f 7072 6976 6174 652f 746d 702f 6162   /private/tmp/ab
00d0: 63e2 3fa8 6465 66e2 3fa9 6768 690a   c.?.def.?.ghi.

Hope those results are helpful!

Chih-Hsuan Yen



Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Bruno Haible
Pádraig Brady wrote:
> but I did want to only avoid \n etc. that might cause issues for
> programs that parsed output from df on a line by line basis.

The current code (which uses iscntrl) also catches escape sequences
that can cause weird output on the screen, in a terminal emulator.
This is good (because it can confuse a human reader as much as a '\n'
would confuse a line-by-line parser).

Now, this feature currently only works for escape sequence that
start with an ASCII escape U+001B. It would be useful also for
other control characters to be caught, at least:
  * escape characters U+009B.
  * other characters that cause a newline in a terminal emulator:
U+2028 and U+2029.
For example, in konsole, the escape sequence '\u009bf' repositions
the cursor. So the effects of

$ mkdir /tmp/`printf 'abc\u009bf'`
$ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc*
$ df
...
/dev/loop019860481986048  0  100% /tmp/abc�f

is that 'df' produces an U+FFFD. This is less useful than what
it produces for an ASCII escape:

$ mkdir /tmp/`printf 'abc\u001b[2J'`
$ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc*
$ df
...
/dev/loop0 692828 692828  0  100% /tmp/abc?[2J

Bruno




Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-22 Thread Bruno Haible
Pádraig Brady wrote:
> > This patch is correct (because the characters that you test for in c_iscntrl
> > are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a 
> > multibyte
> > character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings).
> 
> ... It might be worth mentioning this subtle point in the c_iscntrl() docs?
> "Note this identifies all single byte control chars even in multibyte 
> encodings".

Only in the multibyte encodings that are currently in use. We never know what
kinds of features or misfeatures new multibyte encodings will come up with:
Before GB18030 was introduced, it was a common feature of all multibyte 
encodings
(including SJIS) that ASCII characters in the range 0x00..0x3F never occur as
second or later byte in a multibyte character. Well, GB18030 broke this 
assumption.

So, it is dangerous to rely on this property. Therefore I wouldn't like to
document it in the c_iscntrl() documentation.

Bruno