Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
2018-07-22 6:43 GMT+08:00 Bruno Haible : > Hi Pádraig, > >> I've attached a gnulib patch to document for iscntrl at least. > >> +This function does not support arguments outside of the range of the >> +unsigned char type in locales with large character sets, on some platforms. >> +OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales. > > In UTF-8 locales, arguments >= 0x80 are invalid arguments for iscntrl(). > > POSIX [1] says > "The c argument is a type int, the value of which the application shall >ensure is a character representable as an unsigned char or equal to the >value of the macro EOF. If the argument has any other value, the behavior >is undefined." > > The term "character" is defined here [2]: > "A sequence of one or more bytes representing a single graphic symbol or >control code." > > So, in a UTF-8 locale, a "character representable as an unsigned char" > is a byte sequence of length 1, where the single byte has a value in the > range 0x00..0x7F. > > For invalid values "the behavior is undefined." You were expecting a value 0. > > Now, in the gnulib documentations, what we mention as portability problems > are the cases where > - the behaviour for valid arguments is different on different platforms, or > - the boundary between valid and invalid arguments is fuzzy and depends on > the platform. > IMO there's no point in documenting that a function _really_ has undefined > behaviour when POSIX says that it has undefined behaviour. > >> I've also attached an alternative patch for df (in your name). > > This patch is correct (because the characters that you test for in c_iscntrl > are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte > character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings). > > But it does not catch control characters outside of the ASCII range. It would > make sense to catch these as well. If you want to do that, > 'hide_problematic_chars' needs to be rewritten as a loop that iterates across > the multibyte characters. For example with the 'mbiter' module, in > combination with the mb_iscntrl function from the 'mbchar' module. Or > directly with mbrtowc() and iswcntrl(). > > Bruno > > [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/iscntrl.html > [2] > http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_87 The `c_iscntrl()` patch also fixes the issue on macOS. Please tell me if you want me to test other patches, thanks! Cheers, Chih-Hsuan Yen
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Pádraig Brady wrote: I've also attached an alternative patch for df (in your name). That still has problems, since it can generate improperly-encoded strings in UTF-8 locales (if the inputs are improperly encoded), and can replace parts of multibyte characters with '?' in non-UTF-8 locales. Please try the attached patch instead, which attempts to address these issues. This is more along the lines that Bruno suggested, except it doesn't use mbsiter as I figured it was simpler overall just to use mbrtowc directly for this one thing. From 17a1a37549344cdfd95cc84b1848dafa256be5a0 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Sun, 22 Jul 2018 08:09:01 -0700 Subject: [PATCH] df: avoid multibyte character corruption on macOS Problem reported by Chih-Hsuan Yen (Bug#32236). * NEWS: Mention the bug fix. * src/df.c: Include wchar.h and wctype.h instead of mbswidth.h. (hide_problematic_chars): Return number of screen columns. All callers changed. Use iswcntrl, not iscntrl. (get_header, get_dev): Rely on hide_problematic_chars width, not mbswidth. Scan the cell once, instead of two or three times. --- NEWS | 4 src/df.c | 46 +++--- 2 files changed, 35 insertions(+), 15 deletions(-) diff --git a/NEWS b/NEWS index af1a990..aa3b4f9 100644 --- a/NEWS +++ b/NEWS @@ -2,6 +2,10 @@ GNU coreutils NEWS-*- outline -*- * Noteworthy changes in release ?.? (-??-??) [?] +** Bug fixes + + df no longer corrupts displayed multibyte characters on macOS. + * Noteworthy changes in release 8.30 (2018-07-01) [stable] diff --git a/src/df.c b/src/df.c index 1178865..664b88b 100644 --- a/src/df.c +++ b/src/df.c @@ -23,6 +23,8 @@ #include #include #include +#include +#include #include "system.h" #include "canonicalize.h" @@ -31,7 +33,6 @@ #include "fsusage.h" #include "human.h" #include "mbsalign.h" -#include "mbswidth.h" #include "mountlist.h" #include "quote.h" #include "find-mount-point.h" @@ -272,20 +273,36 @@ static struct option const long_options[] = }; /* Replace problematic chars with '?'. - Since only control characters are currently considered, - this should work in all encodings. */ + Return the number of screen columns. */ -static char* +static size_t hide_problematic_chars (char *cell) { - char *p = cell; - while (*p) + char *srcend = cell + strlen (cell); + char *dst = cell; + mbstate_t mbstate = { 0, }; + size_t n; + size_t width = 0; + + for (char *src = cell; src != srcend; src += n) { - if (iscntrl (to_uchar (*p))) -*p = '?'; - p++; + wchar_t wc; + n = mbrtowc (, src, srcend - src, ); + if (n < (size_t) -2 && !iswcntrl (wc)) +{ + memcpy (dst, src, n); + dst += n; +} + else +{ + *dst++ = '?'; + memset (, 0, sizeof mbstate); +} + width++; } - return cell; + + *dst = '\0'; + return width; } /* Dynamically allocate a row of pointers in TABLE, which @@ -569,11 +586,10 @@ get_header (void) if (!cell) xalloc_die (); - hide_problematic_chars (cell); - table[nrows - 1][col] = cell; - columns[col]->width = MAX (columns[col]->width, mbswidth (cell, 0)); + size_t cell_width = hide_problematic_chars (cell); + columns[col]->width = MAX (columns[col]->width, cell_width); } } @@ -1182,8 +1198,8 @@ get_dev (char const *disk, char const *mount_point, char const* file, if (!cell) assert (!"empty cell"); - hide_problematic_chars (cell); - columns[col]->width = MAX (columns[col]->width, mbswidth (cell, 0)); + size_t cell_width = hide_problematic_chars (cell); + columns[col]->width = MAX (columns[col]->width, cell_width); table[nrows - 1][col] = cell; } free (dev_name); -- 2.7.4
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
On 22/07/18 08:12, Paul Eggert wrote: > Pádraig Brady wrote: >> I've also attached an alternative patch for df (in your name). > > That still has problems, since it can generate improperly-encoded strings in > UTF-8 locales (if the inputs are improperly encoded), and can replace parts > of > multibyte characters with '?' in non-UTF-8 locales. Please try the attached > patch instead, which attempts to address these issues. This is more along the > lines that Bruno suggested, except it doesn't use mbsiter as I figured it was > simpler overall just to use mbrtowc directly for this one thing. I haven't time to review this now, but I did want to only avoid \n etc. that might cause issues for programs that parsed output from df on a line by line basis. This subset of control characters is safe to identify It seems problematic to start eliding improperly encoded mount points for example, rather than just outputting what's there. Also just incrementing width++ per each wide character doesn't seem right, though again I've not tested it. cheers, Pádraig
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
2018-07-22 23:12 GMT+08:00 Paul Eggert : > Pádraig Brady wrote: >> >> I've also attached an alternative patch for df (in your name). > > > That still has problems, since it can generate improperly-encoded strings in > UTF-8 locales (if the inputs are improperly encoded), and can replace parts > of multibyte characters with '?' in non-UTF-8 locales. Please try the > attached patch instead, which attempts to address these issues. This is more > along the lines that Bruno suggested, except it doesn't use mbsiter as I > figured it was simpler overall just to use mbrtowc directly for this one > thing. Here's the result of df: $ df 檔案系統容量 已用 可用 已用 掛載點 /dev/disk1s1234G 137G 95G 60% / /dev/disk1s4234G 2.1G 95G 3% /private/var/vm chyen.cc:25G 12G 12G 51% /private/tmp/abc def ghi $ df | xxd : e6aa 94e6 a188 e7b3 bbe7 b5b1 2020 2020 0010: 2020 2020 e5ae b9e9 878f 2020 e5b7 b2e7 .. 0020: 94a8 2020 e58f afe7 94a8 20e5 b7b2 e794 .. .. . 0030: a820 e68e 9be8 bc89 e9bb 9e0a 2f64 6576 . ../dev 0040: 2f64 6973 6b31 7331 2020 2020 3233 3447 /disk1s1234G 0050: 2020 3133 3747 2020 2039 3547 2020 3630137G 95G 60 0060: 2520 2f0a 2f64 6576 2f64 6973 6b31 7334 % /./dev/disk1s4 0070: 2020 2020 3233 3447 2020 322e 3147 2020 234G 2.1G 0080: 2039 3547 2020 2033 2520 2f70 7269 7661 95G 3% /priva 0090: 7465 2f76 6172 2f76 6d0a 6368 7965 6e2e te/var/vm.chyen. 00a0: 6363 3a20 2020 2020 2020 2032 3547 2020 cc:25G 00b0: 2031 3247 2020 2031 3247 2020 3531 2520 12G 12G 51% 00c0: 2f70 7269 7661 7465 2f74 6d70 2f61 6263 /private/tmp/abc 00d0: e280 a864 6566 e280 a967 6869 0a ...def...ghi. Chinese header names are correct, and U+2028 and U+2029 are written as-is. All tested with LANG=zh_TW.UTF-8 LC_COLLATE=C LC_CTYPE=zh_TW.UTF-8.
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Pádraig Brady wrote: I did want to only avoid \n etc. that might cause issues for programs that parsed output from df on a line by line basis. This subset of control characters is safe to identify It seems problematic to start eliding improperly encoded mount points for example, rather than just outputting what's there. Yes, I suppose you're right, it's not df's job to police encodings. Also just incrementing width++ per each wide character doesn't seem right, though again I've not tested it. True as well. OK, please ignore my patch. I was prompted by worries about multibyte encodings that use bytes that could be misinterpreted as ASCII control characters, such as a locale that uses EBCDIC encoding. However, that's probably just a theoretical concern; no coreutils users use EBCDIC any more, right? Plus there are doubtless lots of other places in coreutils that assume '\n' is a newline in encoded text.
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
On 21/07/18 15:43, Bruno Haible wrote: > Hi Pádraig, > >> I've attached a gnulib patch to document for iscntrl at least. > >> +This function does not support arguments outside of the range of the >> +unsigned char type in locales with large character sets, on some platforms. >> +OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales. > > In UTF-8 locales, arguments >= 0x80 are invalid arguments for iscntrl(). > > POSIX [1] says > "The c argument is a type int, the value of which the application shall >ensure is a character representable as an unsigned char or equal to the >value of the macro EOF. If the argument has any other value, the behavior >is undefined." > > The term "character" is defined here [2]: > "A sequence of one or more bytes representing a single graphic symbol or >control code." > > So, in a UTF-8 locale, a "character representable as an unsigned char" > is a byte sequence of length 1, where the single byte has a value in the > range 0x00..0x7F. > > For invalid values "the behavior is undefined." You were expecting a value 0. > > Now, in the gnulib documentations, what we mention as portability problems > are the cases where > - the behaviour for valid arguments is different on different platforms, or > - the boundary between valid and invalid arguments is fuzzy and depends on > the platform. > IMO there's no point in documenting that a function _really_ has undefined > behaviour when POSIX says that it has undefined behaviour. Thanks for all that info. I agree iscntrl() behavior on macOS is within spec, though is still surprising, and different from other systems. I agree docs should be as succinct as possible, though... >> I've also attached an alternative patch for df (in your name). > > This patch is correct (because the characters that you test for in c_iscntrl > are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte > character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings). ... It might be worth mentioning this subtle point in the c_iscntrl() docs? "Note this identifies all single byte control chars even in multibyte encodings". > But it does not catch control characters outside of the ASCII range. It would > make sense to catch these as well. If you want to do that, > 'hide_problematic_chars' needs to be rewritten as a loop that iterates across > the multibyte characters. For example with the 'mbiter' module, in > combination with the mb_iscntrl function from the 'mbchar' module. Or > directly with mbrtowc() and iswcntrl(). I was mainly worried here about \n for scripts to robustly parse df output. cheers, Pádraig.
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Chih-Hsuan Yen wrote: > The `c_iscntrl()` patch also fixes the issue on macOS. Please tell me > if you want me to test other patches, thanks! You could test how it behaves with mount points that contain U+2028 or U+2029 characters. On Linux, I'd test it like this. Hope it's similar on macOS: $ mkdir /tmp/`printf 'abc\u2028def\u2029ghi'` $ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc* $ df ... /dev/loop019860481986048 0 100% /tmp/abc�def�ghi $ ls -ld /tmp/abc* dr-xr-xr-x 4 root root 2048 Nov 19 2014 /tmp/abc?def?ghi Bruno
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
2018-07-22 18:46 GMT+08:00 Bruno Haible : > Chih-Hsuan Yen wrote: >> The `c_iscntrl()` patch also fixes the issue on macOS. Please tell me >> if you want me to test other patches, thanks! > > You could test how it behaves with mount points that contain U+2028 or > U+2029 characters. On Linux, I'd test it like this. Hope it's similar > on macOS: > $ mkdir /tmp/`printf 'abc\u2028def\u2029ghi'` > $ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc* > $ df > ... > /dev/loop019860481986048 0 100% /tmp/abc�def�ghi > $ ls -ld /tmp/abc* > dr-xr-xr-x 4 root root 2048 Nov 19 2014 /tmp/abc?def?ghi > > Bruno > Hi Bruno, With the c_iscntrl() patch, the result of ls and df are: (I use xxd as GMail seems unable to handle U+2028 and U+2029 correctly) $ ls -ld /tmp/abc def ghi | xxd : 6472 7778 722d 7872 2d78 2031 2079 656e drwxr-xr-x 1 yen 0010: 2073 7461 2034 3039 3620 3230 3138 staff 4096 2018 0020: 2f30 372f 3232 2030 323a 3136 3a34 3320 /07/22 02:16:43 0030: 2f74 6d70 2f61 6263 e280 a864 6566 e280 /tmp/abc...def.. 0040: a967 6869 0a .ghi. $ df | xxd : e6aa 94e6 a188 e7b3 bbe7 b5b1 2020 2020 0010: 2020 2020 e5ae b9e9 878f 2020 e5b7 b2e7 .. 0020: 94a8 2020 e58f afe7 94a8 20e5 b7b2 e794 .. .. . 0030: a825 20e6 8e9b e8bc 89e9 bb9e 0a2f 6465 .% ../de 0040: 762f 6469 736b 3173 3120 2020 2032 3334 v/disk1s1234 0050: 4720 2031 3337 4720 2020 3935 4720 2020 G 137G 95G 0060: 3630 2520 2f0a 2f64 6576 2f64 6973 6b31 60% /./dev/disk1 0070: 7334 2020 2020 3233 3447 2020 322e 3147 s4234G 2.1G 0080: 2020 2039 3547 2020 2020 3325 202f 7072 95G3% /pr 0090: 6976 6174 652f 7661 722f 766d 0a63 6879 ivate/var/vm.chy 00a0: 656e 2e63 633a 2020 2020 2020 2020 3235 en.cc:25 00b0: 4720 2020 3132 4720 2020 3132 4720 2020 G 12G 12G 00c0: 3532 2520 2f70 7269 7661 7465 2f74 6d70 52% /private/tmp 00d0: 2f61 6263 e280 a864 6566 e280 a967 6869 /abc...def...ghi 00e0: 0a . Without the c_iscntrl() patch (unmodified 8.30), ls behaves the same, and the result of df is: $ df | xxd : e6aa 3fe6 a13f e7b3 bbe7 b5b1 20e5 aeb9 ..?..?.. ... 0010: e93f 3f20 e5b7 b2e7 3fa8 20e5 3faf e73f .?? ?. .?..? 0020: a820 e5b7 b2e7 3fa8 2520 e63f 3fe8 bc3f . ?.% .??..? 0030: e9bb 3f0a 2f64 6576 2f64 6973 6b31 7331 ..?./dev/disk1s1 0040: 2020 2020 3233 3447 2020 3133 3747 2020 234G 137G 0050: 2020 3935 4720 2020 2036 3025 202f 0a2f95G60% /./ 0060: 6465 762f 6469 736b 3173 3420 2020 2032 dev/disk1s42 0070: 3334 4720 2032 2e31 4720 2020 2039 3547 34G 2.1G95G 0080: 2020 2020 2033 2520 2f70 7269 7661 7465 3% /private 0090: 2f76 6172 2f76 6d0a 6368 7965 6e2e 6363 /var/vm.chyen.cc 00a0: 3a20 2020 2020 2020 2032 3547 2020 2031 :25G 1 00b0: 3247 2020 2020 3132 4720 2020 2035 3125 2G12G51% 00c0: 202f 7072 6976 6174 652f 746d 702f 6162 /private/tmp/ab 00d0: 63e2 3fa8 6465 66e2 3fa9 6768 690a c.?.def.?.ghi. Hope those results are helpful! Chih-Hsuan Yen
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Pádraig Brady wrote: > but I did want to only avoid \n etc. that might cause issues for > programs that parsed output from df on a line by line basis. The current code (which uses iscntrl) also catches escape sequences that can cause weird output on the screen, in a terminal emulator. This is good (because it can confuse a human reader as much as a '\n' would confuse a line-by-line parser). Now, this feature currently only works for escape sequence that start with an ASCII escape U+001B. It would be useful also for other control characters to be caught, at least: * escape characters U+009B. * other characters that cause a newline in a terminal emulator: U+2028 and U+2029. For example, in konsole, the escape sequence '\u009bf' repositions the cursor. So the effects of $ mkdir /tmp/`printf 'abc\u009bf'` $ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc* $ df ... /dev/loop019860481986048 0 100% /tmp/abc�f is that 'df' produces an U+FFFD. This is less useful than what it produces for an ASCII escape: $ mkdir /tmp/`printf 'abc\u001b[2J'` $ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc* $ df ... /dev/loop0 692828 692828 0 100% /tmp/abc?[2J Bruno
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Pádraig Brady wrote: > > This patch is correct (because the characters that you test for in c_iscntrl > > are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a > > multibyte > > character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings). > > ... It might be worth mentioning this subtle point in the c_iscntrl() docs? > "Note this identifies all single byte control chars even in multibyte > encodings". Only in the multibyte encodings that are currently in use. We never know what kinds of features or misfeatures new multibyte encodings will come up with: Before GB18030 was introduced, it was a common feature of all multibyte encodings (including SJIS) that ASCII characters in the range 0x00..0x3F never occur as second or later byte in a multibyte character. Well, GB18030 broke this assumption. So, it is dangerous to rely on this property. Therefore I wouldn't like to document it in the c_iscntrl() documentation. Bruno