Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
2018-07-28 3:05 GMT+08:00 Paul Eggert : > Bruno Haible wrote: >> >> You can assume that mbrtowc returns >> 0 if and only if the multibyte sequence is a NUL byte - but you had >> chosen srcend in such a way that this would not happen in the loop. > > > Thanks for the correction. I mistakenly thought that C allows multibyte > encodings in which a null wide character's multibyte representation contains > an all-bits-zero byte. I installed the attached to omit the unnecessary > test. Thanks you all for the efforts! I've installed commit e5dae2c6b0bcd0e4ac6e5b212688d223e2e62f79 of coreutils, and `df` works like a charm! Cheers! Chih-Hsuan Yen
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Bruno Haible wrote: You can assume that mbrtowc returns 0 if and only if the multibyte sequence is a NUL byte - but you had chosen srcend in such a way that this would not happen in the loop. Thanks for the correction. I mistakenly thought that C allows multibyte encodings in which a null wide character's multibyte representation contains an all-bits-zero byte. I installed the attached to omit the unnecessary test. From 34e261fa2768533f34cf35158489ea6a22115c17 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Fri, 27 Jul 2018 12:00:02 -0700 Subject: [PATCH] df: omit redundant comparison Trivial inefficiency reported by Bruno Haible in: http://lists.gnu.org/r/bug-gnulib/2018-07/msg00109.html * src/df.c (hide_problematic_chars): Omit redundant test. --- src/df.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/df.c b/src/df.c index 9b65872..5553221 100644 --- a/src/df.c +++ b/src/df.c @@ -288,7 +288,7 @@ hide_problematic_chars (char *cell) wchar_t wc; size_t srcbytes = srcend - src; n = mbrtowc (, src, srcbytes, ); - bool ok = 0 < n && n <= srcbytes; + bool ok = n <= srcbytes; if (ok) ok = !iswcntrl (wc); -- 2.7.4
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Paul Eggert wrote: > my earlier patch > neglected the possibility that mbrtowc can return 0 I wouldn't see this as a bug: You can assume that mbrtowc returns 0 if and only if the multibyte sequence is a NUL byte - but you had chosen srcend in such a way that this would not happen in the loop. > and it incorrectly assumed > wide control characters always have a single-byte representation. Oops, you're right. My mistake as well. The new patch looks good. This will catch (and replace with '?') U+2028 and U+2029 on glibc systems. On macOS, it will not do this, because iswcntrl(0x2028) and iswcntrl(0x2029) is 0 on this system; this is consistent with the fact that the 'Terminal' program displays these characters as simple spaces. So, no need to override iswcntrl on macOS. Bruno 2018-07-27 Bruno Haible iswcntrl: Mention minor problem on macOS. * doc/posix-functions/iswcntrl.texi: Mention oddity on macOS. diff --git a/doc/posix-functions/iswcntrl.texi b/doc/posix-functions/iswcntrl.texi index 99eaa0e..44dd034 100644 --- a/doc/posix-functions/iswcntrl.texi +++ b/doc/posix-functions/iswcntrl.texi @@ -25,4 +25,8 @@ Portability problems not fixed by Gnulib: @item On AIX and Windows platforms, @code{wchar_t} is a 16-bit type and therefore cannot accommodate all Unicode characters. +@item +This function returns 0 for U+2028 (LINE SEPARATOR) and +U+2029 (PARAGRAPH SEPARATOR) on some platforms: +Mac OS X 10.13. @end itemize
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Pádraig Brady wrote: I've pushed the c_iscntrl patch since it's simplest and probably most appropriate patch for an existing release. Yes, that makes sense for a quick patch. However, for the next release I think it'd be better to catch encoding errors and multibyte control characters, given the problems noted. I installed the attached further patch to try to do this. This fixes the problem that Bruno noted, along with two others; my earlier patch neglected the possibility that mbrtowc can return 0, and it incorrectly assumed wide control characters always have a single-byte representation. Either way the original bug appears to be fix so I'm boldly closing the bug report. From 2cf5d730690dad600f8b6d74d0b5fde522804e43 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Sun, 22 Jul 2018 09:50:20 -0700 Subject: [PATCH] df: avoid multibyte character corruption on macOS This improves on the earlier fix for the problem reported by Chih-Hsuan Yen (Bug#32236), by also looking for other control characters and for encoding errors. * src/df.c: Include wchar.h and wctype.h instead of c-ctype.h. (hide_problematic_chars): Process the string as multibyte. Use iswcntrl, not c_iscntrl. --- src/df.c | 43 --- 1 file changed, 32 insertions(+), 11 deletions(-) diff --git a/src/df.c b/src/df.c index c851fcc..d27ba02 100644 --- a/src/df.c +++ b/src/df.c @@ -23,7 +23,8 @@ #include #include #include -#include +#include +#include #include "system.h" #include "canonicalize.h" @@ -272,21 +273,41 @@ static struct option const long_options[] = {NULL, 0, NULL, 0} }; -/* Replace problematic chars with '?'. - Since only control characters are currently considered, - this should work in all encodings. */ +/* Replace problematic chars with '?'. */ -static char* +static void hide_problematic_chars (char *cell) { - char *p = cell; - while (*p) + char *srcend = cell + strlen (cell); + char *dst = cell; + mbstate_t mbstate = { 0, }; + size_t n; + + for (char *src = cell; src != srcend; src += n) { - if (c_iscntrl (to_uchar (*p))) -*p = '?'; - p++; + wchar_t wc; + size_t srcbytes = srcend - src; + n = mbrtowc (, src, srcbytes, ); + bool ok = 0 < n && n <= srcbytes; + + if (ok) +ok = !iswcntrl (wc); + else +n = 1; + + if (ok) +{ + memmove (dst, src, n); + dst += n; +} + else +{ + *dst++ = '?'; + memset (, 0, sizeof mbstate); +} } - return cell; + + *dst = '\0'; } /* Dynamically allocate a row of pointers in TABLE, which -- 2.7.4
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
On 26/07/18 02:01, Paul Eggert wrote: > Chih-Hsuan Yen wrote: >> How about following the idea from Pádraig Brady >> and filter \n only? > > Given the later comments it seems better to filter out encoding errors and > control characters. Programs that parse the output already cannot trust the > strings to be exactly right, since newlines are gonna get replaced no matter > what. So there seems little benefit to copying the other garbage faithfully. > > Revised proposed patch(es) attached. This is better, though this means that mount points now need to match the locale of df or they won't be displayed. Theoretically that was the case previously, but only for control chars and so wouldn't have have had a practical impact for mounts encoded in another local, only for security/robustness reasons where one might have \n etc. I've pushed the c_iscntrl patch since it's simplest and probably most appropriate patch for an existing release. If you consider the matching encoding issue as a non issue, then I'm OK with this. cheers, Pádraig
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Paul Eggert wrote: > Revised proposed patch(es) attached. Looks good to me, except for one little thing: > memcpy (dst, src, n); src and dst may overlap. Therefore memmove should be used instead of memcpy. Bruno
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Chih-Hsuan Yen wrote: How about following the idea from Pádraig Brady and filter \n only? Given the later comments it seems better to filter out encoding errors and control characters. Programs that parse the output already cannot trust the strings to be exactly right, since newlines are gonna get replaced no matter what. So there seems little benefit to copying the other garbage faithfully. Revised proposed patch(es) attached. >From e4f2a5f97771c4a74e0bdef1c1e4a0d2735cef15 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Sun, 22 Jul 2018 09:50:20 -0700 Subject: [PATCH 1/2] df: avoid multibyte character corruption on macOS Problem reported by Chih-Hsuan Yen (Bug#32236). * NEWS: Mention the bug fix. * src/df.c: Include wchar.h and wctype.h. (hide_problematic_chars): Respect multibyte encodings when replacing problematic characters or bytes with '?'. --- NEWS | 4 src/df.c | 35 +-- 2 files changed, 29 insertions(+), 10 deletions(-) diff --git a/NEWS b/NEWS index af1a990..aa3b4f9 100644 --- a/NEWS +++ b/NEWS @@ -2,6 +2,10 @@ GNU coreutils NEWS-*- outline -*- * Noteworthy changes in release ?.? (-??-??) [?] +** Bug fixes + + df no longer corrupts displayed multibyte characters on macOS. + * Noteworthy changes in release 8.30 (2018-07-01) [stable] diff --git a/src/df.c b/src/df.c index 1178865..52be414b 100644 --- a/src/df.c +++ b/src/df.c @@ -23,6 +23,8 @@ #include #include #include +#include +#include #include "system.h" #include "canonicalize.h" @@ -271,21 +273,34 @@ static struct option const long_options[] = {NULL, 0, NULL, 0} }; -/* Replace problematic chars with '?'. - Since only control characters are currently considered, - this should work in all encodings. */ +/* Replace problematic chars with '?'. */ -static char* +static void hide_problematic_chars (char *cell) { - char *p = cell; - while (*p) + char *srcend = cell + strlen (cell); + char *dst = cell; + mbstate_t mbstate = { 0, }; + size_t n; + + for (char *src = cell; src != srcend; src += n) { - if (iscntrl (to_uchar (*p))) -*p = '?'; - p++; + wchar_t wc; + size_t srcbytes = srcend - src; + n = mbrtowc (, src, srcbytes, ); + if (n <= srcbytes && !iswcntrl (wc)) +{ + memcpy (dst, src, n); + dst += n; +} + else +{ + *dst++ = '?'; + memset (, 0, sizeof mbstate); +} } - return cell; + + *dst = '\0'; } /* Dynamically allocate a row of pointers in TABLE, which -- 2.7.4 >From e40ae8fb26fd0c8c0cb7e42598a2b8d9bd8551bb Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Thu, 26 Jul 2018 01:56:28 -0700 Subject: [PATCH 2/2] df: tune slightly * src/df.c (get_header, get_dev): Avoid calling mbswidth twice when once will do. --- src/df.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/df.c b/src/df.c index 52be414b..f5bd26d 100644 --- a/src/df.c +++ b/src/df.c @@ -588,7 +588,8 @@ get_header (void) table[nrows - 1][col] = cell; - columns[col]->width = MAX (columns[col]->width, mbswidth (cell, 0)); + size_t cell_width = mbswidth (cell, 0); + columns[col]->width = MAX (columns[col]->width, cell_width); } } @@ -1198,7 +1199,8 @@ get_dev (char const *disk, char const *mount_point, char const* file, assert (!"empty cell"); hide_problematic_chars (cell); - columns[col]->width = MAX (columns[col]->width, mbswidth (cell, 0)); + size_t cell_width = mbswidth (cell, 0); + columns[col]->width = MAX (columns[col]->width, cell_width); table[nrows - 1][col] = cell; } free (dev_name); -- 2.7.4
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
2018-07-23 5:40 GMT+08:00 Bruno Haible : > Pádraig Brady wrote: >> > This patch is correct (because the characters that you test for in >> > c_iscntrl >> > are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a >> > multibyte >> > character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings). >> >> ... It might be worth mentioning this subtle point in the c_iscntrl() docs? >> "Note this identifies all single byte control chars even in multibyte >> encodings". > > Only in the multibyte encodings that are currently in use. We never know what > kinds of features or misfeatures new multibyte encodings will come up with: > Before GB18030 was introduced, it was a common feature of all multibyte > encodings > (including SJIS) that ASCII characters in the range 0x00..0x3F never occur as > second or later byte in a multibyte character. Well, GB18030 broke this > assumption. > > So, it is dangerous to rely on this property. Therefore I wouldn't like to > document it in the c_iscntrl() documentation. > > Bruno > Hello any update on this? Discussions about encodings are beyond my knowledge, yet I can feel that it's difficult to correctly filter control characters. How about following the idea from Pádraig Brady and filter \n only?
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Pádraig Brady wrote: > > This patch is correct (because the characters that you test for in c_iscntrl > > are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a > > multibyte > > character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings). > > ... It might be worth mentioning this subtle point in the c_iscntrl() docs? > "Note this identifies all single byte control chars even in multibyte > encodings". Only in the multibyte encodings that are currently in use. We never know what kinds of features or misfeatures new multibyte encodings will come up with: Before GB18030 was introduced, it was a common feature of all multibyte encodings (including SJIS) that ASCII characters in the range 0x00..0x3F never occur as second or later byte in a multibyte character. Well, GB18030 broke this assumption. So, it is dangerous to rely on this property. Therefore I wouldn't like to document it in the c_iscntrl() documentation. Bruno
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Pádraig Brady wrote: > but I did want to only avoid \n etc. that might cause issues for > programs that parsed output from df on a line by line basis. The current code (which uses iscntrl) also catches escape sequences that can cause weird output on the screen, in a terminal emulator. This is good (because it can confuse a human reader as much as a '\n' would confuse a line-by-line parser). Now, this feature currently only works for escape sequence that start with an ASCII escape U+001B. It would be useful also for other control characters to be caught, at least: * escape characters U+009B. * other characters that cause a newline in a terminal emulator: U+2028 and U+2029. For example, in konsole, the escape sequence '\u009bf' repositions the cursor. So the effects of $ mkdir /tmp/`printf 'abc\u009bf'` $ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc* $ df ... /dev/loop019860481986048 0 100% /tmp/abc�f is that 'df' produces an U+FFFD. This is less useful than what it produces for an ASCII escape: $ mkdir /tmp/`printf 'abc\u001b[2J'` $ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc* $ df ... /dev/loop0 692828 692828 0 100% /tmp/abc?[2J Bruno
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Pádraig Brady wrote: I did want to only avoid \n etc. that might cause issues for programs that parsed output from df on a line by line basis. This subset of control characters is safe to identify It seems problematic to start eliding improperly encoded mount points for example, rather than just outputting what's there. Yes, I suppose you're right, it's not df's job to police encodings. Also just incrementing width++ per each wide character doesn't seem right, though again I've not tested it. True as well. OK, please ignore my patch. I was prompted by worries about multibyte encodings that use bytes that could be misinterpreted as ASCII control characters, such as a locale that uses EBCDIC encoding. However, that's probably just a theoretical concern; no coreutils users use EBCDIC any more, right? Plus there are doubtless lots of other places in coreutils that assume '\n' is a newline in encoded text.
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
On 21/07/18 15:43, Bruno Haible wrote: > Hi Pádraig, > >> I've attached a gnulib patch to document for iscntrl at least. > >> +This function does not support arguments outside of the range of the >> +unsigned char type in locales with large character sets, on some platforms. >> +OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales. > > In UTF-8 locales, arguments >= 0x80 are invalid arguments for iscntrl(). > > POSIX [1] says > "The c argument is a type int, the value of which the application shall >ensure is a character representable as an unsigned char or equal to the >value of the macro EOF. If the argument has any other value, the behavior >is undefined." > > The term "character" is defined here [2]: > "A sequence of one or more bytes representing a single graphic symbol or >control code." > > So, in a UTF-8 locale, a "character representable as an unsigned char" > is a byte sequence of length 1, where the single byte has a value in the > range 0x00..0x7F. > > For invalid values "the behavior is undefined." You were expecting a value 0. > > Now, in the gnulib documentations, what we mention as portability problems > are the cases where > - the behaviour for valid arguments is different on different platforms, or > - the boundary between valid and invalid arguments is fuzzy and depends on > the platform. > IMO there's no point in documenting that a function _really_ has undefined > behaviour when POSIX says that it has undefined behaviour. Thanks for all that info. I agree iscntrl() behavior on macOS is within spec, though is still surprising, and different from other systems. I agree docs should be as succinct as possible, though... >> I've also attached an alternative patch for df (in your name). > > This patch is correct (because the characters that you test for in c_iscntrl > are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte > character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings). ... It might be worth mentioning this subtle point in the c_iscntrl() docs? "Note this identifies all single byte control chars even in multibyte encodings". > But it does not catch control characters outside of the ASCII range. It would > make sense to catch these as well. If you want to do that, > 'hide_problematic_chars' needs to be rewritten as a loop that iterates across > the multibyte characters. For example with the 'mbiter' module, in > combination with the mb_iscntrl function from the 'mbchar' module. Or > directly with mbrtowc() and iswcntrl(). I was mainly worried here about \n for scripts to robustly parse df output. cheers, Pádraig.
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
On 22/07/18 08:12, Paul Eggert wrote: > Pádraig Brady wrote: >> I've also attached an alternative patch for df (in your name). > > That still has problems, since it can generate improperly-encoded strings in > UTF-8 locales (if the inputs are improperly encoded), and can replace parts > of > multibyte characters with '?' in non-UTF-8 locales. Please try the attached > patch instead, which attempts to address these issues. This is more along the > lines that Bruno suggested, except it doesn't use mbsiter as I figured it was > simpler overall just to use mbrtowc directly for this one thing. I haven't time to review this now, but I did want to only avoid \n etc. that might cause issues for programs that parsed output from df on a line by line basis. This subset of control characters is safe to identify It seems problematic to start eliding improperly encoded mount points for example, rather than just outputting what's there. Also just incrementing width++ per each wide character doesn't seem right, though again I've not tested it. cheers, Pádraig
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
2018-07-22 23:12 GMT+08:00 Paul Eggert : > Pádraig Brady wrote: >> >> I've also attached an alternative patch for df (in your name). > > > That still has problems, since it can generate improperly-encoded strings in > UTF-8 locales (if the inputs are improperly encoded), and can replace parts > of multibyte characters with '?' in non-UTF-8 locales. Please try the > attached patch instead, which attempts to address these issues. This is more > along the lines that Bruno suggested, except it doesn't use mbsiter as I > figured it was simpler overall just to use mbrtowc directly for this one > thing. Here's the result of df: $ df 檔案系統容量 已用 可用 已用 掛載點 /dev/disk1s1234G 137G 95G 60% / /dev/disk1s4234G 2.1G 95G 3% /private/var/vm chyen.cc:25G 12G 12G 51% /private/tmp/abc def ghi $ df | xxd : e6aa 94e6 a188 e7b3 bbe7 b5b1 2020 2020 0010: 2020 2020 e5ae b9e9 878f 2020 e5b7 b2e7 .. 0020: 94a8 2020 e58f afe7 94a8 20e5 b7b2 e794 .. .. . 0030: a820 e68e 9be8 bc89 e9bb 9e0a 2f64 6576 . ../dev 0040: 2f64 6973 6b31 7331 2020 2020 3233 3447 /disk1s1234G 0050: 2020 3133 3747 2020 2039 3547 2020 3630137G 95G 60 0060: 2520 2f0a 2f64 6576 2f64 6973 6b31 7334 % /./dev/disk1s4 0070: 2020 2020 3233 3447 2020 322e 3147 2020 234G 2.1G 0080: 2039 3547 2020 2033 2520 2f70 7269 7661 95G 3% /priva 0090: 7465 2f76 6172 2f76 6d0a 6368 7965 6e2e te/var/vm.chyen. 00a0: 6363 3a20 2020 2020 2020 2032 3547 2020 cc:25G 00b0: 2031 3247 2020 2031 3247 2020 3531 2520 12G 12G 51% 00c0: 2f70 7269 7661 7465 2f74 6d70 2f61 6263 /private/tmp/abc 00d0: e280 a864 6566 e280 a967 6869 0a ...def...ghi. Chinese header names are correct, and U+2028 and U+2029 are written as-is. All tested with LANG=zh_TW.UTF-8 LC_COLLATE=C LC_CTYPE=zh_TW.UTF-8.
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Pádraig Brady wrote: I've also attached an alternative patch for df (in your name). That still has problems, since it can generate improperly-encoded strings in UTF-8 locales (if the inputs are improperly encoded), and can replace parts of multibyte characters with '?' in non-UTF-8 locales. Please try the attached patch instead, which attempts to address these issues. This is more along the lines that Bruno suggested, except it doesn't use mbsiter as I figured it was simpler overall just to use mbrtowc directly for this one thing. From 17a1a37549344cdfd95cc84b1848dafa256be5a0 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Sun, 22 Jul 2018 08:09:01 -0700 Subject: [PATCH] df: avoid multibyte character corruption on macOS Problem reported by Chih-Hsuan Yen (Bug#32236). * NEWS: Mention the bug fix. * src/df.c: Include wchar.h and wctype.h instead of mbswidth.h. (hide_problematic_chars): Return number of screen columns. All callers changed. Use iswcntrl, not iscntrl. (get_header, get_dev): Rely on hide_problematic_chars width, not mbswidth. Scan the cell once, instead of two or three times. --- NEWS | 4 src/df.c | 46 +++--- 2 files changed, 35 insertions(+), 15 deletions(-) diff --git a/NEWS b/NEWS index af1a990..aa3b4f9 100644 --- a/NEWS +++ b/NEWS @@ -2,6 +2,10 @@ GNU coreutils NEWS-*- outline -*- * Noteworthy changes in release ?.? (-??-??) [?] +** Bug fixes + + df no longer corrupts displayed multibyte characters on macOS. + * Noteworthy changes in release 8.30 (2018-07-01) [stable] diff --git a/src/df.c b/src/df.c index 1178865..664b88b 100644 --- a/src/df.c +++ b/src/df.c @@ -23,6 +23,8 @@ #include #include #include +#include +#include #include "system.h" #include "canonicalize.h" @@ -31,7 +33,6 @@ #include "fsusage.h" #include "human.h" #include "mbsalign.h" -#include "mbswidth.h" #include "mountlist.h" #include "quote.h" #include "find-mount-point.h" @@ -272,20 +273,36 @@ static struct option const long_options[] = }; /* Replace problematic chars with '?'. - Since only control characters are currently considered, - this should work in all encodings. */ + Return the number of screen columns. */ -static char* +static size_t hide_problematic_chars (char *cell) { - char *p = cell; - while (*p) + char *srcend = cell + strlen (cell); + char *dst = cell; + mbstate_t mbstate = { 0, }; + size_t n; + size_t width = 0; + + for (char *src = cell; src != srcend; src += n) { - if (iscntrl (to_uchar (*p))) -*p = '?'; - p++; + wchar_t wc; + n = mbrtowc (, src, srcend - src, ); + if (n < (size_t) -2 && !iswcntrl (wc)) +{ + memcpy (dst, src, n); + dst += n; +} + else +{ + *dst++ = '?'; + memset (, 0, sizeof mbstate); +} + width++; } - return cell; + + *dst = '\0'; + return width; } /* Dynamically allocate a row of pointers in TABLE, which @@ -569,11 +586,10 @@ get_header (void) if (!cell) xalloc_die (); - hide_problematic_chars (cell); - table[nrows - 1][col] = cell; - columns[col]->width = MAX (columns[col]->width, mbswidth (cell, 0)); + size_t cell_width = hide_problematic_chars (cell); + columns[col]->width = MAX (columns[col]->width, cell_width); } } @@ -1182,8 +1198,8 @@ get_dev (char const *disk, char const *mount_point, char const* file, if (!cell) assert (!"empty cell"); - hide_problematic_chars (cell); - columns[col]->width = MAX (columns[col]->width, mbswidth (cell, 0)); + size_t cell_width = hide_problematic_chars (cell); + columns[col]->width = MAX (columns[col]->width, cell_width); table[nrows - 1][col] = cell; } free (dev_name); -- 2.7.4
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
2018-07-22 18:46 GMT+08:00 Bruno Haible : > Chih-Hsuan Yen wrote: >> The `c_iscntrl()` patch also fixes the issue on macOS. Please tell me >> if you want me to test other patches, thanks! > > You could test how it behaves with mount points that contain U+2028 or > U+2029 characters. On Linux, I'd test it like this. Hope it's similar > on macOS: > $ mkdir /tmp/`printf 'abc\u2028def\u2029ghi'` > $ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc* > $ df > ... > /dev/loop019860481986048 0 100% /tmp/abc�def�ghi > $ ls -ld /tmp/abc* > dr-xr-xr-x 4 root root 2048 Nov 19 2014 /tmp/abc?def?ghi > > Bruno > Hi Bruno, With the c_iscntrl() patch, the result of ls and df are: (I use xxd as GMail seems unable to handle U+2028 and U+2029 correctly) $ ls -ld /tmp/abc def ghi | xxd : 6472 7778 722d 7872 2d78 2031 2079 656e drwxr-xr-x 1 yen 0010: 2073 7461 2034 3039 3620 3230 3138 staff 4096 2018 0020: 2f30 372f 3232 2030 323a 3136 3a34 3320 /07/22 02:16:43 0030: 2f74 6d70 2f61 6263 e280 a864 6566 e280 /tmp/abc...def.. 0040: a967 6869 0a .ghi. $ df | xxd : e6aa 94e6 a188 e7b3 bbe7 b5b1 2020 2020 0010: 2020 2020 e5ae b9e9 878f 2020 e5b7 b2e7 .. 0020: 94a8 2020 e58f afe7 94a8 20e5 b7b2 e794 .. .. . 0030: a825 20e6 8e9b e8bc 89e9 bb9e 0a2f 6465 .% ../de 0040: 762f 6469 736b 3173 3120 2020 2032 3334 v/disk1s1234 0050: 4720 2031 3337 4720 2020 3935 4720 2020 G 137G 95G 0060: 3630 2520 2f0a 2f64 6576 2f64 6973 6b31 60% /./dev/disk1 0070: 7334 2020 2020 3233 3447 2020 322e 3147 s4234G 2.1G 0080: 2020 2039 3547 2020 2020 3325 202f 7072 95G3% /pr 0090: 6976 6174 652f 7661 722f 766d 0a63 6879 ivate/var/vm.chy 00a0: 656e 2e63 633a 2020 2020 2020 2020 3235 en.cc:25 00b0: 4720 2020 3132 4720 2020 3132 4720 2020 G 12G 12G 00c0: 3532 2520 2f70 7269 7661 7465 2f74 6d70 52% /private/tmp 00d0: 2f61 6263 e280 a864 6566 e280 a967 6869 /abc...def...ghi 00e0: 0a . Without the c_iscntrl() patch (unmodified 8.30), ls behaves the same, and the result of df is: $ df | xxd : e6aa 3fe6 a13f e7b3 bbe7 b5b1 20e5 aeb9 ..?..?.. ... 0010: e93f 3f20 e5b7 b2e7 3fa8 20e5 3faf e73f .?? ?. .?..? 0020: a820 e5b7 b2e7 3fa8 2520 e63f 3fe8 bc3f . ?.% .??..? 0030: e9bb 3f0a 2f64 6576 2f64 6973 6b31 7331 ..?./dev/disk1s1 0040: 2020 2020 3233 3447 2020 3133 3747 2020 234G 137G 0050: 2020 3935 4720 2020 2036 3025 202f 0a2f95G60% /./ 0060: 6465 762f 6469 736b 3173 3420 2020 2032 dev/disk1s42 0070: 3334 4720 2032 2e31 4720 2020 2039 3547 34G 2.1G95G 0080: 2020 2020 2033 2520 2f70 7269 7661 7465 3% /private 0090: 2f76 6172 2f76 6d0a 6368 7965 6e2e 6363 /var/vm.chyen.cc 00a0: 3a20 2020 2020 2020 2032 3547 2020 2031 :25G 1 00b0: 3247 2020 2020 3132 4720 2020 2035 3125 2G12G51% 00c0: 202f 7072 6976 6174 652f 746d 702f 6162 /private/tmp/ab 00d0: 63e2 3fa8 6465 66e2 3fa9 6768 690a c.?.def.?.ghi. Hope those results are helpful! Chih-Hsuan Yen
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Chih-Hsuan Yen wrote: > The `c_iscntrl()` patch also fixes the issue on macOS. Please tell me > if you want me to test other patches, thanks! You could test how it behaves with mount points that contain U+2028 or U+2029 characters. On Linux, I'd test it like this. Hope it's similar on macOS: $ mkdir /tmp/`printf 'abc\u2028def\u2029ghi'` $ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc* $ df ... /dev/loop019860481986048 0 100% /tmp/abc�def�ghi $ ls -ld /tmp/abc* dr-xr-xr-x 4 root root 2048 Nov 19 2014 /tmp/abc?def?ghi Bruno
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
2018-07-22 6:43 GMT+08:00 Bruno Haible : > Hi Pádraig, > >> I've attached a gnulib patch to document for iscntrl at least. > >> +This function does not support arguments outside of the range of the >> +unsigned char type in locales with large character sets, on some platforms. >> +OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales. > > In UTF-8 locales, arguments >= 0x80 are invalid arguments for iscntrl(). > > POSIX [1] says > "The c argument is a type int, the value of which the application shall >ensure is a character representable as an unsigned char or equal to the >value of the macro EOF. If the argument has any other value, the behavior >is undefined." > > The term "character" is defined here [2]: > "A sequence of one or more bytes representing a single graphic symbol or >control code." > > So, in a UTF-8 locale, a "character representable as an unsigned char" > is a byte sequence of length 1, where the single byte has a value in the > range 0x00..0x7F. > > For invalid values "the behavior is undefined." You were expecting a value 0. > > Now, in the gnulib documentations, what we mention as portability problems > are the cases where > - the behaviour for valid arguments is different on different platforms, or > - the boundary between valid and invalid arguments is fuzzy and depends on > the platform. > IMO there's no point in documenting that a function _really_ has undefined > behaviour when POSIX says that it has undefined behaviour. > >> I've also attached an alternative patch for df (in your name). > > This patch is correct (because the characters that you test for in c_iscntrl > are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte > character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings). > > But it does not catch control characters outside of the ASCII range. It would > make sense to catch these as well. If you want to do that, > 'hide_problematic_chars' needs to be rewritten as a loop that iterates across > the multibyte characters. For example with the 'mbiter' module, in > combination with the mb_iscntrl function from the 'mbchar' module. Or > directly with mbrtowc() and iswcntrl(). > > Bruno > > [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/iscntrl.html > [2] > http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_87 The `c_iscntrl()` patch also fixes the issue on macOS. Please tell me if you want me to test other patches, thanks! Cheers, Chih-Hsuan Yen
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Hi Pádraig, > I've attached a gnulib patch to document for iscntrl at least. > +This function does not support arguments outside of the range of the > +unsigned char type in locales with large character sets, on some platforms. > +OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales. In UTF-8 locales, arguments >= 0x80 are invalid arguments for iscntrl(). POSIX [1] says "The c argument is a type int, the value of which the application shall ensure is a character representable as an unsigned char or equal to the value of the macro EOF. If the argument has any other value, the behavior is undefined." The term "character" is defined here [2]: "A sequence of one or more bytes representing a single graphic symbol or control code." So, in a UTF-8 locale, a "character representable as an unsigned char" is a byte sequence of length 1, where the single byte has a value in the range 0x00..0x7F. For invalid values "the behavior is undefined." You were expecting a value 0. Now, in the gnulib documentations, what we mention as portability problems are the cases where - the behaviour for valid arguments is different on different platforms, or - the boundary between valid and invalid arguments is fuzzy and depends on the platform. IMO there's no point in documenting that a function _really_ has undefined behaviour when POSIX says that it has undefined behaviour. > I've also attached an alternative patch for df (in your name). This patch is correct (because the characters that you test for in c_iscntrl are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings). But it does not catch control characters outside of the ASCII range. It would make sense to catch these as well. If you want to do that, 'hide_problematic_chars' needs to be rewritten as a loop that iterates across the multibyte characters. For example with the 'mbiter' module, in combination with the mb_iscntrl function from the 'mbchar' module. Or directly with mbrtowc() and iswcntrl(). Bruno [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/iscntrl.html [2] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_87
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
On 21/07/18 07:20, Chih-Hsuan Yen wrote: > Hi coreutils developers, > > I'm using coreutils on macOS High Sierra (10.13). I noticed that with > `LANG=zh_TW.UTF-8`, `df` output is corrupted. > > �?�?系統 容�?? 已�?� �?��?� 已�?�% �??�?�? > /dev/disk1s1234G 151G81G65% / > /dev/disk1s4234G 2.1G81G 3% /private/var/vm > > (I'm not sure if other mail agents can display those characters > correctly or not. See my blog post [1] for the exact output.) > > Seems it's similar to bug#25630 [2], which is not resolved. I guess > the reason of my issue is that iscntrl() is broken on macOS High > Sierra, so in hide_problematic_chars(), some bytes in the Chinese > header is replaced with a question mark. I managed to patch coreutils > [3] to make `df` work. Could you have a look? Thanks! > > Best, > > Chih-Hsuan Yen > > [1] https://blog.chyen.cc/posts/2018/06/23/mac-df-chinese.html > [2] http://lists.gnu.org/archive/html/bug-coreutils/2017-02/msg8.html > [3] > https://github.com/yan12125/macports-ports/blob/fix-coreutils-df-chinese/sysutils/coreutils/files/patch-df.diff Wow. That's surprising. I do see the FreeBSD man pages say: "The 4.4BSD extension of accepting arguments outside of the range of the unsigned char type in locales with large character sets is considered obsolete and may not be supported in future releases." Now I think that might have been referring to >= 0xFF, but fair enough. I've attached a gnulib patch to document for iscntrl at least. It would be great if someone could test the other is*() classification functions on macOS so that I might have a more complete documentation patch. I've also attached an alternative patch for df (in your name). Can you try that one? thanks! Pádraig >From 6b7434fb222144af3ae9e2d4fd6b4c72eec25f5b Mon Sep 17 00:00:00 2001 From: Chih-Hsuan Yen Date: Sat, 21 Jul 2018 13:19:23 -0700 Subject: [PATCH] df: avoid multibyte character corruption on macOS * src/df.c (hide_problematic_chars): Use c_iscntrl() as passing 8 bit characters to iscntrl() is not supported on macOS. * NEWS: Mention the bug fix. Fixes https://bugs.gnu.org/32236 --- NEWS | 4 src/df.c | 3 ++- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/NEWS b/NEWS index af1a990..aa3b4f9 100644 --- a/NEWS +++ b/NEWS @@ -2,6 +2,10 @@ GNU coreutils NEWS-*- outline -*- * Noteworthy changes in release ?.? (-??-??) [?] +** Bug fixes + + df no longer corrupts displayed multibyte characters on macOS. + * Noteworthy changes in release 8.30 (2018-07-01) [stable] diff --git a/src/df.c b/src/df.c index 1178865..c851fcc 100644 --- a/src/df.c +++ b/src/df.c @@ -23,6 +23,7 @@ #include #include #include +#include #include "system.h" #include "canonicalize.h" @@ -281,7 +282,7 @@ hide_problematic_chars (char *cell) char *p = cell; while (*p) { - if (iscntrl (to_uchar (*p))) + if (c_iscntrl (to_uchar (*p))) *p = '?'; p++; } -- 2.9.3 >From 816cc0d5fb92552a551c523f49c829261731dfe8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?P=C3=A1draig=20Brady?= Date: Sat, 21 Jul 2018 13:15:13 -0700 Subject: [PATCH] iscntrl: document that macOS returns true for >= 0x80 * doc/posix-functions/iscntrl.texi: Mention that support for chars >= 0x80 is not standarized, and not supported on OS X >= 10.5 at least --- doc/posix-functions/iscntrl.texi | 3 +++ 1 file changed, 3 insertions(+) diff --git a/doc/posix-functions/iscntrl.texi b/doc/posix-functions/iscntrl.texi index 7e6813f..3c15708 100644 --- a/doc/posix-functions/iscntrl.texi +++ b/doc/posix-functions/iscntrl.texi @@ -16,6 +16,9 @@ OS X 10.8. Portability problems not fixed by Gnulib: @itemize +This function does not support arguments outside of the range of the +unsigned char type in locales with large character sets, on some platforms. +OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales. @end itemize Note: This function's behaviour depends on the locale, but does not support -- 2.9.3