And now with the patch... ----- Forwarded message from Ingo Schwarze <schwa...@usta.de> -----
From: Ingo Schwarze <schwa...@usta.de> Date: Sun, 17 Jan 2016 21:37:56 +0100 To: Stuart Henderson <st...@openbsd.org>, Ted Unangst <t...@tedunangst.com> Cc: Martijn van Duren <openbsd+t...@list.imperialat.at>, tech@openbsd.org Subject: Re: [patch] ls + utf-8 support Hi, Stuart Henderson wrote on Sun, Jan 17, 2016 at 07:46:23PM +0000: > On 2016/01/17 14:29, Ted Unangst wrote: >> Ingo Schwarze wrote: >>> The old ls(1) also weeded out non-printable bytes, in particular >>> control codes. >> The old ls only had this behavior for terminals however. >> Redirecting to a file or pipe would always output the original bytes. > I've used this a few times in the past, for example "ls | hexdump -C" > or .."| vis", to find out what the characters used in some filename are. > I'd find it surprising for this to not work. Oops. What we currently have in the tree is broken in that respect, i broke it, including the -q option. Current behaviour is: * SMALL: fully works, but no UTF-8 support * not SMALL: - LC_CTYPE=C on a tty or with -q: does '?', ok - LC_CTYPE=en_US.UTF-8 on a tty or with -q: does '?', ok - LC_CTYPE=C neither tty nor -q: does '?', wrong - LC_CTYPE=en_US.UTF-8 neither tty nor -q: does '?', wrong The following patch fixes the last two cases. It is similar in spirit to what Martijn originally sent, but fixes two issues with his patch: 1) Do not invent a new global variable, use the existing f_nonprint. 2) For valid, but non-printable codepoints, print all bytes of the codepoint's encoding rather than just the first byte. Should i commit this? Yours, Ingo ----- End forwarded message ----- Index: utf8.c =================================================================== RCS file: /cvs/src/bin/ls/utf8.c,v retrieving revision 1.1 diff -u -p -r1.1 utf8.c --- utf8.c 1 Dec 2015 18:36:13 -0000 1.1 +++ utf8.c 17 Jan 2016 20:13:51 -0000 @@ -21,6 +21,8 @@ #include <stdlib.h> #include <wchar.h> +extern int f_nonprint; + int mbsprint(const char *mbs, int print) { @@ -33,12 +35,16 @@ mbsprint(const char *mbs, int print) if ((len = mbtowc(&wc, mbs, MB_CUR_MAX)) == -1) { (void)mbtowc(NULL, NULL, MB_CUR_MAX); if (print) - putchar('?'); + putchar(f_nonprint ? '?' : *mbs); total_width++; len = 1; } else if ((width = wcwidth(wc)) == -1) { - if (print) - putchar('?'); + if (print) { + if (f_nonprint) + putchar('?'); + else + fwrite(mbs, 1, len, stdout); + } total_width++; } else { if (print)