And now with the patch...

----- Forwarded message from Ingo Schwarze <schwa...@usta.de> -----

From: Ingo Schwarze <schwa...@usta.de>
Date: Sun, 17 Jan 2016 21:37:56 +0100
To: Stuart Henderson <st...@openbsd.org>, Ted Unangst <t...@tedunangst.com>
Cc: Martijn van Duren <openbsd+t...@list.imperialat.at>, tech@openbsd.org
Subject: Re: [patch] ls + utf-8 support

Hi,

Stuart Henderson wrote on Sun, Jan 17, 2016 at 07:46:23PM +0000:
> On 2016/01/17 14:29, Ted Unangst wrote:
>> Ingo Schwarze wrote:

>>> The old ls(1) also weeded out non-printable bytes, in particular
>>> control codes.

>> The old ls only had this behavior for terminals however.
>> Redirecting to a file or pipe would always output the original bytes.

> I've used this a few times in the past, for example "ls | hexdump -C"
> or .."| vis", to find out what the characters used in some filename are.
> I'd find it surprising for this to not work.

Oops.  What we currently have in the tree is broken in that respect,
i broke it, including the -q option.

Current behaviour is:

 * SMALL: fully works, but no UTF-8 support
 * not SMALL:
    - LC_CTYPE=C on a tty or with -q: does '?', ok
    - LC_CTYPE=en_US.UTF-8 on a tty or with -q: does '?', ok
    - LC_CTYPE=C neither tty nor -q: does '?', wrong
    - LC_CTYPE=en_US.UTF-8 neither tty nor -q: does '?', wrong

The following patch fixes the last two cases.
It is similar in spirit to what Martijn originally sent,
but fixes two issues with his patch:

 1) Do not invent a new global variable, use the existing f_nonprint.
 2) For valid, but non-printable codepoints, print all bytes of the
    codepoint's encoding rather than just the first byte.

Should i commit this?

Yours,
  Ingo

----- End forwarded message -----

Index: utf8.c
===================================================================
RCS file: /cvs/src/bin/ls/utf8.c,v
retrieving revision 1.1
diff -u -p -r1.1 utf8.c
--- utf8.c      1 Dec 2015 18:36:13 -0000       1.1
+++ utf8.c      17 Jan 2016 20:13:51 -0000
@@ -21,6 +21,8 @@
 #include <stdlib.h>
 #include <wchar.h>
 
+extern int f_nonprint;
+
 int
 mbsprint(const char *mbs, int print)
 {
@@ -33,12 +35,16 @@ mbsprint(const char *mbs, int print)
                if ((len = mbtowc(&wc, mbs, MB_CUR_MAX)) == -1) {
                        (void)mbtowc(NULL, NULL, MB_CUR_MAX);
                        if (print)
-                               putchar('?');
+                               putchar(f_nonprint ? '?' : *mbs);
                        total_width++;
                        len = 1;
                } else if ((width = wcwidth(wc)) == -1) {
-                       if (print)
-                               putchar('?');
+                       if (print) {
+                               if (f_nonprint)
+                                       putchar('?');
+                               else
+                                       fwrite(mbs, 1, len, stdout);
+                       }
                        total_width++;
                } else {
                        if (print)

Reply via email to