Re: [patch] ls + utf-8 support

2016-01-17 Thread Michael McConville
Martin Natano wrote:
> I agree with Ingo, ls(1) shouldn't generate unsafe output. Regardless
> of whether the output is to a terminal or some other file.

While POSIX is not holy law, doing what you suggest would probably be a
violation. See the description of the -q option:

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/ls.html

That combined with Stuart's mentioned legitimate uses makes me think
that this could cause subtle and unexpected problems. I do see the
danger of allowing raw bytestring redirection, though.



Re: [patch] ls + utf-8 support

2016-01-17 Thread Ted Unangst
Ingo Schwarze wrote:
> The old ls(1) also weeded out non-printable bytes, in particular
> control codes.

The old ls only had this behavior for terminals however. Redirecting to a file
or pipe would always output the original bytes.



Re: [patch] ls + utf-8 support

2016-01-17 Thread Stuart Henderson
On 2016/01/17 14:29, Ted Unangst wrote:
> Ingo Schwarze wrote:
> > The old ls(1) also weeded out non-printable bytes, in particular
> > control codes.
> 
> The old ls only had this behavior for terminals however. Redirecting to a file
> or pipe would always output the original bytes.

I've used this a few times in the past, for example "ls | hexdump -C"
or .."| vis", to find out what the characters used in some filename are.
I'd find it surprising for this to not work.



Re: [patch] ls + utf-8 support

2016-01-17 Thread Martin Natano
I agree with Ingo, ls(1) shouldn't generate unsafe output. Regardless
of whether the output is to a terminal or some other file.

cheers,
natano



Re: [patch] ls + utf-8 support

2016-01-17 Thread Ingo Schwarze
And now with the patch...

- Forwarded message from Ingo Schwarze <schwa...@usta.de> -

From: Ingo Schwarze <schwa...@usta.de>
Date: Sun, 17 Jan 2016 21:37:56 +0100
To: Stuart Henderson <st...@openbsd.org>, Ted Unangst <t...@tedunangst.com>
Cc: Martijn van Duren <openbsd+t...@list.imperialat.at>, tech@openbsd.org
Subject: Re: [patch] ls + utf-8 support

Hi,

Stuart Henderson wrote on Sun, Jan 17, 2016 at 07:46:23PM +:
> On 2016/01/17 14:29, Ted Unangst wrote:
>> Ingo Schwarze wrote:

>>> The old ls(1) also weeded out non-printable bytes, in particular
>>> control codes.

>> The old ls only had this behavior for terminals however.
>> Redirecting to a file or pipe would always output the original bytes.

> I've used this a few times in the past, for example "ls | hexdump -C"
> or .."| vis", to find out what the characters used in some filename are.
> I'd find it surprising for this to not work.

Oops.  What we currently have in the tree is broken in that respect,
i broke it, including the -q option.

Current behaviour is:

 * SMALL: fully works, but no UTF-8 support
 * not SMALL:
- LC_CTYPE=C on a tty or with -q: does '?', ok
- LC_CTYPE=en_US.UTF-8 on a tty or with -q: does '?', ok
- LC_CTYPE=C neither tty nor -q: does '?', wrong
- LC_CTYPE=en_US.UTF-8 neither tty nor -q: does '?', wrong

The following patch fixes the last two cases.
It is similar in spirit to what Martijn originally sent,
but fixes two issues with his patch:

 1) Do not invent a new global variable, use the existing f_nonprint.
 2) For valid, but non-printable codepoints, print all bytes of the
codepoint's encoding rather than just the first byte.

Should i commit this?

Yours,
  Ingo

- End forwarded message -

Index: utf8.c
===
RCS file: /cvs/src/bin/ls/utf8.c,v
retrieving revision 1.1
diff -u -p -r1.1 utf8.c
--- utf8.c  1 Dec 2015 18:36:13 -   1.1
+++ utf8.c  17 Jan 2016 20:13:51 -
@@ -21,6 +21,8 @@
 #include 
 #include 
 
+extern int f_nonprint;
+
 int
 mbsprint(const char *mbs, int print)
 {
@@ -33,12 +35,16 @@ mbsprint(const char *mbs, int print)
if ((len = mbtowc(, mbs, MB_CUR_MAX)) == -1) {
(void)mbtowc(NULL, NULL, MB_CUR_MAX);
if (print)
-   putchar('?');
+   putchar(f_nonprint ? '?' : *mbs);
total_width++;
len = 1;
} else if ((width = wcwidth(wc)) == -1) {
-   if (print)
-   putchar('?');
+   if (print) {
+   if (f_nonprint)
+   putchar('?');
+   else
+   fwrite(mbs, 1, len, stdout);
+   }
total_width++;
} else {
if (print)



Re: [patch] ls + utf-8 support

2016-01-17 Thread Ingo Schwarze
Hi,

Stuart Henderson wrote on Sun, Jan 17, 2016 at 07:46:23PM +:
> On 2016/01/17 14:29, Ted Unangst wrote:
>> Ingo Schwarze wrote:

>>> The old ls(1) also weeded out non-printable bytes, in particular
>>> control codes.

>> The old ls only had this behavior for terminals however.
>> Redirecting to a file or pipe would always output the original bytes.

> I've used this a few times in the past, for example "ls | hexdump -C"
> or .."| vis", to find out what the characters used in some filename are.
> I'd find it surprising for this to not work.

Oops.  What we currently have in the tree is broken in that respect,
i broke it, including the -q option.

Current behaviour is:

 * SMALL: fully works, but no UTF-8 support
 * not SMALL:
- LC_CTYPE=C on a tty or with -q: does '?', ok
- LC_CTYPE=en_US.UTF-8 on a tty or with -q: does '?', ok
- LC_CTYPE=C neither tty nor -q: does '?', wrong
- LC_CTYPE=en_US.UTF-8 neither tty nor -q: does '?', wrong

The following patch fixes the last two cases.
It is similar in spirit to what Martijn originally sent,
but fixes two issues with his patch:

 1) Do not invent a new global variable, use the existing f_nonprint.
 2) For valid, but non-printable codepoints, print all bytes of the
codepoint's encoding rather than just the first byte.

Should i commit this?

Yours,
  Ingo



Re: [patch] ls + utf-8 support

2016-01-17 Thread Ingo Schwarze
Hi Martijn,

Martijn van Duren wrote on Sun, Jan 17, 2016 at 12:58:38PM +0100:

> I've come across a fair amount of malformed file names by all sorts
> of causes.  Be it malware or just human error.  When such a malformed
> character is in an inconvenient place and can't be auto-completed
> I usually fix this by something like the following:
>
> $ cd "`ls | tail -1`"
> ksh: cd: /home/martijn/Muziek/Motrhead/N?? Sleep at All - No such file
> or directory
> $ cd "`/usr/src/bin/ls/ls | tail -1`"

Why not just

 $ cd N*Sleep*

That seems simpler to me.
Sure, if you have more than one file in the same directory showing
this problem, and the names are too similar to be independently
globbed, and you want to keep them, it's a bit more work:

 $ i=1; for f in N*Sleep*; do mv $f tmp$i; i=$((i+1)); done

I don't see a real reason to use ls(1) in such situations.

> My patch maintains the question marks when stdout is a tty, but returns
> the original byte otherwise.  Afaik the only logical use for the length
> is when doing formatted output, which is only when printing to a tty.

The rationale for weeding out invalid bytes and non-printable
characters is not that we don't know how many display columns the
terminal might use to represent them.  The rationale is that they
might cause the terminal to change state, to interpret them as
in-band control codes.

> This doesn't solve the case when ls is run over ssh -t and the content
> is redirected client-side, but you can't win them all.

Indeed.  I worry that might result in security violations.

The old ls(1) also weeded out non-printable bytes, in particular
control codes.

Even though this question is only tangentially related to UTF-8:
Do we want to change that?  Should ls(1) sometimes pass through
bytes that might be control codes for some terminals?  Imposing
the responsibility that non-isatty(3) ls(1) output never ends up
on a terminal on the user?

I tend to think "no".  I wouldn't want my ls(1) to sometimes generate
output that isn't safe on a terminal.  I doubt i would get into the
habit of considering whether or not ls(1) is safe to run in a given
situation.  I'd rather have it just be safe, always.  And deal with
the occasional insane file name with different tools.

What do people think?
  Ingo


> Index: ls.c
> ===
> RCS file: /cvs/src/bin/ls/ls.c,v
> retrieving revision 1.44
> diff -u -p -r1.44 ls.c
> --- ls.c  1 Dec 2015 18:36:13 -   1.44
> +++ ls.c  17 Jan 2016 10:57:03 -
> @@ -94,6 +94,7 @@ int f_type; /* add type character for 
>  int f_typedir;   /* add type character for directories */
>  
>  int rval;
> +int istty = 0;
>  
>  int
>  ls_main(int argc, char *argv[])
> @@ -110,6 +111,7 @@ ls_main(int argc, char *argv[])
>  
>   /* Terminal defaults to -Cq, non-terminal defaults to -1. */
>   if (isatty(STDOUT_FILENO)) {
> + istty = 1;
>   if ((p = getenv("COLUMNS")) != NULL)
>   width = strtonum(p, 1, INT_MAX, NULL);
>   if (width == 0 &&
> Index: utf8.c
> ===
> RCS file: /cvs/src/bin/ls/utf8.c,v
> retrieving revision 1.1
> diff -u -p -r1.1 utf8.c
> --- utf8.c1 Dec 2015 18:36:13 -   1.1
> +++ utf8.c17 Jan 2016 10:57:03 -
> @@ -21,6 +21,8 @@
>  #include 
>  #include 
>  
> +extern int istty;
> +
>  int
>  mbsprint(const char *mbs, int print)
>  {
> @@ -33,12 +35,12 @@ mbsprint(const char *mbs, int print)
>   if ((len = mbtowc(, mbs, MB_CUR_MAX)) == -1) {
>   (void)mbtowc(NULL, NULL, MB_CUR_MAX);
>   if (print)
> - putchar('?');
> + putchar(istty ? '?' : *mbs);
>   total_width++;
>   len = 1;
>   } else if ((width = wcwidth(wc)) == -1) {
>   if (print)
> - putchar('?');
> + putchar(istty ? '?' : *mbs);
>   total_width++;
>   } else {
>   if (print)