Re: [patch] ls + utf-8 support
Martin Natano wrote: > I agree with Ingo, ls(1) shouldn't generate unsafe output. Regardless > of whether the output is to a terminal or some other file. While POSIX is not holy law, doing what you suggest would probably be a violation. See the description of the -q option: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/ls.html That combined with Stuart's mentioned legitimate uses makes me think that this could cause subtle and unexpected problems. I do see the danger of allowing raw bytestring redirection, though.
Re: [patch] ls + utf-8 support
Ingo Schwarze wrote: > The old ls(1) also weeded out non-printable bytes, in particular > control codes. The old ls only had this behavior for terminals however. Redirecting to a file or pipe would always output the original bytes.
Re: [patch] ls + utf-8 support
On 2016/01/17 14:29, Ted Unangst wrote: > Ingo Schwarze wrote: > > The old ls(1) also weeded out non-printable bytes, in particular > > control codes. > > The old ls only had this behavior for terminals however. Redirecting to a file > or pipe would always output the original bytes. I've used this a few times in the past, for example "ls | hexdump -C" or .."| vis", to find out what the characters used in some filename are. I'd find it surprising for this to not work.
Re: [patch] ls + utf-8 support
I agree with Ingo, ls(1) shouldn't generate unsafe output. Regardless of whether the output is to a terminal or some other file. cheers, natano
Re: [patch] ls + utf-8 support
And now with the patch... - Forwarded message from Ingo Schwarze <schwa...@usta.de> - From: Ingo Schwarze <schwa...@usta.de> Date: Sun, 17 Jan 2016 21:37:56 +0100 To: Stuart Henderson <st...@openbsd.org>, Ted Unangst <t...@tedunangst.com> Cc: Martijn van Duren <openbsd+t...@list.imperialat.at>, tech@openbsd.org Subject: Re: [patch] ls + utf-8 support Hi, Stuart Henderson wrote on Sun, Jan 17, 2016 at 07:46:23PM +: > On 2016/01/17 14:29, Ted Unangst wrote: >> Ingo Schwarze wrote: >>> The old ls(1) also weeded out non-printable bytes, in particular >>> control codes. >> The old ls only had this behavior for terminals however. >> Redirecting to a file or pipe would always output the original bytes. > I've used this a few times in the past, for example "ls | hexdump -C" > or .."| vis", to find out what the characters used in some filename are. > I'd find it surprising for this to not work. Oops. What we currently have in the tree is broken in that respect, i broke it, including the -q option. Current behaviour is: * SMALL: fully works, but no UTF-8 support * not SMALL: - LC_CTYPE=C on a tty or with -q: does '?', ok - LC_CTYPE=en_US.UTF-8 on a tty or with -q: does '?', ok - LC_CTYPE=C neither tty nor -q: does '?', wrong - LC_CTYPE=en_US.UTF-8 neither tty nor -q: does '?', wrong The following patch fixes the last two cases. It is similar in spirit to what Martijn originally sent, but fixes two issues with his patch: 1) Do not invent a new global variable, use the existing f_nonprint. 2) For valid, but non-printable codepoints, print all bytes of the codepoint's encoding rather than just the first byte. Should i commit this? Yours, Ingo - End forwarded message - Index: utf8.c === RCS file: /cvs/src/bin/ls/utf8.c,v retrieving revision 1.1 diff -u -p -r1.1 utf8.c --- utf8.c 1 Dec 2015 18:36:13 - 1.1 +++ utf8.c 17 Jan 2016 20:13:51 - @@ -21,6 +21,8 @@ #include #include +extern int f_nonprint; + int mbsprint(const char *mbs, int print) { @@ -33,12 +35,16 @@ mbsprint(const char *mbs, int print) if ((len = mbtowc(, mbs, MB_CUR_MAX)) == -1) { (void)mbtowc(NULL, NULL, MB_CUR_MAX); if (print) - putchar('?'); + putchar(f_nonprint ? '?' : *mbs); total_width++; len = 1; } else if ((width = wcwidth(wc)) == -1) { - if (print) - putchar('?'); + if (print) { + if (f_nonprint) + putchar('?'); + else + fwrite(mbs, 1, len, stdout); + } total_width++; } else { if (print)
Re: [patch] ls + utf-8 support
Hi, Stuart Henderson wrote on Sun, Jan 17, 2016 at 07:46:23PM +: > On 2016/01/17 14:29, Ted Unangst wrote: >> Ingo Schwarze wrote: >>> The old ls(1) also weeded out non-printable bytes, in particular >>> control codes. >> The old ls only had this behavior for terminals however. >> Redirecting to a file or pipe would always output the original bytes. > I've used this a few times in the past, for example "ls | hexdump -C" > or .."| vis", to find out what the characters used in some filename are. > I'd find it surprising for this to not work. Oops. What we currently have in the tree is broken in that respect, i broke it, including the -q option. Current behaviour is: * SMALL: fully works, but no UTF-8 support * not SMALL: - LC_CTYPE=C on a tty or with -q: does '?', ok - LC_CTYPE=en_US.UTF-8 on a tty or with -q: does '?', ok - LC_CTYPE=C neither tty nor -q: does '?', wrong - LC_CTYPE=en_US.UTF-8 neither tty nor -q: does '?', wrong The following patch fixes the last two cases. It is similar in spirit to what Martijn originally sent, but fixes two issues with his patch: 1) Do not invent a new global variable, use the existing f_nonprint. 2) For valid, but non-printable codepoints, print all bytes of the codepoint's encoding rather than just the first byte. Should i commit this? Yours, Ingo
Re: [patch] ls + utf-8 support
Hi Martijn, Martijn van Duren wrote on Sun, Jan 17, 2016 at 12:58:38PM +0100: > I've come across a fair amount of malformed file names by all sorts > of causes. Be it malware or just human error. When such a malformed > character is in an inconvenient place and can't be auto-completed > I usually fix this by something like the following: > > $ cd "`ls | tail -1`" > ksh: cd: /home/martijn/Muziek/Motrhead/N?? Sleep at All - No such file > or directory > $ cd "`/usr/src/bin/ls/ls | tail -1`" Why not just $ cd N*Sleep* That seems simpler to me. Sure, if you have more than one file in the same directory showing this problem, and the names are too similar to be independently globbed, and you want to keep them, it's a bit more work: $ i=1; for f in N*Sleep*; do mv $f tmp$i; i=$((i+1)); done I don't see a real reason to use ls(1) in such situations. > My patch maintains the question marks when stdout is a tty, but returns > the original byte otherwise. Afaik the only logical use for the length > is when doing formatted output, which is only when printing to a tty. The rationale for weeding out invalid bytes and non-printable characters is not that we don't know how many display columns the terminal might use to represent them. The rationale is that they might cause the terminal to change state, to interpret them as in-band control codes. > This doesn't solve the case when ls is run over ssh -t and the content > is redirected client-side, but you can't win them all. Indeed. I worry that might result in security violations. The old ls(1) also weeded out non-printable bytes, in particular control codes. Even though this question is only tangentially related to UTF-8: Do we want to change that? Should ls(1) sometimes pass through bytes that might be control codes for some terminals? Imposing the responsibility that non-isatty(3) ls(1) output never ends up on a terminal on the user? I tend to think "no". I wouldn't want my ls(1) to sometimes generate output that isn't safe on a terminal. I doubt i would get into the habit of considering whether or not ls(1) is safe to run in a given situation. I'd rather have it just be safe, always. And deal with the occasional insane file name with different tools. What do people think? Ingo > Index: ls.c > === > RCS file: /cvs/src/bin/ls/ls.c,v > retrieving revision 1.44 > diff -u -p -r1.44 ls.c > --- ls.c 1 Dec 2015 18:36:13 - 1.44 > +++ ls.c 17 Jan 2016 10:57:03 - > @@ -94,6 +94,7 @@ int f_type; /* add type character for > int f_typedir; /* add type character for directories */ > > int rval; > +int istty = 0; > > int > ls_main(int argc, char *argv[]) > @@ -110,6 +111,7 @@ ls_main(int argc, char *argv[]) > > /* Terminal defaults to -Cq, non-terminal defaults to -1. */ > if (isatty(STDOUT_FILENO)) { > + istty = 1; > if ((p = getenv("COLUMNS")) != NULL) > width = strtonum(p, 1, INT_MAX, NULL); > if (width == 0 && > Index: utf8.c > === > RCS file: /cvs/src/bin/ls/utf8.c,v > retrieving revision 1.1 > diff -u -p -r1.1 utf8.c > --- utf8.c1 Dec 2015 18:36:13 - 1.1 > +++ utf8.c17 Jan 2016 10:57:03 - > @@ -21,6 +21,8 @@ > #include > #include > > +extern int istty; > + > int > mbsprint(const char *mbs, int print) > { > @@ -33,12 +35,12 @@ mbsprint(const char *mbs, int print) > if ((len = mbtowc(, mbs, MB_CUR_MAX)) == -1) { > (void)mbtowc(NULL, NULL, MB_CUR_MAX); > if (print) > - putchar('?'); > + putchar(istty ? '?' : *mbs); > total_width++; > len = 1; > } else if ((width = wcwidth(wc)) == -1) { > if (print) > - putchar('?'); > + putchar(istty ? '?' : *mbs); > total_width++; > } else { > if (print)