Re: implement locale(1) charmap argument
On Fri, Apr 17, 2020 at 03:05:06PM +0200, Ingo Schwarze wrote: > Naively, it does seem like it would make sense to have "locale -m" > print a list of possible output values of "locale chardef", so i'm > not opposed to adding "US-ASCII" to it. But that doesn't appear to > be how it works elsewhere, at least not everywhere. I found no > documentation stating clearly what it is supposed to do, POSIX feels > murky at best. Good grief! Well, we can leave good enough alone then, I suppose :) Thank you for doing such elaborate research.
Re: implement locale(1) charmap argument
Hi Stefan and Todd, Stefan Sperling wrote on Fri, Apr 17, 2020 at 08:55:29AM +0200: > On Thu, Apr 16, 2020 at 09:35:18PM +0200, Ingo Schwarze wrote: >>$ locale -m >> UTF-8 >>$ locale charmap >> UTF-8 >>$ LC_ALL=C locale charmap >> US-ASCII >>$ LC_ALL=POSIX locale charmap >> US-ASCII > I am OK with your diff, Thanks to both of you for checking, i have put it in. > and noticed a separate issue with -m which > is exposed by this change: > > If US-ASCII is an available charmap, shouldn't locale -m list "US-ASCII" > in addition to "UTF-8"? I'm not completely sure what "available charmaps" is supposed to mean in the POSIX standard. Testing on an old Debian system, is see this: $ locale -m > charmaps.loc $ wc -l charmaps.loc 235 $ ls /usr/share/i18n/charmaps | sed 's/.gz$//' | sort > charmaps.ls $ diff -u charmaps.ls charmaps.loc | grep '^[+-][^+-]' +MAC_CENTRALEUROPE +NF_Z_62-010_(1973) +WIN-SAMI-2 $ locale charmap UTF-8 $ locale -m | grep UTF UTF-8 $ LC_CTYPE=C locale charmap ANSI_X3.4-1968 $ locale -m | grep 1968 ANSI_X3.4-1968 So "locale -m" gives almost a directory listing, but not quite; it produces a few additional entries that aren't in the directory. The return values from "locale charset" appear in "locale -m". Then again, Linux is not a certified UNIX system. So let's try with something certified: > uname -a SunOS unstable11s 5.11 11.3 sun4u sparc SUNW,SPARC-Enterprise > locale charmap UTF-8 > LC_CTYPE=C locale charmap 646 > locale -m | wc 0 0 0 It's a bit difficult because Solaris 11 does not provide locate(1), but i failed to find any charmap files there. Both UTF-8 and US-ASCII work (i tested that by compiling and running mandoc) but still "locale -m" returns nothing. > uname -a SunOS unstable10s 5.10 Generic_150400-17 sun4v sparc SUNW,SPARC-Enterprise-T5220 > locale charmap 646 > LC_CTYPE=en_US.UTF-8 locale charmap UTF-8 > locale -m iso_8859_1/charmap.src > ls -F /usr/lib/localedef/src/ charmaps/ en_US.UTF-8/ extensions/ iso_8859_1/ locales/ > ls -F /usr/lib/localedef/src/charmaps/ charmap.ANSI1251.bz2 charmap.ISO8859-9.bz2 charmap.iso-8859-5.bz2 charmap.ISO8859-1.bz2 charmap.KOI8-R.bz2charmap.iso-8859-6.bz2 charmap.ISO8859-13.bz2charmap.UTF-8.bz2 charmap.iso-8859-7.bz2 charmap.ISO8859-15.bz2charmap.ansi-1251.bz2 charmap.iso-8859-8.bz2 charmap.ISO8859-2.bz2 charmap.ar.bz2@ charmap.iso-8859-9.bz2 charmap.ISO8859-4.bz2 charmap.he.bz2@ charmap.koi8-r.bz2 charmap.ISO8859-5.bz2 charmap.iso-8859-1.bz2charmap.utf-8.bz2 charmap.ISO8859-6.bz2 charmap.iso-8859-13.bz2 charmap.utf8.bz2@ charmap.ISO8859-7.bz2 charmap.iso-8859-15.bz2 charmap.ISO8859-8.bz2 charmap.iso-8859-2.bz2 Same vendor, different version, different behaviour. Again, both UTF-8 and US-ASCII work, and there are several charmap files, but "locale -m" returns something that is neither a charmap name nor a filename for any of the locales, nor a list of anything. Frankly, i doubt the usefulness of "locale -m" in general, and even more so on OpenBSD: if i understand correctly, it is supposed to be used to determine valid input for the -f option of the localedef(1) utility, which we don't even have. Naively, it does seem like it would make sense to have "locale -m" print a list of possible output values of "locale chardef", so i'm not opposed to adding "US-ASCII" to it. But that doesn't appear to be how it works elsewhere, at least not everywhere. I found no documentation stating clearly what it is supposed to do, POSIX feels murky at best. Also, look at this: http://man.bsd.lv/FreeBSD-12.0/locale#BUGS http://man.bsd.lv/NetBSD-8.1/locale#BUGS "BUGS Since FreeBSD does not support charmaps in their POSIX meaning, locale emulates the -m option using the CODESETs listing of all available locales." That does look somehwat similar to what you are suggesting, but *they* call it a bug! Feel free to add "US-ASCII\n" if you like, it does feel as if it might add some minor clarity, but i hardly expect any real practical benefit. Yours, Ingo
Re: implement locale(1) charmap argument
On Thu, Apr 16, 2020 at 09:35:18PM +0200, Ingo Schwarze wrote: >$ locale -m > UTF-8 >$ locale charmap > UTF-8 >$ LC_ALL=C locale charmap > US-ASCII >$ LC_ALL=POSIX locale charmap > US-ASCII I am OK with your diff, and noticed a separate issue with -m which is exposed by this change: If US-ASCII is an available charmap, shouldn't locale -m list "US-ASCII" in addition to "UTF-8"?
Re: implement locale(1) charmap argument
Makes sense to me. OK millert@ - todd
implement locale(1) charmap argument
Hi, our locale(1) implementation is intentionally simplistic and implements only a subset of this POSIX specification: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/locale.html However, one feature is missing that is actually useful and arguably also well-placed inside the locale(1) utility. If you want to know from within a C program which character encoding is actually being used (as opposed to which one the user requested), you can use the nl_langinfo(3) function. But i'm not aware of a possibiliy to ask the same from within a sh(1) program. POSIX says that "locale charmap" should answer that question. In the next release of textproc/groff, that feature of locale(1) will be used in the test suite, and it seems reasonable to do so. So, here is a very simple patch to support the "charmap" argument. Testing: $ export LC_CTYPE=en_US.UTF-8 $ locale LANG= LC_COLLATE="C" LC_CTYPE=en_US.UTF-8 LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_MESSAGES="C" LC_ALL= $ locale -a | wc 68 68 794 $ locale -m UTF-8 $ locale charmap UTF-8 $ LC_ALL=C locale charmap US-ASCII $ LC_ALL=POSIX locale charmap US-ASCII $ LC_ALL=NonSense locale charmap US-ASCII $ locale -x locale: unknown option -- x usage: locale [-a | -m | charmap] $ locale nonsense usage: locale [-a | -m | charmap] $ locale -am usage: locale [-a | -m | charmap] $ locale -a charmap usage: locale [-a | -m | charmap] $ locale -m charmap usage: locale [-a | -m | charmap] $ locale charmap nonsense usage: locale [-a | -m | charmap] OK? Ingo P.S. It would be trivial to also support the POSIX -k option, as in $ locale -k charmap charmap="UTF-8" but that doesn't actually feel useful and i'm not aware of anything that might want to use it, so KISS and let's proceed one step at a time. Supporting "name" arguments other than "charmap" would make little sense on OpenBSD, nor would the -c option. Index: locale.1 === RCS file: /cvs/src/usr.bin/locale/locale.1,v retrieving revision 1.7 diff -u -p -r1.7 locale.1 --- locale.126 Oct 2016 01:00:27 - 1.7 +++ locale.116 Apr 2020 19:04:25 - @@ -1,6 +1,6 @@ .\" $OpenBSD: locale.1,v 1.7 2016/10/26 01:00:27 schwarze Exp $ .\" -.\" Copyright 2016 Ingo Schwarze +.\" Copyright 2016, 2020 Ingo Schwarze .\" Copyright 2013 Stefan Sperling .\" .\" Permission to use, copy, modify, and distribute this software for any @@ -23,7 +23,7 @@ .Nd character encoding and localization conventions .Sh SYNOPSIS .Nm locale -.Op Fl a | Fl m +.Op Fl a | Fl m | Cm charmap .Sh DESCRIPTION If the .Nm @@ -31,7 +31,7 @@ utility is invoked without any arguments configuration is shown. .Pp The options are as follows: -.Bl -tag -width Ds +.Bl -tag -width charmap .It Fl a Display a list of supported locales. .It Fl m @@ -39,6 +39,11 @@ Display a list of supported character en On .Ox , this always returns UTF-8 only. +.It Cm charmap +Display the currently selected character encoding. +On +.Ox , +this returns either US-ASCII or UTF-8. .El .Pp A locale is a set of environment variables telling programs which Index: locale.c === RCS file: /cvs/src/usr.bin/locale/locale.c,v retrieving revision 1.12 diff -u -p -r1.12 locale.c --- locale.c5 Feb 2016 12:59:12 - 1.12 +++ locale.c16 Apr 2020 19:04:25 - @@ -16,6 +16,7 @@ */ #include +#include #include #include #include @@ -169,7 +170,7 @@ show_locales(void) static void usage(void) { - fprintf(stderr, "usage: %s [-a | -m]\n", __progname); + fprintf(stderr, "usage: %s [-a | -m | charmap]\n", __progname); exit(1); } @@ -203,12 +204,16 @@ main(int argc, char *argv[]) argc -= optind; argv += optind; - if (argc != 0 || (aflag && mflag)) + if (aflag + mflag + argc > 1) usage(); else if (aflag) show_locales(); else if (mflag) printf("UTF-8\n"); + else if (strcmp(*argv, "charmap") == 0) + printf("%s\n", nl_langinfo(CODESET)); + else + usage(); return 0; }