Re: implement locale(1) charmap argument

2020-04-17 Thread Stefan Sperling
On Fri, Apr 17, 2020 at 03:05:06PM +0200, Ingo Schwarze wrote:
> Naively, it does seem like it would make sense to have "locale -m"
> print a list of possible output values of "locale chardef", so i'm
> not opposed to adding "US-ASCII" to it.  But that doesn't appear to
> be how it works elsewhere, at least not everywhere.  I found no
> documentation stating clearly what it is supposed to do, POSIX feels
> murky at best.

Good grief! Well, we can leave good enough alone then, I suppose :)

Thank you for doing such elaborate research.



Re: implement locale(1) charmap argument

2020-04-17 Thread Ingo Schwarze
Hi Stefan and Todd,

Stefan Sperling wrote on Fri, Apr 17, 2020 at 08:55:29AM +0200:
> On Thu, Apr 16, 2020 at 09:35:18PM +0200, Ingo Schwarze wrote:

>>$ locale -m
>>   UTF-8
>>$ locale charmap
>>   UTF-8
>>$ LC_ALL=C locale charmap
>>   US-ASCII
>>$ LC_ALL=POSIX locale charmap
>>   US-ASCII

> I am OK with your diff,

Thanks to both of you for checking, i have put it in.

> and noticed a separate issue with -m which
> is exposed by this change:
> 
> If US-ASCII is an available charmap, shouldn't locale -m list "US-ASCII"
> in addition to "UTF-8"?

I'm not completely sure what "available charmaps" is supposed to mean
in the POSIX standard.

Testing on an old Debian system, is see this:

   $ locale -m > charmaps.loc
   $ wc -l charmaps.loc
  235
   $ ls /usr/share/i18n/charmaps | sed 's/.gz$//' | sort > charmaps.ls 
   $ diff -u charmaps.ls charmaps.loc | grep '^[+-][^+-]'
  +MAC_CENTRALEUROPE
  +NF_Z_62-010_(1973)
  +WIN-SAMI-2
   $ locale charmap
  UTF-8
   $ locale -m | grep UTF
  UTF-8
   $ LC_CTYPE=C locale charmap
  ANSI_X3.4-1968
   $ locale -m | grep 1968
  ANSI_X3.4-1968

So "locale -m" gives almost a directory listing, but not quite;
it produces a few additional entries that aren't in the directory.
The return values from "locale charset" appear in "locale -m".
Then again, Linux is not a certified UNIX system.  So let's try
with something certified:

   > uname -a
  SunOS unstable11s 5.11 11.3 sun4u sparc SUNW,SPARC-Enterprise
   > locale charmap
  UTF-8
   > LC_CTYPE=C locale charmap
  646
   > locale -m | wc
   0   0   0

It's a bit difficult because Solaris 11 does not provide locate(1),
but i failed to find any charmap files there.  Both UTF-8 and
US-ASCII work (i tested that by compiling and running mandoc)
but still "locale -m" returns nothing.

   > uname -a
  SunOS unstable10s 5.10 Generic_150400-17 sun4v sparc 
SUNW,SPARC-Enterprise-T5220
   > locale charmap
  646
   > LC_CTYPE=en_US.UTF-8 locale charmap
  UTF-8
   > locale -m
  iso_8859_1/charmap.src
   > ls -F /usr/lib/localedef/src/
  charmaps/ en_US.UTF-8/  extensions/   iso_8859_1/   locales/
   > ls -F /usr/lib/localedef/src/charmaps/
  charmap.ANSI1251.bz2  charmap.ISO8859-9.bz2 charmap.iso-8859-5.bz2
  charmap.ISO8859-1.bz2 charmap.KOI8-R.bz2charmap.iso-8859-6.bz2
  charmap.ISO8859-13.bz2charmap.UTF-8.bz2 charmap.iso-8859-7.bz2
  charmap.ISO8859-15.bz2charmap.ansi-1251.bz2 charmap.iso-8859-8.bz2
  charmap.ISO8859-2.bz2 charmap.ar.bz2@   charmap.iso-8859-9.bz2
  charmap.ISO8859-4.bz2 charmap.he.bz2@   charmap.koi8-r.bz2
  charmap.ISO8859-5.bz2 charmap.iso-8859-1.bz2charmap.utf-8.bz2
  charmap.ISO8859-6.bz2 charmap.iso-8859-13.bz2   charmap.utf8.bz2@
  charmap.ISO8859-7.bz2 charmap.iso-8859-15.bz2
  charmap.ISO8859-8.bz2 charmap.iso-8859-2.bz2

Same vendor, different version, different behaviour.  Again, both UTF-8
and US-ASCII work, and there are several charmap files, but "locale -m"
returns something that is neither a charmap name nor a filename for any
of the locales, nor a list of anything.

Frankly, i doubt the usefulness of "locale -m" in general, and even more
so on OpenBSD: if i understand correctly, it is supposed to be used to
determine valid input for the -f option of the localedef(1) utility,
which we don't even have.

Naively, it does seem like it would make sense to have "locale -m"
print a list of possible output values of "locale chardef", so i'm
not opposed to adding "US-ASCII" to it.  But that doesn't appear to
be how it works elsewhere, at least not everywhere.  I found no
documentation stating clearly what it is supposed to do, POSIX feels
murky at best.

Also, look at this:

  http://man.bsd.lv/FreeBSD-12.0/locale#BUGS
  http://man.bsd.lv/NetBSD-8.1/locale#BUGS

  "BUGS
   Since FreeBSD does not support charmaps in their POSIX meaning,
   locale emulates the -m option using the CODESETs listing of all
   available locales."

That does look somehwat similar to what you are suggesting,
but *they* call it a bug!

Feel free to add "US-ASCII\n" if you like, it does feel as if it
might add some minor clarity, but i hardly expect any real practical
benefit.

Yours,
  Ingo



Re: implement locale(1) charmap argument

2020-04-17 Thread Stefan Sperling
On Thu, Apr 16, 2020 at 09:35:18PM +0200, Ingo Schwarze wrote:
>$ locale -m
>   UTF-8
>$ locale charmap
>   UTF-8
>$ LC_ALL=C locale charmap
>   US-ASCII
>$ LC_ALL=POSIX locale charmap
>   US-ASCII

I am OK with your diff, and noticed a separate issue with -m which
is exposed by this change:

If US-ASCII is an available charmap, shouldn't locale -m list "US-ASCII"
in addition to "UTF-8"?



Re: implement locale(1) charmap argument

2020-04-16 Thread Todd C . Miller
Makes sense to me.  OK millert@

 - todd



implement locale(1) charmap argument

2020-04-16 Thread Ingo Schwarze
Hi,

our locale(1) implementation is intentionally simplistic
and implements only a subset of this POSIX specification:

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/locale.html

However, one feature is missing that is actually useful and arguably
also well-placed inside the locale(1) utility.  If you want to know
from within a C program which character encoding is actually being
used (as opposed to which one the user requested), you can use the
nl_langinfo(3) function.  But i'm not aware of a possibiliy to ask
the same from within a sh(1) program.

POSIX says that "locale charmap" should answer that question.

In the next release of textproc/groff, that feature of locale(1)
will be used in the test suite, and it seems reasonable to do so.

So, here is a very simple patch to support the "charmap" argument.

Testing:

   $ export LC_CTYPE=en_US.UTF-8
   $ locale
  LANG=
  LC_COLLATE="C"
  LC_CTYPE=en_US.UTF-8
  LC_MONETARY="C"
  LC_NUMERIC="C"
  LC_TIME="C"
  LC_MESSAGES="C"
  LC_ALL=

   $ locale -a | wc
  68  68 794
   $ locale -m
  UTF-8
   $ locale charmap
  UTF-8
   $ LC_ALL=C locale charmap
  US-ASCII
   $ LC_ALL=POSIX locale charmap
  US-ASCII

   $ LC_ALL=NonSense locale charmap
  US-ASCII
   $ locale -x
  locale: unknown option -- x
  usage: locale [-a | -m | charmap]
   $ locale nonsense
  usage: locale [-a | -m | charmap]
   $ locale -am 
  usage: locale [-a | -m | charmap]
   $ locale -a charmap
  usage: locale [-a | -m | charmap]
   $ locale -m charmap
  usage: locale [-a | -m | charmap]
   $ locale charmap nonsense
  usage: locale [-a | -m | charmap]

OK?
  Ingo


P.S.
It would be trivial to also support the POSIX -k option, as in
   $ locale -k charmap
  charmap="UTF-8"
but that doesn't actually feel useful and i'm not aware of anything
that might want to use it, so KISS and let's proceed one step at a time.
Supporting "name" arguments other than "charmap" would make little
sense on OpenBSD, nor would the -c option.


Index: locale.1
===
RCS file: /cvs/src/usr.bin/locale/locale.1,v
retrieving revision 1.7
diff -u -p -r1.7 locale.1
--- locale.126 Oct 2016 01:00:27 -  1.7
+++ locale.116 Apr 2020 19:04:25 -
@@ -1,6 +1,6 @@
 .\" $OpenBSD: locale.1,v 1.7 2016/10/26 01:00:27 schwarze Exp $
 .\"
-.\" Copyright 2016 Ingo Schwarze 
+.\" Copyright 2016, 2020 Ingo Schwarze 
 .\" Copyright 2013 Stefan Sperling 
 .\"
 .\" Permission to use, copy, modify, and distribute this software for any
@@ -23,7 +23,7 @@
 .Nd character encoding and localization conventions
 .Sh SYNOPSIS
 .Nm locale
-.Op Fl a | Fl m
+.Op Fl a | Fl m | Cm charmap
 .Sh DESCRIPTION
 If the
 .Nm
@@ -31,7 +31,7 @@ utility is invoked without any arguments
 configuration is shown.
 .Pp
 The options are as follows:
-.Bl -tag -width Ds
+.Bl -tag -width charmap
 .It Fl a
 Display a list of supported locales.
 .It Fl m
@@ -39,6 +39,11 @@ Display a list of supported character en
 On
 .Ox ,
 this always returns UTF-8 only.
+.It Cm charmap
+Display the currently selected character encoding.
+On
+.Ox ,
+this returns either US-ASCII or UTF-8.
 .El
 .Pp
 A locale is a set of environment variables telling programs which
Index: locale.c
===
RCS file: /cvs/src/usr.bin/locale/locale.c,v
retrieving revision 1.12
diff -u -p -r1.12 locale.c
--- locale.c5 Feb 2016 12:59:12 -   1.12
+++ locale.c16 Apr 2020 19:04:25 -
@@ -16,6 +16,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -169,7 +170,7 @@ show_locales(void)
 static void
 usage(void)
 {
-   fprintf(stderr, "usage: %s [-a | -m]\n", __progname);
+   fprintf(stderr, "usage: %s [-a | -m | charmap]\n", __progname);
exit(1);
 }
 
@@ -203,12 +204,16 @@ main(int argc, char *argv[])
argc -= optind;
argv += optind;
 
-   if (argc != 0 || (aflag && mflag))
+   if (aflag + mflag + argc > 1)
usage();
else if (aflag)
show_locales();
else if (mflag)
printf("UTF-8\n");
+   else if (strcmp(*argv, "charmap") == 0)
+   printf("%s\n", nl_langinfo(CODESET));
+   else
+   usage();
 
return 0;
 }