On Thu, May 30, 2013 at 3:35 PM, Stefan Sperling <s...@openbsd.org> wrote:
> I've received several requests for adding new locale names,
> both on this list, and off-list, from several people.
>
> I've been trying to find a way to keep /usr/share/locale reasonably
> clean while also allowing people to use their preferred locale names.
>
> Currently, the list of supported locale names is represented by the
> list of directories in /usr/share/locale. I don't think we should
> continue to maintain a list of <language>_<country>.<encoding> names
> because such a list cannot be maintained properly.
>
> Some requests that have been made are non-functional changes.
> E.g. adding a <country> doesn't have a functional effect on OpenBSD.
> Still, some users would like to use names containing
> <theirlanguage>_<theircountry>, for whatever reason.
>
> There have also been requests for supporting locale names such
> as "C.UTF-8". I'm not sure what the use case is but as a side-effect
> of the proposal below such names would also be possible.

The inclusion of C.UTF-8 and POSIX.UTF-8 to their respective standards
has a greater chance of acceptance than the almost-functionally
equivalent en_US.UTF-8. If I define the first two in terms of
en_US.UTF-8 as implemented in various BSD and glibc, the only
difference is sorting order.

>
> POSIX doesn't specify how files in /usr/share/locale are stored.
> bluhm@ suggested to change the filesystem layout such that encoding
> and language are separated. libc will look up locale definition data at
> specific places depending on which of the LC_* categories is being set.
>
> LC_CTYPE support code needs to look at the character encoding only.
> It only cares about the encoding part of the locale name, which by
> convention is the substring after the last dot in the locale name.

Actually, several assertions regarding file structure take place in
the library routines, which you haven't modified in your patch.

>
> The suggested new layout looks like this:
>
>   /usr/share/locale/UTF-8/LC_CTYPE
>   /usr/share/locale/CP1251/LC_CTYPE
>   /usr/share/locale/ISO8859-1/LC_CTYPE
>   /usr/share/locale/ISO8859-15/LC_CTYPE
>   /usr/share/locale/ISO8859-2/LC_CTYPE
>   /usr/share/locale/ISO8859-7/LC_CTYPE
>   /usr/share/locale/ARMSCII-8/LC_CTYPE
>   /usr/share/locale/ISO8859-4/LC_CTYPE
>   /usr/share/locale/ISO8859-13/LC_CTYPE
>   /usr/share/locale/CP866/LC_CTYPE
>   /usr/share/locale/KOI8-R/LC_CTYPE
>   /usr/share/locale/ISO8859-5/LC_CTYPE
>   /usr/share/locale/KOI8-U/LC_CTYPE
>
> All other files and directories currently in /usr/share/locale
> can be removed.
>
> If we later add support for language- or country-specific features
> such as LC_COLLATE we can add directories for every language the
> collation code supports:
>
>   /usr/share/locale/en/LC_COLLATE
>   /usr/share/locale/es/LC_COLLATE
>   /usr/share/locale/de/LC_COLLATE
>
> Or even add country names, if necessary and supported by the
> hypothetical collation code:
>   /usr/share/locale/it_IT/LC_COLLATE
>   /usr/share/locale/it_CH/LC_COLLATE

I did some research a few months ago on the optimal layout for
collation, transliteration, and date/misc formatting:

Cytrus systems don't seem to peruse CLDR, which covers quite alot, and
indexes certain data by cc_LL because it's not transient between
different countries that feature the same language. Since OpenBSD libc
doesn't cover such localization, and it's extremely unlikely that it
will in the near future, it's safe to say that the first hierarchy is
an ok investment.

>
> Does anyone see problems with this plan?

As I mentioned, there's code that expects the prior layout, and that's
confusing.

on src/lib/libc/locale/setlocale.c, load_locale_sub() :

   228          len = snprintf(name, sizeof(name), "%s/%s/%s",
   229                         _PATH_LOCALE, locname, categories[category]);
   230          if (len < 0 || len >= sizeof(name))
   231                  return -1;

on src/lib/libc/locale/setrunelocale.c, _xpg4_setrunelocale():

   184          len = snprintf(path, sizeof(path),
   185              "%s/%s/LC_CTYPE", _PATH_LOCALE, encoding);
   186          if (len < 0 || len >= sizeof(path))
   187                  return ENAMETOOLONG;


>
> Index: share/locale/ctype/Makefile
> ===================================================================
> RCS file: /cvs/src/share/locale/ctype/Makefile,v
> retrieving revision 1.6
> diff -u -p -r1.6 Makefile

[...]

> Index: lib/libc/locale/setrunelocale.c
> ===================================================================
> RCS file: /cvs/src/lib/libc/locale/setrunelocale.c,v
> retrieving revision 1.9
> diff -u -p -r1.9 setrunelocale.c
> --- lib/libc/locale/setrunelocale.c     30 May 2013 18:35:55 -0000      1.9
> +++ lib/libc/locale/setrunelocale.c     30 May 2013 19:23:16 -0000
> @@ -171,17 +171,27 @@ found:
>  }
>
>  int
> -_xpg4_setrunelocale(const char *encoding)
> +_xpg4_setrunelocale(const char *locname)
>  {
>         char path[PATH_MAX];
>         _RuneLocale *rl;
>         int error, len;
> +       const char *dot, *encoding;
>
> -       if (!strcmp(encoding, "C") || !strcmp(encoding, "POSIX")) {
> +       if (!strcmp(locname, "C") || !strcmp(locname, "POSIX")) {
>                 rl = &_DefaultRuneLocale;
>                 goto found;
>         }
>
> +       /* Assume "<whatever>.<encoding>" locale name. */

There should be some notion of syntax for cc_LL.CTYPE, even if only
mentioned in comments.

E.g.,

ISO 3166-1 for country codes and BCP 47 for language tags.

glibc did not do this and directly because of that it's a mess to
navigate their structure.

> +       dot = strrchr(locname, '.');
> +       if (dot == NULL) {
> +               /* No encoding specified. Fall back to ASCII. */
> +               rl = &_DefaultRuneLocale;
> +               goto found;
> +       }
> +
> +       encoding = dot + 1;
>         len = snprintf(path, sizeof(path),
>             "%s/%s/LC_CTYPE", _PATH_LOCALE, encoding);
>         if (len < 0 || len >= sizeof(path))
>

Reply via email to