from:"Bruno Haible via austin\-group\-l at The Open Group"

Re: POSIX msgfmt and universal-character-name escape sequences

2022-06-28 Thread Bruno Haible via austin-group-l at The Open Group

Geoff Clare wrote:
> In today's teleconference we discussed this and formulated the following
> response...
> 
> If a C17 source file contains calls to gettext family functions
> that pass string literals containing \u sequences, xgettext will
> write those strings literals to the .po file. It would be a useful
> future enhancement to msgfmt if it could support these sequences.
> We don't want POSIX to forbid this enhancement, as it is possible
> it will be requested by users during the lifetime of the next
> POSIX revision.

OK, it's not as bad as I thought. Since without support for the
universal-character-name escape sequences, a "\u" in a dot-po file
is invalid, there will not be two different valid interpretations of
the same input (like for ISO C trigraphs). Implementations can either
accept or reject a dot-po file that contains "\u".

GNU msgfmt currently gives an error "invalid control sequence" when
it encounters "\u"; that is sufficient for the moment.

Bruno

Re: POSIX xgettext and the initial domain directive

2022-06-28 Thread Bruno Haible via austin-group-l at The Open Group

Geoff Clare wrote:
> we struck out part about -d so that it reads:
> 
> The first directive in each created dot-po file shall be a domain
> directive giving the associated domain name, except that this
> directive is optional in the default output file.
> 
> This allows both the Solaris and GNU behaviours.

Perfect. This resolves the issue, better than my suggestion did. Thank you!

Bruno

Re: POSIX xgettext and the -s option

2022-06-28 Thread Bruno Haible via austin-group-l at The Open Group

Geoff Clare wrote:
> > Suggestion: Remove the '-s' option from the standard.
> 
> In today's teleconference we struck out the text relating to -s and
> added to RATIONALE explaining why it is being omitted.

Thank you!

Bruno

Re: POSIX gettext(): lifetime of returned values

2022-06-23 Thread Bruno Haible via austin-group-l at The Open Group

Geoff Clare wrote:
> We believe that all of your comments have now been addressed.  ...  Once
> you have reviewed this last change, we plan to clean up the document

Thanks for the prompt. I have reviewed the specifications of msgfmt and
xgettext, and sent 7 comments about them.

Bruno

POSIX xgettext: -K option description

2022-06-23 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Lines 1202..1211

In line 1164, the argument to the -K option is called 'pattern'.

Issue: In lines 1202..1211 it is called 'keyword'.

Suggestion: Use the same term 'pattern' here as well, instead of 'keyword'.

Rationale: In the 1st, 3rd, and 4th case, it is a misnomer to call the
argument a "keyword".

POSIX xgettext example

2022-06-23 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Line 1293

Issue: The list of -K options is incomplete, as they don't handle the
dgettext_l, dcgettext_l, dngettext_l, dcngettext_l function invocations.

Suggestion: Add these options:
-K gettext_l:1 -K dgettext_l:2 -K dcgettext_l:2 -K ngettext_l:1,2 -K 
dngettext_l:2,3 -K dcngettext_l:2,3

POSIX xgettext and the -s option

2022-06-23 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Lines 1164, 1166, 1187, 1221-1222

Issue: The option '-s' has been found to be counter-productive in practice,
and therefore has been deprecated in GNU gettext.
See https://savannah.gnu.org/bugs/?61249 .

Suggestion: Remove the '-s' option from the standard.

POSIX xgettext and the initial domain directive

2022-06-23 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Line 1183

"The first directive in each created dot-po file shall be a domain directive
giving the associated domain name"

GNU gettext currently does not do this. Solaris gettext does it.
The msgfmt program allows the initial domain directive to be absent
(see lines 996-998).

What is the added value of this directive, since for msgfmt it is optional?
In 99% percent of the cases, xgettext is used as part of a build system for
a single domain. The author of that build system knows the domain perfectly
well. There is no need to additionally store it in the dot-po file.

Issue: Since this directive was not documented in
https://www.gnu.org/software/gettext/manual/html_node/PO-Files.html
many PO file consumers will choke on this directive, once GNU xgettext
implements the POSIX specification.

Suggestion: Declare that it is implementation-dependent whether xgettext
writes out a domain directive, when the output contains only entries for a
single domain.

POSIX msgfmt and newlines in strings

2022-06-23 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Line 1067

"Unlike shell command language strings, double-quoted strings in dot-po files
cannot contain a literal  character."

Issue: This sentence should be part of the specification of the dot-po file 
format.

Suggestion: Move this sentence from the APPLICATION USAGE section to the
EXTENDED DESCRIPTION section.

POSIX msgfmt and escape sequences in msgid and msgid_plural strings

2022-06-23 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Line 1031

"C-language escape sequences in message strings shall be processed as
specified for character string literals in the ISO C standard ..."

Issue: The way this is written, it is not possible to write, in a dot-po file:
msgid "Program terminated.\n"

Suggestion:
This sentence should be extended to hold for *all* string literals in a
dot-po file. So that C escape sequences can be used in particular
in message_identifier (line 901) and untranslated_string_plural (line 902).

Both GNU msgfmt and Solaris msgfmt do it like this.

POSIX msgfmt and universal-character-name escape sequences

2022-06-23 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Line 1031

"except that universal-character-name escape sequences need not be supported."

Neither GNU msgfmt nor Solaris msgfmt treat universal-character-name
escape sequences specially. If an msgstr contains e.g. "\\u20AC", the
resulting string in the .mo file is
{ '\\', 'u', '2', '0', 'A', 'C', '\0' }.

Issue: Leaving it undefined whether \u escape sequences are recognized can
lead to mutual incompatibility of msgfmt implementations: Implementations
would differ in their interpretation of the dot-po file.

There is no good reason for leaving it undefined: There is already a
mechanism for specifying an encoding (charset=... in the header), and the
UTF-8 encoding is in widespread use for more than 10 years.

Suggestion: Change
"except that universal-character-name escape sequences need not be supported."
to
"except that universal-character-name escape sequences are not supported."

Re: POSIX gettext(): lifetime of returned values

2022-06-22 Thread Bruno Haible via austin-group-l at The Open Group

Geoff Clare wrote:
> > I hope this explains it: how gettext() can be implemented in a reasonable
> > way, without limiting the use of uselocale().
> 
> In today's teleconference we changed the etherpad text to require that
> uselocale() does not invalidate the returned string.

That's great! Thank you.

Bruno

Re: POSIX gettext(): changes to the .mo file

2022-05-26 Thread Bruno Haible via austin-group-l at The Open Group

Robert Elz wrote:
> I would also guess that a side effect of the way it was described
> is that changes to the on disc backing store (the .mo file, or
> whatever) will not be detected while the application remains
> running, and that aside from execing itself to restart clean
> there is no way for an application designed to run forever
> to ever see updated data.

It is true that for some use-cases, it would be beneficial if the
gettext() implementation would detect changes to the .mo file and
even provide a notification system about which translations have
changed. However, this has not been implemented in GNU gettext in
more than 25 years. Therefore, IMO, this is not something that POSIX
needs to address.

Bruno

Re: POSIX gettext(): lifetime of returned values

2022-05-24 Thread Bruno Haible via austin-group-l at The Open Group

Thank you for the reply.

Geoff Clare wrote:
> > https://posix.rhansen.org/p/gettext_draft
> > Line 357
> > ...
> > If temporarily switching a thread's locale through uselocale()
> > invalidates the gettext functions' results (even if only those from
> > the same thread), it effectively disallows uselocale() as a helper
> > function.
> 
> This was discussed in today's call, but we did not reach a conclusion.
> 
> Can you explain how glibc manages not to invalidate strings returned
> by gettext() when uselocale() is used to change the locale (without
> leaking memory - or does it leak memory?), in particular if codeset
> translation was needed.

First, let me clarify the term "memory leak". It means [1] that a piece
of memory is allocated and held for the rest of the runtime of the process.

IMO, it's useful to distinguish bounded and unbounded memory leaks:
  - A _bounded_ memory leak is one where the amount of leaked memory is
bounded by an a-priori computable constant.
  - An _unbounded_ memory leak is one where such a bound does not exist.

Bounded memory leaks are noticeable when a program is run with memory
instrumentation, but do not make the program crash (assuming the bound
is smaller than the machine's available memory size).

Whereas unbounded memory leaks increase the memory size of the process,
typically linearly over time, and in the end make the process crash.

Bounded memory leaks already exist in a number of places in POSIX:
  - Most statically allocated caches are bounded memory leaks.
  - An application that calls setenv() a fixed number of times has a
bounded memory leak.
  - An application that calls dlopen() a fixed number of times has a
bounded memory leak.
  - An application that creates a fixed number of background threads
(= threads which persist until exit()) has a bounded memory leak,
because each thread consumes memory.

The gettext() implementation in glibc, when used with a fixed number of
domains, is a *bounded* memory leak. It's bounded, because there are only
a certain number of message catalogs (.mo files) that can be loaded, and
only a fixed number of possible locale encodings (UTF-8, ISO-8859-1, etc.).

Now to your question:

> Can you explain how glibc manages not to invalidate strings returned
> by gettext() when uselocale() is used to change the locale (without
> leaking memory - or does it leak memory?), in particular if codeset
> translation was needed.

Glibc uses a cache of loaded message objects:

  GettextCache = Map ( .mo file name --> LoadedMoFile )

  LoadedMoFile = {
   contents of .mo file;
   MapOfConvertedContents;
   other data
 }

  MapOfConvertedContents = Map ( encoding --> ConvertedContents )

  ConvertedContents = {
iconv_t iconv_descriptor;
hash table/map ( msgid --> converted msgstr );
  }

Since this GettextCache does not have thread-dependent elements,

  * Lookups made in one thread speed up also the lookups in other threads.
(This is important for speed in multi-threaded applications.)

  * The result of gettext() in one thread can be used in other threads,
with indefinite lifetime.

For example, in a GUI application with a "main" thread and an
event-handling thread, the main thread can prepare GUI elements
with strings returned from gettext() — without copying them through
strdup() —, and the event-handling thread can access these GUI
elements at any time.

  * uselocale() has no effect on the GettextCache. This has several
consequences:

+ When the application does
   s1 = gettext (msgid);
   uselocale (...);
   s2 = gettext (msgid);
  in a way that the locale change also changes the locale encoding,
  s1 and s2 will be different (because looked up from different
  ConvertedContents objects from the same LoadedMoFile).

+ When the application does
   s1 = gettext (msgid);
   locale_t old_locale = uselocale (...);
   ... strtod() / sscanf() calls ...
   uselocale (old_locale);
   s2 = gettext (msgid);
  then, since the locale at the two gettext() calls is the same,
  s1 and s2 will be the same.

+ When an application's thread does
   s1 = gettext (msgid);
  and another thread does
   locale_t old_locale = uselocale (...);
   ... strtod() / sscanf() calls ...
   uselocale (old_locale);
  then the first thread can use s1 without caring about the second
  thread.

I hope this explains it: how gettext() can be implemented in a reasonable
way, without limiting the use of uselocale().

Bruno

[1] https://en.wikipedia.org/wiki/Memory_leak

Re: POSIX gettext(): behaviour if iconv() produces a replacement character

2022-05-24 Thread Bruno Haible via austin-group-l at The Open Group

Thank you for the reply.

Geoff Clare wrote:
> > https://posix.rhansen.org/p/gettext_draft
> > Line 350
>
> In today's call we made changes along the lines you suggest. Please
> check the updated etherpad to see if they achieve what you wanted.

The new text achieves what I wanted; thank you.
There is a typo, though: a missing closing parenthesis after
``replacement-character''.

Bruno

Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-24 Thread Bruno Haible via austin-group-l at The Open Group

Thank you for the reply.

Geoff Clare wrote:
> > https://posix.rhansen.org/p/gettext_draft
> > Line 573
> 
> In today's call we made changes along the lines you suggest. Please
> check the updated etherpad to see if they achieve what you wanted.

The change is good, from my POV. Thank you.

Bruno

Re: POSIX gettext with option -s: handling of \c escape sequence

2022-05-24 Thread Bruno Haible via austin-group-l at The Open Group

Thanks for the reply.

Geoff Clare wrote:
> > This is NOT entirely how the gettext program from GNU gettext behaves. 
> > Namely,
> > it also looks whether some of the strings contain a '\c' sequence, in order 
> > to
> > emulate what BSD 'echo' does:
> > 
> > $ gettext -s -e 'ab\c' | od -t c
> > 000   a   b
> > 002
> > 
> > Whereas on Solaris, \c is not interpreted:
> > 
> > $ gettext -s -e 'ab\c' | od -t c
> > 000   a   b   c  \n
> > 004
> > 
> > How to resolve this?
> 
> In today's call we made changes to allow this handling of \c (using "may",
> so it is an implementation option).  Please check the updated etherpad to
> see if the way it is described there matches how GNU gettext behaves.

The updated text is good. GNU gettext will need a small change, in order
to accommodate the specified behaviour for the characters that follow '\c',
but that is OK since it is rare for users to add more characters after '\c'.

Bruno

Re: POSIX gettext(): multithread-safe or not?

2022-05-24 Thread Bruno Haible via austin-group-l at The Open Group

Thank you for the reply.

https://posix.rhansen.org/p/gettext_draft
Line 357

Geoff Clare wrote:
> However, we have rearranged the wording in a way that
> we hope makes it clearer it is a requirement on implementations.

Thank you; it is clearer now. It would be even clearer if there was a
paragraph break between the "The application shall ensure …" sentence
and the "A subsequent call …" sentence.

Bruno

Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-11 Thread Bruno Haible via austin-group-l at The Open Group

Steffen Nurpmeso wrote:
>  ...
>  | [.] "UTF-7"."
> 
> That is overshoot.

No. UTF-7 is invalid here because it produces output that is not NUL
terminated. See:

$ printf 'ab\0' | iconv -t UTF-7 | od -t c
000   a   b   +   A   A   A   -
007

strlen() on such a return value makes invalid memory accesses.
You can convince yourself by running
$ OUTPUT_CHARSET=UTF-7 valgrind ls --help

Bruno

POSIX gettext: a typo

2022-05-11 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Line 668

Typo: msgid_pural -> msgid_plural

POSIX gettext() and NLSPATH

2022-05-11 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Line 130
"indicates that catopen( ) should look ..."
What does the gettext family of functions do when NLSPATH is set to this
value?

Line 136
"indicates that the gettext family of functions ..."
What does the catopen() function when NLSPATH is set to this value?

There is some explanation in example 5. But IMO it's not immediately clear
how this applies to example 1 and example 2.

POSIX and restrict

2022-05-11 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Lines 163..230, 538..543

The 'restrict' keywords in these declarations are useless and - worse -
forbid some valid, useful calls. For example, there is nothing wrong
with
   dgettext("hello", "hello")
which will attempt to search for a translation of "hello" in a catalog
name hello.mo. There is also no imaginable optimization that can be done
in the implementation of dgettext() by assuming that the two arguments
were different.

'restrict' is meaningful when at least one of the parameters is a
writable pointer type. Here, all parameters are either non-pointers
or read-only pointers.

Suggestion: Remove every 'restrict' in these declarations.

POSIX gettext(): choosing the domain name

2022-05-11 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Line 50
"often named after the application that provides the collection"

Issue: On my system, in /usr/share/locale/de/LC_MESSAGES/ there are
55 .mo files for libraries.

Suggestion: Change
"after the application"
->
"after the application or library"

POSIX gettext(): behaviour if iconv() produces a replacement character

2022-05-11 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Line 350

"If a significant proportion of the converted message string would consist
 of characters resulting from non-identical conversions ..."

The term "significant proportion" is undefined.

Suggestion: Change
"If a significant proportion of the converted message string would consist
 of characters resulting from non-identical conversions that do not provide
 any information about the character they were converted from (for example,
 if the converted message string would be mostly  or
  characters)"
to
"If at least one of the non-identical conversions produces a fallback
 character (such as  or , depending
 on implementation)"

Rationale: There is no point in forcing gettext() to accept the converted
string when it has low quality.

POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-11 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Line 573

"The application shall ensure that the codeset argument, if non-empty, is a
 valid codeset name that can be used as the tocode argument of the iconv_open()
 function."

This is not the only requirement. We also need the requirement that the NUL
character of ASCII maps to a single NUL byte in the codeset. Otherwise the
iconv() processing inside gettext() is likely to malfunction.

Suggestion: Change
"... iconv_open() function."
to
"... iconv_open() function, and that the NUL character corresponds to a
 single NUL byte in codeset. So, the codeset may not be, for example,
 "UCS-2", "UTF-16", "UTF-16BE", "UTF-16LE", "UCS-4", "UTF-32", "UTF-32BE",
 "UTF-32LE", "UTF-7"."

POSIX gettext with option -s: handling of \c escape sequence

2022-05-11 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Lines 699, 721

"if the -n option is not specified, a  shall be written after the
 last message string"
"(if -n is not also specified) append a  to the output."

This is NOT entirely how the gettext program from GNU gettext behaves. Namely,
it also looks whether some of the strings contain a '\c' sequence, in order to
emulate what BSD 'echo' does:

$ gettext -s -e 'ab\c' | od -t c
000   a   b
002

Whereas on Solaris, \c is not interpreted:

$ gettext -s -e 'ab\c' | od -t c
000   a   b   c  \n
004

How to resolve this?

POSIX msgfmt: effect of LC_CTYPE on PO file parsing

2022-05-11 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Line 960

"Do we need to say this isn't used for message strings, only for parsing
 the .po file?"

The .po file format has a mechanism for specifying the codeset of the
PO file. See line 1009. Therefore LC_CTYPE is *not used* for the
interpretation of the input .po file, only for producing diagnostics
(in combination with the LC_MESSAGES category).

POSIX gettext(): multithread-safe or not?

2022-05-11 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Line 357

"The returned string shall not be ... invalidated by a subsequent call
 to a gettext family function."

It is not clear whether this sentence is an assertion (regarding how the
gettext() implementation behaves) or a requirement/restriction w.r.t. the
application.

In the latter case, the consequences of this restriction would be:
  1) Multithreaded applications cannot use gettext, except during
 initialization when only one thread exists.
  2) Libraries cannot use gettext, otherwise multithreaded applications
 cannot make use of them. And *many* applications are multithreaded
 nowadays.

If the gettext() functions are designed like this, they would be as
problematic to use as the the old non-reentrant, non-MT-safe functions
(like getpwnam), for which _r variants had to be designed.

We don't want to replace all calls to gettext() with calls to gettext_r()!

POSIX gettext(): lifetime of returned values

2022-05-11 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Line 357

"The returned string may be invalidated by ... a subsequent call to
uselocale() in the same thread, except for calls that only query values."

As explained in my mail from 2021-05-04 [1]

   uselocale() is a helper function to implement *_l functions where
   the POSIX standard does not specify them or the system does not have
   them.
   For example, when a program wants to have a function to parse
   a number, recognizing only the ASCII digits and only '.' as decimal
   separator, a reliable way to implement such a function is by calling
   uselocale of the "C" locale, strtod(), and then uselocale() again
   to switch the thread back to the previous locale.

   If POSIX did not have uselocale(), it would need to provide many
   more *_l functions.

If temporarily switching a thread's locale through uselocale()
invalidates the gettext functions' results (even if only those from
the same thread), it effectively disallows uselocale() as a helper
function.

[1] https://lists.gnu.org/archive/html/bug-gettext/2021-05/msg5.html

POSIX gettext(): Use of LANGUAGE in the POSIX locale

2022-05-11 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Line 65
"The locale names in LANGUAGE shall take precedence over <...>"

Issue: If this is true in all cases, then

1) programs such as 'diff'
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/diff.html
- which are forced to produce a specific output in the POSIX locale -
will have to explicitly test for the POSIX locale, for example by
doing
  const char *fmt =
(in_posix_locale () ? "Only in %s: %s\n" : gettext ("Only in %s: %s\n"));

2) for many languages, which use non-ASCII characters, the output
will contain many question marks, due to transliteration, because the
POSIX locale, on many systems, comes with the ASCII encoding.

Suggestion: Change
"over <...>"
to
"over <...>, if the latter is not the POSIX locale"

Remark: This is how GNU gettext behaves for over 20 years.
https://git.savannah.gnu.org/gitweb/?p=gettext.git;a=blob;f=gettext-runtime/intl/dcigettext.c;h=e7cb9b962a9a6b8e9ccf4a4a249b41517f857f26;hb=HEAD#l1626

POSIX gettext(): messages catalog lookup when LANGUAGE is set

2022-05-11 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Lines 308..309
  "o attempt to locate a suitable messages object..."
   o attempt to retrieve the string identified by msgid from the messages
 object"
and line 342, 344
  "the pathname used to locate the messages object shall be
   dirname/localename/categoryname/textdomainname.mo, where:
   ...
   additional searches of locale names without .codeset (if present),
   without _territory (if present), and without @modifier (if present)
   may be performed"

This text is suggesting that once the first suitable messages object
has been found, the string identified by msgid will be looked up in
this ONE AND ONLY ONE messages object.

Lines 339, 340 on the other hand suggest that when the msgid is not
found in the messages object, the search may continue.


What GNU gettext does, is that if the "attempt to retrieve the string
identified by msgid from the message object" fails, the search continues
with the NEXT suitable messages object.

Why is this important?

The list of directories is important for people who live in communities
which often (but not always) have translations of their own but can read
translations for other locales. For example:

  * A user in Austria prefers translations for Austrian German, but can
also read German with no problem.

  * A user in Uzbekistan may prefer translations in Cyrillic but can also
read translations in Latin. [1]

To accomodate, say, the first of these examples:

  * The language identifier 'de' gives rise to the localename part
  de

  * The language identifier 'de_AT' gives rise to the localename parts
  de_AT
  de

The translator for the de_AT locale may choose to translate only messages
that contain words that translate differently to Austrian than to German,
e.g. "potato", "nonsense", "bag", "is possible", and leave the rest to
the German translator. (It would be a waste of human resources if the
Austrian and the German translator did double work on 98% of the messages.
And it would be a complicated workflow if the Austrian translator had to
update their translations each time the German translator send new or
updated translations.)

The de_AT/LC_MESSAGES/textdomainname.mo file will thus contain a few
translations, and the desired behaviour is that for the msgids not
translated in this messages object, the next one
de/LC_MESSAGES/textdomainname.mo, gets used.

Suggestion: Reword it so that
  "o attempt to locate a suitable messages object..."
   o attempt to retrieve the string identified by msgid from the messages
 object"
are no longer separate steps, but such that backtracking occurs when the
messages object does not contain a translation for the given msgid.
This would resolve the apparent contradiction with lines 339, 340.

POSIX gettext(): messages catalog lookup when LANGUAGE is not set

2022-05-11 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Lines 335, 344

  "For portable applications, only the LANGUAGE search supports searches
   across multiple locale names."
  "For the LANGUAGE search, ... if a locale name has the format
   language[_territory][.codeset][@modifier], additional searches of locale
   names without .codeset (if present), without _territory (if present),
   and without @modifier (if present) may be performed; if .codeset is not
   present, additional searches of locale names with an added .codeset may
   be performed. For the single-locale search, the localename part is the
   name of the current locale, or the locale specified in an *_l() function
   call, for the category named by categoryname."

As explained in my mails from 2021-05-04 and 2022-01-16, it is important to
support people who live in communities which often (but not always) have
translations of their own but can read translations for other locales.
While, at the same time, it is important allow a translator for say, German,
to produce a translation that is useful for users in Germany, Austria, and
Switzerland, if no other (more specific) translation is available.

So, while the user may be working in either of the locales
  de_DE.UTF-8
  de_AT.UTF-8
  de_CH.UTF-8
they SHOULD see the translations that have been installed at
  dirname/de/LC_MESSAGES/textdomainname.mo

This is true also if the LANGUAGE environment variable has not been set.
Most operating systems set the LANG or LC_ALL environment variable for the
user, but do not set LANGUAGE.

In this situation, the current text mandates(!) that for a user in the
de_DE.UTF-8 locale
  - dirname/de/LC_MESSAGES/textdomainname.mo gets always ignored, and
  - dirname/de_DE.UTF-8/LC_MESSAGES/textdomainname.mo gets used - but this
messages object file almost never exists.

This is NOT how GNU gettext behaves. If POSIX standardizes it like this,
GNU libc and GNU gettext will have the choice among
  (a) looking in different (and fewer) directories than they do today,
  causing major i18n dysfunctionality to users, until the users
  have set up lots of symbolic links between directories or set
  LANGUAGE to a (redundant) value, or
  (b) violating POSIX in this point.

I will vote for (b).

If above text was adopted, it would have the consequences that

  1) Users will have to set LANGUAGE.

   LANG=de_DE.UTF-8

 will not be sufficient; instead the user will have to set

   LANG=de_DE.UTF-8
   LANGUAGE=de

For those users who don't do this:

  2) Many symbolic links are needed in /usr/share/locale/. Solaris 11.4
 is a system that implements gettext() as described in above text,
 and it has the links shown below [1].

  3) Users who want to create a new locale (e.g. for English in Australia)
 will have to create a symlink
 /usr/share/locale/en_AU -> /usr/share/locale/en
 and so on for each custom locale.

  4) Users who install packages in non-privileged directories (for GNU
 programs, that's the --prefix=PREFIX option) will have to create the
 same amount of symbolic links in their PREFIX/share/locale/ directory.

This is BAD, BAD, BAD.

Suggestion:
In line 344, make the
   "if a locale name has the format language[_territory][.codeset][@modifier],
additional searches of locale names without .codeset (if present), without
_territory (if present), and without @modifier (if present) may be
performed; if .codeset is not present, additional searches of locale
names with an added .codeset may be performed."
text apply also to the single-locale case.
In line 335, remove the sentence "only the LANGUAGE search supports searches
across multiple locale names."

Bruno

[1]
$ ls -l /usr/share/locale
total 102
drwxr-xr-x   3 root other  3 Oct 13  2018 C
drwxr-xr-x   3 root other  4 Oct 13  2018 de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE.ISO8859-1 -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE.ISO8859-15 -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE.UTF-8 -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de.ISO8859-15 -> de
drwxr-xr-x   3 root other  3 Oct 13  2018 de.us-ascii
lrwxrwxrwx   1 root root   2 Oct 13  2018 de.UTF-8 -> de
drwxr-xr-x   3 root other  3 Oct 13  2018 en
drwxr-xr-x   3 root other  3 Oct 13  2018 en_US
drwxr-xr-x   3 root other  3 Oct 13  2018 en@boldquot
drwxr-xr-x   3 root other  3 Oct 13  2018 en@quot
drwxr-xr-x   3 root other  3 Oct 13  2018 en@shaw
drwxr-xr-x   3 root other  4 Oct 13  2018 es
drwxr-xr-x   3 root other  3 Oct 13  2018 es_ES
lrwxrwxrwx   1 root root   2 Oct 13  2018 es_ES.ISO8859-1 -> es
lrwxrwxrwx   1 root root   2 Oct 13  2018 es_ES.ISO8859-15 -> es
lrwxrwxrwx   1 root root   2

Re: POSIX msgfmt and duplicate msgids

2022-05-11 Thread Bruno Haible via austin-group-l at The Open Group

Eric Blake wrote:
> In the msgfmt(1) utility, there is currently a difference between GNU
> and Illumos implementations on detecting duplicate msgid strings, and
> which command line switch(es) make detection of duplicates possible.
> The question is whether GNU msgfmt would be willing to use the current
> -c option (--check) have a mode for erroring out on duplicate msgid
> strings, or even adding a new command line option (-n appears to be
> available, for a mnemonic of 'no dupes') to have the duplicate
> detection available without requiring -c.

https://posix.rhansen.org/p/gettext_draft
Lines 925..926, 1140

"-n Do not allow duplicate msgid directives. Treat duplicate msgid
directives for the same message_identifier as errors instead of
ignoring the duplicates."

This does not deserve a specific option. *Of course* an input file with
duplicate msgids is abnormal; this is like a C file that defines two
functions with the same name. And that implies that when invoked with
-c and -v, the 'msgfmt' program must produce an error.

None of the following has this '-n' option:
  - The LI18NUX 2000 specification
  - GNU msgfmt
https://www.gnu.org/software/gettext/manual/html_node/msgfmt-Invocation.html
  - Solaris msgfmt
https://docs.oracle.com/cd/E88353_01/html/E37839/msgfmt-1.html

When '-c' and '-v' are *not* specified, I don't care whether the standard
requires msgfmt to diagnose this abnormality of the input. But it should
definitely not prohibit it.

Suggestion:
Remove these lines.

> The question is whether GNU msgfmt would be willing to use the current
> -c option (--check) have a mode for erroring out on duplicate msgid
> strings

It doesn't need a mode for that. GNU msgfmt already errors out on
duplicate msgids, even without '-c'.

Bruno

Re: POSIX xgettext and dgettext() calls

2022-05-11 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_draft
Lines 1173..1179

> on Solaris, the resulting .po file is called "foobar.po" and contains the 
> msgid "test".

Confirmed; it's like this on OmniOS and OpenIndiana.

> Running it on GNU, the resulting .po file is called "messages.po" and there 
> is no indication that the msgid belongs to "foobar".

Confirmed as well. It is like this since at least version 0.10.40 from 2001.

> According to the L18nux specification, the Solaris behavior is intended.

Confirmed: LI18NUX 2000 says "msgid strings in dgettext() calls are written
to the output file domainname.po where domainname is the first parameter to
the dgettext() call."

> Why does GNU xgettext deviate?

I think there are three reasons:

(1) Premature standardization: At that time, there was no established
practice regarding how to deal with multiple domains.

The old Uniforum specification pushed for the idea of a multi-domain
PO file, with the 'domain' directives; this approach made it hard to
concatenate and manipulate the files.

The LI18NUX 2000 specification pushed for extracting a separate .po
file for dgettext() directives.

This did not attain wide use either, because the programmers want to
minimize the number of domains: ideally one domain per package. Then
it makes no sense to mention the domain name at hundreds of places in
the source code. The programmer would instead write
  #define _(msgid) dgettext("mydomain", msgid)
and use the _() macro throughout the source code.

(2) Integration into a build system.

The xgettext utility is, in 99% of the cases, used as part of a build
system. In a build system, a maintainer wants to have control over the
file names; that is, they don't want files with arbitrary names to
appear. For comparison, have you ever seen a C/C++ compiler create a
separate file for each function/class/template/whatever? No, because
the build systems people conceive for C/C++, with Makefile rules etc.,
don't like files with arbitrary file names in the current directory.

(3) Security: When your test program is changed to

#include 
#include 
int main(){
printf("%s\n",dgettext("../../../../../../tmp/foobar","test"));
}

it does indeed create a file /tmp/foobar.po. Similar things can be
done, to write into any writable directory on the disk. This is
nowadays considered a security issue, which is why e.g. GNU tar
prohibits extracting files outside of the current directory, since
version 1.30.

Suggestion:
Mark this case as unspecified.

Rationale: I don't think the Li18nux + Solaris behaviour should be
standardized, because of the points (2), (3) above. And I don't think
it's worth standardizing any particular behaviour at all, because of
what I wrote in (1). It's a fringe case no one uses.

Bruno

Re: POSIX gettext() and uselocale()

2022-01-17 Thread Bruno Haible via austin-group-l at The Open Group

Geoff Clare wrote:
> The current draft says:
> 
> The returned string may be invalidated by a subsequent call to
> bind_textdomain_codeset(), bindtextdomain(), setlocale(), or
> textdomain() in the same process, or a subsequent call to
> uselocale() in the same thread, except for calls that only query
> values.
> 
> [...]
> 
> > I think that specifying gettext() to be so restricted is not useful.
> > It would make more sense to allow concurrent uselocale() calls.
> 
> The current draft text allows concurrent uselocale() calls.

This is better; thanks. Still, I don't think it is sufficient nor consistent.

OBJECTION 1:
  It requires applications to delegate some calls to separate threads.
  For example, take an application that regularly updates some UI and
  also occasionally writes an JSON file.

  For the UI updates, it will need to call gettext(). Let's assume that
  the UI caches the string the strings that the application passes it,
  e.g. for fast rerendering. This is the typical way a UI is built. E.g.
  Gtk+:   label1 = gtk_label_new (gettext ("Hello, world!"));
  Qt: label1 = new QLabel (gettext ("Hello, world!"), panel);

  For writing data in JSON format [1], it needs to convert
- strings to UTF-8 encoding,
- numbers to decimal representation, with '.' as decimal separator.
  For converting numbers to decimal, since the standard has strtod()
  but no strtod_l() [2], the most immediate implementation is to use
  uselocale() with a "C" locale argument, then call strtod(), then
  switch back to the previous locale using uselocale().

  With the current wording, converting a number to decimal like this
  will invalidate many of the strings that the UI is holding.

  Thus, the application will need to move its JSON file writing to a
  separate thread. This is a big architectural requirement.

OBJECTION 2:
  It is inconsistent with other parts of POSIX. For localeconv() [3]
  the wording is
"... might be overwritten by subsequent calls to setlocale() with the
 categories LC_ALL, LC_MONETARY, or LC_NUMERIC, or by calls to
 uselocale() which change the categories LC_MONETARY or LC_NUMERIC."

  To make things consistent, you would need to change the text for gettext
  from
"call to uselocale() in the same thread"
  to
"call to uselocale() in the same thread which changes the category
 LC_MESSAGES (for gettext(), gettext_l(), dgettext(), dgettext_l())
 respectively the locale passed to dcgettext(), dcgettext_l()"

Bruno

[1] https://datatracker.ietf.org/doc/html/rfc8259
[2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/strtod.html
[3] https://pubs.opengroup.org/onlinepubs/9699919799/functions/localeconv.html

POSIX gettext() and uselocale()

2022-01-16 Thread Bruno Haible via austin-group-l at The Open Group

[First sent on 2021-05-03. Resending because it has not been handled.]

https://posix.rhansen.org/p/gettext_draft
says (line 358):

  "The returned string may be invalidated by a subsequent call to
   bind_textdomain_codeset(), bindtextdomain(), setlocale(),
   textdomain(), or uselocale()."

While in most programs setlocale(), textdomain(), bindtextdomain(),
bind_textdomain_codeset() are being called at the beginning of the
program execution, before any call to gettext(), the situation is
very different for uselocale().

1) uselocale() is meant to have effects ONLY on the thread in which it
   is called.

2) uselocale() is a helper function to implement *_l functions where
   the POSIX standard does not specify them or the system does not have
   them.
   For example, when a program wants to have a function to parse
   a number, recognizing only the ASCII digits and only '.' as decimal
   separator, a reliable way to implement such a function is by calling
   uselocale of the "C" locale, strtod(), and then uselocale() again
   to switch the thread back to the previous locale.

   If POSIX did not have uselocale(), it would need to provide many
   more *_l functions.

If the gettext() result may be invalidated by a uselocale() call (in
any other thread!), this would mean that

  ** Programs can use gettext() or uselocale() but not both. **

and - more or less -

  ** Multithreaded programs that use libraries (that may use uselocale())
 cannot use gettext(). **

I think that specifying gettext() to be so restricted is not useful.
It would make more sense to allow concurrent uselocale() calls.

Proposed wording:

  "The returned string may be invalidated by a subsequent call to
   bind_textdomain_codeset(), bindtextdomain(), setlocale(),
   or textdomain()."

POSIX gettext() and the installation directories for .mo files

2022-01-16 Thread Bruno Haible via austin-group-l at The Open Group

[First sent on 2021-05-03. Resending because it has not been fully handled.]

https://posix.rhansen.org/p/gettext_draft
says (line 343..345)

  "For each locale name in LANGUAGE, or if LANGUAGE is not set or is
   empty, or no suitable messages object is found in processing LANGUAGE,
   the pathname used to locate the messages object shall be
   dirname/localename/categoryname/textdomainname.mo, where:
   ...
   For the LANGUAGE search, the localename part is each locale name from
   LANGUAGE in turn   For the single-locale search, the localename part
   is the name of the current locale, or the locale specified in an *_l()
   function call, for the category named by categoryname."

This text is ambiguous. The first cited paragraph says that it looks in a
single directory; the second cited paragraph says that it tries locale names
"in turn". This is contradictory. Also when it says "in turn" it does not
say what the stopping condition it: does it loop
  - until an existing locale name is found?
  - until a file dirname/localename/categoryname/textdomainname.mo is found?
  - until a file dirname/localename/categoryname/textdomainname.mo is found
that contains a translation for the given msgid?

For most of the interpretations of this set of paragraphs, this is NOT how
GNU gettext behaves. If POSIX standardizes it like this, GNU libc and
GNU gettext will have the choice among
  (a) looking in different (and fewer) directories than they do today,
  causing major i18n dysfunctionality to users, until the users
  have set up lots of symbolic links between directories, or
  (b) violating POSIX in this point.

I will vote for (b).

Namely, what GNU gettext does is to look in SEVERAL (not ONE) directories
per LANGUAGE element. This is true for *both* the LANGUAGE search and the
single-locale search.

The localename parts of these directories are constructed from the language
identifier (element of LANGUAGE) or locale name. For example:

* The language identifier 'de' gives rise to the localename part
de

* The language identifier 'de_AT' gives rise to the localename parts
de_AT
de

* The locale name 'de_AT.UTF-8' gives rise to the localename parts
de_AT.UTF-8
de_AT.utf8
de_AT
de.UTF-8
de.utf8
de

* The locale name 'uz_UZ.UTF-8@cyrillic gives rise to the localename parts
uz_UZ.UTF-8@cyrillic
uz_UZ.utf8@cyrillic
uz_UZ@cyrillic
uz.UTF-8@cyrillic
uz.utf8@cyrillic
uz@cyrillic
uz_UZ.UTF-8
uz_UZ.utf8
uz_UZ
uz.UTF-8
uz.utf8
uz

This list of directories is important for people who live in communities
which often (but not always) have translations of their own but can read
translations for other locales. In the examples above:

  * A user in Austria prefers translations for Austrian German, but can
also read German with no problem.

  * A user in Uzbekistan may prefer translations in Cyrillic but can also
read translations in Latin. [1]

If above text was adopted, it would have the consequences that

  1) Many symbolic links are needed in /usr/share/locale/. Solaris 11.4
 is a system that implements gettext() as described in above text,
 and it has the links shown below [2].

  2) Users who want to create a new locale (e.g. for English in Australia)
 will have to create a symlink
 /usr/share/locale/en_AU -> /usr/share/locale/en
 and so on for each custom locale.

  3) Users who install packages in non-privileged directories (for GNU
 programs, that's the --prefix=PREFIX option) will have to create the
 same amount of symbolic links in their PREFIX/share/locale/ directory.

  4) Users will have to set fallback logic in their LANGUAGE environment
 variable

   LANGUAGE=de_AT:de_DE

 instead of having it built-in:

   LANGUAGE=de_AT

This is BAD, BAD, BAD.

Bruno

[1] https://en.wikipedia.org/wiki/Uzbek_alphabet
[2]
$ ls -l /usr/share/locale
total 102
drwxr-xr-x   3 root other  3 Oct 13  2018 C
drwxr-xr-x   3 root other  4 Oct 13  2018 de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE.ISO8859-1 -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE.ISO8859-15 -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE.UTF-8 -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de.ISO8859-15 -> de
drwxr-xr-x   3 root other  3 Oct 13  2018 de.us-ascii
lrwxrwxrwx   1 root root   2 Oct 13  2018 de.UTF-8 -> de
drwxr-xr-x   3 root other  3 Oct 13  2018 en
drwxr-xr-x   3 root other  3 Oct 13  2018 en_US
drwxr-xr-x   3 root other  3 Oct 13  2018 en@boldquot
drwxr-xr-x   3 root other  3 Oct 13  2018 en@quot
drwxr-xr-x   3 root other  3 Oct 13  2018 en@shaw
drwxr-xr-x   3 root other  4 Oct 13  2018 es
drwxr-xr-x   3 root other  3 Oct 13  2018 es_ES

Re: Question from Austin Group regarding standardization of msgfmt

2022-01-16 Thread Bruno Haible via austin-group-l at The Open Group

Hi,

Eric Blake wrote:
> The Austin Group (the standards body in charge of the POSIX document)
> is trying to standardize the gettext(3) family of functions, as well
> as command line tools such as gettext(1) and xgettext(1).  You can
> track the efforts here, and if you have comments, I'm happy to relay
> them back to the Austin Group:
> 
> https://posix.rhansen.org/p/gettext_draft

Thanks for this info.

> At the moment, there is a particular question about GNU msgfmt(1)
> behavior.  The Austin Group has noted the current documented
> behaviors, first with GNU xgettext

You mean GNU msgfmt. GNU xgettext has an option '-c' too, but it
has a completely different meaning.

> having two two separate options, -c
> and -v, which are currently orthogonal:
> 
>-c, --check
>   performallthechecks   implied   by   --check-format,
>   --check-header, --check-domain
> 
>-v, --verbose
>   increase verbosity level
> 
> and contrasting with Solaris msgfmt
> (https://docs.oracle.com/cd/E36784_01/html/E36870/msgfmt-1.html),
> which has no -c, but documents:
> 
> –v
> –−verbose
> 
> Verbose. Lists duplicate message identifiers if Solaris message
> catalog files are processed. Message strings are not redefined.
> 
> If GNU-compatible message files are processed, this option detects
> and diagnoses input file anomalies which might represent
> translation errors. The msgid and msgstr strings are studied and
> compared. It is considered abnormal if one string starts or ends
> with a newline while the other does not. Also, if the string
> represents a format string used in a printf-like function, both
> strings should have the same number of % format specifiers, with
> matching types. If the flag c-format appears in the special
> comment '#' for this entry, a check is performed.
> 
> The question on the floor is whether GNU msgfmt would consider
> tweaking behavior so that -v implies -c (that is, turning on verbosity
> now also turns on format checking), so that there is one less option
> letter to standardize, and so that users can just rely on 'msgfmt -v'
> for message checking regardless of GNU or Solaris implementation.
> 
> Or put another way, the Austin Group would like to standardize only:
> 
>  -vVerbose. If this option is specified, msgfmt shall detect and
> diagnose input file abnormalities which might represent
> translation errors. The msgid and msgstr strings shall be
> compared. It shall be considered abnormal if one string starts or
> ends with a  while the other does not.  Also, if the flag
> c-format appears in a "#," comment for this entry, it shall be
> considered abnormal if the strings do not have the same number of
> '%' conversion specifiers, or if corresponding conversion
> specifiers take different argument types (see [xref to
> fprintf()]). If an abnormality is detected, the exit status shall
> be non-zero and a diagnostic message shall be output.
> 
> which would still leave -c as a GNU extension, but give users the
> ability to get format checking across both implementations with just
> -v.

I object, for three reasons:

OBJECTION 1:
   The text that you propose is incompatible with *both* GNU msgfmt
   and Solaris msgfmt.
   Namely,
 - In GNU msgfmt, the option '-v' increases verbosity without diagnosing
   abnormalities, and does *not* have an effect on the exit status.
 - In Solaris msgfmt, the option '-v' increases verbosity through
   diagnostics of abnormalities, and for most such abnormalities does
   *not* have an effect on the exit status either. Only for duplicate
   msgids does it have an effect on the exit status.

But of course, it is good to realize that presenting error-like diagnostics
with no influence on the exit status is not useful in practice. In fact,
both
  - desktop translation tools and
  - web-based translation services
use "msgfmt -c" to test whether the PO file is ready to submit/accept, by
looking at the exit code of this command.

OBJECTION 2:
  Not introducing a '-c' option is pointless, because (as just said)
  this is the main option for checking the validity / soundness of a PO file.
  It is widely used in practice. Use of 'msgfmt' without the option '-c'
  is neither useful not frequent, because who wants a .mo file that is
  able to crash the application that opens and uses it?

  Suggestion: Add a '-c' option. Describe it in abstract terms. Don't
  describe it as "perform all the checks implied by --check-format,
  --check-header, --check-domain", because we want to be able add
  different kinds of checks in the future (like accelerators).

OBJECTION 3:
  Making a '-v' option change the exit status of a utility would be a
  deviation from current practice for existing POSIX utilities.
  The POSIX utilities that have a '-v' option that increases verbosity are:

Re: Question regarding gettext behavior on iconv failure

2021-05-04 Thread Bruno Haible via austin-group-l at The Open Group

Carlos O'Donell wrote:
> > 1 Empfaenger Chinese (??,???,??)  ??
> >   * For the second line of output, in the first three cases, iconv()
> > did transliteration, and the result was always an ASCII string.
> > (The quality of glibc's transliteration of Hanzi characters to
> > question marks can be debated, though.)
> 
> Completely off-topic, but is there a "high quality" transliteration of
> Hanzi characters?
> Would you have expected a phenome to be spelled out in ASCII?
> I am not aware of any way to keep the meaning of the Hanzi characters in
> ASCII, therefore you see the locales "default_missing" character U+003F '?'.

Let's discuss this on libc-alpha. [1]

Bruno

[1] https://sourceware.org/pipermail/libc-alpha/

Re: Question regarding gettext behavior on iconv failure

2021-05-04 Thread Bruno Haible via austin-group-l at The Open Group

Eric Ackermann wrote:
> please find attached another test case (a shortened version of the
> example in the gettext proposal that Eric Blake linked). It uses the
> same mail.po and mail-utf8.po files that you provided earlier.
> When I compile and run it on Ubuntu 20.04 (Ubuntu GLIBC
> 2.31-0ubuntu9.2), for both .po files it prints "Empf?nger" in ASCII
> (converting the a-Umlaut into the question mark). This is probably
> related to the transliteration mechanism you described.

This demo.c example is not a good test case, because it does not
follow the advice to set at least the LC_MESSAGES and LC_CTYPE categories
of the locale. See

and  line 86.

What happens then is that the LC_CTYPE category of the locale is, by default,
set to "C", which implies "ASCII" encoding and no particular language or
territory. glibc's transliteration uses the language to determine the
transliteration to use. For example, it transliterates "å" to "aa" in a
Danish locale, but to "a" in an English locale. In the absence of a known
language, it falls back to "?" (like for the Chinese characters in my
previous mail).

> I conclude that the different sequence in which the
> gettext-functions are called causes this behavior which I would consider
> a bug.

No, there is no bug. The doc states that the LC_MESSAGES and LC_CTYPE
categories should be set, for gettext() to operate reasonably.

Bruno

POSIX gettext() and the locale category

2021-05-03 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_split
says (line 217):

  "All of the functions in the gettext family of functions, except
   dcgettext(), search for messages objects only in the LC_MESSAGES
   category."

dcgettext_l, dcngettext, dcngettext_l also search in the specified
category.

Suggested wording:

  "All of the functions in the gettext family of functions, except
   for dcgettext(), dcgettext_l(), dcngettext(), and dcngettext_l()
   search for messages objects only in the LC_MESSAGES category."

Bruno

POSIX gettext() and chdir()

2021-05-03 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_split
says (line 273):

  "The bindtextdomain() function shall not perform pathname resolution
   on dirname (that is done by the gettext family of functions)."

This is indeed how GNU gettext and GNU libc behave. However, this is
not optimal:

  1) If the dirname is not absolute, the application cannot use chdir()
 at various points and still expect gettext() to work.

  2) Storing a dirname over a long time and using it only later during
 the gettext() call opens the door to file system races.

A more modern approach would be to

  * have bindtextdomain() open() the given directory and store the
resulting file descriptor,
  * have gettext() use openat() instead of open() for locating and opening
the message catalogs.

Such changes have not been implemented in GNU gettext and glibc so far.
But it would be good to not preclude such an improved implementation.

Suggested wording:

  "It is unspecified whether the bindtextdomain() function performs
   pathname resolution on dirname, or whether that is done by the
   gettext family of functions."

Bruno

POSIX gettext() and uselocale()

2021-05-03 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_split
says (line 92):

  "The returned string may be invalidated by a subsequent call to
   bind_textdomain_codeset(), bindtextdomain(), setlocale(),
   textdomain(), or uselocale()."

While in most programs setlocale(), textdomain(), bindtextdomain(),
bind_textdomain_codeset() are being called at the beginning of the
program execution, before any call to gettext(), the situation is
very different for uselocale().

1) uselocale() is meant to have effects ONLY on the thread in which it
   is called.

2) uselocale() is a helper function to implement *_l functions where
   the POSIX standard does not specify them or the system does not have
   them.
   For example, when a program wants to have a function to parse
   a number, recognizing only the ASCII digits and only '.' as decimal
   separator, a reliable way to implement such a function is by calling
   uselocale of the "C" locale, strtod(), and then uselocale() again
   to switch the thread back to the previous locale.

   If POSIX did not have uselocale(), it would need to provide many
   more *_l functions.

If the gettext() result may be invalidated by a uselocale() call (in
any other thread!), this would mean that

  ** Programs can use gettext() or uselocale() but not both. **

and - more or less -

  ** Multithreaded programs that use libraries (that may use uselocale())
 cannot use gettext(). **

I think that specifying gettext() to be so restricted is not useful.
It would make more sense to allow concurrent uselocale() calls.

Proposed wording:

  "The returned string may be invalidated by a subsequent call to
   bind_textdomain_codeset(), bindtextdomain(), setlocale(),
   or textdomain()."

Bruno

POSIX gettext() and the installation directories for .mo files

2021-05-03 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_split
says (line 77..79)

  "For each locale name in LANGUAGE, or if LANGUAGE is not set or is
   empty, or no suitable messages object is found in processing LANGUAGE,
   the pathname used to locate the messages object shall be
   dirname/localename/categoryname/textdomainname.mo, where:
   ...
   For the LANGUAGE search, the localename part is each locale name from
   LANGUAGE in turn.  For the single-locale search, the localename part
   is the name of the current locale, or the locale specified in an *_l()
   function call, for the category named by categoryname."

This is NOT how GNU gettext behaves. If POSIX standardizes it like this,
GNU libc and GNU gettext will have the choice among
  (a) looking in different (and fewer) directories than they do today,
  causing major i18n dysfunctionality to users, until the users
  have set up lots of symbolic links between directories, or
  (b) violating POSIX in this point.

I will vote for (b).

Namely, what GNU gettext does is to look in SEVERAL (not ONE) directories
per LANGUAGE element.

The localename parts of these directories are constructed from the language
identifier (element of LANGUAGE) or locale name. For example:

* The language identifier 'de' gives rise to the localename part
de

* The language identifier 'de_AT' gives rise to the localename parts
de_AT
de

* The locale name 'de_AT.UTF-8' gives rise to the localename parts
de_AT.UTF-8
de_AT.utf8
de_AT
de.UTF-8
de.utf8
de

* The locale name 'uz_UZ.UTF-8@cyrillic gives rise to the localename parts
uz_UZ.UTF-8@cyrillic
uz_UZ.utf8@cyrillic
uz_UZ@cyrillic
uz.UTF-8@cyrillic
uz.utf8@cyrillic
uz@cyrillic
uz_UZ.UTF-8
uz_UZ.utf8
uz_UZ
uz.UTF-8
uz.utf8
uz

This list of directories is important for people who live in communities
which often (but not always) have translations of their own but can read
translations for other locales. In the examples above:

  * A user in Austria prefers translations for Austrian German, but can
also read German with no problem.

  * A user in Uzbekistan may prefer translations in Cyrillic but can also
read translations in Latin. [1]

If above text was adopted, it would have the consequences that

  1) Many symbolic links are needed in /usr/share/locale/. Solaris 11.4
 is a system that implements gettext() as described in above text,
 and it has the links shown below [2].

  2) Users who want to create a new locale (e.g. for English in Australia)
 will have to create a symlink
 /usr/share/locale/en_AU -> /usr/share/locale/en
 and so on for each custom locale.

  3) Users who install packages in non-privileged directories (for GNU
 programs, that's the --prefix=PREFIX option) will have to create the
 same amount of symbolic links in their PREFIX/share/locale/ directory.

  4) Users will have to set fallback logic in their LANGUAGE environment
 variable

   LANGUAGE=de_AT:de_DE

 instead of having it built-in:

   LANGUAGE=de_AT

This is BAD, BAD, BAD.

Bruno

[1] https://en.wikipedia.org/wiki/Uzbek_alphabet
[2]
$ ls -l /usr/share/locale
total 102
drwxr-xr-x   3 root other  3 Oct 13  2018 C
drwxr-xr-x   3 root other  4 Oct 13  2018 de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE.ISO8859-1 -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE.ISO8859-15 -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de_DE.UTF-8 -> de
lrwxrwxrwx   1 root root   2 Oct 13  2018 de.ISO8859-15 -> de
drwxr-xr-x   3 root other  3 Oct 13  2018 de.us-ascii
lrwxrwxrwx   1 root root   2 Oct 13  2018 de.UTF-8 -> de
drwxr-xr-x   3 root other  3 Oct 13  2018 en
drwxr-xr-x   3 root other  3 Oct 13  2018 en_US
drwxr-xr-x   3 root other  3 Oct 13  2018 en@boldquot
drwxr-xr-x   3 root other  3 Oct 13  2018 en@quot
drwxr-xr-x   3 root other  3 Oct 13  2018 en@shaw
drwxr-xr-x   3 root other  4 Oct 13  2018 es
drwxr-xr-x   3 root other  3 Oct 13  2018 es_ES
lrwxrwxrwx   1 root root   2 Oct 13  2018 es_ES.ISO8859-1 -> es
lrwxrwxrwx   1 root root   2 Oct 13  2018 es_ES.ISO8859-15 -> es
lrwxrwxrwx   1 root root   2 Oct 13  2018 es_ES.UTF-8 -> es
lrwxrwxrwx   1 root root   2 Oct 13  2018 es.ISO8859-15 -> es
lrwxrwxrwx   1 root root   2 Oct 13  2018 es.UTF-8 -> es
drwxr-xr-x   3 root other  4 Oct 13  2018 fr
lrwxrwxrwx   1 root root   2 Oct 13  2018 fr_FR -> fr
lrwxrwxrwx   1 root root   2 Oct 13  2018 fr_FR.ISO8859-1 -> fr
lrwxrwxrwx   1 root root   2 Oct 13  2018 fr_FR.ISO8859-15 -> fr
lrwxrwxrwx   1 root root   2 Oct 13  2018 fr_FR.UTF-8 -> fr
lrwxrwxrwx   1 root root

POSIX gettext() and the LANGUAGE environment variable

2021-05-03 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_split
says (line 72)

  "For the LANGUAGE search, the value of the LANGUAGE environment
   variable shall be a list of one or more locale names separated
   by a colon (':') character."

This is NOT how GNU gettext behaves. If POSIX standardizes it like this,
GNU libc and GNU gettext will have the choice among
  (a) forcing users to specify their preferences in a user-unfriendly way, or
  (b) violating POSIX in this point.

I will vote for (b).

Namely, what gettext expects in the LANGUAGE environment variable is
documented in
https://www.gnu.org/software/gettext/manual/html_node/The-LANGUAGE-variable.html

In a modern glibc system, the locale names are essentially

  C
  POSIX
  en_US.UTF-8
  de_DE.UTF-8
  fr_FR.UTF-8
  pt_BR.UTF-8
  etc.

We do NOT want that user who wants to see messages in Arabic (1st preference)
or French (2nd preference) has to set

  LANGUAGE=ar_EG.UTF-8:fr_FR.UTF-8

We want that the user merely has to write

  LANGUAGE=ar:fr

Suggested wording change:

  "For the LANGUAGE search, the value of the LANGUAGE environment
   variable shall be a list of one or more language identifiers.
   A language identifier is a locale name with the '.codeset' part
   removed and optionally also the territory and/or the modifiers removed.
   In the simplest case, a language identifier consists of just an ISO 639-1
   code."

Bruno

POSIX gettext() and iconv_open()

2021-05-03 Thread Bruno Haible via austin-group-l at The Open Group

https://posix.rhansen.org/p/gettext_split
says (line 85)

  "The conversion shall be performed as if by a call to iconv() using a
   conversion descriptor returned by iconv_open(,
   )."

This is NOT how GNU gettext behaves. If POSIX standardizes it like this,
GNU libc and GNU gettext will have the choice among
  (a) dropping the transliteration during charset conversion of the
  messages,
  (b) violating POSIX in this point.

I will vote for (b).

Namely, what GNU gettext does, is to allocate an iconv descriptor that
allows transliteration. For example, when converting from ISO-8859-1 or
UTF-8 to ASCII, it would transform

  "1 Empfänger" to"1 Empfaenger"  (glibc in German locale)
or"1 Empf\"anger"  (GNU libiconv)

Suggested wording change:

  "The conversion shall be performed as if by a call to iconv() using a
   conversion descriptor that converts from the returned 
   to the , with an implementation-dependent conversion
   quality."

Bruno

Re: Question regarding gettext behavior on iconv failure

2021-05-03 Thread Bruno Haible via austin-group-l at The Open Group

Hi Eric,

> The example in question set up several .po files and a specific
> environment to test various pluralization/transcoding fallbacks, and
> concludes with a snippet where a string with an encoding error in
> ISO-8859-1 is output in spite of an iconv failure, rather than the
> string passed in to ngettext():
> 
> 
> n_recipients = 1;
> // The following outputs "1 EmpfÃ¤nger" encoded in UTF-8:
> printf("%s\n", ngettext("recipient", "recipients", n_recipients));
> 
> bind_textdomain_codeset("mail", "ASCII");
> 
> n_recipients = 1;
> // The following outputs "recipient" with the same encoding as the
> "recipient"
> // argument to ngettext (remember, the the system is assumed to not
> support
> // conversion from ISO/IEC 8859-1 to ASCII):
> printf("%s\n", ngettext("recipient", "recipients", n_recipients));
> // On GNU gettext, "1 EmpfÃ¤nger" is output in ISO-8859-1 here (i.e.
> no conversion is done). I think we already agreed on considering this
> behavior a bug,

I cannot reproduce this. Find attached my (complete) test case.

GNU gettext uses iconv_open() with arguments that indicate that a not 1:1
conversion (e.g. transliteration) is better than a failure.

The result thus depends on the iconv implementation. For GNU gettext
the recommended iconv implementations are:
  - on glibc systems: GNU libc,
  - otherwise: GNU libiconv.
Therefore here are the results on GNU libc (2.32) and on some other OS
(FreeBSD 13) with GNU libiconv:

With a mail.po that contains only umlauts:

Output on glibc systems (e.g. 2.32):
1 EmpfÃ¤nger
1 Empfaenger

Output on non-glibc systems with GNU libiconv:
1 EmpfÃ¤nger
1 Empf"anger

With a mail-utf8.po that contains also Hanzi characters:

Output on glibc systems (e.g. 2.32):
1 EmpfÃ¤nger Chinese (ä¸æ,æ®éè¯,æ±è¯)  ä½ å¥½
1 Empfaenger Chinese (??,???,??)  ??

Output on non-glibc systems with GNU libiconv:
1 EmpfÃ¤nger Chinese (ä¸æ,æ®éè¯,æ±è¯)  ä½ å¥½
recipient

As you can see:

  * For the first line of output, since the output encoding is UTF-8,
iconv() never needed transliteration and never failed.

  * For the second line of output, in the first three cases, iconv()
did transliteration, and the result was always an ASCII string.
(The quality of glibc's transliteration of Hanzi characters to
question marks can be debated, though.)

  * In the last case, iconv() failed, and thus GNU gettext output
the corresponding argument to ngettext() untranslated.

> This raises a few questions: does the GNU gettext team agree that this
> can be considered a bug

No. Please provide a reproducible test case, that produces wrong results
on an interesting platform. NetBSD 3.0 or IRIX 6.5, for example, don't
count.

Bruno
/* Preparations:
- Install locale named 'de_DE.UTF-8' (using localedef).
- Find attached mail.po
- $ mkdir -p de/LC_MESSAGES
  $ msgfmt -c -o de/LC_MESSAGES/mail.mo mail.po
  or
  $ msgfmt -c -o de/LC_MESSAGES/mail.mo mail-utf8.po
- $ gcc -Wall foo.c
- $ LC_ALL=de_DE.UTF-8 ./a.out
*/

#include 
#include 
#include 

int
main ()
{
  if (setlocale (LC_ALL, "") == NULL)
return 1;
  textdomain ("mail");
  bindtextdomain ("mail", ".");

  unsigned int n_recipients;

  n_recipients = 1;
  // The following outputs "1 EmpfÃ¤nger" encoded in UTF-8:
  printf("%s\n", ngettext("recipient", "recipients", n_recipients));

  bind_textdomain_codeset("mail", "ASCII");

  n_recipients = 1;
  // The following outputs "recipient" with the same encoding as the "recipient"
  // argument to ngettext (remember, the the system is assumed to not support
  // conversion from ISO/IEC 8859-1 to ASCII):
  printf("%s\n", ngettext("recipient", "recipients", n_recipients));
  // On GNU gettext, "1 EmpfÃ¤nger" is output in ISO-8859-1 here (i.e. no conversion is done). I think we already agreed on considering this behavior a bug,
}
/*
With a mail.po that contains only umlauts:

Output on glibc systems (e.g. 2.32):
1 EmpfÃ¤nger
1 Empfaenger

Output on non-glibc systems with GNU libiconv:
1 EmpfÃ¤nger
1 Empf"anger

With a mail-utf8.po that contains also Hanzi characters:

Output on glibc systems (e.g. 2.32):
1 EmpfÃ¤nger Chinese (ä¸æ,æ®éè¯,æ±è¯)  ä½ å¥½
1 Empfaenger Chinese (??,???,??)  ??

Output on non-glibc systems with GNU libiconv:
1 EmpfÃ¤nger Chinese (ä¸æ,æ®éè¯,æ±è¯)  ä½ å¥½
recipient

*/
msgid ""
msgstr ""
"Content-Type: text/plain; charset=ISO_8859-1\n"
"Plural-Forms: nplurals=4; plural= n==1?0: (n>1 && n< 5)?1: (n==0)? 2:3;\n"

msgid "recipient"
msgid_plural "recipients"
msgstr[0] "1 Empfänger"
msgstr[1] "2 bis 4 Empfänger"
msgstr[2] "keine Empfänger"
msgstr[3] "mehr als 4 Empfänger"
msgid ""
msgstr ""
"Content-Type: text/plain; charset=UTF-8\n"
"Plural-Forms: nplurals=4; plural= n==1?0: (n>1 && n< 5)?1: (n==0)? 2:3;\n"

msgid "recipient"
msgid_plural "recipients"
msgstr[0] "1 EmpfÃ¤nger Chinese (ä¸æ,æ®éè¯,æ±è¯)  ä½ å¥½"
msgstr[1] "2 bis 4 EmpfÃ¤nger"
msgstr[2]

Re: SIGSTKSZ is now a run-time variable

2021-03-09 Thread Bruno Haible via austin-group-l at The Open Group

Eric Blake wrote:
> I can open a defect against POSIX if we decide that is needed, but want
> some consensus first on whether it is glibc's change that went too far,
> or POSIX's requirements that are too restrictive for what glibc wants to do.

Thanks for opening the discussion, Eric.

Here are a couple of questions, to understand the motivation and the possible
alternative solutions to the problem:

1) As far as I understand, the issue occurs with certain x86 or x86_64
   processors.

   1.1) What has been the value of MINSIGSTKSZ on x86 and x86_64 so far?
   1.2) What value of MINSIGSTKSZ is needed for AVX-512F support?
   1.3) Will the trend to larger MINSIGSTKSZ values continue for Intel
processors?

2) Regarding the change of the macro MINSIGSTKSZ:
   Would it possible to just change the value of MINSIGSTKSZ to a larger
   constant?

   If there is a fear regarding ABI compatibility between a library and a
   program: How likely is it that a library offers an interface that takes
   a char[MINSIGSTKSZ] as argument, or that defines a variable of type
   char[MINSIGSTKSZ]?

3) Regarding the change of the macro SIGSTKSZ:
   Likewise, would it be possible to just change the value of SIGSTKSZ to a
   larger constant?

4) Since SIGSTKSZ has other uses than MINSIGSTKSZ, has it been considered
   to make MINSIGSTKSZ non-constant but keep SIGSTKSZ constant?

5) POSIX:2018 [1] defines SIGSTKSZ as the stack size for "the usual case".
   So, it should be composed of MINSIGSTKSZ for the initial stack frame,
   plus a certain amount of stack, depending on CPU, ABI, and compiler,
   for doing what a "usual" signal handler would do.

   What is the reason, then, for the computation
 SIGSTKSZ >= 4 * MINSIGSTKSZ
   in [2]? Shouldn't it be something like
 SIGSTKSZ >= MINSIGSTKSZ + (64 KB on SPARC and powerpc, 8 KB on other
processors)
   ?

Bruno

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/sigaltstack.html
[2] 
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/sysconf-sigstksz.h

48 matches

Mail list logo