Bug#1061103: locale charset not respected

Zefram Thu, 18 Jan 2024 03:45:22 -0800

Package: unicode
Version: 2.8-1.1
Severity: normal

unicode(1) documents that, by default, it produces output in the charset
nominated by the user's locale.  (This is in the documentation of the "-i"
option, which can be used to specify an output charset; it specifically
says that with a "properly set up locale" the option should not be
needed.)  In fact it does not reliably respect the environmental locale;
whether it uses it depends partly on which environment variables are being
used to specify it and partly on which locale is specified.  (I'm not
clear on what the logic actually is.)  If it doesn't use the locale then,
empirically, it outputs in UTF-8, which is not a safe default.


If no locale-relevant environment variables are set, meaning that the
environmental locale is C, then unicode(1) doesn't respect the locale,
and outputs in UTF-8:

$ env - locale charmap
ANSI_X3.4-1968
$ env - unicode --brief -x 2603 | od -tx1
0000000 e2 98 83 20 55 2b 32 36 30 33 20 53 4e 4f 57 4d
0000020 41 4e 0a
0000023

If the C locale is explicitly set in the LANG environment variable,
then unicode(1) doesn't respect the locale, and outputs in UTF-8:

$ env - LANG=C locale charmap
ANSI_X3.4-1968
$ env - LANG=C unicode --brief -x 2603 | od -tx1
0000000 e2 98 83 20 55 2b 32 36 30 33 20 53 4e 4f 57 4d
0000020 41 4e 0a
0000023

But if a Latin-1 locale is set in the LANG environment variable, then
unicode(1) does respect it:

$ env - LANG=de_DE.iso88591 locale charmap
ISO-8859-1
$ env - LANG=de_DE.iso88591 unicode --brief -x 2603 | od -tx1
0000000 3f 20 55 2b 32 36 30 33 20 53 4e 4f 57 4d 41 4e
0000020 0a
0000021

Specifying the charset aspect of a locale using the LC_CTYPE environment
variable produces the same locale-dependent results as using LANG:

$ env - LC_CTYPE=C locale charmap
ANSI_X3.4-1968
$ env - LC_CTYPE=C unicode --brief -x 2603 | od -tx1
0000000 e2 98 83 20 55 2b 32 36 30 33 20 53 4e 4f 57 4d
0000020 41 4e 0a
0000023
$ env - LC_CTYPE=de_DE.iso88591 locale charmap
ISO-8859-1
$ env - LC_CTYPE=de_DE.iso88591 unicode --brief -x 2603 | od -tx1
0000000 3f 20 55 2b 32 36 30 33 20 53 4e 4f 57 4d 41 4e
0000020 0a
0000021

If a locale is specified using the LC_ALL environment variable, however,
then it seems to always be respected:

$ env - LC_ALL=C locale charmap
ANSI_X3.4-1968
$ env - LC_ALL=C unicode --brief -x 2603 | od -tx1
0000000 3f 20 55 2b 32 36 30 33 20 53 4e 4f 57 4d 41 4e
0000020 0a
0000021
$ env - LC_ALL=de_DE.iso88591 locale charmap
ISO-8859-1
$ env - LC_ALL=de_DE.iso88591 unicode --brief -x 2603 | od -tx1
0000000 3f 20 55 2b 32 36 30 33 20 53 4e 4f 57 4d 41 4e
0000020 0a
0000021

The impact of this bug is that for many reasonable setups, with locale
correctly described in the environment and matching actual output
device capabilities, unicode(1) generates output containing spurious
control characters that not only don't display what was intended but
also sometimes screw up the display of other parts of the output.
I'm specifically seeing that result with a Latin-1 terminal emulator
and the C locale.

-zefram

Bug#1061103: locale charset not respected

Reply via email to