bug#69951: coreutils: printf formatting bug for nb_NO and nn_NO locales

2024-03-23 Thread Thomas Dreibholz

Hi,

indeed, the issue seems to be in libc. I can reproduce the problem with 
a simple C program:


#include 
#include 
#include 

int main(int argc, char** argv)
{
   setlocale (LC_ALL, "");

   struct lconv* loc = localeconv();
   printf("Thousands Separator: <%s>\n", loc->thousands_sep);

   for(int i = 1; i \tint <%'10d>\n", f, n);
   }
   return 0;
}

Output with LC_NUMERIC=nb_NO.UTF-8:

Thousands Separator: < >
double < 1> int < 1>
double <    10> int <    10>
double <   100> int <   100>
double < 1 000> int <   1 000>
double <    10 000> int <  10 000>
double <   100 000> int < 100 000>
double < 1 000 000> int <1 000 000>
double <10 000 000> int <10 000 000>

So, for a float (%f), the output is as expected, while it is wrong for 
an integer (%d).


--
Best regards / Mit freundlichen Grüßen / Med vennlig hilsen

===
 Thomas Dreibholz

 Simula Metropolitan Centre for Digital Engineering
 Centre for Resilient Networks and Applications
 Pilestredet 52
 0167 Oslo, Norway
---
 E-Mail:dre...@simula.no
 Homepage:http://simula.no/people/dreibh
===



OpenPGP_signature.asc
Description: OpenPGP digital signature


bug#69951: coreutils: printf formatting bug for nb_NO and nn_NO locales

2024-03-23 Thread Thomas Dreibholz

Hi,

some further debugging of a hexdump output of printf, i.e.:

#!/bin/bash
for l in de_DE en_US nb_NO nn_NO ; do
   echo "LC_NUMERIC=$l.UTF-8"
   for n in 1 100 1000 1 10 100 1000 ; do
  LC_NUMERIC=$l.UTF-8 /usr/bin/printf "<%'10d>" $n | hexdump -C
   done
done

The output is:

...
LC_NUMERIC=nb_NO.UTF-8
  3c 20 20 20 20 20 20 20  20 20 31 3e  |< 1>|
000c
  3c 20 20 20 20 20 20 20  31 30 30 3e  |<   100>|
000c
  3c 20 20 20 31 e2 80 af  30 30 30 3e  |<   1...000>|
000c
  3c 20 20 31 30 e2 80 af  30 30 30 3e  |<  10...000>|
000c
  3c 20 31 30 30 e2 80 af  30 30 30 3e  |< 100...000>|
000c
  3c 31 e2 80 af 30 30 30  e2 80 af 30 30 30 3e 
|<1...000...000>|

000f
  3c 31 30 e2 80 af 30 30  30 e2 80 af 30 30 30 3e 
 |<10...000...000>|

0010
LC_NUMERIC=nn_NO.UTF-8
  3c 20 20 20 20 20 20 20  20 20 31 3e  |< 1>|
000c
  3c 20 20 20 20 20 20 20  31 30 30 3e  |<   100>|
000c
  3c 20 20 20 31 e2 80 af  30 30 30 3e  |<   1...000>|
000c
  3c 20 20 31 30 e2 80 af  30 30 30 3e  |<  10...000>|
000c
  3c 20 31 30 30 e2 80 af  30 30 30 3e  |< 100...000>|
000c
  3c 31 e2 80 af 30 30 30  e2 80 af 30 30 30 3e 
|<1...000...000>|

000f
  3c 31 30 e2 80 af 30 30  30 e2 80 af 30 30 30 3e 
 |<10...000...000>|

0010

printf seems to insert a 3-byte UTF-8 character 0xe2 0x80 0xaf as 
thousands separator. "0xe2 0x80 0xaf" is UTF-8 NARROW NO-BREAK SPACE -> 
https://www.fileformat.info/info/unicode/char/202f/index.htm 
<https://www.fileformat.info/info/unicode/char/202f/index.htm> . But 
terminal output (tested with Konsole and XTerm) has fixed spacing, so 
"narrow space" should probably be a regular space or regular 
non-breakable space (0xc2 0xa0, HTML "")? Note that also 
LibreOffice cannot produce a correct screen output with UTF-8 NARROW 
NO-BREAK SPACE, even with proportional fonts, when loading the output of 
the test script as a text file.


Screenshots for illustration:

 * Terminal output:
   
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/2058775/+attachment/5758462/+files/Screenshot_20240322_213947.png
 * LibreOffice output:
   
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/2058775/+attachment/5758464/+files/Screenshot_20240322_222052.png

--
Best regards / Mit freundlichen Grüßen / Med vennlig hilsen

===
 Thomas Dreibholz

 Simula Metropolitan Centre for Digital Engineering
 Centre for Resilient Networks and Applications
 Pilestredet 52
 0167 Oslo, Norway
---
 E-Mail:dre...@simula.no
 Homepage:http://simula.no/people/dreibh
===



OpenPGP_signature.asc
Description: OpenPGP digital signature


bug#69951: coreutils: printf formatting bug for nb_NO and nn_NO locales

2024-03-22 Thread Thomas Dreibholz

Hi,

I just discovered a printf bug for at least the nb_NO and nn_NO locales 
when printing numbers with thousands separator. To reproduce:


#!/bin/bash
for l in de_DE en_US nb_NO ; do
   echo "LC_NUMERIC=$l.UTF-8"
   for n in 1 100 1000 1 10 100 1000 ; do
  LC_NUMERIC=$l.UTF-8 /usr/bin/printf "<%'10d>\n" $n
   done
done

The expected output of "%'10d" is a right-formatted number string with 
10 characters.


The output of the test script is fine for e.g. LC_NUMERIC=de_DE.UTF-8 
and LC_NUMERIC=en_US.UTF-8:


LC_NUMERIC=de_DE.UTF-8
< 1>
<   100>
< 1.000>
<    10.000>
<   100.000>
< 1.000.000>
<10.000.000>
LC_NUMERIC=en_US.UTF-8
< 1>
<   100>
< 1,000>
<    10,000>
<   100,000>
< 1,000,000>
<10,000,000>

However, for LC_NUMERIC=nb_NO.UTF-8 and LC_NUMERIC=nn_NO.UTF-8, the 
formatting is wrong:


LC_NUMERIC=nb_NO.UTF-8
< 1>
<   100>
<   1 000>
<  10 000>
< 100 000>
<1 000 000>
<10 000 000>
LC_NUMERIC=nn_NO.UTF-8
< 1>
<   100>
<   1 000>
<  10 000>
< 100 000>
<1 000 000>
<10 000 000>

I reproduced the issue with coreutils-8.32-4.1ubuntu1.1 (Ubuntu 22.04) 
as well as coreutils-9.3-5.fc39.x86_64 (Fedora 39).


Under FreeBSD 14.0-RELEASE (coreutils-9.4_1), the output looks slightly 
better but is still wrong:


LC_NUMERIC=nb_NO.UTF-8
< 1>
<   100>
<    1 000>
<   10 000>
<  100 000>
<1 000 000>
<10 000 000>
LC_NUMERIC=nn_NO.UTF-8
< 1>
<   100>
<    1 000>
<   10 000>
<  100 000>
<1 000 000>
<10 000 000>

May be the issue is that the thousands separator for the Norwegian 
locales is a space " ", while it is "."/"," for German/US English locales.


--
Best regards / Mit freundlichen Grüßen / Med vennlig hilsen

===
 Thomas Dreibholz

 Simula Metropolitan Centre for Digital Engineering
 Centre for Resilient Networks and Applications
 Pilestredet 52
 0167 Oslo, Norway
---
 E-Mail:dre...@simula.no
 Homepage:http://simula.no/people/dreibh
===



OpenPGP_signature.asc
Description: OpenPGP digital signature