bug#69951: coreutils: printf formatting bug for nb_NO and nn_NO locales

2024-03-23 Thread Thomas Dreibholz

Hi,

indeed, the issue seems to be in libc. I can reproduce the problem with 
a simple C program:


#include 
#include 
#include 

int main(int argc, char** argv)
{
   setlocale (LC_ALL, "");

   struct lconv* loc = localeconv();
   printf("Thousands Separator: <%s>\n", loc->thousands_sep);

   for(int i = 1; i \tint <%'10d>\n", f, n);
   }
   return 0;
}

Output with LC_NUMERIC=nb_NO.UTF-8:

Thousands Separator: < >
double < 1> int < 1>
double <    10> int <    10>
double <   100> int <   100>
double < 1 000> int <   1 000>
double <    10 000> int <  10 000>
double <   100 000> int < 100 000>
double < 1 000 000> int <1 000 000>
double <10 000 000> int <10 000 000>

So, for a float (%f), the output is as expected, while it is wrong for 
an integer (%d).


--
Best regards / Mit freundlichen Grüßen / Med vennlig hilsen

===
 Thomas Dreibholz

 Simula Metropolitan Centre for Digital Engineering
 Centre for Resilient Networks and Applications
 Pilestredet 52
 0167 Oslo, Norway
---
 E-Mail:dre...@simula.no
 Homepage:http://simula.no/people/dreibh
===



OpenPGP_signature.asc
Description: OpenPGP digital signature


bug#69951: coreutils: printf formatting bug for nb_NO and nn_NO locales

2024-03-23 Thread Thomas Dreibholz

Hi,

some further debugging of a hexdump output of printf, i.e.:

#!/bin/bash
for l in de_DE en_US nb_NO nn_NO ; do
   echo "LC_NUMERIC=$l.UTF-8"
   for n in 1 100 1000 1 10 100 1000 ; do
  LC_NUMERIC=$l.UTF-8 /usr/bin/printf "<%'10d>" $n | hexdump -C
   done
done

The output is:

...
LC_NUMERIC=nb_NO.UTF-8
  3c 20 20 20 20 20 20 20  20 20 31 3e  |< 1>|
000c
  3c 20 20 20 20 20 20 20  31 30 30 3e  |<   100>|
000c
  3c 20 20 20 31 e2 80 af  30 30 30 3e  |<   1...000>|
000c
  3c 20 20 31 30 e2 80 af  30 30 30 3e  |<  10...000>|
000c
  3c 20 31 30 30 e2 80 af  30 30 30 3e  |< 100...000>|
000c
  3c 31 e2 80 af 30 30 30  e2 80 af 30 30 30 3e 
|<1...000...000>|

000f
  3c 31 30 e2 80 af 30 30  30 e2 80 af 30 30 30 3e 
 |<10...000...000>|

0010
LC_NUMERIC=nn_NO.UTF-8
  3c 20 20 20 20 20 20 20  20 20 31 3e  |< 1>|
000c
  3c 20 20 20 20 20 20 20  31 30 30 3e  |<   100>|
000c
  3c 20 20 20 31 e2 80 af  30 30 30 3e  |<   1...000>|
000c
  3c 20 20 31 30 e2 80 af  30 30 30 3e  |<  10...000>|
000c
  3c 20 31 30 30 e2 80 af  30 30 30 3e  |< 100...000>|
000c
  3c 31 e2 80 af 30 30 30  e2 80 af 30 30 30 3e 
|<1...000...000>|

000f
  3c 31 30 e2 80 af 30 30  30 e2 80 af 30 30 30 3e 
 |<10...000...000>|

0010

printf seems to insert a 3-byte UTF-8 character 0xe2 0x80 0xaf as 
thousands separator. "0xe2 0x80 0xaf" is UTF-8 NARROW NO-BREAK SPACE -> 
https://www.fileformat.info/info/unicode/char/202f/index.htm 
 . But 
terminal output (tested with Konsole and XTerm) has fixed spacing, so 
"narrow space" should probably be a regular space or regular 
non-breakable space (0xc2 0xa0, HTML "")? Note that also 
LibreOffice cannot produce a correct screen output with UTF-8 NARROW 
NO-BREAK SPACE, even with proportional fonts, when loading the output of 
the test script as a text file.


Screenshots for illustration:

 * Terminal output:
   
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/2058775/+attachment/5758462/+files/Screenshot_20240322_213947.png
 * LibreOffice output:
   
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/2058775/+attachment/5758464/+files/Screenshot_20240322_222052.png

--
Best regards / Mit freundlichen Grüßen / Med vennlig hilsen

===
 Thomas Dreibholz

 Simula Metropolitan Centre for Digital Engineering
 Centre for Resilient Networks and Applications
 Pilestredet 52
 0167 Oslo, Norway
---
 E-Mail:dre...@simula.no
 Homepage:http://simula.no/people/dreibh
===



OpenPGP_signature.asc
Description: OpenPGP digital signature


bug#69951: coreutils: printf formatting bug for nb_NO and nn_NO locales

2024-03-23 Thread Pádraig Brady

tag 69951 notabug
close 69951
stop

On 22/03/2024 20:22, Thomas Dreibholz wrote:

Hi,

I just discovered a printf bug for at least the nb_NO and nn_NO locales
when printing numbers with thousands separator. To reproduce:

#!/bin/bash
for l in de_DE nb_NO ; do
     echo "LC_NUMERIC=$l.UTF-8"
     for n in 1 100 1000 1 10 100 1000 ; do
    LC_NUMERIC=$l.UTF-8 /usr/bin/printf "<%'10d>\n" $n
     done
done

The expected output of "%'10d" is a right-formatted number string with
10 characters.

The output of the test script is fine for e.g. LC_NUMERIC=de_DE.UTF-8
and LC_NUMERIC=en_US.UTF-8:

LC_NUMERIC=de_DE.UTF-8
< 1>
<   100>
< 1.000>
<    10.000>
<   100.000>
< 1.000.000>
<10.000.000>



However, for LC_NUMERIC=nb_NO.UTF-8 and LC_NUMERIC=nn_NO.UTF-8, the
formatting is wrong:

LC_NUMERIC=nb_NO.UTF-8
< 1>
<   100>
<   1 000>
<  10 000>
< 100 000>
<1 000 000>
<10 000 000>



I reproduced the issue with coreutils-8.32-4.1ubuntu1.1 (Ubuntu 22.04)
as well as coreutils-9.3-5.fc39.x86_64 (Fedora 39).

Under FreeBSD 14.0-RELEASE (coreutils-9.4_1), the output looks slightly
better but is still wrong:

LC_NUMERIC=nb_NO.UTF-8
< 1>
<   100>
<    1 000>
<   10 000>
<  100 000>
<1 000 000>
<10 000 000>
LC_NUMERIC=nn_NO.UTF-8
< 1>
<   100>
<    1 000>
<   10 000>
<  100 000>
<1 000 000>
<10 000 000>

May be the issue is that the thousands separator for the Norwegian
locales is a space " ", while it is "."/"," for German/US English locales.


The issue looks to be that the thousands separator for Norwegian locales
is “NARROW NO-BREAK SPACE", or more problematically the _three_ byte
UTF8 sequence E2 80 AF. So it looks like an issue with libc routines
counting bytes rather than characters in this case.

One suggestion is to do the alignment after. For example:

$ export LC_NUMERIC=nb_NO.UTF-8
$ printf "%'.f\n" $(seq -f '1E%.f' 7) | column --table-right=1 -t
10
   100
 1 000
10 000
   100 000
 1 000 000
10 000 000

Actually I've just noticed that specifying the %'10.f format
does count characters and not bytes! So another solution is:

$ export LC_NUMERIC=nb_NO.UTF-8
$ printf "%'10.f\n" $(seq -f '1E%.f' 7)
10
   100
 1 000
10 000
   100 000
 1 000 000
10 000 000

The issue if there is one is in libc at least.
It would be worth checking existing glibc reports about this
and reporting if not mentioned.

cheers,
Pádraig.





bug#69532: mv's new -x option should be made orthogonal to -t/-T/default

2024-03-23 Thread Bernhard Voelker

On 3/22/24 11:22, Karel Zak wrote:
> On Wed, Mar 20, 2024 at 11:53:05PM +0100, Bernhard Voelker wrote:>> On 
userland OTOH, we have broader choice.
>> Karel did his choice in util-linux for exch(1), and coreutils could expose
>> the same functionality.
>>
>> For other feature requests, we were much more reluctant in coreutils ... for
>> good reasons: feature bloat, maintainability, etc.
>>
>> So I'm asking myself what is different this time?
>> - The feature already exists -> util-linux.
>
> Note that we can move exch(1) from util-linux to coreutils if, at the
> end of the brainstorming session, the conclusion will be that mv(1) is
> a bad choice :-)

I'd be for that as well, because ...

>> I'm currently only 20:80 for adding it to mv(1).
>
> I think the functionality will be lost in the mv(1) for many users.

... that's a fair point.

The code for the functionality is in copy.c, so - as with mv.c/cp.c/install.c -
we could have a exch.c using just that part, and thus expose a clearer interface
to the users.

Have a nice day,
Berny





bug#69532: mv's new -x option should be made orthogonal to -t/-T/default

2024-03-23 Thread Bernhard Voelker

On 3/23/24 02:44, Paul Eggert wrote:

I installed the attached patches to do the above. (Basically, the
problem was that my earlier patches were too ambitious; these patches
scale things back to avoid some optimizations so that mv --exchange is
more like ordinary mv.)

The first patch simplifies the code (and fixes a diagnostic to be more
useful) without otherwise changing behavior; it's more of a refactoring.
The second patch does the real work.


Thanks.

We should put adding more tests on our TODO list still.

Have a nice day,
Berny