bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)

2021-10-08 Thread Paul Eggert

On 10/8/21 7:32 PM, Pádraig Brady wrote:
it's not a thousands separator, rather a grouping 
character,

and groups can be in 2, 3, 4, and even 5.


Sure, but 'sort' could determine the group sizes from the locale, and 
reject digit strings that are formatted improperly according to the 
group-size rules. (Not that I plan to write the code to do that)






bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)

2021-10-08 Thread Pádraig Brady

On 08/10/2021 21:48, Paul Eggert wrote:

On 10/8/21 6:37 AM, Pádraig Brady wrote:


The difference here is due to ',' being treated as a thousands sep,
not a decimal point.


Oh, thanks. Of course! I should have figured that out myself.

It is unfortunate that "," is treated as a thousands seperator even
though it's obviously not one (as it's not followed by 3 decimal
digits). I don't think POSIX requires this behavior; it's not clear to
me that POSIX even allows it.


Well in general it's not a thousands separator, rather a grouping character,
and groups can be in 2, 3, 4, and even 5.  So I don't think we should
change the logic here.


This bug report suggests that we should alter the code so that 'sort -n'
acts more like common practice, and requires thousands separators to be
in the right places in order to treat nearby digits to be part of the
number. Alternatively, we could document the existing behavior (even if
it's not clear that it conforms to POSIX).


What we can do is have --debug warn when there is an overlap
in --field-separator and the grouping and decimal characters
when using numeric keys.  I'll have a look at that tomorrow.

cheers,
Pádraig





bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)

2021-10-08 Thread Paul Eggert

On 10/8/21 6:37 AM, Pádraig Brady wrote:


The difference here is due to ',' being treated as a thousands sep,
not a decimal point.


Oh, thanks. Of course! I should have figured that out myself.

It is unfortunate that "," is treated as a thousands seperator even 
though it's obviously not one (as it's not followed by 3 decimal 
digits). I don't think POSIX requires this behavior; it's not clear to 
me that POSIX even allows it.


This bug report suggests that we should alter the code so that 'sort -n' 
acts more like common practice, and requires thousands separators to be 
in the right places in order to treat nearby digits to be part of the 
number. Alternatively, we could document the existing behavior (even if 
it's not clear that it conforms to POSIX).






bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)

2021-10-08 Thread Pádraig Brady

On 04/10/2021 21:01, Paul Eggert wrote:

On 10/4/21 08:58, Pádraig Brady wrote:

The --debug option points out the issue:

    $ printf '%s\n' 1,a 0,9 | sort --debug -nk1 -t ,
    sort: key 1 is numeric and spans multiple fields
    1,a
    _
    ___
    0,9
    ___
    ___


As Juncheng points out, it is a bit odd that -n and -g disagree here,
even in locales where ',' is not a decimal point. For example:

$ printf '1,a\n0,9\n' | sort -gk1 -t, --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
0,9
_
___
1,a
_
___
$ printf '1,a\n0,9\n' | sort -nk1 -t, --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
1,a
_
___
0,9
___
___


The difference here is due to ',' being treated as a thousands sep,
not a decimal point. So Juncheng to specifically answer your question,
0,9 is being interpreted as 9, which sorts after 1,a. For e.g. consider:

$ printf '%s\n' 1,a 0,900 | sort -s -k1,1g --debug
0,900
_
1,a
_

$ printf '%s\n' 1,a 0,900 | sort -s -k1,1n --debug
1,a
_
0,900
_


Given the various groupings possible (depending on locale
one can group in 2, 3, 4, 5 digits) we effectively just
ignore the grouping separator in numeric mode, hence the difference.

Note in locales where , is a decimal point we do get
consistent order between -g and -n as expected:

$ printf '%s\n' '1,a' '0,9' | LC_ALL=fr_FR.utf8 sort -s -k1,1n --debug
sort: tri du texte réalisé en utilisant les règles de tri « fr_FR.utf8 »
0,9
___
1,a
__
$ printf '%s\n' '1,a' '0,9' | LC_ALL=fr_FR.utf8 sort -s -k1,1g --debug
sort: tri du texte réalisé en utilisant les règles de tri « fr_FR.utf8 »
0,9
___
1,a
__

For completeness we do have another issue with grouping separators,
where we don't support multi-byte separators appropriately.
For e.g. fr_FR.utf8 uses "narrow non breaking space" as the separator,
which we don't support:

$ sep=$(LC_ALL=fr_FR.utf8 locale thousands_sep)
$ printf '%s\n' 0800 "0${sep}900" | LC_ALL=fr_FR.utf8 sort -s -k1,1n --debug
sort: tri du texte réalisé en utilisant les règles de tri « fr_FR.utf8 »
0 900
_
0800



cheers,
Pádraig