bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)

2021-10-09 Thread Paul Eggert

On 10/9/21 5:00 AM, Pádraig Brady wrote:

On 09/10/2021 04:48, Paul Eggert wrote:



'sort' could determine the group sizes from the locale, and
reject digit strings that are formatted improperly according to the
group-size rules. (Not that I plan to write the code to do that)


Yes I agree that would be better, but not worth it I think
as there would still be ambiguity in what was a grouping char
and what was a field separator. Also that ambiguity would
now vary across locales.


I don't see the ambiguity problem. The field separator is used to 
identify fields; once the fields are identified, the thousands 
separator, decimal point, etc. contribute to numeric comparison in the 
usual way. So it's OK (albeit confusing) for the field separator to be 
'.' or ',' or '-' or '0' or any another character that could be part of 
a number.


For example, with 'sort -t 0 -k 2,2n', the digit 0 is not part of the 
numeric field that is compared, and there's no ambiguity about that even 
though 0 is allowed in numbers. The same idea applies to 'sort -t , -k 
2,2n'.






bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)

2021-10-09 Thread Pádraig Brady

On 09/10/2021 04:48, Paul Eggert wrote:

On 10/8/21 7:32 PM, Pádraig Brady wrote:

it's not a thousands separator, rather a grouping
character,
and groups can be in 2, 3, 4, and even 5.


Sure, but 'sort' could determine the group sizes from the locale, and
reject digit strings that are formatted improperly according to the
group-size rules. (Not that I plan to write the code to do that)


Yes I agree that would be better, but not worth it I think
as there would still be ambiguity in what was a grouping char
and what was a field separator. Also that ambiguity would
now vary across locales.

Another possible change which I'd prefer TBH
would be to disable the grouping separator, or decimal point
if they overlapped with --field-separator.
Doing this would induce a warning from --debug also.

cheers,
Pádraig





bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)

2021-10-08 Thread Paul Eggert

On 10/8/21 7:32 PM, Pádraig Brady wrote:
it's not a thousands separator, rather a grouping 
character,

and groups can be in 2, 3, 4, and even 5.


Sure, but 'sort' could determine the group sizes from the locale, and 
reject digit strings that are formatted improperly according to the 
group-size rules. (Not that I plan to write the code to do that)






bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)

2021-10-08 Thread Pádraig Brady

On 08/10/2021 21:48, Paul Eggert wrote:

On 10/8/21 6:37 AM, Pádraig Brady wrote:


The difference here is due to ',' being treated as a thousands sep,
not a decimal point.


Oh, thanks. Of course! I should have figured that out myself.

It is unfortunate that "," is treated as a thousands seperator even
though it's obviously not one (as it's not followed by 3 decimal
digits). I don't think POSIX requires this behavior; it's not clear to
me that POSIX even allows it.


Well in general it's not a thousands separator, rather a grouping character,
and groups can be in 2, 3, 4, and even 5.  So I don't think we should
change the logic here.


This bug report suggests that we should alter the code so that 'sort -n'
acts more like common practice, and requires thousands separators to be
in the right places in order to treat nearby digits to be part of the
number. Alternatively, we could document the existing behavior (even if
it's not clear that it conforms to POSIX).


What we can do is have --debug warn when there is an overlap
in --field-separator and the grouping and decimal characters
when using numeric keys.  I'll have a look at that tomorrow.

cheers,
Pádraig





bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)

2021-10-08 Thread Paul Eggert

On 10/8/21 6:37 AM, Pádraig Brady wrote:


The difference here is due to ',' being treated as a thousands sep,
not a decimal point.


Oh, thanks. Of course! I should have figured that out myself.

It is unfortunate that "," is treated as a thousands seperator even 
though it's obviously not one (as it's not followed by 3 decimal 
digits). I don't think POSIX requires this behavior; it's not clear to 
me that POSIX even allows it.


This bug report suggests that we should alter the code so that 'sort -n' 
acts more like common practice, and requires thousands separators to be 
in the right places in order to treat nearby digits to be part of the 
number. Alternatively, we could document the existing behavior (even if 
it's not clear that it conforms to POSIX).






bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)

2021-10-08 Thread Pádraig Brady

On 04/10/2021 21:01, Paul Eggert wrote:

On 10/4/21 08:58, Pádraig Brady wrote:

The --debug option points out the issue:

    $ printf '%s\n' 1,a 0,9 | sort --debug -nk1 -t ,
    sort: key 1 is numeric and spans multiple fields
    1,a
    _
    ___
    0,9
    ___
    ___


As Juncheng points out, it is a bit odd that -n and -g disagree here,
even in locales where ',' is not a decimal point. For example:

$ printf '1,a\n0,9\n' | sort -gk1 -t, --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
0,9
_
___
1,a
_
___
$ printf '1,a\n0,9\n' | sort -nk1 -t, --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
1,a
_
___
0,9
___
___


The difference here is due to ',' being treated as a thousands sep,
not a decimal point. So Juncheng to specifically answer your question,
0,9 is being interpreted as 9, which sorts after 1,a. For e.g. consider:

$ printf '%s\n' 1,a 0,900 | sort -s -k1,1g --debug
0,900
_
1,a
_

$ printf '%s\n' 1,a 0,900 | sort -s -k1,1n --debug
1,a
_
0,900
_


Given the various groupings possible (depending on locale
one can group in 2, 3, 4, 5 digits) we effectively just
ignore the grouping separator in numeric mode, hence the difference.

Note in locales where , is a decimal point we do get
consistent order between -g and -n as expected:

$ printf '%s\n' '1,a' '0,9' | LC_ALL=fr_FR.utf8 sort -s -k1,1n --debug
sort: tri du texte réalisé en utilisant les règles de tri « fr_FR.utf8 »
0,9
___
1,a
__
$ printf '%s\n' '1,a' '0,9' | LC_ALL=fr_FR.utf8 sort -s -k1,1g --debug
sort: tri du texte réalisé en utilisant les règles de tri « fr_FR.utf8 »
0,9
___
1,a
__

For completeness we do have another issue with grouping separators,
where we don't support multi-byte separators appropriately.
For e.g. fr_FR.utf8 uses "narrow non breaking space" as the separator,
which we don't support:

$ sep=$(LC_ALL=fr_FR.utf8 locale thousands_sep)
$ printf '%s\n' 0800 "0${sep}900" | LC_ALL=fr_FR.utf8 sort -s -k1,1n --debug
sort: tri du texte réalisé en utilisant les règles de tri « fr_FR.utf8 »
0 900
_
0800



cheers,
Pádraig





bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)

2021-10-04 Thread Juncheng Yang
Thank you, Paul and Padraig! 
May I ask when it fails to sort numerically why 1,a comes before 0,9? I could 
not come up with an ordering that 1,a is smaller. 


Best, 
Jason 


> On Oct 4, 2021, at 4:01 PM, Paul Eggert  wrote:
> 
> On 10/4/21 08:58, Pádraig Brady wrote:
>> The --debug option points out the issue:
>>   $ printf '%s\n' 1,a 0,9 | sort --debug -nk1 -t ,
>>   sort: key 1 is numeric and spans multiple fields
>>   1,a
>>   _
>>   ___
>>   0,9
>>   ___
>>   ___
> 
> As Juncheng points out, it is a bit odd that -n and -g disagree here, even in 
> locales where ',' is not a decimal point. For example:
> 
> $ printf '1,a\n0,9\n' | sort -gk1 -t, --debug
> sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
> sort: key 1 is numeric and spans multiple fields
> 0,9
> _
> ___
> 1,a
> _
> ___
> $ printf '1,a\n0,9\n' | sort -nk1 -t, --debug
> sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
> sort: key 1 is numeric and spans multiple fields
> 1,a
> _
> ___
> 0,9
> ___
> ___






bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)

2021-10-04 Thread Paul Eggert

On 10/4/21 08:58, Pádraig Brady wrote:

The --debug option points out the issue:

   $ printf '%s\n' 1,a 0,9 | sort --debug -nk1 -t ,
   sort: key 1 is numeric and spans multiple fields
   1,a
   _
   ___
   0,9
   ___
   ___


As Juncheng points out, it is a bit odd that -n and -g disagree here, 
even in locales where ',' is not a decimal point. For example:


$ printf '1,a\n0,9\n' | sort -gk1 -t, --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
0,9
_
___
1,a
_
___
$ printf '1,a\n0,9\n' | sort -nk1 -t, --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
1,a
_
___
0,9
___
___





bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)

2021-10-04 Thread Pádraig Brady

tag 51011 notabug
close 51011
stop

On 04/10/2021 15:36, Juncheng Yang wrote:

Hi coreutils developers,
 I have encountered a bug in GNU sort in which sort produces incorrect 
results when numerical sort with delimiters. For example,
sort -nk1 -t , file
cannot sort the a file with the following lines (sort by the first column 
numerically)
1,a
0,9

I have tried multiple version including the latest version, this problem still 
exists.


The --debug option points out the issue:

  $ printf '%s\n' 1,a 0,9 | sort --debug -nk1 -t ,
  sort: key 1 is numeric and spans multiple fields
  1,a
  _
  ___
  0,9
  ___
  ___


So you want -k1,1n

cheers,
Pádraig





bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)

2021-10-04 Thread Davide Brini
On Mon, 4 Oct 2021 10:36:52 -0400, Juncheng Yang
 wrote:

> Hi coreutils developers,
> I have encountered a bug in GNU sort in which sort produces incorrect
> results when numerical sort with delimiters. For example, sort -nk1 -t ,
> file cannot sort the a file with the following lines (sort by the first
> column numerically)
> 1,a
> 0,9
>
> I have tried multiple version including the latest version, this problem
> still exists.

Works for me with

sort -t, -k1,1n

Keep in mind that with just "-k1" you're effectively telling sort to
consider fields from the first up to the last (ie the whole line), not just
the first one.


--
D.





bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)

2021-10-04 Thread Juncheng Yang
Hi coreutils developers, 
I have encountered a bug in GNU sort in which sort produces incorrect 
results when numerical sort with delimiters. For example, 
sort -nk1 -t , file 
cannot sort the a file with the following lines (sort by the first column 
numerically) 
1,a
0,9

I have tried multiple version including the latest version, this problem still 
exists. 


Best, 
Juncheng