bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)
On 10/9/21 5:00 AM, Pádraig Brady wrote: On 09/10/2021 04:48, Paul Eggert wrote: 'sort' could determine the group sizes from the locale, and reject digit strings that are formatted improperly according to the group-size rules. (Not that I plan to write the code to do that) Yes I agree that would be better, but not worth it I think as there would still be ambiguity in what was a grouping char and what was a field separator. Also that ambiguity would now vary across locales. I don't see the ambiguity problem. The field separator is used to identify fields; once the fields are identified, the thousands separator, decimal point, etc. contribute to numeric comparison in the usual way. So it's OK (albeit confusing) for the field separator to be '.' or ',' or '-' or '0' or any another character that could be part of a number. For example, with 'sort -t 0 -k 2,2n', the digit 0 is not part of the numeric field that is compared, and there's no ambiguity about that even though 0 is allowed in numbers. The same idea applies to 'sort -t , -k 2,2n'.
bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)
On 09/10/2021 04:48, Paul Eggert wrote: On 10/8/21 7:32 PM, Pádraig Brady wrote: it's not a thousands separator, rather a grouping character, and groups can be in 2, 3, 4, and even 5. Sure, but 'sort' could determine the group sizes from the locale, and reject digit strings that are formatted improperly according to the group-size rules. (Not that I plan to write the code to do that) Yes I agree that would be better, but not worth it I think as there would still be ambiguity in what was a grouping char and what was a field separator. Also that ambiguity would now vary across locales. Another possible change which I'd prefer TBH would be to disable the grouping separator, or decimal point if they overlapped with --field-separator. Doing this would induce a warning from --debug also. cheers, Pádraig
bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)
On 10/8/21 7:32 PM, Pádraig Brady wrote: it's not a thousands separator, rather a grouping character, and groups can be in 2, 3, 4, and even 5. Sure, but 'sort' could determine the group sizes from the locale, and reject digit strings that are formatted improperly according to the group-size rules. (Not that I plan to write the code to do that)
bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)
On 08/10/2021 21:48, Paul Eggert wrote: On 10/8/21 6:37 AM, Pádraig Brady wrote: The difference here is due to ',' being treated as a thousands sep, not a decimal point. Oh, thanks. Of course! I should have figured that out myself. It is unfortunate that "," is treated as a thousands seperator even though it's obviously not one (as it's not followed by 3 decimal digits). I don't think POSIX requires this behavior; it's not clear to me that POSIX even allows it. Well in general it's not a thousands separator, rather a grouping character, and groups can be in 2, 3, 4, and even 5. So I don't think we should change the logic here. This bug report suggests that we should alter the code so that 'sort -n' acts more like common practice, and requires thousands separators to be in the right places in order to treat nearby digits to be part of the number. Alternatively, we could document the existing behavior (even if it's not clear that it conforms to POSIX). What we can do is have --debug warn when there is an overlap in --field-separator and the grouping and decimal characters when using numeric keys. I'll have a look at that tomorrow. cheers, Pádraig
bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)
On 10/8/21 6:37 AM, Pádraig Brady wrote: The difference here is due to ',' being treated as a thousands sep, not a decimal point. Oh, thanks. Of course! I should have figured that out myself. It is unfortunate that "," is treated as a thousands seperator even though it's obviously not one (as it's not followed by 3 decimal digits). I don't think POSIX requires this behavior; it's not clear to me that POSIX even allows it. This bug report suggests that we should alter the code so that 'sort -n' acts more like common practice, and requires thousands separators to be in the right places in order to treat nearby digits to be part of the number. Alternatively, we could document the existing behavior (even if it's not clear that it conforms to POSIX).
bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)
On 04/10/2021 21:01, Paul Eggert wrote: On 10/4/21 08:58, Pádraig Brady wrote: The --debug option points out the issue: $ printf '%s\n' 1,a 0,9 | sort --debug -nk1 -t , sort: key 1 is numeric and spans multiple fields 1,a _ ___ 0,9 ___ ___ As Juncheng points out, it is a bit odd that -n and -g disagree here, even in locales where ',' is not a decimal point. For example: $ printf '1,a\n0,9\n' | sort -gk1 -t, --debug sort: text ordering performed using ‘en_US.UTF-8’ sorting rules sort: key 1 is numeric and spans multiple fields 0,9 _ ___ 1,a _ ___ $ printf '1,a\n0,9\n' | sort -nk1 -t, --debug sort: text ordering performed using ‘en_US.UTF-8’ sorting rules sort: key 1 is numeric and spans multiple fields 1,a _ ___ 0,9 ___ ___ The difference here is due to ',' being treated as a thousands sep, not a decimal point. So Juncheng to specifically answer your question, 0,9 is being interpreted as 9, which sorts after 1,a. For e.g. consider: $ printf '%s\n' 1,a 0,900 | sort -s -k1,1g --debug 0,900 _ 1,a _ $ printf '%s\n' 1,a 0,900 | sort -s -k1,1n --debug 1,a _ 0,900 _ Given the various groupings possible (depending on locale one can group in 2, 3, 4, 5 digits) we effectively just ignore the grouping separator in numeric mode, hence the difference. Note in locales where , is a decimal point we do get consistent order between -g and -n as expected: $ printf '%s\n' '1,a' '0,9' | LC_ALL=fr_FR.utf8 sort -s -k1,1n --debug sort: tri du texte réalisé en utilisant les règles de tri « fr_FR.utf8 » 0,9 ___ 1,a __ $ printf '%s\n' '1,a' '0,9' | LC_ALL=fr_FR.utf8 sort -s -k1,1g --debug sort: tri du texte réalisé en utilisant les règles de tri « fr_FR.utf8 » 0,9 ___ 1,a __ For completeness we do have another issue with grouping separators, where we don't support multi-byte separators appropriately. For e.g. fr_FR.utf8 uses "narrow non breaking space" as the separator, which we don't support: $ sep=$(LC_ALL=fr_FR.utf8 locale thousands_sep) $ printf '%s\n' 0800 "0${sep}900" | LC_ALL=fr_FR.utf8 sort -s -k1,1n --debug sort: tri du texte réalisé en utilisant les règles de tri « fr_FR.utf8 » 0 900 _ 0800 cheers, Pádraig
bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)
Thank you, Paul and Padraig! May I ask when it fails to sort numerically why 1,a comes before 0,9? I could not come up with an ordering that 1,a is smaller. Best, Jason > On Oct 4, 2021, at 4:01 PM, Paul Eggert wrote: > > On 10/4/21 08:58, Pádraig Brady wrote: >> The --debug option points out the issue: >> $ printf '%s\n' 1,a 0,9 | sort --debug -nk1 -t , >> sort: key 1 is numeric and spans multiple fields >> 1,a >> _ >> ___ >> 0,9 >> ___ >> ___ > > As Juncheng points out, it is a bit odd that -n and -g disagree here, even in > locales where ',' is not a decimal point. For example: > > $ printf '1,a\n0,9\n' | sort -gk1 -t, --debug > sort: text ordering performed using ‘en_US.UTF-8’ sorting rules > sort: key 1 is numeric and spans multiple fields > 0,9 > _ > ___ > 1,a > _ > ___ > $ printf '1,a\n0,9\n' | sort -nk1 -t, --debug > sort: text ordering performed using ‘en_US.UTF-8’ sorting rules > sort: key 1 is numeric and spans multiple fields > 1,a > _ > ___ > 0,9 > ___ > ___
bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)
On 10/4/21 08:58, Pádraig Brady wrote: The --debug option points out the issue: $ printf '%s\n' 1,a 0,9 | sort --debug -nk1 -t , sort: key 1 is numeric and spans multiple fields 1,a _ ___ 0,9 ___ ___ As Juncheng points out, it is a bit odd that -n and -g disagree here, even in locales where ',' is not a decimal point. For example: $ printf '1,a\n0,9\n' | sort -gk1 -t, --debug sort: text ordering performed using ‘en_US.UTF-8’ sorting rules sort: key 1 is numeric and spans multiple fields 0,9 _ ___ 1,a _ ___ $ printf '1,a\n0,9\n' | sort -nk1 -t, --debug sort: text ordering performed using ‘en_US.UTF-8’ sorting rules sort: key 1 is numeric and spans multiple fields 1,a _ ___ 0,9 ___ ___
bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)
tag 51011 notabug close 51011 stop On 04/10/2021 15:36, Juncheng Yang wrote: Hi coreutils developers, I have encountered a bug in GNU sort in which sort produces incorrect results when numerical sort with delimiters. For example, sort -nk1 -t , file cannot sort the a file with the following lines (sort by the first column numerically) 1,a 0,9 I have tried multiple version including the latest version, this problem still exists. The --debug option points out the issue: $ printf '%s\n' 1,a 0,9 | sort --debug -nk1 -t , sort: key 1 is numeric and spans multiple fields 1,a _ ___ 0,9 ___ ___ So you want -k1,1n cheers, Pádraig
bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)
On Mon, 4 Oct 2021 10:36:52 -0400, Juncheng Yang wrote: > Hi coreutils developers, > I have encountered a bug in GNU sort in which sort produces incorrect > results when numerical sort with delimiters. For example, sort -nk1 -t , > file cannot sort the a file with the following lines (sort by the first > column numerically) > 1,a > 0,9 > > I have tried multiple version including the latest version, this problem > still exists. Works for me with sort -t, -k1,1n Keep in mind that with just "-k1" you're effectively telling sort to consider fields from the first up to the last (ie the whole line), not just the first one. -- D.
bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)
Hi coreutils developers, I have encountered a bug in GNU sort in which sort produces incorrect results when numerical sort with delimiters. For example, sort -nk1 -t , file cannot sort the a file with the following lines (sort by the first column numerically) 1,a 0,9 I have tried multiple version including the latest version, this problem still exists. Best, Juncheng