bug#51011: [PATCH] sort: --debug: add warnings about radix and grouping chars

2021-10-31 Thread Paul Eggert

Thank you for working on this. Your points are well taken. One tiny comment:


+  if (basic_numeric_field)
+{
+  if (thousands_sep_ignored)


This might be better combined as "if (basic_numeric_field && 
thousands_sep_ignored)", so that it's more similar to the previous "if".






bug#51011: [PATCH] sort: --debug: add warnings about radix and grouping chars

2021-10-31 Thread Pádraig Brady

On 11/10/2021 02:54, Pádraig Brady wrote:

On 11/10/2021 00:34, Paul Eggert wrote:

The warnings look good, except that this one:


     $ printf '1.0\n0.9\n' | sort -s -k1,1g --debug
     sort: numbers use ‘.’ as a decimal point in this locale
     0.9
     ___
     1.0
     ___


seems overkill if we're in the C locale.

Also, shouldn't similar diagnostics be generated if the field separator
is '-', or '+', or a digit in the current locale?


Yes this is more informational than a warning.
As Bernhard mentioned it may be useful to tag
--debug messages as informational or warnings.

In this case it would be info:
but would change to warn: if (tab == decimal_point).

The reason for the info message is that it may not
be obvious to users that numeric comparison
depends on locale just like text,
and we already provide the informational
text comparison message indicating the current locale.
We would only show this info: message if doing numeric sorting.


Addressing your '+' and '-' comment.
Yes they may also be used as field separators and
so are worth explicitly warning about.

Re warning about using digits in --field-separator,
that would be extremely edge case, and anyway
the --debug key marking should make it apparent
the extent of the numbers being compared.
The same argument can be made for other characters possible in numbers like;
  1E+4 nan, Infinity, 0xabcde.fp-3, etc.

As a related issue, I also thought it appropriate to warn
when we're ignoring multi-byte grouping chars in the locale.

The new warnings in this update are:

  $ LC_ALL=fr_FR.utf8 sort -n --debug /dev/null
  sort: the multi-byte number group separator in this locale is not supported

  $ sort --debug -t- -k1n /dev/null
  sort: key 1 is numeric and spans multiple fields
  sort: field separator ‘-’ is treated as a minus sign in numbers

  $ sort --debug -t+ -k1g /dev/null
  sort: key 1 is numeric and spans multiple fields
  sort: field separator ‘+’ is treated as a plus sign in numbers

I'll apply this later.

cheers,
Pádraig
>From 9e89c5f3e9c5652df2cfbf46b3e1f9608cb0092f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= 
Date: Sun, 10 Oct 2021 18:35:59 +0100
Subject: [PATCH] sort: --debug: add warnings about sign, radix, and grouping
 chars
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

New warnings are added related to the handling
of thousands grouping characters, and decimal points.
Examples now diagnosed are:

$ printf '0,9\n1,a\n' | sort -nk1 --debug -t, -s
sort: key 1 is numeric and spans multiple fields
sort: field separator ‘,’ is treated as a group separator in numbers
1,a
_
0,9
___

$ printf '1,a\n0,9\n' | LC_ALL=fr_FR.utf8 sort -gk1 --debug -t, -s
sort: key 1 is numeric and spans multiple fields
sort: field separator ‘,’ is treated as a decimal point in numbers
0,9
___
1,a
__

$ printf '1.0\n0.9\n' | LC_ALL=fr_FR.utf8 sort -s -k1,1g --debug
sort: note numbers use ‘,’ as a decimal point in this locale
0.9
_
1.0
_

$ LC_ALL=fr_FR.utf8 sort -n --debug /dev/null
sort: text ordering performed using ‘fr_FR.utf8’ sorting rules
sort: note numbers use ‘,’ as a decimal point in this locale
sort: the multi-byte number group separator in this locale \
  is not supported

$ sort --debug -t- -k1n /dev/null
sort: key 1 is numeric and spans multiple fields
sort: field separator ‘-’ is treated as a minus sign in numbers
sort: note numbers use ‘.’ as a decimal point in this locale

$ sort --debug -t+ -k1g /dev/null
sort: key 1 is numeric and spans multiple fields
sort: field separator ‘+’ is treated as a plus sign in numbers
sort: note numbers use ‘.’ as a decimal point in this locale

* src/sort.c (key_warnings): Add the warnings above.
* tests/misc/sort-debug-warn.sh: Add test cases.
* NEWS: Mention the improvement.
Addresses https://bugs.gnu.org/51011
---
 NEWS  |  7 ++-
 src/sort.c| 90 ++-
 tests/misc/sort-debug-warn.sh | 51 +++-
 3 files changed, 144 insertions(+), 4 deletions(-)

diff --git a/NEWS b/NEWS
index 086da03ae..1097f0c32 100644
--- a/NEWS
+++ b/NEWS
@@ -11,10 +11,15 @@ GNU coreutils NEWS-*- outline -*-
 ** Changes in behavior
 
   timeout --foreground --kill-after=... will now exit with status 137
-  if the kill signal was sent, which is consistent with the behaviour
+  if the kill signal was sent, which is consistent with the behavior
   when the --foreground option is not specified.  This allows users to
   distinguish if the command was more forcefully terminated.
 
+** Improvements
+
+  sort --debug now diagnoses issues with --field-separator characters
+  that conflict with characters possibly used in numbers.
+
 
 * Noteworthy changes in release 9.0 (2021-09-24) [stable]
 
diff --git a/src/sort.c b/src/sort.c
index d32dcfb02..c75281661 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -156,6 +156,8 @@ static char decimal_point;
 
 /* 

bug#51011: [PATCH] sort: --debug: add warnings about radix and grouping chars

2021-10-10 Thread Pádraig Brady

On 11/10/2021 00:34, Paul Eggert wrote:

The warnings look good, except that this one:


    $ printf '1.0\n0.9\n' | sort -s -k1,1g --debug
    sort: numbers use ‘.’ as a decimal point in this locale
    0.9
    ___
    1.0
    ___


seems overkill if we're in the C locale.

Also, shouldn't similar diagnostics be generated if the field separator
is '-', or '+', or a digit in the current locale?


Yes this is more informational than a warning.
As Bernhard mentioned it may be useful to tag
--debug messages as informational or warnings.

In this case it would be info:
but would change to warn: if (tab == decimal_point).

The reason for the info message is that it may not
be obvious to users that numeric comparison
depends on locale just like text,
and we already provide the informational
text comparison message indicating the current locale.
We would only show this info: message if doing numeric sorting.


+  if (numeric_field_span)
+{
+  char sep_string[2] = { 0, };
+  sep_string[0] = thousands_sep;
+  if ((tab == TAB_DEFAULT
+   && (isblank (to_uchar (thousands_sep
+  || tab == thousands_sep)
+{
+  error (0, 0,
+ _("field separator %s is treated as a "
+   "group separator in numbers"),
+ quote (sep_string));
+  number_locale_warned = true;
+}
+}


This code brought it to my attention that the GNU 'sort' has had a
longstanding bug (in code that I wrote long ago - sorry!) in that
thousands_sep is either -1 or an unsigned char converted to int, and
this doesn't work in some unusual cases. I installed the attached patch
to fix that bug, and I vaguely suspect that it fixes similar bugs in GNU
'test' and GNU 'expr'. Good thing you brought it to my attention.
(Sorry, I'm too lazy and/or time-pressed and/or overconfident to write
test cases)


I'd noted this and was going to follow up on it.
Thanks for sorting it out!


Anyway, with that patch installed, TAB and THOUSANDS_SEP can both be
CHAR_MAX + 1 so the above code needs to be twiddled. Also, we can assume
C99. So, something like following (pardon Thunderbird's line wrap):

if (numeric_field_span
&& (tab == TAB_DEFAULT
  ? thousands_char != NON_CHAR && isblank (to_uchar (thousands_sep))
  : tab == thousands_sep))
  {
error (0, 0,
 _("field separator %s is treated as a group separator in numbers"),
 quote (((char []) {thousands_sep, 0})));
number_locale_warned = true;
  }

with a similar replacement to the decimal_point code.


cheers,
Pádraig





bug#51011: [PATCH] sort: --debug: add warnings about radix and grouping chars

2021-10-10 Thread Pádraig Brady

On 10/10/2021 22:20, Bernhard Voelker wrote:

On 10/10/21 19:57, Pádraig Brady wrote:

sort: numbers use ‘.’ as a decimal point in this locale


What about adding the hint to that message that this an "ambiguity warning"?

 sort: ambiguity warning: numbers use ‘.’ as a decimal point in this locale

(Likewise for the other cases, of course.)

Most other --debug messages usually state how sort processes the options / the
input, while this one tells why the processing potentially does not work as the
user expected.

+1 otherwise.


Yes it may be useful to tag messages as "informational" or "problematic".
I've mentioned this also in a reply to Paul.

thanks for the review,
Pádraig





bug#51011: [PATCH] sort: --debug: add warnings about radix and grouping chars

2021-10-10 Thread Paul Eggert

The warnings look good, except that this one:


   $ printf '1.0\n0.9\n' | sort -s -k1,1g --debug
   sort: numbers use ‘.’ as a decimal point in this locale
   0.9
   ___
   1.0
   ___


seems overkill if we're in the C locale.

Also, shouldn't similar diagnostics be generated if the field separator 
is '-', or '+', or a digit in the current locale?




+  if (numeric_field_span)
+{
+  char sep_string[2] = { 0, };
+  sep_string[0] = thousands_sep;
+  if ((tab == TAB_DEFAULT
+   && (isblank (to_uchar (thousands_sep
+  || tab == thousands_sep)
+{
+  error (0, 0,
+ _("field separator %s is treated as a "
+   "group separator in numbers"),
+ quote (sep_string));
+  number_locale_warned = true;
+}
+}


This code brought it to my attention that the GNU 'sort' has had a 
longstanding bug (in code that I wrote long ago - sorry!) in that 
thousands_sep is either -1 or an unsigned char converted to int, and 
this doesn't work in some unusual cases. I installed the attached patch 
to fix that bug, and I vaguely suspect that it fixes similar bugs in GNU 
'test' and GNU 'expr'. Good thing you brought it to my attention. 
(Sorry, I'm too lazy and/or time-pressed and/or overconfident to write 
test cases)


Anyway, with that patch installed, TAB and THOUSANDS_SEP can both be 
CHAR_MAX + 1 so the above code needs to be twiddled. Also, we can assume 
C99. So, something like following (pardon Thunderbird's line wrap):


  if (numeric_field_span
  && (tab == TAB_DEFAULT
  ? thousands_char != NON_CHAR && isblank (to_uchar (thousands_sep))
  : tab == thousands_sep))
{
  error (0, 0,
 _("field separator %s is treated as a group separator in numbers"),
 quote (((char []) {thousands_sep, 0})));
  number_locale_warned = true;
}

with a similar replacement to the decimal_point code.
From 6cafb122fa214baa96a67fbbd2dab6c77a961e0a Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Sun, 10 Oct 2021 15:59:56 -0700
Subject: [PATCH] sort: fix unlikely bug when '\377' < 0
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* gl/lib/strintcmp.c (strintcmp): Don’t assume that the input
cannot contain ((char) -1), as this equals '\377' when char is
signed (assuming 8-bit char).
* src/sort.c (decimal_point): Now char, to make it clear
that it’s always in char range now.
(NON_CHAR): New constant.
(traverse_raw_number): Return char not unsigned char;
this is simpler and could be faster.  All callers changed.
(main): Do not convert decimal_point and thousands_sep to
unsigned char, as this can mishandle comparisons on
machines where char is signed and the input data contains
((char) -1).  Use NON_CHAR, not -1, as an out-of-range value for
thousands_sep.
---
 gl/lib/strintcmp.c|  4 +++-
 gl/lib/strnumcmp-in.h |  4 ++--
 src/sort.c| 23 +--
 3 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/gl/lib/strintcmp.c b/gl/lib/strintcmp.c
index 0ecf6d02c..215158627 100644
--- a/gl/lib/strintcmp.c
+++ b/gl/lib/strintcmp.c
@@ -21,6 +21,8 @@
 
 #include "strnumcmp-in.h"
 
+#include 
+
 /* Compare strings A and B as integers without explicitly converting
them to machine numbers, to avoid overflow problems and perhaps
improve performance.  */
@@ -28,5 +30,5 @@
 int
 strintcmp (char const *a, char const *b)
 {
-  return numcompare (a, b, -1, -1);
+  return numcompare (a, b, CHAR_MAX + 1, CHAR_MAX + 1);
 }
diff --git a/gl/lib/strnumcmp-in.h b/gl/lib/strnumcmp-in.h
index 7f2d4bbe6..f1b03689c 100644
--- a/gl/lib/strnumcmp-in.h
+++ b/gl/lib/strnumcmp-in.h
@@ -106,8 +106,8 @@ fraccompare (char const *a, char const *b, char decimal_point)
 /* Compare strings A and B as numbers without explicitly converting
them to machine numbers, to avoid overflow problems and perhaps
improve performance.  DECIMAL_POINT is the decimal point and
-   THOUSANDS_SEP the thousands separator.  A DECIMAL_POINT of -1
-   causes comparisons to act as if there is no decimal point
+   THOUSANDS_SEP the thousands separator.  A DECIMAL_POINT outside
+   'char' range causes comparisons to act as if there is no decimal point
character, and likewise for THOUSANDS_SEP.  */
 
 static inline int _GL_ATTRIBUTE_PURE
diff --git a/src/sort.c b/src/sort.c
index 2979400e7..d32dcfb02 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -152,9 +152,9 @@ enum
   };
 
 /* The representation of the decimal point in the current locale.  */
-static int decimal_point;
+static char decimal_point;
 
-/* Thousands separator; if -1, then there isn't one.  */
+/* Thousands separator; if outside char range, there is no separator.  */
 static int thousands_sep;
 
 /* Nonzero if the corresponding locales are hard.  */
@@ -338,6 +338,9 @@ static bool reverse;
they were read if all keys compare equal.  */
 static bool stable;
 
+/* An int value 

bug#51011: [PATCH] sort: --debug: add warnings about radix and grouping chars

2021-10-10 Thread Paul Eggert

On 10/10/21 2:20 PM, Bernhard Voelker wrote:

What about adding the hint to that message that this an "ambiguity warning"?


I don't think it's ambiguous (merely confusing :-).





bug#51011: [PATCH] sort: --debug: add warnings about radix and grouping chars

2021-10-10 Thread Bernhard Voelker
On 10/10/21 19:57, Pádraig Brady wrote:
>sort: numbers use ‘.’ as a decimal point in this locale

What about adding the hint to that message that this an "ambiguity warning"?

sort: ambiguity warning: numbers use ‘.’ as a decimal point in this locale

(Likewise for the other cases, of course.)

Most other --debug messages usually state how sort processes the options / the
input, while this one tells why the processing potentially does not work as the
user expected.

+1 otherwise.

Have a nice day,
Berny





bug#51011: [PATCH] sort: --debug: add warnings about radix and grouping chars

2021-10-10 Thread Pádraig Brady

On 09/10/2021 23:29, Paul Eggert wrote:

On 10/9/21 5:00 AM, Pádraig Brady wrote:

On 09/10/2021 04:48, Paul Eggert wrote:



'sort' could determine the group sizes from the locale, and
reject digit strings that are formatted improperly according to the
group-size rules. (Not that I plan to write the code to do that)


Yes I agree that would be better, but not worth it I think
as there would still be ambiguity in what was a grouping char
and what was a field separator. Also that ambiguity would
now vary across locales.


I don't see the ambiguity problem. The field separator is used to
identify fields; once the fields are identified, the thousands
separator, decimal point, etc. contribute to numeric comparison in the
usual way. So it's OK (albeit confusing) for the field separator to be
'.' or ',' or '-' or '0' or any another character that could be part of
a number.

For example, with 'sort -t 0 -k 2,2n', the digit 0 is not part of the
numeric field that is compared, and there's no ambiguity about that even
though 0 is allowed in numbers. The same idea applies to 'sort -t , -k
2,2n'.


Indeed. I dropped -t, from my later examples and confused myself.

Attached is the proposed change to add appropriate warnings in this area.
Examples now diagnosed are:

  $ printf '0,9\n1,a\n' | sort -nk1 --debug -t, -s
  sort: key 1 is numeric and spans multiple fields
  sort: field separator ‘,’ is treated as a group separator in numbers
  1,a
  _
  0,9
  ___

  $ printf '1,a\n0,9\n' | LC_ALL=fr_FR.utf8 sort -gk1 --debug -t, -s
  sort: key 1 is numeric and spans multiple fields
  sort: field separator ‘,’ is treated as a decimal point in numbers
  0,9
  ___
  1,a
  __

  $ printf '1.0\n0.9\n' | sort -s -k1,1g --debug
  sort: numbers use ‘.’ as a decimal point in this locale
  0.9
  ___
  1.0
  ___

  $ printf '1.0\n0.9\n' | LC_ALL=fr_FR.utf8 sort -s -k1,1g --debug
  sort: numbers use ‘,’ as a decimal point in this locale
  0.9
  _
  1.0
  _

cheers,
Pádraig
>From 06410ad77fcd80010859586cc068d8931bcf74e6 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= 
Date: Sun, 10 Oct 2021 18:35:59 +0100
Subject: [PATCH] sort: --debug: add warnings about radix and grouping chars
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

New warnings are added related to the handling
of thousands grouping characters, and decimal points.
Examples now diagnosed are:

$ printf '0,9\n1,a\n' | sort -nk1 --debug -t, -s
sort: key 1 is numeric and spans multiple fields
sort: field separator ‘,’ is treated as a group separator in numbers
1,a
_
0,9
___

$ printf '1,a\n0,9\n' | LC_ALL=fr_FR.utf8 sort -gk1 --debug -t, -s
sort: key 1 is numeric and spans multiple fields
sort: field separator ‘,’ is treated as a decimal point in numbers
0,9
___
1,a
__

$ printf '1.0\n0.9\n' | sort -s -k1,1g --debug
sort: numbers use ‘.’ as a decimal point in this locale
0.9
___
1.0
___

$ printf '1.0\n0.9\n' | LC_ALL=fr_FR.utf8 sort -s -k1,1g --debug
sort: numbers use ‘,’ as a decimal point in this locale
0.9
_
1.0
_

* src/sort.c (key_warnings): Add the warnings above.
* tests/misc/sort-debug-warn.sh: Add test cases.
Addresses https://bugs.gnu.org/51011
---
 src/sort.c| 62 +--
 tests/misc/sort-debug-warn.sh | 33 ++-
 2 files changed, 92 insertions(+), 3 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index 2979400e7..17bae2a57 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -2422,9 +2422,15 @@ key_warnings (struct keyfield const *gkey, bool gkey_only)
   struct keyfield const *key;
   struct keyfield ugkey = *gkey;
   unsigned long keynum = 1;
+  bool numeric_field = false;
+  bool numeric_field_span = false;
+  bool general_numeric_field_span = false;
 
   for (key = keylist; key; key = key->next, keynum++)
 {
+  if (key_numeric (key))
+numeric_field = true;
+
   if (key->traditional_used)
 {
   size_t sword = key->sword;
@@ -2479,8 +2485,14 @@ key_warnings (struct keyfield const *gkey, bool gkey_only)
   if (!sword)
 sword++;
   if (!eword || sword < eword)
-error (0, 0, _("key %lu is numeric and spans multiple fields"),
-   keynum);
+{
+  error (0, 0, _("key %lu is numeric and spans multiple fields"),
+ keynum);
+  if (key->general_numeric)
+general_numeric_field_span = true;
+  else
+numeric_field_span = true;
+}
 }
 
   /* Flag global options not copied or specified in any key.  */
@@ -2499,6 +2511,52 @@ key_warnings (struct keyfield const *gkey, bool gkey_only)
   ugkey.reverse &= !key->reverse;
 }
 
+  /* Explicitly warn if field delimiters in this locale
+ don't constrain numbers.  */
+  bool number_locale_warned = false;
+  if (numeric_field_span)
+{
+  char sep_string[2] = { 0, };
+  sep_string[0] =