Re: drop unused locale junk from sort(1)
ping On Oct 11 15:37:03, schwa...@usta.de wrote: > Jan Stary wrote on Tue, Oct 11, 2016 at 11:26:50AM +0200: > > > Feeling encouraged by Ingo's ok to remove locale from cp/rm, > > here's a diff that removes the locale stuff we don't actually do > > from the code and documentation of sort(1). Leave just LC_CTYPE > > You removed that too, but that's OK because the code doesn't > currently use it. Right now, even though some wide character > code is there, it always operates in narrow mode. > > > which determines isblank() and case conversions. > > LC_CTYPE does NOT affect isblank(), all the wide-character-related > functions are in dead code as far as i can see, and OpenBSD does > not support wide character case conversions. > > > Your diff is related to more than one thing: > > 1. Kill LC_NUMERIC support. > I very strongly agree with that. > I think we should never support that anywhere. > It is an absolutely terrible idea no matter how you look at it. > Your removal was incomplete, though. > In the patch below, i delete some additional bits > that you left behind. > > 2. Remove LC_COLLATE support for now. > I also agree with that, even if not quite as strongly. > In the distant future, we might or might not want LC_COLLATE support. > In Belgrade, i talked to Baptiste Daroussin who just implemented > LC_COLLATE in FreeBSD libc and who was utterly scared by the complexity. > Knowing ourselves, we will be scared even more once we get there. > So it will definitely not happen quickly. > > If it ever does, the trivial bits that are present right now > are so tiny that they won't help at all. It might even be > easier to have clean earth to till when starting. > Who knows at this point. > > In any case, deleting the incomplete pieces scattered around > the current code makes maintenance easier until that time, > which seems worthwhile. > > 3. With step 2 above, some flags become explicitly const > that appear to be variable now (but actually aren't). > That would allow deleting large amounts of dead code, > but i didn't add that to the patch below. > > The grep multibyte code is absolutely horrible, with insanity > up to and including unions of char and wchar_t coming with > hosts of trivial non-standard wrapper functions and tons > of duplicate, in some case triplicate code. Another > example: Even though the code half-heartedly attempts to > isolate multibyte stuff in bwstring.{c,h} - superficially > reminding of our utf8.c technique, but half-hearted because > the respective functions are numerous and called all over the > place, such that isolation is effectively a failure - > the main program (!) at one place abuses btowc(3) (!) > on an ASCII (!) command line argument and stores the result > as wchar_t in the top level global state structure (!) - > that is obviously functionally completely futile, but very > effective for polluting *all* of the code with complicated > data types and headers. > > But cleaning up all the parts of item 3 would be way too intrusive > for this patch, so it leaves behind hundreds of lines of code > that is already both dead and duplicate right now. > > > Annotate a missed -z flag while there, > > Committed. > > > and change /var/tmp to /tmp. > > No, according to file.c, the program still writes to /var/tmp, > not to /tmp. Before changing that in the manual, we would have > to change it in the code. > > > The following version of the patch survives our test suite, > but it no doubt needs more testing. > > Yours, > Ingo > > > Index: bwstring.c > === > RCS file: /cvs/src/usr.bin/sort/bwstring.c,v > retrieving revision 1.7 > diff -u -p -r1.7 bwstring.c > --- bwstring.c1 Apr 2015 22:38:08 - 1.7 > +++ bwstring.c11 Oct 2016 13:20:59 - > @@ -40,8 +40,8 @@ > #include "bwstring.h" > #include "sort.h" > > -bool byte_sort; > -size_t sort_mb_cur_max = 1; > +static const bool byte_sort = true; > +const size_t sort_mb_cur_max = 1; > > static wchar_t **wmonths; > static char **cmonths; > Index: bwstring.h > === > RCS file: /cvs/src/usr.bin/sort/bwstring.h,v > retrieving revision 1.2 > diff -u -p -r1.2 bwstring.h > --- bwstring.h31 Dec 2015 16:09:31 - 1.2 > +++ bwstring.h11 Oct 2016 13:20:59 - > @@ -37,8 +37,7 @@ > > #include "mem.h" > > -extern bool byte_sort; > -extern size_t sort_mb_cur_max; > +extern const size_t sort_mb_cur_max; > > /* wchar_t is of 4 bytes: */ > #define SIZEOF_WCHAR_STRING(LEN) ((LEN)*sizeof(wchar_t)) > Index: coll.c > === > RCS file: /cvs/src/usr.bin/sort/coll.c,v > retrieving revision 1.11 > diff -u -p -r1.11 coll
Re: drop unused locale junk from sort(1)
On Oct 11 15:37:03, schwa...@usta.de wrote: > > and change /var/tmp to /tmp. > > No, according to file.c, the program still writes to /var/tmp, > not to /tmp. Before changing that in the manual, we would have > to change it in the code. I sent that in a separate email; this is a leftover, sorry. Jan
Re: drop unused locale junk from sort(1)
Hi, Jan Stary wrote on Tue, Oct 11, 2016 at 11:26:50AM +0200: > Feeling encouraged by Ingo's ok to remove locale from cp/rm, > here's a diff that removes the locale stuff we don't actually do > from the code and documentation of sort(1). Leave just LC_CTYPE You removed that too, but that's OK because the code doesn't currently use it. Right now, even though some wide character code is there, it always operates in narrow mode. > which determines isblank() and case conversions. LC_CTYPE does NOT affect isblank(), all the wide-character-related functions are in dead code as far as i can see, and OpenBSD does not support wide character case conversions. Your diff is related to more than one thing: 1. Kill LC_NUMERIC support. I very strongly agree with that. I think we should never support that anywhere. It is an absolutely terrible idea no matter how you look at it. Your removal was incomplete, though. In the patch below, i delete some additional bits that you left behind. 2. Remove LC_COLLATE support for now. I also agree with that, even if not quite as strongly. In the distant future, we might or might not want LC_COLLATE support. In Belgrade, i talked to Baptiste Daroussin who just implemented LC_COLLATE in FreeBSD libc and who was utterly scared by the complexity. Knowing ourselves, we will be scared even more once we get there. So it will definitely not happen quickly. If it ever does, the trivial bits that are present right now are so tiny that they won't help at all. It might even be easier to have clean earth to till when starting. Who knows at this point. In any case, deleting the incomplete pieces scattered around the current code makes maintenance easier until that time, which seems worthwhile. 3. With step 2 above, some flags become explicitly const that appear to be variable now (but actually aren't). That would allow deleting large amounts of dead code, but i didn't add that to the patch below. The grep multibyte code is absolutely horrible, with insanity up to and including unions of char and wchar_t coming with hosts of trivial non-standard wrapper functions and tons of duplicate, in some case triplicate code. Another example: Even though the code half-heartedly attempts to isolate multibyte stuff in bwstring.{c,h} - superficially reminding of our utf8.c technique, but half-hearted because the respective functions are numerous and called all over the place, such that isolation is effectively a failure - the main program (!) at one place abuses btowc(3) (!) on an ASCII (!) command line argument and stores the result as wchar_t in the top level global state structure (!) - that is obviously functionally completely futile, but very effective for polluting *all* of the code with complicated data types and headers. But cleaning up all the parts of item 3 would be way too intrusive for this patch, so it leaves behind hundreds of lines of code that is already both dead and duplicate right now. > Annotate a missed -z flag while there, Committed. > and change /var/tmp to /tmp. No, according to file.c, the program still writes to /var/tmp, not to /tmp. Before changing that in the manual, we would have to change it in the code. The following version of the patch survives our test suite, but it no doubt needs more testing. Yours, Ingo Index: bwstring.c === RCS file: /cvs/src/usr.bin/sort/bwstring.c,v retrieving revision 1.7 diff -u -p -r1.7 bwstring.c --- bwstring.c 1 Apr 2015 22:38:08 - 1.7 +++ bwstring.c 11 Oct 2016 13:20:59 - @@ -40,8 +40,8 @@ #include "bwstring.h" #include "sort.h" -bool byte_sort; -size_t sort_mb_cur_max = 1; +static const bool byte_sort = true; +const size_t sort_mb_cur_max = 1; static wchar_t **wmonths; static char **cmonths; Index: bwstring.h === RCS file: /cvs/src/usr.bin/sort/bwstring.h,v retrieving revision 1.2 diff -u -p -r1.2 bwstring.h --- bwstring.h 31 Dec 2015 16:09:31 - 1.2 +++ bwstring.h 11 Oct 2016 13:20:59 - @@ -37,8 +37,7 @@ #include "mem.h" -extern bool byte_sort; -extern size_t sort_mb_cur_max; +extern const size_t sort_mb_cur_max; /* wchar_t is of 4 bytes: */ #defineSIZEOF_WCHAR_STRING(LEN) ((LEN)*sizeof(wchar_t)) Index: coll.c === RCS file: /cvs/src/usr.bin/sort/coll.c,v retrieving revision 1.11 diff -u -p -r1.11 coll.c --- coll.c 11 Dec 2015 21:41:51 - 1.11 +++ coll.c 11 Oct 2016 13:20:59 - @@ -46,12 +46,6 @@ struct key_specs *keys; size_t keys_num = 0; -wint_t symbol_decimal_point = L'.'; -/* there is no default thousands separator in collate rules: */ -wint_t symbol_thousands_sep = 0; -wint_t symbol_neg
drop unused locale junk from sort(1)
Feeling encouraged by Ingo's ok to remove locale from cp/rm, here's a diff that removes the locale stuff we don't actually do from the code and documentation of sort(1). Leave just LC_CTYPE which determines isblank() and case conversions. Annotate a missed -z flag while there, and change /var/tmp to /tmp. Jan Index: sort.c === RCS file: /cvs/src/usr.bin/sort/sort.c,v retrieving revision 1.86 diff -u -p -r1.86 sort.c --- sort.c 14 Jul 2016 08:31:18 - 1.86 +++ sort.c 11 Oct 2016 09:21:45 - @@ -252,55 +252,6 @@ conv_mbtowc(wchar_t *wc, const char *c, } /* - * Set current locale symbols. - */ -static void -set_locale(void) -{ - struct lconv *lc; - const char *locale; - - setlocale(LC_ALL, ""); - - /* Obtain LC_NUMERIC info */ - lc = localeconv(); - - /* Convert to wide char form */ - conv_mbtowc(&symbol_decimal_point, lc->decimal_point, - symbol_decimal_point); - conv_mbtowc(&symbol_thousands_sep, lc->thousands_sep, - symbol_thousands_sep); - conv_mbtowc(&symbol_positive_sign, lc->positive_sign, - symbol_positive_sign); - conv_mbtowc(&symbol_negative_sign, lc->negative_sign, - symbol_negative_sign); - - if (getenv("GNUSORT_NUMERIC_COMPATIBILITY")) - gnusort_numeric_compatibility = true; - - locale = setlocale(LC_COLLATE, NULL); - if (locale != NULL) { - char *tmpl; - const char *byteclocale; - - tmpl = sort_strdup(locale); - byteclocale = setlocale(LC_COLLATE, "C"); - if (byteclocale && strcmp(byteclocale, tmpl) == 0) { - byte_sort = true; - } else { - byteclocale = setlocale(LC_COLLATE, "POSIX"); - if (byteclocale && strcmp(byteclocale, tmpl) == 0) - byte_sort = true; - else - setlocale(LC_COLLATE, tmpl); - } - sort_free(tmpl); - } - if (!byte_sort) - sort_mb_cur_max = MB_CUR_MAX; -} - -/* * Set directory temporary files. */ static void @@ -883,7 +834,6 @@ main(int argc, char *argv[]) atexit(clear_tmp_files); - set_locale(); set_tmpdir(); set_sort_opts(); @@ -1163,17 +1113,8 @@ main(int argc, char *argv[]) if (debug_sort) { printf("Memory to be used for sorting: %llu\n", available_free_memory); - printf("Using collate rules of %s locale\n", - setlocale(LC_COLLATE, NULL)); if (byte_sort) printf("Byte sort is used\n"); - if (print_symbols_on_debug) { - printf("Decimal Point: <%lc>\n", symbol_decimal_point); - if (symbol_thousands_sep) - printf("Thousands separator: <%lc>\n", - symbol_thousands_sep); - printf("Positive sign: <%lc>\n", symbol_positive_sign); - printf("Negative sign: <%lc>\n", symbol_negative_sign); } } Index: sort.1 === RCS file: /cvs/src/usr.bin/sort/sort.1,v retrieving revision 1.54 diff -u -p -r1.54 sort.1 --- sort.1 5 Apr 2015 14:20:22 - 1.54 +++ sort.1 11 Oct 2016 09:21:45 - @@ -52,13 +52,16 @@ The .Nm utility sorts text and binary files by lines. A line is a record separated from the subsequent record by a -newline (default) or NUL \'\\0\' character (-z option). +newline (default) or NUL \'\\0\' character +.Po +.Fl z +option +.Pc . A record can contain any printable or unprintable characters. Comparisons are based on one or more sort keys extracted from each line of input, and are performed lexicographically, -according to the current locale's collating rules and the -specified command-line options that can tune the actual -sorting behavior. +according to the specified command-line options +that can tune the actual sorting behavior. By default, if keys are not given, .Nm uses entire lines for comparison. @@ -110,7 +113,7 @@ Store temporary files in the directory The default path is the value of the environment variable .Ev TMPDIR or -.Pa /var/tmp +.Pa /tmp if .Ev TMPDIR is not defined. @@ -173,10 +176,6 @@ Unknown strings are considered smaller t .It Fl n , Fl Fl numeric-sort, Fl Fl sort=numeric An initial numeric string, consisting of optional blank space, optional minus sign, and zero or more digits (including decimal point) -.\" with -.\" optional radix character and thousands -.\" separator -.\" (as defined in the current locale), is sorted by arithmetic value. Leading blank characters are ignored. .It Fl R, Fl Fl random-sort