Re: uniq: add -i option
On Sun, Dec 24 2017, Theo Buehlerwrote: >> Obviously, the relevant condition should have been >> >> if ((iflag ? strcasecmp : strcmp)(t1, t2) != 0) >> >> instead of awkwardly messing with logical AND and OR. > > Indeed, this is much better. I like your version, but perhaps others > might like this one more: > > if ((iflag && strcasecmp(t1, t2)) || strcmp(t1, t2)) This would break -i mode. If people want to avoid the ternary operator (I see no reason here) then please write a proper if / else pattern. > Thanks for pointing it out. ok jca@ for the diff below > Index: uniq.c > === > RCS file: /var/cvs/src/usr.bin/uniq/uniq.c,v > retrieving revision 1.25 > diff -u -p -r1.25 uniq.c > --- uniq.c21 Dec 2017 10:05:59 - 1.25 > +++ uniq.c23 Dec 2017 23:07:16 - > @@ -152,7 +152,7 @@ main(int argc, char *argv[]) > } > > /* If different, print; set previous to new value. */ > - if ((!iflag && strcmp(t1, t2)) || strcasecmp(t1, t2)) { > + if ((iflag ? strcasecmp : strcmp)(t1, t2)) { > show(ofp, prevline); > t1 = prevline; > prevline = thisline; > -- jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF DDCC 0DFA 74AE 1524 E7EE
Re: uniq: add -i option
On Sun, Dec 24, 2017 at 12:01:17AM +, kshe wrote: > On Sat, 23 Dec 2017 23:13:22 +, Theo Buehler wrote: > > > Obviously, the relevant condition should have been > > > > > > if ((iflag ? strcasecmp : strcmp)(t1, t2) != 0) > > > > > > instead of awkwardly messing with logical AND and OR. > > > > Indeed, this is much better. I like your version, but perhaps others > > might like this one more: > > > > if ((iflag && strcasecmp(t1, t2)) || strcmp(t1, t2)) > > Well, this is even worse: this version not only has the same kind of > performance drawback as the current one, but it is also logically > incorrect, and will cause the `-i' flag to have no effect at all. Sometimes I'm not only stupid but really, really stupid. Thanks > > If you really want to do this without a ternary operator, the equivalent > form is > > if (iflag && strcasecmp(t1, t2) || !iflag && strcmp(t1, t2)) > > which looks very dumb indeed, but at least it is logically sound. > > Regards, > > kshe >
Re: uniq: add -i option
On Sat, 23 Dec 2017 23:13:22 +, Theo Buehler wrote: > > Obviously, the relevant condition should have been > > > > if ((iflag ? strcasecmp : strcmp)(t1, t2) != 0) > > > > instead of awkwardly messing with logical AND and OR. > > Indeed, this is much better. I like your version, but perhaps others > might like this one more: > > if ((iflag && strcasecmp(t1, t2)) || strcmp(t1, t2)) Well, this is even worse: this version not only has the same kind of performance drawback as the current one, but it is also logically incorrect, and will cause the `-i' flag to have no effect at all. If you really want to do this without a ternary operator, the equivalent form is if (iflag && strcasecmp(t1, t2) || !iflag && strcmp(t1, t2)) which looks very dumb indeed, but at least it is logically sound. Regards, kshe
Re: uniq: add -i option
> Obviously, the relevant condition should have been > > if ((iflag ? strcasecmp : strcmp)(t1, t2) != 0) > > instead of awkwardly messing with logical AND and OR. Indeed, this is much better. I like your version, but perhaps others might like this one more: if ((iflag && strcasecmp(t1, t2)) || strcmp(t1, t2)) Thanks for pointing it out. Index: uniq.c === RCS file: /var/cvs/src/usr.bin/uniq/uniq.c,v retrieving revision 1.25 diff -u -p -r1.25 uniq.c --- uniq.c 21 Dec 2017 10:05:59 - 1.25 +++ uniq.c 23 Dec 2017 23:07:16 - @@ -152,7 +152,7 @@ main(int argc, char *argv[]) } /* If different, print; set previous to new value. */ - if ((!iflag && strcmp(t1, t2)) || strcasecmp(t1, t2)) { + if ((iflag ? strcasecmp : strcmp)(t1, t2)) { show(ofp, prevline); t1 = prevline; prevline = thisline;
Re: uniq: add -i option
Hi, This change causes uniq(1) to compare equal lines twice when run without `-i', once in strcmp(3) and once again in strcasecmp(3). In the worst case, which is also one of the most common, the main loop spends about half of its time copying buffers and looking for newlines in fgets(3), and the other half actually comparing those buffers; hence, in practice, because of this commit, it has now become 25% slower than it was before. $ jot -b a -s a 4080 >tmp $ cat $(jot -b tmp 4096) >tmp2 $ cat $(jot -b tmp2 16) >lines $ time ./uniq /dev/null 0m01.60s real 0m00.80s user 0m00.75s system $ time uniq /dev/null 0m01.23s real 0m00.47s user 0m00.73s system Obviously, the relevant condition should have been if ((iflag ? strcasecmp : strcmp)(t1, t2) != 0) instead of awkwardly messing with logical AND and OR. That said, it seems that this whole program was not really written with performance in mind in the first place, as in less than an hour I was able to write a new uniq(1) which, apart from handling arbitrarily long lines and NUL bytes correctly as well as consistently parsing its arguments (contrary to OpenBSD's version), has proved an order of magnitude faster than the latter in such worst cases, so I guess a 25% slowdown may not appear that important after all. But, still, I think the code makes little sense as it currently stands, if only from a logical point of view and regardless of those performance considerations. Regards, kshe
Re: uniq: add -i option
Hi Theo, Theo Buehler wrote on Thu, Dec 21, 2017 at 11:57:12PM +0100: > Ingo Schwarze wrote: >> So i really don't feel like adding a BUGS section, but instead i >> think documenting that -i is intended as an ASCII-only feature is >> the way to go. > Yes, sounds reasonable. Thanks for checking! Changes to DESCRIPTION and ENVIRONMENT committed. >> While here, profit from the opportunity to mention that uniq(1) is >> intended to work on the level of codepoints, not on the level of >> fully combined characters. > ok, I think that's an improvement. Thanks. I didn't commit the CAVEATS section because deraadt@ convincingly pointed out that it was quite hard to understand. Fully understanding a run-of-the-mill section 1 manual page must not require knowledge about Unicode terms like "normalization forms" and "canonical equivalence". Besides, the way our base system utilities define "string equality" for strings that may be either ASCII or UTF-8 is not specific to uniq(1). So i'll probably document that in another, central place, quite possibly in locale(1) because that's where LC_CTYPE is defined, so people are likely to look there for the gory details of UTF-8 handling, and also because using full Unicode terminology in that place is less disruptive than in an innocent manual line uniq(1). Yours, Ingo
Re: uniq: add -i option
> So i really don't feel like adding a BUGS section, but instead i > think documenting that -i is intended as an ASCII-only feature is > the way to go. Yes, sounds reasonable. > While here, profit from the opportunity to mention that uniq(1) is > intended to work on the level of codepoints, not on the level of > fully combined characters. > > OK? ok, I think that's an improvement. Thanks.
Re: uniq: add -i option
On Thu, Dec 21, 2017, Theo Buehler wrote: > I committed a minimally tweaked version of your diff: Thanks for the fixes and the commit, I will try to do better next time.
Re: uniq: add -i option
Hi Theo, Theo Buehler wrote on Thu, Dec 21, 2017 at 11:06:02AM +0100: > On Thu, Dec 21, 2017 at 01:50:37AM -0800, Claus Assmann wrote: >> On Fri, Dec 15, 2017, Todd C. Miller wrote: >>> On Fri, 15 Dec 2017 03:41:25 -0800, Claus Assmann wrote: I use uniq for some log file analysis and it contained "duplicate" lines which only differ in lower/upper case (user input). Hence I added an -i flag which also exists on FreeBSD at least. Maybe it's useful to add to OpenBSD? >>> Linux has this as well. It's OK by me. >> So would it be ok for you to commit it or does it have to be someone >> else (with the proper rights and some spare time) based on your "OK"? > I committed a minimally tweaked version of your diff: > * put the -c and -i flags together in the manual and sync usage() > * add a sentence that i is an extension of POSIX to the STANDARDS > section > * use alphabetic order of the globals iflag and uflag I don't object to what you committed even though it is rather half-baked. I see how it may be useful as it stands. Making the new feature fully UTF-8 aware would require major changes to the code, making it substantially more complicated, which is exactly what, as a rule, we want to avoid while maintaining POSIX utilities. In particular, non-standard features are usually expected to not cause major complications of standard utilities if that can be avoided. So i really don't feel like adding a BUGS section, but instead i think documenting that -i is intended as an ASCII-only feature is the way to go. While here, profit from the opportunity to mention that uniq(1) is intended to work on the level of codepoints, not on the level of fully combined characters. OK? Ingo Index: uniq.1 === RCS file: /cvs/src/usr.bin/uniq/uniq.1,v retrieving revision 1.20 diff -u -r1.20 uniq.1 --- uniq.1 21 Dec 2017 10:05:59 - 1.20 +++ uniq.1 21 Dec 2017 14:51:04 - @@ -74,7 +74,7 @@ by blanks, with blanks considered part of the following field. Field numbers are one based, i.e., the first field is field one. .It Fl i -Case insensitive comparison of lines. +Regard lower and upper case ASCII characters as identical. .It Fl s Ar chars Ignore the first .Ar chars @@ -128,6 +128,10 @@ .Qq POSIX , or an unsupported value, each byte is treated as a character, and only space and tab are considered blank. +.Pp +This variable is ignored for case comparisons. +Lower and upper case versions of non-ASCII characters are always +considered different. .El .Sh EXIT STATUS .Ex -std uniq @@ -155,3 +159,10 @@ and .Fl Ns Ar number options have been deprecated but are still supported in this implementation. +.Sh CAVEATS +The +.Nm +utility does no Unicode normalization. +For example, a character followed by a combining accent is considered +different from the canonically equivalent combined character, +and the order of combining accents is significant.
Re: uniq: add -i option
On Thu, 21 Dec 2017 11:06:02 +0100 Theo Buehler wrote: > I committed a minimally tweaked version of your diff... Thanks everybody.
Re: uniq: add -i option
On Thu, Dec 21, 2017 at 01:50:37AM -0800, Claus Assmann wrote: > On Fri, Dec 15, 2017, Todd C. Miller wrote: > > On Fri, 15 Dec 2017 03:41:25 -0800, Claus Assmann wrote: > > > > I use uniq for some log file analysis and it contained "duplicate" > > > lines which only differ in lower/upper case (user input). Hence I > > > added an -i flag which also exists on FreeBSD at least. > > > Maybe it's useful to add to OpenBSD? > > > Linux has this as well. It's OK by me. > > So would it be ok for you to commit it or does it have to be someone > else (with the proper rights and some spare time) based on your "OK"? > I committed a minimally tweaked version of your diff: * put the -c and -i flags together in the manual and sync usage() * add a sentence that i is an extension of POSIX to the STANDARDS section * use alphabetic order of the globals iflag and uflag
Re: uniq: add -i option
On Fri, Dec 15, 2017, Todd C. Miller wrote: > On Fri, 15 Dec 2017 03:41:25 -0800, Claus Assmann wrote: > > I use uniq for some log file analysis and it contained "duplicate" > > lines which only differ in lower/upper case (user input). Hence I > > added an -i flag which also exists on FreeBSD at least. > > Maybe it's useful to add to OpenBSD? > Linux has this as well. It's OK by me. So would it be ok for you to commit it or does it have to be someone else (with the proper rights and some spare time) based on your "OK"?
Re: uniq: add -i option
On Fri, 15 Dec 2017 03:41:25 -0800, Claus Assmann wrote: > I use uniq for some log file analysis and it contained "duplicate" > lines which only differ in lower/upper case (user input). Hence I > added an -i flag which also exists on FreeBSD at least. > Maybe it's useful to add to OpenBSD? Linux has this as well. It's OK by me. - todd
uniq: add -i option
I use uniq for some log file analysis and it contained "duplicate" lines which only differ in lower/upper case (user input). Hence I added an -i flag which also exists on FreeBSD at least. Maybe it's useful to add to OpenBSD? Index: uniq.1 === RCS file: /home/ca/OpenBSD/cvs/src/usr.bin/uniq/uniq.1,v retrieving revision 1.19 diff -u -r1.19 uniq.1 --- uniq.1 24 Oct 2016 13:46:58 - 1.19 +++ uniq.1 15 Dec 2017 11:35:53 - @@ -43,6 +43,7 @@ .Nm uniq .Op Fl c .Op Fl d | u +.Op Fl i .Op Fl f Ar fields .Op Fl s Ar chars .Oo @@ -73,6 +74,8 @@ A field is a string of non-blank characters separated from adjacent fields by blanks, with blanks considered part of the following field. Field numbers are one based, i.e., the first field is field one. +.It Fl i +Case insensitive comparison of lines. .It Fl s Ar chars Ignore the first .Ar chars Index: uniq.c === RCS file: /home/ca/OpenBSD/cvs/src/usr.bin/uniq/uniq.c,v retrieving revision 1.24 diff -u -r1.24 uniq.c --- uniq.c 19 Dec 2015 10:21:01 - 1.24 +++ uniq.c 15 Dec 2017 11:30:46 - @@ -47,7 +47,7 @@ #defineMAXLINELEN (8 * 1024) -int cflag, dflag, uflag; +int cflag, dflag, uflag, iflag; int numchars, numfields, repeats; FILE *file(char *, char *); @@ -70,7 +70,7 @@ err(1, "pledge"); obsolete(argv); - while ((ch = getopt(argc, argv, "cdf:s:u")) != -1) { + while ((ch = getopt(argc, argv, "cdf:is:u")) != -1) { const char *errstr; switch (ch) { @@ -87,6 +87,9 @@ errx(1, "field skip value is %s: %s", errstr, optarg); break; + case 'i': + iflag = 1; + break; case 's': numchars = (int)strtonum(optarg, 0, INT_MAX, ); @@ -149,7 +152,7 @@ } /* If different, print; set previous to new value. */ - if (strcmp(t1, t2)) { + if ((!iflag && strcmp(t1, t2)) || strcasecmp(t1, t2)) { show(ofp, prevline); t1 = prevline; prevline = thisline;