Re: uniq: add -i option

2017-12-23 Thread Jeremie Courreges-Anglas
On Sun, Dec 24 2017, Theo Buehler  wrote:
>> Obviously, the relevant condition should have been
>> 
>>  if ((iflag ? strcasecmp : strcmp)(t1, t2) != 0)
>> 
>> instead of awkwardly messing with logical AND and OR.
>
> Indeed, this is much better. I like your version, but perhaps others
> might like this one more:
>
>   if ((iflag && strcasecmp(t1, t2)) || strcmp(t1, t2))

This would break -i mode.  If people want to avoid the ternary operator
(I see no reason here) then please write a proper if / else pattern.

> Thanks for pointing it out.

ok jca@ for the diff below

> Index: uniq.c
> ===
> RCS file: /var/cvs/src/usr.bin/uniq/uniq.c,v
> retrieving revision 1.25
> diff -u -p -r1.25 uniq.c
> --- uniq.c21 Dec 2017 10:05:59 -  1.25
> +++ uniq.c23 Dec 2017 23:07:16 -
> @@ -152,7 +152,7 @@ main(int argc, char *argv[])
>   }
>  
>   /* If different, print; set previous to new value. */
> - if ((!iflag && strcmp(t1, t2)) || strcasecmp(t1, t2)) {
> + if ((iflag ? strcasecmp : strcmp)(t1, t2)) {
>   show(ofp, prevline);
>   t1 = prevline;
>   prevline = thisline;
>

-- 
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE



Re: uniq: add -i option

2017-12-23 Thread Theo Buehler
On Sun, Dec 24, 2017 at 12:01:17AM +, kshe wrote:
> On Sat, 23 Dec 2017 23:13:22 +, Theo Buehler wrote:
> > > Obviously, the relevant condition should have been
> > >
> > > if ((iflag ? strcasecmp : strcmp)(t1, t2) != 0)
> > >
> > > instead of awkwardly messing with logical AND and OR.
> >
> > Indeed, this is much better. I like your version, but perhaps others
> > might like this one more:
> >
> > if ((iflag && strcasecmp(t1, t2)) || strcmp(t1, t2))
> 
> Well, this is even worse: this version not only has the same kind of
> performance drawback as the current one, but it is also logically
> incorrect, and will cause the `-i' flag to have no effect at all.

Sometimes I'm not only stupid but really, really stupid.

Thanks

> 
> If you really want to do this without a ternary operator, the equivalent
> form is
> 
>   if (iflag && strcasecmp(t1, t2) || !iflag && strcmp(t1, t2))
> 
> which looks very dumb indeed, but at least it is logically sound.
> 
> Regards,
> 
> kshe
> 



Re: uniq: add -i option

2017-12-23 Thread kshe
On Sat, 23 Dec 2017 23:13:22 +, Theo Buehler wrote:
> > Obviously, the relevant condition should have been
> >
> > if ((iflag ? strcasecmp : strcmp)(t1, t2) != 0)
> >
> > instead of awkwardly messing with logical AND and OR.
>
> Indeed, this is much better. I like your version, but perhaps others
> might like this one more:
>
> if ((iflag && strcasecmp(t1, t2)) || strcmp(t1, t2))

Well, this is even worse: this version not only has the same kind of
performance drawback as the current one, but it is also logically
incorrect, and will cause the `-i' flag to have no effect at all.

If you really want to do this without a ternary operator, the equivalent
form is

if (iflag && strcasecmp(t1, t2) || !iflag && strcmp(t1, t2))

which looks very dumb indeed, but at least it is logically sound.

Regards,

kshe



Re: uniq: add -i option

2017-12-23 Thread Theo Buehler

> Obviously, the relevant condition should have been
> 
>   if ((iflag ? strcasecmp : strcmp)(t1, t2) != 0)
> 
> instead of awkwardly messing with logical AND and OR.

Indeed, this is much better. I like your version, but perhaps others
might like this one more:

if ((iflag && strcasecmp(t1, t2)) || strcmp(t1, t2))

Thanks for pointing it out.

Index: uniq.c
===
RCS file: /var/cvs/src/usr.bin/uniq/uniq.c,v
retrieving revision 1.25
diff -u -p -r1.25 uniq.c
--- uniq.c  21 Dec 2017 10:05:59 -  1.25
+++ uniq.c  23 Dec 2017 23:07:16 -
@@ -152,7 +152,7 @@ main(int argc, char *argv[])
}
 
/* If different, print; set previous to new value. */
-   if ((!iflag && strcmp(t1, t2)) || strcasecmp(t1, t2)) {
+   if ((iflag ? strcasecmp : strcmp)(t1, t2)) {
show(ofp, prevline);
t1 = prevline;
prevline = thisline;



Re: uniq: add -i option

2017-12-23 Thread kshe
Hi,

This change causes uniq(1) to compare equal lines twice when run without
`-i', once in strcmp(3) and once again in strcasecmp(3).  In the worst
case, which is also one of the most common, the main loop spends about
half of its time copying buffers and looking for newlines in fgets(3),
and the other half actually comparing those buffers; hence, in practice,
because of this commit, it has now become 25% slower than it was before.

$ jot -b a -s a 4080 >tmp
$ cat $(jot -b tmp 4096) >tmp2
$ cat $(jot -b tmp2 16) >lines

$ time ./uniq /dev/null
0m01.60s real 0m00.80s user 0m00.75s system
$ time uniq /dev/null
0m01.23s real 0m00.47s user 0m00.73s system

Obviously, the relevant condition should have been

if ((iflag ? strcasecmp : strcmp)(t1, t2) != 0)

instead of awkwardly messing with logical AND and OR.

That said, it seems that this whole program was not really written with
performance in mind in the first place, as in less than an hour I was
able to write a new uniq(1) which, apart from handling arbitrarily long
lines and NUL bytes correctly as well as consistently parsing its
arguments (contrary to OpenBSD's version), has proved an order of
magnitude faster than the latter in such worst cases, so I guess a 25%
slowdown may not appear that important after all.  But, still, I think
the code makes little sense as it currently stands, if only from a
logical point of view and regardless of those performance
considerations.

Regards,

kshe



Re: uniq: add -i option

2017-12-22 Thread Ingo Schwarze
Hi Theo,

Theo Buehler wrote on Thu, Dec 21, 2017 at 11:57:12PM +0100:
> Ingo Schwarze wrote:

>> So i really don't feel like adding a BUGS section, but instead i
>> think documenting that -i is intended as an ASCII-only feature is
>> the way to go.

> Yes, sounds reasonable.

Thanks for checking!
Changes to DESCRIPTION and ENVIRONMENT committed.

>> While here, profit from the opportunity to mention that uniq(1) is
>> intended to work on the level of codepoints, not on the level of
>> fully combined characters.

> ok, I think that's an improvement. Thanks.

I didn't commit the CAVEATS section because deraadt@ convincingly
pointed out that it was quite hard to understand.  Fully understanding
a run-of-the-mill section 1 manual page must not require knowledge
about Unicode terms like "normalization forms" and "canonical equivalence".

Besides, the way our base system utilities define "string equality"
for strings that may be either ASCII or UTF-8 is not specific to
uniq(1).  So i'll probably document that in another, central place,
quite possibly in locale(1) because that's where LC_CTYPE is defined,
so people are likely to look there for the gory details of UTF-8
handling, and also because using full Unicode terminology in that
place is less disruptive than in an innocent manual line uniq(1).

Yours,
  Ingo



Re: uniq: add -i option

2017-12-21 Thread Theo Buehler
> So i really don't feel like adding a BUGS section, but instead i
> think documenting that -i is intended as an ASCII-only feature is
> the way to go.

Yes, sounds reasonable.

> While here, profit from the opportunity to mention that uniq(1) is
> intended to work on the level of codepoints, not on the level of
> fully combined characters.
> 
> OK?

ok, I think that's an improvement. Thanks.



Re: uniq: add -i option

2017-12-21 Thread Claus Assmann
On Thu, Dec 21, 2017, Theo Buehler wrote:

> I committed a minimally tweaked version of your diff:

Thanks for the fixes and the commit, I will try to do
better next time.



Re: uniq: add -i option

2017-12-21 Thread Ingo Schwarze
Hi Theo,

Theo Buehler wrote on Thu, Dec 21, 2017 at 11:06:02AM +0100:
> On Thu, Dec 21, 2017 at 01:50:37AM -0800, Claus Assmann wrote:
>> On Fri, Dec 15, 2017, Todd C. Miller wrote:
>>> On Fri, 15 Dec 2017 03:41:25 -0800, Claus Assmann wrote:

 I use uniq for some log file analysis and it contained "duplicate"
 lines which only differ in lower/upper case (user input). Hence I
 added an -i flag which also exists on FreeBSD at least.
 Maybe it's useful to add to OpenBSD?

>>> Linux has this as well.  It's OK by me.

>> So would it be ok for you to commit it or does it have to be someone
>> else (with the proper rights and some spare time) based on your "OK"?

> I committed a minimally tweaked version of your diff:
> * put the -c and -i flags together in the manual and sync usage()
> * add a sentence that i is an extension of POSIX to the STANDARDS
>   section
> * use alphabetic order of the globals iflag and uflag

I don't object to what you committed even though it is rather
half-baked.  I see how it may be useful as it stands.

Making the new feature fully UTF-8 aware would require major changes
to the code, making it substantially more complicated, which is exactly
what, as a rule, we want to avoid while maintaining POSIX utilities.
In particular, non-standard features are usually expected to not cause
major complications of standard utilities if that can be avoided.

So i really don't feel like adding a BUGS section, but instead i
think documenting that -i is intended as an ASCII-only feature is
the way to go.

While here, profit from the opportunity to mention that uniq(1) is
intended to work on the level of codepoints, not on the level of
fully combined characters.

OK?
  Ingo


Index: uniq.1
===
RCS file: /cvs/src/usr.bin/uniq/uniq.1,v
retrieving revision 1.20
diff -u -r1.20 uniq.1
--- uniq.1  21 Dec 2017 10:05:59 -  1.20
+++ uniq.1  21 Dec 2017 14:51:04 -
@@ -74,7 +74,7 @@
 by blanks, with blanks considered part of the following field.
 Field numbers are one based, i.e., the first field is field one.
 .It Fl i
-Case insensitive comparison of lines.
+Regard lower and upper case ASCII characters as identical.
 .It Fl s Ar chars
 Ignore the first
 .Ar chars
@@ -128,6 +128,10 @@
 .Qq POSIX ,
 or an unsupported value, each byte is treated as a character,
 and only space and tab are considered blank.
+.Pp
+This variable is ignored for case comparisons.
+Lower and upper case versions of non-ASCII characters are always
+considered different.
 .El
 .Sh EXIT STATUS
 .Ex -std uniq
@@ -155,3 +159,10 @@
 and
 .Fl Ns Ar number
 options have been deprecated but are still supported in this implementation.
+.Sh CAVEATS
+The
+.Nm
+utility does no Unicode normalization.
+For example, a character followed by a combining accent is considered
+different from the canonically equivalent combined character,
+and the order of combining accents is significant.



Re: uniq: add -i option

2017-12-21 Thread Craig Skinner
On Thu, 21 Dec 2017 11:06:02 +0100 Theo Buehler wrote:
> I committed a minimally tweaked version of your diff...

Thanks everybody.



Re: uniq: add -i option

2017-12-21 Thread Theo Buehler
On Thu, Dec 21, 2017 at 01:50:37AM -0800, Claus Assmann wrote:
> On Fri, Dec 15, 2017, Todd C. Miller wrote:
> > On Fri, 15 Dec 2017 03:41:25 -0800, Claus Assmann wrote:
> 
> > > I use uniq for some log file analysis and it contained "duplicate"
> > > lines which only differ in lower/upper case (user input). Hence I
> > > added an -i flag which also exists on FreeBSD at least.
> > > Maybe it's useful to add to OpenBSD?
> 
> > Linux has this as well.  It's OK by me.
> 
> So would it be ok for you to commit it or does it have to be someone
> else (with the proper rights and some spare time) based on your "OK"?
> 

I committed a minimally tweaked version of your diff:
* put the -c and -i flags together in the manual and sync usage()
* add a sentence that i is an extension of POSIX to the STANDARDS
  section
* use alphabetic order of the globals iflag and uflag



Re: uniq: add -i option

2017-12-21 Thread Claus Assmann
On Fri, Dec 15, 2017, Todd C. Miller wrote:
> On Fri, 15 Dec 2017 03:41:25 -0800, Claus Assmann wrote:

> > I use uniq for some log file analysis and it contained "duplicate"
> > lines which only differ in lower/upper case (user input). Hence I
> > added an -i flag which also exists on FreeBSD at least.
> > Maybe it's useful to add to OpenBSD?

> Linux has this as well.  It's OK by me.

So would it be ok for you to commit it or does it have to be someone
else (with the proper rights and some spare time) based on your "OK"?



Re: uniq: add -i option

2017-12-15 Thread Todd C. Miller
On Fri, 15 Dec 2017 03:41:25 -0800, Claus Assmann wrote:

> I use uniq for some log file analysis and it contained "duplicate"
> lines which only differ in lower/upper case (user input). Hence I
> added an -i flag which also exists on FreeBSD at least.
> Maybe it's useful to add to OpenBSD?

Linux has this as well.  It's OK by me.

 - todd



uniq: add -i option

2017-12-15 Thread Claus Assmann
I use uniq for some log file analysis and it contained "duplicate"
lines which only differ in lower/upper case (user input). Hence I
added an -i flag which also exists on FreeBSD at least.
Maybe it's useful to add to OpenBSD?

Index: uniq.1
===
RCS file: /home/ca/OpenBSD/cvs/src/usr.bin/uniq/uniq.1,v
retrieving revision 1.19
diff -u -r1.19 uniq.1
--- uniq.1  24 Oct 2016 13:46:58 -  1.19
+++ uniq.1  15 Dec 2017 11:35:53 -
@@ -43,6 +43,7 @@
 .Nm uniq
 .Op Fl c
 .Op Fl d | u
+.Op Fl i
 .Op Fl f Ar fields
 .Op Fl s Ar chars
 .Oo
@@ -73,6 +74,8 @@
 A field is a string of non-blank characters separated from adjacent fields
 by blanks, with blanks considered part of the following field.
 Field numbers are one based, i.e., the first field is field one.
+.It Fl i
+Case insensitive comparison of lines.
 .It Fl s Ar chars
 Ignore the first
 .Ar chars
Index: uniq.c
===
RCS file: /home/ca/OpenBSD/cvs/src/usr.bin/uniq/uniq.c,v
retrieving revision 1.24
diff -u -r1.24 uniq.c
--- uniq.c  19 Dec 2015 10:21:01 -  1.24
+++ uniq.c  15 Dec 2017 11:30:46 -
@@ -47,7 +47,7 @@
 
 #defineMAXLINELEN  (8 * 1024)
 
-int cflag, dflag, uflag;
+int cflag, dflag, uflag, iflag;
 int numchars, numfields, repeats;
 
 FILE   *file(char *, char *);
@@ -70,7 +70,7 @@
err(1, "pledge");
 
obsolete(argv);
-   while ((ch = getopt(argc, argv, "cdf:s:u")) != -1) {
+   while ((ch = getopt(argc, argv, "cdf:is:u")) != -1) {
const char *errstr;
 
switch (ch) {
@@ -87,6 +87,9 @@
errx(1, "field skip value is %s: %s",
errstr, optarg);
break;
+   case 'i':
+   iflag = 1;
+   break;
case 's':
numchars = (int)strtonum(optarg, 0, INT_MAX,
);
@@ -149,7 +152,7 @@
}
 
/* If different, print; set previous to new value. */
-   if (strcmp(t1, t2)) {
+   if ((!iflag && strcmp(t1, t2)) || strcasecmp(t1, t2)) {
show(ofp, prevline);
t1 = prevline;
prevline = thisline;