Hi Theo,

Theo Buehler wrote on Thu, Dec 21, 2017 at 11:06:02AM +0100:
> On Thu, Dec 21, 2017 at 01:50:37AM -0800, Claus Assmann wrote:
>> On Fri, Dec 15, 2017, Todd C. Miller wrote:
>>> On Fri, 15 Dec 2017 03:41:25 -0800, Claus Assmann wrote:

>>>> I use uniq for some log file analysis and it contained "duplicate"
>>>> lines which only differ in lower/upper case (user input). Hence I
>>>> added an -i flag which also exists on FreeBSD at least.
>>>> Maybe it's useful to add to OpenBSD?

>>> Linux has this as well.  It's OK by me.

>> So would it be ok for you to commit it or does it have to be someone
>> else (with the proper rights and some spare time) based on your "OK"?

> I committed a minimally tweaked version of your diff:
> * put the -c and -i flags together in the manual and sync usage()
> * add a sentence that i is an extension of POSIX to the STANDARDS
>   section
> * use alphabetic order of the globals iflag and uflag

I don't object to what you committed even though it is rather
half-baked.  I see how it may be useful as it stands.

Making the new feature fully UTF-8 aware would require major changes
to the code, making it substantially more complicated, which is exactly
what, as a rule, we want to avoid while maintaining POSIX utilities.
In particular, non-standard features are usually expected to not cause
major complications of standard utilities if that can be avoided.

So i really don't feel like adding a BUGS section, but instead i
think documenting that -i is intended as an ASCII-only feature is
the way to go.

While here, profit from the opportunity to mention that uniq(1) is
intended to work on the level of codepoints, not on the level of
fully combined characters.

OK?
  Ingo


Index: uniq.1
===================================================================
RCS file: /cvs/src/usr.bin/uniq/uniq.1,v
retrieving revision 1.20
diff -u -r1.20 uniq.1
--- uniq.1      21 Dec 2017 10:05:59 -0000      1.20
+++ uniq.1      21 Dec 2017 14:51:04 -0000
@@ -74,7 +74,7 @@
 by blanks, with blanks considered part of the following field.
 Field numbers are one based, i.e., the first field is field one.
 .It Fl i
-Case insensitive comparison of lines.
+Regard lower and upper case ASCII characters as identical.
 .It Fl s Ar chars
 Ignore the first
 .Ar chars
@@ -128,6 +128,10 @@
 .Qq POSIX ,
 or an unsupported value, each byte is treated as a character,
 and only space and tab are considered blank.
+.Pp
+This variable is ignored for case comparisons.
+Lower and upper case versions of non-ASCII characters are always
+considered different.
 .El
 .Sh EXIT STATUS
 .Ex -std uniq
@@ -155,3 +159,10 @@
 and
 .Fl Ns Ar number
 options have been deprecated but are still supported in this implementation.
+.Sh CAVEATS
+The
+.Nm
+utility does no Unicode normalization.
+For example, a character followed by a combining accent is considered
+different from the canonically equivalent combined character,
+and the order of combining accents is significant.

Reply via email to