Re: uniq(1): ignore newline when comparing lines?

2021-11-02 Thread Scott Cheloha
On Tue, Nov 02, 2021 at 08:12:00AM -0600, Todd C. Miller wrote:
> On Mon, 01 Nov 2021 21:04:54 -0500, Scott Cheloha wrote:
> 
> > Yes it would be simpler.  However I didn't want to start changing the
> > input -- which we currently don't do -- without discussing it.
> >
> > The standard says we should "write one copy of each input line on the
> > output." So, if we are being strict, we don't add a newline that isn't
> > there, because that isn't what we read.  Any other interpretation
> > requires handwaving about what an "input line" even is.
> 
> The System V version of uniq actually ignores the last line if it
> doesn't end in a newline.  For example:
> 
> $ printf "bar\nfoo\nfool" | uniq
> bar
> foo
> 
> AIX, Solaris, and HP-UX still exhibit this behavior.  What happens
> is that the gline() function (which reads the line) returns non-zero
> when it hits EOF, discarding any input in that line.  Interestingly,
> it does realloc the line buffer as needed to handle long lines.
> 
> So really, this is a corner case where you can't count on consistent
> behavior among implementations and we just need to do what we think
> is best.  If you prefer we retain the existing behavior wrt a final
> line without a newline that is OK with me.

NetBSD still does what we do.  So there are three behaviors in the
wild.

Sigh.

Let's go with the FreeBSD/DragonFly/GNU unconditional newline
behavior.  It's simpler to implement it this way if we're going to
adopt the POSIX.1-2008 change and start ignoring newlines when we
compare lines.

We can share the blame if it breaks something.

I'll wait a few days to let other people comment, as this is a
behavior change.

Here's an updated patch.  Note that we do not (yet) use the line
length for anything but stubbing out the newline so we only need
a single variable for this purpose.

OK?  Objections from people who are not millert@?

Index: uniq.c
===
RCS file: /cvs/src/usr.bin/uniq/uniq.c,v
retrieving revision 1.28
diff -u -p -r1.28 uniq.c
--- uniq.c  1 Nov 2021 23:20:35 -   1.28
+++ uniq.c  2 Nov 2021 15:34:48 -
@@ -60,6 +60,7 @@ main(int argc, char *argv[])
char *prevline, *t1, *t2, *thisline;
FILE *ifp = NULL, *ofp = NULL;
size_t prevsize, thissize, tmpsize;
+   ssize_t len;
int ch;
 
setlocale(LC_CTYPE, "");
@@ -133,16 +134,21 @@ main(int argc, char *argv[])
 
prevsize = 0;
prevline = NULL;
-   if (getline(, , ifp) == -1) {
+   if ((len = getline(, , ifp)) == -1) {
free(prevline);
if (ferror(ifp))
err(1, "getline");
exit(0);
}
+   if (prevline[len - 1] == '\n')
+   prevline[len - 1] = '\0';

thissize = 0;
thisline = NULL;
-   while (getline(, , ifp) != -1) {
+   while ((len = getline(, , ifp)) != -1) {
+   if (thisline[len - 1] == '\n')
+   thisline[len - 1] = '\0';
+
/* If requested get the chosen fields + character offsets. */
if (numfields || numchars) {
t1 = skip(thisline);
@@ -185,9 +191,9 @@ show(FILE *ofp, char *str)
 {
if ((dflag && repeats) || (uflag && !repeats)) {
if (cflag)
-   (void)fprintf(ofp, "%4d %s", repeats + 1, str);
+   fprintf(ofp, "%4d %s\n", repeats + 1, str);
else
-   (void)fprintf(ofp, "%s", str);
+   fprintf(ofp, "%s\n", str);
}
 }
 



Re: uniq(1): ignore newline when comparing lines?

2021-11-02 Thread Ingo Schwarze
Hi,

Todd C. Miller wrote on Tue, Nov 02, 2021 at 08:12:00AM -0600:
> On Mon, 01 Nov 2021 21:04:54 -0500, Scott Cheloha wrote:

>> Yes it would be simpler.  However I didn't want to start changing the
>> input -- which we currently don't do -- without discussing it.
>>
>> The standard says we should "write one copy of each input line on the
>> output." So, if we are being strict, we don't add a newline that isn't
>> there, because that isn't what we read.  Any other interpretation
>> requires handwaving about what an "input line" even is.

> The System V version of uniq actually ignores the last line if it
> doesn't end in a newline.  For example:
> 
> $ printf "bar\nfoo\nfool" | uniq
> bar
> foo
> 
> AIX, Solaris, and HP-UX still exhibit this behavior.  What happens
> is that the gline() function (which reads the line) returns non-zero
> when it hits EOF, discarding any input in that line.  Interestingly,
> it does realloc the line buffer as needed to handle long lines.
> 
> So really, this is a corner case where you can't count on consistent
> behavior among implementations and we just need to do what we think
> is best.  If you prefer we retain the existing behavior wrt a final
> line without a newline that is OK with me.

Just a quick comment without looking at the code:

POSIX says on
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html :

  INPUT FILES
The input file shall be a text file.

By definition, a file that does not end in a newline character
is *NOT* a text file.  Consequently, what uniq(1) does in this case
is unspecified.

My personal opinion is that in general, if utilities are specified to
process text files and the terminating newline is missing, it is best to
behave as if the mandatory terminating newline were present.  The reason
why i think so is that forgetting the terminating newline is a popular
user error, and silently assuming a terminating newline causes less
surprise with users than silently discarding the incomplete last line.

Most users are not even aware that a file without a terminating newline
is not a text file, so it definitely is less surprising for the vast
majority.  But even those who are aware of that quirk in the definition
of what a text file is (like myself) are regularly surprised by the
consequences...

Yours,
  Ingo



Re: uniq(1): ignore newline when comparing lines?

2021-11-02 Thread Todd C . Miller
Actually, the historic version of uniq used static 1000 byte buffers.
Solaris 2.6 includes a version that realloc()s the buffer but I
don't know when exactly that behavior was added.

 - todd



Re: uniq(1): ignore newline when comparing lines?

2021-11-02 Thread Todd C . Miller
On Mon, 01 Nov 2021 21:04:54 -0500, Scott Cheloha wrote:

> Yes it would be simpler.  However I didn't want to start changing the
> input -- which we currently don't do -- without discussing it.
>
> The standard says we should "write one copy of each input line on the
> output." So, if we are being strict, we don't add a newline that isn't
> there, because that isn't what we read.  Any other interpretation
> requires handwaving about what an "input line" even is.

The System V version of uniq actually ignores the last line if it
doesn't end in a newline.  For example:

$ printf "bar\nfoo\nfool" | uniq
bar
foo

AIX, Solaris, and HP-UX still exhibit this behavior.  What happens
is that the gline() function (which reads the line) returns non-zero
when it hits EOF, discarding any input in that line.  Interestingly,
it does realloc the line buffer as needed to handle long lines.

So really, this is a corner case where you can't count on consistent
behavior among implementations and we just need to do what we think
is best.  If you prefer we retain the existing behavior wrt a final
line without a newline that is OK with me.

 - todd



Re: uniq(1): ignore newline when comparing lines?

2021-11-01 Thread Scott Cheloha
On Mon, Nov 01, 2021 at 07:33:58PM -0600, Todd C. Miller wrote:
> On Mon, 01 Nov 2021 20:20:49 -0500, Scott Cheloha wrote:
> 
> > On the one hand, it is intuitive that two buffers are not literally
> > the same if one has a newline and the other does not.  strcmp(3)
> > agrees with this.
> >
> > ... On the other hand, it also seems intuitive that two records are
> > the same even if one record doesn't have a record delimiter.  You
> > could argue that this makes the utility more flexible in a small
> > corner case.
> 
> I think we do want it.  However, wouldn't it be simpler to always
> strip the trailing newline and emit a newline in show() unconditionally?
> I realize this would add a newline if one was missing but that
> doesn't seem like a bad thing to me and is what GNU and FreeBSD do.

Yes it would be simpler.  However I didn't want to start changing the
input -- which we currently don't do -- without discussing it.

The standard says we should "write one copy of each input line on the
output." So, if we are being strict, we don't add a newline that isn't
there, because that isn't what we read.  Any other interpretation
requires handwaving about what an "input line" even is.

So, I guess I'm unsure.  I love the simplicity of the unconditional
newline but modifying the input makes me nervous.

If additional people are fine with adding the maybe-missing newline I
can make it unconditional.  Aligning our behavior with other
implementations in this corner case is probably reasonable.



Re: uniq(1): ignore newline when comparing lines?

2021-11-01 Thread Todd C . Miller
On Mon, 01 Nov 2021 20:20:49 -0500, Scott Cheloha wrote:

> On the one hand, it is intuitive that two buffers are not literally
> the same if one has a newline and the other does not.  strcmp(3)
> agrees with this.
>
> ... On the other hand, it also seems intuitive that two records are
> the same even if one record doesn't have a record delimiter.  You
> could argue that this makes the utility more flexible in a small
> corner case.

I think we do want it.  However, wouldn't it be simpler to always
strip the trailing newline and emit a newline in show() unconditionally?
I realize this would add a newline if one was missing but that
doesn't seem like a bad thing to me and is what GNU and FreeBSD do.

 - todd



uniq(1): ignore newline when comparing lines?

2021-11-01 Thread Scott Cheloha
POSIX.1-2008 added a new sentence to the description of uniq:

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html

> [...] The trailing  of each line in the input shall be
> ignored when doing comparisons.

It comes from this interpretation:

https://collaboration.opengroup.org/austin/interps/documents/14355/AI-133.txt

POSIX.1-2001 doesn't have this sentence:

https://pubs.opengroup.org/onlinepubs/009695399/utilities/uniq.html

This distinction changes behavior slightly.  My interpretation is that
the last line in the input can be a duplicate of the penultimate line
even if said last line is missing the terminating newline byte.

Consider a practical example.  Here is -current uniq(1):

$ printf "line\nline" | uniq -c
   1 line
   1 line$ 

(note that we do not append the missing newline to the output and my
shell prompt is printed on the same line.)

Here is uniq(1) with my patch:

$ printf "line\nline" | obj/uniq -c
   2 line

Do we want this?

On the one hand, it is intuitive that two buffers are not literally
the same if one has a newline and the other does not.  strcmp(3)
agrees with this.

... On the other hand, it also seems intuitive that two records are
the same even if one record doesn't have a record delimiter.  You
could argue that this makes the utility more flexible in a small
corner case.

Thoughts?

Of note is that we already claim conformance to POSIX.1-2008 in the
uniq.1 manpage even though we are missing this behavior.  This is the
only major behavior change inttroduced between POSIX.1-2001 and
POSIX.1-2008.  If we decide not to add this behavior I think we should
dial back the standard quoted in the manpage to POSIX.1-2001.

FWIW, GNU uniq implements this behavior:

$ printf "line\nline" | guniq
  2 line

As does FreeBSD, since 2002, although they don't mention it in their
manpage.  Here's the commit:

https://cgit.freebsd.org/src/commit/usr.bin/uniq/uniq.c?id=4e774f7fbe7c154f101f90c115b702780920ebcb

Index: uniq.c
===
RCS file: /cvs/src/usr.bin/uniq/uniq.c,v
retrieving revision 1.28
diff -u -p -r1.28 uniq.c
--- uniq.c  1 Nov 2021 23:20:35 -   1.28
+++ uniq.c  2 Nov 2021 01:16:52 -
@@ -49,7 +49,7 @@ int cflag, dflag, iflag, uflag;
 int numchars, numfields, repeats;
 
 FILE   *file(char *, char *);
-voidshow(FILE *, char *);
+voidshow(FILE *, char *, int);
 char   *skip(char *);
 voidobsolete(char *[]);
 __dead voidusage(void);
@@ -60,7 +60,8 @@ main(int argc, char *argv[])
char *prevline, *t1, *t2, *thisline;
FILE *ifp = NULL, *ofp = NULL;
size_t prevsize, thissize, tmpsize;
-   int ch;
+   ssize_t len;
+   int ch, prevnl, thisnl;
 
setlocale(LC_CTYPE, "");
 
@@ -133,16 +134,22 @@ main(int argc, char *argv[])
 
prevsize = 0;
prevline = NULL;
-   if (getline(, , ifp) == -1) {
+   if ((len = getline(, , ifp)) == -1) {
free(prevline);
if (ferror(ifp))
err(1, "getline");
exit(0);
}
+   prevnl = prevline[len - 1] == '\n';
+   if (prevnl)
+   prevline[len - 1] = '\0';

thissize = 0;
thisline = NULL;
-   while (getline(, , ifp) != -1) {
+   while ((len = getline(, , ifp)) != -1) {
+   thisnl = thisline[len - 1] == '\n';
+   if (thisnl)
+   thisline[len - 1] = '\0';
/* If requested get the chosen fields + character offsets. */
if (numfields || numchars) {
t1 = skip(thisline);
@@ -154,13 +161,14 @@ main(int argc, char *argv[])
 
/* If different, print; set previous to new value. */
if ((iflag ? strcasecmp : strcmp)(t1, t2)) {
-   show(ofp, prevline);
+   show(ofp, prevline, prevnl);
t1 = prevline;
prevline = thisline;
thisline = t1;
tmpsize = prevsize;
prevsize = thissize;
thissize = tmpsize;
+   prevnl = thisnl;
repeats = 0;
} else
++repeats;
@@ -169,7 +177,7 @@ main(int argc, char *argv[])
if (ferror(ifp))
err(1, "getline");
 
-   show(ofp, prevline);
+   show(ofp, prevline, prevnl);
free(prevline);
 
exit(0);
@@ -181,13 +189,15 @@ main(int argc, char *argv[])
  * of the line.
  */
 void
-show(FILE *ofp, char *str)
+show(FILE *ofp, char *str, int newline)
 {
if ((dflag && repeats) || (uflag && !repeats)) {
if (cflag)
(void)fprintf(ofp, "%4d %s", repeats + 1, str);
else
(void)fprintf(ofp, "%s", str);
+   if (newline)
+