On Mon, Nov 01, 2021 at 09:04:03AM +0000, Stuart Henderson wrote:
> On 2021/10/31 20:48, Scott Cheloha wrote:
> > In uniq(1), if we use getline(3) instead of fgets(3) we can support
> > arbitrarily long lines.
> 
> It works for me, and getting rid of the length restriction is nice.
> 
> I don't know how much of a concern it is, but it's about twice as slow:
> 
> $ wc -l /tmp/z
>  10000000 /tmp/z
> $ time (cat /tmp/z | uniq > /dev/null)
>     0m01.65s real     0m01.62s user     0m00.22s system
> $ time (cat /tmp/z | obj/uniq > /dev/null)
>     0m03.60s real     0m03.52s user     0m00.20s system
> $ time (cat /tmp/z | guniq > /dev/null)
>     0m01.33s real     0m01.28s user     0m00.20s system

How did you generate this input?  Is it just ten million lines with a
single 'z' character?  `jot -bz 10000000`?

My own testing here with pathological inputs didn't show that large of
a performance difference between fgets(3) and getline(3).  There was
a difference but it was closer to like 5-10%.

I don't know what GNU uniq is doing but maybe we should do it too.

> though in practice with large files I'd be more likely to sort | uniq -c
> and uniq's time is dwarfed by sort's.

If you don't need to count dupicates you can also just use sort's -u
flag to bypass the pipe.  Saves some copying.

--

*ahem*

Updated patch.

I screwed up.  We don't need to free(3) prevline when we print a line.
We can just swap the pointers (and now, sizes) as we did before and
let getline(3) handle reallocation.

Index: uniq.c
===================================================================
RCS file: /cvs/src/usr.bin/uniq/uniq.c,v
retrieving revision 1.27
diff -u -p -r1.27 uniq.c
--- uniq.c      31 Jul 2018 02:55:57 -0000      1.27
+++ uniq.c      1 Nov 2021 15:30:55 -0000
@@ -45,8 +45,6 @@
 #include <wchar.h>
 #include <wctype.h>
 
-#define        MAXLINELEN      (8 * 1024)
-
 int cflag, dflag, iflag, uflag;
 int numchars, numfields, repeats;
 
@@ -59,10 +57,10 @@ __dead void usage(void);
 int
 main(int argc, char *argv[])
 {
-       char *t1, *t2;
+       char *prevline, *t1, *t2, *thisline;
        FILE *ifp = NULL, *ofp = NULL;
+       size_t prevsize, thissize, tmpsize;
        int ch;
-       char *prevline, *thisline;
 
        setlocale(LC_CTYPE, "");
 
@@ -133,15 +131,18 @@ main(int argc, char *argv[])
        if (pledge("stdio", NULL) == -1)
                err(1, "pledge");
 
-       prevline = malloc(MAXLINELEN);
-       thisline = malloc(MAXLINELEN);
-       if (prevline == NULL || thisline == NULL)
-               err(1, "malloc");
-
-       if (fgets(prevline, MAXLINELEN, ifp) == NULL)
+       prevline = NULL;
+       prevsize = 0;
+       if (getline(&prevline, &prevsize, ifp) == -1) {
+               free(prevline);
+               if (ferror(ifp))
+                       err(1, "getline");
                exit(0);
-
-       while (fgets(thisline, MAXLINELEN, ifp)) {
+       }
+       
+       thisline = NULL;
+       thissize = 0;
+       while (getline(&thisline, &thissize, ifp) != -1) {
                /* If requested get the chosen fields + character offsets. */
                if (numfields || numchars) {
                        t1 = skip(thisline);
@@ -157,11 +158,20 @@ main(int argc, char *argv[])
                        t1 = prevline;
                        prevline = thisline;
                        thisline = t1;
+                       tmpsize = prevsize;
+                       prevsize = thissize;
+                       thissize = tmpsize;
                        repeats = 0;
                } else
                        ++repeats;
        }
+       free(thisline);
+       if (ferror(ifp))
+               err(1, "getline");
+
        show(ofp, prevline);
+       free(prevline);
+
        exit(0);
 }
 

Reply via email to