On Mon, Nov 01, 2021 at 09:04:03AM +0000, Stuart Henderson wrote: > On 2021/10/31 20:48, Scott Cheloha wrote: > > In uniq(1), if we use getline(3) instead of fgets(3) we can support > > arbitrarily long lines. > > It works for me, and getting rid of the length restriction is nice. > > I don't know how much of a concern it is, but it's about twice as slow: > > $ wc -l /tmp/z > 10000000 /tmp/z > $ time (cat /tmp/z | uniq > /dev/null) > 0m01.65s real 0m01.62s user 0m00.22s system > $ time (cat /tmp/z | obj/uniq > /dev/null) > 0m03.60s real 0m03.52s user 0m00.20s system > $ time (cat /tmp/z | guniq > /dev/null) > 0m01.33s real 0m01.28s user 0m00.20s system
How did you generate this input? Is it just ten million lines with a single 'z' character? `jot -bz 10000000`? My own testing here with pathological inputs didn't show that large of a performance difference between fgets(3) and getline(3). There was a difference but it was closer to like 5-10%. I don't know what GNU uniq is doing but maybe we should do it too. > though in practice with large files I'd be more likely to sort | uniq -c > and uniq's time is dwarfed by sort's. If you don't need to count dupicates you can also just use sort's -u flag to bypass the pipe. Saves some copying. -- *ahem* Updated patch. I screwed up. We don't need to free(3) prevline when we print a line. We can just swap the pointers (and now, sizes) as we did before and let getline(3) handle reallocation. Index: uniq.c =================================================================== RCS file: /cvs/src/usr.bin/uniq/uniq.c,v retrieving revision 1.27 diff -u -p -r1.27 uniq.c --- uniq.c 31 Jul 2018 02:55:57 -0000 1.27 +++ uniq.c 1 Nov 2021 15:30:55 -0000 @@ -45,8 +45,6 @@ #include <wchar.h> #include <wctype.h> -#define MAXLINELEN (8 * 1024) - int cflag, dflag, iflag, uflag; int numchars, numfields, repeats; @@ -59,10 +57,10 @@ __dead void usage(void); int main(int argc, char *argv[]) { - char *t1, *t2; + char *prevline, *t1, *t2, *thisline; FILE *ifp = NULL, *ofp = NULL; + size_t prevsize, thissize, tmpsize; int ch; - char *prevline, *thisline; setlocale(LC_CTYPE, ""); @@ -133,15 +131,18 @@ main(int argc, char *argv[]) if (pledge("stdio", NULL) == -1) err(1, "pledge"); - prevline = malloc(MAXLINELEN); - thisline = malloc(MAXLINELEN); - if (prevline == NULL || thisline == NULL) - err(1, "malloc"); - - if (fgets(prevline, MAXLINELEN, ifp) == NULL) + prevline = NULL; + prevsize = 0; + if (getline(&prevline, &prevsize, ifp) == -1) { + free(prevline); + if (ferror(ifp)) + err(1, "getline"); exit(0); - - while (fgets(thisline, MAXLINELEN, ifp)) { + } + + thisline = NULL; + thissize = 0; + while (getline(&thisline, &thissize, ifp) != -1) { /* If requested get the chosen fields + character offsets. */ if (numfields || numchars) { t1 = skip(thisline); @@ -157,11 +158,20 @@ main(int argc, char *argv[]) t1 = prevline; prevline = thisline; thisline = t1; + tmpsize = prevsize; + prevsize = thissize; + thissize = tmpsize; repeats = 0; } else ++repeats; } + free(thisline); + if (ferror(ifp)) + err(1, "getline"); + show(ofp, prevline); + free(prevline); + exit(0); }