Grep worries me because it selects a lot of names that aren't in the pattern file

No, it does not. You do not understand what it does and/or do not use it properly. Beside option -F missing (as I explained in my previous post), a tab should certainly end every pattern. For instance, line 297123 of SS.HN-GLU-MB-January2020-PTRs-Rndm.txt is "union". As a consequence, 'grep -f SS.HN-GLU-MB-January2020-PTRs-Rndm.txt' outputs every line including "union", i.e.:
$ grep union SS.IPv4-NLU-Joined-HN-GLU-January2020-slash24.PTRs_.Tally_.txt
unallocated.unioncom.net.ua     249
r-r-resale-dba-once-upon-a-child-9148-union.static.fuse.net     4
chaco-credit-union-10m-fuse.static.fuse.net     4
mail.unionbankph.com    3
gw.interunion.ru        2

Including line 297123, 128 lines in SS.HN-GLU-MB-January2020-PTRs-Rndm.txt include "union":
$ zgrep -c union SS.HN-GLU-MB-January2020-PTRs-Rndm.txt_0.gz
128
The presence of the 127 other lines makes no difference whatsoever in the output of 'grep -f SS.HN-GLU-MB-January2020-PTRs-Rndm.txt'.

The join, sort, and comm-based scripts all were executed orders of magnitude faster than the grep script.

The overall run times of is dominated by sort's, which is linearithmic in the number of lines in the largest of the two files. Because grep must output the lines in the order of the (potentially infinite) file, its run time grows with the product of the number of lines in that file and the number of patterns (for each processed line, all patterns are enumerated): that is much worse if both files are larges. Also, without -F, grep interprets the patterns as regular expressions: it is obviously more expensive to match a regular expression than a fixed string. Finally, grep searches the pattern in the whole line and not only in one specific field, as join does.

Reply via email to