[Trisquel-users] Re : Can a path statement be too long or the file too big?

lcerf Thu, 09 Jul 2020 11:50:43 -0700

Grep worries me because it selects a lot of names that aren't in the patternfile

No, it does not. You do not understand what it does and/or do not use itproperly. Beside option -F missing (as I explained in my previous post), atab should certainly end every pattern. For instance, line 297123 ofSS.HN-GLU-MB-January2020-PTRs-Rndm.txt is "union". As a consequence, 'grep-f SS.HN-GLU-MB-January2020-PTRs-Rndm.txt' outputs every line including"union", i.e.:

$ grep union SS.IPv4-NLU-Joined-HN-GLU-January2020-slash24.PTRs_.Tally_.txt
unallocated.unioncom.net.ua     249
r-r-resale-dba-once-upon-a-child-9148-union.static.fuse.net     4
chaco-credit-union-10m-fuse.static.fuse.net     4
mail.unionbankph.com    3
gw.interunion.ru        2

Including line 297123, 128 lines in SS.HN-GLU-MB-January2020-PTRs-Rndm.txtinclude "union":

$ zgrep -c union SS.HN-GLU-MB-January2020-PTRs-Rndm.txt_0.gz
128

The presence of the 127 other lines makes no difference whatsoever in theoutput of 'grep -f SS.HN-GLU-MB-January2020-PTRs-Rndm.txt'.

The join, sort, and comm-based scripts all were executed orders of magnitudefaster than the grep script.

The overall run times of is dominated by sort's, which is linearithmic in thenumber of lines in the largest of the two files. Because grep must outputthe lines in the order of the (potentially infinite) file, its run time growswith the product of the number of lines in that file and the number ofpatterns (for each processed line, all patterns are enumerated): that is muchworse if both files are larges. Also, without -F, grep interprets thepatterns as regular expressions: it is obviously more expensive to match aregular expression than a fixed string. Finally, grep searches the patternin the whole line and not only in one specific field, as join does.

[Trisquel-users] Re : Can a path statement be too long or the file too big?

Reply via email to