Re: [Trisquel-users] Unforeseen feature of the join command

amenex Thu, 12 Mar 2020 10:02:15 -0700

Magic Banana comes to the rescue yet again!

I tried the two suggested scripts: The second takes half as much CPU time asthe first.

I applied MB's first suggestion to a couple of multi-megabyte files,obfuscated to

protect the innocent:

awk '{print $1, $3, $2}' 'DataX-VpDCts.txt' > Temp01.txt ;
awk '{print $1, $3, $2}' 'DataY-DVCts.txt' > Temp02.txt ;
join -2 1  DataZ-VpDCts-DVCts-Keyed.txt ;
rm Temp01.txt Temp02.txt Temp03.txt

Forgive me for using crutches ... but the result has ~300,000 rows, fourcolumns, nomissing cells and the script(s) took less than a second real time forprocessing.


Here's the story behind the two subject files:
Each has three columns: IPv4, Count 1 or Count 2, Textfile name.

The join command is to produce IPv4, Count 1, Count 2, Textfile name in fourcolums

and does so OK.

A few of the IPv4's visited more than one Textfile name; and some IPv4'svisiteda single Textfile name multiple times. Therefore, some IPv4's appear up totwentyor more times in successive rows when the resulting file is sorted on theIPv4 column.

A total of thirty-two sets of domain visitor data were examined; some IPv4'svisitedtwenty-nine of the thirty-two domains. Over a thousand IPv4's visited asingle domainover a thousand times. The maximum was over 100,000 visits in the monthexamined.

I note on looking at the resulting files that my earler generated domainvisit

counts are inaccurate, but that it is not the result of an inaccurate join.

More to come, it appears.
George Langford

Re: [Trisquel-users] Unforeseen feature of the join command

Reply via email to