Magic Banana comes to the rescue yet again!
I tried the two suggested scripts: The second takes half as much CPU time as
the first.
I applied MB's first suggestion to a couple of multi-megabyte files,
obfuscated to
protect the innocent:
awk '{print $1, $3, $2}' 'DataX-VpDCts.txt' > Temp01.txt ;
awk '{print $1, $3, $2}' 'DataY-DVCts.txt' > Temp02.txt ;
join -2 1 DataZ-VpDCts-DVCts-Keyed.txt ;
rm Temp01.txt Temp02.txt Temp03.txt
Forgive me for using crutches ... but the result has ~300,000 rows, four
columns, no
missing cells and the script(s) took less than a second real time for
processing.
Here's the story behind the two subject files:
Each has three columns: IPv4, Count 1 or Count 2, Textfile name.
The join command is to produce IPv4, Count 1, Count 2, Textfile name in four
colums
and does so OK.
A few of the IPv4's visited more than one Textfile name; and some IPv4's
visited
a single Textfile name multiple times. Therefore, some IPv4's appear up to
twenty
or more times in successive rows when the resulting file is sorted on the
IPv4 column.
A total of thirty-two sets of domain visitor data were examined; some IPv4's
visited
twenty-nine of the thirty-two domains. Over a thousand IPv4's visited a
single domain
over a thousand times. The maximum was over 100,000 visits in the month
examined.
I note on looking at the resulting files that my earler generated domain
visit
counts are inaccurate, but that it is not the result of an inaccurate join.
More to come, it appears.
George Langford