Magic Banana comes to the rescue yet again!

I tried the two suggested scripts: The second takes half as much CPU time as the first.

I applied MB's first suggestion to a couple of multi-megabyte files, obfuscated to
protect the innocent:

awk '{print $1, $3, $2}' 'DataX-VpDCts.txt' > Temp01.txt ;
awk '{print $1, $3, $2}' 'DataY-DVCts.txt' > Temp02.txt ;
join -2 1  DataZ-VpDCts-DVCts-Keyed.txt ;
rm Temp01.txt Temp02.txt Temp03.txt

Forgive me for using crutches ... but the result has ~300,000 rows, four columns, no missing cells and the script(s) took less than a second real time for processing.

Here's the story behind the two subject files:
Each has three columns: IPv4, Count 1 or Count 2, Textfile name.

The join command is to produce IPv4, Count 1, Count 2, Textfile name in four colums
and does so OK.

A few of the IPv4's visited more than one Textfile name; and some IPv4's visited a single Textfile name multiple times. Therefore, some IPv4's appear up to twenty or more times in successive rows when the resulting file is sorted on the IPv4 column.

A total of thirty-two sets of domain visitor data were examined; some IPv4's visited twenty-nine of the thirty-two domains. Over a thousand IPv4's visited a single domain over a thousand times. The maximum was over 100,000 visits in the month examined.

I note on looking at the resulting files that my earler generated domain visit
counts are inaccurate, but that it is not the result of an inaccurate join.

More to come, it appears.
George Langford

Reply via email to