[EMAIL PROTECTED] wrote:

for all hits
cat access_log| awk '{ print $1 }' | sort | uniq -c | sort -gr | head


Disclaimer: this adds nothing of value to the actual conversation at hand.

Just a style preference, but I prefer to use cut instead of awk, as it seems to be just the right tool for that particular job you're doing. In the example you give, the replacement cut command would be ... access_log | cut -f 1 -d\ | sort... It also seems at first glance to me that awk, being a (much more capable, but correspondingly) heavier tool for the job would probably be slower at the task. I setup a bit of an artificial test to determine one way or the other which one was more efficient. I took a mail log with about 9 million lines in it, and cat'd it through each of the programs, throwing the output to /dev/null, and repeated the process three times to get a little bit of an average.

awk took about 54 seconds on average, cut took about 43. awk spent about 25.5 seconds processing in user space for each run, cut spent about 6.5. The difference of 11 seconds for both the real time, and user time spent, shows clearly the fact that awk is paying attention to the entire line when it reads it in, where as cut shortcuts when it has achieved it's goal of getting to the first space. The rest of the time is simply how slow the disks are. :) For comparison, it took an average of 38 seconds to do a "wc -l" of this file.

So in short, even on really large inputs, it's not going to make more than 10-15 seconds worth of difference. But if you're an efficiency nut, or dealing with ridiculous data sets, hopefully I added one more tool to your bag of text-mangling tricks. :)

Aaron S. Joyner
--
TriLUG mailing list        : http://www.trilug.org/mailman/listinfo/trilug
TriLUG Organizational FAQ  : http://trilug.org/faq/
TriLUG Member Services FAQ : http://members.trilug.org/services_faq/
TriLUG PGP Keyring         : http://trilug.org/~chrish/trilug.asc

Reply via email to