Apache log munging

2008-10-08 Thread Joe Python
I have a written a generator for an apache log which returns two types of
information,
hostname and the filename requested.

The 'log' generator can be 'consumed' like this:

for r in log:
  print r['host'], r['filename']

I want to find the top '100' hosts (sorted in descending order of total
requests) like follows:

host  filename1  filename2 filename3 Total

hostA   6  9 45 110
hostC   4 4343  98
hostB   344 45  83

and so on.
Is there a fast way to this without scanning the log file many times?
Thanks in advance.
- Jo
--
http://mail.python.org/mailman/listinfo/python-list


Re: Apache log munging

2008-10-08 Thread Joe Riopel
On Wed, Oct 8, 2008 at 1:55 PM, Joe Python [EMAIL PROTECTED] wrote:
 I want to find the top '100' hosts (sorted in descending order of total
 requests) like follows:
 Is there a fast way to this without scanning the log file many times?

As you encounter a new host add it to a dict (or another type of
collection), and if encountered again, use that host as the key to
retrieve the dict entry and increment it's request count. You should
only have to read the file once.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Apache log munging

2008-10-08 Thread Joe Python
I am currently using the following technic to get the info above:

all = defaultdict(int)
hosts = defaultdict(int)
filename = defaultdict(int)

for r in log:
   all[r['host'],r['file']] += 1
   hosts[r['host']] += 1
   filename[r['file']] = 1


for host in sorted(hosts,key=hosts.get, reverse=True):
for file in filename:
  print host, all[host,file]
print hosts[host]
I was looking for a better option instead of building 'three' collections
to improve performance.

- Jo

On Wed, Oct 8, 2008 at 2:15 PM, Joe Riopel [EMAIL PROTECTED] wrote:

 On Wed, Oct 8, 2008 at 1:55 PM, Joe Python [EMAIL PROTECTED] wrote:
  I want to find the top '100' hosts (sorted in descending order of total
  requests) like follows:
  Is there a fast way to this without scanning the log file many times?

 As you encounter a new host add it to a dict (or another type of
 collection), and if encountered again, use that host as the key to
 retrieve the dict entry and increment it's request count. You should
 only have to read the file once.

--
http://mail.python.org/mailman/listinfo/python-list