-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello,
: An apache logfile entry looks like this: : : 89.151.119.196 - - [04/Nov/2009:04:02:10 +0000] "GET : /service.php?s=nav&arg[]=&arg[]=home&q=ubercrumb/node%2F20812 : HTTP/1.1" 200 50 "-" "-" : : I want to extract 24 hrs of data based timestamps like this: : : [04/Nov/2009:04:02:10 +0000] : : I also need to do some filtering (eg I actually don't want : anything with service.php), and I also have to do some : substitutions - that's trivial other than not knowing the optimum : place to do it? IE should I do multiple passes? I wouldn't. Then, you spend decompression CPU, line matching CPU and I/O several times. I'd do it all at once. : Or should I try to do all the work at once, only viewing each : line once? Also what about reading from compressed files? The : data comes in as 6 gzipped logfiles which expand to 6G in total. There are standard modules for handling compressed data (gzip and bz2). I'd imagine that the other pythonistas on this list will give you more detailed (and probably better) advice, but here's a sample of how to use the gzip module and how to skip the lines containing the '/service.php' string, and to extract an epoch timestamp from the datestamp field(s). You would pass the filenames to operate on as arguments to this script. See optparse if you want fancier capabilities for option handling. See re if you want to match multiple patterns to ignore. See time (and datetime) for mangling time and date strings. Be forewarned, time zone issues will probably be a massive headache. Many others have been here before [0]. Look up itertools (and be prepared for some study) if you want the output from the log files from your different servers sorted in the output. Note that the below snippet is a toy and makes no attempt to trap (try/except) any error conditions. If you are looking for a weblog analytics package once you have reassambled the files into a whole, perhaps you could just start there (e.g. webalizer, analog are two old-school packages that come to mind for processing logging that has been produced in a Common Log Format). I will echo Alan Gauld's sentiments of a few minutes ago and note that there are a probably many different Apache log parsers out there which can accomplish what you hope to accomplish. On the other hand, you may be using this as an excuse to learn a bit of python. Good luck, - -Martin [0] http://seehuhn.de/blog/52 Sample: import sys, time, gzip files = sys.argv[1:] for file in files: print >>sys.stderr, "About to open %s" % ( file ) f = gzip.open( file ) for line in f: if line.find('/service.php') > 0: continue fields = line.split() # -- ignoring time zone; you are logging in UTC, right? # tz = fields[4] d = int( time.mktime( time.strptime(fields[3], "[%d/%b/%Y:%H:%M:%S") ) ) print d, line, - -- Martin A. Brown http://linux-ip.net/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) Comment: pgf-0.72 (http://linux-ip.net/sw/pine-gpg-filter/) iD8DBQFK9+MGHEoZD1iZ+YcRAhITAKCLGF6GnEMYr50bgk4vAw3YMRZjuACg2VUg I7/Vrw6KKjwqfxG0qfr10lo= =oi6X -----END PGP SIGNATURE----- _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor