[PHP] Parsing through an Apache Log File?

2006-01-04 Thread Jay Paulson \(CE CEN\)
Hello everyone!  I've been given the responsiblity of coding an apache 
access_log parser.  What my tasks are to do is to return the number of hits for 
certain file extensions that happen on certain dates with specific IP address.

As of now I'm only going back 7 days in the log looking for this information 
and I'm only looking for 5 file types (.doc, .pdf, .html, .php, and .flv).  I'm 
using the fgets() function so I can read the file line by line and do the 
matches that I need to do and increment the counters as needed.  Right now I 
have 3 loops looking for everything, which seems to me not to be the best way 
of doing this.  I've also encountered that a line may have the file extension I 
want but it's actually the soucre of another file.  (see below for example)

Log file example:
I want the first line but not the second line.  The second line has a .css file 
which was used by the .html file therefore I don't want this line.  I do want 
the first line that all it has is .html and no other files.

10.25.40.64 - - [01/Jan/2006:07:33:18 -0600] GET /home.html HTTP/1.1 200 8220 
- Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
10.25.40.64 - - [01/Jan/2006:07:33:18 -0600] GET /styles/redesign.css 
HTTP/1.1 200 2381 http://wfmu.wfm.pvt/home.html; Mozilla/4.0 (compatible; 
MSIE 6.0; Windows NT 5.1; SV1)

At any rate, here's some of my psudo code/code for what I'm trying to 
accomplish.  I know there has to be a better way for this and I'm looking for 
suggestions!


//path to log file
$path = ./;
//name of log file
$log_filename = access_log;

if (!file_exists($path.$log_filename)) {
echo file does not exists!;
die;
}

//open log file
if (!$handle = fopen($path.$log_filename, r)) {
echo error in loading file!;
die;
}

//get date range from past 7 days put into array for comparision of log file
$dates = array();
$days = 7;
for ($i=1;$i=$days;$i++) {
$dates[] = date(d/M/Y, strtotime(-$i day));
}

//get document types that need to match
$docs = array();
$docs[] = .doc;
$docs[] = .pdf;
$docs[] = .html;
$docs[] = .htm;
$docs[] = .php;
$docs[] = .flv;

$contents = ;
while (!feof($handle)) {
$line = fgets($handle);
//look to see if the line has a date we are looking for
foreach ($dates as $date) {
//if date is in the line look for the doc type we want
if (strpos($line, $date)) {
//look to see if the line has the doc type we want
foreach ($docs as $doc) {
//if the line has the doc type we want then 
grab the region
//and increment the counter for page hit
//make sure to break out of the loops once found
//need to add functionality for lines that have 
file extensions
//that are not wanted but also have file 
extensions that are wanted
if (strpos($line, $doc) {

break;  
} //end if
} //end foreach ($docs as $doc)
break;
} //end if
} //end foreach ($dates as $date)
}


//close log file
fclose($handle);

Thanks!
Jay


Re: [PHP] Parsing through an Apache Log File?

2006-01-04 Thread John Nichel

Jay Paulson (CE CEN) wrote:

Hello everyone!  I've been given the responsiblity of coding an apache 
access_log parser.  What my tasks are to do is to return the number of hits for 
certain file extensions that happen on certain dates with specific IP address.

As of now I'm only going back 7 days in the log looking for this information 
and I'm only looking for 5 file types (.doc, .pdf, .html, .php, and .flv).  I'm 
using the fgets() function so I can read the file line by line and do the 
matches that I need to do and increment the counters as needed.  Right now I 
have 3 loops looking for everything, which seems to me not to be the best way 
of doing this.  I've also encountered that a line may have the file extension I 
want but it's actually the soucre of another file.  (see below for example)

Log file example:
I want the first line but not the second line.  The second line has a .css file 
which was used by the .html file therefore I don't want this line.  I do want 
the first line that all it has is .html and no other files.

10.25.40.64 - - [01/Jan/2006:07:33:18 -0600] GET /home.html HTTP/1.1 200 8220 - 
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
10.25.40.64 - - [01/Jan/2006:07:33:18 -0600] GET /styles/redesign.css HTTP/1.1 200 2381 
http://wfmu.wfm.pvt/home.html; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

At any rate, here's some of my psudo code/code for what I'm trying to 
accomplish.  I know there has to be a better way for this and I'm looking for 
suggestions!

snip

Save yourself a ton of work.  Dump the raw logs into a db, and you can 
do all the queries on the db.  Something like this...


CREATE TABLE `rawLogs` (
  `ipAddress` int(15) NOT NULL default '0',
  `rfcIdentity` varchar(32) NOT NULL default '',
  `apacheUser` varchar(32) NOT NULL default '',
  `date` int(15) NOT NULL default '0',
  `request` longtext NOT NULL,
  `statusCode` varchar(32) NOT NULL default '',
  `sizeBytes` int(11) NOT NULL default '0',
  `referer` longtext NOT NULL,
  `userAgent` longtext NOT NULL,
  KEY `ipAddress` (`ipAddress`),
  FULLTEXT KEY `search` (`request`,`referer`,`userAgent`)
) TYPE=MyISAM;

--
John C. Nichel IV
Programmer/System Admin (ÜberGeek)
Dot Com Holdings of Buffalo
716.856.9675
[EMAIL PROTECTED]

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



RE: [PHP] Parsing through an Apache Log File?

2006-01-04 Thread Jay Paulson \(CE CEN\)
Jay Paulson (CE CEN) wrote:
 Hello everyone!  I've been given the responsiblity of coding an apache 
 access_log parser.  What my tasks are to do is to return the number of hits 
 for certain file extensions that happen on certain dates with specific IP 
 address.
 
 As of now I'm only going back 7 days in the log looking for this information 
 and I'm only looking for 5 file types (.doc, .pdf, .html, .php, and .flv).  
 I'm using the fgets() function so I can read the file line by line and do the 
 matches that I need to do and increment the counters as needed.  Right now I 
 have 3 loops looking for everything, which seems to me not to be the best way 
 of doing this.  I've also encountered that a line may have the file extension 
 I want but it's actually the soucre of another file.  (see below for example)
 
 Log file example:
 I want the first line but not the second line.  The second line has a .css 
 file which was used by the .html file therefore I don't want this line.  I do 
 want the first line that all it has is .html and no other files.
 
 10.25.40.64 - - [01/Jan/2006:07:33:18 -0600] GET /home.html HTTP/1.1 200 
 8220 - Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
 10.25.40.64 - - [01/Jan/2006:07:33:18 -0600] GET /styles/redesign.css 
 HTTP/1.1 200 2381 http://wfmu.wfm.pvt/home.html; Mozilla/4.0 (compatible; 
 MSIE 6.0; Windows NT 5.1; SV1)
 
 At any rate, here's some of my psudo code/code for what I'm trying to 
 accomplish.  I know there has to be a better way for this and I'm looking for 
 suggestions!
snip

Save yourself a ton of work.  Dump the raw logs into a db, and you can 
do all the queries on the db.  Something like this...

CREATE TABLE `rawLogs` (
   `ipAddress` int(15) NOT NULL default '0',
   `rfcIdentity` varchar(32) NOT NULL default '',
   `apacheUser` varchar(32) NOT NULL default '',
   `date` int(15) NOT NULL default '0',
   `request` longtext NOT NULL,
   `statusCode` varchar(32) NOT NULL default '',
   `sizeBytes` int(11) NOT NULL default '0',
   `referer` longtext NOT NULL,
   `userAgent` longtext NOT NULL,
   KEY `ipAddress` (`ipAddress`),
   FULLTEXT KEY `search` (`request`,`referer`,`userAgent`)
) TYPE=MyISAM;

A few questions with this train of thought.  I can see the advantages of 
putting the raw log file into a database but I would still need to parse the 
file and get the information out of it for each column.  I'm also not quite 
sure what some of your feilds are for 'rfcIdentity'??  What is that?  Why would 
I need an 'apacheUser' also?  Anyway, not too sure how I would get this 
information in an easy way for the massive amounts of inserts I would have to 
do on a 10 meg log file.

jay


RE: [PHP] Parsing through an Apache Log File?

2006-01-04 Thread Jay Paulson \(CE CEN\)
Jay Paulson (CE CEN) wrote:
 Hello everyone!  I've been given the responsiblity of coding an apache 
 access_log parser.  What my tasks are to do is to return the number of hits 
 for certain file extensions that happen on certain dates with specific IP 
 address.
 
 As of now I'm only going back 7 days in the log looking for this information 
 and I'm only looking for 5 file types (.doc, .pdf, .html, .php, and .flv).  
 I'm using the fgets() function so I can read the file line by line and do the 
 matches that I need to do and increment the counters as needed.  Right now I 
 have 3 loops looking for everything, which seems to me not to be the best way 
 of doing this.  I've also encountered that a line may have the file extension 
 I want but it's actually the soucre of another file.  (see below for example)
 
 Log file example:
 I want the first line but not the second line.  The second line has a .css 
 file which was used by the .html file therefore I don't want this line.  I do 
 want the first line that all it has is .html and no other files.
 
 10.25.40.64 - - [01/Jan/2006:07:33:18 -0600] GET /home.html HTTP/1.1 200 
 8220 - Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
 10.25.40.64 - - [01/Jan/2006:07:33:18 -0600] GET /styles/redesign.css 
 HTTP/1.1 200 2381 http://wfmu.wfm.pvt/home.html; Mozilla/4.0 (compatible; 
 MSIE 6.0; Windows NT 5.1; SV1)
 
 At any rate, here's some of my psudo code/code for what I'm trying to 
 accomplish.  I know there has to be a better way for this and I'm looking for 
 suggestions!
snip

Save yourself a ton of work.  Dump the raw logs into a db, and you can 
do all the queries on the db.  Something like this...

I took your idea and did a search on Google and found that this has already 
been done for me!  Check it out!

http://www.php-scripts.com/php_diary/012103.php3

Very cool :)

jay


Re: [PHP] Parsing through an Apache Log File?

2006-01-04 Thread John Nichel

Jay Paulson (CE CEN) wrote:

Jay Paulson (CE CEN) wrote:


Hello everyone!  I've been given the responsiblity of coding an apache 
access_log parser.  What my tasks are to do is to return the number of hits for 
certain file extensions that happen on certain dates with specific IP address.

As of now I'm only going back 7 days in the log looking for this information 
and I'm only looking for 5 file types (.doc, .pdf, .html, .php, and .flv).  I'm 
using the fgets() function so I can read the file line by line and do the 
matches that I need to do and increment the counters as needed.  Right now I 
have 3 loops looking for everything, which seems to me not to be the best way 
of doing this.  I've also encountered that a line may have the file extension I 
want but it's actually the soucre of another file.  (see below for example)

Log file example:
I want the first line but not the second line.  The second line has a .css file 
which was used by the .html file therefore I don't want this line.  I do want 
the first line that all it has is .html and no other files.

10.25.40.64 - - [01/Jan/2006:07:33:18 -0600] GET /home.html HTTP/1.1 200 8220 - 
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
10.25.40.64 - - [01/Jan/2006:07:33:18 -0600] GET /styles/redesign.css HTTP/1.1 200 2381 
http://wfmu.wfm.pvt/home.html; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

At any rate, here's some of my psudo code/code for what I'm trying to 
accomplish.  I know there has to be a better way for this and I'm looking for 
suggestions!


snip

Save yourself a ton of work.  Dump the raw logs into a db, and you can 
do all the queries on the db.  Something like this...


CREATE TABLE `rawLogs` (
   `ipAddress` int(15) NOT NULL default '0',
   `rfcIdentity` varchar(32) NOT NULL default '',
   `apacheUser` varchar(32) NOT NULL default '',
   `date` int(15) NOT NULL default '0',
   `request` longtext NOT NULL,
   `statusCode` varchar(32) NOT NULL default '',
   `sizeBytes` int(11) NOT NULL default '0',
   `referer` longtext NOT NULL,
   `userAgent` longtext NOT NULL,
   KEY `ipAddress` (`ipAddress`),
   FULLTEXT KEY `search` (`request`,`referer`,`userAgent`)
) TYPE=MyISAM;

A few questions with this train of thought.  I can see the advantages of 
putting the raw log file into a database but I would still need to parse the 
file and get the information out of it for each column.


Correct, but putting it into a db, you only have to parse the file once 
instead of every time you want to sort your data.



I'm also not quite sure what some of your feilds are for 'rfcIdentity'??  What 
is that?  Why would I need an 'apacheUser' also?


In the output example of your logs, it looks as if your using the format 
of Apache logs which contain this data (the two dashes after the IP). 
Most of the time, that's what they will be; dashes, no data.  Look here:


http://httpd.apache.org/docs/1.3/logs.html


Anyway, not too sure how I would get this information in an easy way for the 
massive amounts of inserts I would have to do on a 10 meg log file.


Script it.  Just like you're parsing each line right now, but split the 
line on the tab (I assume that's your separator), and you'll have an 
array of the values in that line.  Use that array to insert your values. 
 I do this with daily logs on our sites (some of the files are over 
100mb)  I also convert the IP and date into integers for easier 
searching before inserting them into the db.  YMMV.


Once you have them in the db, it's easy to run your queries on that 
table (or break the data up into other tables for different search 
criteria).  On our system, I dump the raw log table every month (because 
it's already been broken down to other tables and better normalized), as 
trying to put two months of data into it would put it beyond the 4gb 
limit on our system.


If this is just a one time thing you're looking to do, all of this may 
be over the top.  However, if the bosses are going to want to review 
this data month in and month out, I think the time spent doing something 
like this will be worth it.


--
John C. Nichel IV
Programmer/System Admin (ÜberGeek)
Dot Com Holdings of Buffalo
716.856.9675
[EMAIL PROTECTED]

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Parsing through an Apache Log File?

2006-01-04 Thread John Nichel

Jay Paulson (CE CEN) wrote:

Jay Paulson (CE CEN) wrote:


Hello everyone!  I've been given the responsiblity of coding an apache 
access_log parser.  What my tasks are to do is to return the number of hits for 
certain file extensions that happen on certain dates with specific IP address.

As of now I'm only going back 7 days in the log looking for this information 
and I'm only looking for 5 file types (.doc, .pdf, .html, .php, and .flv).  I'm 
using the fgets() function so I can read the file line by line and do the 
matches that I need to do and increment the counters as needed.  Right now I 
have 3 loops looking for everything, which seems to me not to be the best way 
of doing this.  I've also encountered that a line may have the file extension I 
want but it's actually the soucre of another file.  (see below for example)

Log file example:
I want the first line but not the second line.  The second line has a .css file 
which was used by the .html file therefore I don't want this line.  I do want 
the first line that all it has is .html and no other files.

10.25.40.64 - - [01/Jan/2006:07:33:18 -0600] GET /home.html HTTP/1.1 200 8220 - 
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
10.25.40.64 - - [01/Jan/2006:07:33:18 -0600] GET /styles/redesign.css HTTP/1.1 200 2381 
http://wfmu.wfm.pvt/home.html; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

At any rate, here's some of my psudo code/code for what I'm trying to 
accomplish.  I know there has to be a better way for this and I'm looking for 
suggestions!


snip

Save yourself a ton of work.  Dump the raw logs into a db, and you can 
do all the queries on the db.  Something like this...


I took your idea and did a search on Google and found that this has already 
been done for me!  Check it out!

http://www.php-scripts.com/php_diary/012103.php3

Very cool :)


This is the script I wrote when we first started this project a few 
months ago to parse the 2+ years of log files, and intially get them 
into the db.  If you want to use parts of it, feel free.


http://john.nichel.net/parse.phps

--
John C. Nichel IV
Programmer/System Admin (ÜberGeek)
Dot Com Holdings of Buffalo
716.856.9675
[EMAIL PROTECTED]

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



RE: [PHP] Parsing through an Apache Log File?

2006-01-04 Thread Jay Paulson \(CE CEN\)
 If this is just a one time thing you're looking to do, all of this may be 
 over the top.  However, if the bosses are going to want to review this data 
 month in and month out, I think the time spent doing something like this will 
 be worth it.

As of now I've got it working and inserting the data into the database!  I did 
see your code and since you are being so generious as to let me use it I'll 
probably tweak it (a very little bit!) as we are going to be using this script 
once a week to read the log files.  We are using it to get some numbers out of 
it so make our own custome stats thing based off of a lot more numbers that 
included this as part of the number getting.

Thanks so much for your help!

jay