[PHP-DB] Re: Real Killer App!

2003-03-12 Thread Nelson Goforth
Do you lock out the URLs that have already been indexed?  I'm 
wondering if your system is going into an endless loop?

--
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DB] Re: Real Killer App!

2003-03-12 Thread Nicholas Fitzgerald
Well, I'm not locking them out exactly, but for good reason. When a url 
is first submitted it goes into the database with a checksum value of 0 
and a date of -00-00.  If the checksum is 0 the spider will process 
that url and update the record with the proper info. If the checksum is 
not 0, then it checks the date. If the date is passed the date for 
reindexing then it goes ahead and updates the record, it also checks 
against the checksum to see if the url has changed, in which case it 
updates.

It does look like it's going into an endless loop, but the strange thing 
is that it goes through the loop successfully a couple of times first. 
That's what's got me confused.

Nick

Nelson Goforth wrote:

Do you lock out the URLs that have already been indexed?  I'm 
wondering if your system is going into an endless loop?






--
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


RE: [PHP-DB] Re: Real Killer App!

2003-03-12 Thread Matthew Moldvan
Even if the system is working correctly the first couple times, it may go
into an endless loop if you do not specify the right conditions, for any
programming application ...

I am very curious about this project ... is it open source?  If so, I'd be
interested in taking a look at how you implemented it.

Thanks,
Matthew Moldvan.

System Administrator,
Trilogy International, Inc.
http://www.trilogyintl.com/

-Original Message-
From: Nicholas Fitzgerald [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 12, 2003 7:58 AM
To: [EMAIL PROTECTED]
Subject: Re: [PHP-DB] Re: Real Killer App!


Well, I'm not locking them out exactly, but for good reason. When a url 
is first submitted it goes into the database with a checksum value of 0 
and a date of -00-00.  If the checksum is 0 the spider will process 
that url and update the record with the proper info. If the checksum is 
not 0, then it checks the date. If the date is passed the date for 
reindexing then it goes ahead and updates the record, it also checks 
against the checksum to see if the url has changed, in which case it 
updates.

It does look like it's going into an endless loop, but the strange thing 
is that it goes through the loop successfully a couple of times first. 
That's what's got me confused.

Nick

Nelson Goforth wrote:

 Do you lock out the URLs that have already been indexed?  I'm 
 wondering if your system is going into an endless loop?






-- 
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

-- 
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DB] Re: Real Killer App!

2003-03-12 Thread Nicholas Fitzgerald
I think at this point, as I've been working on it the past few hours, I 
can safely say that it is NOT going into an endless loop. It's just 
dying at a certain point. Interestingly, it always seems to dies at the 
same point. When indexing a particular site, I noticed that it was dying 
on a certain url. I also noticed that it was trying to index some .jpg 
and .gif files. I put in a filter so it wouldn't do anything with those 
binary files, and then re ran the app. Now, with a lot less files to go 
through and hence a lot less work to do, it still died on the same url 
as before. I then cleared the database and put in that specific URL and 
started there. It indexed that fine and did what it was supposed to do. 
No matter what though, if I start at the root of the site, or anywhere 
else, it dies right there. I've also noticed that when it gets to the 
point where it's dying the hard drive does a massive write. 
Interestingly, this happens on the boot drive, and not the raid array 
where the database, php, and the webserver live. I've got a fairly 
sizable swapfile on both drives, and 1.5 gig of memory. I can't imagine 
it's a memory problem, but you never know.

I believe I have the right conditions specified, but I do plan to go and 
review all of that both in the app and in the server environment. As for 
the status of the code, I'm not sure yet. I need to make something from 
this, and I haven't quite figured out yet how people make money from 
open source without charging a fortune for support. I'd rather charge 
less up front and support it for free, but we'll see what happens.

Nick

Matthew Moldvan wrote:

Even if the system is working correctly the first couple times, it may go
into an endless loop if you do not specify the right conditions, for any
programming application ...
I am very curious about this project ... is it open source?  If so, I'd be
interested in taking a look at how you implemented it.
Thanks,
Matthew Moldvan.
System Administrator,
Trilogy International, Inc.
http://www.trilogyintl.com/
-Original Message-
From: Nicholas Fitzgerald [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 12, 2003 7:58 AM
To: [EMAIL PROTECTED]
Subject: Re: [PHP-DB] Re: Real Killer App!
Well, I'm not locking them out exactly, but for good reason. When a url 
is first submitted it goes into the database with a checksum value of 0 
and a date of -00-00.  If the checksum is 0 the spider will process 
that url and update the record with the proper info. If the checksum is 
not 0, then it checks the date. If the date is passed the date for 
reindexing then it goes ahead and updates the record, it also checks 
against the checksum to see if the url has changed, in which case it 
updates.

It does look like it's going into an endless loop, but the strange thing 
is that it goes through the loop successfully a couple of times first. 
That's what's got me confused.

Nick

Nelson Goforth wrote:

 

Do you lock out the URLs that have already been indexed?  I'm 
wondering if your system is going into an endless loop?

   





 




[PHP-DB] Re: Real Killer App!

2003-03-12 Thread Benjamin Walling
Are you doing some kind of recursion and getting stuck or overflowing the
stack?

If you create something like:

function Factorial($x)
{
 if ($x == 1)
 {
  return $x;
 }
 else
 {
  return $x * Factorial($x-1);
 }
}


You can get into a problem with overflowing the call stack with a
sufficiently high value for $x.

Nicholas Fitzgerald [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 I'm having a heck of a time trying to write a little web crawler for my
 intranet. I've got everything functionally working it seems like, but
 there is a very strange problem I can't nail down. If I put in an entry
 and start the crawler it goes great through the first loop. It gets the
 url, gets the page info, puts it in the database, and then parses all of
 the links out and puts them raw into the database. On the second loop it
 picks up all the new stuff and does the same thing. By the time the
 second loop is completed I'll have just over 300 items in the database.
 On the third loop is where the problem starts. Once it gets into the
 third loop, it starts to slow down a lot. Then, after a while, if I'm
 running from the command line, it'll just go to a command prompt. If I'm
 running in a browser, it returns a document contains no data error.
 This is with php 4.3.1 on a win2000 server. Haven't tried it on a linux
 box yet, but I'd rather run it on the windows server since it's bigger
 and has plenty of cpu, memory, and raid space. It's almost like the
 thing is getting confused when it starts to get more than 300 entries in
 the database. Any ideas out there as to what would cause this kind of
 problem?

 Nick





-- 
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php