[PHP-DB] Re: Real Killer App!
Do you lock out the URLs that have already been indexed? I'm wondering if your system is going into an endless loop? -- PHP Database Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DB] Re: Real Killer App!
Well, I'm not locking them out exactly, but for good reason. When a url is first submitted it goes into the database with a checksum value of 0 and a date of -00-00. If the checksum is 0 the spider will process that url and update the record with the proper info. If the checksum is not 0, then it checks the date. If the date is passed the date for reindexing then it goes ahead and updates the record, it also checks against the checksum to see if the url has changed, in which case it updates. It does look like it's going into an endless loop, but the strange thing is that it goes through the loop successfully a couple of times first. That's what's got me confused. Nick Nelson Goforth wrote: Do you lock out the URLs that have already been indexed? I'm wondering if your system is going into an endless loop? -- PHP Database Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP-DB] Re: Real Killer App!
Even if the system is working correctly the first couple times, it may go into an endless loop if you do not specify the right conditions, for any programming application ... I am very curious about this project ... is it open source? If so, I'd be interested in taking a look at how you implemented it. Thanks, Matthew Moldvan. System Administrator, Trilogy International, Inc. http://www.trilogyintl.com/ -Original Message- From: Nicholas Fitzgerald [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 12, 2003 7:58 AM To: [EMAIL PROTECTED] Subject: Re: [PHP-DB] Re: Real Killer App! Well, I'm not locking them out exactly, but for good reason. When a url is first submitted it goes into the database with a checksum value of 0 and a date of -00-00. If the checksum is 0 the spider will process that url and update the record with the proper info. If the checksum is not 0, then it checks the date. If the date is passed the date for reindexing then it goes ahead and updates the record, it also checks against the checksum to see if the url has changed, in which case it updates. It does look like it's going into an endless loop, but the strange thing is that it goes through the loop successfully a couple of times first. That's what's got me confused. Nick Nelson Goforth wrote: Do you lock out the URLs that have already been indexed? I'm wondering if your system is going into an endless loop? -- PHP Database Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php -- PHP Database Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DB] Re: Real Killer App!
I think at this point, as I've been working on it the past few hours, I can safely say that it is NOT going into an endless loop. It's just dying at a certain point. Interestingly, it always seems to dies at the same point. When indexing a particular site, I noticed that it was dying on a certain url. I also noticed that it was trying to index some .jpg and .gif files. I put in a filter so it wouldn't do anything with those binary files, and then re ran the app. Now, with a lot less files to go through and hence a lot less work to do, it still died on the same url as before. I then cleared the database and put in that specific URL and started there. It indexed that fine and did what it was supposed to do. No matter what though, if I start at the root of the site, or anywhere else, it dies right there. I've also noticed that when it gets to the point where it's dying the hard drive does a massive write. Interestingly, this happens on the boot drive, and not the raid array where the database, php, and the webserver live. I've got a fairly sizable swapfile on both drives, and 1.5 gig of memory. I can't imagine it's a memory problem, but you never know. I believe I have the right conditions specified, but I do plan to go and review all of that both in the app and in the server environment. As for the status of the code, I'm not sure yet. I need to make something from this, and I haven't quite figured out yet how people make money from open source without charging a fortune for support. I'd rather charge less up front and support it for free, but we'll see what happens. Nick Matthew Moldvan wrote: Even if the system is working correctly the first couple times, it may go into an endless loop if you do not specify the right conditions, for any programming application ... I am very curious about this project ... is it open source? If so, I'd be interested in taking a look at how you implemented it. Thanks, Matthew Moldvan. System Administrator, Trilogy International, Inc. http://www.trilogyintl.com/ -Original Message- From: Nicholas Fitzgerald [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 12, 2003 7:58 AM To: [EMAIL PROTECTED] Subject: Re: [PHP-DB] Re: Real Killer App! Well, I'm not locking them out exactly, but for good reason. When a url is first submitted it goes into the database with a checksum value of 0 and a date of -00-00. If the checksum is 0 the spider will process that url and update the record with the proper info. If the checksum is not 0, then it checks the date. If the date is passed the date for reindexing then it goes ahead and updates the record, it also checks against the checksum to see if the url has changed, in which case it updates. It does look like it's going into an endless loop, but the strange thing is that it goes through the loop successfully a couple of times first. That's what's got me confused. Nick Nelson Goforth wrote: Do you lock out the URLs that have already been indexed? I'm wondering if your system is going into an endless loop?
[PHP-DB] Re: Real Killer App!
Are you doing some kind of recursion and getting stuck or overflowing the stack? If you create something like: function Factorial($x) { if ($x == 1) { return $x; } else { return $x * Factorial($x-1); } } You can get into a problem with overflowing the call stack with a sufficiently high value for $x. Nicholas Fitzgerald [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] I'm having a heck of a time trying to write a little web crawler for my intranet. I've got everything functionally working it seems like, but there is a very strange problem I can't nail down. If I put in an entry and start the crawler it goes great through the first loop. It gets the url, gets the page info, puts it in the database, and then parses all of the links out and puts them raw into the database. On the second loop it picks up all the new stuff and does the same thing. By the time the second loop is completed I'll have just over 300 items in the database. On the third loop is where the problem starts. Once it gets into the third loop, it starts to slow down a lot. Then, after a while, if I'm running from the command line, it'll just go to a command prompt. If I'm running in a browser, it returns a document contains no data error. This is with php 4.3.1 on a win2000 server. Haven't tried it on a linux box yet, but I'd rather run it on the windows server since it's bigger and has plenty of cpu, memory, and raid space. It's almost like the thing is getting confused when it starts to get more than 300 entries in the database. Any ideas out there as to what would cause this kind of problem? Nick -- PHP Database Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php