Hi, We've implemented a number of crawlers ourselves and the benefit of using PHP is that they are easier to maintain if they are built in a language a larger range of people cah use.
To solve your specific problems build a queue system. Create a list table, which lists urls that you want to scrape. This may include details on how to log in to these and what method (POST, GET) to use. Then create a daemon by doing the following: - create a cron job that runs every minute: - set-time_limit(58+$time_of_curl_request+$a_bit); - get next url from list - get contents - scrape out data you need - this may include generating new list urls - remove from list table - measure time, if 58 seconds have passed terminate. If you have to manage load on the target site and have certain turnaround times for scrape data, you may also need a scheduler, which decides what url gets scheduled for scrape when.On Thu, Sep 2, 2010 at 2:20 PM, Dennis <[email protected]> wrote: > Also, I listened to a conversation with Justin Wage and some one from > Digg. They use daemons for various things (kind of what you need to > do). They kept a counter for each daemon and when the counter was at > some magic number of times the daemon had been assigned and completed > a task, it was killed, and a new one started to replace it. This was > more of a 'top level' garbage collection scheme IN PRODUCTION right > now. > > On Sep 1, 11:17 am, pghoratiu <[email protected]> wrote: >> Hi! >> >> My suggestion is to use PHP 5.3.X, it has improved garbage collection >> and it should help with reclaiming unused memory. Also you should >> group the code that is leaking inside a separate function(s), this way >> the PHP runtime knows that it can release the memory for variables >> within the scope. >> >> gabriel >> >> On Sep 1, 12:11 pm, "PieR." <[email protected]> wrote: >> >> > Hi, >> >> > I have a sfTask in CLI wich use lot of foreach and preg_matches, and >> > unfortunatly PHP return an error "Allowed memory size...." in few >> > minutes. >> >> > I read that PHP clear the memory when a script ends, so I tried to run >> > tasks inside the main task, but the problem still remains. >> >> > How to manage this memory issue ? clear memory or launch tasks in >> > separate processes ? >> >> > The final aim is to build a web crawler, wich runs many hours per >> > days. >> >> > Thanks in advance for help, >> >> > Regards, >> >> > Pierre >> >> > > -- > If you want to report a vulnerability issue on symfony, please send it to > security at symfony-project.com > > You received this message because you are subscribed to the Google > Groups "symfony users" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/symfony-users?hl=en > -- If you want to report a vulnerability issue on symfony, please send it to security at symfony-project.com You received this message because you are subscribed to the Google Groups "symfony users" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/symfony-users?hl=en
