Hi,

We've implemented a number of crawlers ourselves and the benefit of
using PHP is that they are easier to maintain if they are built in a
language a larger range of people cah use.

To solve your specific problems build a queue system.

Create a list table, which lists urls that you want to scrape. This
may include details on how to log in to these and what method (POST,
GET) to use.

Then create a daemon by doing the following:

- create a cron job that runs every minute:
 - set-time_limit(58+$time_of_curl_request+$a_bit);
 - get next url from list
 - get contents
 - scrape out data you need - this may include generating new list urls
 - remove from list table
 - measure time, if 58 seconds have passed terminate.


If you have to manage load on the target site and have certain
turnaround times for scrape data, you may also need a scheduler, which
decides what url gets scheduled for scrape when.On Thu, Sep 2, 2010 at
2:20 PM, Dennis <[email protected]> wrote:
> Also, I listened to a conversation with Justin Wage and some one from
> Digg. They use daemons for various things (kind of what you need to
> do). They kept a counter for each daemon and when the counter was at
> some magic number of times the daemon had been assigned and completed
> a task, it was killed, and a new one started to replace it. This was
> more of a 'top level' garbage collection scheme IN PRODUCTION right
> now.
>
> On Sep 1, 11:17 am, pghoratiu <[email protected]> wrote:
>> Hi!
>>
>> My suggestion is to use PHP 5.3.X, it has improved garbage collection
>> and it should help with reclaiming unused memory. Also you should
>> group the code that is leaking inside a separate function(s), this way
>> the PHP runtime knows that it can release the memory for variables
>> within the scope.
>>
>>     gabriel
>>
>> On Sep 1, 12:11 pm, "PieR." <[email protected]> wrote:
>>
>> > Hi,
>>
>> > I have a sfTask in CLI wich use lot of foreach and preg_matches, and
>> > unfortunatly PHP return an error "Allowed memory size...." in few
>> > minutes.
>>
>> > I read that PHP clear the memory when a script ends, so I tried to run
>> > tasks inside the main task, but the problem still remains.
>>
>> > How to manage this memory issue ? clear memory or launch tasks in
>> > separate processes ?
>>
>> > The final aim is to build a web crawler, wich runs many hours per
>> > days.
>>
>> > Thanks in advance for help,
>>
>> > Regards,
>>
>> > Pierre
>>
>>
>
> --
> If you want to report a vulnerability issue on symfony, please send it to 
> security at symfony-project.com
>
> You received this message because you are subscribed to the Google
> Groups "symfony users" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/symfony-users?hl=en
>

-- 
If you want to report a vulnerability issue on symfony, please send it to 
security at symfony-project.com

You received this message because you are subscribed to the Google
Groups "symfony users" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/symfony-users?hl=en

Reply via email to