Re: [MASSMAIL]Crawling focused only over seed file

Roannel Fernández Hernández Wed, 18 Nov 2015 06:28:56 -0800

Hi Andrés,

Change in your nutch-site.xml the property db.ignore.external.links to true.


Regards

----- Mensaje original -----
> De: "Andrés Rincón Pacheco" <[email protected]>
> Para: [email protected]
> Enviados: Sábado, 14 de Noviembre 2015 19:51:54
> Asunto: [MASSMAIL]Crawling focused only over seed file
> 
> Hi,
> 
> I need execute nutch focus over seed file, no more urls added in every
> cycle.
> 
> I am executing nutch with the following scenarios:
> 
> 1. Invoking crawl script without updatedb job:  The time of execution for
> every cycle is 15 minutes, but
> in every cycle the urls processing are the same.  The total time for nutch
> execution is around 16 hours.
> Because the urls in every cycle are the same?
> 
> 2. Crawling normal (using updateddb): if I am using updatedb job, how can
> nutch make fetch only urls of seed file without add new urls to crawldb?
> 
> I am trying execute nutch using updatedb job with -noAdditions, so that it
> serves this option?  I was reading the nutch wiki but is not clear the
> performance of
> -noAdditions option
> 
> Conditions for every case: the configuration used for proccessing is 360
> urls in every cycle.  The seed file contains around 25000 urls.
> (limit parameter in crawl bash script is 25000 and sizeFetchlist is 360).
> 
> Thanks,
> 
> Andres
> 
Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC
https://icpc.baylor.edu/regionals/finder/cf-2015

Re: [MASSMAIL]Crawling focused only over seed file

Reply via email to