Hi Andrés, Change in your nutch-site.xml the property db.ignore.external.links to true.
Regards ----- Mensaje original ----- > De: "Andrés Rincón Pacheco" <[email protected]> > Para: [email protected] > Enviados: Sábado, 14 de Noviembre 2015 19:51:54 > Asunto: [MASSMAIL]Crawling focused only over seed file > > Hi, > > I need execute nutch focus over seed file, no more urls added in every > cycle. > > I am executing nutch with the following scenarios: > > 1. Invoking crawl script without updatedb job: The time of execution for > every cycle is 15 minutes, but > in every cycle the urls processing are the same. The total time for nutch > execution is around 16 hours. > Because the urls in every cycle are the same? > > 2. Crawling normal (using updateddb): if I am using updatedb job, how can > nutch make fetch only urls of seed file without add new urls to crawldb? > > I am trying execute nutch using updatedb job with -noAdditions, so that it > serves this option? I was reading the nutch wiki but is not clear the > performance of > -noAdditions option > > Conditions for every case: the configuration used for proccessing is 360 > urls in every cycle. The seed file contains around 25000 urls. > (limit parameter in crawl bash script is 25000 and sizeFetchlist is 360). > > Thanks, > > Andres > Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC https://icpc.baylor.edu/regionals/finder/cf-2015

