Re: [MASSMAIL]Crawling focused only over seed file

Andrés Rincón Pacheco Wed, 18 Nov 2015 18:58:19 -0800

Hi Roannel,

I had the parameter configured previously but this not solved the problem.
How I can avoid add any newly discovered URLs during fetch process? I want
that nutch process only urls of seed file.


Thanks.


2015-11-18 9:22 GMT-05:00 Roannel Fernández Hernández <[email protected]>:

> Hi Andrés,
>
> Change in your nutch-site.xml the property db.ignore.external.links to
> true.
>
> Regards
>
> ----- Mensaje original -----
> > De: "Andrés Rincón Pacheco" <[email protected]>
> > Para: [email protected]
> > Enviados: Sábado, 14 de Noviembre 2015 19:51:54
> > Asunto: [MASSMAIL]Crawling focused only over seed file
> >
> > Hi,
> >
> > I need execute nutch focus over seed file, no more urls added in every
> > cycle.
> >
> > I am executing nutch with the following scenarios:
> >
> > 1. Invoking crawl script without updatedb job:  The time of execution for
> > every cycle is 15 minutes, but
> > in every cycle the urls processing are the same.  The total time for
> nutch
> > execution is around 16 hours.
> > Because the urls in every cycle are the same?
> >
> > 2. Crawling normal (using updateddb): if I am using updatedb job, how can
> > nutch make fetch only urls of seed file without add new urls to crawldb?
> >
> > I am trying execute nutch using updatedb job with -noAdditions, so that
> it
> > serves this option?  I was reading the nutch wiki but is not clear the
> > performance of
> > -noAdditions option
> >
> > Conditions for every case: the configuration used for proccessing is 360
> > urls in every cycle.  The seed file contains around 25000 urls.
> > (limit parameter in crawl bash script is 25000 and sizeFetchlist is 360).
> >
> > Thanks,
> >
> > Andres
> >
> Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC
> https://icpc.baylor.edu/regionals/finder/cf-2015
>

Re: [MASSMAIL]Crawling focused only over seed file

Reply via email to