Hi Roannel, the new URLs aren't from other domains, they are in the same
domain, we want updatedb command avoid the update crawldb with new url from
the same site.

Thanks,

Paul

2015-11-18 21:57 GMT-05:00 Andrés Rincón Pacheco <[email protected]>:

> Hi Roannel,
>
> I had the parameter configured previously but this not solved the problem.
> How I can avoid add any newly discovered URLs during fetch process? I want
> that nutch process only urls of seed file.
>
> Thanks.
>
>
> 2015-11-18 9:22 GMT-05:00 Roannel Fernández Hernández <[email protected]>:
>
> > Hi Andrés,
> >
> > Change in your nutch-site.xml the property db.ignore.external.links to
> > true.
> >
> > Regards
> >
> > ----- Mensaje original -----
> > > De: "Andrés Rincón Pacheco" <[email protected]>
> > > Para: [email protected]
> > > Enviados: Sábado, 14 de Noviembre 2015 19:51:54
> > > Asunto: [MASSMAIL]Crawling focused only over seed file
> > >
> > > Hi,
> > >
> > > I need execute nutch focus over seed file, no more urls added in every
> > > cycle.
> > >
> > > I am executing nutch with the following scenarios:
> > >
> > > 1. Invoking crawl script without updatedb job:  The time of execution
> for
> > > every cycle is 15 minutes, but
> > > in every cycle the urls processing are the same.  The total time for
> > nutch
> > > execution is around 16 hours.
> > > Because the urls in every cycle are the same?
> > >
> > > 2. Crawling normal (using updateddb): if I am using updatedb job, how
> can
> > > nutch make fetch only urls of seed file without add new urls to
> crawldb?
> > >
> > > I am trying execute nutch using updatedb job with -noAdditions, so that
> > it
> > > serves this option?  I was reading the nutch wiki but is not clear the
> > > performance of
> > > -noAdditions option
> > >
> > > Conditions for every case: the configuration used for proccessing is
> 360
> > > urls in every cycle.  The seed file contains around 25000 urls.
> > > (limit parameter in crawl bash script is 25000 and sizeFetchlist is
> 360).
> > >
> > > Thanks,
> > >
> > > Andres
> > >
> > Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación
> ACM-ICPC
> > https://icpc.baylor.edu/regionals/finder/cf-2015
> >
>



-- 
Paul Escobar Mossos
skype: paulescom
telefono: +57 1 3006815404

Reply via email to