Re: [MASSMAIL]Crawling focused only over seed file

Roannel Fernández Hernández Thu, 19 Nov 2015 07:14:46 -0800

Hi Paul

Include in your plugin.includes property the scoring-depth plugin and set the 
value 1 in your property scoring.depth.max.


See: https://issues.apache.org/jira/browse/NUTCH-1331 for more information.

Regards.

----- Mensaje original -----
> De: "Paul Escobar" <[email protected]>
> Para: [email protected]
> Enviados: Miércoles, 18 de Noviembre 2015 22:33:50
> Asunto: Re: [MASSMAIL]Crawling focused only over seed file
> 
> Hi Roannel, the new URLs aren't from other domains, they are in the same
> domain, we want updatedb command avoid the update crawldb with new url from
> the same site.
> 
> Thanks,
> 
> Paul
> 
> 2015-11-18 21:57 GMT-05:00 Andrés Rincón Pacheco <[email protected]>:
> 
> > Hi Roannel,
> >
> > I had the parameter configured previously but this not solved the problem.
> > How I can avoid add any newly discovered URLs during fetch process? I want
> > that nutch process only urls of seed file.
> >
> > Thanks.
> >
> >
> > 2015-11-18 9:22 GMT-05:00 Roannel Fernández Hernández <[email protected]>:
> >
> > > Hi Andrés,
> > >
> > > Change in your nutch-site.xml the property db.ignore.external.links to
> > > true.
> > >
> > > Regards
> > >
> > > ----- Mensaje original -----
> > > > De: "Andrés Rincón Pacheco" <[email protected]>
> > > > Para: [email protected]
> > > > Enviados: Sábado, 14 de Noviembre 2015 19:51:54
> > > > Asunto: [MASSMAIL]Crawling focused only over seed file
> > > >
> > > > Hi,
> > > >
> > > > I need execute nutch focus over seed file, no more urls added in every
> > > > cycle.
> > > >
> > > > I am executing nutch with the following scenarios:
> > > >
> > > > 1. Invoking crawl script without updatedb job:  The time of execution
> > for
> > > > every cycle is 15 minutes, but
> > > > in every cycle the urls processing are the same.  The total time for
> > > nutch
> > > > execution is around 16 hours.
> > > > Because the urls in every cycle are the same?
> > > >
> > > > 2. Crawling normal (using updateddb): if I am using updatedb job, how
> > can
> > > > nutch make fetch only urls of seed file without add new urls to
> > crawldb?
> > > >
> > > > I am trying execute nutch using updatedb job with -noAdditions, so that
> > > it
> > > > serves this option?  I was reading the nutch wiki but is not clear the
> > > > performance of
> > > > -noAdditions option
> > > >
> > > > Conditions for every case: the configuration used for proccessing is
> > 360
> > > > urls in every cycle.  The seed file contains around 25000 urls.
> > > > (limit parameter in crawl bash script is 25000 and sizeFetchlist is
> > 360).
> > > >
> > > > Thanks,
> > > >
> > > > Andres
> > > >
> > > Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación
> > ACM-ICPC
> > > https://icpc.baylor.edu/regionals/finder/cf-2015
> > >
> >
> 
> 
> 
> --
> Paul Escobar Mossos
> skype: paulescom
> telefono: +57 1 3006815404
> 
Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC
https://icpc.baylor.edu/regionals/finder/cf-2015

Re: [MASSMAIL]Crawling focused only over seed file

Reply via email to