Re: [MASSMAIL]Crawling focused only over seed file

Paul Escobar Fri, 20 Nov 2015 15:07:26 -0800

Hi Roannel, we go try it

Thanks,


Paul

2015-11-19 10:08 GMT-05:00 Roannel Fernández Hernández <[email protected]>:

> Hi Paul
>
> Include in your plugin.includes property the scoring-depth plugin and set
> the value 1 in your property scoring.depth.max.
>
> See: https://issues.apache.org/jira/browse/NUTCH-1331 for more
> information.
>
> Regards.
>
> ----- Mensaje original -----
> > De: "Paul Escobar" <[email protected]>
> > Para: [email protected]
> > Enviados: Miércoles, 18 de Noviembre 2015 22:33:50
> > Asunto: Re: [MASSMAIL]Crawling focused only over seed file
> >
> > Hi Roannel, the new URLs aren't from other domains, they are in the same
> > domain, we want updatedb command avoid the update crawldb with new url
> from
> > the same site.
> >
> > Thanks,
> >
> > Paul
> >
> > 2015-11-18 21:57 GMT-05:00 Andrés Rincón Pacheco <[email protected]>:
> >
> > > Hi Roannel,
> > >
> > > I had the parameter configured previously but this not solved the
> problem.
> > > How I can avoid add any newly discovered URLs during fetch process? I
> want
> > > that nutch process only urls of seed file.
> > >
> > > Thanks.
> > >
> > >
> > > 2015-11-18 9:22 GMT-05:00 Roannel Fernández Hernández <[email protected]
> >:
> > >
> > > > Hi Andrés,
> > > >
> > > > Change in your nutch-site.xml the property db.ignore.external.links
> to
> > > > true.
> > > >
> > > > Regards
> > > >
> > > > ----- Mensaje original -----
> > > > > De: "Andrés Rincón Pacheco" <[email protected]>
> > > > > Para: [email protected]
> > > > > Enviados: Sábado, 14 de Noviembre 2015 19:51:54
> > > > > Asunto: [MASSMAIL]Crawling focused only over seed file
> > > > >
> > > > > Hi,
> > > > >
> > > > > I need execute nutch focus over seed file, no more urls added in
> every
> > > > > cycle.
> > > > >
> > > > > I am executing nutch with the following scenarios:
> > > > >
> > > > > 1. Invoking crawl script without updatedb job:  The time of
> execution
> > > for
> > > > > every cycle is 15 minutes, but
> > > > > in every cycle the urls processing are the same.  The total time
> for
> > > > nutch
> > > > > execution is around 16 hours.
> > > > > Because the urls in every cycle are the same?
> > > > >
> > > > > 2. Crawling normal (using updateddb): if I am using updatedb job,
> how
> > > can
> > > > > nutch make fetch only urls of seed file without add new urls to
> > > crawldb?
> > > > >
> > > > > I am trying execute nutch using updatedb job with -noAdditions, so
> that
> > > > it
> > > > > serves this option?  I was reading the nutch wiki but is not clear
> the
> > > > > performance of
> > > > > -noAdditions option
> > > > >
> > > > > Conditions for every case: the configuration used for proccessing
> is
> > > 360
> > > > > urls in every cycle.  The seed file contains around 25000 urls.
> > > > > (limit parameter in crawl bash script is 25000 and sizeFetchlist is
> > > 360).
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Andres
> > > > >
> > > > Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación
> > > ACM-ICPC
> > > > https://icpc.baylor.edu/regionals/finder/cf-2015
> > > >
> > >
> >
> >
> >
> > --
> > Paul Escobar Mossos
> > skype: paulescom
> > telefono: +57 1 3006815404
> >
> Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC
> https://icpc.baylor.edu/regionals/finder/cf-2015
>



-- 
Paul Escobar Mossos
skype: paulescom
telefono: +57 1 3006815404

Re: [MASSMAIL]Crawling focused only over seed file

Reply via email to