Re: [MASSMAIL]Crawling focused only over seed file

Julien Nioche Fri, 27 Nov 2015 01:03:08 -0800

db.ignore.external.links is for filtering the outlinks and keeping the ones
from the same host (and now domain
https://issues.apache.org/jira/browse/NUTCH-2069). The one you probably
want is


<property>
  <name>db.update.additions.allowed</name>
  <value>true</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
</property>

On 20 November 2015 at 23:06, Paul Escobar <[email protected]>
wrote:

> Hi Roannel, we go try it
>
> Thanks,
>
> Paul
>
> 2015-11-19 10:08 GMT-05:00 Roannel Fernández Hernández <[email protected]>:
>
> > Hi Paul
> >
> > Include in your plugin.includes property the scoring-depth plugin and set
> > the value 1 in your property scoring.depth.max.
> >
> > See: https://issues.apache.org/jira/browse/NUTCH-1331 for more
> > information.
> >
> > Regards.
> >
> > ----- Mensaje original -----
> > > De: "Paul Escobar" <[email protected]>
> > > Para: [email protected]
> > > Enviados: Miércoles, 18 de Noviembre 2015 22:33:50
> > > Asunto: Re: [MASSMAIL]Crawling focused only over seed file
> > >
> > > Hi Roannel, the new URLs aren't from other domains, they are in the
> same
> > > domain, we want updatedb command avoid the update crawldb with new url
> > from
> > > the same site.
> > >
> > > Thanks,
> > >
> > > Paul
> > >
> > > 2015-11-18 21:57 GMT-05:00 Andrés Rincón Pacheco <[email protected]>:
> > >
> > > > Hi Roannel,
> > > >
> > > > I had the parameter configured previously but this not solved the
> > problem.
> > > > How I can avoid add any newly discovered URLs during fetch process? I
> > want
> > > > that nutch process only urls of seed file.
> > > >
> > > > Thanks.
> > > >
> > > >
> > > > 2015-11-18 9:22 GMT-05:00 Roannel Fernández Hernández <
> [email protected]
> > >:
> > > >
> > > > > Hi Andrés,
> > > > >
> > > > > Change in your nutch-site.xml the property db.ignore.external.links
> > to
> > > > > true.
> > > > >
> > > > > Regards
> > > > >
> > > > > ----- Mensaje original -----
> > > > > > De: "Andrés Rincón Pacheco" <[email protected]>
> > > > > > Para: [email protected]
> > > > > > Enviados: Sábado, 14 de Noviembre 2015 19:51:54
> > > > > > Asunto: [MASSMAIL]Crawling focused only over seed file
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I need execute nutch focus over seed file, no more urls added in
> > every
> > > > > > cycle.
> > > > > >
> > > > > > I am executing nutch with the following scenarios:
> > > > > >
> > > > > > 1. Invoking crawl script without updatedb job:  The time of
> > execution
> > > > for
> > > > > > every cycle is 15 minutes, but
> > > > > > in every cycle the urls processing are the same.  The total time
> > for
> > > > > nutch
> > > > > > execution is around 16 hours.
> > > > > > Because the urls in every cycle are the same?
> > > > > >
> > > > > > 2. Crawling normal (using updateddb): if I am using updatedb job,
> > how
> > > > can
> > > > > > nutch make fetch only urls of seed file without add new urls to
> > > > crawldb?
> > > > > >
> > > > > > I am trying execute nutch using updatedb job with -noAdditions,
> so
> > that
> > > > > it
> > > > > > serves this option?  I was reading the nutch wiki but is not
> clear
> > the
> > > > > > performance of
> > > > > > -noAdditions option
> > > > > >
> > > > > > Conditions for every case: the configuration used for proccessing
> > is
> > > > 360
> > > > > > urls in every cycle.  The seed file contains around 25000 urls.
> > > > > > (limit parameter in crawl bash script is 25000 and sizeFetchlist
> is
> > > > 360).
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Andres
> > > > > >
> > > > > Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación
> > > > ACM-ICPC
> > > > > https://icpc.baylor.edu/regionals/finder/cf-2015
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Paul Escobar Mossos
> > > skype: paulescom
> > > telefono: +57 1 3006815404
> > >
> > Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación
> ACM-ICPC
> > https://icpc.baylor.edu/regionals/finder/cf-2015
> >
>
>
>
> --
> Paul Escobar Mossos
> skype: paulescom
> telefono: +57 1 3006815404
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

Re: [MASSMAIL]Crawling focused only over seed file

Reply via email to