db.ignore.external.links is for filtering the outlinks and keeping the ones from the same host (and now domain https://issues.apache.org/jira/browse/NUTCH-2069). The one you probably want is
<property> <name>db.update.additions.allowed</name> <value>true</value> <description>If true, updatedb will add newly discovered URLs, if false only already existing URLs in the CrawlDb will be updated and no new URLs will be added. </description> </property> On 20 November 2015 at 23:06, Paul Escobar <[email protected]> wrote: > Hi Roannel, we go try it > > Thanks, > > Paul > > 2015-11-19 10:08 GMT-05:00 Roannel Fernández Hernández <[email protected]>: > > > Hi Paul > > > > Include in your plugin.includes property the scoring-depth plugin and set > > the value 1 in your property scoring.depth.max. > > > > See: https://issues.apache.org/jira/browse/NUTCH-1331 for more > > information. > > > > Regards. > > > > ----- Mensaje original ----- > > > De: "Paul Escobar" <[email protected]> > > > Para: [email protected] > > > Enviados: Miércoles, 18 de Noviembre 2015 22:33:50 > > > Asunto: Re: [MASSMAIL]Crawling focused only over seed file > > > > > > Hi Roannel, the new URLs aren't from other domains, they are in the > same > > > domain, we want updatedb command avoid the update crawldb with new url > > from > > > the same site. > > > > > > Thanks, > > > > > > Paul > > > > > > 2015-11-18 21:57 GMT-05:00 Andrés Rincón Pacheco <[email protected]>: > > > > > > > Hi Roannel, > > > > > > > > I had the parameter configured previously but this not solved the > > problem. > > > > How I can avoid add any newly discovered URLs during fetch process? I > > want > > > > that nutch process only urls of seed file. > > > > > > > > Thanks. > > > > > > > > > > > > 2015-11-18 9:22 GMT-05:00 Roannel Fernández Hernández < > [email protected] > > >: > > > > > > > > > Hi Andrés, > > > > > > > > > > Change in your nutch-site.xml the property db.ignore.external.links > > to > > > > > true. > > > > > > > > > > Regards > > > > > > > > > > ----- Mensaje original ----- > > > > > > De: "Andrés Rincón Pacheco" <[email protected]> > > > > > > Para: [email protected] > > > > > > Enviados: Sábado, 14 de Noviembre 2015 19:51:54 > > > > > > Asunto: [MASSMAIL]Crawling focused only over seed file > > > > > > > > > > > > Hi, > > > > > > > > > > > > I need execute nutch focus over seed file, no more urls added in > > every > > > > > > cycle. > > > > > > > > > > > > I am executing nutch with the following scenarios: > > > > > > > > > > > > 1. Invoking crawl script without updatedb job: The time of > > execution > > > > for > > > > > > every cycle is 15 minutes, but > > > > > > in every cycle the urls processing are the same. The total time > > for > > > > > nutch > > > > > > execution is around 16 hours. > > > > > > Because the urls in every cycle are the same? > > > > > > > > > > > > 2. Crawling normal (using updateddb): if I am using updatedb job, > > how > > > > can > > > > > > nutch make fetch only urls of seed file without add new urls to > > > > crawldb? > > > > > > > > > > > > I am trying execute nutch using updatedb job with -noAdditions, > so > > that > > > > > it > > > > > > serves this option? I was reading the nutch wiki but is not > clear > > the > > > > > > performance of > > > > > > -noAdditions option > > > > > > > > > > > > Conditions for every case: the configuration used for proccessing > > is > > > > 360 > > > > > > urls in every cycle. The seed file contains around 25000 urls. > > > > > > (limit parameter in crawl bash script is 25000 and sizeFetchlist > is > > > > 360). > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Andres > > > > > > > > > > > Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación > > > > ACM-ICPC > > > > > https://icpc.baylor.edu/regionals/finder/cf-2015 > > > > > > > > > > > > > > > > > > > > > -- > > > Paul Escobar Mossos > > > skype: paulescom > > > telefono: +57 1 3006815404 > > > > > Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación > ACM-ICPC > > https://icpc.baylor.edu/regionals/finder/cf-2015 > > > > > > -- > Paul Escobar Mossos > skype: paulescom > telefono: +57 1 3006815404 > -- *Open Source Solutions for Text Engineering* http://www.digitalpebble.com http://digitalpebble.blogspot.com/ #digitalpebble <http://twitter.com/digitalpebble>

