Hi Roannel, After review the URL filters configuration and log I have seen the following evidence in log file:
crawl.Injector - Injector: Total number of urls rejected by filters: 1413 crawl.Injector - Injector: Total number of urls after normalization: 25146 crawl.Generator - Generator: topN: 26559 So whit these values is not possible infer that the trouble is related with the URL filter. Any other solution for the trouble? Thanks for your help. 2015-10-09 9:34 GMT-05:00 Roannel Fernández Hernández <[email protected]>: > Hi Andres, > > Check your rules in the URL filters. > > Roannel > > ----- Mensaje original ----- > > De: "Andrés Rincón Pacheco" <[email protected]> > > Para: [email protected] > > Enviados: Jueves, 8 de Octubre 2015 9:26:11 > > Asunto: [MASSMAIL]Nutch only fetch and parse the third part of urls > > > > Hi, > > > > I am using nutch 1.9, after review the urls added by the Injector the > total > > url is 25146. > > (Log evidence) > > crawl.Injector - Injector: Total number of urls after normalization: > 25146 > > > > When I was checking the log file only 7003 urls was fetched and 6727 urls > > was parsed. > > > > And these are the statistics: > > > > CrawlDb statistics start: ../crawlInfo/crawldb > > Statistics for CrawlDb: ../crawlInfo/crawldb > > TOTAL urls: 30914 > > retry 0: 30913 > > retry 1: 1 > > min score: 0.0 > > avg score: 0.4359605 > > max score: 100.002 > > status 1 (db_unfetched): 23912 > > status 2 (db_fetched): 6727 > > status 3 (db_gone): 8 > > status 4 (db_redir_temp): 266 > > status 5 (db_redir_perm): 1 > > CrawlDb statistics: done > > > > Why only the third part (approximately) urls is fetched and parsed? > > > > Thanks. > > > 17 de octubre: Final Cubana 2015 del Concurso de Programación ACM-ICPC. > http://coj.uci.cu/contest/contestview.xhtml?cid 07 >

