Hello Andrés. Your situation could happens because a lot of problem, share with as your log for see details, i can suggest that check your url normalizer because it can skip url with problems, also check your nutch script exactly in lines below and increase your parameter(i have 1000) because this is the total of url fetched on every round of crawl. # number of urls to fetch in one iteration # 250K per task? sizeFetchlist=`expr $numSlaves \* 1000`
Tell me if this helps you. Greetings. ----- Mensaje original ----- De: "Andrés Rincón Pacheco" <[email protected]> Para: [email protected] Enviados: Jueves, 8 de Octubre 2015 9:26:11 Asunto: [MASSMAIL]Nutch only fetch and parse the third part of urls Hi, I am using nutch 1.9, after review the urls added by the Injector the total url is 25146. (Log evidence) crawl.Injector - Injector: Total number of urls after normalization: 25146 When I was checking the log file only 7003 urls was fetched and 6727 urls was parsed. And these are the statistics: CrawlDb statistics start: ../crawlInfo/crawldb Statistics for CrawlDb: ../crawlInfo/crawldb TOTAL urls: 30914 retry 0: 30913 retry 1: 1 min score: 0.0 avg score: 0.4359605 max score: 100.002 status 1 (db_unfetched): 23912 status 2 (db_fetched): 6727 status 3 (db_gone): 8 status 4 (db_redir_temp): 266 status 5 (db_redir_perm): 1 CrawlDb statistics: done Why only the third part (approximately) urls is fetched and parsed? Thanks. 17 de octubre: Final Cubana 2015 del Concurso de Programación ACM-ICPC. http://coj.uci.cu/contest/contestview.xhtml?cid07

