Hello Andrés.
Your situation could happens because a lot of problem, share with as your log 
for see details, i can suggest that check your url normalizer because it can 
skip url with problems, also check your nutch script exactly in lines below and 
increase your parameter(i have 1000) because this is the total of url fetched 
on every round of crawl. 
# number of urls to fetch in one iteration
# 250K per task?
sizeFetchlist=`expr $numSlaves \* 1000`

Tell me if this helps you.
Greetings. 




----- Mensaje original -----
De: "Andrés Rincón Pacheco" <[email protected]>
Para: [email protected]
Enviados: Jueves, 8 de Octubre 2015 9:26:11
Asunto: [MASSMAIL]Nutch only fetch and parse the third part of urls

Hi,

I am using nutch 1.9, after review the urls added by the Injector the total
url is 25146.
(Log evidence)
crawl.Injector - Injector: Total number of urls after normalization: 25146

When I was checking the log file only 7003 urls was fetched and 6727 urls
was parsed.

And these are the statistics:

CrawlDb statistics start: ../crawlInfo/crawldb
Statistics for CrawlDb: ../crawlInfo/crawldb
TOTAL urls:     30914
retry 0:        30913
retry 1:        1
min score:      0.0
avg score:      0.4359605
max score:      100.002
status 1 (db_unfetched):        23912
status 2 (db_fetched):  6727
status 3 (db_gone):     8
status 4 (db_redir_temp):       266
status 5 (db_redir_perm):       1
CrawlDb statistics: done

Why only the third part (approximately) urls is fetched and parsed?

Thanks.
17 de octubre: Final Cubana 2015 del Concurso de Programación ACM-ICPC.
http://coj.uci.cu/contest/contestview.xhtml?cid07

Reply via email to