Hello, I'm Jose, i have one question and i hope you can help me

I have nutch-1.4 and I'm crawling the web from a country (mx), for that 
reason i changed regex-urlfilter to add the correct regex. the second param
changed in nutch script was
the java heap amount because an error of memory space. Well my question is
because i am doing a crawling with depth 2 to two sites(seed) but i get so
few sites fetched. the result of readdb is below
TOTAL urls:     653
retry 0:        653
min score:      0.0
avg score:      0.0077212863
max score:      1.028
status 1 (db_unfetched):        504
status 2 (db_fetched):  139
status 3 (db_gone):     4
status 4 (db_redir_temp):       4
status 5 (db_redir_perm):       2
CrawlDb statistics: done

in some other posts i saw they changed "protocol-httpclient" for
"protocol-http" in nutch-site.xml but is the same with the two protocols. I
did a -dump from crawldb and verify manually some db_unfetched urls to see
if those are unavailable but are correct and with content, no robots.txt are
present in servers. What must i do to get more url's fetched?


sorry for my english, thank you


--
View this message in context: 
http://lucene.472066.n3.nabble.com/too-few-db-fetched-tp3785938p3785938.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to