Hi Jose, We have this question very often and the short answer, with regard to 'stats' printout, is that everything is probably fine. For a more complete answer plz search in the mailing-list or Google.
BTW, how did you change the heap size? I get some IOException when the TopN is 'too' high Remi On Wednesday, February 29, 2012, pepe3059 <pepe3...@gmail.com> wrote: > Hello, I'm Jose, i have one question and i hope you can help me > > I have nutch-1.4 and I'm crawling the web from a country (mx), for that > reason i changed regex-urlfilter to add the correct regex. the second param > changed in nutch script was > the java heap amount because an error of memory space. Well my question is > because i am doing a crawling with depth 2 to two sites(seed) but i get so > few sites fetched. the result of readdb is below > TOTAL urls: 653 > retry 0: 653 > min score: 0.0 > avg score: 0.0077212863 > max score: 1.028 > status 1 (db_unfetched): 504 > status 2 (db_fetched): 139 > status 3 (db_gone): 4 > status 4 (db_redir_temp): 4 > status 5 (db_redir_perm): 2 > CrawlDb statistics: done > > in some other posts i saw they changed "protocol-httpclient" for > "protocol-http" in nutch-site.xml but is the same with the two protocols. I > did a -dump from crawldb and verify manually some db_unfetched urls to see > if those are unavailable but are correct and with content, no robots.txt are > present in servers. What must i do to get more url's fetched? > > > sorry for my english, thank you > > > -- > View this message in context: http://lucene.472066.n3.nabble.com/too-few-db-fetched-tp3785938p3785938.html > Sent from the Nutch - User mailing list archive at Nabble.com. >