Re: too few db_fetched
Short anwer: continue crawling! When going to crawl a large amount of records i wouldn't encourage you to use the crawl command. It's better to build a small shell script that repeats the crawl cycle over and over. Remember, the depth parameter is nothing more than a crawl cycle exectuted twice! You'll never get far with two cycles. On Wed, 29 Feb 2012 05:12:08 +0200, remi tassing tassingr...@gmail.com wrote: Hi Jose, We have this question very often and the short answer, with regard to 'stats' printout, is that everything is probably fine. For a more complete answer plz search in the mailing-list or Google. BTW, how did you change the heap size? I get some IOException when the TopN is 'too' high Remi On Wednesday, February 29, 2012, pepe3059 pepe3...@gmail.com wrote: Hello, I'm Jose, i have one question and i hope you can help me I have nutch-1.4 and I'm crawling the web from a country (mx), for that reason i changed regex-urlfilter to add the correct regex. the second param changed in nutch script was the java heap amount because an error of memory space. Well my question is because i am doing a crawling with depth 2 to two sites(seed) but i get so few sites fetched. the result of readdb is below TOTAL urls: 653 retry 0:653 min score: 0.0 avg score: 0.0077212863 max score: 1.028 status 1 (db_unfetched):504 status 2 (db_fetched): 139 status 3 (db_gone): 4 status 4 (db_redir_temp): 4 status 5 (db_redir_perm): 2 CrawlDb statistics: done in some other posts i saw they changed protocol-httpclient for protocol-http in nutch-site.xml but is the same with the two protocols. I did a -dump from crawldb and verify manually some db_unfetched urls to see if those are unavailable but are correct and with content, no robots.txt are present in servers. What must i do to get more url's fetched? sorry for my english, thank you -- View this message in context: http://lucene.472066.n3.nabble.com/too-few-db-fetched-tp3785938p3785938.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: too few db_fetched
Thank you for your answers. remi tassing you can increase de java heap used by Nutch modifying the variable JAVA_HEAP_MAX=-Xmx1000m included in the script bin/nutch, 1gb is currently assigned. Another question for my problem is: I know mapred is used by default. I read in one post that map and reduce tasks can interfere with the fetch process, is that correct? where can i found information related with the status codes or different values dumped by readdb?, i got information from one url with the follow values http://cca.inegi.org.mx/en-contacto/foro-del-ccaVersion: 7 Status: 1 (db_unfetched) Fetch time: Tue Feb 28 17:11:55 CST 2012 Modified time: Wed Dec 31 18:00:00 CST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 0.030734694 Signature: null Metadata: thank you -- View this message in context: http://lucene.472066.n3.nabble.com/too-few-db-fetched-tp3785938p3788086.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: too few db_fetched
Hi Jose, We have this question very often and the short answer, with regard to 'stats' printout, is that everything is probably fine. For a more complete answer plz search in the mailing-list or Google. BTW, how did you change the heap size? I get some IOException when the TopN is 'too' high Remi On Wednesday, February 29, 2012, pepe3059 pepe3...@gmail.com wrote: Hello, I'm Jose, i have one question and i hope you can help me I have nutch-1.4 and I'm crawling the web from a country (mx), for that reason i changed regex-urlfilter to add the correct regex. the second param changed in nutch script was the java heap amount because an error of memory space. Well my question is because i am doing a crawling with depth 2 to two sites(seed) but i get so few sites fetched. the result of readdb is below TOTAL urls: 653 retry 0:653 min score: 0.0 avg score: 0.0077212863 max score: 1.028 status 1 (db_unfetched):504 status 2 (db_fetched): 139 status 3 (db_gone): 4 status 4 (db_redir_temp): 4 status 5 (db_redir_perm): 2 CrawlDb statistics: done in some other posts i saw they changed protocol-httpclient for protocol-http in nutch-site.xml but is the same with the two protocols. I did a -dump from crawldb and verify manually some db_unfetched urls to see if those are unavailable but are correct and with content, no robots.txt are present in servers. What must i do to get more url's fetched? sorry for my english, thank you -- View this message in context: http://lucene.472066.n3.nabble.com/too-few-db-fetched-tp3785938p3785938.html Sent from the Nutch - User mailing list archive at Nabble.com.