Hi, I was running crawl from the last 5 days. Till yesterday there were no problem during the process. The fetching and updating was happening fine. when i started the crawl yesterday with conf of TopN:80000 the process is stucked up in fetching , Its almost 16hr but till now it has not compleated the depth 1. I am using a cluster of 4 systems with memory of 4GB and Harddisk space of 2TB in three of the machines and 500GB in the other machine. I checked the space in all the clusters. All the systems are having a minimum of 50GB space free. I went through the mailing list and found few suggestion like to reduce maximum pages per url. but before doing that i need few more suggestions regarding the same. can any one pls help me with there suggestions.
Hadoop Log is showing the message like. 2010-05-24 19:33:14,755 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. 2010-05-24 19:33:14,756 INFO crawl.Generator - Generator: starting 2010-05-24 19:33:14,756 INFO crawl.Generator - Generator: segment: crawled/segments/20100524193314 2010-05-24 19:33:14,756 INFO crawl.Generator - Generator: filtering: true 2010-05-24 19:33:14,756 INFO crawl.Generator - Generator: topN: 80000 2010-05-24 20:18:05,558 INFO crawl.Generator - Generator: Partitioning selected urls by host, for politeness. 2010-05-24 20:19:36,187 INFO crawl.Generator - Generator: done. 2010-05-24 20:19:37,672 INFO fetcher.Fetcher - Fetcher: starting 2010-05-24 20:19:37,672 INFO fetcher.Fetcher - Fetcher: segment: crawled/segments/20100524193314 2010-05-24 20:19:37,842 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. Its stuck here. -- View this message in context: http://lucene.472066.n3.nabble.com/Fetching-is-slow-after-few-crawls-tp841366p841366.html Sent from the Nutch - User mailing list archive at Nabble.com.

