Hi All, I have set up a psuedo distributed cluster using Hadoop 2.3 and runing Nutch 1.7 on it as a MapReduce Job and I use the following command to submit the job.
/mnt/hadoop-2.3.0/bin/hadoop jar /opt/dfconfig/nutch/apache-nutch-1.8-SNAPSHOT.job org.apache.nutch.crawl.Crawl /urls -dir crawldirectory -depth 1000 -topN 30000 I notice that the crawl is continuing even after 72 hours , I am only crawling 4 websites and have disabled outlinks to external domains . Most of the pages are crawled for the first few hours but then the crawl keeps on running and a very few pages are crawled in those extended crawl sessions. Is my high topN value causing this seemingly never ending crawl ? How can I track the status ( from the Hadoop console or otherwise) ? Thanks.

