When can the Nutch MapReduce job be considered complete?

S.L Mon, 03 Mar 2014 23:11:25 -0800

Hi All,

I have set up  a psuedo distributed cluster using Hadoop 2.3 and runing
Nutch 1.7 on it as a MapReduce Job and I use the following command to
submit the job.


/mnt/hadoop-2.3.0/bin/hadoop jar
/opt/dfconfig/nutch/apache-nutch-1.8-SNAPSHOT.job
org.apache.nutch.crawl.Crawl /urls -dir crawldirectory -depth 1000 -topN
30000

I notice that the crawl is continuing even after 72 hours , I am only
crawling 4 websites and have disabled outlinks to external domains . Most
of the pages are crawled for the first few hours but then the crawl keeps
on running and a very few pages are crawled in those extended crawl
sessions. Is my high topN value causing this seemingly never ending crawl ?

How can I track the status ( from the Hadoop console  or otherwise) ?

Thanks.

When can the Nutch MapReduce job be considered complete?

Reply via email to