RE: When can the Nutch MapReduce job be considered complete?

Markus Jelsma Tue, 04 Mar 2014 07:00:03 -0800

Yes, the console shows you what it is doing, stdout as well. 
In your case is it the depth that makes it take so long, it does 30.000 crawl 
cycles. We do cycles of around 1000-2000 and that takes between 10 and 15 
minutes and we skip the indexing job (we index in the Fetcher). In the end we 
do around 90-110 cycles every day so 30.000 would take us almost a year! :)


If your crawler does not finish all its records before default or 
adaptiveinterval, it won't stop for a long time! :)
 
-----Original message-----
> From:S.L <[email protected]>
> Sent: Tuesday 4th March 2014 8:09
> To: [email protected]
> Subject: When can the Nutch MapReduce job be considered complete?
> 
> Hi All,
> 
> I have set up  a psuedo distributed cluster using Hadoop 2.3 and runing
> Nutch 1.7 on it as a MapReduce Job and I use the following command to
> submit the job.
> 
> /mnt/hadoop-2.3.0/bin/hadoop jar
> /opt/dfconfig/nutch/apache-nutch-1.8-SNAPSHOT.job
> org.apache.nutch.crawl.Crawl /urls -dir crawldirectory -depth 1000 -topN
> 30000
> 
> I notice that the crawl is continuing even after 72 hours , I am only
> crawling 4 websites and have disabled outlinks to external domains . Most
> of the pages are crawled for the first few hours but then the crawl keeps
> on running and a very few pages are crawled in those extended crawl
> sessions. Is my high topN value causing this seemingly never ending crawl ?
> 
> How can I track the status ( from the Hadoop console  or otherwise) ?
> 
> Thanks.
>

RE: When can the Nutch MapReduce job be considered complete?

Reply via email to