Hi vivekvl

I see that if tasktracker node failed, you can use bin/fetch resume option
to resume interrupted job. [0] . and nutch 2.x will no use HDFS to store
any data. so data node failure will not effect the crawl.

[0] http://wiki.apache.org/nutch/bin/nutch%20fetch


On Tue, May 14, 2013 at 6:19 PM, vivekvl <[email protected]> wrote:

> I am in process of setting up production ready environment for Nutch
> Crawler.
> Trying to make the environment fault tolerant to Hadoop node failure,
> typically tasktracker and datanode failing together due to network issue or
> crashing OS.
>
> I tried simulating the scenario by stopping one node during a crawl
> process.
> I stopped the node which was running a fetch reducer task in 5th cycle. The
> task got completed after hanging for few minutes. The Namenode UI and Map
> Reduce admin UI started showing reduced number for nodes. The crawl process
> continued for the configured 6 cycles and ended. However the total number
> of
> URLs crawled was lesser when compared with previous results. I suspect the
> interrupted fetch task was never retried.
>
> I want to understand the behavior and find solution for node failure during
> crawl. I welcome suggestions on this.
>
> I am using Nutch 2.1 with HBase 0.90.6 and Hadoop-0.20.2.
>
> Thanks,
> Raja
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/What-would-happen-when-Hadoop-tasktracker-and-data-node-fails-during-Nutch-Crawl-tp4063189.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Don't Grow Old, Grow Up... :-)

Reply via email to