Hi vivekvl I see that if tasktracker node failed, you can use bin/fetch resume option to resume interrupted job. [0] . and nutch 2.x will no use HDFS to store any data. so data node failure will not effect the crawl.
[0] http://wiki.apache.org/nutch/bin/nutch%20fetch On Tue, May 14, 2013 at 6:19 PM, vivekvl <[email protected]> wrote: > I am in process of setting up production ready environment for Nutch > Crawler. > Trying to make the environment fault tolerant to Hadoop node failure, > typically tasktracker and datanode failing together due to network issue or > crashing OS. > > I tried simulating the scenario by stopping one node during a crawl > process. > I stopped the node which was running a fetch reducer task in 5th cycle. The > task got completed after hanging for few minutes. The Namenode UI and Map > Reduce admin UI started showing reduced number for nodes. The crawl process > continued for the configured 6 cycles and ended. However the total number > of > URLs crawled was lesser when compared with previous results. I suspect the > interrupted fetch task was never retried. > > I want to understand the behavior and find solution for node failure during > crawl. I welcome suggestions on this. > > I am using Nutch 2.1 with HBase 0.90.6 and Hadoop-0.20.2. > > Thanks, > Raja > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/What-would-happen-when-Hadoop-tasktracker-and-data-node-fails-during-Nutch-Crawl-tp4063189.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Don't Grow Old, Grow Up... :-)

