Hi All, I am running the bin/crawl script that comes with Nutch 1.7 on Hadoop YARN by redirecting its output to a log file as shown below.
/opt/bitconfig/nutch/deploy/bin/crawl /urls crawldirectory 2000 > /tmp/nutch.log 2>&1 & The issue I am facing is that randomly this script when it is running a job looses track of the updates like Map 80% Reduce 67% and gets stuck there , and in the mean time the job completes successfully and the script is waiting there for further updates , as a result the looping of generate-fetch -update jobs gets terminated prematurely. This is so random that I am not able to figure out a particular pattern to this issue, and end up restarting the script every so often.Some times this happens in a job as short in duration as the inject phase of Nutch. Just wondering if anyone faced this issue ?Is the fact that I am redirecting the output to a logfile playing a part in this ? What are the best practices for running a long running script like bin/crawl ? I am using CentOs7.x Thanks.

