bin/Crawl script loosing status updates from the MR job.

Meraj A. Khan Thu, 30 Oct 2014 15:02:54 -0700

Hi All,

I am running the bin/crawl script that comes with Nutch 1.7 on Hadoop YARN
by redirecting its output to a log file as shown below.


/opt/bitconfig/nutch/deploy/bin/crawl /urls crawldirectory 2000 >
/tmp/nutch.log 2>&1 &

The issue I am facing is that randomly this script when it is running a job
looses track of the updates like Map 80% Reduce 67% and gets stuck there ,
and in the mean time the job completes successfully and the script is
waiting there for further updates , as a result the looping of
generate-fetch -update jobs gets terminated prematurely.

This is so random that I am not able to figure out a particular pattern to
this issue, and end up  restarting the script every so often.Some times
this happens in a job as short in duration as the inject phase of Nutch.

Just wondering if anyone faced this issue ?Is the fact that I am
redirecting the output to a logfile playing a part in this ? What are the
best practices for running a long running script like bin/crawl ? I am
using CentOs7.x

Thanks.

bin/Crawl script loosing status updates from the MR job.

Reply via email to