bin/crawl : incorrect handling of nutch errors?

Bouchard Mathieu (DGTT) Tue, 19 Aug 2014 05:17:28 -0700

Hi,

We are using Solr with Nutch to provide a complete search engine for our 
website.


I created a cron job that would use Nutch to crawl and update the Solr index 
each night. This cron job is trying to automatically correct some errors that 
could result in a corrupt crawldb. However, it seems that the bin/crawl command 
doesn't correctly propagate errors coming from bin/nutch.

Here is an exemple from the bin/crawl script :
    $bin/nutch inject $CRAWL_PATH/crawldb $SEEDDIR

    if [ $? -ne 0 ]
      then exit $?
    fi

Even if there is an error in the nutch inject command, the crawl script always 
returns 0. The way I understand it, the exit code returned is the result of the 
shell test and not the result of the nutch inject command.

To correct this, we would need to modify the script with something like :
    $bin/nutch inject $CRAWL_PATH/crawldb $SEEDDIR
    RETCODE=$?

    if [ $RETCODE -ne 0 ]
      then exit $RETCODE
    fi

I also have a problem with the bin/nutch generate command. This command would 
return the same error code if there is an error or no new segment to process, 
so there is no way to tell if the error is real or not.

I'm thinking on opening a tiket with these issues, but i'm wondering if there 
was a reason the script was written this way?

Thanks,

Les renseignements contenus dans ce message peuvent être confidentiels.

Si vous n'êtes pas le destinataire visé ou une personne autorisée à lui 
remettre ce courriel, vous êtes par la présente avisé qu'il est strictement 
interdit d'utiliser, de copier ou de distribuer ce courriel, de dévoiler la 
teneur de ce message ou de prendre quelque mesure fondée sur l'information 
contenue. Vous êtes donc prié d'aviser immédiatement l'expéditeur de cette 
erreur et de détruire ce message sans garder de copie.

bin/crawl : incorrect handling of nutch errors?

Reply via email to