Re: Continue to Crawl even when an Error Occured

feng lu Wed, 13 Mar 2013 21:32:45 -0700

Hi

Maybe you can use this command to check the return code from the preivous
command.


$NUTCH_HOME/bin/nutch crawl urls -dir $crawldb -solr $solrurl -depth $depth

  if [ $? -ne 0 ]
  then exit $?
  fi

$NUTCH_HOME/bin/nutch solrindex $solrurl $crawldb/crawldb/ -linkdb

And the bin/nutch crawl command is DEPRECATED. please use crawl script
instead.

gxl@gxl-desktop:~/workspace/java/nutch-svn/bin$ ./crawl
Missing seedDir : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>




On Thu, Mar 14, 2013 at 12:17 PM, David Philip
<[email protected]>wrote:

> Hi,
>
> While running crawl command, the below error occurred and so indexing of
> the other urls that were fetched successfully failed.
> Can you please tell me if there is any way to mention in crawl
> script[below] that even when such error occurs, continue crawling?
>
> I think this error occurred because sometime back the crawl initiated got
> stopped abruptly so it created segment folder  without its respective sub
> folders. Next time when crawl command was re run it gave the below error.
> What is the best way to handle this error so that crawl continues?
>
>
> *Error:*
> SolrIndexer: starting at 2013-03-13 23:21:30
> SolrIndexer: deleting gone documents: true
> SolrIndexer: URL filtering: false
> SolrIndexer: URL normalizing: false
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
>
> file:/home/ubuntu/Downloads/apache-nutch-1.6/crawlService/segments/20130313140839/crawl_fetch
> Input path does not exist:
>
> file:/home/ubuntu/Downloads/apache-nutch-1.6/crawlService/segments/20130313140839/crawl_parse
> Input path does not exist:
>
> file:/home/ubuntu/Downloads/apache-nutch-1.6/crawlService/segments/20130313140839/parse_data
> Input path does not exist:
>
> file:/home/ubuntu/Downloads/apache-nutch-1.6/crawlService/segments/20130313140839/parse_text
> FINISHED: Crawl completed
>
> *
> *
> *Script I am using:*
> export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
> export NUTCH_HOME=/home/ubuntu/Downloads/apache-nutch-1.6
> depth=1
> solrurl=http://xx\.xx\.xx\.xx:8080/solrnutch
> crawldb=$NUTCH_HOME/crawlService
>
> $NUTCH_HOME/bin/nutch crawl urls -dir $crawldb -solr $solrurl -depth $depth
>
> $NUTCH_HOME/bin/nutch solrindex $solrurl $crawldb/crawldb/ -linkdb
> $crawldb/linkdb -dir $crawldb/segments/ *-deleteGone*
>
> echo "FINISHED: Crawl completed!"
>
> *Note: *I know that writing script to call the commands individually is
> best but I started with crawl command so was working with it only. if at
> all using individual crawl command script can handle this exception let me
> know.
>
>
> Thanks - David
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Continue to Crawl even when an Error Occured

Reply via email to