Re: Continue to Crawl even when an Error Occured

David Philip Wed, 13 Mar 2013 21:49:36 -0700

Hi,

  Thank you for the quick reply. I will check by adding below piece of
script with crawl script with individual commands.


Thanks- David

On Thu, Mar 14, 2013 at 10:01 AM, feng lu <[email protected]> wrote:

> Hi
>
> Maybe you can use this command to check the return code from the preivous
> command.
>
> $NUTCH_HOME/bin/nutch crawl urls -dir $crawldb -solr $solrurl -depth $depth
>
>   if [ $? -ne 0 ]
>   then exit $?
>   fi
>
> $NUTCH_HOME/bin/nutch solrindex $solrurl $crawldb/crawldb/ -linkdb
>
> And the bin/nutch crawl command is DEPRECATED. please use crawl script
> instead.
>
> gxl@gxl-desktop:~/workspace/java/nutch-svn/bin$ ./crawl
> Missing seedDir : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
>
>
>
>
> On Thu, Mar 14, 2013 at 12:17 PM, David Philip
> <[email protected]>wrote:
>
> > Hi,
> >
> > While running crawl command, the below error occurred and so indexing of
> > the other urls that were fetched successfully failed.
> > Can you please tell me if there is any way to mention in crawl
> > script[below] that even when such error occurs, continue crawling?
> >
> > I think this error occurred because sometime back the crawl initiated got
> > stopped abruptly so it created segment folder  without its respective sub
> > folders. Next time when crawl command was re run it gave the below error.
> > What is the best way to handle this error so that crawl continues?
> >
> >
> > *Error:*
> > SolrIndexer: starting at 2013-03-13 23:21:30
> > SolrIndexer: deleting gone documents: true
> > SolrIndexer: URL filtering: false
> > SolrIndexer: URL normalizing: false
> > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> >
> >
> file:/home/ubuntu/Downloads/apache-nutch-1.6/crawlService/segments/20130313140839/crawl_fetch
> > Input path does not exist:
> >
> >
> file:/home/ubuntu/Downloads/apache-nutch-1.6/crawlService/segments/20130313140839/crawl_parse
> > Input path does not exist:
> >
> >
> file:/home/ubuntu/Downloads/apache-nutch-1.6/crawlService/segments/20130313140839/parse_data
> > Input path does not exist:
> >
> >
> file:/home/ubuntu/Downloads/apache-nutch-1.6/crawlService/segments/20130313140839/parse_text
> > FINISHED: Crawl completed
> >
> > *
> > *
> > *Script I am using:*
> > export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
> > export NUTCH_HOME=/home/ubuntu/Downloads/apache-nutch-1.6
> > depth=1
> > solrurl=http://xx\.xx\.xx\.xx:8080/solrnutch
> > crawldb=$NUTCH_HOME/crawlService
> >
> > $NUTCH_HOME/bin/nutch crawl urls -dir $crawldb -solr $solrurl -depth
> $depth
> >
> > $NUTCH_HOME/bin/nutch solrindex $solrurl $crawldb/crawldb/ -linkdb
> > $crawldb/linkdb -dir $crawldb/segments/ *-deleteGone*
> >
> > echo "FINISHED: Crawl completed!"
> >
> > *Note: *I know that writing script to call the commands individually is
> > best but I started with crawl command so was working with it only. if at
> > all using individual crawl command script can handle this exception let
> me
> > know.
> >
> >
> > Thanks - David
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Re: Continue to Crawl even when an Error Occured

Reply via email to