Hi, Thank you for the quick reply. I will check by adding below piece of script with crawl script with individual commands.
Thanks- David On Thu, Mar 14, 2013 at 10:01 AM, feng lu <[email protected]> wrote: > Hi > > Maybe you can use this command to check the return code from the preivous > command. > > $NUTCH_HOME/bin/nutch crawl urls -dir $crawldb -solr $solrurl -depth $depth > > if [ $? -ne 0 ] > then exit $? > fi > > $NUTCH_HOME/bin/nutch solrindex $solrurl $crawldb/crawldb/ -linkdb > > And the bin/nutch crawl command is DEPRECATED. please use crawl script > instead. > > gxl@gxl-desktop:~/workspace/java/nutch-svn/bin$ ./crawl > Missing seedDir : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds> > > > > > On Thu, Mar 14, 2013 at 12:17 PM, David Philip > <[email protected]>wrote: > > > Hi, > > > > While running crawl command, the below error occurred and so indexing of > > the other urls that were fetched successfully failed. > > Can you please tell me if there is any way to mention in crawl > > script[below] that even when such error occurs, continue crawling? > > > > I think this error occurred because sometime back the crawl initiated got > > stopped abruptly so it created segment folder without its respective sub > > folders. Next time when crawl command was re run it gave the below error. > > What is the best way to handle this error so that crawl continues? > > > > > > *Error:* > > SolrIndexer: starting at 2013-03-13 23:21:30 > > SolrIndexer: deleting gone documents: true > > SolrIndexer: URL filtering: false > > SolrIndexer: URL normalizing: false > > org.apache.hadoop.mapred.InvalidInputException: Input path does not > exist: > > > > > file:/home/ubuntu/Downloads/apache-nutch-1.6/crawlService/segments/20130313140839/crawl_fetch > > Input path does not exist: > > > > > file:/home/ubuntu/Downloads/apache-nutch-1.6/crawlService/segments/20130313140839/crawl_parse > > Input path does not exist: > > > > > file:/home/ubuntu/Downloads/apache-nutch-1.6/crawlService/segments/20130313140839/parse_data > > Input path does not exist: > > > > > file:/home/ubuntu/Downloads/apache-nutch-1.6/crawlService/segments/20130313140839/parse_text > > FINISHED: Crawl completed > > > > * > > * > > *Script I am using:* > > export JAVA_HOME=/usr/lib/jvm/java-6-openjdk > > export NUTCH_HOME=/home/ubuntu/Downloads/apache-nutch-1.6 > > depth=1 > > solrurl=http://xx\.xx\.xx\.xx:8080/solrnutch > > crawldb=$NUTCH_HOME/crawlService > > > > $NUTCH_HOME/bin/nutch crawl urls -dir $crawldb -solr $solrurl -depth > $depth > > > > $NUTCH_HOME/bin/nutch solrindex $solrurl $crawldb/crawldb/ -linkdb > > $crawldb/linkdb -dir $crawldb/segments/ *-deleteGone* > > > > echo "FINISHED: Crawl completed!" > > > > *Note: *I know that writing script to call the commands individually is > > best but I started with crawl command so was working with it only. if at > > all using individual crawl command script can handle this exception let > me > > know. > > > > > > Thanks - David > > > > > > -- > Don't Grow Old, Grow Up... :-) >

