Hi Chris, Please check out NUTCH-1545 We'll hopefully be committing this today(ish) and it will hopefully be included in the 2.2 RC which I am about to cut. Your feedback would be great. Thanks
On Wednesday, May 29, 2013, Christopher Gross <[email protected]> wrote: > I did make some modifications -- but that was at the top & to hard code > some params to make it easier for me to make this a cron job. I didn't > change anything for the real functionality. > > I was under the assumption that the provided script would be correct and > work out of the box, like the old runbot.sh script did. > > Are you aware of the right thing to do to fix this? > > -- Chris > > > On Tue, May 28, 2013 at 4:01 PM, <[email protected]> wrote: > >> HI, >> >> I have seen this script. I thought you have modified it. It will not run >> even if you remove crawlId, because it does not capture batchId from >> generate command. >> >> Alex. >> >> >> >> >> >> >> >> -----Original Message----- >> From: Christopher Gross <[email protected]> >> To: user <[email protected]> >> Sent: Tue, May 28, 2013 5:20 am >> Subject: Re: error crawling >> >> >> Local mode. >> >> Script: >> >> #!/bin/bash >> # >> # Licensed to the Apache Software Foundation (ASF) under one or more >> # contributor license agreements. See the NOTICE file distributed with >> # this work for additional information regarding copyright ownership. >> # The ASF licenses this file to You under the Apache License, Version 2.0 >> # (the "License"); you may not use this file except in compliance with >> # the License. You may obtain a copy of the License at >> # >> # http://www.apache.org/licenses/LICENSE-2.0 >> # >> # Unless required by applicable law or agreed to in writing, software >> # distributed under the License is distributed on an "AS IS" BASIS, >> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. >> # See the License for the specific language governing permissions and >> # limitations under the License. >> # >> # The Crawl command script : crawl <seedDir> <crawlId> <solrURL> >> <numberOfRounds> >> # >> # >> # UNLIKE THE NUTCH ALL-IN-ONE-CRAWL COMMAND THIS SCRIPT DOES THE LINK >> INVERSION AND >> # INDEXING FOR EACH SEGMENT >> >> #set common env. variables -- ex $JAVA_HOME, $NUTCH_HOME, etc. >> . /proj/common/setenv.sh >> >> SEEDDIR=$NUTCH_HOME/urls >> CRAWLDIR=$NUTCH_HOME/crawl/ >> CRAWL_ID=crawl >> SOLRURL=http://localhost/nutchsolr/ >> LIMIT=3 >> >> ############################################# >> # MODIFY THE PARAMETERS BELOW TO YOUR NEEDS # >> ############################################# >> >> # set the number of slaves nodes >> numSlaves=1 >> >> # and the total number of available tasks >> # sets Hadoop parameter "mapred.reduce.tasks" >> numTasks=`expr $numSlaves \* 2` >> >> # number of urls to fetch in one iteration >> # 250K per task? >> #sizeFetchlist=`expr $numSlaves \* 5000` >> sizeFetchlist=`expr $numSlaves \* 20` >> >> # time limit for feching >> timeLimitFetch=180 >> >> ############################################# >> >> bin=`dirname "$0"` >> bin=`cd "$bin"; pwd` >> >> # note that some of the options listed here could be set in the >> # corresponding hadoop site xml param file >> commonOptions="-D mapred.reduce.tasks=$numTasks -D >> mapred.child.java.opts=-Xmx1000m -D >> mapred.reduce.tasks.speculative.execution=false -D >> mapred.map.tasks.speculative.execution=false -D >> mapred.compress.map.output=true" >> >> # initial injection >> $bin/nutch inject $SEEDDIR -crawlId $CRAWL_ID >> >> if [ $? -ne 0 ] >> then exit $? >> fi >> >> # main loop : rounds of generate - fetch - parse - update >> for ((a=1; a <= LIMIT ; a++)) >> do >> if [ -e ".STOP" ] >> then >> echo "STOP file found - escaping loop" >> break >> fi >> >> echo `date` ": Iteration $a of $LIMIT" >> >> echo "Generating a new fetchlist" >> $bin/nutch generate $commonOptions -crawlId $CRAWL_ID -force -topN >> $sizeFetchlist -numFetchers $numSlaves -noFilter >> >> if [ $? -ne 0 ] >> then exit $? >> fi >> >> # TODO capture the batchID >> echo "Fetching : " >> $bin/nutch fetch $commonOptions - -- *Lewis*

