HI, I have seen this script. I thought you have modified it. It will not run even if you remove crawlId, because it does not capture batchId from generate command.
Alex. -----Original Message----- From: Christopher Gross <[email protected]> To: user <[email protected]> Sent: Tue, May 28, 2013 5:20 am Subject: Re: error crawling Local mode. Script: #!/bin/bash # # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # The Crawl command script : crawl <seedDir> <crawlId> <solrURL> <numberOfRounds> # # # UNLIKE THE NUTCH ALL-IN-ONE-CRAWL COMMAND THIS SCRIPT DOES THE LINK INVERSION AND # INDEXING FOR EACH SEGMENT #set common env. variables -- ex $JAVA_HOME, $NUTCH_HOME, etc. . /proj/common/setenv.sh SEEDDIR=$NUTCH_HOME/urls CRAWLDIR=$NUTCH_HOME/crawl/ CRAWL_ID=crawl SOLRURL=http://localhost/nutchsolr/ LIMIT=3 ############################################# # MODIFY THE PARAMETERS BELOW TO YOUR NEEDS # ############################################# # set the number of slaves nodes numSlaves=1 # and the total number of available tasks # sets Hadoop parameter "mapred.reduce.tasks" numTasks=`expr $numSlaves \* 2` # number of urls to fetch in one iteration # 250K per task? #sizeFetchlist=`expr $numSlaves \* 5000` sizeFetchlist=`expr $numSlaves \* 20` # time limit for feching timeLimitFetch=180 ############################################# bin=`dirname "$0"` bin=`cd "$bin"; pwd` # note that some of the options listed here could be set in the # corresponding hadoop site xml param file commonOptions="-D mapred.reduce.tasks=$numTasks -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true" # initial injection $bin/nutch inject $SEEDDIR -crawlId $CRAWL_ID if [ $? -ne 0 ] then exit $? fi # main loop : rounds of generate - fetch - parse - update for ((a=1; a <= LIMIT ; a++)) do if [ -e ".STOP" ] then echo "STOP file found - escaping loop" break fi echo `date` ": Iteration $a of $LIMIT" echo "Generating a new fetchlist" $bin/nutch generate $commonOptions -crawlId $CRAWL_ID -force -topN $sizeFetchlist -numFetchers $numSlaves -noFilter if [ $? -ne 0 ] then exit $? fi # TODO capture the batchID echo "Fetching : " $bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch -all -crawlId $CRAWL_ID -threads 10 if [ $? -ne 0 ] then exit $? fi # parsing the segment echo "Parsing : " # enable the skipping of records for the parsing so that a dodgy document # so that it does not fail the full task skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1" $bin/nutch parse $commonOptions $skipRecordsOptions -all -crawlId $CRAWL_ID -force if [ $? -ne 0 ] then exit $? fi # updatedb with this segment echo "CrawlDB update" $bin/nutch updatedb $commonOptions if [ $? -ne 0 ] then exit $? fi echo "Indexing $CRAWL_ID on SOLR index -> $SOLRURL" $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID if [ $? -ne 0 ] then exit $? fi #echo "SOLR dedup -> $SOLRURL" #$bin/nutch solrdedup $commonOptions $SOLRURL if [ $? -ne 0 ] then exit $? fi done exit 0 -- Chris On Fri, May 24, 2013 at 2:51 PM, <[email protected]> wrote: > Can you send the scrpit? Also are you running it in deploy or local mode? > > > > > > > -----Original Message----- > From: Christopher Gross <[email protected]> > To: user <[email protected]> > Sent: Fri, May 24, 2013 9:43 am > Subject: Re: error crawling > > > Right. "runbot" is the old one. They don't package something with nutch > anymore like that. Through digging on the web I found something. > > I took this script. > http://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/crawl > > I made small changes -- rather than passing in args I hard coded them (to > make it easier to run via cron), and since my user doesn't have the right > stuff set up in the PATH, I have an environment loader. I also commented > out the dedup line since it doesn't work. > > From that file: > > # initial injection > $bin/nutch inject $SEEDDIR -crawlId $CRAWL_ID > > Even taking out the CRAWL_ID part I still get the crawl_webpage error > message. So I'm still not able to do the crawling correctly. I still > cannot find documentation saying what I need to do to make the Keyclass and > nameclass match correctly. That's what I'm trying to get answered. I > tried hacking at it a bit but things got uglier, so I'm looking to here for > guidance. > > >

