Hi Chris,
Please check out NUTCH-1545
We'll hopefully be committing this today(ish) and it will hopefully be
included in the 2.2 RC which I am about to cut.
Your feedback would be great.
Thanks


On Wednesday, May 29, 2013, Christopher Gross <[email protected]> wrote:
> I did make some modifications -- but that was at the top & to hard code
> some params to make it easier for me to make this a cron job.  I didn't
> change anything for the real functionality.
>
> I was under the assumption that the provided script would be correct and
> work out of the box, like the old runbot.sh script did.
>
> Are you aware of the right thing to do to fix this?
>
> -- Chris
>
>
> On Tue, May 28, 2013 at 4:01 PM, <[email protected]> wrote:
>
>> HI,
>>
>> I have seen this script. I thought you have modified it. It will not run
>> even if you remove crawlId, because it does not capture batchId from
>> generate command.
>>
>> Alex.
>>
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Christopher Gross <[email protected]>
>> To: user <[email protected]>
>> Sent: Tue, May 28, 2013 5:20 am
>> Subject: Re: error crawling
>>
>>
>> Local mode.
>>
>> Script:
>>
>> #!/bin/bash
>> #
>> # Licensed to the Apache Software Foundation (ASF) under one or more
>> # contributor license agreements.  See the NOTICE file distributed with
>> # this work for additional information regarding copyright ownership.
>> # The ASF licenses this file to You under the Apache License, Version 2.0
>> # (the "License"); you may not use this file except in compliance with
>> # the License.  You may obtain a copy of the License at
>> #
>> #     http://www.apache.org/licenses/LICENSE-2.0
>> #
>> # Unless required by applicable law or agreed to in writing, software
>> # distributed under the License is distributed on an "AS IS" BASIS,
>> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
>> # See the License for the specific language governing permissions and
>> # limitations under the License.
>> #
>> # The Crawl command script : crawl <seedDir> <crawlId> <solrURL>
>> <numberOfRounds>
>> #
>> #
>> # UNLIKE THE NUTCH ALL-IN-ONE-CRAWL COMMAND THIS SCRIPT DOES THE LINK
>> INVERSION AND
>> # INDEXING FOR EACH SEGMENT
>>
>> #set common env. variables -- ex $JAVA_HOME, $NUTCH_HOME, etc.
>> . /proj/common/setenv.sh
>>
>> SEEDDIR=$NUTCH_HOME/urls
>> CRAWLDIR=$NUTCH_HOME/crawl/
>> CRAWL_ID=crawl
>> SOLRURL=http://localhost/nutchsolr/
>> LIMIT=3
>>
>> #############################################
>> # MODIFY THE PARAMETERS BELOW TO YOUR NEEDS #
>> #############################################
>>
>> # set the number of slaves nodes
>> numSlaves=1
>>
>> # and the total number of available tasks
>> # sets Hadoop parameter "mapred.reduce.tasks"
>> numTasks=`expr $numSlaves \* 2`
>>
>> # number of urls to fetch in one iteration
>> # 250K per task?
>> #sizeFetchlist=`expr $numSlaves \* 5000`
>> sizeFetchlist=`expr $numSlaves \* 20`
>>
>> # time limit for feching
>> timeLimitFetch=180
>>
>> #############################################
>>
>> bin=`dirname "$0"`
>> bin=`cd "$bin"; pwd`
>>
>> # note that some of the options listed here could be set in the
>> # corresponding hadoop site xml param file
>> commonOptions="-D mapred.reduce.tasks=$numTasks -D
>> mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true"
>>
>> # initial injection
>> $bin/nutch inject $SEEDDIR -crawlId $CRAWL_ID
>>
>> if [ $? -ne 0 ]
>>   then exit $?
>> fi
>>
>> # main loop : rounds of generate - fetch - parse - update
>> for ((a=1; a <= LIMIT ; a++))
>> do
>>   if [ -e ".STOP" ]
>>   then
>>    echo "STOP file found - escaping loop"
>>    break
>>   fi
>>
>>   echo `date` ": Iteration $a of $LIMIT"
>>
>>   echo "Generating a new fetchlist"
>>   $bin/nutch generate $commonOptions -crawlId $CRAWL_ID -force -topN
>> $sizeFetchlist -numFetchers $numSlaves -noFilter
>>
>>   if [ $? -ne 0 ]
>>   then exit $?
>>   fi
>>
>>   # TODO capture the batchID
>>   echo "Fetching : "
>>   $bin/nutch fetch $commonOptions -

-- 
*Lewis*

Reply via email to