Re: error crawling

Christopher Gross Wed, 29 May 2013 04:18:53 -0700

I did make some modifications -- but that was at the top & to hard code
some params to make it easier for me to make this a cron job.  I didn't
change anything for the real functionality.


I was under the assumption that the provided script would be correct and
work out of the box, like the old runbot.sh script did.

Are you aware of the right thing to do to fix this?

-- Chris


On Tue, May 28, 2013 at 4:01 PM, <[email protected]> wrote:

> HI,
>
> I have seen this script. I thought you have modified it. It will not run
> even if you remove crawlId, because it does not capture batchId from
> generate command.
>
> Alex.
>
>
>
>
>
>
>
> -----Original Message-----
> From: Christopher Gross <[email protected]>
> To: user <[email protected]>
> Sent: Tue, May 28, 2013 5:20 am
> Subject: Re: error crawling
>
>
> Local mode.
>
> Script:
>
> #!/bin/bash
> #
> # Licensed to the Apache Software Foundation (ASF) under one or more
> # contributor license agreements.  See the NOTICE file distributed with
> # this work for additional information regarding copyright ownership.
> # The ASF licenses this file to You under the Apache License, Version 2.0
> # (the "License"); you may not use this file except in compliance with
> # the License.  You may obtain a copy of the License at
> #
> #     http://www.apache.org/licenses/LICENSE-2.0
> #
> # Unless required by applicable law or agreed to in writing, software
> # distributed under the License is distributed on an "AS IS" BASIS,
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> # See the License for the specific language governing permissions and
> # limitations under the License.
> #
> # The Crawl command script : crawl <seedDir> <crawlId> <solrURL>
> <numberOfRounds>
> #
> #
> # UNLIKE THE NUTCH ALL-IN-ONE-CRAWL COMMAND THIS SCRIPT DOES THE LINK
> INVERSION AND
> # INDEXING FOR EACH SEGMENT
>
> #set common env. variables -- ex $JAVA_HOME, $NUTCH_HOME, etc.
> . /proj/common/setenv.sh
>
> SEEDDIR=$NUTCH_HOME/urls
> CRAWLDIR=$NUTCH_HOME/crawl/
> CRAWL_ID=crawl
> SOLRURL=http://localhost/nutchsolr/
> LIMIT=3
>
> #############################################
> # MODIFY THE PARAMETERS BELOW TO YOUR NEEDS #
> #############################################
>
> # set the number of slaves nodes
> numSlaves=1
>
> # and the total number of available tasks
> # sets Hadoop parameter "mapred.reduce.tasks"
> numTasks=`expr $numSlaves \* 2`
>
> # number of urls to fetch in one iteration
> # 250K per task?
> #sizeFetchlist=`expr $numSlaves \* 5000`
> sizeFetchlist=`expr $numSlaves \* 20`
>
> # time limit for feching
> timeLimitFetch=180
>
> #############################################
>
> bin=`dirname "$0"`
> bin=`cd "$bin"; pwd`
>
> # note that some of the options listed here could be set in the
> # corresponding hadoop site xml param file
> commonOptions="-D mapred.reduce.tasks=$numTasks -D
> mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true"
>
> # initial injection
> $bin/nutch inject $SEEDDIR -crawlId $CRAWL_ID
>
> if [ $? -ne 0 ]
>   then exit $?
> fi
>
> # main loop : rounds of generate - fetch - parse - update
> for ((a=1; a <= LIMIT ; a++))
> do
>   if [ -e ".STOP" ]
>   then
>    echo "STOP file found - escaping loop"
>    break
>   fi
>
>   echo `date` ": Iteration $a of $LIMIT"
>
>   echo "Generating a new fetchlist"
>   $bin/nutch generate $commonOptions -crawlId $CRAWL_ID -force -topN
> $sizeFetchlist -numFetchers $numSlaves -noFilter
>
>   if [ $? -ne 0 ]
>   then exit $?
>   fi
>
>   # TODO capture the batchID
>   echo "Fetching : "
>   $bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch
> -all -crawlId $CRAWL_ID -threads 10
>
>   if [ $? -ne 0 ]
>   then exit $?
>   fi
>
>   # parsing the segment
>   echo "Parsing : "
>   # enable the skipping of records for the parsing so that a dodgy document
>   # so that it does not fail the full task
>   skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D
> mapred.skip.map.max.skip.records=1"
>   $bin/nutch parse $commonOptions $skipRecordsOptions -all -crawlId
> $CRAWL_ID -force
>
>   if [ $? -ne 0 ]
>   then exit $?
>   fi
>
>   # updatedb with this segment
>   echo "CrawlDB update"
>   $bin/nutch updatedb $commonOptions
>
>   if [ $? -ne 0 ]
>   then exit $?
>   fi
>
>   echo "Indexing $CRAWL_ID on SOLR index -> $SOLRURL"
>   $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID
>
>   if [ $? -ne 0 ]
>    then exit $?
>   fi
>
>   #echo "SOLR dedup -> $SOLRURL"
>   #$bin/nutch solrdedup $commonOptions $SOLRURL
>
>   if [ $? -ne 0 ]
>    then exit $?
>   fi
>
> done
>
> exit 0
>
>
> -- Chris
>
>
> On Fri, May 24, 2013 at 2:51 PM, <[email protected]> wrote:
>
> > Can you send the scrpit? Also are you running it in deploy or local mode?
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Christopher Gross <[email protected]>
> > To: user <[email protected]>
> > Sent: Fri, May 24, 2013 9:43 am
> > Subject: Re: error crawling
> >
> >
> > Right.  "runbot" is the old one.  They don't package something with nutch
> > anymore like that.  Through digging on the web I found something.
> >
> > I took this script.
> > http://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/crawl
> >
> > I made small changes -- rather than passing in args I hard coded them (to
> > make it easier to run via cron), and since my user doesn't have the right
> > stuff set up in the PATH, I have an environment loader.  I also commented
> > out the dedup line since it doesn't work.
> >
> > From that file:
> >
> > # initial injection
> > $bin/nutch inject $SEEDDIR -crawlId $CRAWL_ID
> >
> > Even taking out the CRAWL_ID part I still get the crawl_webpage error
> > message.  So I'm still not able to do the crawling correctly.  I still
> > cannot find documentation saying what I need to do to make the Keyclass
> and
> > nameclass match correctly.  That's what I'm trying to get answered.  I
> > tried hacking at it a bit but things got uglier, so I'm looking to here
> for
> > guidance.
> >
> >
> >
>
>
>

Re: error crawling

Reply via email to