Re: error crawling

Christopher Gross Tue, 28 May 2013 05:10:36 -0700

Local mode.

Script:


#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# The Crawl command script : crawl <seedDir> <crawlId> <solrURL>
<numberOfRounds>
#
#
# UNLIKE THE NUTCH ALL-IN-ONE-CRAWL COMMAND THIS SCRIPT DOES THE LINK
INVERSION AND
# INDEXING FOR EACH SEGMENT

#set common env. variables -- ex $JAVA_HOME, $NUTCH_HOME, etc.
. /proj/common/setenv.sh

SEEDDIR=$NUTCH_HOME/urls
CRAWLDIR=$NUTCH_HOME/crawl/
CRAWL_ID=crawl
SOLRURL=http://localhost/nutchsolr/
LIMIT=3

#############################################
# MODIFY THE PARAMETERS BELOW TO YOUR NEEDS #
#############################################

# set the number of slaves nodes
numSlaves=1

# and the total number of available tasks
# sets Hadoop parameter "mapred.reduce.tasks"
numTasks=`expr $numSlaves \* 2`

# number of urls to fetch in one iteration
# 250K per task?
#sizeFetchlist=`expr $numSlaves \* 5000`
sizeFetchlist=`expr $numSlaves \* 20`

# time limit for feching
timeLimitFetch=180

#############################################

bin=`dirname "$0"`
bin=`cd "$bin"; pwd`

# note that some of the options listed here could be set in the
# corresponding hadoop site xml param file
commonOptions="-D mapred.reduce.tasks=$numTasks -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true"

# initial injection
$bin/nutch inject $SEEDDIR -crawlId $CRAWL_ID

if [ $? -ne 0 ]
  then exit $?
fi

# main loop : rounds of generate - fetch - parse - update
for ((a=1; a <= LIMIT ; a++))
do
  if [ -e ".STOP" ]
  then
   echo "STOP file found - escaping loop"
   break
  fi

  echo `date` ": Iteration $a of $LIMIT"

  echo "Generating a new fetchlist"
  $bin/nutch generate $commonOptions -crawlId $CRAWL_ID -force -topN
$sizeFetchlist -numFetchers $numSlaves -noFilter

  if [ $? -ne 0 ]
  then exit $?
  fi

  # TODO capture the batchID
  echo "Fetching : "
  $bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch
-all -crawlId $CRAWL_ID -threads 10

  if [ $? -ne 0 ]
  then exit $?
  fi

  # parsing the segment
  echo "Parsing : "
  # enable the skipping of records for the parsing so that a dodgy document
  # so that it does not fail the full task
  skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D
mapred.skip.map.max.skip.records=1"
  $bin/nutch parse $commonOptions $skipRecordsOptions -all -crawlId
$CRAWL_ID -force

  if [ $? -ne 0 ]
  then exit $?
  fi

  # updatedb with this segment
  echo "CrawlDB update"
  $bin/nutch updatedb $commonOptions

  if [ $? -ne 0 ]
  then exit $?
  fi

  echo "Indexing $CRAWL_ID on SOLR index -> $SOLRURL"
  $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID

  if [ $? -ne 0 ]
   then exit $?
  fi

  #echo "SOLR dedup -> $SOLRURL"
  #$bin/nutch solrdedup $commonOptions $SOLRURL

  if [ $? -ne 0 ]
   then exit $?
  fi

done

exit 0


-- Chris


On Fri, May 24, 2013 at 2:51 PM, <[email protected]> wrote:

> Can you send the scrpit? Also are you running it in deploy or local mode?
>
>
>
>
>
>
> -----Original Message-----
> From: Christopher Gross <[email protected]>
> To: user <[email protected]>
> Sent: Fri, May 24, 2013 9:43 am
> Subject: Re: error crawling
>
>
> Right.  "runbot" is the old one.  They don't package something with nutch
> anymore like that.  Through digging on the web I found something.
>
> I took this script.
> http://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/crawl
>
> I made small changes -- rather than passing in args I hard coded them (to
> make it easier to run via cron), and since my user doesn't have the right
> stuff set up in the PATH, I have an environment loader.  I also commented
> out the dedup line since it doesn't work.
>
> From that file:
>
> # initial injection
> $bin/nutch inject $SEEDDIR -crawlId $CRAWL_ID
>
> Even taking out the CRAWL_ID part I still get the crawl_webpage error
> message.  So I'm still not able to do the crawling correctly.  I still
> cannot find documentation saying what I need to do to make the Keyclass and
> nameclass match correctly.  That's what I'm trying to get answered.  I
> tried hacking at it a bit but things got uglier, so I'm looking to here for
> guidance.
>
>
>

Re: error crawling

Reply via email to