Hi,

I am trying to use Nutch 1.1 to build a complete index of our corporate web 
sites. I am using a script based on this one:

http://wiki.apache.org/nutch/Crawl

The text of my script is included below. I set the crawling depth to 100 to 
make sure that everything is indexed, expecting that the process will stop 
after about 20 iterations. The problem is that the Generator keeps 
re-scheduling fetching of failed URLs. The process stopped because the max 
number of iterations (the depth) was reached. After a few iterations, only 
failing URLs were being repeatedly fetched. I checked the log for one of them. 
It failed with code 403 and was re-scheduled for fetching 94 times. This does 
not seem right.

Another problem that I've noticed is that sometimes the Generators schedules 
fetching of just one URL per iteration. This happened once and I did not try to 
repeat this effect, but this does not seem right either.

Here is my script text:

-----------------------------------------------
#!/bin/bash

depth=100
threads=10
adddays=5
topN=50000

NUTCH_HOME=/data/HORUS_1/nutch1.1
NUTCH_HEAPSIZE=8000
nutch=$NUTCH_HOME
JAVA_HOME=/usr/lib/jvm/java-6-sun
export NUTCH_HOME
export JAVA_HOME
export NUTCH_HEAPSIZE

steps=6
echo "----- Inject (Step 1 of $steps) -----"
$nutch/bin/nutch inject $nutch/crawl/crawldb $nutch/crawl/seed

echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0; i < $depth; i++))
do
  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
  $nutch/bin/nutch generate $nutch/crawl/crawldb $nutch/crawl/segments $topN 
-adddays $adddays
  if [ $? -ne 0 ]
  then
    echo "Stopping at depth $depth. No more URLs to fetch."
    break
  fi
  segment=`ls -d $nutch/crawl/segments/* | tail -1`

  $nutch/bin/nutch fetch $segment -threads $threads
  if [ $? -ne 0 ]
  then
    echo "fetch $segment at depth $depth failed. Deleting it."
    rm -rf $segment
    continue
  fi

#  echo "--- Parsing Segment $segment ---"
#  $nutch/bin/nutch parse $segment

  $nutch/bin/nutch updatedb $nutch/crawl/crawldb $segment
done

echo "----- Invert Links (Step 3 of $steps) -----"
$nutch/bin/nutch invertlinks $nutch/crawl/linkdb $nutch/crawl/segments/*

echo "----- Index (Step 4 of $steps) -----"
$nutch/bin/nutch index $nutch/crawl/preIndex $nutch/crawl/crawldb 
$nutch/crawl/linkdb $nutch/crawl/segments/*

echo "----- Dedup (Step 5 of $steps) -----"
$nutch/bin/nutch dedup $nutch/crawl/preIndex

echo "----- Merge Indexes (Step 6 of $steps) -----"
$nutch/bin/nutch merge $nutch/crawl/index $nutch/crawl/preIndex

# in nutch-site, hadoop.tmp.dir points to crawl/tmp
rm -rf $nutch/crawl/tmp/*
-----------------------------------------------

Is anyone experiencing same problems? Is there anything wrong in what I am 
doing?

Regards,

Arkadi





Reply via email to