Hi, I am trying to use Nutch 1.1 to build a complete index of our corporate web sites. I am using a script based on this one:
http://wiki.apache.org/nutch/Crawl The text of my script is included below. I set the crawling depth to 100 to make sure that everything is indexed, expecting that the process will stop after about 20 iterations. The problem is that the Generator keeps re-scheduling fetching of failed URLs. The process stopped because the max number of iterations (the depth) was reached. After a few iterations, only failing URLs were being repeatedly fetched. I checked the log for one of them. It failed with code 403 and was re-scheduled for fetching 94 times. This does not seem right. Another problem that I've noticed is that sometimes the Generators schedules fetching of just one URL per iteration. This happened once and I did not try to repeat this effect, but this does not seem right either. Here is my script text: ----------------------------------------------- #!/bin/bash depth=100 threads=10 adddays=5 topN=50000 NUTCH_HOME=/data/HORUS_1/nutch1.1 NUTCH_HEAPSIZE=8000 nutch=$NUTCH_HOME JAVA_HOME=/usr/lib/jvm/java-6-sun export NUTCH_HOME export JAVA_HOME export NUTCH_HEAPSIZE steps=6 echo "----- Inject (Step 1 of $steps) -----" $nutch/bin/nutch inject $nutch/crawl/crawldb $nutch/crawl/seed echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----" for((i=0; i < $depth; i++)) do echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---" $nutch/bin/nutch generate $nutch/crawl/crawldb $nutch/crawl/segments $topN -adddays $adddays if [ $? -ne 0 ] then echo "Stopping at depth $depth. No more URLs to fetch." break fi segment=`ls -d $nutch/crawl/segments/* | tail -1` $nutch/bin/nutch fetch $segment -threads $threads if [ $? -ne 0 ] then echo "fetch $segment at depth $depth failed. Deleting it." rm -rf $segment continue fi # echo "--- Parsing Segment $segment ---" # $nutch/bin/nutch parse $segment $nutch/bin/nutch updatedb $nutch/crawl/crawldb $segment done echo "----- Invert Links (Step 3 of $steps) -----" $nutch/bin/nutch invertlinks $nutch/crawl/linkdb $nutch/crawl/segments/* echo "----- Index (Step 4 of $steps) -----" $nutch/bin/nutch index $nutch/crawl/preIndex $nutch/crawl/crawldb $nutch/crawl/linkdb $nutch/crawl/segments/* echo "----- Dedup (Step 5 of $steps) -----" $nutch/bin/nutch dedup $nutch/crawl/preIndex echo "----- Merge Indexes (Step 6 of $steps) -----" $nutch/bin/nutch merge $nutch/crawl/index $nutch/crawl/preIndex # in nutch-site, hadoop.tmp.dir points to crawl/tmp rm -rf $nutch/crawl/tmp/* ----------------------------------------------- Is anyone experiencing same problems? Is there anything wrong in what I am doing? Regards, Arkadi

