Whenever I try to crawl a large enough list, the website will sometimes
return 503 errors for pages:
fetch of ... failed with: Http code=503

These pages are not down and can still be accessed.  However even if I do
more rounds of crawling, it seems nutch does not attempt to retry fetching
these pages.  This results in only being able to crawl a fraction of the
total pages, e.g. 32,000/37,000.

I am using nutch 2 and protocol-httpclient

Looking at the code I see they should be marked for retry and their status
changed to "unfetched".  I checked the database and found the status is
changed to unfetched, but nevertheless they are not re-fetched in subsequent
iterations.

What am I missing that it's not re-trying to fetch the page?

I have this loop:

DEPTH=3

for ((a=1; a <= DEPTH ; a++))
do

  echo `date` ": Iteration $a of $DEPTH"

  echo "Generating a new fetchlist"
  $NUTCH_BIN/nutch generate -crawlId $CRAWL_ID"
 
  echo `date` ": Fetching : "
  $NUTCH_BIN/nutch fetch -all -crawlId $CRAWL_ID -threads 50

  echo `date` ": Parsing : "
  $NUTCH_BIN/nutch parse -all -crawlId $CRAWL_ID

  echo `date` ": Updating Database"
  $NUTCH_BIN/nutch updatedb -crawlId $CRAWL_ID

done





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-have-nutch-2-retry-503-errors-tp4123311.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to