Whenever I try to crawl a large enough list, the website will sometimes return 503 errors for pages: fetch of ... failed with: Http code=503
These pages are not down and can still be accessed. However even if I do more rounds of crawling, it seems nutch does not attempt to retry fetching these pages. This results in only being able to crawl a fraction of the total pages, e.g. 32,000/37,000. I am using nutch 2 and protocol-httpclient Looking at the code I see they should be marked for retry and their status changed to "unfetched". I checked the database and found the status is changed to unfetched, but nevertheless they are not re-fetched in subsequent iterations. What am I missing that it's not re-trying to fetch the page? I have this loop: DEPTH=3 for ((a=1; a <= DEPTH ; a++)) do echo `date` ": Iteration $a of $DEPTH" echo "Generating a new fetchlist" $NUTCH_BIN/nutch generate -crawlId $CRAWL_ID" echo `date` ": Fetching : " $NUTCH_BIN/nutch fetch -all -crawlId $CRAWL_ID -threads 50 echo `date` ": Parsing : " $NUTCH_BIN/nutch parse -all -crawlId $CRAWL_ID echo `date` ": Updating Database" $NUTCH_BIN/nutch updatedb -crawlId $CRAWL_ID done -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-have-nutch-2-retry-503-errors-tp4123311.html Sent from the Nutch - User mailing list archive at Nabble.com.

