Hi Brain, What is your out going of network topology ? This error may be caused by your firewall or etc.
Talat 2014-03-13 9:12 GMT+02:00 brian4 <[email protected]>: > Whenever I try to crawl a large enough list, the website will sometimes > return 503 errors for pages: > fetch of ... failed with: Http code=503 > > These pages are not down and can still be accessed. However even if I do > more rounds of crawling, it seems nutch does not attempt to retry fetching > these pages. This results in only being able to crawl a fraction of the > total pages, e.g. 32,000/37,000. > > I am using nutch 2 and protocol-httpclient > > Looking at the code I see they should be marked for retry and their status > changed to "unfetched". I checked the database and found the status is > changed to unfetched, but nevertheless they are not re-fetched in > subsequent > iterations. > > What am I missing that it's not re-trying to fetch the page? > > I have this loop: > > DEPTH=3 > > for ((a=1; a <= DEPTH ; a++)) > do > > echo `date` ": Iteration $a of $DEPTH" > > echo "Generating a new fetchlist" > $NUTCH_BIN/nutch generate -crawlId $CRAWL_ID" > > echo `date` ": Fetching : " > $NUTCH_BIN/nutch fetch -all -crawlId $CRAWL_ID -threads 50 > > echo `date` ": Parsing : " > $NUTCH_BIN/nutch parse -all -crawlId $CRAWL_ID > > echo `date` ": Updating Database" > $NUTCH_BIN/nutch updatedb -crawlId $CRAWL_ID > > done > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/How-to-have-nutch-2-retry-503-errors-tp4123311.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

