RE: How to have nutch 2 retry 503 errors

Markus Jelsma Thu, 13 Mar 2014 01:48:31 -0700
If a fetch failes due to a transient error, the retry count is increased and 
the record is retried 24 hours later, by default. Error like these happen all 
the time, even if the  browser can access it, but at that moment, Nutch could 
not. 
 
-----Original message-----
> From:brian4 <[email protected]>
> Sent: Thursday 13th March 2014 8:12
> To: [email protected]
> Subject: How to have nutch 2 retry 503 errors
> 
> Whenever I try to crawl a large enough list, the website will sometimes
> return 503 errors for pages:
> fetch of ... failed with: Http code=503
> 
> These pages are not down and can still be accessed.  However even if I do
> more rounds of crawling, it seems nutch does not attempt to retry fetching
> these pages.  This results in only being able to crawl a fraction of the
> total pages, e.g. 32,000/37,000.
> 
> I am using nutch 2 and protocol-httpclient
> 
> Looking at the code I see they should be marked for retry and their status
> changed to "unfetched".  I checked the database and found the status is
> changed to unfetched, but nevertheless they are not re-fetched in subsequent
> iterations.
> 
> What am I missing that it's not re-trying to fetch the page?
> 
> I have this loop:
> 
> DEPTH=3
> 
> for ((a=1; a <= DEPTH ; a++))
> do
> 
>   echo `date` ": Iteration $a of $DEPTH"
> 
>   echo "Generating a new fetchlist"
>   $NUTCH_BIN/nutch generate -crawlId $CRAWL_ID"
>  
>   echo `date` ": Fetching : "
>   $NUTCH_BIN/nutch fetch -all -crawlId $CRAWL_ID -threads 50
> 
>   echo `date` ": Parsing : "
>   $NUTCH_BIN/nutch parse -all -crawlId $CRAWL_ID
> 
>   echo `date` ": Updating Database"
>   $NUTCH_BIN/nutch updatedb -crawlId $CRAWL_ID
> 
> done
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-have-nutch-2-retry-503-errors-tp4123311.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
RE: How to have nutch 2 retry 503 errors

Reply via email to