Re: How to have nutch 2 retry 503 errors

Talat Uyarer Thu, 13 Mar 2014 01:05:24 -0700

Hi Brain,

What is your out going of network topology ? This error may be caused by
your firewall or etc.


Talat


2014-03-13 9:12 GMT+02:00 brian4 <[email protected]>:

> Whenever I try to crawl a large enough list, the website will sometimes
> return 503 errors for pages:
> fetch of ... failed with: Http code=503
>
> These pages are not down and can still be accessed.  However even if I do
> more rounds of crawling, it seems nutch does not attempt to retry fetching
> these pages.  This results in only being able to crawl a fraction of the
> total pages, e.g. 32,000/37,000.
>
> I am using nutch 2 and protocol-httpclient
>
> Looking at the code I see they should be marked for retry and their status
> changed to "unfetched".  I checked the database and found the status is
> changed to unfetched, but nevertheless they are not re-fetched in
> subsequent
> iterations.
>
> What am I missing that it's not re-trying to fetch the page?
>
> I have this loop:
>
> DEPTH=3
>
> for ((a=1; a <= DEPTH ; a++))
> do
>
>   echo `date` ": Iteration $a of $DEPTH"
>
>   echo "Generating a new fetchlist"
>   $NUTCH_BIN/nutch generate -crawlId $CRAWL_ID"
>
>   echo `date` ": Fetching : "
>   $NUTCH_BIN/nutch fetch -all -crawlId $CRAWL_ID -threads 50
>
>   echo `date` ": Parsing : "
>   $NUTCH_BIN/nutch parse -all -crawlId $CRAWL_ID
>
>   echo `date` ": Updating Database"
>   $NUTCH_BIN/nutch updatedb -crawlId $CRAWL_ID
>
> done
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-have-nutch-2-retry-503-errors-tp4123311.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

Re: How to have nutch 2 retry 503 errors

Reply via email to