Re: What is the Nutch page-update mechanism after recrawl

feng lu Tue, 21 Aug 2012 06:45:59 -0700

I think nutch will use fetch-related status to update the db-related
status. if the recrawl url is gone (404) , the fetch-related status is
STATUS_FETCH_GONE, it will update former page with STATUS_DB_GONE, if the
url is temorary failure , it will try MAX times to recrawl the url until
reach the db.fetch.retry.max times.


On Tue, Aug 21, 2012 at 6:18 PM, weishenyun <[email protected]> wrote:

> Hi IT_ailen:
>        I know what 404 means and I also know adaptive fetch schedule. But I
> want to know what Nutch will do when it meet some exceptions by recrawl.
> Still an example, a same page was fetched successfully and recrawled for
> three times. In all three times of recrawl, it returns 404 or other
> exceptions. Will Nutch uses exception page info to update the former
> successful page?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/What-is-the-Nutch-page-update-mechanism-after-recrawl-tp4002366p4002373.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Don't Grow Old, Grow Up... :-)

Re: What is the Nutch page-update mechanism after recrawl

Reply via email to