I think nutch will use fetch-related status to update the db-related status. if the recrawl url is gone (404) , the fetch-related status is STATUS_FETCH_GONE, it will update former page with STATUS_DB_GONE, if the url is temorary failure , it will try MAX times to recrawl the url until reach the db.fetch.retry.max times.
On Tue, Aug 21, 2012 at 6:18 PM, weishenyun <[email protected]> wrote: > Hi IT_ailen: > I know what 404 means and I also know adaptive fetch schedule. But I > want to know what Nutch will do when it meet some exceptions by recrawl. > Still an example, a same page was fetched successfully and recrawled for > three times. In all three times of recrawl, it returns 404 or other > exceptions. Will Nutch uses exception page info to update the former > successful page? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/What-is-the-Nutch-page-update-mechanism-after-recrawl-tp4002366p4002373.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Don't Grow Old, Grow Up... :-)

