Nutch 1.14: I am looking at the FetcherThread code. The 404 url does get flagged with a ProtocolStatus.NOTFOUND, but the broken link never gets to the crawldb. It does however got into the linkdb. Please tell me how I can collect these 404 urls.
Any help would be appreciated, .,..bob case ProtocolStatus.NOTFOUND: case ProtocolStatus.GONE: // gone case ProtocolStatus.ACCESS_DENIED: case ProtocolStatus.ROBOTS_DENIED: output(fit.url, fit.datum, null, status, CrawlDatum.STATUS_FETCH_GONE); // broken link is getting here break; On Fri, Feb 28, 2020 at 12:06 PM Robert Scavilla <rscavi...@gmail.com> wrote: > Hi again, and thank you in advance for your kind help. > > I'm using Nutch 1.14 > > I'm trying to use nutch to find broken links (404s) on a site. I > followed the instructions: > bin/nutch readdb <crawlFolder>/crawldb/ -dump myDump > > but the dump only shows 200 and 301 status. There is no sign of any broken > link. When enter just 1 broken link in the seed file the crawldb is empty. > > Please advise how I can inspect broken links with nutch1.14 > > Thank you! > ...bob >