Re: finding broken links with nutch 1.14

Robert Scavilla Mon, 02 Mar 2020 14:12:06 -0800

Nutch 1.14:
I am looking at the FetcherThread code. The 404 url does get flagged with
a ProtocolStatus.NOTFOUND, but the broken link never gets to the crawldb.
It does however got into the linkdb. Please tell me how I can collect these
404 urls.


Any help would be appreciated,
.,..bob

           case ProtocolStatus.NOTFOUND:
            case ProtocolStatus.GONE: // gone
            case ProtocolStatus.ACCESS_DENIED:
            case ProtocolStatus.ROBOTS_DENIED:
              output(fit.url, fit.datum, null, status,
                  CrawlDatum.STATUS_FETCH_GONE);     // broken link is
getting here
              break;

On Fri, Feb 28, 2020 at 12:06 PM Robert Scavilla <rscavi...@gmail.com>
wrote:

> Hi again, and thank you in advance for your kind help.
>
> I'm using Nutch 1.14
>
> I'm trying to use nutch to find broken links (404s) on a site. I
> followed the instructions:
> bin/nutch readdb <crawlFolder>/crawldb/ -dump myDump
>
> but the dump only shows 200 and 301 status. There is no sign of any broken
> link. When enter just 1 broken link in the seed file the crawldb is empty.
>
> Please advise how I can inspect broken links with nutch1.14
>
> Thank you!
> ...bob
>

Re: finding broken links with nutch 1.14

Reply via email to