Re: finding broken links with nutch 1.14

Sebastian Nagel Tue, 03 Mar 2020 00:57:53 -0800

Hi Robert,

404s are recorded in the CrawlDb after the tool "updatedb" is called.
Could you share the commands you're running? Please also have a look into the 
log files (esp. the
hadoop.log) - all fetches are logged and
also whether fetches have failed. If you cannot find a log message
for the broken links, it might be that the URLs are filtered. In this
case, please also share the configuration (if different from the default).


Best,
Sebastian

On 3/2/20 11:11 PM, Robert Scavilla wrote:
> Nutch 1.14:
> I am looking at the FetcherThread code. The 404 url does get flagged with
> a ProtocolStatus.NOTFOUND, but the broken link never gets to the crawldb.
> It does however got into the linkdb. Please tell me how I can collect these
> 404 urls.
> 
> Any help would be appreciated,
> .,..bob
> 
>            case ProtocolStatus.NOTFOUND:
>             case ProtocolStatus.GONE: // gone
>             case ProtocolStatus.ACCESS_DENIED:
>             case ProtocolStatus.ROBOTS_DENIED:
>               output(fit.url, fit.datum, null, status,
>                   CrawlDatum.STATUS_FETCH_GONE);     // broken link is
> getting here
>               break;
> 
> On Fri, Feb 28, 2020 at 12:06 PM Robert Scavilla <rscavi...@gmail.com>
> wrote:
> 
>> Hi again, and thank you in advance for your kind help.
>>
>> I'm using Nutch 1.14
>>
>> I'm trying to use nutch to find broken links (404s) on a site. I
>> followed the instructions:
>> bin/nutch readdb <crawlFolder>/crawldb/ -dump myDump
>>
>> but the dump only shows 200 and 301 status. There is no sign of any broken
>> link. When enter just 1 broken link in the seed file the crawldb is empty.
>>
>> Please advise how I can inspect broken links with nutch1.14
>>
>> Thank you!
>> ...bob
>>
>

Re: finding broken links with nutch 1.14

Reply via email to