Hi,

looks like this has been overseen when 
https://issues.apache.org/jira/browse/NUTCH-2069 was
implemented. Please, open an issue on
  https://issues.apache.org/jira/browse/NUTCH
to report your issue.

As a temporary work-around, try to set
  http.redirect.max = 0
Redirects are then treated same as links.

Thanks,
Sebastian

On 03/08/2017 04:20 PM, [email protected] wrote:
> I came across an issue where the main page of a site redirects to a
> subdomain which doesn't get followed during the crawl. The URL http://w
> ww.mercenarytrader.com redirects to https://members.mercenarytrader.com
>  which doesn't get followed.
> In the nutch-site.xml I have db.ignore.external.links set to 'true'
> and db.ignore.external.links.mode set to 'byDomain' since I only want
> to crawl within the domain inculding subdomains.
> 
> I came across this redirect code FetcherThread which causes the issue.
> Instead of comparing the Domains the hosts get compared and don't match
> up i.e members.mercenarytrader.com doesn't match up with
> mercenarytrader.com. Is there an existing issue that has been logged
> for this?
> 
>   String origHost = new URL(urlString).getHost().toLowerCase();
>       String newHost = new URL(newUrl).getHost().toLowerCase();
>       if (ignoreExternalLinks) {
>         if (!origHost.equals(newHost)) {
>           if (LOG.isDebugEnabled()) {
>             LOG.debug(" - ignoring redirect " + redirType + " from "
>                 + urlString + " to " + newUrl
>                 + " because external links are ignored");
>           }
>           return null;
>         }
>       }
> 

Reply via email to