Hi, looks like this has been overseen when https://issues.apache.org/jira/browse/NUTCH-2069 was implemented. Please, open an issue on https://issues.apache.org/jira/browse/NUTCH to report your issue.
As a temporary work-around, try to set http.redirect.max = 0 Redirects are then treated same as links. Thanks, Sebastian On 03/08/2017 04:20 PM, [email protected] wrote: > I came across an issue where the main page of a site redirects to a > subdomain which doesn't get followed during the crawl. The URL http://w > ww.mercenarytrader.com redirects to https://members.mercenarytrader.com > which doesn't get followed. > In the nutch-site.xml I have db.ignore.external.links set to 'true' > and db.ignore.external.links.mode set to 'byDomain' since I only want > to crawl within the domain inculding subdomains. > > I came across this redirect code FetcherThread which causes the issue. > Instead of comparing the Domains the hosts get compared and don't match > up i.e members.mercenarytrader.com doesn't match up with > mercenarytrader.com. Is there an existing issue that has been logged > for this? > > String origHost = new URL(urlString).getHost().toLowerCase(); > String newHost = new URL(newUrl).getHost().toLowerCase(); > if (ignoreExternalLinks) { > if (!origHost.equals(newHost)) { > if (LOG.isDebugEnabled()) { > LOG.debug(" - ignoring redirect " + redirType + " from " > + urlString + " to " + newUrl > + " because external links are ignored"); > } > return null; > } > } >

