Thanks, unfortunately the workaround won't work for us since we don't want to follow redirects greater than 5. I opened an issue NUTCH-2365.
On Thu, 2017-03-09 at 10:03 +0100, Sebastian Nagel wrote: > Hi, > > looks like this has been overseen when https://issues.apache.org/jira > /browse/NUTCH-2069 was > implemented. Please, open an issue on > https://issues.apache.org/jira/browse/NUTCH > to report your issue. > > As a temporary work-around, try to set > http.redirect.max = 0 > Redirects are then treated same as links. > > Thanks, > Sebastian > > On 03/08/2017 04:20 PM, [email protected] wrote: > > I came across an issue where the main page of a site redirects to a > > subdomain which doesn't get followed during the crawl. The URL > > http://w > > ww.mercenarytrader.com redirects to https://members.mercenarytrader > > .com > > which doesn't get followed. > > In the nutch-site.xml I have db.ignore.external.links set to 'true' > > and db.ignore.external.links.mode set to 'byDomain' since I only > > want > > to crawl within the domain inculding subdomains. > > > > I came across this redirect code FetcherThread which causes the > > issue. > > Instead of comparing the Domains the hosts get compared and don't > > match > > up i.e members.mercenarytrader.com doesn't match up with > > mercenarytrader.com. Is there an existing issue that has been > > logged > > for this? > > > > String origHost = new URL(urlString).getHost().toLowerCase(); > > String newHost = new URL(newUrl).getHost().toLowerCase(); > > if (ignoreExternalLinks) { > > if (!origHost.equals(newHost)) { > > if (LOG.isDebugEnabled()) { > > LOG.debug(" - ignoring redirect " + redirType + " from > > " > > + urlString + " to " + newUrl > > + " because external links are ignored"); > > } > > return null; > > } > > } > > > >

