I came across an issue where the main page of a site redirects to a
subdomain which doesn't get followed during the crawl. The URL http://w
ww.mercenarytrader.com redirects to https://members.mercenarytrader.com
which doesn't get followed.
In the nutch-site.xml I have db.ignore.external.links set to 'true'
and db.ignore.external.links.mode set to 'byDomain' since I only want
to crawl within the domain inculding subdomains.
I came across this redirect code FetcherThread which causes the issue.
Instead of comparing the Domains the hosts get compared and don't match
up i.e members.mercenarytrader.com doesn't match up with
mercenarytrader.com. Is there an existing issue that has been logged
for this?
String origHost = new URL(urlString).getHost().toLowerCase();
String newHost = new URL(newUrl).getHost().toLowerCase();
if (ignoreExternalLinks) {
if (!origHost.equals(newHost)) {
if (LOG.isDebugEnabled()) {
LOG.debug(" - ignoring redirect " + redirType + " from "
+ urlString + " to " + newUrl
+ " because external links are ignored");
}
return null;
}
}