Thanks, unfortunately the workaround won't work for us since we don't
want to follow redirects greater than 5. I opened an issue NUTCH-2365.

On Thu, 2017-03-09 at 10:03 +0100, Sebastian Nagel wrote:
> Hi,
> 
> looks like this has been overseen when https://issues.apache.org/jira
> /browse/NUTCH-2069 was
> implemented. Please, open an issue on
>   https://issues.apache.org/jira/browse/NUTCH
> to report your issue.
> 
> As a temporary work-around, try to set
>   http.redirect.max = 0
> Redirects are then treated same as links.
> 
> Thanks,
> Sebastian
> 
> On 03/08/2017 04:20 PM, [email protected] wrote:
> > I came across an issue where the main page of a site redirects to a
> > subdomain which doesn't get followed during the crawl. The URL
> > http://w
> > ww.mercenarytrader.com redirects to https://members.mercenarytrader
> > .com
> >  which doesn't get followed.
> > In the nutch-site.xml I have db.ignore.external.links set to 'true'
> > and db.ignore.external.links.mode set to 'byDomain' since I only
> > want
> > to crawl within the domain inculding subdomains.
> > 
> > I came across this redirect code FetcherThread which causes the
> > issue.
> > Instead of comparing the Domains the hosts get compared and don't
> > match
> > up i.e members.mercenarytrader.com doesn't match up with
> > mercenarytrader.com. Is there an existing issue that has been
> > logged
> > for this?
> > 
> >   String origHost = new URL(urlString).getHost().toLowerCase();
> >       String newHost = new URL(newUrl).getHost().toLowerCase();
> >       if (ignoreExternalLinks) {
> >         if (!origHost.equals(newHost)) {
> >           if (LOG.isDebugEnabled()) {
> >             LOG.debug(" - ignoring redirect " + redirType + " from
> > "
> >                 + urlString + " to " + newUrl
> >                 + " because external links are ignored");
> >           }
> >           return null;
> >         }
> >       }
> > 
> 
> 

Reply via email to