Thanks for the information. But I found the wiki page 
http://wiki.apache.org/nutch/RedirectHandling
http://wiki.apache.org/nutch/RedirectHandling  still doesn't have too much
content about Nutch redirects.

I found even if I set http.redirect.max=2 and
db.ignore.external.links=false, the crawler still can't get redirect pages.
And with further digging, I found the plugin lib-http (in Nutch 1.1)
contains following code:

Java file: org.apache.nutch.protocol.http.api.HttpBase

  public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
......
        response = getResponse(u, datum, */false/*); // make a request
......
  }

  protected abstract Response getResponse(URL url,
                                          CrawlDatum datum,
                                          boolean followRedirects)
    throws ProtocolException, IOException;

After I changed the call to getResponse(u, datum, */true/*) and recompile
the plugin, the crawler fetches redirected pages as expected.

So is this a bug in lib-http library or I had some misunderstanding on how
redirect works?

Thanks!

lewis john mcgibbney wrote
> 
> Hi Rafael,
> 
> The page we are talking about will be added on the link below.
> 
> http://wiki.apache.org/nutch/InternalDocumentation
> 
> and will be available here
> 
> http://wiki.apache.org/nutch/RedirectHandling
> 
> 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768657.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to