Thanks for the information. But I found the wiki page
http://wiki.apache.org/nutch/RedirectHandling
http://wiki.apache.org/nutch/RedirectHandling still doesn't have too much
content about Nutch redirects.
I found even if I set http.redirect.max=2 and
db.ignore.external.links=false, the crawler still can't get redirect pages.
And with further digging, I found the plugin lib-http (in Nutch 1.1)
contains following code:
Java file: org.apache.nutch.protocol.http.api.HttpBase
public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
......
response = getResponse(u, datum, */false/*); // make a request
......
}
protected abstract Response getResponse(URL url,
CrawlDatum datum,
boolean followRedirects)
throws ProtocolException, IOException;
After I changed the call to getResponse(u, datum, */true/*) and recompile
the plugin, the crawler fetches redirected pages as expected.
So is this a bug in lib-http library or I had some misunderstanding on how
redirect works?
Thanks!
lewis john mcgibbney wrote
>
> Hi Rafael,
>
> The page we are talking about will be added on the link below.
>
> http://wiki.apache.org/nutch/InternalDocumentation
>
> and will be available here
>
> http://wiki.apache.org/nutch/RedirectHandling
>
>
--
View this message in context:
http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768657.html
Sent from the Nutch - User mailing list archive at Nabble.com.