Would you give Nucth-1.4 a try? Maybe this bug is already solved? Remi
On Thursday, February 23, 2012, xuyuanme <xuyua...@gmail.com> wrote: > Thanks for the information. But I found the wiki page > http://wiki.apache.org/nutch/RedirectHandling > http://wiki.apache.org/nutch/RedirectHandling still doesn't have too much > content about Nutch redirects. > > I found even if I set http.redirect.max=2 and > db.ignore.external.links=false, the crawler still can't get redirect pages. > And with further digging, I found the plugin lib-http (in Nutch 1.1) > contains following code: > > Java file: org.apache.nutch.protocol.http.api.HttpBase > > public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) { > ...... > response = getResponse(u, datum, */false/*); // make a request > ...... > } > > protected abstract Response getResponse(URL url, > CrawlDatum datum, > boolean followRedirects) > throws ProtocolException, IOException; > > After I changed the call to getResponse(u, datum, */true/*) and recompile > the plugin, the crawler fetches redirected pages as expected. > > So is this a bug in lib-http library or I had some misunderstanding on how > redirect works? > > Thanks! > > lewis john mcgibbney wrote >> >> Hi Rafael, >> >> The page we are talking about will be added on the link below. >> >> http://wiki.apache.org/nutch/InternalDocumentation >> >> and will be available here >> >> http://wiki.apache.org/nutch/RedirectHandling >> >> > > > -- > View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768657.html > Sent from the Nutch - User mailing list archive at Nabble.com. >