Hi, Can you post your nutch-site.xml and I will give it a spin.
Thank you Lewis On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme <xuyua...@gmail.com> wrote: > Just checked the latest code in 1.4 but it's the same. See code line 138 in > below link: > > > http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup > > http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup > > The method just call getResponse() and set followRedirects parameter to > *false*. > > So I guess the http.redirect.max setting is not working on it? > > > remi tassing wrote > > > > Would you give Nucth-1.4 a try? Maybe this bug is already solved? > > > > Remi > > > > On Thursday, February 23, 2012, xuyuanme <xuyuanme@> wrote: > >> Thanks for the information. But I found the wiki page > >> http://wiki.apache.org/nutch/RedirectHandling > >> http://wiki.apache.org/nutch/RedirectHandling still doesn't have too > >> much > >> content about Nutch redirects. > >> > >> I found even if I set http.redirect.max=2 and > >> db.ignore.external.links=false, the crawler still can't get redirect > > pages. > >> And with further digging, I found the plugin lib-http (in Nutch 1.1) > >> contains following code: > >> > >> Java file: org.apache.nutch.protocol.http.api.HttpBase > >> > >> public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) { > >> ...... > >> response = getResponse(u, datum, */false/*); // make a request > >> ...... > >> } > >> > >> protected abstract Response getResponse(URL url, > >> CrawlDatum datum, > >> boolean followRedirects) > >> throws ProtocolException, IOException; > >> > >> After I changed the call to getResponse(u, datum, */true/*) and > recompile > >> the plugin, the crawler fetches redirected pages as expected. > >> > >> So is this a bug in lib-http library or I had some misunderstanding on > >> how > >> redirect works? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768744.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- *Lewis*