Re: http.redirect.max

xuyuanme Wed, 22 Feb 2012 21:08:20 -0800

Just checked the latest code in 1.4 but it's the same. See code line 138 in
below link:


http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
 

The method just call getResponse() and set followRedirects parameter to
*false*.

So I guess the http.redirect.max setting is not working on it?


remi tassing wrote
> 
> Would you give Nucth-1.4 a try? Maybe this bug is already solved?
> 
> Remi
> 
> On Thursday, February 23, 2012, xuyuanme &lt;xuyuanme@&gt; wrote:
>> Thanks for the information. But I found the wiki page
>> http://wiki.apache.org/nutch/RedirectHandling
>> http://wiki.apache.org/nutch/RedirectHandling  still doesn't have too
>> much
>> content about Nutch redirects.
>>
>> I found even if I set http.redirect.max=2 and
>> db.ignore.external.links=false, the crawler still can't get redirect
> pages.
>> And with further digging, I found the plugin lib-http (in Nutch 1.1)
>> contains following code:
>>
>> Java file: org.apache.nutch.protocol.http.api.HttpBase
>>
>>  public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
>> ......
>>        response = getResponse(u, datum, */false/*); // make a request
>> ......
>>  }
>>
>>  protected abstract Response getResponse(URL url,
>>                                          CrawlDatum datum,
>>                                          boolean followRedirects)
>>    throws ProtocolException, IOException;
>>
>> After I changed the call to getResponse(u, datum, */true/*) and recompile
>> the plugin, the crawler fetches redirected pages as expected.
>>
>> So is this a bug in lib-http library or I had some misunderstanding on
>> how
>> redirect works?
> 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768744.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: http.redirect.max

Reply via email to