Would you give Nucth-1.4 a try? Maybe this bug is already solved?

Remi

On Thursday, February 23, 2012, xuyuanme <xuyua...@gmail.com> wrote:
> Thanks for the information. But I found the wiki page
> http://wiki.apache.org/nutch/RedirectHandling
> http://wiki.apache.org/nutch/RedirectHandling  still doesn't have too much
> content about Nutch redirects.
>
> I found even if I set http.redirect.max=2 and
> db.ignore.external.links=false, the crawler still can't get redirect
pages.
> And with further digging, I found the plugin lib-http (in Nutch 1.1)
> contains following code:
>
> Java file: org.apache.nutch.protocol.http.api.HttpBase
>
>  public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
> ......
>        response = getResponse(u, datum, */false/*); // make a request
> ......
>  }
>
>  protected abstract Response getResponse(URL url,
>                                          CrawlDatum datum,
>                                          boolean followRedirects)
>    throws ProtocolException, IOException;
>
> After I changed the call to getResponse(u, datum, */true/*) and recompile
> the plugin, the crawler fetches redirected pages as expected.
>
> So is this a bug in lib-http library or I had some misunderstanding on how
> redirect works?
>
> Thanks!
>
> lewis john mcgibbney wrote
>>
>> Hi Rafael,
>>
>> The page we are talking about will be added on the link below.
>>
>> http://wiki.apache.org/nutch/InternalDocumentation
>>
>> and will be available here
>>
>> http://wiki.apache.org/nutch/RedirectHandling
>>
>>
>
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768657.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Reply via email to