Gerard Bouchar created NUTCH-2557:
-------------------------------------

             Summary: protocol-http fails to follow redirections when an HTTP 
response body is invalid
                 Key: NUTCH-2557
                 URL: https://issues.apache.org/jira/browse/NUTCH-2557
             Project: Nutch
          Issue Type: Sub-task
            Reporter: Gerard Bouchar


If a server sends a redirection (3XX status code, with a Location header), 
protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
occurs while decoding the body, the redirection is not followed and the 
information is lost. Browsers follow the redirection and close the socket soon 
as they can.
 * Example: this page is a redirection to its https version, with an HTTP body 
containing invalidly gzip encoded contents. Browsers follow the redirection, 
but nutch throws an error:

 ** [http://www.webarcelona.net/es/blog?page=2]

 

The HttpResponse::getContent class can already return null. I think it should 
at least return null when parsing the HTTP response body fails.

Ideally, we would adopt the same behavior as browsers, and not even try parsing 
the body when the headers indicate a redirection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to