[jira] [Commented] (NUTCH-2557) protocol-http fails to follow redirections when an HTTP response body is invalid

2018-06-12 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509871#comment-16509871
 ] 

Hudson commented on NUTCH-2557:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3534 (See 
[https://builds.apache.org/job/Nutch-trunk/3534/])
NUTCH-2557 protocol-http fails to follow redirections when HTTP response 
(snagel: 
[https://github.com/apache/nutch/commit/d163512d5d2e345dfe6c816a29dc93a108dfd254])
* (edit) 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
* (edit) 
src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http/TestBadServerResponses.java


> protocol-http fails to follow redirections when an HTTP response body is 
> invalid
> 
>
> Key: NUTCH-2557
> URL: https://issues.apache.org/jira/browse/NUTCH-2557
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  * Example: this page is a redirection to its https version, with an HTTP 
> body containing invalidly gzip encoded contents. Browsers follow the 
> redirection, but nutch throws an error:
>  ** [http://www.webarcelona.net/es/blog?page=2]
>  
> The HttpResponse::getContent class can already return null. I think it should 
> at least return null when parsing the HTTP response body fails.
> Ideally, we would adopt the same behavior as browsers, and not even try 
> parsing the body when the headers indicate a redirection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2557) protocol-http fails to follow redirections when an HTTP response body is invalid

2018-06-12 Thread Omkar Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509469#comment-16509469
 ] 

Omkar Reddy commented on NUTCH-2557:


A simple and wise solution. Thanks. 

> protocol-http fails to follow redirections when an HTTP response body is 
> invalid
> 
>
> Key: NUTCH-2557
> URL: https://issues.apache.org/jira/browse/NUTCH-2557
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  * Example: this page is a redirection to its https version, with an HTTP 
> body containing invalidly gzip encoded contents. Browsers follow the 
> redirection, but nutch throws an error:
>  ** [http://www.webarcelona.net/es/blog?page=2]
>  
> The HttpResponse::getContent class can already return null. I think it should 
> at least return null when parsing the HTTP response body fails.
> Ideally, we would adopt the same behavior as browsers, and not even try 
> parsing the body when the headers indicate a redirection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2557) protocol-http fails to follow redirections when an HTTP response body is invalid

2018-06-11 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16508019#comment-16508019
 ] 

Sebastian Nagel commented on NUTCH-2557:


Hi [~omkar20895], hi [~gbouchar], [PR 
#347|https://github.com/apache/nutch/pull/347] contains Gerard's solution for 
this issue, see [commit 
d163512|https://github.com/apache/nutch/pull/347/commits/d163512d5d2e345dfe6c816a29dc93a108dfd254].
 It does not skip reading payload content for redirects and other non-200 
responses. But if reading the payload throws an exception, the exception is 
caught and ignored. Since it only affects responses which would fail otherwise, 
I've decided not introduce a new property. Let me know whether this is ok. 
Thanks!

> protocol-http fails to follow redirections when an HTTP response body is 
> invalid
> 
>
> Key: NUTCH-2557
> URL: https://issues.apache.org/jira/browse/NUTCH-2557
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  * Example: this page is a redirection to its https version, with an HTTP 
> body containing invalidly gzip encoded contents. Browsers follow the 
> redirection, but nutch throws an error:
>  ** [http://www.webarcelona.net/es/blog?page=2]
>  
> The HttpResponse::getContent class can already return null. I think it should 
> at least return null when parsing the HTTP response body fails.
> Ideally, we would adopt the same behavior as browsers, and not even try 
> parsing the body when the headers indicate a redirection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2557) protocol-http fails to follow redirections when an HTTP response body is invalid

2018-05-25 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490789#comment-16490789
 ] 

Sebastian Nagel commented on NUTCH-2557:


The name is arbitrary. But it's always hard to fine one which is descriptive 
but not too specific. What about {{http.content.store.always}}? These would 
include redirects, 404, not modified and further HTTP status codes. But it's 
your decision to select a suitable name. Thanks!

> protocol-http fails to follow redirections when an HTTP response body is 
> invalid
> 
>
> Key: NUTCH-2557
> URL: https://issues.apache.org/jira/browse/NUTCH-2557
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Gerard Bouchar
>Priority: Major
>
> If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  * Example: this page is a redirection to its https version, with an HTTP 
> body containing invalidly gzip encoded contents. Browsers follow the 
> redirection, but nutch throws an error:
>  ** [http://www.webarcelona.net/es/blog?page=2]
>  
> The HttpResponse::getContent class can already return null. I think it should 
> at least return null when parsing the HTTP response body fails.
> Ideally, we would adopt the same behavior as browsers, and not even try 
> parsing the body when the headers indicate a redirection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2557) protocol-http fails to follow redirections when an HTTP response body is invalid

2018-05-25 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490581#comment-16490581
 ] 

Omkar Reddy commented on NUTCH-2557:


I agree, sometimes the http body of bad requests and redirects might contain 
some kind of diagnostic information that might be helpful to the user. So we 
should store it optionally. 

Can we add the property as http.content.store.3XX.404? or is it a complicated 
name for a property?  

> protocol-http fails to follow redirections when an HTTP response body is 
> invalid
> 
>
> Key: NUTCH-2557
> URL: https://issues.apache.org/jira/browse/NUTCH-2557
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Gerard Bouchar
>Priority: Major
>
> If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  * Example: this page is a redirection to its https version, with an HTTP 
> body containing invalidly gzip encoded contents. Browsers follow the 
> redirection, but nutch throws an error:
>  ** [http://www.webarcelona.net/es/blog?page=2]
>  
> The HttpResponse::getContent class can already return null. I think it should 
> at least return null when parsing the HTTP response body fails.
> Ideally, we would adopt the same behavior as browsers, and not even try 
> parsing the body when the headers indicate a redirection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2557) protocol-http fails to follow redirections when an HTTP response body is invalid

2018-05-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488985#comment-16488985
 ] 

Sebastian Nagel commented on NUTCH-2557:


See [comments in 
NUTCH-2549|https://issues.apache.org/jira/browse/NUTCH-2549?focusedCommentId=16430591=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16430591]:
 please try to make this the default but allow to optionally fetch and store 
the content.

> protocol-http fails to follow redirections when an HTTP response body is 
> invalid
> 
>
> Key: NUTCH-2557
> URL: https://issues.apache.org/jira/browse/NUTCH-2557
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Gerard Bouchar
>Priority: Major
>
> If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  * Example: this page is a redirection to its https version, with an HTTP 
> body containing invalidly gzip encoded contents. Browsers follow the 
> redirection, but nutch throws an error:
>  ** [http://www.webarcelona.net/es/blog?page=2]
>  
> The HttpResponse::getContent class can already return null. I think it should 
> at least return null when parsing the HTTP response body fails.
> Ideally, we would adopt the same behavior as browsers, and not even try 
> parsing the body when the headers indicate a redirection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)