[jira] [Commented] (NUTCH-2549) protocol-http does not behave the same as browsers

2018-06-12 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509864#comment-16509864
 ] 

Hudson commented on NUTCH-2549:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3534 (See 
[https://builds.apache.org/job/Nutch-trunk/3534/])
NUTCH-2549 protocol-http does not behave the same as browsers - add unit 
(snagel: 
[https://github.com/apache/nutch/commit/4cf96820553c7137236e52da0551b084814670f2])
* (edit) src/plugin/protocol-http/src/test/conf/nutch-site-test.xml
* (add) 
src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http/TestBadServerResponses.java
NUTCH-2549 protocol-http does not behave the same as browsers - be (snagel: 
[https://github.com/apache/nutch/commit/2e485cfbdf46461a733cd21e9129f6fa5989f288])
* (edit) 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java


> protocol-http does not behave the same as browsers
> --
>
> Key: NUTCH-2549
> URL: https://issues.apache.org/jira/browse/NUTCH-2549
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
> Attachments: NUTCH-2549.patch
>
>
> We identified the following issues in protocol-http (a plugin implementing 
> the HTTP protocol):
>  * It fails if an url's path does not start with '/'
>  ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers 
> correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries 
> to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*.
>  * It advertises its requests as being HTTP/1.0, but sends an 
> _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This 
> confuses some web servers
>  ** Example: 
> [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]
>  * If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  ** Example: [http://www.webarcelona.net/es/blog?page=2]
>  * Some servers invalidly send an HTTP body directly without a status line or 
> headers. Browsers handle that, protocol-http doesn't:
>  ** Example: [https://app.unitymedia.de/]
>  * Some servers invalidly add colons after the HTTP status code in the status 
> line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not 
> found_ for instance). Browsers can handle that.
>  * Some servers invalidly send headers that span over multiple lines. In that 
> case, browsers simply ignore the subsequent lines, but protocol-http throws 
> an error, thus preventing us from fetching the contents of the page.
>  * There is no limit over the size of the HTTP headers it reads. A bogus 
> server could send an infinite stream of different HTTP headers and cause the 
> fetcher to go out of memory, or send the same HTTP header repeatedly and 
> cause the fetcher to timeout.
>  * The same goes for the HTTP status line: no check is made concerning its 
> size.
>  * While reading chunked content, if the content size becomes larger than 
> {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it 
> tries to read a new chunk before having read the previous one completely, 
> resulting in a '{color:#33}bad chunk length' error.{color}
> {color:#33}Additionally (and that concerns protocol-httpclient as well), 
> when reading http headers, for each header, the SpellCheckedMetadata class 
> computes a Levenshtein distance between it and every  known header in the 
> HttpHeaders interface. Not only is that slow, non-standard, and non-conform 
> to browsers' behavior, but it also causes bugs and prevents us from accessing 
> the real headers sent by the HTTP server.{color}
>  * {color:#33}Example: [http://www.taz.de/!443358/] . The server sends a 
> *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
> it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) 
> tries to read the HTTP body as chunked, whereas it is not.{color}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2549) protocol-http does not behave the same as browsers

2018-06-12 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509804#comment-16509804
 ] 

ASF GitHub Bot commented on NUTCH-2549:
---

sebastian-nagel closed pull request #347: NUTCH-2549  protocol-http does not 
behave the same as browsers
URL: https://github.com/apache/nutch/pull/347
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/build.xml b/build.xml
index 1d680d0bd..d4836a4f2 100644
--- a/build.xml
+++ b/build.xml
@@ -215,6 +215,7 @@
   
   
   
+  
   
   
   
@@ -673,6 +674,7 @@
   
   
   
+  
   
   
   
@@ -1107,6 +1109,8 @@
 
 
 
+
+
 
 
 
diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 37f73b8cd..cb2d2df50 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -277,6 +277,15 @@
   
 
 
+
+  http.proxy.type
+  HTTP
+  
+Proxy type: HTTP or SOCKS (cf. java.net.Proxy.Type).
+Note: supported by protocol-okhttp.
+  
+
+
 
   http.proxy.exception.list
   
@@ -301,9 +310,22 @@
 
 
   http.useHttp11
+  true
+  
+If true, use HTTP 1.1, if false use HTTP 1.0 .
+  
+
+
+
+  http.useHttp2
   false
-  NOTE: at the moment this works only for protocol-httpclient.
-  If true, use HTTP 1.1, if false use HTTP 1.0 .
+  
+If true try HTTP/2 and fall-back to HTTP/1.1 if HTTP/2 not
+supported, if false use always HTTP/1.1.
+
+NOTE: HTTP/2 is currently only supported by protocol-okhttp and
+requires at runtime Java 9 or a modified Java 8 with support for
+ALPN (Application Layer Protocol Negotiation).
   
 
 
diff --git a/src/java/org/apache/nutch/metadata/HttpHeaders.java 
b/src/java/org/apache/nutch/metadata/HttpHeaders.java
index 71a66f66c..b7700e5d3 100644
--- a/src/java/org/apache/nutch/metadata/HttpHeaders.java
+++ b/src/java/org/apache/nutch/metadata/HttpHeaders.java
@@ -28,6 +28,8 @@
 
   public static final String TRANSFER_ENCODING = "Transfer-Encoding";
 
+  public static final String CLIENT_TRANSFER_ENCODING = 
"Client-Transfer-Encoding";
+
   public static final String CONTENT_ENCODING = "Content-Encoding";
 
   public static final String CONTENT_LANGUAGE = "Content-Language";
@@ -48,4 +50,8 @@
 
   public static final String LOCATION = "Location";
 
+  public static final String IF_MODIFIED_SINCE = "If-Modified-Since";
+
+  public static final String USER_AGENT = "User-Agent";
+
 }
diff --git a/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java 
b/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java
index 9434cab60..fdbf1b62c 100644
--- a/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java
+++ b/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java
@@ -32,9 +32,10 @@
 public class SpellCheckedMetadata extends Metadata {
 
   /**
-   * Treshold divider.
+   * Threshold divider to calculate max. Levenshtein distance for misspelled
+   * header field names:
* 
-   * threshold = searched.length() / TRESHOLD_DIVIDER;
+   * threshold = Math.min(3, searched.length() / 
TRESHOLD_DIVIDER);
*/
   private static final int TRESHOLD_DIVIDER = 3;
 
@@ -112,7 +113,7 @@ public static String getNormalizedName(final String name) {
 String value = NAMES_IDX.get(searched);
 
 if ((value == null) && (normalized != null)) {
-  int threshold = searched.length() / TRESHOLD_DIVIDER;
+  int threshold = Math.min(3, searched.length() / TRESHOLD_DIVIDER);
   for (int i = 0; i < normalized.length && value == null; i++) {
 if (StringUtils.getLevenshteinDistance(searched, normalized[i]) < 
threshold) {
   value = NAMES_IDX.get(normalized[i]);
diff --git a/src/java/org/apache/nutch/net/protocols/Response.java 
b/src/java/org/apache/nutch/net/protocols/Response.java
index c9139bd6c..7096c934d 100644
--- a/src/java/org/apache/nutch/net/protocols/Response.java
+++ b/src/java/org/apache/nutch/net/protocols/Response.java
@@ -26,6 +26,32 @@
  */
 public interface Response extends HttpHeaders {
 
+  /** Key to hold the HTTP request if store.http.request is true 
*/
+  public static final String REQUEST = "_request_";
+
+  /**
+   * Key to hold the HTTP response header if store.http.headers is
+   * true
+   */
+  public static final String RESPONSE_HEADERS = "_response.headers_";
+
+  /**
+   * Key to hold the IP address the request is sent to if
+   * store.ip.address is true
+   */
+  public static final String IP_ADDRESS = "_ip_";
+
+  /**
+   * Key to hold the time when the page has been fetched
+   */
+  public static final String FETCH_TIME = "nutch.fetch.time";
+
+  /**
+   * Key to hold boolean whether content has been trimmed because it 

[jira] [Commented] (NUTCH-2549) protocol-http does not behave the same as browsers

2018-06-11 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16508006#comment-16508006
 ] 

Sebastian Nagel commented on NUTCH-2549:


Hi [~gbouchar], PR is open to fix all sub-tasks. I've also took your 
evilserver.py (attached to NUTCH-2561) as inspiration for unit tests. When 
testing Nutch 1.14 these fail (as expected):
{noformat}
% grep -A1 -i testcase 
build/protocol-http/test/TEST-org.apache.nutch.protocol.http.TestBadServerResponses.txt
Testcase: testBadHttpServer took 0.257 sec
Testcase: testNoStatusLine took 0.091 sec
Caused an ERROR
--
Testcase: testOverlongHeader took 0.48 sec
FAILED
--
Testcase: testContentLengthNotANumber took 0.075 sec
Caused an ERROR
--
Testcase: testHeaderSpellChecking took 0.065 sec
Caused an ERROR
--
Testcase: testMultiLineHeader took 0.066 sec
Testcase: testHeaderWithColon took 0.098 sec
Caused an ERROR
--
Testcase: testChunkedContent took 0.088 sec
FAILED
--
Testcase: testRequestNotStartingWithSlash took 0.094 sec
FAILED
--
Testcase: testIgnoreErrorInRedirectPayload took 0.065 sec
Caused an ERROR
{noformat}

> protocol-http does not behave the same as browsers
> --
>
> Key: NUTCH-2549
> URL: https://issues.apache.org/jira/browse/NUTCH-2549
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
> Attachments: NUTCH-2549.patch
>
>
> We identified the following issues in protocol-http (a plugin implementing 
> the HTTP protocol):
>  * It fails if an url's path does not start with '/'
>  ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers 
> correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries 
> to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*.
>  * It advertises its requests as being HTTP/1.0, but sends an 
> _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This 
> confuses some web servers
>  ** Example: 
> [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]
>  * If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  ** Example: [http://www.webarcelona.net/es/blog?page=2]
>  * Some servers invalidly send an HTTP body directly without a status line or 
> headers. Browsers handle that, protocol-http doesn't:
>  ** Example: [https://app.unitymedia.de/]
>  * Some servers invalidly add colons after the HTTP status code in the status 
> line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not 
> found_ for instance). Browsers can handle that.
>  * Some servers invalidly send headers that span over multiple lines. In that 
> case, browsers simply ignore the subsequent lines, but protocol-http throws 
> an error, thus preventing us from fetching the contents of the page.
>  * There is no limit over the size of the HTTP headers it reads. A bogus 
> server could send an infinite stream of different HTTP headers and cause the 
> fetcher to go out of memory, or send the same HTTP header repeatedly and 
> cause the fetcher to timeout.
>  * The same goes for the HTTP status line: no check is made concerning its 
> size.
>  * While reading chunked content, if the content size becomes larger than 
> {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it 
> tries to read a new chunk before having read the previous one completely, 
> resulting in a '{color:#33}bad chunk length' error.{color}
> {color:#33}Additionally (and that concerns protocol-httpclient as well), 
> when reading http headers, for each header, the SpellCheckedMetadata class 
> computes a Levenshtein distance between it and every  known header in the 
> HttpHeaders interface. Not only is that slow, non-standard, and non-conform 
> to browsers' behavior, but it also causes bugs and prevents us from accessing 
> the real headers sent by the HTTP server.{color}
>  * {color:#33}Example: [http://www.taz.de/!443358/] . The server sends a 
> *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
> it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) 
> tries to read the HTTP body as chunked, whereas it is not.{color}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2549) protocol-http does not behave the same as browsers

2018-06-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16507998#comment-16507998
 ] 

ASF GitHub Bot commented on NUTCH-2549:
---

sebastian-nagel opened a new pull request #347: NUTCH-2549  protocol-http does 
not behave the same as browsers
URL: https://github.com/apache/nutch/pull/347
 
 
   - integrates patch provided by Gerard Bouchar
   - fixes sub-tasks (see commit messages)
   - adds unit tests to verify that issues are solved
   
   Note: to avoid future merge conflicts this branch/PR includes code 
refactorings made for NUTCH-2576.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> protocol-http does not behave the same as browsers
> --
>
> Key: NUTCH-2549
> URL: https://issues.apache.org/jira/browse/NUTCH-2549
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
> Attachments: NUTCH-2549.patch
>
>
> We identified the following issues in protocol-http (a plugin implementing 
> the HTTP protocol):
>  * It fails if an url's path does not start with '/'
>  ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers 
> correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries 
> to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*.
>  * It advertises its requests as being HTTP/1.0, but sends an 
> _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This 
> confuses some web servers
>  ** Example: 
> [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]
>  * If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  ** Example: [http://www.webarcelona.net/es/blog?page=2]
>  * Some servers invalidly send an HTTP body directly without a status line or 
> headers. Browsers handle that, protocol-http doesn't:
>  ** Example: [https://app.unitymedia.de/]
>  * Some servers invalidly add colons after the HTTP status code in the status 
> line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not 
> found_ for instance). Browsers can handle that.
>  * Some servers invalidly send headers that span over multiple lines. In that 
> case, browsers simply ignore the subsequent lines, but protocol-http throws 
> an error, thus preventing us from fetching the contents of the page.
>  * There is no limit over the size of the HTTP headers it reads. A bogus 
> server could send an infinite stream of different HTTP headers and cause the 
> fetcher to go out of memory, or send the same HTTP header repeatedly and 
> cause the fetcher to timeout.
>  * The same goes for the HTTP status line: no check is made concerning its 
> size.
>  * While reading chunked content, if the content size becomes larger than 
> {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it 
> tries to read a new chunk before having read the previous one completely, 
> resulting in a '{color:#33}bad chunk length' error.{color}
> {color:#33}Additionally (and that concerns protocol-httpclient as well), 
> when reading http headers, for each header, the SpellCheckedMetadata class 
> computes a Levenshtein distance between it and every  known header in the 
> HttpHeaders interface. Not only is that slow, non-standard, and non-conform 
> to browsers' behavior, but it also causes bugs and prevents us from accessing 
> the real headers sent by the HTTP server.{color}
>  * {color:#33}Example: [http://www.taz.de/!443358/] . The server sends a 
> *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
> it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) 
> tries to read the HTTP body as chunked, whereas it is not.{color}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2549) protocol-http does not behave the same as browsers

2018-05-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488947#comment-16488947
 ] 

Sebastian Nagel commented on NUTCH-2549:


Thanks, [~gbouchar]! Could you split the patch and address each sub-issue 
separately? There are also PRs open for review, see NUTCH-2562 and NUTCH-2576 
(tested but not ready yet).

> protocol-http does not behave the same as browsers
> --
>
> Key: NUTCH-2549
> URL: https://issues.apache.org/jira/browse/NUTCH-2549
> Project: Nutch
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: NUTCH-2549.patch
>
>
> We identified the following issues in protocol-http (a plugin implementing 
> the HTTP protocol):
>  * It fails if an url's path does not start with '/'
>  ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers 
> correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries 
> to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*.
>  * It advertises its requests as being HTTP/1.0, but sends an 
> _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This 
> confuses some web servers
>  ** Example: 
> [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]
>  * If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  ** Example: [http://www.webarcelona.net/es/blog?page=2]
>  * Some servers invalidly send an HTTP body directly without a status line or 
> headers. Browsers handle that, protocol-http doesn't:
>  ** Example: [https://app.unitymedia.de/]
>  * Some servers invalidly add colons after the HTTP status code in the status 
> line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not 
> found_ for instance). Browsers can handle that.
>  * Some servers invalidly send headers that span over multiple lines. In that 
> case, browsers simply ignore the subsequent lines, but protocol-http throws 
> an error, thus preventing us from fetching the contents of the page.
>  * There is no limit over the size of the HTTP headers it reads. A bogus 
> server could send an infinite stream of different HTTP headers and cause the 
> fetcher to go out of memory, or send the same HTTP header repeatedly and 
> cause the fetcher to timeout.
>  * The same goes for the HTTP status line: no check is made concerning its 
> size.
>  * While reading chunked content, if the content size becomes larger than 
> {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it 
> tries to read a new chunk before having read the previous one completely, 
> resulting in a '{color:#33}bad chunk length' error.{color}
> {color:#33}Additionally (and that concerns protocol-httpclient as well), 
> when reading http headers, for each header, the SpellCheckedMetadata class 
> computes a Levenshtein distance between it and every  known header in the 
> HttpHeaders interface. Not only is that slow, non-standard, and non-conform 
> to browsers' behavior, but it also causes bugs and prevents us from accessing 
> the real headers sent by the HTTP server.{color}
>  * {color:#33}Example: [http://www.taz.de/!443358/] . The server sends a 
> *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
> it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) 
> tries to read the HTTP body as chunked, whereas it is not.{color}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2549) protocol-http does not behave the same as browsers

2018-05-24 Thread Gerard Bouchar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488894#comment-16488894
 ] 

Gerard Bouchar commented on NUTCH-2549:
---

 [^NUTCH-2549.patch] 

> protocol-http does not behave the same as browsers
> --
>
> Key: NUTCH-2549
> URL: https://issues.apache.org/jira/browse/NUTCH-2549
> Project: Nutch
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: NUTCH-2549.patch
>
>
> We identified the following issues in protocol-http (a plugin implementing 
> the HTTP protocol):
>  * It fails if an url's path does not start with '/'
>  ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers 
> correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries 
> to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*.
>  * It advertises its requests as being HTTP/1.0, but sends an 
> _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This 
> confuses some web servers
>  ** Example: 
> [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]
>  * If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  ** Example: [http://www.webarcelona.net/es/blog?page=2]
>  * Some servers invalidly send an HTTP body directly without a status line or 
> headers. Browsers handle that, protocol-http doesn't:
>  ** Example: [https://app.unitymedia.de/]
>  * Some servers invalidly add colons after the HTTP status code in the status 
> line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not 
> found_ for instance). Browsers can handle that.
>  * Some servers invalidly send headers that span over multiple lines. In that 
> case, browsers simply ignore the subsequent lines, but protocol-http throws 
> an error, thus preventing us from fetching the contents of the page.
>  * There is no limit over the size of the HTTP headers it reads. A bogus 
> server could send an infinite stream of different HTTP headers and cause the 
> fetcher to go out of memory, or send the same HTTP header repeatedly and 
> cause the fetcher to timeout.
>  * The same goes for the HTTP status line: no check is made concerning its 
> size.
>  * While reading chunked content, if the content size becomes larger than 
> {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it 
> tries to read a new chunk before having read the previous one completely, 
> resulting in a '{color:#33}bad chunk length' error.{color}
> {color:#33}Additionally (and that concerns protocol-httpclient as well), 
> when reading http headers, for each header, the SpellCheckedMetadata class 
> computes a Levenshtein distance between it and every  known header in the 
> HttpHeaders interface. Not only is that slow, non-standard, and non-conform 
> to browsers' behavior, but it also causes bugs and prevents us from accessing 
> the real headers sent by the HTTP server.{color}
>  * {color:#33}Example: [http://www.taz.de/!443358/] . The server sends a 
> *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
> it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) 
> tries to read the HTTP body as chunked, whereas it is not.{color}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2549) protocol-http does not behave the same as browsers

2018-04-09 Thread Gerard Bouchar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430591#comment-16430591
 ] 

Gerard Bouchar commented on NUTCH-2549:
---

Hello,

  OK, I am going to open sub-tasks.

   As for the rewrite, I think it is very much needed. The bugs I reported here 
are the one I could find, but I am sure there are more subtle bugs. HTTP is not 
as simple a protocol as one might think, and mixing low-level socket-related 
concerns with higher-level fetch logic related concerns can only lead to bugs.

  I do not think the content should be skipped in case of 404 or other errors, 
I was talking about redirects only. I do not see a case where the contents of a 
redirection page could be of interest, but your idea of adding a setting 
(disabled by default) for parsing it anyway should satisfy everyone.

 

> protocol-http does not behave the same as browsers
> --
>
> Key: NUTCH-2549
> URL: https://issues.apache.org/jira/browse/NUTCH-2549
> Project: Nutch
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
>
> We identified the following issues in protocol-http (a plugin implementing 
> the HTTP protocol):
>  * It fails if an url's path does not start with '/'
>  ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers 
> correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries 
> to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*.
>  * It advertises its requests as being HTTP/1.0, but sends an 
> _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This 
> confuses some web servers
>  ** Example: 
> [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]
>  * If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  ** Example: [http://www.webarcelona.net/es/blog?page=2]
>  * Some servers invalidly send an HTTP body directly without a status line or 
> headers. Browsers handle that, protocol-http doesn't:
>  ** Example: [https://app.unitymedia.de/]
>  * Some servers invalidly add colons after the HTTP status code in the status 
> line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not 
> found_ for instance). Browsers can handle that.
>  * Some servers invalidly send headers that span over multiple lines. In that 
> case, browsers simply ignore the subsequent lines, but protocol-http throws 
> an error, thus preventing us from fetching the contents of the page.
>  * There is no limit over the size of the HTTP headers it reads. A bogus 
> server could send an infinite stream of different HTTP headers and cause the 
> fetcher to go out of memory, or send the same HTTP header repeatedly and 
> cause the fetcher to timeout.
>  * The same goes for the HTTP status line: no check is made concerning its 
> size.
>  * While reading chunked content, if the content size becomes larger than 
> {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it 
> tries to read a new chunk before having read the previous one completely, 
> resulting in a '{color:#33}bad chunk length' error.{color}
> {color:#33}Additionally (and that concerns protocol-httpclient as well), 
> when reading http headers, for each header, the SpellCheckedMetadata class 
> computes a Levenshtein distance between it and every  known header in the 
> HttpHeaders interface. Not only is that slow, non-standard, and non-conform 
> to browsers' behavior, but it also causes bugs and prevents us from accessing 
> the real headers sent by the HTTP server.{color}
>  * {color:#33}Example: [http://www.taz.de/!443358/] . The server sends a 
> *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
> it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) 
> tries to read the HTTP body as chunked, whereas it is not.{color}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2549) protocol-http does not behave the same as browsers

2018-04-09 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430360#comment-16430360
 ] 

Sebastian Nagel commented on NUTCH-2549:


Thanks, [~gbouchar], for the long list of issues affecting protocol-http 
(partially also protocol-httpclient), I know it was for sure hard work to 
prepare this list by digging into fetcher logs.

I would recommend to split this issue into multiple subissues. It's easier to 
discuss every point in a single issue and address it in a single commit. Could 
you open the subissues? Or let us know if we should do it, thanks. In general, 
the length of the list rises the question whether it wouldn't better to rewrite 
the protocol plugin from scratch using a library. Suggestions are welcome!

Two notes:
 - the Protocol interface does not support modifying URLs, it should be handled 
by URL normalizers (eg. urlnormalizer-basic). URLs are used as keys in CrawlDb 
and segments and should therefor correspond with the URLs actually fetched.
 - it's a good idea to skip the content in case of redirects, 404s etc. But 
this should be optional, as some crawlers may want to store this content.

> protocol-http does not behave the same as browsers
> --
>
> Key: NUTCH-2549
> URL: https://issues.apache.org/jira/browse/NUTCH-2549
> Project: Nutch
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
>
> We identified the following issues in protocol-http (a plugin implementing 
> the HTTP protocol):
>  * It fails if an url's path does not start with '/'
>  ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers 
> correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries 
> to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*.
>  * It advertises its requests as being HTTP/1.0, but sends an 
> _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This 
> confuses some web servers
>  ** Example: 
> [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]
>  * If a server send sends a redirection (3XX status code, with a Location 
> header), protocol-http tries to parse the HTTP response body anyway. Thus, if 
> an error occurs while decoding the body, the redirection is not followed and 
> the information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  ** Example: [http://www.webarcelona.net/es/blog?page=2]
>  * Some servers invalidly send an HTTP body directly without a status line or 
> headers. Browsers handle that, protocol-http doesn't:
>  ** Example: [https://app.unitymedia.de/]
>  * Some servers invalidly add colons after the HTTP status code in the status 
> line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not 
> found_ for instance). Browsers can handle that.
>  * Some servers invalidly send headers that span over multiple line. In that 
> case, browsers simply ignore the subsequent lines, but protocol-http throws 
> an error, thus preventing us from fetching the contents of the page.
>  * There is no limit over the size of the HTTP headers it reads. A bogus 
> server could send an infinite stream of different HTTP headers and cause the 
> fetcher to go out of memory, or send the same HTTP header repeatedly and 
> cause the fetcher to timeout.
>  * The same goes for the HTTP status line: no check is made concerning its 
> size.
>  * While reading chunked content, if the content size becomes larger than 
> {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it 
> tries to read a new chunk before having read the previous one completely, 
> resulting in a '{color:#33}bad chunk length' error.{color}
> {color:#33}Additionally (and that concerns protocol-httpclient as well), 
> when reading http headers, for each header, the SpellCheckedMetadata class 
> computes a Levenshtein distance between it and every  known header in the 
> HttpHeaders interface. Not only is that slow, non-standard, and non-conform 
> to browsers' behavior, but it also causes bugs and prevents us from accessing 
> the real headers sent by the HTTP server.{color}
>  * {color:#33}Example: [http://www.taz.de/!443358/] . The server sends a 
> *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
> it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) 
> tries to read the HTTP body as chunked, whereas it is not.{color}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)