[jira] [Comment Edited] (NUTCH-2549) protocol-http does not behave the same as browsers
[ https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430591#comment-16430591 ] Gerard Bouchar edited comment on NUTCH-2549 at 4/9/18 2:57 PM: --- Hello, OK, I am going to open sub-tasks. As for the rewrite, I think it is very much needed. The bugs I reported here are the ones I could find, but I am sure there are more subtle bugs. HTTP is not as simple a protocol as one might think, and mixing low-level socket-related concerns with higher-level fetch logic related concerns can only lead to bugs. If we keep a custom implementation of HTTP, it should at least have a lot of tests. I do not think the content should be skipped in case of 404 or other errors, I was talking about redirects only. I do not see a case where the contents of a 3XX redirection page could be of interest, but your idea of adding a setting (disabled by default) for parsing it anyway should satisfy everyone. was (Author: gbouchar): Hello, OK, I am going to open sub-tasks. As for the rewrite, I think it is very much needed. The bugs I reported here are the ones I could find, but I am sure there are more subtle bugs. HTTP is not as simple a protocol as one might think, and mixing low-level socket-related concerns with higher-level fetch logic related concerns can only lead to bugs. I do not think the content should be skipped in case of 404 or other errors, I was talking about redirects only. I do not see a case where the contents of a 3XX redirection page could be of interest, but your idea of adding a setting (disabled by default) for parsing it anyway should satisfy everyone. I also think that if we keep a custom implementation of HTTP, it should have a lot of tests. > protocol-http does not behave the same as browsers > -- > > Key: NUTCH-2549 > URL: https://issues.apache.org/jira/browse/NUTCH-2549 > Project: Nutch > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > > We identified the following issues in protocol-http (a plugin implementing > the HTTP protocol): > * It fails if an url's path does not start with '/' > ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers > correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries > to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*. > * It advertises its requests as being HTTP/1.0, but sends an > _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This > confuses some web servers > ** Example: > [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm] > * If a server sends a redirection (3XX status code, with a Location header), > protocol-http tries to parse the HTTP response body anyway. Thus, if an error > occurs while decoding the body, the redirection is not followed and the > information is lost. Browsers follow the redirection and close the socket > soon as they can. > ** Example: [http://www.webarcelona.net/es/blog?page=2] > * Some servers invalidly send an HTTP body directly without a status line or > headers. Browsers handle that, protocol-http doesn't: > ** Example: [https://app.unitymedia.de/] > * Some servers invalidly add colons after the HTTP status code in the status > line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not > found_ for instance). Browsers can handle that. > * Some servers invalidly send headers that span over multiple lines. In that > case, browsers simply ignore the subsequent lines, but protocol-http throws > an error, thus preventing us from fetching the contents of the page. > * There is no limit over the size of the HTTP headers it reads. A bogus > server could send an infinite stream of different HTTP headers and cause the > fetcher to go out of memory, or send the same HTTP header repeatedly and > cause the fetcher to timeout. > * The same goes for the HTTP status line: no check is made concerning its > size. > * While reading chunked content, if the content size becomes larger than > {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it > tries to read a new chunk before having read the previous one completely, > resulting in a '{color:#33}bad chunk length' error.{color} > {color:#33}Additionally (and that concerns protocol-httpclient as well), > when reading http headers, for each header, the SpellCheckedMetadata class > computes a Levenshtein distance between it and every known header in the > HttpHeaders interface. Not only is that slow, non-standard, and non-conform > to browsers' behavior, but it also causes bugs and prevents us from accessing > the real headers sent by the HTTP server.{color} > * {color:#33
[jira] [Comment Edited] (NUTCH-2549) protocol-http does not behave the same as browsers
[ https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430591#comment-16430591 ] Gerard Bouchar edited comment on NUTCH-2549 at 4/9/18 2:57 PM: --- Hello, OK, I am going to open sub-tasks. As for the rewrite, I think it is very much needed. The bugs I reported here are the ones I could find, but I am sure there are more subtle bugs. HTTP is not as simple a protocol as one might think, and mixing low-level socket-related concerns with higher-level fetch logic related concerns can only lead to bugs. I do not think the content should be skipped in case of 404 or other errors, I was talking about redirects only. I do not see a case where the contents of a 3XX redirection page could be of interest, but your idea of adding a setting (disabled by default) for parsing it anyway should satisfy everyone. I also think that if we keep a custom implementation of HTTP, it should have a lot of tests. was (Author: gbouchar): Hello, OK, I am going to open sub-tasks. As for the rewrite, I think it is very much needed. The bugs I reported here are the ones I could find, but I am sure there are more subtle bugs. HTTP is not as simple a protocol as one might think, and mixing low-level socket-related concerns with higher-level fetch logic related concerns can only lead to bugs. I do not think the content should be skipped in case of 404 or other errors, I was talking about redirects only. I do not see a case where the contents of a 3XX redirection page could be of interest, but your idea of adding a setting (disabled by default) for parsing it anyway should satisfy everyone. > protocol-http does not behave the same as browsers > -- > > Key: NUTCH-2549 > URL: https://issues.apache.org/jira/browse/NUTCH-2549 > Project: Nutch > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > > We identified the following issues in protocol-http (a plugin implementing > the HTTP protocol): > * It fails if an url's path does not start with '/' > ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers > correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries > to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*. > * It advertises its requests as being HTTP/1.0, but sends an > _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This > confuses some web servers > ** Example: > [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm] > * If a server sends a redirection (3XX status code, with a Location header), > protocol-http tries to parse the HTTP response body anyway. Thus, if an error > occurs while decoding the body, the redirection is not followed and the > information is lost. Browsers follow the redirection and close the socket > soon as they can. > ** Example: [http://www.webarcelona.net/es/blog?page=2] > * Some servers invalidly send an HTTP body directly without a status line or > headers. Browsers handle that, protocol-http doesn't: > ** Example: [https://app.unitymedia.de/] > * Some servers invalidly add colons after the HTTP status code in the status > line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not > found_ for instance). Browsers can handle that. > * Some servers invalidly send headers that span over multiple lines. In that > case, browsers simply ignore the subsequent lines, but protocol-http throws > an error, thus preventing us from fetching the contents of the page. > * There is no limit over the size of the HTTP headers it reads. A bogus > server could send an infinite stream of different HTTP headers and cause the > fetcher to go out of memory, or send the same HTTP header repeatedly and > cause the fetcher to timeout. > * The same goes for the HTTP status line: no check is made concerning its > size. > * While reading chunked content, if the content size becomes larger than > {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it > tries to read a new chunk before having read the previous one completely, > resulting in a '{color:#33}bad chunk length' error.{color} > {color:#33}Additionally (and that concerns protocol-httpclient as well), > when reading http headers, for each header, the SpellCheckedMetadata class > computes a Levenshtein distance between it and every known header in the > HttpHeaders interface. Not only is that slow, non-standard, and non-conform > to browsers' behavior, but it also causes bugs and prevents us from accessing > the real headers sent by the HTTP server.{color} > * {color:#33}Example: [http://www.taz.de/!443358/] . The server sends a > *Client-Transfer-E
[jira] [Comment Edited] (NUTCH-2549) protocol-http does not behave the same as browsers
[ https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430591#comment-16430591 ] Gerard Bouchar edited comment on NUTCH-2549 at 4/9/18 2:24 PM: --- Hello, OK, I am going to open sub-tasks. As for the rewrite, I think it is very much needed. The bugs I reported here are the ones I could find, but I am sure there are more subtle bugs. HTTP is not as simple a protocol as one might think, and mixing low-level socket-related concerns with higher-level fetch logic related concerns can only lead to bugs. I do not think the content should be skipped in case of 404 or other errors, I was talking about redirects only. I do not see a case where the contents of a 3XX redirection page could be of interest, but your idea of adding a setting (disabled by default) for parsing it anyway should satisfy everyone. was (Author: gbouchar): Hello, OK, I am going to open sub-tasks. As for the rewrite, I think it is very much needed. The bugs I reported here are the ones I could find, but I am sure there are more subtle bugs. HTTP is not as simple a protocol as one might think, and mixing low-level socket-related concerns with higher-level fetch logic related concerns can only lead to bugs. I do not think the content should be skipped in case of 404 or other errors, I was talking about redirects only. I do not see a case where the contents of a redirection page could be of interest, but your idea of adding a setting (disabled by default) for parsing it anyway should satisfy everyone. > protocol-http does not behave the same as browsers > -- > > Key: NUTCH-2549 > URL: https://issues.apache.org/jira/browse/NUTCH-2549 > Project: Nutch > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > > We identified the following issues in protocol-http (a plugin implementing > the HTTP protocol): > * It fails if an url's path does not start with '/' > ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers > correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries > to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*. > * It advertises its requests as being HTTP/1.0, but sends an > _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This > confuses some web servers > ** Example: > [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm] > * If a server sends a redirection (3XX status code, with a Location header), > protocol-http tries to parse the HTTP response body anyway. Thus, if an error > occurs while decoding the body, the redirection is not followed and the > information is lost. Browsers follow the redirection and close the socket > soon as they can. > ** Example: [http://www.webarcelona.net/es/blog?page=2] > * Some servers invalidly send an HTTP body directly without a status line or > headers. Browsers handle that, protocol-http doesn't: > ** Example: [https://app.unitymedia.de/] > * Some servers invalidly add colons after the HTTP status code in the status > line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not > found_ for instance). Browsers can handle that. > * Some servers invalidly send headers that span over multiple lines. In that > case, browsers simply ignore the subsequent lines, but protocol-http throws > an error, thus preventing us from fetching the contents of the page. > * There is no limit over the size of the HTTP headers it reads. A bogus > server could send an infinite stream of different HTTP headers and cause the > fetcher to go out of memory, or send the same HTTP header repeatedly and > cause the fetcher to timeout. > * The same goes for the HTTP status line: no check is made concerning its > size. > * While reading chunked content, if the content size becomes larger than > {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it > tries to read a new chunk before having read the previous one completely, > resulting in a '{color:#33}bad chunk length' error.{color} > {color:#33}Additionally (and that concerns protocol-httpclient as well), > when reading http headers, for each header, the SpellCheckedMetadata class > computes a Levenshtein distance between it and every known header in the > HttpHeaders interface. Not only is that slow, non-standard, and non-conform > to browsers' behavior, but it also causes bugs and prevents us from accessing > the real headers sent by the HTTP server.{color} > * {color:#33}Example: [http://www.taz.de/!443358/] . The server sends a > *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects > it to *Transfer-Encoding: chunked*. The
[jira] [Comment Edited] (NUTCH-2549) protocol-http does not behave the same as browsers
[ https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430591#comment-16430591 ] Gerard Bouchar edited comment on NUTCH-2549 at 4/9/18 2:22 PM: --- Hello, OK, I am going to open sub-tasks. As for the rewrite, I think it is very much needed. The bugs I reported here are the ones I could find, but I am sure there are more subtle bugs. HTTP is not as simple a protocol as one might think, and mixing low-level socket-related concerns with higher-level fetch logic related concerns can only lead to bugs. I do not think the content should be skipped in case of 404 or other errors, I was talking about redirects only. I do not see a case where the contents of a redirection page could be of interest, but your idea of adding a setting (disabled by default) for parsing it anyway should satisfy everyone. was (Author: gbouchar): Hello, OK, I am going to open sub-tasks. As for the rewrite, I think it is very much needed. The bugs I reported here are the one I could find, but I am sure there are more subtle bugs. HTTP is not as simple a protocol as one might think, and mixing low-level socket-related concerns with higher-level fetch logic related concerns can only lead to bugs. I do not think the content should be skipped in case of 404 or other errors, I was talking about redirects only. I do not see a case where the contents of a redirection page could be of interest, but your idea of adding a setting (disabled by default) for parsing it anyway should satisfy everyone. > protocol-http does not behave the same as browsers > -- > > Key: NUTCH-2549 > URL: https://issues.apache.org/jira/browse/NUTCH-2549 > Project: Nutch > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > > We identified the following issues in protocol-http (a plugin implementing > the HTTP protocol): > * It fails if an url's path does not start with '/' > ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers > correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries > to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*. > * It advertises its requests as being HTTP/1.0, but sends an > _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This > confuses some web servers > ** Example: > [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm] > * If a server sends a redirection (3XX status code, with a Location header), > protocol-http tries to parse the HTTP response body anyway. Thus, if an error > occurs while decoding the body, the redirection is not followed and the > information is lost. Browsers follow the redirection and close the socket > soon as they can. > ** Example: [http://www.webarcelona.net/es/blog?page=2] > * Some servers invalidly send an HTTP body directly without a status line or > headers. Browsers handle that, protocol-http doesn't: > ** Example: [https://app.unitymedia.de/] > * Some servers invalidly add colons after the HTTP status code in the status > line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not > found_ for instance). Browsers can handle that. > * Some servers invalidly send headers that span over multiple lines. In that > case, browsers simply ignore the subsequent lines, but protocol-http throws > an error, thus preventing us from fetching the contents of the page. > * There is no limit over the size of the HTTP headers it reads. A bogus > server could send an infinite stream of different HTTP headers and cause the > fetcher to go out of memory, or send the same HTTP header repeatedly and > cause the fetcher to timeout. > * The same goes for the HTTP status line: no check is made concerning its > size. > * While reading chunked content, if the content size becomes larger than > {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it > tries to read a new chunk before having read the previous one completely, > resulting in a '{color:#33}bad chunk length' error.{color} > {color:#33}Additionally (and that concerns protocol-httpclient as well), > when reading http headers, for each header, the SpellCheckedMetadata class > computes a Levenshtein distance between it and every known header in the > HttpHeaders interface. Not only is that slow, non-standard, and non-conform > to browsers' behavior, but it also causes bugs and prevents us from accessing > the real headers sent by the HTTP server.{color} > * {color:#33}Example: [http://www.taz.de/!443358/] . The server sends a > *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects > it to *Transfer-Encoding: chunked*. Then, Htt