[ https://issues.apache.org/jira/browse/NUTCH-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811949#comment-16811949 ]
Sebastian Nagel commented on NUTCH-2707: ---------------------------------------- Turns out that there are few more servers which does not conform to the standard and answer on a request with {{Accept-Encoding: gzip}} with something different than {{Content-Encoding: gzip}} or {{Content-Encoding: identity}}. We should at least try to handle most of these cases. Further examples: - same as the initial problem: {noformat} % nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' -Dstore.http.headers=true -Dstore.http.request=true https://saudibusiness.directory/%D8%A7%D8%AB%D8%A7%D8%AB%D9%83%D9%88%D9%85-5741.html ... contentType: application/gzip ... Content Metadata: ... _request_=GET /%D8%A7%D8%AB%D8%A7%D8%AB%D9%83%D9%88%D9%85-5741.html HTTP/1.1 ... Accept-Encoding: gzip _response.headers_=HTTP/1.1 200 OK Date: Sun, 07 Apr 2019 14:19:15 GMT Server: Apache X-Powered-By: PHP/5.6.30 Content-Encoding: zlib,gzip,deflate ... {noformat} - response uses "deflate" although "gzip" is requested: {noformat} % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' -Dstore.http.headers=true -Dstore.http.request=true https://de.wantedly.com/all/japan/designer/businessmodel ... contentType: application/zlib ... Content Metadata: ... _request_=GET /all/japan/designer/businessmodel HTTP/1.1 ... Accept-Encoding: gzip ... _response.headers_=HTTP/1.1 200 OK Date: Fri, 05 Apr 2019 15:56:01 GMT ... Server: nginx ... Content-Encoding: deflate ... {noformat} > protocol-okhttp fails to decompress content if Content-Encoding header is > wrong > ------------------------------------------------------------------------------- > > Key: NUTCH-2707 > URL: https://issues.apache.org/jira/browse/NUTCH-2707 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol > Affects Versions: 1.15 > Reporter: Sebastian Nagel > Priority: Minor > Fix For: 1.16 > > > The plugin protocol-okhttp does not decompress the returned gzipped content > for some rare pages. Looks like that happens because the response HTTP > header does not specify {{Content-Type: gzip}} but {{zlib,gzip,deflate}}. > {noformat} > % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' \ > -Dstore.http.headers=true -Dstore.http.request=true \ > http://24310.gr/afroditi-42426.html > fetching: http://24310.gr/afroditi-42426.html > ... > contentType: application/gzip > ... > Content Metadata: Transfer-Encoding=chunked ... > Content-Encoding=zlib,gzip,deflate ... _request_=GET /afroditi-42426.html > HTTP/1.1 > ... > Accept-Encoding: gzip > _response.headers_=HTTP/1.1 200 OK > ... > Content-Encoding: zlib,gzip,deflate > ... > Transfer-Encoding: chunked > Connection: keep-alive > {noformat} > The plugin protocol-http requests {{Accept-Encoding: x-gzip, gzip, deflate}} > and gets the correct response header: > {noformat} > % bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-tika' \ > -Dstore.http.headers=true -Dstore.http.request=true > http://24310.gr/afroditi-42426.html > ... > contentType: application/xhtml+xml > ... > Content Metadata: ... Content-Encoding=gzip ... _request_=GET > /afroditi-42426.html HTTP/1.1 > Host: 24310.gr > Accept-Encoding: x-gzip, gzip, deflate > ... > {noformat} > Similar for Firefox which sends {{Accept-Encoding: gzip, deflate}}. > I will report the issue to upstream okhttp. But it would be also possible to > handle the content encoding in the protocol implementation: if the > Accept-Encoding header is set, the okhttp library will not decompress the > content and expects that it's handled in the calling code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)