[jira] [Commented] (NUTCH-2548) Compressed content skipped. Content of size 78 was truncated to 74
[ https://issues.apache.org/jira/browse/NUTCH-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430359#comment-16430359 ] Hudson commented on NUTCH-2548: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1604 (See [https://builds.apache.org/job/Nutch-nutchgora/1604/]) fix for NUTCH-2548 contributed by rustyx (me: [https://github.com/apache/nutch/commit/3289fdcddaae14cc6f692afb368039f152cf7996]) * (edit) src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java NUTCH-2548 Compressed content skipped, contributed by Rustam - do not (snagel: [https://github.com/apache/nutch/commit/7f0fe0fc718cf1caf4bb2ad3c0d4d2e01d92e571]) * (edit) src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java > Compressed content skipped. Content of size 78 was truncated to 74 > -- > > Key: NUTCH-2548 > URL: https://issues.apache.org/jira/browse/NUTCH-2548 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.4 >Reporter: Rustam >Priority: Major > Fix For: 2.4 > > Attachments: nutch-content-truncated.patch > > > gzip or deflate compressed content fails to parse with a message like: > {{WARN parse.ParserJob - https://rustyx.org/temp/index%20bbb skipped. > Content of size 78 was truncated to 74}} > The root cause is that the original (compressed) Content-Length is stored in > the headers, while the content is stored uncompressed. Subsequently the > Content-Length doesn't match the stored content size. > See attached patch that fixed the issue by removing Content-Length from the > headers if it contains compressed value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2548) Compressed content skipped. Content of size 78 was truncated to 74
[ https://issues.apache.org/jira/browse/NUTCH-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430331#comment-16430331 ] ASF GitHub Bot commented on NUTCH-2548: --- sebastian-nagel closed pull request #308: fix for NUTCH-2548 contributed by rustyx URL: https://github.com/apache/nutch/pull/308 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java b/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java index 989e4e53d..3140bee78 100644 --- a/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java +++ b/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java @@ -169,10 +169,12 @@ content = http.processGzipEncoded(content, url); if (Http.LOG.isTraceEnabled()) fetchTrace.append("; extracted to " + content.length + " bytes"); + headers.remove(Response.CONTENT_LENGTH); } else if ("deflate".equals(contentEncoding)) { content = http.processDeflateEncoded(content, url); if (Http.LOG.isTraceEnabled()) fetchTrace.append("; extracted to " + content.length + " bytes"); + headers.remove(Response.CONTENT_LENGTH); } } This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Compressed content skipped. Content of size 78 was truncated to 74 > -- > > Key: NUTCH-2548 > URL: https://issues.apache.org/jira/browse/NUTCH-2548 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.4 >Reporter: Rustam >Priority: Major > Attachments: nutch-content-truncated.patch > > > gzip or deflate compressed content fails to parse with a message like: > {{WARN parse.ParserJob - https://rustyx.org/temp/index%20bbb skipped. > Content of size 78 was truncated to 74}} > The root cause is that the original (compressed) Content-Length is stored in > the headers, while the content is stored uncompressed. Subsequently the > Content-Length doesn't match the stored content size. > See attached patch that fixed the issue by removing Content-Length from the > headers if it contains compressed value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2548) Compressed content skipped. Content of size 78 was truncated to 74
[ https://issues.apache.org/jira/browse/NUTCH-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430302#comment-16430302 ] Sebastian Nagel commented on NUTCH-2548: Thanks, [~rustyx]! Confirmed for 2.x (using parsechecker), 1.x seems not affected. > Compressed content skipped. Content of size 78 was truncated to 74 > -- > > Key: NUTCH-2548 > URL: https://issues.apache.org/jira/browse/NUTCH-2548 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.4 >Reporter: Rustam >Priority: Major > Attachments: nutch-content-truncated.patch > > > gzip or deflate compressed content fails to parse with a message like: > {{WARN parse.ParserJob - https://rustyx.org/temp/index%20bbb skipped. > Content of size 78 was truncated to 74}} > The root cause is that the original (compressed) Content-Length is stored in > the headers, while the content is stored uncompressed. Subsequently the > Content-Length doesn't match the stored content size. > See attached patch that fixed the issue by removing Content-Length from the > headers if it contains compressed value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2548) Compressed content skipped. Content of size 78 was truncated to 74
[ https://issues.apache.org/jira/browse/NUTCH-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424381#comment-16424381 ] ASF GitHub Bot commented on NUTCH-2548: --- rustyx opened a new pull request #308: fix for NUTCH-2548 contributed by rustyx URL: https://github.com/apache/nutch/pull/308 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Compressed content skipped. Content of size 78 was truncated to 74 > -- > > Key: NUTCH-2548 > URL: https://issues.apache.org/jira/browse/NUTCH-2548 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.4 >Reporter: Rustam >Priority: Major > Attachments: nutch-content-truncated.patch > > > gzip or deflate compressed content fails to parse with a message like: > {{WARN parse.ParserJob - https://rustyx.org/temp/index%20bbb skipped. > Content of size 78 was truncated to 74}} > The root cause is that the original (compressed) Content-Length is stored in > the headers, while the content is stored uncompressed. Subsequently the > Content-Length doesn't match the stored content size. > See attached patch that fixed the issue by removing Content-Length from the > headers if it contains compressed value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)