[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners Introduced new error message when scanning a corrupt Sequence or RCFile. Added new checks to detect buffer overrun while handling Sequence or RCFile. Testing: a) Made changes to fuzz test for RCFile/Sequence file, ran fuzz test in a loop with 200 iteration without failure. b) Ran exhaustive test on the changes without failure. Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Reviewed-on: http://gerrit.cloudera.org:8080/8936 Reviewed-by: Tim Armstrong Tested-by: Impala Public Jenkins --- M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-rcfile-scanner.h M be/src/exec/hdfs-sequence-scanner.cc M be/src/exec/hdfs-sequence-scanner.h M be/src/exec/read-write-util-test.cc M be/src/exec/read-write-util.h M be/src/exec/scanner-context.inline.h M be/src/util/decompress.cc M tests/query_test/test_scanners_fuzz.py 9 files changed, 259 insertions(+), 72 deletions(-) Approvals: Tim Armstrong: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 21 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 20: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 20 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Fri, 18 May 2018 05:05:55 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 20: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2503/ -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 20 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Fri, 18 May 2018 01:55:30 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 20: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 20 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Fri, 18 May 2018 01:55:21 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 19: Code-Review+2 Will wait for the other prerequisite to be merged. -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 19 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Thu, 10 May 2018 00:48:12 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Hello anujphadke, Tim Armstrong, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8936 to look at the new patch set (#19). Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners Introduced new error message when scanning a corrupt Sequence or RCFile. Added new checks to detect buffer overrun while handling Sequence or RCFile. Testing: a) Made changes to fuzz test for RCFile/Sequence file, ran fuzz test in a loop with 200 iteration without failure. b) Ran exhaustive test on the changes without failure. Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a --- M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-rcfile-scanner.h M be/src/exec/hdfs-sequence-scanner.cc M be/src/exec/hdfs-sequence-scanner.h M be/src/exec/read-write-util-test.cc M be/src/exec/read-write-util.h M be/src/exec/scanner-context.inline.h M be/src/util/decompress.cc M tests/query_test/test_scanners_fuzz.py 9 files changed, 259 insertions(+), 72 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/8936/19 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 19 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 18: (1 comment) http://gerrit.cloudera.org:8080/#/c/8936/18/be/src/runtime/timestamp-parse-util.cc File be/src/runtime/timestamp-parse-util.cc: http://gerrit.cloudera.org:8080/#/c/8936/18/be/src/runtime/timestamp-parse-util.cc@370 PS18, Line 370: DCHECK_GT(len, 0); > Please remove this, it's not valid. E.g. cast('' as timestamp). Sure was an oversight should have removed it, thanks for pointing out. -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 18 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Tue, 08 May 2018 22:31:13 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 18: (2 comments) Looks good but we'll want to merge after the IMPALA-6995 fix so we don't hit that DCHECK in the meantime. http://gerrit.cloudera.org:8080/#/c/8936/15/be/src/runtime/timestamp-parse-util.cc File be/src/runtime/timestamp-parse-util.cc: http://gerrit.cloudera.org:8080/#/c/8936/15/be/src/runtime/timestamp-parse-util.cc@370 PS15, Line 370: DCHECK_GT(len, 0); > Introduced a DCHECK_GT(len, 0) and a check in the caller. The checks added in this patch don't really fix the problem (and actually make it worse) because it has other callsites. Filed IMPALA-6995 to fix this properly. http://gerrit.cloudera.org:8080/#/c/8936/18/be/src/runtime/timestamp-parse-util.cc File be/src/runtime/timestamp-parse-util.cc: http://gerrit.cloudera.org:8080/#/c/8936/18/be/src/runtime/timestamp-parse-util.cc@370 PS18, Line 370: DCHECK_GT(len, 0); Please remove this, it's not valid. E.g. cast('' as timestamp). We can wait for the fix for IMPALA-6995 to go in before merging this change. -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 18 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Tue, 08 May 2018 22:05:28 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Hello anujphadke, Tim Armstrong, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8936 to look at the new patch set (#18). Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners Introduced new error message when scanning a corrupt Sequence or RCFile. Added new checks to detect buffer overrun while handling Sequence or RCFile. Testing: a) Made changes to fuzz test for RCFile/Sequence file, ran fuzz test in a loop with 200 iteration without failure. b) Ran exhaustive test on the changes without failure. Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a --- M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-rcfile-scanner.h M be/src/exec/hdfs-sequence-scanner.cc M be/src/exec/hdfs-sequence-scanner.h M be/src/exec/read-write-util-test.cc M be/src/exec/read-write-util.h M be/src/exec/scanner-context.inline.h M be/src/runtime/timestamp-parse-util.cc M be/src/util/decompress.cc M tests/query_test/test_scanners_fuzz.py 10 files changed, 260 insertions(+), 72 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/8936/18 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 18 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Hello anujphadke, Tim Armstrong, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8936 to look at the new patch set (#17). Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners Introduced new error message when scanning a corrupt Sequence or RCFile. Added new checks to detect buffer overrun while handling Sequence or RCFile. Testing: a) Made changes to fuzz test for RCFile/Sequence file, ran fuzz test in a loop with 200 iteration without failure. b) Ran exhaustive test on the changes without failure. Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a --- M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-rcfile-scanner.h M be/src/exec/hdfs-sequence-scanner.cc M be/src/exec/hdfs-sequence-scanner.h M be/src/exec/read-write-util-test.cc M be/src/exec/read-write-util.h M be/src/exec/scanner-context.inline.h M be/src/runtime/timestamp-parse-util.cc M be/src/util/decompress.cc M tests/query_test/test_scanners_fuzz.py 10 files changed, 258 insertions(+), 72 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/8936/17 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 17 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 16: (1 comment) http://gerrit.cloudera.org:8080/#/c/8936/15/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/15/be/src/exec/hdfs-rcfile-scanner.cc@588 PS15, Line 588: } > I think we should catch the invalid column metadata when we read it in GetC Since the end of column is calculated here I am checking it here. -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 16 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Thu, 03 May 2018 18:55:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Hello anujphadke, Tim Armstrong, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8936 to look at the new patch set (#16). Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners Introduced new error message when scanning a corrupt Sequence or RCFile. Added new checks to detect buffer overrun while handling Sequence or RCFile. Testing: a) Made changes to fuzz test for RCFile/Sequence file, ran fuzz test in a loop with 200 iteration without failure. b) Ran exhaustive test on the changes without failure. Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a --- M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-rcfile-scanner.h M be/src/exec/hdfs-sequence-scanner.cc M be/src/exec/hdfs-sequence-scanner.h M be/src/exec/read-write-util-test.cc M be/src/exec/read-write-util.h M be/src/exec/scanner-context.inline.h M be/src/runtime/timestamp-parse-util.cc M be/src/util/decompress.cc M tests/query_test/test_scanners_fuzz.py 10 files changed, 259 insertions(+), 73 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/8936/16 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 16 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 15: (4 comments) http://gerrit.cloudera.org:8080/#/c/8936/15/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/15/be/src/exec/hdfs-rcfile-scanner.cc@91 PS15, Line 91: excceeded > typo: exceeded Done http://gerrit.cloudera.org:8080/#/c/8936/15/be/src/exec/hdfs-rcfile-scanner.cc@545 PS15, Line 545: if (max_tuples < 0) { > It seems like this is catching the problem too late. How does it get into t Done http://gerrit.cloudera.org:8080/#/c/8936/15/be/src/runtime/timestamp-parse-util.cc File be/src/runtime/timestamp-parse-util.cc: http://gerrit.cloudera.org:8080/#/c/8936/15/be/src/runtime/timestamp-parse-util.cc@370 PS15, Line 370: if (UNLIKELY(str == NULL || len <= 0)) { > We're already checking len here but the problem is that we're trimming whit Introduced a DCHECK_GT(len, 0) and a check in the caller. http://gerrit.cloudera.org:8080/#/c/8936/15/be/src/runtime/timestamp-parse-util.cc@477 PS15, Line 477: if (lazy_len > 0 && ParseFormatTokensByStr(&lazy_ctx)) { > Looks like you hit upon a different bug here - I can trigger that DCHECK wi I have removed the check -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 15 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Thu, 03 May 2018 18:44:11 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 15: (6 comments) http://gerrit.cloudera.org:8080/#/c/8936/15/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/15/be/src/exec/hdfs-rcfile-scanner.cc@91 PS15, Line 91: excceeded typo: exceeded http://gerrit.cloudera.org:8080/#/c/8936/15/be/src/exec/hdfs-rcfile-scanner.cc@91 PS15, Line 91: 8 million limit" Let's use the constant instead of hardcoding it in the error message so that they can't get out of sync. http://gerrit.cloudera.org:8080/#/c/8936/15/be/src/exec/hdfs-rcfile-scanner.cc@545 PS15, Line 545: if (max_tuples < 0) { It seems like this is catching the problem too late. How does it get into this invalid state? The problem is in our own code so we shoudl fix it at the source rather than trying to catch it here. http://gerrit.cloudera.org:8080/#/c/8936/15/be/src/exec/hdfs-rcfile-scanner.cc@588 PS15, Line 588: if (col_end > row_group_end || column.start_offset < 0 || column.buffer_pos < 0 I think we should catch the invalid column metadata when we read it in GetCurrentKeyBuffer(), rather than letting things get into an invalid state and trying to catch it later. http://gerrit.cloudera.org:8080/#/c/8936/15/be/src/runtime/timestamp-parse-util.cc File be/src/runtime/timestamp-parse-util.cc: http://gerrit.cloudera.org:8080/#/c/8936/15/be/src/runtime/timestamp-parse-util.cc@370 PS15, Line 370: if (UNLIKELY(str == NULL || len <= 0)) { We're already checking len here but the problem is that we're trimming whitespace below. I think the right fix is to check if (len <= 0) immediately after stripping leading and trailing whitespace. That will be less error-prone than trying to catch it later http://gerrit.cloudera.org:8080/#/c/8936/15/be/src/runtime/timestamp-parse-util.cc@477 PS15, Line 477: if (lazy_len > 0 && ParseFormatTokensByStr(&lazy_ctx)) { Looks like you hit upon a different bug here - I can trigger that DCHECK with another query - select cast(' ' as timestamp); I don't think it should affect release builds, but can you file a separate bug to fix that? I think we should fix that separately and add a specific regression text. -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 15 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Fri, 27 Apr 2018 23:27:44 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Hello anujphadke, Tim Armstrong, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8936 to look at the new patch set (#15). Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners Introduced new error message when scanning a corrupt Sequence or RCFile. Added new checks to detect buffer overrun while handling Sequence or RCFile. Testing: a) Made changes to fuzz test for RCFile/Sequence file, ran fuzz test in a loop with 200 iteration without failure. b) Ran exhaustive test on the changes without failure. Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a --- M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-rcfile-scanner.h M be/src/exec/hdfs-sequence-scanner.cc M be/src/exec/hdfs-sequence-scanner.h M be/src/exec/read-write-util-test.cc M be/src/exec/read-write-util.h M be/src/exec/scanner-context.inline.h M be/src/runtime/timestamp-parse-util.cc M be/src/util/decompress.cc M tests/query_test/test_scanners_fuzz.py 10 files changed, 264 insertions(+), 73 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/8936/15 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 15 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Hello anujphadke, Tim Armstrong, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8936 to look at the new patch set (#14). Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners Introduced new error message when scanning a corrupt Sequence or RCFile. Added new checks to detect buffer overrun while handling Sequence or RCFile. Testing: a) Made changes to fuzz test for RCFile/Sequence file, ran fuzz test in a loop with 200 iteration without failure. b) Ran exhaustive test on the changes without failure. Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a --- M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-rcfile-scanner.h M be/src/exec/hdfs-sequence-scanner.cc M be/src/exec/hdfs-sequence-scanner.h M be/src/exec/read-write-util-test.cc M be/src/exec/read-write-util.h M be/src/exec/scanner-context.inline.h M be/src/runtime/timestamp-parse-util.cc M be/src/util/decompress.cc M tests/query_test/test_scanners_fuzz.py 10 files changed, 258 insertions(+), 73 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/8936/14 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 14 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 13: (6 comments) http://gerrit.cloudera.org:8080/#/c/8936/13//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8936/13//COMMIT_MSG@14 PS13, Line 14: with 100 iteration without failure. > We should run it for more than 100 iterations before committing, I think t I ran the the test for around 200 iterations to see if it fails. http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-rcfile-scanner.cc@86 PS13, Line 86: const int max_ncols = 800; > Let's make this consistent with other constant - should be all caps and def Done http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-rcfile-scanner.cc@90 PS13, Line 90: ss << stream_->filename() << " Invalid file-header-metadata, number of columns " > I think this error message is confusing. The metadata isn't invalid, the pr Done http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-rcfile-scanner.cc@363 PS13, Line 363: ss << stream_->filename() << " Bad rowcolumn index: " << col_idx; > Is there meant to be a space between row and column? It seems like we shoul Done http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-sequence-scanner.cc File be/src/exec/hdfs-sequence-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-sequence-scanner.cc@56 PS13, Line 56: if (codegen == nullptr) { > This DCHECK should be valid - why did this need to be changed? If we're hit Done http://gerrit.cloudera.org:8080/#/c/8936/13/tests/query_test/test_scanners_fuzz.py File tests/query_test/test_scanners_fuzz.py: http://gerrit.cloudera.org:8080/#/c/8936/13/tests/query_test/test_scanners_fuzz.py@198 PS13, Line 198: elif 'corrupt' in str(e).lower(): > In theory we should just be skipping over corrupt files with abort_on_error Done -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 13 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Wed, 25 Apr 2018 16:59:22 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 13: (9 comments) http://gerrit.cloudera.org:8080/#/c/8936/13//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8936/13//COMMIT_MSG@14 PS13, Line 14: with 100 iteration without failure. We should run it for more than 100 iterations before committing, I think there's still some chance of a rare failure slipping through otherwise. http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-rcfile-scanner.cc@86 PS13, Line 86: const int max_ncols = 800; Let's make this consistent with other constant - should be all caps and defined up the top of the file. http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-rcfile-scanner.cc@90 PS13, Line 90: ss << stream_->filename() << " Invalid file-header-metadata, number of columns " I think this error message is confusing. The metadata isn't invalid, the problem is that we've imposed a limit on the maximum number of columns that Impala can handle. I think the error message needs to be rewritten to explain the limit. It should also include the value of the limit. http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-rcfile-scanner.cc@363 PS13, Line 363: ss << stream_->filename() << " Bad rowcolumn index: " << col_idx; Is there meant to be a space between row and column? It seems like we should also say something relating to corruption so that it's less cryptic. E.g. "Corrupt column at index: " http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-sequence-scanner.cc File be/src/exec/hdfs-sequence-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-sequence-scanner.cc@56 PS13, Line 56: if (codegen == nullptr) { This DCHECK should be valid - why did this need to be changed? If we're hitting the DCHECK it indicates a bug with how we set up codegen and we need to fix the bug rather than remove the DCHECK. http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/read-write-util.h File be/src/exec/read-write-util.h: http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/read-write-util.h@198 PS11, Line 198: // Buffer access out of bounds. > Shouldn't this be >=? Since buf[size] is one past the end of the buffer? Looks like you missed this comment. http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/util/decompress.cc File be/src/util/decompress.cc: http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/util/decompress.cc@452 PS11, Line 452: if (input_len < 0) { > Shouldn't we be checking if it's <= sizeof(uint32_t), since that's the amou Looks like you missed this comment. http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/util/decompress.cc@454 PS11, Line 454: ss << " Corruption snappy decomp input_len " << input_len; > Can we make this consistent with the other error handling in this function Looks like you missed this comment. http://gerrit.cloudera.org:8080/#/c/8936/13/tests/query_test/test_scanners_fuzz.py File tests/query_test/test_scanners_fuzz.py: http://gerrit.cloudera.org:8080/#/c/8936/13/tests/query_test/test_scanners_fuzz.py@198 PS13, Line 198: elif 'corrupt' in str(e).lower(): In theory we should just be skipping over corrupt files with abort_on_error=1. This doesn't always happen for a lot of file formats. We should just add rc and sequence file to the list of formats below on l207 that don't implement this skipping behaviour, rather than trying to guess based on the error message. -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 13 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Wed, 11 Apr 2018 23:27:36 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Hello anujphadke, Tim Armstrong, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8936 to look at the new patch set (#13). Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners Introduced new error message when scanning a corrupt Sequence or RCFile. Added new checks to detect buffer overrun while handling Sequence or RCFile. Testing: a) Made changes to fuzz test for RCFile/Sequence file, ran fuzz test in a loop with 100 iteration without failure. b) Ran exhaustive test on the changes without failure. Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a --- M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-rcfile-scanner.h M be/src/exec/hdfs-sequence-scanner.cc M be/src/exec/hdfs-sequence-scanner.h M be/src/exec/read-write-util-test.cc M be/src/exec/read-write-util.h M be/src/exec/scanner-context.inline.h M be/src/util/decompress.cc M tests/query_test/test_scanners_fuzz.py 9 files changed, 249 insertions(+), 71 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/8936/13 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 13 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 12: (11 comments) http://gerrit.cloudera.org:8080/#/c/8936/11//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8936/11//COMMIT_MSG@13 PS11, Line 13: a) Ran the fuzz test on a private build without failures/crash. > It still fails sometimes - let's be clear in the commit message that this d Done http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc@176 PS11, Line 176: > extra indentation Done http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc@176 PS11, Line 176: > extra indentation Done http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc@218 PS11, Line 218: > formatting is a little off (leftmost << should be aligned with each other). Done http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc@331 PS11, Line 331: ss << stream_->filename() << " RCFile corrupt," << "out of bounds error num_rows_: " > Error message will not have a space after the comma. (here and at least one Done http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc@345 PS11, Line 345: key_buf_ptr > I don't think it's necessarily safe to print uint8_t* pointers with C++ str no more printing uint8_t* pointers, and removed Impala specific message. http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc@433 PS11, Line 433: col_info.uncompressed_buffer_len: 1; > Why is it safe for this to be 1 if the length of the buffer is <= 0? Won't replaced this with col_info.buffer_len which sets an upper bound that should be checked against. http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc@581 PS11, Line 581: row_group_sz > row_group_end? _sz makes me think this is a numeric size, which it isn't. Done http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc@583 PS11, Line 583: col_sz > col_end? Done http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc@587 PS11, Line 587: << col_sz << " greater than row_group_sz "<< row_group_sz; > Printing col_sz and row_group_sz is really dangerous here, since it's going removed printing of char* and replaced the Impala specific error, the error is more generic now w.r.t RCFIle http://gerrit.cloudera.org:8080/#/c/8936/12/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/12/be/src/exec/hdfs-rcfile-scanner.cc@362 PS12, Line 362: if (remain_len <= 0) { : stringstream ss; : ss << stream_->filename() << " RCFile corrupt, out of bounds error" : " GetCurrentKeyBuffer col_idx: " << col_idx << " remaining length: " << remain_len : << " col_info.buffer_len: " << col_info.buffer_len; : return Status(ss.str()); : } : : DCHECK_GT(remain_len, 0); : int bytes_read = ReadWriteUtil::GetVInt(*key_buf_ptr, &col_info.buffer_len, remain_len); : if (bytes_read == -1) { : stringstream ss; : ss << stream_->filename() << " RCFile corrupt, out of bounds error" : " GetCurrentKeyBuffer col_idx: " << col_idx << " remaining length: " << remain_len : << " col_info.buffer_len: " << col_info.buffer_len; : return Status(ss.str()); : } : *key_buf_ptr += bytes_read; : remain_len -= bytes_read; : DCHECK_GT(remain_len, 0); : : bytes_read = ReadWriteUtil::GetVInt(*key_buf_ptr, &col_info.uncompressed_buffer_len, : remain_len); : if (bytes_read == -1) { : stringstream ss; : ss << stream_->filename() << " RCFile corrupt, out of bounds error key buffer " : "col_idx: " << col_idx << " remaining length: "<< remain_len; : return Status(ss.str()); : } : *key_buf_ptr += bytes_read; : remain_len -= bytes_read; : DCHECK_GT(remain_len, 0); : : int col_key_buf_len; : bytes_read = ReadWriteUtil::GetVInt(*key_buf_ptr , &col_key_buf_len, remain_len); : if (bytes_read == -1) { : stringstream ss; : ss << stream_->filename() << " RCFile corrupt, out of bounds error key " : "buffer col_i
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 12: (2 comments) http://gerrit.cloudera.org:8080/#/c/8936/12/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/12/be/src/exec/hdfs-rcfile-scanner.cc@362 PS12, Line 362: if (remain_len <= 0) { : stringstream ss; : ss << stream_->filename() << " RCFile corrupt, out of bounds error" : " GetCurrentKeyBuffer col_idx: " << col_idx << " remaining length: " << remain_len : << " col_info.buffer_len: " << col_info.buffer_len; : return Status(ss.str()); : } : : DCHECK_GT(remain_len, 0); : int bytes_read = ReadWriteUtil::GetVInt(*key_buf_ptr, &col_info.buffer_len, remain_len); : if (bytes_read == -1) { : stringstream ss; : ss << stream_->filename() << " RCFile corrupt, out of bounds error" : " GetCurrentKeyBuffer col_idx: " << col_idx << " remaining length: " << remain_len : << " col_info.buffer_len: " << col_info.buffer_len; : return Status(ss.str()); : } : *key_buf_ptr += bytes_read; : remain_len -= bytes_read; : DCHECK_GT(remain_len, 0); : : bytes_read = ReadWriteUtil::GetVInt(*key_buf_ptr, &col_info.uncompressed_buffer_len, : remain_len); : if (bytes_read == -1) { : stringstream ss; : ss << stream_->filename() << " RCFile corrupt, out of bounds error key buffer " : "col_idx: " << col_idx << " remaining length: "<< remain_len; : return Status(ss.str()); : } : *key_buf_ptr += bytes_read; : remain_len -= bytes_read; : DCHECK_GT(remain_len, 0); : : int col_key_buf_len; : bytes_read = ReadWriteUtil::GetVInt(*key_buf_ptr , &col_key_buf_len, remain_len); : if (bytes_read == -1) { : stringstream ss; : ss << stream_->filename() << " RCFile corrupt, out of bounds error key " : "buffer col_idx: " << col_idx << " remaining length: "<< remain_len; : return Status(ss.str()); : } these error messages seem very tailored to Impala's internal bookkeeping. Should we rephrase them to be more in terms of the RC file format, so that they make sense without looking at Impala code? i.e. so they could be interpreted by whoever is supporting the tool that's producing these files? Also, should we be dedup'ing these message (i.e. assigning an error code) or do we not expect that to be worthwhile? (Also, using Substitute() may make this more readable, but you can decide which you prefer.) http://gerrit.cloudera.org:8080/#/c/8936/12/be/src/exec/hdfs-sequence-scanner.cc File be/src/exec/hdfs-sequence-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/12/be/src/exec/hdfs-sequence-scanner.cc@233 PS12, Line 233: if (i > record_locations_.size()) { : stringstream ss; : ss << stream_->filename() << " SeqFile corrupted, record_locations_.size() " : << record_locations_.size() << " is less than curr_idx: " << i; : return Status(ss.str()); : } : int bytes_read = ReadWriteUtil::GetVLong( : next_record_in_compressed_block_, &record_locations_[i].len, : next_record_in_compressed_block_len_); : if (UNLIKELY(bytes_read == -1)) { : stringstream ss; : ss << stream_->filename() << " SeqFile corrupted, Invalid record sizes in " : "compressed block "<< " rec_len: " << record_locations_[i].len : << " next_record_in_compressed_block_len_ " : << next_record_in_compressed_block_len_; : return Status(ss.str()); : } : next_record_in_compressed_block_ += bytes_read; : next_record_in_compressed_block_len_ -= bytes_read; : if (next_record_in_compressed_block_len_ <= 0) { : stringstream ss; : ss << stream_->filename() << " SeqFile corrupted," : " next_record_in_compressed_block_len_ le 0" << " comp_blk_len: " : << next_record_in_compressed_block_len_; : return Status(ss
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 12: > Uploaded patch set 12. Tim please ignore these changes. -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 12 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Tue, 13 Mar 2018 23:35:38 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Hello anujphadke, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8936 to look at the new patch set (#12). Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners Enabled the fuzz test for Sequence and RCFiles and added new checks to return when failure is encountered while reading an RC file or sequence file. Testing: a) Ran the fuzz test on a private build without failures/crash. b) Ran end to end test, back end tests with new changes. Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a --- M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-rcfile-scanner.h M be/src/exec/hdfs-sequence-scanner.cc M be/src/exec/hdfs-sequence-scanner.h M be/src/exec/read-write-util-test.cc M be/src/exec/read-write-util.h M be/src/exec/scanner-context.inline.h M be/src/util/decompress.cc M tests/query_test/test_scanners_fuzz.py 9 files changed, 242 insertions(+), 61 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/8936/12 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 12 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 11: I also tried running exhaustive tests on the patch. I'm seeing at least one failure: test_hdfs_rcfile_scan_node_errors -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 11 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Tue, 13 Mar 2018 20:49:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 11: (12 comments) http://gerrit.cloudera.org:8080/#/c/8936/11//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8936/11//COMMIT_MSG@13 PS11, Line 13: a) Ran the fuzz test on a private build without failures/crash. It still fails sometimes - let's be clear in the commit message that this doesn't fix all the problems. http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc@176 PS11, Line 176: extra indentation http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc@218 PS11, Line 218: formatting is a little off (leftmost << should be aligned with each other). There are a bunch of minor formatting issues. I'm not going to go through a point them all out individually. It's probably easiest to run the patch through clang-format-diff (see https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65868536) and visually check that it didn't do anything weird. http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc@331 PS11, Line 331: ss << stream_->filename() << " RCFile corrupt," << "out of bounds error num_rows_: " Error message will not have a space after the comma. (here and at least one more place below). http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc@345 PS11, Line 345: key_buf_ptr I don't think it's necessarily safe to print uint8_t* pointers with C++ stream formatting - I believe in some cases it's treated as a null-terminated string. http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc@433 PS11, Line 433: col_info.uncompressed_buffer_len: 1; Why is it safe for this to be 1 if the length of the buffer is <= 0? Won't that still overrun the buffer by 1 byte? http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc@581 PS11, Line 581: row_group_sz row_group_end? _sz makes me think this is a numeric size, which it isn't. http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc@583 PS11, Line 583: col_sz col_end? http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/hdfs-rcfile-scanner.cc@587 PS11, Line 587: << col_sz << " greater than row_group_sz "<< row_group_sz; Printing col_sz and row_group_sz is really dangerous here, since it's going to be interpreted as a null-terminated string. Maybe instead print (col_end - row_group_buffer_) and row_group_length_. http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/read-write-util.h File be/src/exec/read-write-util.h: http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/read-write-util.h@198 PS11, Line 198: if (offset > size) return -1; Shouldn't this be >=? Since buf[size] is one past the end of the buffer? http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/util/decompress.cc File be/src/util/decompress.cc: http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/util/decompress.cc@452 PS11, Line 452: if (input_len < 0) { Shouldn't we be checking if it's <= sizeof(uint32_t), since that's the amount we read unconditionally. http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/util/decompress.cc@454 PS11, Line 454: ss << " Corruption snappy decomp input_len "<< input_len; Can we make this consistent with the other error handling in this function by using a TErrorCode. Maybe something like SNAPPY_DECOMPRESS_TRUNCATED? -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 11 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Tue, 13 Mar 2018 20:33:55 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Hello anujphadke, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8936 to look at the new patch set (#11). Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners Introduced new error message when scanning a corrupt Sequence or RCFile. Added new checks to detect buffer overrun while handling Sequence or RCFile. Testing: a) Ran the fuzz test on a private build without failures/crash. b) Ran end to end test, back end tests with new changes. Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a --- M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-rcfile-scanner.h M be/src/exec/hdfs-sequence-scanner.cc M be/src/exec/hdfs-sequence-scanner.h M be/src/exec/read-write-util-test.cc M be/src/exec/read-write-util.h M be/src/exec/scanner-context.inline.h M be/src/util/decompress.cc 8 files changed, 233 insertions(+), 58 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/8936/11 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 11 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/8936/10//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8936/10//COMMIT_MSG@9 PS10, Line 9: Enabled the fuzz test for Sequence and RCFiles and added new checks to return > Commit message is stale. Fixed it. -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 10 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Tue, 13 Mar 2018 15:38:18 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 10: I need to take another pass over this just to convince myself I didn't miss any bugs first time over. -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 10 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Tue, 13 Mar 2018 01:02:01 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/8936/10//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8936/10//COMMIT_MSG@9 PS10, Line 9: Enabled the fuzz test for Sequence and RCFiles and added new checks to return Commit message is stale. -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 10 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Tue, 13 Mar 2018 01:02:12 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Hello anujphadke, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8936 to look at the new patch set (#10). Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners Enabled the fuzz test for Sequence and RCFiles and added new checks to return when failure is encountered while reading an RC file or sequence file. Testing: a) Ran the fuzz test on a private build without failures/crash. b) Ran end to end test, back end tests with new changes. Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a --- M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-rcfile-scanner.h M be/src/exec/hdfs-sequence-scanner.cc M be/src/exec/hdfs-sequence-scanner.h M be/src/exec/read-write-util-test.cc M be/src/exec/read-write-util.h M be/src/exec/scanner-context.inline.h M be/src/util/decompress.cc 8 files changed, 233 insertions(+), 58 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/8936/10 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 10 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Hello anujphadke, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8936 to look at the new patch set (#9). Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners Enabled the fuzz test for Sequence and RCFiles and added new checks to return when failure is encountered while reading an RC file or sequence file. Testing: a) Ran the fuzz test on a private build without failures/crash. b) Ran end to end test, back end tests with new changes. Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a --- M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-rcfile-scanner.h M be/src/exec/hdfs-sequence-scanner.cc M be/src/exec/hdfs-sequence-scanner.h M be/src/exec/read-write-util-test.cc M be/src/exec/read-write-util.h M be/src/exec/scanner-context.inline.h M be/src/util/decompress.cc 8 files changed, 234 insertions(+), 58 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/8936/9 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 9 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Hello anujphadke, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8936 to look at the new patch set (#8). Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners Enabled the fuzz test for Sequence and RCFiles and added new checks to return when failure is encountered while reading an RC file or sequence file. Testing: a) Ran the fuzz test on a private build without failures/crash. b) Ran end to end test, back end tests with new changes. Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a --- M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-rcfile-scanner.h M be/src/exec/hdfs-sequence-scanner.cc M be/src/exec/hdfs-sequence-scanner.h M be/src/exec/read-write-util-test.cc M be/src/exec/read-write-util.h M be/src/exec/scanner-context.inline.h M be/src/util/decompress.cc M tests/query_test/test_scanners_fuzz.py 9 files changed, 234 insertions(+), 59 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/8936/8 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 8 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 7: > I'm still seeing crashes when I loop this locally: > F0306 16:34:04.306205 29911 mem-pool.h:245] Check failed: size >= 0 > (-9 vs. 0) > *** Check failure stack trace: *** > @ 0x3c32cbd google::LogMessage::Fail() > @ 0x3c34562 google::LogMessage::SendToLog() > @ 0x3c32697 google::LogMessage::Flush() > @ 0x3c35c5e google::LogMessageFatal::~LogMessageFatal() > @ 0x184eeb4 impala::MemPool::TryAllocate() > @ 0x1be5d08 impala::HdfsRCFileScanner::StartRowGroup() > @ 0x1be8032 impala::HdfsRCFileScanner::ProcessRange() > @ 0x296931a impala::BaseSequenceScanner::GetNextInternal() > @ 0x1bc71b2 impala::HdfsScanner::ProcessSplit() > @ 0x1b9f48c impala::HdfsScanNode::ProcessSplit() > @ 0x1b9e8b7 impala::HdfsScanNode::ScannerThread() > @ 0x1b9dd3f > _ZZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS_17ThreadResourceMgr12ResourcePoolEENKUlvE_clEv > @ 0x1b9fcdd > _ZN5boost6detail8function26void_function_obj_invoker0IZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS3_17ThreadResourceMgr12ResourcePoolEEUlvE_vE6invokeERNS1_15function_bufferE > @ 0x17f4ce0 boost::function0<>::operator()() > @ 0x1af63ff impala::Thread::SuperviseThread() > @ 0x1aff3cf boost::_bi::list5<>::operator()<>() > @ 0x1aff2f3 boost::_bi::bind_t<>::operator()() > @ 0x1aff2b6 boost::detail::thread_data<>::run() > @ 0x2dbc1aa thread_proxy > @ 0x7f5af8f926ba start_thread > @ 0x7f5af8cc841d clone > Picked up JAVA_TOOL_OPTIONS: > -agentlib:jdwp=transport=dt_socket,address=3,server=y,suspend=n > Wrote minidump to > /home/tarmstrong/Impala/incubator-impala/logs/cluster/minidumps/impalad/31a8ae22-6a2a-4118-4ff4e1b6-be18ef7c.dmp > > How about we leave the test disabled and move forward with the > improvements, then we can finish up the remaining fixes needed > later? Shall I revert my changes to tests/query_test/test_scanners_fuzz.py ? -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 7 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Wed, 07 Mar 2018 18:56:03 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 7: I'm still seeing crashes when I loop this locally: F0306 16:34:04.306205 29911 mem-pool.h:245] Check failed: size >= 0 (-9 vs. 0) *** Check failure stack trace: *** @ 0x3c32cbd google::LogMessage::Fail() @ 0x3c34562 google::LogMessage::SendToLog() @ 0x3c32697 google::LogMessage::Flush() @ 0x3c35c5e google::LogMessageFatal::~LogMessageFatal() @ 0x184eeb4 impala::MemPool::TryAllocate() @ 0x1be5d08 impala::HdfsRCFileScanner::StartRowGroup() @ 0x1be8032 impala::HdfsRCFileScanner::ProcessRange() @ 0x296931a impala::BaseSequenceScanner::GetNextInternal() @ 0x1bc71b2 impala::HdfsScanner::ProcessSplit() @ 0x1b9f48c impala::HdfsScanNode::ProcessSplit() @ 0x1b9e8b7 impala::HdfsScanNode::ScannerThread() @ 0x1b9dd3f _ZZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS_17ThreadResourceMgr12ResourcePoolEENKUlvE_clEv @ 0x1b9fcdd _ZN5boost6detail8function26void_function_obj_invoker0IZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS3_17ThreadResourceMgr12ResourcePoolEEUlvE_vE6invokeERNS1_15function_bufferE @ 0x17f4ce0 boost::function0<>::operator()() @ 0x1af63ff impala::Thread::SuperviseThread() @ 0x1aff3cf boost::_bi::list5<>::operator()<>() @ 0x1aff2f3 boost::_bi::bind_t<>::operator()() @ 0x1aff2b6 boost::detail::thread_data<>::run() @ 0x2dbc1aa thread_proxy @ 0x7f5af8f926ba start_thread @ 0x7f5af8cc841d clone Picked up JAVA_TOOL_OPTIONS: -agentlib:jdwp=transport=dt_socket,address=3,server=y,suspend=n Wrote minidump to /home/tarmstrong/Impala/incubator-impala/logs/cluster/minidumps/impalad/31a8ae22-6a2a-4118-4ff4e1b6-be18ef7c.dmp How about we leave the test disabled and move forward with the improvements, then we can finish up the remaining fixes needed later? -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 7 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Wed, 07 Mar 2018 01:24:21 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 6: (4 comments) http://gerrit.cloudera.org:8080/#/c/8936/5/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/5/be/src/exec/hdfs-rcfile-scanner.cc@119 PS5, Line 119: ss << stream_->filename() << " Invalid RCFILE_VERSION_HEADER: '" > I meant the filename of the file being decoded, i.e. stream_->filename(). Done http://gerrit.cloudera.org:8080/#/c/8936/5/be/src/exec/hdfs-rcfile-scanner.cc@119 PS5, Line 119: ss << stream_->filename() << " Invalid RCFILE_VERSION_HEADER: '" > It seems like the message text should be enough to distinguish, right? Done http://gerrit.cloudera.org:8080/#/c/8936/5/be/src/exec/scanner-context.inline.h File be/src/exec/scanner-context.inline.h: http://gerrit.cloudera.org:8080/#/c/8936/5/be/src/exec/scanner-context.inline.h@96 PS5, Line 96: } > Is this possible? The intent of the above code is that it shouldn't go nega Done http://gerrit.cloudera.org:8080/#/c/8936/5/be/src/exec/scanner-context.inline.h@97 PS5, Line 97: > This will crash since the template for the error message in common/thrift/g This is not needed as you've suggested above I have introduced the DCHECK in the beginning to check that length is never negative -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 6 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Sat, 10 Feb 2018 01:01:59 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Hello anujphadke, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8936 to look at the new patch set (#6). Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners Enabled the fuzz test for Sequence and RCFiles and added new checks to return when failure is encountered while reading an RC file or sequence file. Testing: a) Ran the fuzz test on a private build without failures/crash. b) Ran end to end test, back end tests with new changes. Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a --- M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-rcfile-scanner.h M be/src/exec/hdfs-sequence-scanner.cc M be/src/exec/hdfs-sequence-scanner.h M be/src/exec/read-write-util-test.cc M be/src/exec/read-write-util.h M be/src/exec/scanner-context.inline.h M be/src/util/decompress.cc M tests/query_test/test_scanners_fuzz.py 9 files changed, 242 insertions(+), 61 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/8936/6 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 6 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 5: (1 comment) I'm still seeing some crashes if I run this in a loop locally: I=0; while impala-py.test tests/query_test/test_scanners_fuzz.py -k "rc or seq" -n 4 --verbose; do echo ITER $I; I=$((I + 1)); done It seem to be causing a crash about every 10 iterations. http://gerrit.cloudera.org:8080/#/c/8936/5/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/5/be/src/exec/hdfs-rcfile-scanner.cc@119 PS5, Line 119: ss << __FILE__ << ":" << __LINE__<< " Invalid RCFILE_VERSION_HEADER: '" > I was wondering how do we differentiate two similar messages coming from di It seems like the message text should be enough to distinguish, right? -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 5 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Thu, 01 Feb 2018 00:12:15 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/8936/5/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/5/be/src/exec/hdfs-rcfile-scanner.cc@119 PS5, Line 119: ss << __FILE__ << ":" << __LINE__<< " Invalid RCFILE_VERSION_HEADER: '" > I meant the filename of the file being decoded, i.e. stream_->filename(). I was wondering how do we differentiate two similar messages coming from different places say in this case Line#119 and Line#133 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 5 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Wed, 31 Jan 2018 05:36:13 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 5: (4 comments) http://gerrit.cloudera.org:8080/#/c/8936/5/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/5/be/src/exec/hdfs-rcfile-scanner.cc@119 PS5, Line 119: ss << __FILE__ << ":" << __LINE__<< " Invalid RCFILE_VERSION_HEADER: '" I meant the filename of the file being decoded, i.e. stream_->filename(). http://gerrit.cloudera.org:8080/#/c/8936/5/be/src/exec/scanner-context.inline.h File be/src/exec/scanner-context.inline.h: http://gerrit.cloudera.org:8080/#/c/8936/5/be/src/exec/scanner-context.inline.h@96 PS5, Line 96: if (bytes_left < 0) { Is this possible? The intent of the above code is that it shouldn't go negative if 'length' is non-negative. Callers shouldn't be passing in a negative length, if it's from the caller passing in a negative length, we should add a DCHECK_GE(length, 0) at the start of the function and fix the caller. http://gerrit.cloudera.org:8080/#/c/8936/5/be/src/exec/scanner-context.inline.h@97 PS5, Line 97: SCANNER_INCOMPLETE_READ This will crash since the template for the error message in common/thrift/generate_error_codes.py requires a couple of parameters. It's unfortunate we don't have compile-time checks for this. http://gerrit.cloudera.org:8080/#/c/8936/4/be/src/exec/scanner-context.inline.h File be/src/exec/scanner-context.inline.h: http://gerrit.cloudera.org:8080/#/c/8936/4/be/src/exec/scanner-context.inline.h@97 PS4, Line 97: *status = Status(TErrorCode::SCANNER_INCOMPLETE_READ); The SCANNER_INCOMPLETE_READ template in common/thrift/generate_error_codes.py requires two arguments - it will crash if it doesn't get the expected number. -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 5 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Wed, 31 Jan 2018 01:25:13 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Hello anujphadke, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8936 to look at the new patch set (#5). Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners Enabled the fuzz test for Sequence and RCFiles and added new checks to return when failure is encountered while reading an RC file or sequence file. Testing: a) Ran the fuzz test on a private build without failures/crash. b) Ran end to end test, back end tests with new changes. Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a --- M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-rcfile-scanner.h M be/src/exec/hdfs-sequence-scanner.cc M be/src/exec/hdfs-sequence-scanner.h M be/src/exec/read-write-util-test.cc M be/src/exec/read-write-util.h M be/src/exec/scanner-context.inline.h M be/src/util/decompress.cc M tests/query_test/test_scanners_fuzz.py 9 files changed, 221 insertions(+), 61 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/8936/5 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 5 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 4: (11 comments) http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc@329 PS3, Line 329: int bytes_read = ReadWriteUtil::GetVInt(key_buf_ptr, &num_rows_, key_length_); > Maybe also include the file name in the messages? So that if this happens i Done http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc@334 PS3, Line 334: << " key_length_: " << key_length_; > It doesn't look like moving the col_idx declaration out of the loop is need Yes it was left over from a previous change http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc@337 PS3, Line 337: > Passing key_length_ seems wrong, since it measures the number of bytes in k Corrected it to have remaining length which gets adjusted as the bytes are read in to the buffer. http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc@464 PS3, Line 464: } > Can you add a brief comment explaining what this is checking for? E.g. "Ens Done http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-sequence-scanner.h File be/src/exec/hdfs-sequence-scanner.h: http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-sequence-scanner.h@257 PS3, Line 257: next_record_in_compressed_block_len > Needs an underscore in the end of the variable name. Done http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-sequence-scanner.cc File be/src/exec/hdfs-sequence-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-sequence-scanner.cc@235 PS3, Line 235: ss << __FILE__ << ":" << __LINE__ << "SeqFile corrupted, record_locations_.size() " > Would be good to include file names so it's easier to root cause in the fie Done http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util-test.cc File be/src/exec/read-write-util-test.cc: http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util-test.cc@115 PS3, Line 115: TestPutGetZeroCompressedLong(val); > How about we add a few unit tests to make sure that passing in a too-short Done http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util.h File be/src/exec/read-write-util.h: http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util.h@58 PS3, Line 58: /// Returns the length of the long/int > Can you document the size argument? Done http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util.h@72 PS3, Line 72: /// byte offset and the buffer passed is of length size, accessing beyond the > Can you document the size argument? Done http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util.h@198 PS3, Line 198: > I think we can check len vs size here and save doing it in the loop below. Done http://gerrit.cloudera.org:8080/#/c/8936/3/tests/query_test/test_scanners_fuzz.py File tests/query_test/test_scanners_fuzz.py: http://gerrit.cloudera.org:8080/#/c/8936/3/tests/query_test/test_scanners_fuzz.py@203 PS3, Line 203: # E.g. corrupt Parquet footer (IMPALA-3773) or a corrupt LZO index file > The previous two aren't needed, right, since 'corrupt' is a substring of th Agreed removed the above two conditional statements -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 4 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Sat, 20 Jan 2018 06:26:25 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Hello anujphadke, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8936 to look at the new patch set (#4). Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners Enabled the fuzz test for Sequence and RCFiles and added new checks to return when failure is encountered while reading an RC file or sequence file. Testing: Ran the fuzz test on a private build without failures/crash. Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a --- M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-rcfile-scanner.h M be/src/exec/hdfs-sequence-scanner.cc M be/src/exec/hdfs-sequence-scanner.h M be/src/exec/read-write-util-test.cc M be/src/exec/read-write-util.h M be/src/exec/scanner-context.inline.h M be/src/util/decompress.cc M tests/query_test/test_scanners_fuzz.py 9 files changed, 218 insertions(+), 61 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/8936/4 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 4 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 3: I hit a DCHECK after running this in a loop for a couple minutes. F0118 11:53:48.126408 11019 scanner-context.inline.h:97] Check failed: bytes_left > 0 (-197 vs. 0) *** Check failure stack trace: *** @ 0x3c0541d google::LogMessage::Fail() @ 0x3c06cc2 google::LogMessage::SendToLog() @ 0x3c04df7 google::LogMessage::Flush() @ 0x3c083be google::LogMessageFatal::~LogMessageFatal() @ 0x1bc5c8b impala::ScannerContext::Stream::SkipBytes() @ 0x1bc320a impala::HdfsRCFileScanner::ReadColumnBuffers() @ 0x1bc1a71 impala::HdfsRCFileScanner::StartRowGroup() @ 0x1bc38ba impala::HdfsRCFileScanner::ProcessRange() @ 0x293ecaa impala::BaseSequenceScanner::GetNextInternal() @ 0x1ba2ede impala::HdfsScanner::ProcessSplit() @ 0x1b7b1b4 impala::HdfsScanNode::ProcessSplit() @ 0x1b7a5df impala::HdfsScanNode::ScannerThread() @ 0x1b79a62 _ZZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS_17ThreadResourceMgr12ResourcePoolEENKUlvE_clEv @ 0x1b7ba05 _ZN5boost6detail8function26void_function_obj_invoker0IZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS3_17ThreadResourceMgr12ResourcePoolEEUlvE_vE6invokeERNS1_15function_bufferE @ 0x17d6bda boost::function0<>::operator()() @ 0x1ad0bf7 impala::Thread::SuperviseThread() @ 0x1ad990c boost::_bi::list4<>::operator()<>() @ 0x1ad984f boost::_bi::bind_t<>::operator()() @ 0x1ad9812 boost::detail::thread_data<>::run() @ 0x2d8edba thread_proxy @ 0x7f4c344426ba start_thread @ 0x7f4c341783dd clone Picked up JAVA_TOOL_OPTIONS: -agentlib:jdwp=transport=dt_socket,address=3,server=y,suspend=n Wrote minidump to /home/tarmstrong/Impala/incubator-impala/logs/cluster/minidumps/impalad/aadeb6a5-be5b-4a86-c5991d87-fa93553b.dmp -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 3 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Thu, 18 Jan 2018 19:55:33 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 3: (11 comments) http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc@329 PS3, Line 329: << " key_length_: " << key_length_; Maybe also include the file name in the messages? So that if this happens in the wild we can figure out which file is corrupt. http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc@334 PS3, Line 334: int col_idx = 0; It doesn't look like moving the col_idx declaration out of the loop is needed, right? http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc@337 PS3, Line 337: key_length_ Passing key_length_ seems wrong, since it measures the number of bytes in key_buffer, not the bytes remaining after key_buf_ptr. http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc@464 PS3, Line 464: if (column.uncompressed_buffer_len + column.start_offset > row_group_length_) { Can you add a brief comment explaining what this is checking for? E.g. "Ensure there is enough space in 'row_group_buffer_' for the uncompressed data". http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-sequence-scanner.h File be/src/exec/hdfs-sequence-scanner.h: http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-sequence-scanner.h@257 PS3, Line 257: next_record_in_compressed_block_len Needs an underscore in the end of the variable name. http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-sequence-scanner.cc File be/src/exec/hdfs-sequence-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-sequence-scanner.cc@235 PS3, Line 235: ss << "SeqFile corrupted, record_locations_.size() " << record_locations_.size() Would be good to include file names so it's easier to root cause in the field. http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util-test.cc File be/src/exec/read-write-util-test.cc: http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util-test.cc@115 PS3, Line 115: } How about we add a few unit tests to make sure that passing in a too-short buffer results in an error? http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util.h File be/src/exec/read-write-util.h: http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util.h@58 PS3, Line 58: /// If the size byte is corrupted then return -1; Can you document the size argument? http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util.h@72 PS3, Line 72: static int GetVLong(uint8_t* buf, int64_t offset, int64_t* vlong, int32_t size); Can you document the size argument? http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util.h@198 PS3, Line 198: if (len > MAX_VINT_LEN) return -1; I think we can check len vs size here and save doing it in the loop below. Would be clearer to have the two length checks in the same place. http://gerrit.cloudera.org:8080/#/c/8936/3/tests/query_test/test_scanners_fuzz.py File tests/query_test/test_scanners_fuzz.py: http://gerrit.cloudera.org:8080/#/c/8936/3/tests/query_test/test_scanners_fuzz.py@203 PS3, Line 203: elif 'corrupt' in str(e).lower(): The previous two aren't needed, right, since 'corrupt' is a substring of those? -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 3 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Thu, 18 Jan 2018 19:52:49 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Hello anujphadke, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8936 to look at the new patch set (#3). Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners Enabled the fuzz test for Sequence and RCFiles and added new checks to return when failure is encountered while reading an RC file or sequence file. Testing: Ran the fuzz test on a private build without failures/crash. Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a --- M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-rcfile-scanner.h M be/src/exec/hdfs-sequence-scanner.cc M be/src/exec/hdfs-sequence-scanner.h M be/src/exec/read-write-util-test.cc M be/src/exec/read-write-util.h M be/src/util/decompress.cc M tests/query_test/test_scanners_fuzz.py 8 files changed, 160 insertions(+), 35 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/8936/3 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 3 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 2: (4 comments) http://gerrit.cloudera.org:8080/#/c/8936/2/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/2/be/src/exec/hdfs-rcfile-scanner.cc@177 PS2, Line 177: ss << "Codec bad, corrupted "; > Can you include a bit more detail, i.e. mention that it's RCFile and includ Done http://gerrit.cloudera.org:8080/#/c/8936/2/be/src/exec/hdfs-rcfile-scanner.cc@337 PS2, Line 337: ss << "Invalid bytes read col_idx: " << col_idx; > This could also do with a bit more detail. Added more detail to describe the error http://gerrit.cloudera.org:8080/#/c/8936/2/be/src/exec/hdfs-rcfile-scanner.cc@344 PS2, Line 344: void HdfsRCFileScanner::GetCurrentKeyBuffer(int col_idx, bool skip_col_data, > How does this avoid buffer overflows if we don't pass in the length of the Added length of the buffer and DCHECK to prevent reading beyond the buffer size. http://gerrit.cloudera.org:8080/#/c/8936/2/be/src/exec/hdfs-rcfile-scanner.cc@348 PS2, Line 348: GetVInt > These GetVInt() and GetVLong() interfaces seems fundamentally unsafe - they Changed all callers of GetVInt/GetVLong to pass length of the buffer which can be used to check out of bound access. -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 2 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Tue, 16 Jan 2018 21:10:18 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 2: (4 comments) http://gerrit.cloudera.org:8080/#/c/8936/2/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/2/be/src/exec/hdfs-rcfile-scanner.cc@177 PS2, Line 177: ss << "Codec bad, corrupted "; Can you include a bit more detail, i.e. mention that it's RCFile and include the name of the codec? http://gerrit.cloudera.org:8080/#/c/8936/2/be/src/exec/hdfs-rcfile-scanner.cc@337 PS2, Line 337: ss << "Invalid bytes read col_idx: " << col_idx; This could also do with a bit more detail. http://gerrit.cloudera.org:8080/#/c/8936/2/be/src/exec/hdfs-rcfile-scanner.cc@344 PS2, Line 344: void HdfsRCFileScanner::GetCurrentKeyBuffer(int col_idx, bool skip_col_data, How does this avoid buffer overflows if we don't pass in the length of the 'key_buf_ptr' buffer? I think this should be defensively checking if it reads past the end of the buffer. http://gerrit.cloudera.org:8080/#/c/8936/2/be/src/exec/hdfs-rcfile-scanner.cc@348 PS2, Line 348: GetVInt These GetVInt() and GetVLong() interfaces seems fundamentally unsafe - they don't take the length of the buffer as input! I think we might need to change them so that the length of the buffer is passed in and they check the bounds. -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 2 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Tue, 09 Jan 2018 01:47:27 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 2: It's ok if we don't have a reproducible test case for bugs caught by the fuzzer, unless the fuzzing revealed a really big hole in test coverage. It's good if we have a test case but for cases where we weren't handling corrupt data correctly there are so many ways files could be corrupted it wasn't feasible to have a standalone test for each case. -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 2 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Mon, 08 Jan 2018 17:41:43 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 2: > Can we test all the cases fixed this with 100% reproducible cases? > For Ex: > I commented out the change in decompress.cc and ran the fuzz test > and it passed. > Can we test all the cases fixed this with 100% reproducible cases? > For Ex: > I commented out the change in decompress.cc and ran the fuzz test > and it passed. If you run a couple of times without the fix the assertion will be hit. I'll see if I can change the test so that we hit the assert every time -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 2 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Mon, 08 Jan 2018 16:59:34 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
anujphadke has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 ) Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. Patch Set 2: Can we test all the cases fixed this with 100% reproducible cases? For Ex: I commented out the change in decompress.cc and ran the fuzz test and it passed. -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 2 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: anujphadke Gerrit-Comment-Date: Fri, 05 Jan 2018 20:45:47 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners
Pranay Singh has uploaded this change for review. ( http://gerrit.cloudera.org:8080/8936 Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners .. IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners Enabled the fuzz test for Sequence and RCFiles and added new checks to return when failure is encountered while reading an RC file or sequence file. Testing: Ran the fuzz test on a private build without failures/crash. Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a --- M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-sequence-scanner.cc M be/src/util/decompress.cc M tests/query_test/test_scanners_fuzz.py 4 files changed, 25 insertions(+), 7 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/8936/1 -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 1 Gerrit-Owner: Pranay Singh