[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8030 ) Change subject: IMPALA-5250: Unify decompressor output_length semantics .. IMPALA-5250: Unify decompressor output_length semantics This patch makes the semantics of the output_length parameter in Codec::ProcessBlock to be the same across all codecs. In existing code different decompressor treats output_length differently: 1. SnappyDecompressor needs output_length to be greater than or equal to the actual decompressed length, but it does not set it to the actual decompressed length after decompression. 2. SnappyBlockDecompressor and Lz4Decompressor require output_length to be exactly the same as the actual decompressed length, otherwise decompression fails. 3. Other decompressors need output_length to be greater than or equal to the actual decompressed length and will set it to actual decompressed length if oversized. This inconsistency leads to a bug where the error message is undeterministic when the compressed block is corrupted. This patch makes all decompressor behave like a modified version of 3: Output_length should be greater than or equal to the actual decompressed length and it will be set to actual decompressed length if oversized. A decompression failure sets it to 0. Lz4Decompressor will use the "safe" instead of the "fast" decompression function, for the latter is insecure with corrupted data and requires the decompressed length to be known. Testing: A testcase is added checking that the decompressors can handle an oversized output buffer correctly. A regression test for the exact case described in IMPALA-5250 is also added. A benchmark is run on a 16-node cluster testing the performance impact of the LZ4Decompressor change and no performance regression is found. Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 Reviewed-on: http://gerrit.cloudera.org:8080/8030 Reviewed-by: Alex BehmTested-by: Impala Public Jenkins --- M be/src/util/codec.h M be/src/util/decompress-test.cc M be/src/util/decompress.cc 3 files changed, 74 insertions(+), 45 deletions(-) Approvals: Alex Behm: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/8030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 Gerrit-Change-Number: 8030 Gerrit-PatchSet: 6 Gerrit-Owner: Tianyi Wang Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8030 ) Change subject: IMPALA-5250: Unify decompressor output_length semantics .. Patch Set 5: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1255/ -- To view, visit http://gerrit.cloudera.org:8080/8030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 Gerrit-Change-Number: 8030 Gerrit-PatchSet: 5 Gerrit-Owner: Tianyi WangGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 21 Sep 2017 18:19:27 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8030 ) Change subject: IMPALA-5250: Unify decompressor output_length semantics .. Patch Set 5: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 Gerrit-Change-Number: 8030 Gerrit-PatchSet: 5 Gerrit-Owner: Tianyi WangGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 21 Sep 2017 18:18:43 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8030 ) Change subject: IMPALA-5250: Unify decompressor output_length semantics .. Patch Set 4: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 Gerrit-Change-Number: 8030 Gerrit-PatchSet: 4 Gerrit-Owner: Tianyi WangGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 21 Sep 2017 17:43:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics
Tianyi Wang has posted comments on this change. Change subject: IMPALA-5250: Unify decompressor output_length semantics .. Patch Set 3: (3 comments) http://gerrit.cloudera.org:8080/#/c/8030/3//COMMIT_MSG Commit Message: Line 33: > Add a separate section for testing. Mention the tests you added and the per Done http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress-test.cc File be/src/util/decompress-test.cc: Line 196: int64_t* output_len, uint8_t** output, bool output_preallocated) { > Your interpretation is correct. A test cases is added. http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress.cc File be/src/util/decompress.cc: Line 426: > Agree this in/out pattern is not great and we should consider your proposal All output_length will be set to 0 if the decompression fails now. -- To view, visit http://gerrit.cloudera.org:8080/8030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tianyi WangGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics
Hello Alex Behm, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8030 to look at the new patch set (#4). Change subject: IMPALA-5250: Unify decompressor output_length semantics .. IMPALA-5250: Unify decompressor output_length semantics This patch makes the semantics of the output_length parameter in Codec::ProcessBlock to be the same across all codecs. In existing code different decompressor treats output_length differently: 1. SnappyDecompressor needs output_length to be greater than or equal to the actual decompressed length, but it does not set it to the actual decompressed length after decompression. 2. SnappyBlockDecompressor and Lz4Decompressor require output_length to be exactly the same as the actual decompressed length, otherwise decompression fails. 3. Other decompressors need output_length to be greater than or equal to the actual decompressed length and will set it to actual decompressed length if oversized. This inconsistency leads to a bug where the error message is undeterministic when the compressed block is corrupted. This patch makes all decompressor behave like a modified version of 3: Output_length should be greater than or equal to the actual decompressed length and it will be set to actual decompressed length if oversized. A decompression failure sets it to 0. Lz4Decompressor will use the "safe" instead of the "fast" decompression function, for the latter is insecure with corrupted data and requires the decompressed length to be known. Testing: A testcase is added checking that the decompressors can handle an oversized output buffer correctly. A regression test for the exact case described in IMPALA-5250 is also added. A benchmark is run on a 16-node cluster testing the performance impact of the LZ4Decompressor change and no performance regression is found. Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 --- M be/src/util/codec.h M be/src/util/decompress-test.cc M be/src/util/decompress.cc 3 files changed, 74 insertions(+), 45 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/8030/4 -- To view, visit http://gerrit.cloudera.org:8080/8030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tianyi WangGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics
Alex Behm has posted comments on this change. Change subject: IMPALA-5250: Unify decompressor output_length semantics .. Patch Set 3: Code-Review+1 (1 comment) http://gerrit.cloudera.org:8080/#/c/8030/3//COMMIT_MSG Commit Message: Line 33: Add a separate section for testing. Mention the tests you added and the perf validation you did. -- To view, visit http://gerrit.cloudera.org:8080/8030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tianyi WangGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tianyi Wang Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics
Tianyi Wang has uploaded a new patch set (#3). Change subject: IMPALA-5250: Unify decompressor output_length semantics .. IMPALA-5250: Unify decompressor output_length semantics This patch makes the semantics of the output_length parameter in Codec::ProcessBlock to be the same across all codecs. In existing code different decompressor treats output_length differently: 1. SnappyDecompressor needs output_length to be greater than or equal to the actual decompressed length, but it does not set it to the actual decompressed length after decompression. 2. SnappyBlockDecompressor and Lz4Decompressor require output_length to be exactly the same as the actual decompressed length, otherwise decompression fails. 3. Other decompressors need output_length to be greater than or equal to the actual decompressed length and will set it to actual decompressed length if oversized. This inconsistency leads to a bug where the error message is undeterministic when the compressed block is corrupted. This patch makes all decompressor behave like a modified version of 3: Output_length should be greater than or equal to the actual decompressed length and it will be set to actual decompressed length if oversized. A decompression failure sets it to 0. A testcase is added checking that decompressors can handle an oversized output buffer correctly. Lz4Decompressor will use the "safe" instead of the "fast" decompression function, for the latter is insecure with corrupted data and requires the decompressed length to be known. A benchmark is run on a 16-node cluster and no performance impact is found. Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 --- M be/src/util/codec.h M be/src/util/decompress-test.cc M be/src/util/decompress.cc 3 files changed, 74 insertions(+), 45 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/8030/3 -- To view, visit http://gerrit.cloudera.org:8080/8030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tianyi WangGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tianyi Wang
[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics
Tianyi Wang has posted comments on this change. Change subject: IMPALA-5250: Unify decompressor output_length semantics .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8030/1//COMMIT_MSG Commit Message: Line 26: instead of the "fast" version, for the latter is insecure with corrupted > Not sure how useful compression-test.cc is. I was thinking more along the l Running single_node_perf_run.py locally shows that overall performance is 1.78% slower after this change. Given LZ4 is not used in parquet, we may leave LZ4 inconsistent with others for now. -- To view, visit http://gerrit.cloudera.org:8080/8030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tianyi WangGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tianyi Wang Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics
Alex Behm has posted comments on this change. Change subject: IMPALA-5250: Unify decompressor output_length semantics .. Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/8030/1//COMMIT_MSG Commit Message: Line 26: instead of the "fast" version, for the latter is insecure with corrupted > Is there an individual benchmark? Not sure how useful compression-test.cc is. I was thinking more along the lines of end-to-end tests that have heavy data exchanges. http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress-test.cc File be/src/util/decompress-test.cc: Line 196: } > I'm not sure if my interpretation of JIRA is correct. In my understanding: Your interpretation is correct. I'm just referring to a specific broken file I've seen where I believe the output_len should be 0. Can you please make sure that your change fixes the bug for that specific broken file and check whether these tests cover that scenario? http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress.cc File be/src/util/decompress.cc: Line 426: // If size_only is false, output buffer size must be at least output_len. *output_len is > Done. It is set to 0 in this case, while other decompressor may leave it un Agree this in/out pattern is not great and we should consider your proposal (but not in this patch). I'm ok with setting to 0 everywhere, but it might be easier to just leave output_len untouched. I saw in the code that in some places it is set to 0, but not in others. My main ask is for consistency and documenting the behavior in a comment. -- To view, visit http://gerrit.cloudera.org:8080/8030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tianyi WangGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tianyi Wang Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics
Tianyi Wang has posted comments on this change. Change subject: IMPALA-5250: Unify decompressor output_length semantics .. Patch Set 1: (18 comments) http://gerrit.cloudera.org:8080/#/c/8030/1//COMMIT_MSG Commit Message: Line 9: This patch makes the semantics output_length parameter in > the semantics of the output_length Done Line 10: Codec::ProcessBlock to be the same. In existing code different > Codec::ProcessBlock() to be the same across all codecs. In the existing cod Done Line 13:to actual decompressed length, but it does not set it to actual > to the actual decompressed length Done Line 16:be exactly the same as actual decompressed length, otherwise > as the actual decompressed length Done Line 19:actual decompressed length and will set it to actual decompressed > the actual decompressed length Done Line 21: This inconsistency leads to a bug where the error message is > This inconsistency lead to a bug where Do you mean changing it to plural? Line 24: decompressors can handle oversized output buffer correctly. > can handle an oversized output buffer correctly. Done Line 25: Lz4Decompressor will use the "safe" version of decompression function > will use the "safe" instead of the "fast" decompression function Done Line 26: instead of the "fast" version, for the latter is insecure with corrupted > We use LZ4 for compressing exchange data, so we check the perf impact of th Is there an individual benchmark? There is one for compression in be/src/experiments/compression-test.cc. Line 27: data and requires decompressed length to be known. > and requires the decompressed length to be known Done http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress-test.cc File be/src/util/decompress-test.cc: Line 153: // Test the behavior when the decompressor is given too little / too much space > "." at end of sentence Done Line 155: // correct output size when the space is enough, and does not write beyond the output > the correct output size Done Line 157: void DecompressOverUnderSizedOutputBuffer(Codec *compressor, Codec *decompressor, > Out style is "Codec* compressor" Done Line 196: } > I think we should test more directly the scenario mentioned in the JIRA whe I'm not sure if my interpretation of JIRA is correct. In my understanding: 1. The parquet file is corrupted so either current_page_header_.uncompressed_page_size is greater or current_page_header_.compressed_page_size is less than what it should be. 2. Either way we allocated a larger buffer and the decompressor decompressed the (probably incomplete) data successfully. 3. The part of the buffer not written by the decompressor is uninitilized. Because the decompressor returned OK, somehow we read this uninitilized part later and trigger undeterministic errors. The problem here is in 2: The decompressor should report how much data it decompressed even if the decompression is successful. Then ReadDataPage() can check whether it is the same as current_page_header_.uncompressed_page_size. So I think we should test whether output_len is set properly with an successful decompression. I don't understand "set to 0" part. When should it be set to 0? http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress.cc File be/src/util/decompress.cc: Line 426: // If size_only is false, output buffer size must be at least output_len. *output_len is > Mention what output_len is set to if a non-OK status is returned. Done. It is set to 0 in this case, while other decompressor may leave it unmodified or set it to an arbitrary intermediate value. I think the complication here is caused by the design that 'output_len' and 'output' are output parameters. When a parameter is both input and output, and could be passed as reference further into callees of callee, there could be all kinds of bugs forgetting to set the correct value here and there. I think if output and output_len is in the return value, creating such bugs is much more difficult, and the code will be much easier to follow since the data flow is clear and explicit. I think we should have a Status type for return values. It represents either a value or an error status, and then we can avoid output parameters in new codes. Line 427: // also updated with actual output size. > the actual output size Done Line 575: // LZ4_decompress_fast will cause a segmentation fault if passed a nullptr output. > Comment still relevant? Removed. Line 579: if (ret < 0) { > single line Done -- To view, visit http://gerrit.cloudera.org:8080/8030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tianyi WangGerrit-Reviewer: Alex Behm
[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics
Tianyi Wang has uploaded a new patch set (#2). Change subject: IMPALA-5250: Unify decompressor output_length semantics .. IMPALA-5250: Unify decompressor output_length semantics This patch makes the semantics of the output_length parameter in Codec::ProcessBlock to be the same across all codecs. In existing code different decompressor treats output_length differently: 1. SnappyDecompressor needs output_length to be greater than or equal to the actual decompressed length, but it does not set it to the actual decompressed length after decompression. 2. SnappyBlockDecompressor and Lz4Decompressor require output_length to be exactly the same as the actual decompressed length, otherwise decompression fails. 3. Other decompressors need output_length to be greater than or equal to the actual decompressed length and will set it to actual decompressed length if oversized. This inconsistency leads to a bug where the error message is undeterministic when the compressed block is corrupted. This patch makes all decompressor behave as 3 and adds a testcase checking that decompressors can handle an oversized output buffer correctly. Lz4Decompressor will use the "safe" instead of the "fast" decompression function, for the latter is insecure with corrupted data and requires the decompressed length to be known. Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 --- M be/src/util/decompress-test.cc M be/src/util/decompress.cc 2 files changed, 31 insertions(+), 27 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/8030/2 -- To view, visit http://gerrit.cloudera.org:8080/8030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tianyi WangGerrit-Reviewer: Alex Behm
[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics
Alex Behm has posted comments on this change. Change subject: IMPALA-5250: Unify decompressor output_length semantics .. Patch Set 1: (18 comments) http://gerrit.cloudera.org:8080/#/c/8030/1//COMMIT_MSG Commit Message: Line 9: This patch makes the semantics output_length parameter in the semantics of the output_length Line 10: Codec::ProcessBlock to be the same. In existing code different Codec::ProcessBlock() to be the same across all codecs. In the existing code different decompressors treat output_length differently. Line 13:to actual decompressed length, but it does not set it to actual to the actual decompressed length Line 16:be exactly the same as actual decompressed length, otherwise as the actual decompressed length Line 19:actual decompressed length and will set it to actual decompressed the actual decompressed length Line 21: This inconsistency leads to a bug where the error message is This inconsistency lead to a bug where Line 24: decompressors can handle oversized output buffer correctly. can handle an oversized output buffer correctly. Line 25: Lz4Decompressor will use the "safe" version of decompression function will use the "safe" instead of the "fast" decompression function Line 26: instead of the "fast" version, for the latter is insecure with corrupted We use LZ4 for compressing exchange data, so we check the perf impact of these changes. I can help you with the details. Line 27: data and requires decompressed length to be known. and requires the decompressed length to be known http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress-test.cc File be/src/util/decompress-test.cc: Line 153: // Test the behavior when the decompressor is given too little / too much space "." at end of sentence Line 155: // correct output size when the space is enough, and does not write beyond the output the correct output size Line 157: void DecompressOverUnderSizedOutputBuffer(Codec *compressor, Codec *decompressor, Out style is "Codec* compressor" Line 196: } I think we should test more directly the scenario mentioned in the JIRA where a corrupt file can lead to the output_len not being set to 0 properly. http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress.cc File be/src/util/decompress.cc: Line 426: // If size_only is false, output buffer size must be at least output_len. *output_len is Mention what output_len is set to if a non-OK status is returned. Line 427: // also updated with actual output size. the actual output size Line 575: // LZ4_decompress_fast will cause a segmentation fault if passed a nullptr output. Comment still relevant? Line 579: if (ret < 0) { single line -- To view, visit http://gerrit.cloudera.org:8080/8030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tianyi WangGerrit-Reviewer: Alex Behm Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics
Tianyi Wang has uploaded a new change for review. http://gerrit.cloudera.org:8080/8030 Change subject: IMPALA-5250: Unify decompressor output_length semantics .. IMPALA-5250: Unify decompressor output_length semantics This patch makes the semantics output_length parameter in Codec::ProcessBlock to be the same. In existing code different decompressor treats output_length differently: 1. SnappyDecompressor needs output_length to be greater than or equal to actual decompressed length, but it does not set it to actual decompressed length after decompression. 2. SnappyBlockDecompressor and Lz4Decompressor require output_length to be exactly the same as actual decompressed length, otherwise decompression fails. 3. Other decompressors need output_length to be greater than or equal to actual decompressed length and will set it to actual decompressed length if oversized. This inconsistency leads to a bug where the error message is undeterministic when the compressed block is corrupted. This patch makes all decompressor behave as 3 and adds a testcase checking that decompressors can handle oversized output buffer correctly. Lz4Decompressor will use the "safe" version of decompression function instead of the "fast" version, for the latter is insecure with corrupted data and requires decompressed length to be known. Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 --- M be/src/util/decompress-test.cc M be/src/util/decompress.cc 2 files changed, 31 insertions(+), 24 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/8030/1 -- To view, visit http://gerrit.cloudera.org:8080/8030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tianyi Wang