[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 13: I've gotten into a similar situation before and git commit --amend --reset-author has gotten me out of it. Anyway, not a big deal, I just don't want to get credit (or blame!) for your work :) -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 13 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 23 Mar 2018 15:10:37 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 13: I think that it is related to an incident during my commit: - by mistake, I committed first with "commit --amend", which modified the last commit in my repo (which was yours) - I searched for a solution to redo this change without loosing my modification, and did the one I found (I don't remember the exact commands, I can look it up if necessary) - everything looked ok to me, my changes were in separate commit, so I pushed the change to gerrit My guess is that gerrit was tricked somehow by this maneuver, and the Author field was inherited from the last commit. -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 13 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 23 Mar 2018 12:17:52 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 13: I just noticed that I'm listed as the author on this patch - weird - wonder what happened... -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 13 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 23 Mar 2018 00:09:53 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 12: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 12 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 22 Mar 2018 02:47:32 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner Impala already supported RLE encoding for levels and dictionary pages, so the only task was to integrate it into BoolColumnReader. A new benchmark, rle-benchmark.cc is added to test the speed of RLE decoding for different bit widths and run lengths. There might be a small performance impact on PLAIN encoded booleans, because of the additional branch when the cache of BoolColumnReader is filled. As the cache size is 128, I considered this to be outside the "hot loop". Testing: As Impala cannot write RLE encoded bool columns at the moment, parquet-mr was used to create a test file, testdata/data/rle_encoded_bool.parquet tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table that uses this file, and tries to query from it. Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Reviewed-on: http://gerrit.cloudera.org:8080/9403 Reviewed-by: Tim ArmstrongTested-by: Impala Public Jenkins --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/rle-benchmark.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/rle-encoding.h M be/src/util/rle-test.cc M testdata/data/README A testdata/data/rle_encoded_bool.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test M tests/query_test/test_scanners.py 10 files changed, 331 insertions(+), 53 deletions(-) Approvals: Tim Armstrong: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 13 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 12: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2150/ -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 12 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 21 Mar 2018 23:01:28 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 12: Code-Review+2 There was a minor clang-tidy issue. Just fixed it myself. -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 12 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 21 Mar 2018 23:01:15 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Tim Armstrong has uploaded a new patch set (#11) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner Impala already supported RLE encoding for levels and dictionary pages, so the only task was to integrate it into BoolColumnReader. A new benchmark, rle-benchmark.cc is added to test the speed of RLE decoding for different bit widths and run lengths. There might be a small performance impact on PLAIN encoded booleans, because of the additional branch when the cache of BoolColumnReader is filled. As the cache size is 128, I considered this to be outside the "hot loop". Testing: As Impala cannot write RLE encoded bool columns at the moment, parquet-mr was used to create a test file, testdata/data/rle_encoded_bool.parquet tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table that uses this file, and tries to query from it. Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/rle-benchmark.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/rle-encoding.h M be/src/util/rle-test.cc M testdata/data/README A testdata/data/rle_encoded_bool.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test M tests/query_test/test_scanners.py 10 files changed, 331 insertions(+), 53 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/9403/11 -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 11 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 10: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/2143/ -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 10 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 21 Mar 2018 20:37:45 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 10: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 10 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 21 Mar 2018 16:31:28 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 10: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2143/ -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 10 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 21 Mar 2018 16:31:35 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 9: Patch set 9 is a rebase + conflict resolution. -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 9 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 21 Mar 2018 12:17:27 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Hello Lars Volker, Gabor Kaszab, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9403 to look at the new patch set (#9). Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner Impala already supported RLE encoding for levels and dictionary pages, so the only task was to integrate it into BoolColumnReader. A new benchmark, rle-benchmark.cc is added to test the speed of RLE decoding for different bit widths and run lengths. There might be a small performance impact on PLAIN encoded booleans, because of the additional branch when the cache of BoolColumnReader is filled. As the cache size is 128, I considered this to be outside the "hot loop". Testing: As Impala cannot write RLE encoded bool columns at the moment, parquet-mr was used to create a test file, testdata/data/rle_encoded_bool.parquet tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table that uses this file, and tries to query from it. Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/rle-benchmark.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/rle-encoding.h M be/src/util/rle-test.cc M testdata/data/README A testdata/data/rle_encoded_bool.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test M tests/query_test/test_scanners.py 10 files changed, 331 insertions(+), 53 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/9403/9 -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 9 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 8: I can merge once the merge conflict is fixed. -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 8 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 20 Mar 2018 22:30:55 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 8: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 8 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 20 Mar 2018 22:00:53 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 7: (2 comments) http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/exec/parquet-column-readers.cc@190 PS7, Line 190: DCHECK_EQ(encoding_, parquet::Encoding::RLE); > I think it's better to omit the "not for 2.x" so it doesn't confuse people Thanks for the answer! As the commit message does not contain "not for 2.x", no modification is needed related to this. http://gerrit.cloudera.org:8080/#/c/9403/2/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/9403/2/tests/query_test/test_scanners.py@649 PS2, Line 649: "testdata/data/", file_name) > For me a refactor you mentioned makes sense. Do you know the proper way of Sorry for the late answer. I would prefer to do that refactor in a separate patch, so I have created Jira for it: https://issues.apache.org/jira/browse/IMPALA-6709 -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 7 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 20 Mar 2018 14:50:54 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 7: (1 comment) http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/exec/parquet-column-readers.cc@190 PS7, Line 190: DCHECK_EQ(encoding_, parquet::Encoding::RLE); > Sure, I can do the conflict resolution. Does this mean that I should add "C I think it's better to omit the "not for 2.x" so it doesn't confuse people looking at the commit message in the future. Once the change is merged you can do the backport and if the cherry-pick job breaks in the meantime it'll continue once the backport gets merged. -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 7 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 19 Mar 2018 17:08:04 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 7: (2 comments) http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/exec/parquet-column-readers.cc@190 PS7, Line 190: DCHECK_EQ(encoding_, parquet::Encoding::RLE); > We'll need to resolve a conflict here on the 2.x branch since that also sup Sure, I can do the conflict resolution. Does this mean that I should add "Cherry-picks: not for 2.x." to the commit message, and do the cherry picking locally? http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/util/rle-encoding.h@687 PS7, Line 687: DCHECK(false); > Can't this DCHECK be triggered by invalid data? GetLiteralValues() could fa I have removed the DCHECK, and created a Jira (IMPALA-6700) to improve error handling, as currently it is a bit hard to understand it in my opinion. -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 7 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 19 Mar 2018 16:54:43 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Hello Lars Volker, Gabor Kaszab, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9403 to look at the new patch set (#8). Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner Impala already supported RLE encoding for levels and dictionary pages, so the only task was to integrate it into BoolColumnReader. A new benchmark, rle-benchmark.cc is added to test the speed of RLE decoding for different bit widths and run lengths. There might be a small performance impact on PLAIN encoded booleans, because of the additional branch when the cache of BoolColumnReader is filled. As the cache size is 128, I considered this to be outside the "hot loop". Testing: As Impala cannot write RLE encoded bool columns at the moment, parquet-mr was used to create a test file, testdata/data/rle_encoded_bool.parquet tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table that uses this file, and tries to query from it. Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/rle-benchmark.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/rle-encoding.h M be/src/util/rle-test.cc M testdata/data/README A testdata/data/rle_encoded_bool.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test M tests/query_test/test_scanners.py 10 files changed, 331 insertions(+), 53 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/9403/8 -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 8 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 7: (2 comments) I think this is good to go once we remove that DCHECK. http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/exec/parquet-column-readers.cc@190 PS7, Line 190: DCHECK_EQ(encoding_, parquet::Encoding::RLE); We'll need to resolve a conflict here on the 2.x branch since that also supports BIT_PACKED levels. Once we've merged this to master, can you cherry-pick it to the 2.x branch and resolve the conflict? http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/util/rle-encoding.h@687 PS7, Line 687: DCHECK(false); Can't this DCHECK be triggered by invalid data? GetLiteralValues() could fail if the input was truncated. -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 7 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 14 Mar 2018 21:08:13 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 6: "I'm curious what was wrong with the decoding - I didn't see an obvious bugfix between PS3 and this patchset." It was fixed by skipping the first 4 bytes, which contained the length of the data, and caused some extra values to be inserted. see https://gerrit.cloudera.org/#/c/9403/3..4/be/src/exec/parquet-column-readers.cc As the inserted values were 0s, and the values shifted out at the end of were also 0s, the count of 0s and 1s remained correct, so the original tests didn't catch this issue. -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 6 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 14 Mar 2018 15:18:10 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 6: (4 comments) http://gerrit.cloudera.org:8080/#/c/9403/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9403/6//COMMIT_MSG@17 PS6, Line 17: filled. As the cache size is 128, I considered this to be outside the > I agree that this overhead seems minimal - I'm not concerned about a predic Done http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc File be/src/benchmarks/rle-benchmark.cc: http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc@39 PS4, Line 39: // for loop / run length: 100.4 0.4 0.4 0.487X 0.487X 0.486X : // memset / run length: 10 0.396 0.4 0.4 0.482X 0.487X 0.486X > I agree that it would make sense to pick the denser encoding in cases where I have created a Jira for the size issue: https://issues.apache.org/jira/browse/IMPALA-6658 http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc@60 PS4, Line 60: std:: > nit: we generally prefer "using" declarations in .cc files instead of using Done http://gerrit.cloudera.org:8080/#/c/9403/6/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: http://gerrit.cloudera.org:8080/#/c/9403/6/be/src/util/rle-encoding.h@611 PS6, Line 611: literal_count_ = (indicator_value >> 1) * 8; > I remember you pointed out on another patchset that, according to the gramm I am unsure whether it is valid or not. There are tests related to 0 counters, but I think that these test only the case when the data starts with 0, which cause the RleBatchDecoder (or its original copy) to return early and consume 0 elements, which is treated as error in some places. Meanwhile if the 0 comes later, Impala may ignore it silently. the tests were added here: https://github.com/apache/impala/commit/025fd3bd7f3b7f4bc3dcb6b21e3598124b19b577 I would prefer if RleBatchDecoder would be stricter, and treat 0 counters always as error, but this change could lead to errors in tables that can be read by Impala at the moment. -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 6 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 14 Mar 2018 15:10:32 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Hello Lars Volker, Gabor Kaszab, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9403 to look at the new patch set (#7). Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner Impala already supported RLE encoding for levels and dictionary pages, so the only task was to integrate it into BoolColumnReader. A new benchmark, rle-benchmark.cc is added to test the speed of RLE decoding for different bit widths and run lengths. There might be a small performance impact on PLAIN encoded booleans, because of the additional branch when the cache of BoolColumnReader is filled. As the cache size is 128, I considered this to be outside the "hot loop". Testing: As Impala cannot write RLE encoded bool columns at the moment, parquet-mr was used to create a test file, testdata/data/rle_encoded_bool.parquet tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table that uses this file, and tries to query from it. Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/rle-benchmark.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/rle-encoding.h M be/src/util/rle-test.cc M testdata/data/README A testdata/data/rle_encoded_bool.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test M tests/query_test/test_scanners.py 10 files changed, 332 insertions(+), 53 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/9403/7 -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 7 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 6: (4 comments) Only very minor comments. I'm curious what was wrong with the decoding - I didn't see an obvious bugfix between PS3 and this patchset. http://gerrit.cloudera.org:8080/#/c/9403/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9403/6//COMMIT_MSG@17 PS6, Line 17: filled. As the cache size is 128, I considered this to be outside the I agree that this overhead seems minimal - I'm not concerned about a predictable branch per 128 values. http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc File be/src/benchmarks/rle-benchmark.cc: http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc@39 PS4, Line 39: // for loop / run length: 100.4 0.4 0.4 0.487X 0.487X 0.486X : // memset / run length: 10 0.396 0.4 0.4 0.482X 0.487X 0.486X > Some thoughts about this performance degradation compared to the run_length I agree that it would make sense to pick the denser encoding in cases where the literal run is actually smaller. Please feel free to file a JIRA and fix it. I'm not sure that the performance difference will hold up permanently, since we can optimise handling of repeated runs better in the caller and avoid unpacking the N values. http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc@60 PS4, Line 60: std:: nit: we generally prefer "using" declarations in .cc files instead of using the namespace everywhere. http://gerrit.cloudera.org:8080/#/c/9403/6/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: http://gerrit.cloudera.org:8080/#/c/9403/6/be/src/util/rle-encoding.h@611 PS6, Line 611: literal_count_ = (indicator_value >> 1) * 8; I remember you pointed out on another patchset that, according to the grammar, this could technically be 0 and still be a valid literal run. Should we leave a comment explaining that while we're here? https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3 -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 6 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 13 Mar 2018 18:55:41 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc File be/src/benchmarks/rle-benchmark.cc: http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc@39 PS4, Line 39: // for loop / run length: 100.4 0.4 0.4 0.487X 0.487X 0.486X : // memset / run length: 10 0.396 0.4 0.4 0.482X 0.487X 0.486X Some thoughts about this performance degradation compared to the run_length=1 case: If bit_width=1, then 8 values encoded as a repeated run can use more space than if they were encoded as a literal run. RleEncoder currently always uses repeated runs if it finds 8 repeated values - it may be good to change this to a higher number (16?) if bit_width=1. The performance seems to be similar if bit_width>1, so the space inefficiency above is probably not the real cause. I suspect that the problem with short repeated runs is that BatchedBitReader is optimized for 32*N batches, and shorter literal runs are buffered by RleBatchDecoder. It would be possible to avoid buffering in case of 8*N batches too, which could improve the performance of shorter runs. -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 4 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 13 Mar 2018 15:14:53 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 6: (7 comments) http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/exec/parquet-column-readers.cc@673 PS3, Line 673: switch (page_encoding_) { > nit: space before ( Done http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/exec/parquet-column-readers.cc@1304 PS3, Line 1304: } > Convert to using Substitute() while you're here - we generally prefer using Done http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h@138 PS3, Line 138: /// Returns the number of consumed values or 0 if an error occurred. > Can you add some tests to be/src/util/rle-test.cc for the new interface? Th Done http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h@674 PS3, Line 674:for (int i = 0; i < num_repeats_to_set; ++i) { > Yes, please do add the benchmark. The old benchmark was comparing two old i Done http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/data/README@160 PS3, Line 160: Parquet v1 file with RLE encoded boolean column "b" and int column "i". > How was it created? Parquet-mr? Yes, it was created with parquet-mr (parquet-cli's convert-csv command). I had to modify parquet-mr to force RLE encoding for booleans, because by default it will use bit packed for Parquet v1. Parquet v2 uses RLE by default, but Impala does not read v2 data pages at the moment. http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test File testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test: http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test@4 PS3, Line 4: select count(*) from rle_encoded_bool where b; > Thanks for the tip, I have created such a table, and something is actually The new test will catch if rle encoded booleans are returned in the wrong order. http://gerrit.cloudera.org:8080/#/c/9403/3/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/9403/3/tests/query_test/test_scanners.py@656 PS3, Line 656: self.client.execute(("""CREATE TABLE {0}.rle_encoded_bool (b boolean, i int) > Nit: trailing """ can go on same line. Done -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 6 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 13 Mar 2018 13:59:39 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Hello Lars Volker, Gabor Kaszab, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9403 to look at the new patch set (#6). Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner Impala already supported RLE encoding for levels and dictionary pages, so the only task was to integrate it into BoolColumnReader. A new benchmark, rle-benchmark.cc is added to test the speed of RLE decoding for different bit widths and run lengths. There might be a small performance impact on PLAIN encoded booleans, because of the additional branch when the cache of BoolColumnReader is filled. As the cache size is 128, I considered this to be outside the "hot loop", but some more performance measurement may be needed to validate this. Testing: As Impala cannot write RLE encoded bool columns at the moment, parquet-mr was used to create a test file, testdata/data/rle_encoded_bool.parquet tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table that uses this file, and tries to query from it. Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/rle-benchmark.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/rle-encoding.h M be/src/util/rle-test.cc M testdata/data/README A testdata/data/rle_encoded_bool.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test M tests/query_test/test_scanners.py 10 files changed, 328 insertions(+), 53 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/9403/6 -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 6 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Hello Lars Volker, Gabor Kaszab, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9403 to look at the new patch set (#4). Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner Impala already supported RLE encoding for levels and dictionary pages, so the only task was to integrate it into BoolColumnReader. There might be a small performance impact on PLAIN encoded booleans, because of the additional branch when the cache of BoolColumnReader is filled. As the cache size is 128, I considered this to be outside the "hot loop", but some performance measurement may be needed to validate this. Testing: As Impala cannot write RLE encoded bool columns at the moment, parquet-mr was used to create a test file, testdata/data/rle_encoded_bool.parquet tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table that uses this file, and tries to query from it. Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/rle-benchmark.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/rle-encoding.h M be/src/util/rle-test.cc M testdata/data/README A testdata/data/rle_encoded_bool.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test M tests/query_test/test_scanners.py 10 files changed, 306 insertions(+), 66 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/9403/4 -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 4 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h@674 PS3, Line 674:memset(values + num_consumed, repeated_value, num_repeats_to_set); > I have created a benchmark and it turned out that memset and the for loop Yes, please do add the benchmark. The old benchmark was comparing two old implementations that were no longer relevant. -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 3 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 09 Mar 2018 17:34:38 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 3: (2 comments) There is something wrong with the decoding, so I have no patch ready for review, but I have a question related to this commit in the comment for rle-encoding.h. http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h@674 PS3, Line 674:memset(values + num_consumed, repeated_value, num_repeats_to_set); > We could also be lazy and not implement it for sizeof(T) > 1 until we actua I have created a benchmark and it turned out that memset and the for loop are actually different, memset seems to be a few percent faster for short runs (<=32), and the loop is a few percent faster for long runs (>=128). These results are surprising to me, I have expected the opposite. Should I add the benchmark (rle-benchmark.cc) to this commit? I saw that you have removed a file with similar name in https://gerrit.cloudera.org/#/c/8267/ http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test File testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test: http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test@4 PS3, Line 4: select count(*) from rle_encoded_bool where b; > This wouldn't detect if the boolean values were returned in the wrong order Thanks for the tip, I have created such a table, and something is actually wrong with the decoded values. I am still investigating the issue. -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 3 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 09 Mar 2018 17:25:24 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h@674 PS3, Line 674:memset(values + num_consumed, repeated_value, num_repeats_to_set); > It's not valid to use memset() here in general since sizeof(T) may be > 1. We could also be lazy and not implement it for sizeof(T) > 1 until we actually need it, but I think we should just fix it properly while we're here. -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 3 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 06 Mar 2018 23:21:31 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 3: Code-Review+1 (1 comment) http://gerrit.cloudera.org:8080/#/c/9403/2/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/9403/2/tests/query_test/test_scanners.py@649 PS2, Line 649: "testdata/data/", file_name) > I have copy-pasted the template of this functions from test_def_levels, whi For me a refactor you mentioned makes sense. Do you know the proper way of doing this? Open a new Jira, or proactively include this to a random patch that touches this area (like this patch)? -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 3 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 06 Mar 2018 16:15:27 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Hello Lars Volker, Gabor Kaszab, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9403 to look at the new patch set (#3). Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner Impala already supported RLE encoding for levels and dictionary pages, so the only task was to integrate it into BoolColumnReader. There might be a small performance impact on PLAIN encoded booleans, because of the additional branch when the cache of BoolColumnReader is filled. As the cache size is 128, I considered this to be outside the "hot loop", but some performance measurement may be needed to validate this. Testing: As Impala cannot write RLE encoded bool columns at the moment, parquet-mr was used to create a test file, testdata/data/rle_encoded_bool.parquet tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table that uses this file, and tries to query from it. Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/rle-encoding.h M testdata/data/README A testdata/data/rle_encoded_bool.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test M tests/query_test/test_scanners.py 7 files changed, 102 insertions(+), 43 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/9403/3 -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 3 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 ) Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. Patch Set 2: (10 comments) http://gerrit.cloudera.org:8080/#/c/9403/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9403/2//COMMIT_MSG@24 PS2, Line 24: query into it nit: query from it? http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc@185 PS2, Line 185: return num_cached_levels > 0; instead? return num_cached_levels != NULL num_cached_levels > 0 doesn't indicate for me that num_cached_levels is a pointer. http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc@665 PS2, Line 665: switch(page_encoding_) { Would it have any impact if you did a reset on both bool_values_ and rle_decoder_ here unconditionally? If this has to stay, could you enhance the function's comment to reflect? As I see you branch on page_encoding_ anyway where you use them. http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc@714 PS2, Line 714: num_unpacked_values_ = page_encoding_ == parquet::Encoding::PLAIN nit: I find a simple if more readable than the ternary operator. Might be just my preference, though. http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/util/rle-encoding.h@139 PS2, Line 139: int32_t GetValues(int32_t num_values_to_consume, T* values); As I see the return value here serves 2 different purposes: 1) As a general return value to indicate success or failure 2) An out parameter to get the num_values_to_consume. I find this a bit confusing. Can't you have a return value for the purpose of 1) and an out param for 2) ? What do you think? http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/util/rle-encoding.h@666 PS2, Line 666: int32_t num_unpacked_values = 0; The names of num_unpacked_values and num_values_to_consume are not in line in my opinion. Either the first one should be num_consumed_values or the second one num_values_to_unpack. http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README@152 PS2, Line 152: rle_encoded_bool.parquet: According to parquet-rle-encoded-bool.test there are 140 True values in column b. Could you please mention in the comment this? (and the total row count would also seem reasonable for future use). http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README@154 PS2, Line 154: as I found no other way to force RLE : encoding of booleans in Parquet v1 nit: This part of the sentence doesn't provide extra information for the reader in my opinion. http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README@155 PS2, Line 155: It became the default encoding in Parquet v2, but : Impala is not able to read Parquet v2 files yet. nit: I'm not sure this information is required here either. (what if someone reads this after Parquet v2 reading is implemented in Impala?) http://gerrit.cloudera.org:8080/#/c/9403/2/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/9403/2/tests/query_test/test_scanners.py@649 PS2, Line 649: "refresh {0}.rle_encoded_bool".format(unique_database)); For my understanding: Why is this refresh needed here? I checked other tests that include .parquet files (e.g. test_zero_rows) but those doesn't do this refresh step. -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 2 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 05 Mar 2018 16:04:02 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Hello Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9403 to look at the new patch set (#2). Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner Impala already supported RLE encoding for levels and dictionary pages, so the only task was to integrate it into BoolColumnReader. There might be a small performance impact on PLAIN encoded booleans, because of the additional branch when the cache of BoolColumnReader is filled. As the cache size is 128, I considered this to be outside the "hot loop", but some performance measurement may be needed to validate this. Testing: As Impala cannot write RLE encoded bool columns at the moment, parquet-mr was used to create a test file, testdata/data/rle_encoded_bool.parquet tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table that uses this file, and tries to query into it. Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/rle-encoding.h M testdata/data/README A testdata/data/rle_encoded_bool.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test M tests/query_test/test_scanners.py 7 files changed, 101 insertions(+), 43 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/9403/2 -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 2 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Csaba Ringhofer has uploaded this change for review. ( http://gerrit.cloudera.org:8080/9403 Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner .. IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner Impala already supported RLE encoding for levels and dictionary pages, so the only task was to integrate it into BoolColumnReader. There might be a small performance impact on PLAIN encoded booleans, because of the addittional branch when the cache of BoolColumnReader is filled. As the cache size is 128, I considered this to be outside the "hot loop", but some performance measurement may be needed to validate this. Testing: As Impala cannot write RLE encoded bool columns at the moment, parquet-mr was used to create a test file, testdata/data/rle_encoded_bool.parquet tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table and that uses this file, and tries to query into it. Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/rle-encoding.h M testdata/data/README A testdata/data/rle_encoded_bool.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test M tests/query_test/test_scanners.py 7 files changed, 101 insertions(+), 43 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/9403/1 -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 1 Gerrit-Owner: Csaba RinghoferGerrit-Reviewer: Tim Armstrong