[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16503 ) Change subject: IMPALA-9952: Fix page index filtering for empty pages .. Patch Set 4: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/16503 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Gerrit-Change-Number: 16503 Gerrit-PatchSet: 4 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 07 Oct 2020 13:41:33 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16503 ) Change subject: IMPALA-9952: Fix page index filtering for empty pages .. IMPALA-9952: Fix page index filtering for empty pages As IMPALA-4371 and IMPALA-10186 points out, Impala might write empty data pages. It usually does that when it has to write a bigger page than the current page size. If we really need to write empty data pages is a different question, but we need to handle them correctly as there are already such files out there. The corresponding Parquet offset index entries to empty data pages are invalid PageLocation objects with 'compressed_page_size' = 0. Before this commit Impala didn't ignore the empty page locations, but generated a warning. Since invalid page index doesn't fail a scan by default, Impala continued scanning the file with semi-initialized page filtering. This resulted in 'Top level rows aren't in sync' error, or a crash in DEBUG builds. With this commit Impala ignores empty data pages and still able to filter the rest of the pages. Also, if the page index is corrupt for some other reason, Impala correctly resets the page filtering logic and falls back to regular scanning. Testing: * Added unit test for empty data pages * Added e2e test for empty data pages * Added e2e test for invalid page index Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Reviewed-on: http://gerrit.cloudera.org:8080/16503 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins --- M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-common-test.cc M be/src/exec/parquet/parquet-common.cc M testdata/data/README A testdata/data/alltypes_empty_pages.parquet A testdata/data/alltypes_invalid_pages.parquet M testdata/workloads/functional-query/queries/QueryTest/parquet-page-index.test M tests/query_test/test_parquet_stats.py 9 files changed, 195 insertions(+), 40 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/16503 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Gerrit-Change-Number: 16503 Gerrit-PatchSet: 5 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16503 ) Change subject: IMPALA-9952: Fix page index filtering for empty pages .. Patch Set 4: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16503 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Gerrit-Change-Number: 16503 Gerrit-PatchSet: 4 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 07 Oct 2020 08:19:02 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16503 ) Change subject: IMPALA-9952: Fix page index filtering for empty pages .. Patch Set 4: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6525/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/16503 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Gerrit-Change-Number: 16503 Gerrit-PatchSet: 4 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 07 Oct 2020 08:19:02 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16503 ) Change subject: IMPALA-9952: Fix page index filtering for empty pages .. Patch Set 3: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16503 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Gerrit-Change-Number: 16503 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 06 Oct 2020 18:44:11 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16503 ) Change subject: IMPALA-9952: Fix page index filtering for empty pages .. Patch Set 3: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7368/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16503 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Gerrit-Change-Number: 16503 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 06 Oct 2020 15:21:55 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16503 ) Change subject: IMPALA-9952: Fix page index filtering for empty pages .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/16503/2/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/16503/2/be/src/exec/parquet/hdfs-parquet-scanner.cc@824 PS2, Line 824: if (!ComputeCandidatePages(page_locations, candidate_ranges_, row_group.num_rows, > this is not needed, as we set it to false in ResetPageFiltering() anyway Done http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc File be/src/exec/parquet/parquet-common.cc: http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc@172 PS1, Line 172: while (next_page_idx < page_locations.size() && :page_locations[next_page_idx].compressed_page_size == 0) { : ++next_page_idx; : } > I don't think that copying a vector with the size of the number of pages wo Yeah, for now I'd rather leave it that way. -- To view, visit http://gerrit.cloudera.org:8080/16503 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Gerrit-Change-Number: 16503 Gerrit-PatchSet: 1 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 06 Oct 2020 15:03:07 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16503 to look at the new patch set (#3). Change subject: IMPALA-9952: Fix page index filtering for empty pages .. IMPALA-9952: Fix page index filtering for empty pages As IMPALA-4371 and IMPALA-10186 points out, Impala might write empty data pages. It usually does that when it has to write a bigger page than the current page size. If we really need to write empty data pages is a different question, but we need to handle them correctly as there are already such files out there. The corresponding Parquet offset index entries to empty data pages are invalid PageLocation objects with 'compressed_page_size' = 0. Before this commit Impala didn't ignore the empty page locations, but generated a warning. Since invalid page index doesn't fail a scan by default, Impala continued scanning the file with semi-initialized page filtering. This resulted in 'Top level rows aren't in sync' error, or a crash in DEBUG builds. With this commit Impala ignores empty data pages and still able to filter the rest of the pages. Also, if the page index is corrupt for some other reason, Impala correctly resets the page filtering logic and falls back to regular scanning. Testing: * Added unit test for empty data pages * Added e2e test for empty data pages * Added e2e test for invalid page index Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 --- M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-common-test.cc M be/src/exec/parquet/parquet-common.cc M testdata/data/README A testdata/data/alltypes_empty_pages.parquet A testdata/data/alltypes_invalid_pages.parquet M testdata/workloads/functional-query/queries/QueryTest/parquet-page-index.test M tests/query_test/test_parquet_stats.py 9 files changed, 195 insertions(+), 40 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/16503/3 -- To view, visit http://gerrit.cloudera.org:8080/16503 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Gerrit-Change-Number: 16503 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16503 ) Change subject: IMPALA-9952: Fix page index filtering for empty pages .. Patch Set 2: Code-Review+1 (2 comments) http://gerrit.cloudera.org:8080/#/c/16503/2/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/16503/2/be/src/exec/parquet/hdfs-parquet-scanner.cc@824 PS2, Line 824: filter_pages_ = false; this is not needed, as we set it to false in ResetPageFiltering() anyway http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc File be/src/exec/parquet/parquet-common.cc: http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc@172 PS1, Line 172: int next_page_idx = page_idx + 1; : while (next_page_idx < page_locations.size() && :page_locations[next_page_idx].compressed_page_size == 0) { : > Yeah it'd also work, but I didn't want to dynamically allocate a new vector I don't think that copying a vector with the size of the number of pages would matter, but I am ok with the current solution if you prefer it. -- To view, visit http://gerrit.cloudera.org:8080/16503 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Gerrit-Change-Number: 16503 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 06 Oct 2020 14:46:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16503 ) Change subject: IMPALA-9952: Fix page index filtering for empty pages .. Patch Set 2: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7366/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16503 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Gerrit-Change-Number: 16503 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 06 Oct 2020 13:59:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16503 ) Change subject: IMPALA-9952: Fix page index filtering for empty pages .. Patch Set 2: (5 comments) Thanks for the comments! http://gerrit.cloudera.org:8080/#/c/16503/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16503/1//COMMIT_MSG@11 PS1, Line 11: data > typo: data Done http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@637 PS1, Line 637: if (filter_pages_ && candidate_ranges_.empty()) { > nit: extra line Done http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@724 PS1, Line 724: ETURN_IF_ERR > Maybe make filter_pages a member, and reset it in ResetPageFiltering? Done http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc File be/src/exec/parquet/parquet-common.cc: http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc@142 PS1, Line 142: auto& last_valid_page = page_locations[last_valid_idx]; : > can you add some comment in this block? e.g. first_row_index must have prog Done http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc@172 PS1, Line 172: int next_page_idx = page_idx + 1; : while (next_page_idx < page_locations.size() && :page_locations[next_page_idx].compressed_page_size == 0) { : > I may have missed something, but wouldn't it be simpler to create a copy of Yeah it'd also work, but I didn't want to dynamically allocate a new vector and copy integers. Though I didn't measure what would be the actual overhead. -- To view, visit http://gerrit.cloudera.org:8080/16503 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Gerrit-Change-Number: 16503 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 06 Oct 2020 13:40:00 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16503 to look at the new patch set (#2). Change subject: IMPALA-9952: Fix page index filtering for empty pages .. IMPALA-9952: Fix page index filtering for empty pages As IMPALA-4371 and IMPALA-10186 points out, Impala might write empty data pages. It usually does that when it has to write a bigger page than the current page size. If we really need to write empty data pages is a different question, but we need to handle them correctly as there are already such files out there. The corresponding Parquet offset index entries to empty data pages are invalid PageLocation objects with 'compressed_page_size' = 0. Before this commit Impala didn't ignore the empty page locations, but generated a warning. Since invalid page index doesn't fail a scan by default, Impala continued scanning the file with semi-initialized page filtering. This resulted in 'Top level rows aren't in sync' error, or a crash in DEBUG builds. With this commit Impala ignores empty data pages and still able to filter the rest of the pages. Also, if the page index is corrupt for some other reason, Impala correctly resets the page filtering logic and falls back to regular scanning. Testing: * Added unit test for empty data pages * Added e2e test for empty data pages * Added e2e test for invalid page index Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 --- M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-common-test.cc M be/src/exec/parquet/parquet-common.cc M testdata/data/README A testdata/data/alltypes_empty_pages.parquet A testdata/data/alltypes_invalid_pages.parquet M testdata/workloads/functional-query/queries/QueryTest/parquet-page-index.test M tests/query_test/test_parquet_stats.py 9 files changed, 196 insertions(+), 40 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/16503/2 -- To view, visit http://gerrit.cloudera.org:8080/16503 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Gerrit-Change-Number: 16503 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16503 ) Change subject: IMPALA-9952: Fix page index filtering for empty pages .. Patch Set 1: Code-Review+1 (5 comments) http://gerrit.cloudera.org:8080/#/c/16503/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16503/1//COMMIT_MSG@11 PS1, Line 11: date typo: data http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@637 PS1, Line 637: nit: extra line http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@724 PS1, Line 724: filter_pages Maybe make filter_pages a member, and reset it in ResetPageFiltering? http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc File be/src/exec/parquet/parquet-common.cc: http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc@142 PS1, Line 142: auto& last_valid_page = page_locations[last_valid_idx]; : can you add some comment in this block? e.g. first_row_index must have progressed in a non-empty page http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc@172 PS1, Line 172: while (next_page_idx < page_locations.size() && :page_locations[next_page_idx].compressed_page_size == 0) { : ++next_page_idx; : } I may have missed something, but wouldn't it be simpler to create a copy of the page_location vector and omit empty pages during ValidatePageLocations? -- To view, visit http://gerrit.cloudera.org:8080/16503 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Gerrit-Change-Number: 16503 Gerrit-PatchSet: 1 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Mon, 28 Sep 2020 15:45:10 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16503 ) Change subject: IMPALA-9952: Fix page index filtering for empty pages .. Patch Set 1: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7273/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16503 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Gerrit-Change-Number: 16503 Gerrit-PatchSet: 1 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Thu, 24 Sep 2020 14:08:09 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16503 ) Change subject: IMPALA-9952: Fix page index filtering for empty pages .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/16503/1/tests/query_test/test_parquet_stats.py File tests/query_test/test_parquet_stats.py: http://gerrit.cloudera.org:8080/#/c/16503/1/tests/query_test/test_parquet_stats.py@99 PS1, Line 99: t flake8: E501 line too long (94 > 90 characters) http://gerrit.cloudera.org:8080/#/c/16503/1/tests/query_test/test_parquet_stats.py@104 PS1, Line 104: ) flake8: E501 line too long (91 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16503 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Gerrit-Change-Number: 16503 Gerrit-PatchSet: 1 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Thu, 24 Sep 2020 13:48:17 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages
Zoltan Borok-Nagy has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16503 Change subject: IMPALA-9952: Fix page index filtering for empty pages .. IMPALA-9952: Fix page index filtering for empty pages As IMPALA-4371 and IMPALA-10186 points out, Impala might write empty data pages. It usually does that when it has to write a bigger page than the current page size. If we really need to write empty date pages is a different question, but we need to handle them correctly as there are already such files out there. The corresponding Parquet offset index entries to empty data pages are invalid PageLocation objects with 'compressed_page_size' = 0. Before this commit Impala didn't ignore the empty page locations, but generated a warning. Since invalid page index doesn't fail a scan by default, Impala continued scanning the file with semi-initialized page filtering. This resulted in 'Top level rows aren't in sync' error, or a crash in DEBUG builds. With this commit Impala ignores empty data pages and still able to filter the rest of the pages. Also, if the page index is corrupt for some other reason, Impala correctly resets the page filtering logic and falls back to regular scanning. Testing: * Added unit test for empty data pages * Added e2e test for empty data pages * Added e2e test for invalid page index Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 --- M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-common-test.cc M be/src/exec/parquet/parquet-common.cc M testdata/data/README A testdata/data/alltypes_empty_pages.parquet A testdata/data/alltypes_invalid_pages.parquet M testdata/workloads/functional-query/queries/QueryTest/parquet-page-index.test M tests/query_test/test_parquet_stats.py 9 files changed, 174 insertions(+), 23 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/16503/1 -- To view, visit http://gerrit.cloudera.org:8080/16503 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Gerrit-Change-Number: 16503 Gerrit-PatchSet: 1 Gerrit-Owner: Zoltan Borok-Nagy