[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages

2020-10-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16503 )

Change subject: IMPALA-9952: Fix page index filtering for empty pages
..


Patch Set 4: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/16503
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
Gerrit-Change-Number: 16503
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 07 Oct 2020 13:41:33 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages

2020-10-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/16503 )

Change subject: IMPALA-9952: Fix page index filtering for empty pages
..

IMPALA-9952: Fix page index filtering for empty pages

As IMPALA-4371 and IMPALA-10186 points out, Impala might write
empty data pages. It usually does that when it has to write a bigger
page than the current page size. If we really need to write empty data
pages is a different question, but we need to handle them correctly
as there are already such files out there.

The corresponding Parquet offset index entries to empty data pages
are invalid PageLocation objects with 'compressed_page_size' = 0.
Before this commit Impala didn't ignore the empty page locations, but
generated a warning. Since invalid page index doesn't fail a scan
by default, Impala continued scanning the file with semi-initialized
page filtering. This resulted in 'Top level rows aren't in sync'
error, or a crash in DEBUG builds.

With this commit Impala ignores empty data pages and still able to
filter the rest of the pages. Also, if the page index is corrupt
for some other reason, Impala correctly resets the page filtering
logic and falls back to regular scanning.

Testing:
* Added unit test for empty data pages
* Added e2e test for empty data pages
* Added e2e test for invalid page index

Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
Reviewed-on: http://gerrit.cloudera.org:8080/16503
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 
---
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-common-test.cc
M be/src/exec/parquet/parquet-common.cc
M testdata/data/README
A testdata/data/alltypes_empty_pages.parquet
A testdata/data/alltypes_invalid_pages.parquet
M testdata/workloads/functional-query/queries/QueryTest/parquet-page-index.test
M tests/query_test/test_parquet_stats.py
9 files changed, 195 insertions(+), 40 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/16503
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
Gerrit-Change-Number: 16503
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages

2020-10-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16503 )

Change subject: IMPALA-9952: Fix page index filtering for empty pages
..


Patch Set 4: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16503
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
Gerrit-Change-Number: 16503
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 07 Oct 2020 08:19:02 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages

2020-10-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16503 )

Change subject: IMPALA-9952: Fix page index filtering for empty pages
..


Patch Set 4:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6525/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/16503
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
Gerrit-Change-Number: 16503
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 07 Oct 2020 08:19:02 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages

2020-10-06 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16503 )

Change subject: IMPALA-9952: Fix page index filtering for empty pages
..


Patch Set 3: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16503
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
Gerrit-Change-Number: 16503
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 06 Oct 2020 18:44:11 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages

2020-10-06 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16503 )

Change subject: IMPALA-9952: Fix page index filtering for empty pages
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7368/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16503
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
Gerrit-Change-Number: 16503
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 06 Oct 2020 15:21:55 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages

2020-10-06 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16503 )

Change subject: IMPALA-9952: Fix page index filtering for empty pages
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16503/2/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16503/2/be/src/exec/parquet/hdfs-parquet-scanner.cc@824
PS2, Line 824: if (!ComputeCandidatePages(page_locations, 
candidate_ranges_, row_group.num_rows,
> this is not needed, as we set it to false in ResetPageFiltering() anyway
Done


http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc
File be/src/exec/parquet/parquet-common.cc:

http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc@172
PS1, Line 172: while (next_page_idx < page_locations.size() &&
 :page_locations[next_page_idx].compressed_page_size == 
0) {
 :   ++next_page_idx;
 : }
> I don't think that copying a vector with the size of the number of pages wo
Yeah, for now I'd rather leave it that way.



--
To view, visit http://gerrit.cloudera.org:8080/16503
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
Gerrit-Change-Number: 16503
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 06 Oct 2020 15:03:07 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages

2020-10-06 Thread Zoltan Borok-Nagy (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16503

to look at the new patch set (#3).

Change subject: IMPALA-9952: Fix page index filtering for empty pages
..

IMPALA-9952: Fix page index filtering for empty pages

As IMPALA-4371 and IMPALA-10186 points out, Impala might write
empty data pages. It usually does that when it has to write a bigger
page than the current page size. If we really need to write empty data
pages is a different question, but we need to handle them correctly
as there are already such files out there.

The corresponding Parquet offset index entries to empty data pages
are invalid PageLocation objects with 'compressed_page_size' = 0.
Before this commit Impala didn't ignore the empty page locations, but
generated a warning. Since invalid page index doesn't fail a scan
by default, Impala continued scanning the file with semi-initialized
page filtering. This resulted in 'Top level rows aren't in sync'
error, or a crash in DEBUG builds.

With this commit Impala ignores empty data pages and still able to
filter the rest of the pages. Also, if the page index is corrupt
for some other reason, Impala correctly resets the page filtering
logic and falls back to regular scanning.

Testing:
* Added unit test for empty data pages
* Added e2e test for empty data pages
* Added e2e test for invalid page index

Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
---
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-common-test.cc
M be/src/exec/parquet/parquet-common.cc
M testdata/data/README
A testdata/data/alltypes_empty_pages.parquet
A testdata/data/alltypes_invalid_pages.parquet
M testdata/workloads/functional-query/queries/QueryTest/parquet-page-index.test
M tests/query_test/test_parquet_stats.py
9 files changed, 195 insertions(+), 40 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/16503/3
--
To view, visit http://gerrit.cloudera.org:8080/16503
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
Gerrit-Change-Number: 16503
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages

2020-10-06 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16503 )

Change subject: IMPALA-9952: Fix page index filtering for empty pages
..


Patch Set 2: Code-Review+1

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16503/2/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16503/2/be/src/exec/parquet/hdfs-parquet-scanner.cc@824
PS2, Line 824:   filter_pages_ = false;
this is not needed, as we set it to false in ResetPageFiltering() anyway


http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc
File be/src/exec/parquet/parquet-common.cc:

http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc@172
PS1, Line 172: int next_page_idx = page_idx + 1;
 : while (next_page_idx < page_locations.size() &&
 :page_locations[next_page_idx].compressed_page_size == 
0) {
 :
> Yeah it'd also work, but I didn't want to dynamically allocate a new vector
I don't think that copying a vector with the size of the number of pages would 
matter, but I am ok with the current solution if you prefer it.



--
To view, visit http://gerrit.cloudera.org:8080/16503
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
Gerrit-Change-Number: 16503
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 06 Oct 2020 14:46:45 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages

2020-10-06 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16503 )

Change subject: IMPALA-9952: Fix page index filtering for empty pages
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7366/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16503
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
Gerrit-Change-Number: 16503
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 06 Oct 2020 13:59:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages

2020-10-06 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16503 )

Change subject: IMPALA-9952: Fix page index filtering for empty pages
..


Patch Set 2:

(5 comments)

Thanks for the comments!

http://gerrit.cloudera.org:8080/#/c/16503/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16503/1//COMMIT_MSG@11
PS1, Line 11: data
> typo: data
Done


http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@637
PS1, Line 637:   if (filter_pages_ && candidate_ranges_.empty()) {
> nit: extra line
Done


http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@724
PS1, Line 724: ETURN_IF_ERR
> Maybe make filter_pages a member, and reset it in ResetPageFiltering?
Done


http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc
File be/src/exec/parquet/parquet-common.cc:

http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc@142
PS1, Line 142: auto& last_valid_page = page_locations[last_valid_idx];
 :
> can you add some comment in this block? e.g. first_row_index must have prog
Done


http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc@172
PS1, Line 172: int next_page_idx = page_idx + 1;
 : while (next_page_idx < page_locations.size() &&
 :page_locations[next_page_idx].compressed_page_size == 
0) {
 :
> I may have missed something, but wouldn't it be simpler to create a copy of
Yeah it'd also work, but I didn't want to dynamically allocate a new vector and 
copy integers. Though I didn't measure what would be the actual overhead.



--
To view, visit http://gerrit.cloudera.org:8080/16503
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
Gerrit-Change-Number: 16503
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 06 Oct 2020 13:40:00 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages

2020-10-06 Thread Zoltan Borok-Nagy (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16503

to look at the new patch set (#2).

Change subject: IMPALA-9952: Fix page index filtering for empty pages
..

IMPALA-9952: Fix page index filtering for empty pages

As IMPALA-4371 and IMPALA-10186 points out, Impala might write
empty data pages. It usually does that when it has to write a bigger
page than the current page size. If we really need to write empty data
pages is a different question, but we need to handle them correctly
as there are already such files out there.

The corresponding Parquet offset index entries to empty data pages
are invalid PageLocation objects with 'compressed_page_size' = 0.
Before this commit Impala didn't ignore the empty page locations, but
generated a warning. Since invalid page index doesn't fail a scan
by default, Impala continued scanning the file with semi-initialized
page filtering. This resulted in 'Top level rows aren't in sync'
error, or a crash in DEBUG builds.

With this commit Impala ignores empty data pages and still able to
filter the rest of the pages. Also, if the page index is corrupt
for some other reason, Impala correctly resets the page filtering
logic and falls back to regular scanning.

Testing:
* Added unit test for empty data pages
* Added e2e test for empty data pages
* Added e2e test for invalid page index

Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
---
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-common-test.cc
M be/src/exec/parquet/parquet-common.cc
M testdata/data/README
A testdata/data/alltypes_empty_pages.parquet
A testdata/data/alltypes_invalid_pages.parquet
M testdata/workloads/functional-query/queries/QueryTest/parquet-page-index.test
M tests/query_test/test_parquet_stats.py
9 files changed, 196 insertions(+), 40 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/16503/2
--
To view, visit http://gerrit.cloudera.org:8080/16503
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
Gerrit-Change-Number: 16503
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages

2020-09-28 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16503 )

Change subject: IMPALA-9952: Fix page index filtering for empty pages
..


Patch Set 1: Code-Review+1

(5 comments)

http://gerrit.cloudera.org:8080/#/c/16503/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16503/1//COMMIT_MSG@11
PS1, Line 11: date
typo: data


http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@637
PS1, Line 637:
nit: extra line


http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@724
PS1, Line 724: filter_pages
Maybe make filter_pages a member, and reset it in ResetPageFiltering?


http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc
File be/src/exec/parquet/parquet-common.cc:

http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc@142
PS1, Line 142: auto& last_valid_page = page_locations[last_valid_idx];
 :
can you add some comment in this block? e.g. first_row_index must have 
progressed in a non-empty page


http://gerrit.cloudera.org:8080/#/c/16503/1/be/src/exec/parquet/parquet-common.cc@172
PS1, Line 172: while (next_page_idx < page_locations.size() &&
 :page_locations[next_page_idx].compressed_page_size == 
0) {
 :   ++next_page_idx;
 : }
I may have missed something, but wouldn't it be simpler to create a copy of the 
page_location vector and omit empty pages during ValidatePageLocations?



--
To view, visit http://gerrit.cloudera.org:8080/16503
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
Gerrit-Change-Number: 16503
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 28 Sep 2020 15:45:10 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages

2020-09-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16503 )

Change subject: IMPALA-9952: Fix page index filtering for empty pages
..


Patch Set 1:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7273/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16503
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
Gerrit-Change-Number: 16503
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Thu, 24 Sep 2020 14:08:09 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages

2020-09-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16503 )

Change subject: IMPALA-9952: Fix page index filtering for empty pages
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16503/1/tests/query_test/test_parquet_stats.py
File tests/query_test/test_parquet_stats.py:

http://gerrit.cloudera.org:8080/#/c/16503/1/tests/query_test/test_parquet_stats.py@99
PS1, Line 99: t
flake8: E501 line too long (94 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/16503/1/tests/query_test/test_parquet_stats.py@104
PS1, Line 104: )
flake8: E501 line too long (91 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16503
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
Gerrit-Change-Number: 16503
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Thu, 24 Sep 2020 13:48:17 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9952: Fix page index filtering for empty pages

2020-09-24 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/16503


Change subject: IMPALA-9952: Fix page index filtering for empty pages
..

IMPALA-9952: Fix page index filtering for empty pages

As IMPALA-4371 and IMPALA-10186 points out, Impala might write
empty data pages. It usually does that when it has to write a bigger
page than the current page size. If we really need to write empty date
pages is a different question, but we need to handle them correctly
as there are already such files out there.

The corresponding Parquet offset index entries to empty data pages
are invalid PageLocation objects with 'compressed_page_size' = 0.
Before this commit Impala didn't ignore the empty page locations, but
generated a warning. Since invalid page index doesn't fail a scan
by default, Impala continued scanning the file with semi-initialized
page filtering. This resulted in 'Top level rows aren't in sync'
error, or a crash in DEBUG builds.

With this commit Impala ignores empty data pages and still able to
filter the rest of the pages. Also, if the page index is corrupt
for some other reason, Impala correctly resets the page filtering
logic and falls back to regular scanning.

Testing:
* Added unit test for empty data pages
* Added e2e test for empty data pages
* Added e2e test for invalid page index

Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
---
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-common-test.cc
M be/src/exec/parquet/parquet-common.cc
M testdata/data/README
A testdata/data/alltypes_empty_pages.parquet
A testdata/data/alltypes_invalid_pages.parquet
M testdata/workloads/functional-query/queries/QueryTest/parquet-page-index.test
M tests/query_test/test_parquet_stats.py
9 files changed, 174 insertions(+), 23 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/16503/1
--
To view, visit http://gerrit.cloudera.org:8080/16503
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
Gerrit-Change-Number: 16503
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy