[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 9: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Thu, 22 Feb 2018 00:57:44 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..

IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

If the first number in a row group written by Impala is NaN,
then Impala writes incorrect statistics in the metadata.
This will result in incorrect results when filtering the
data.

This commit fixes the read path when encountering NaNs in
Parquet min/max statistics. If min and max are both NaN, we
can't use the statistics at all. If only one of them is NaN,
the other still can be used.

I added some tests to QueryTest/parqet-stats.test

Change-Id: If3897fc1426541239223670812f59e2bed32f455
Reviewed-on: http://gerrit.cloudera.org:8080/9358
Reviewed-by: Tim Armstrong 
Tested-by: Impala Public Jenkins
---
M be/src/exec/parquet-column-stats.cc
M testdata/data/README
A testdata/data/min_max_is_nan.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-invalid-minmax-stats.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
M tests/query_test/test_parquet_stats.py
6 files changed, 137 insertions(+), 2 deletions(-)

Approvals:
  Tim Armstrong: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 9:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1976/


--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 21 Feb 2018 21:17:13 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 9: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 21 Feb 2018 21:16:58 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 8: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 8
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 21 Feb 2018 21:16:52 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9358/6/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/9358/6/testdata/data/README@153
PS6, Line 153: hacking
wasn't this the default behavior before this fix?



--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 21 Feb 2018 19:33:21 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Gabor Kaszab, Zoltan Ivanfi, Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9358

to look at the new patch set (#7).

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..

IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

If the first number in a row group written by Impala is NaN,
then Impala writes incorrect statistics in the metadata.
This will result in incorrect results when filtering the
data.

This commit fixes the read path when encountering NaNs in
Parquet min/max statistics. If min and max are both NaN, we
can't use the statistics at all. If only one of them is NaN,
the other still can be used.

I added some tests to QueryTest/parqet-stats.test

Change-Id: If3897fc1426541239223670812f59e2bed32f455
---
M be/src/exec/parquet-column-stats.cc
M testdata/data/README
A testdata/data/min_max_is_nan.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-invalid-minmax-stats.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
M tests/query_test/test_parquet_stats.py
6 files changed, 137 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/9358/7
--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Gabor Kaszab, Zoltan Ivanfi, Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9358

to look at the new patch set (#8).

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..

IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

If the first number in a row group written by Impala is NaN,
then Impala writes incorrect statistics in the metadata.
This will result in incorrect results when filtering the
data.

This commit fixes the read path when encountering NaNs in
Parquet min/max statistics. If min and max are both NaN, we
can't use the statistics at all. If only one of them is NaN,
the other still can be used.

I added some tests to QueryTest/parqet-stats.test

Change-Id: If3897fc1426541239223670812f59e2bed32f455
---
M be/src/exec/parquet-column-stats.cc
M testdata/data/README
A testdata/data/min_max_is_nan.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-invalid-minmax-stats.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
M tests/query_test/test_parquet_stats.py
6 files changed, 137 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/9358/8
--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 8
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 7: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9358/7/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/9358/7/testdata/data/README@153
PS7, Line 153: Generated by Impala's Parquet writer before the fix for 
IMPALA-6527.
Nit: Maybe also add the first 8 characters of the git hash (current HEAD) to 
make it easy to repro.



--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 21 Feb 2018 20:23:25 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9358/6/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/9358/6/testdata/data/README@153
PS6, Line 153: hacking
> wasn't this the default behavior before this fix?
Yes, I just saw that almost every Parquet file has this description that are 
generated by Impala.
Improved the comment.



--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 21 Feb 2018 19:51:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9358/7/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/9358/7/testdata/data/README@153
PS7, Line 153: Generated by Impala's Parquet writer before the fix for 
IMPALA-6527. Git hash: 3a049a53
> Nit: Maybe also add the first 8 characters of the git hash (current HEAD) t
Done



--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 8
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 21 Feb 2018 20:37:06 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 7: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 21 Feb 2018 20:35:52 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9358/5/testdata/data/min_max_is_nan.parquet
File testdata/data/min_max_is_nan.parquet:

PS5:
> Please also update the README file in the same folder.
Updated the README file.



--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 21 Feb 2018 19:32:12 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Gabor Kaszab, Zoltan Ivanfi, Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9358

to look at the new patch set (#6).

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..

IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

If the first number in a row group written by Impala is NaN,
then Impala writes incorrect statistics in the metadata.
This will result in incorrect results when filtering the
data.

This commit fixes the read path when encountering NaNs in
Parquet min/max statistics. If min and max are both NaN, we
can't use the statistics at all. If only one of them is NaN,
the other still can be used.

I added some tests to QueryTest/parqet-stats.test

Change-Id: If3897fc1426541239223670812f59e2bed32f455
---
M be/src/exec/parquet-column-stats.cc
M testdata/data/README
A testdata/data/min_max_is_nan.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-invalid-minmax-stats.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
M tests/query_test/test_parquet_stats.py
6 files changed, 137 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/9358/6
--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9358/5/testdata/data/min_max_is_nan.parquet
File testdata/data/min_max_is_nan.parquet:

PS5:
Please also update the README file in the same folder.



--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 21 Feb 2018 18:26:41 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 5: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 21 Feb 2018 16:45:07 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Zoltan Ivanfi (Code Review)
Zoltan Ivanfi has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 5: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 21 Feb 2018 15:12:12 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 4:

(2 comments)

Thanks for the comments!

http://gerrit.cloudera.org:8080/#/c/9358/4/be/src/exec/parquet-column-stats.cc
File be/src/exec/parquet-column-stats.cc:

http://gerrit.cloudera.org:8080/#/c/9358/4/be/src/exec/parquet-column-stats.cc@99
PS4, Line 99:   col_chunk.meta_data.type)) return false;
> nit: Not sure if it makes it more readable, but you could write this as
I like it, re-structured the code.


http://gerrit.cloudera.org:8080/#/c/9358/4/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/9358/4/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test@494
PS4, Line 494:  QUERY
> This is a good point. We should consider checking in a file written with th
Thanks for the suggestions. I added min_max_is_nan.parquet, and created a new 
test in test_parquet_stats.py.



--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 21 Feb 2018 14:27:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-21 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Gabor Kaszab, Zoltan Ivanfi, Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9358

to look at the new patch set (#5).

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..

IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

If the first number in a row group written by Impala is NaN,
then Impala writes incorrect statistics in the metadata.
This will result in incorrect results when filtering the
data.

This commit fixes the read path when encountering NaNs in
Parquet min/max statistics. If min and max are both NaN, we
can't use the statistics at all. If only one of them is NaN,
the other still can be used.

I added some tests to QueryTest/parqet-stats.test

Change-Id: If3897fc1426541239223670812f59e2bed32f455
---
M be/src/exec/parquet-column-stats.cc
A testdata/data/min_max_is_nan.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-invalid-minmax-stats.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
M tests/query_test/test_parquet_stats.py
5 files changed, 130 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/9358/5
--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-20 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/9358/4/be/src/exec/parquet-column-stats.cc
File be/src/exec/parquet-column-stats.cc:

http://gerrit.cloudera.org:8080/#/c/9358/4/be/src/exec/parquet-column-stats.cc@99
PS4, Line 99:   col_chunk.meta_data.type)) return false;
nit: Not sure if it makes it more readable, but you could write this as

  return (DecodePlainValue(..) && !isnan(..));


http://gerrit.cloudera.org:8080/#/c/9358/4/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/9358/4/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test@494
PS4, Line 494:  QUERY
> This is a good test in general, but it does not specifically test for the f
This is a good point. We should consider checking in a file written with the 
current code. You can find examples in testdata/data.



--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Tue, 20 Feb 2018 18:00:58 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-20 Thread Zoltan Ivanfi (Code Review)
Zoltan Ivanfi has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 4: Code-Review+1

(1 comment)

This workaround in the read path seems to be a good quick-fix, but I think the 
write path should also have a quick fix to make the written stats independent 
of the data order, i.e. it should not matter whether a NaN is the first value 
or not.

http://gerrit.cloudera.org:8080/#/c/9358/4/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/9358/4/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test@494
PS4, Line 494:  QUERY
This is a good test in general, but it does not specifically test for the fix. 
Both the read path and the write path can be modified to make this test pass 
and once we will have fixes in both paths, this test won't notice if one of 
them has a regression.



--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Tue, 20 Feb 2018 15:41:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-20 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Gabor Kaszab, Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9358

to look at the new patch set (#4).

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..

IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

If the first number in a row group written by Impala is NaN,
then Impala writes incorrect statistics in the metadata.
This will result in incorrect results when filtering the
data.

This commit fixes the read path when encountering NaNs in
Parquet min/max statistics. If min and max are both NaN, we
can't use the statistics at all. If only one of them is NaN,
the other still can be used.

I added some tests to QueryTest/parqet-stats.test

Change-Id: If3897fc1426541239223670812f59e2bed32f455
---
M be/src/exec/parquet-column-stats.cc
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
2 files changed, 86 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/9358/4
--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-20 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 3:

(6 comments)

Thanks for the comments!

http://gerrit.cloudera.org:8080/#/c/9358/2/be/src/exec/parquet-column-stats.cc
File be/src/exec/parquet-column-stats.cc:

http://gerrit.cloudera.org:8080/#/c/9358/2/be/src/exec/parquet-column-stats.cc@29
PS2, Line 29: const ColumnType& col_type, const parquet::ColumnOrder* 
col_order,
> 1: Can this be part of ColumnStatsBase?
Removed this function.


http://gerrit.cloudera.org:8080/#/c/9358/2/be/src/exec/parquet-column-stats.cc@109
PS2, Line 109:   col_chunk.meta_data.type);
> +1 for returning false here.
I added comments here. The function is general for all the datatypes and I 
think the current comment does a good job explaining its behavior.

I'm not sure if this edge case that only affects one type should be mentioned 
there.


http://gerrit.cloudera.org:8080/#/c/9358/2/be/src/exec/parquet-column-stats.cc@109
PS2, Line 109:   col_chunk.meta_data.type);
> I think we can just return false here if a value is NaN. That way the calle
Good idea, thanks.


http://gerrit.cloudera.org:8080/#/c/9358/2/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/9358/2/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test@496
PS2, Line 496: # When the first value is NaN in a column chunk, Impala chooses 
it as min_value and
> Could you add a description here why we test this? (to check that when the
Added some extra details.


http://gerrit.cloudera.org:8080/#/c/9358/2/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test@497
PS2, Line 497: # max_value for statistics. In this case the min/max filter 
should be ignored.
> Can you also add tests where NaN is inserted in the middle of two values? F
Added couple of tests for the case when NaN is in the middle.


http://gerrit.cloudera.org:8080/#/c/9358/2/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test@520
PS2, Line 520: # test predicate that is true for NaN
> Maybe also test !=?
Added tests for !=



--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 20 Feb 2018 11:18:26 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-20 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Gabor Kaszab, Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9358

to look at the new patch set (#3).

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..

IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

If the first number in a row group written by Impala is NaN,
then Impala writes incorrect statistics in the metadata.
This will result in incorrect results when filtering the
data.

This commit fixes the read path when encountering NaNs in
Parquet min/max statistics. If max is NaN, then Impala
will interpret it as infinity. If min is NaN, then -infinity
will be used.

I added some tests to QueryTest/parqet-stats.test

Change-Id: If3897fc1426541239223670812f59e2bed32f455
---
M be/src/exec/parquet-column-stats.cc
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
2 files changed, 86 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/9358/3
--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-20 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9358/2/be/src/exec/parquet-column-stats.cc
File be/src/exec/parquet-column-stats.cc:

http://gerrit.cloudera.org:8080/#/c/9358/2/be/src/exec/parquet-column-stats.cc@109
PS2, Line 109:   ChangeNaNToInf(slot, stats_field);
> I think we can just return false here if a value is NaN. That way the calle
+1 for returning false here.
Please update the function comment in the header file accordingly.



--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 20 Feb 2018 08:40:12 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-19 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/9358/2/be/src/exec/parquet-column-stats.cc
File be/src/exec/parquet-column-stats.cc:

http://gerrit.cloudera.org:8080/#/c/9358/2/be/src/exec/parquet-column-stats.cc@109
PS2, Line 109:   ChangeNaNToInf(slot, stats_field);
I think we can just return false here if a value is NaN. That way the caller 
will know that it shouldn't use the stats (see comment of the ReadFromThrift()).


http://gerrit.cloudera.org:8080/#/c/9358/2/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/9358/2/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test@497
PS2, Line 497: insert into test_nan values (cast('NaN' as double)), (42);
Can you also add tests where NaN is inserted in the middle of two values? For 
those, "where not val >= 0" should be covered, too.



--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 19 Feb 2018 22:17:30 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-19 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9358/2/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/9358/2/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test@520
PS2, Line 520: select * from test_nan where not val >= 0
Maybe also test !=?



--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 19 Feb 2018 20:59:05 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-19 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9358 )

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..


Patch Set 2:

(2 comments)

Seems fine in general. I have some minor comments.

http://gerrit.cloudera.org:8080/#/c/9358/2/be/src/exec/parquet-column-stats.cc
File be/src/exec/parquet-column-stats.cc:

http://gerrit.cloudera.org:8080/#/c/9358/2/be/src/exec/parquet-column-stats.cc@29
PS2, Line 29: void ChangeNaNToInf(void *slot, ColumnStatsBase::StatsField 
stats_field) {
1: Can this be part of ColumnStatsBase?
2: Could you add some description to the function? e.g. can we expect changes 
on slot parameter?


http://gerrit.cloudera.org:8080/#/c/9358/2/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/9358/2/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test@496
PS2, Line 496: create table test_nan(val double) stored as parquet;
Could you add a description here why we test this? (to check that when the 
first item in the row group is NaN then it doesn't ruin min/max stats and as a 
result it doesn't rule out the whole row group)



--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 19 Feb 2018 15:40:18 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-19 Thread Zoltan Borok-Nagy (Code Review)
Hello Gabor Kaszab, Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9358

to look at the new patch set (#2).

Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..

IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

If the first number in a row group written by Impala is NaN,
then Impala writes incorrect statistics in the metadata.
This will result in incorrect results when filtering the
data.

This commit fixes the read path when encountering NaNs in
Parquet min/max statistics. If max is NaN, then Impala
will interpret it as infinity. If min is NaN, then -infinity
will be used.

I added some tests to QueryTest/parqet-stats.test

Change-Id: If3897fc1426541239223670812f59e2bed32f455
---
M be/src/exec/parquet-column-stats.cc
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
2 files changed, 48 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/9358/2
--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-19 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/9358


Change subject: IMPALA-6538: Fix read path when Parquet min/max statistics 
contain NaN
..

IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

If the first number in a row group written by Impala is NaN,
then Impala writes incorrect statistics in the metadata.
This will result in incorrect results when filtering the
data.

This commit fixes the read path when encountering NaNs in
Parquet min/max statistics. If max is NaN, then Impala
will interpret it as infinity. If min is NaN, then -infinity
will be used.

I added some tests to QueryTest/parqet-stats.test

Change-Id: If3897fc1426541239223670812f59e2bed32f455
---
M be/src/exec/parquet-column-stats.cc
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
2 files changed, 48 insertions(+), 4 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/9358/1
--
To view, visit http://gerrit.cloudera.org:8080/9358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: If3897fc1426541239223670812f59e2bed32f455
Gerrit-Change-Number: 9358
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy