[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata

2019-09-20 Thread Sahil Takiar (Code Review)
Sahil Takiar has abandoned this change. ( http://gerrit.cloudera.org:8080/12163 
)

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale than table metadata
..


Abandoned

Abandoning for now as I am no longer working on this / don't plan on picking it 
up for a while. Will publish a new review if decide otherwise.
--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata

2019-04-04 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale than table metadata
..


Patch Set 3:

(6 comments)

Yeah, sorry I forgot about this change request.

http://gerrit.cloudera.org:8080/#/c/12163/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12163/3//COMMIT_MSG@24
PS3, Line 24: that cannot be converted to
: the higher scale without overflow
Shouldn't we only allow conversions to decimals that have more digits before 
and after the decimal point? This way no overflow should occur, right?


http://gerrit.cloudera.org:8080/#/c/12163/3/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12163/3/be/src/exec/parquet/parquet-column-readers.cc@237
PS3, Line 237:   inline ALWAYS_INLINE bool ConvertDecimalPrecisionAndScale(
nit: please add comment


http://gerrit.cloudera.org:8080/#/c/12163/3/be/src/exec/parquet/parquet-column-readers.cc@248
PS3, Line 248:   inline ALWAYS_INLINE void ConvertDecimalPrecision(const 
InternalType* src, void* slot);
nit: Please add comment. Explain why it takes an InternalType while the 
CovertDecimalScale() takes a Decimal*Value.


http://gerrit.cloudera.org:8080/#/c/12163/3/be/src/exec/parquet/parquet-column-readers.cc@978
PS3, Line 978: dst_dec, slot
'dst_dec' and 'slot' points to the same memory location, it's strange to pass 
both.


http://gerrit.cloudera.org:8080/#/c/12163/3/be/src/exec/parquet/parquet-column-readers.cc@986
PS3, Line 986: dst_dec4, slot
'dst_dec4' and 'slot'  points to the same memory location, why do we need to 
pass both?


http://gerrit.cloudera.org:8080/#/c/12163/3/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/12163/3/tests/query_test/test_scanners.py@948
PS3, Line 948:
I don't see a test case for the error case, i.e. table metadata scale is lower 
than file metadata scale.



--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 04 Apr 2019 14:44:49 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata

2019-04-03 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale than table metadata
..


Patch Set 3:

Zoltan, any plans to take another look at this one?


--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 04 Apr 2019 01:05:51 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata

2019-02-12 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale than table metadata
..


Patch Set 3:

(3 comments)

Just noticed the comments about the README, but I plan to take a deeper look 
soon.

http://gerrit.cloudera.org:8080/#/c/12163/3/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/12163/3/testdata/data/README@183
PS3, Line 183: decimal_stored_as_int32.parquet:
 : Parquet file generated by Spark 2.3.1 that contains decimals 
stored as int32.
 : Impala needs to be able to read such values (IMPALA-5542)
 :
 : decimal_stored_as_int64.parquet:
 : Parquet file generated by Spark 2.3.1 that contains decimals 
stored as int64.
 : Impala needs to be able to read such values (IMPALA-5542)
Here.


http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@842
PS1, Line 842: pressed_page_size_summaries:
> Thanks for taking care of the other files too!
The decimal_stored_as_* files were in the README already.


http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@848
PS1, Line 848:  " limit 10")
> I do not see this file in the patch and it is not mentioned in the README.
This was already in the README.



--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 12 Feb 2019 17:51:45 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata

2019-01-29 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale than table metadata
..


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12163/1/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test:

http://gerrit.cloudera.org:8080/#/c/12163/1/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test@85
PS1, Line 85: 
> Can you point me to the stat filtering code you are referring to?
The stats are read here: 
https://github.com/apache/impala/blob/fa672909c868f76aa50e9fb756114c32daaf6d9b/be/src/exec/parquet/hdfs-parquet-scanner.cc#L538

With a few hops it arrives to the function I had to specialize for timestamps:
https://github.com/apache/impala/blob/fa672909c868f76aa50e9fb756114c32daaf6d9b/be/src/exec/parquet/parquet-column-stats.inline.h#L103

They are evaluated a few lines later with the assumption that they have the 
same scale and precision as the SQL type.


http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@842
PS1, Line 842: pressed_page_size_summaries:
> It was an existing file. I added a section to the README describing its con
Thanks for taking care of the other files too!



--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Tue, 29 Jan 2019 17:04:36 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata

2019-01-28 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale than table metadata
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1910/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Mon, 28 Jan 2019 21:27:10 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata

2019-01-28 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale than table metadata
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1909/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Mon, 28 Jan 2019 21:06:51 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata

2019-01-28 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale than table metadata
..


Patch Set 1:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/12163/1/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12163/1/be/src/exec/parquet/parquet-metadata-utils.cc@243
PS1, Line 243: // TODO: we could allow a mismatch and do a conversion at this 
step.
> This TODO could be probably updated.
Done


http://gerrit.cloudera.org:8080/#/c/12163/1/be/src/exec/parquet/parquet-metadata-utils.cc@255
PS1, Line 255: // TODO: we could allow a mismatch and do a conversion at this 
step.
> This TODO could be probably updated.
Done


http://gerrit.cloudera.org:8080/#/c/12163/1/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test:

http://gerrit.cloudera.org:8080/#/c/12163/1/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test@85
PS1, Line 85: 
> Please add some tests with 'where col=const' clauses to exercise the dictio
Can you point me to the stat filtering code you are referring to?


http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@842
PS1, Line 842: decimal_stored_as_int32.parquet
> I do not see this file in the patch and it is not mentioned in the README.
It was an existing file. I added a section to the README describing its 
contents. Same for decimal_stored_as_int64.parquet and 
binary_decimal_no_dictionary.parquet


http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@848
PS1, Line 848: decimal_stored_as_int64.parquet
> I do not see this file in the patch and it is not mentioned in the README.
Done


http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@856
PS1, Line 856: binary_decimal_no_dictionary.parquet
> I do not see this file in the patch and it is not mentioned in the README.
Done



--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Mon, 28 Jan 2019 20:15:12 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata

2019-01-28 Thread Sahil Takiar (Code Review)
Hello Lars Volker, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12163

to look at the new patch set (#3).

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale than table metadata
..

IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table 
metadata

Allows Impala to read Parquet files that have been written with a lower
precision / scale compared to the precision / scale specified during
table creation. Currently, if the precision / scale in the table does
not exactly match the precision / scale in the Parquet files, Impala
will throw an error and the table will be unqueryable.

This patch only allows reading Parquet files with a *lower* precision /
scale compared to the table metadata. Users still get an error if they
try to read higher precision / scale data into a lower precision /
scale.

The ability to read lower precision / scale data is allowed by the SQL
Standard and several other engines, such as Hive, MySQL, and Postgres,
allow for this behavior.

If the underlying Parquet files contain rows that cannot be converted to
the higher scale without overflow, the decimal slot is set to NULL and a
warning is printed in the logs. This is consistent with what other engines,
such as Hive do. If abort_on_error is set to true, then instead of setting
the slot to NULL, the query fails.

Testing:
* Ran core tests
* Added new tests in test_scanners.py

Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
---
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M common/thrift/generate_error_codes.py
M testdata/data/README
A testdata/data/binary_decimal_precision_and_scale_widening.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening-abort-on-error.test
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test
M tests/common/test_result_verifier.py
M tests/query_test/test_scanners.py
10 files changed, 423 insertions(+), 18 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/12163/3
--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata

2019-01-28 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale than table metadata
..


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12163/2/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12163/2/be/src/exec/parquet/parquet-column-readers.cc@237
PS2, Line 237:   inline ALWAYS_INLINE bool 
ConvertDecimalPrecisionAndScale(const InternalType* src, void* slot);
line too long (97 > 90)


http://gerrit.cloudera.org:8080/#/c/12163/2/be/src/exec/parquet/parquet-column-readers.cc@1027
PS2, Line 1027: << filename() << " column " << 
slot_desc()->type().DebugString();
tab used for whitespace



--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Mon, 28 Jan 2019 20:12:46 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata

2019-01-28 Thread Sahil Takiar (Code Review)
Hello Lars Volker, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12163

to look at the new patch set (#2).

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale than table metadata
..

IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table 
metadata

Allows Impala to read Parquet files that have been written with a lower
precision / scale compared to the precision / scale specified during
table creation. Currently, if the precision / scale in the table does
not exactly match the precision / scale in the Parquet files, Impala
will throw an error and the table will be unqueryable.

This patch only allows reading Parquet files with a *lower* precision /
scale compared to the table metadata. Users still get an error if they
try to read higher precision / scale data into a lower precision /
scale.

The ability to read lower precision / scale data is allowed by the SQL
Standard and several other engines, such as Hive, MySQL, and Postgres,
allow for this behavior.

If the underlying Parquet files contain rows that cannot be converted to
the higher scale without overflow, the decimal slot is set to NULL and a
warning is printed in the logs. This is consistent with what other engines,
such as Hive do. If abort_on_error is set to true, then instead of setting
the slot to NULL, the query fails.

Testing:
* Ran core tests
* Added new tests in test_scanners.py

Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
---
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M common/thrift/generate_error_codes.py
M testdata/data/README
A testdata/data/binary_decimal_precision_and_scale_widening.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening-abort-on-error.test
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test
M tests/common/test_result_verifier.py
M tests/query_test/test_scanners.py
10 files changed, 422 insertions(+), 18 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/12163/2
--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale

2019-01-24 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale
..


Patch Set 1:

> > Starting a new thread, since the last one was getting long.
 > >
 > > I started working on getting this patch in line with Hive's
 > > behavior - e.g. if an overflow occurs, set the row to NULL and
 > emit
 > > a warning. I hit an issue in the design of ScalarColumnReader::ConvertSlot.
 > > The ConvertSlot method returns a bool, if true the execution
 > > continues, if false the query fails. There is no clean way for
 > > ConvertSlot to set the tuple to NULL and continue processing.
 > > ScalarColumnReader::ValidateValue on the other had does allow for
 > > this. If ValidateValue returns false, and the parse_status is
 > ok()
 > > then the tuple is set to NULL and execution continues. If it
 > > returns false and the parse_status() is an error, it aborts the
 > > query. If it returns true, execution continues.
 > >
 > > I checked and all current implementations of ConvertSlot always
 > > return true, no matter what. So I propose changing ConvertSlot's
 > > return semantics to match those of ValidateValue.
 > >
 > > Any objections?
 >
 > I would still prefer the "exclude the whole file with warning if
 > overflow is possible" approach, as it would:
 > - give a useful error message
 > - probably make conversion faster, as a simple multiplication would
 > be enough
 > - would be simpler and need less testing
 >
 > If this is not an option (because enabling the potentially
 > overflowing conversion is really useful in some use cases), then I
 > agree with changing ConvertSlot() as you described.

A few points to consider on the "exclude the whole file with warning if 
overflow is possible" approach

1: I don't think we should exclude entire files with just a warning only 
because an overflow might be possible. That could significantly throw off query 
results. I think a more fine grained approach where we essentially exclude 
individual rows by setting them to NULL would be better.

2: Allowing for overflows only gives the user more flexibility; if they don't 
want overflows, they can set abort_on_error on the query will fail instead of 
setting rows to NULL.

3: Setting overflowed slots to NULL is consistent with how Impala handles 
similar issues elsewhere; most of the ValidateValue methods follow this 
approach.

4: Yes, there is a performance overhead to checking overflows. I haven't 
checked how much it is, but consider that many decimal operations done by 
Impala expressions invokes the overflow check as well (e.g. decimal 
multiplication, addition, etc).

5: Currently, if an Impala query multiplies two slots and the result overflows, 
the entire query fails. So I'm also ok with keeping the current behavior of 
this patch since it matches how Impala currently deals with overflows. However, 
I would still prefer to follow Hive's behavior because (1) consistency with 
Hive is useful, and (2) users can still set abort_on_error=true if they want.

If discussing this over Hangout is easier, happy to setup a call.


--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Thu, 24 Jan 2019 15:47:58 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale

2019-01-10 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale
..


Patch Set 1:

> Starting a new thread, since the last one was getting long.
 >
 > I started working on getting this patch in line with Hive's
 > behavior - e.g. if an overflow occurs, set the row to NULL and emit
 > a warning. I hit an issue in the design of ScalarColumnReader::ConvertSlot.
 > The ConvertSlot method returns a bool, if true the execution
 > continues, if false the query fails. There is no clean way for
 > ConvertSlot to set the tuple to NULL and continue processing.
 > ScalarColumnReader::ValidateValue on the other had does allow for
 > this. If ValidateValue returns false, and the parse_status is ok()
 > then the tuple is set to NULL and execution continues. If it
 > returns false and the parse_status() is an error, it aborts the
 > query. If it returns true, execution continues.
 >
 > I checked and all current implementations of ConvertSlot always
 > return true, no matter what. So I propose changing ConvertSlot's
 > return semantics to match those of ValidateValue.
 >
 > Any objections?

 > Starting a new thread, since the last one was getting long.
 >
 > I started working on getting this patch in line with Hive's
 > behavior - e.g. if an overflow occurs, set the row to NULL and emit
 > a warning. I hit an issue in the design of ScalarColumnReader::ConvertSlot.
 > The ConvertSlot method returns a bool, if true the execution
 > continues, if false the query fails. There is no clean way for
 > ConvertSlot to set the tuple to NULL and continue processing.
 > ScalarColumnReader::ValidateValue on the other had does allow for
 > this. If ValidateValue returns false, and the parse_status is ok()
 > then the tuple is set to NULL and execution continues. If it
 > returns false and the parse_status() is an error, it aborts the
 > query. If it returns true, execution continues.
 >
 > I checked and all current implementations of ConvertSlot always
 > return true, no matter what. So I propose changing ConvertSlot's
 > return semantics to match those of ValidateValue.
 >
 > Any objections?

 > Starting a new thread, since the last one was getting long.
 >
 > I started working on getting this patch in line with Hive's
 > behavior - e.g. if an overflow occurs, set the row to NULL and emit
 > a warning. I hit an issue in the design of ScalarColumnReader::ConvertSlot.
 > The ConvertSlot method returns a bool, if true the execution
 > continues, if false the query fails. There is no clean way for
 > ConvertSlot to set the tuple to NULL and continue processing.
 > ScalarColumnReader::ValidateValue on the other had does allow for
 > this. If ValidateValue returns false, and the parse_status is ok()
 > then the tuple is set to NULL and execution continues. If it
 > returns false and the parse_status() is an error, it aborts the
 > query. If it returns true, execution continues.
 >
 > I checked and all current implementations of ConvertSlot always
 > return true, no matter what. So I propose changing ConvertSlot's
 > return semantics to match those of ValidateValue.
 >
 > Any objections?

I would still prefer the "exclude the whole file with warning if overflow is 
possible" approach, as it would:
- give a useful error message
- probably make conversion faster, as a simple multiplication would be enough
- would be simpler and need less testing

If this is not an option (because enabling the potentially overflowing 
conversion is really useful in some use cases), then I agree with changing 
ConvertSlot() as you described.


--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Thu, 10 Jan 2019 22:11:01 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale

2019-01-10 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale
..


Patch Set 1:

Starting a new thread, since the last one was getting long.

I started working on getting this patch in line with Hive's behavior - e.g. if 
an overflow occurs, set the row to NULL and emit a warning. I hit an issue in 
the design of ScalarColumnReader::ConvertSlot. The ConvertSlot method returns a 
bool, if true the execution continues, if false the query fails. There is no 
clean way for ConvertSlot to set the tuple to NULL and continue processing. 
ScalarColumnReader::ValidateValue on the other had does allow for this. If 
ValidateValue returns false, and the parse_status is ok() then the tuple is set 
to NULL and execution continues. If it returns false and the parse_status() is 
an error, it aborts the query. If it returns true, execution continues.

I checked and all current implementations of ConvertSlot always return true, no 
matter what. So I propose changing ConvertSlot's return semantics to match 
those of ValidateValue.

Any objections?


--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Thu, 10 Jan 2019 20:42:48 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale

2019-01-07 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale
..


Patch Set 1:

> > (1 comment)
 > >
 > > Thanks for taking a look @Csaba! I've only responded to your
 > > comment on overflow handling so far, as I want to get that
 > behavior
 > > locked down first.
 >
 > Are you sure that the query will be aborted - isn't it only one
 > warning / file? If there are "normal" files processed before the
 > one with overflows, then some rows will be returned before running
 > to the error. In the current implementation it is also possible
 > that some rows are returned from the file before reaching the
 > overflowing value.
 >
 > I am also not completely sure about the best approach.
 >
 > I would prefer the error to be non-fatal, so to print one
 > warning/file and continue the query.
 >
 > If no important use case is lost this way, then I would prefer to
 > use the formula and skip files based on metadata instead of
 > checking every row.
 >
 > This would make it possible to print a useful error message, so
 > that the user will know the precision needed to process the file,
 > and can call ALTER TABLE to increase precision / decrease scale and
 > make the file usable. If the error is printed based the first row,
 > then this can become a long iterative process.

Thats a fair point. It is throwing an error. The last test in 
parquet-decimal-precision-and-scale-widening.test has a `CATCH` section, so if 
I understand the logic in ImpalaTestSuite#run_test_case then that means the 
query threw an exception. I agree that its possible for some rows to be 
returned before this error is throw, but I guess thats an issue with many 
runtime errors in Impala.

I'm good with printing a non-fatal error and setting the row to NULL. This is 
what Hive does, so we will be consistent with Hive.

Using the formula will prevent overflows from happening in the first place, so 
it somewhat orthogonal to the discussion of error handling. The only drawback I 
can see to using the formula you suggested is that it is not exactly SQL 
Standard compliant. It's entirely possible that the formula prevents a query 
from running, even though none of the values in the table would cause an 
overflow. The standard says that values should be rounded or truncated to the 
match target precision / scale, but I guess it doesn't explicitly mention 
overflows so its hard to say.

I think that following Hive's behavior might be best here. I've added @Lars to 
the review to see what he thinks.


--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Mon, 07 Jan 2019 21:32:17 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale

2019-01-07 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale
..


Patch Set 1:

> (1 comment)
 >
 > Thanks for taking a look @Csaba! I've only responded to your
 > comment on overflow handling so far, as I want to get that behavior
 > locked down first.

Are you sure that the query will be aborted - isn't it only one warning / file? 
If there are "normal" files processed before the one with overflows, then some 
rows will be returned before running to the error. In the current 
implementation it is also possible that some rows are returned from the file 
before reaching the overflowing value.

I am also not completely sure about the best approach.

I would prefer the error to be non-fatal, so to print one warning/file and 
continue the query.

If no important use case is lost this way, then I would prefer to use the 
formula and skip files based on metadata instead of checking every row.

This would make it possible to print a useful error message, so that the user 
will know the precision needed to process the file, and can call ALTER TABLE to 
increase precision / decrease scale and make the file usable. If the error is 
printed based the first row, then this can become a long iterative process.


--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Mon, 07 Jan 2019 18:22:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale

2019-01-07 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale
..


Patch Set 1:

(1 comment)

Thanks for taking a look @Csaba! I've only responded to your comment on 
overflow handling so far, as I want to get that behavior locked down first.

http://gerrit.cloudera.org:8080/#/c/12163/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12163/1//COMMIT_MSG@25
PS1, Line 25: If the underlying Parquet files contain rows that cannot be 
converted to
: the higher scale without overflow, an error is thrown and the 
query
: fails. This is different from other engines such as Hive which 
set the
: row to NULL and allow the query to run to completion without 
error.
> I do not know the use case, so I may be wrong, but I dislike the idea of fa
Yes, the example you provided would trigger an overflow. The formula you 
provided sounds like it should work.

I couldn't find much info in the SQL standard on how to handle these types of 
overflows, so maybe the behavior is ambiguous. For relational databases, you 
get an error when you try to insert a row that causes an overflow. However, the 
error is upon row insertion, so the table is still queryable after a failed 
attempt at inserting the row.

As mentioned in the comments, Hive sets the row to NULL, and the query runs to 
completion.

So sounds like there are three options: (1) the current behavior in the patch, 
(2) use the formula you provided to prevent overflows in which case no runtime 
errors or warnings are necessary, or (3) set the row to NULL and print out a 
warning (I think we do this for other types of Parquet scanning errors?).

FWIW, the current behavior is exactly what you described. If you put a Parquet 
file into a table with a lower precision / scale, the table will be unqueryable 
until the files are removed. So if we went with option (2), any file inside the 
table that doesn't match the formula will make the table unqueryable. The main 
difference with this patch is that the table will be unqueryable only if 
certain rows that overflow are present.

I don't feel strongly about any of the approaches, and I'm not super familiar 
with how Impala handles these types of issues, so open to suggestions.



--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Mon, 07 Jan 2019 16:26:28 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale

2019-01-07 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale
..


Patch Set 1:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/12163/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12163/1//COMMIT_MSG@25
PS1, Line 25: If the underlying Parquet files contain rows that cannot be 
converted to
: the higher scale without overflow, an error is thrown and the 
query
: fails. This is different from other engines such as Hive which 
set the
: row to NULL and allow the query to run to completion without 
error.
I do not know the use case, so I may be wrong, but I dislike the idea of 
failing the query if the Parquet file is not corrupt and only the data is "too 
large". If a file with such data is added to a table, the table can become 
unusable and Impala offers no tool to get rid of the problematic file.

Would it make sense to be stricter during scale + precision checking to exclude 
the possibility of overflow? If I understand correctly, overflow can only occur 
if the scale in the file is lower then the expected scale: e.g value 2 in 
decimal (1,0) would overflow when converted to decimal(1, 1). If precision_file 
- scale_file <= precision_expected - scale_expected, then the conversion should 
happen without overflow, shouldn't it?


http://gerrit.cloudera.org:8080/#/c/12163/1/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12163/1/be/src/exec/parquet/parquet-metadata-utils.cc@243
PS1, Line 243: // TODO: we could allow a mismatch and do a conversion at this 
step.
This TODO could be probably updated.


http://gerrit.cloudera.org:8080/#/c/12163/1/be/src/exec/parquet/parquet-metadata-utils.cc@255
PS1, Line 255: // TODO: we could allow a mismatch and do a conversion at this 
step.
This TODO could be probably updated.


http://gerrit.cloudera.org:8080/#/c/12163/1/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test:

http://gerrit.cloudera.org:8080/#/c/12163/1/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test@85
PS1, Line 85: 
Please add some tests with 'where col=const' clauses to exercise the dictionary 
and stat filtering code paths.

Dictionary filtering is disabled if 'needs_conversion_' is true, but it may be 
enabled in the future if conversion will be moved to dictionary construction.

I think that stat filtering won't work correctly with the current 
implementation - the stat value will be interpreted as if it had the same 
precision/scale as the SQL column, which can lead to incorrectly dropping row 
groups.

I had to deal with somewhat similar problems in 
https://gerrit.cloudera.org/#/c/11057/


http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@842
PS1, Line 842: decimal_stored_as_int32.parquet
I do not see this file in the patch and it is not mentioned in the README.


http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@848
PS1, Line 848: decimal_stored_as_int64.parquet
I do not see this file in the patch and it is not mentioned in the README.


http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@856
PS1, Line 856: binary_decimal_no_dictionary.parquet
I do not see this file in the patch and it is not mentioned in the README.



--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Comment-Date: Mon, 07 Jan 2019 15:14:25 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale

2019-01-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale
..


Patch Set 1:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1715/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Comment-Date: Fri, 04 Jan 2019 22:20:33 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale

2019-01-04 Thread Sahil Takiar (Code Review)
Sahil Takiar has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/12163


Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale
..

IMPALA-7087: Read Parquet decimal columns with lower precision/scale

Allows Impala to read Parquet files that have been written with a lower
precision / scale compared to the precision / scale specified during
table creation. Currently, if the precision / scale in the table does
not exactly match the precision / scale in the underlying Parquet files,
Impala will throw an error during any attempt to read data from the
table.

This patch only allows reading Parquet files with a *lower* precision /
scale compared to the table metadata. Users still get an error if they
try to read higher precision / scale data into a lower precision /
scale.

The ability to read lower precision / scale data is allowed by the SQL
Standard and several other engines, such as Hive, MySQL, and Postgres,
allow for this behavior.

If the underlying Parquet files contain rows that cannot be converted to
the higher scale without overflow, an error is thrown and the query
fails. This is different from other engines such as Hive which set the
row to NULL and allow the query to run to completion without error.

Testing:
* Ran core tests
* Added new tests in test_scanners.py

Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
---
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M common/thrift/generate_error_codes.py
M testdata/data/README
A testdata/data/binary_decimal_precision_and_scale_widening.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test
M tests/query_test/test_scanners.py
7 files changed, 357 insertions(+), 9 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/12163/1
--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar