[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Sahil Takiar has abandoned this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. Abandoned Abandoning for now as I am no longer working on this / don't plan on picking it up for a while. Will publish a new review if decide otherwise. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: abandon Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 3 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. Patch Set 3: (6 comments) Yeah, sorry I forgot about this change request. http://gerrit.cloudera.org:8080/#/c/12163/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12163/3//COMMIT_MSG@24 PS3, Line 24: that cannot be converted to : the higher scale without overflow Shouldn't we only allow conversions to decimals that have more digits before and after the decimal point? This way no overflow should occur, right? http://gerrit.cloudera.org:8080/#/c/12163/3/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12163/3/be/src/exec/parquet/parquet-column-readers.cc@237 PS3, Line 237: inline ALWAYS_INLINE bool ConvertDecimalPrecisionAndScale( nit: please add comment http://gerrit.cloudera.org:8080/#/c/12163/3/be/src/exec/parquet/parquet-column-readers.cc@248 PS3, Line 248: inline ALWAYS_INLINE void ConvertDecimalPrecision(const InternalType* src, void* slot); nit: Please add comment. Explain why it takes an InternalType while the CovertDecimalScale() takes a Decimal*Value. http://gerrit.cloudera.org:8080/#/c/12163/3/be/src/exec/parquet/parquet-column-readers.cc@978 PS3, Line 978: dst_dec, slot 'dst_dec' and 'slot' points to the same memory location, it's strange to pass both. http://gerrit.cloudera.org:8080/#/c/12163/3/be/src/exec/parquet/parquet-column-readers.cc@986 PS3, Line 986: dst_dec4, slot 'dst_dec4' and 'slot' points to the same memory location, why do we need to pass both? http://gerrit.cloudera.org:8080/#/c/12163/3/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/12163/3/tests/query_test/test_scanners.py@948 PS3, Line 948: I don't see a test case for the error case, i.e. table metadata scale is lower than file metadata scale. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 3 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 04 Apr 2019 14:44:49 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. Patch Set 3: Zoltan, any plans to take another look at this one? -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 3 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 04 Apr 2019 01:05:51 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. Patch Set 3: (3 comments) Just noticed the comments about the README, but I plan to take a deeper look soon. http://gerrit.cloudera.org:8080/#/c/12163/3/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/12163/3/testdata/data/README@183 PS3, Line 183: decimal_stored_as_int32.parquet: : Parquet file generated by Spark 2.3.1 that contains decimals stored as int32. : Impala needs to be able to read such values (IMPALA-5542) : : decimal_stored_as_int64.parquet: : Parquet file generated by Spark 2.3.1 that contains decimals stored as int64. : Impala needs to be able to read such values (IMPALA-5542) Here. http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@842 PS1, Line 842: pressed_page_size_summaries: > Thanks for taking care of the other files too! The decimal_stored_as_* files were in the README already. http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@848 PS1, Line 848: " limit 10") > I do not see this file in the patch and it is not mentioned in the README. This was already in the README. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 3 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 12 Feb 2019 17:51:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/12163/1/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test File testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test: http://gerrit.cloudera.org:8080/#/c/12163/1/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test@85 PS1, Line 85: > Can you point me to the stat filtering code you are referring to? The stats are read here: https://github.com/apache/impala/blob/fa672909c868f76aa50e9fb756114c32daaf6d9b/be/src/exec/parquet/hdfs-parquet-scanner.cc#L538 With a few hops it arrives to the function I had to specialize for timestamps: https://github.com/apache/impala/blob/fa672909c868f76aa50e9fb756114c32daaf6d9b/be/src/exec/parquet/parquet-column-stats.inline.h#L103 They are evaluated a few lines later with the assumption that they have the same scale and precision as the SQL type. http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@842 PS1, Line 842: pressed_page_size_summaries: > It was an existing file. I added a section to the README describing its con Thanks for taking care of the other files too! -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 3 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Tue, 29 Jan 2019 17:04:36 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. Patch Set 3: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/1910/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 3 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Mon, 28 Jan 2019 21:27:10 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. Patch Set 2: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/1909/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 2 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Mon, 28 Jan 2019 21:06:51 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. Patch Set 1: (6 comments) http://gerrit.cloudera.org:8080/#/c/12163/1/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12163/1/be/src/exec/parquet/parquet-metadata-utils.cc@243 PS1, Line 243: // TODO: we could allow a mismatch and do a conversion at this step. > This TODO could be probably updated. Done http://gerrit.cloudera.org:8080/#/c/12163/1/be/src/exec/parquet/parquet-metadata-utils.cc@255 PS1, Line 255: // TODO: we could allow a mismatch and do a conversion at this step. > This TODO could be probably updated. Done http://gerrit.cloudera.org:8080/#/c/12163/1/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test File testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test: http://gerrit.cloudera.org:8080/#/c/12163/1/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test@85 PS1, Line 85: > Please add some tests with 'where col=const' clauses to exercise the dictio Can you point me to the stat filtering code you are referring to? http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@842 PS1, Line 842: decimal_stored_as_int32.parquet > I do not see this file in the patch and it is not mentioned in the README. It was an existing file. I added a section to the README describing its contents. Same for decimal_stored_as_int64.parquet and binary_decimal_no_dictionary.parquet http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@848 PS1, Line 848: decimal_stored_as_int64.parquet > I do not see this file in the patch and it is not mentioned in the README. Done http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@856 PS1, Line 856: binary_decimal_no_dictionary.parquet > I do not see this file in the patch and it is not mentioned in the README. Done -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 1 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Mon, 28 Jan 2019 20:15:12 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Hello Lars Volker, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12163 to look at the new patch set (#3). Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata Allows Impala to read Parquet files that have been written with a lower precision / scale compared to the precision / scale specified during table creation. Currently, if the precision / scale in the table does not exactly match the precision / scale in the Parquet files, Impala will throw an error and the table will be unqueryable. This patch only allows reading Parquet files with a *lower* precision / scale compared to the table metadata. Users still get an error if they try to read higher precision / scale data into a lower precision / scale. The ability to read lower precision / scale data is allowed by the SQL Standard and several other engines, such as Hive, MySQL, and Postgres, allow for this behavior. If the underlying Parquet files contain rows that cannot be converted to the higher scale without overflow, the decimal slot is set to NULL and a warning is printed in the logs. This is consistent with what other engines, such as Hive do. If abort_on_error is set to true, then instead of setting the slot to NULL, the query fails. Testing: * Ran core tests * Added new tests in test_scanners.py Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 --- M be/src/exec/parquet/parquet-column-readers.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M common/thrift/generate_error_codes.py M testdata/data/README A testdata/data/binary_decimal_precision_and_scale_widening.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening-abort-on-error.test A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test M tests/common/test_result_verifier.py M tests/query_test/test_scanners.py 10 files changed, 423 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/12163/3 -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 3 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. Patch Set 2: (2 comments) http://gerrit.cloudera.org:8080/#/c/12163/2/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12163/2/be/src/exec/parquet/parquet-column-readers.cc@237 PS2, Line 237: inline ALWAYS_INLINE bool ConvertDecimalPrecisionAndScale(const InternalType* src, void* slot); line too long (97 > 90) http://gerrit.cloudera.org:8080/#/c/12163/2/be/src/exec/parquet/parquet-column-readers.cc@1027 PS2, Line 1027: << filename() << " column " << slot_desc()->type().DebugString(); tab used for whitespace -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 2 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Mon, 28 Jan 2019 20:12:46 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Hello Lars Volker, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12163 to look at the new patch set (#2). Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata Allows Impala to read Parquet files that have been written with a lower precision / scale compared to the precision / scale specified during table creation. Currently, if the precision / scale in the table does not exactly match the precision / scale in the Parquet files, Impala will throw an error and the table will be unqueryable. This patch only allows reading Parquet files with a *lower* precision / scale compared to the table metadata. Users still get an error if they try to read higher precision / scale data into a lower precision / scale. The ability to read lower precision / scale data is allowed by the SQL Standard and several other engines, such as Hive, MySQL, and Postgres, allow for this behavior. If the underlying Parquet files contain rows that cannot be converted to the higher scale without overflow, the decimal slot is set to NULL and a warning is printed in the logs. This is consistent with what other engines, such as Hive do. If abort_on_error is set to true, then instead of setting the slot to NULL, the query fails. Testing: * Ran core tests * Added new tests in test_scanners.py Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 --- M be/src/exec/parquet/parquet-column-readers.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M common/thrift/generate_error_codes.py M testdata/data/README A testdata/data/binary_decimal_precision_and_scale_widening.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening-abort-on-error.test A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test M tests/common/test_result_verifier.py M tests/query_test/test_scanners.py 10 files changed, 422 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/12163/2 -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 2 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar