[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Sahil Takiar has abandoned this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. Abandoned Abandoning for now as I am no longer working on this / don't plan on picking it up for a while. Will publish a new review if decide otherwise. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: abandon Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 3 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. Patch Set 3: (6 comments) Yeah, sorry I forgot about this change request. http://gerrit.cloudera.org:8080/#/c/12163/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12163/3//COMMIT_MSG@24 PS3, Line 24: that cannot be converted to : the higher scale without overflow Shouldn't we only allow conversions to decimals that have more digits before and after the decimal point? This way no overflow should occur, right? http://gerrit.cloudera.org:8080/#/c/12163/3/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12163/3/be/src/exec/parquet/parquet-column-readers.cc@237 PS3, Line 237: inline ALWAYS_INLINE bool ConvertDecimalPrecisionAndScale( nit: please add comment http://gerrit.cloudera.org:8080/#/c/12163/3/be/src/exec/parquet/parquet-column-readers.cc@248 PS3, Line 248: inline ALWAYS_INLINE void ConvertDecimalPrecision(const InternalType* src, void* slot); nit: Please add comment. Explain why it takes an InternalType while the CovertDecimalScale() takes a Decimal*Value. http://gerrit.cloudera.org:8080/#/c/12163/3/be/src/exec/parquet/parquet-column-readers.cc@978 PS3, Line 978: dst_dec, slot 'dst_dec' and 'slot' points to the same memory location, it's strange to pass both. http://gerrit.cloudera.org:8080/#/c/12163/3/be/src/exec/parquet/parquet-column-readers.cc@986 PS3, Line 986: dst_dec4, slot 'dst_dec4' and 'slot' points to the same memory location, why do we need to pass both? http://gerrit.cloudera.org:8080/#/c/12163/3/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/12163/3/tests/query_test/test_scanners.py@948 PS3, Line 948: I don't see a test case for the error case, i.e. table metadata scale is lower than file metadata scale. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 3 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 04 Apr 2019 14:44:49 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. Patch Set 3: Zoltan, any plans to take another look at this one? -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 3 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 04 Apr 2019 01:05:51 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. Patch Set 3: (3 comments) Just noticed the comments about the README, but I plan to take a deeper look soon. http://gerrit.cloudera.org:8080/#/c/12163/3/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/12163/3/testdata/data/README@183 PS3, Line 183: decimal_stored_as_int32.parquet: : Parquet file generated by Spark 2.3.1 that contains decimals stored as int32. : Impala needs to be able to read such values (IMPALA-5542) : : decimal_stored_as_int64.parquet: : Parquet file generated by Spark 2.3.1 that contains decimals stored as int64. : Impala needs to be able to read such values (IMPALA-5542) Here. http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@842 PS1, Line 842: pressed_page_size_summaries: > Thanks for taking care of the other files too! The decimal_stored_as_* files were in the README already. http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@848 PS1, Line 848: " limit 10") > I do not see this file in the patch and it is not mentioned in the README. This was already in the README. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 3 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 12 Feb 2019 17:51:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/12163/1/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test File testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test: http://gerrit.cloudera.org:8080/#/c/12163/1/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test@85 PS1, Line 85: > Can you point me to the stat filtering code you are referring to? The stats are read here: https://github.com/apache/impala/blob/fa672909c868f76aa50e9fb756114c32daaf6d9b/be/src/exec/parquet/hdfs-parquet-scanner.cc#L538 With a few hops it arrives to the function I had to specialize for timestamps: https://github.com/apache/impala/blob/fa672909c868f76aa50e9fb756114c32daaf6d9b/be/src/exec/parquet/parquet-column-stats.inline.h#L103 They are evaluated a few lines later with the assumption that they have the same scale and precision as the SQL type. http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@842 PS1, Line 842: pressed_page_size_summaries: > It was an existing file. I added a section to the README describing its con Thanks for taking care of the other files too! -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 3 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Tue, 29 Jan 2019 17:04:36 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. Patch Set 3: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/1910/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 3 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Mon, 28 Jan 2019 21:27:10 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. Patch Set 2: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/1909/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 2 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Mon, 28 Jan 2019 21:06:51 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. Patch Set 1: (6 comments) http://gerrit.cloudera.org:8080/#/c/12163/1/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12163/1/be/src/exec/parquet/parquet-metadata-utils.cc@243 PS1, Line 243: // TODO: we could allow a mismatch and do a conversion at this step. > This TODO could be probably updated. Done http://gerrit.cloudera.org:8080/#/c/12163/1/be/src/exec/parquet/parquet-metadata-utils.cc@255 PS1, Line 255: // TODO: we could allow a mismatch and do a conversion at this step. > This TODO could be probably updated. Done http://gerrit.cloudera.org:8080/#/c/12163/1/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test File testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test: http://gerrit.cloudera.org:8080/#/c/12163/1/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test@85 PS1, Line 85: > Please add some tests with 'where col=const' clauses to exercise the dictio Can you point me to the stat filtering code you are referring to? http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@842 PS1, Line 842: decimal_stored_as_int32.parquet > I do not see this file in the patch and it is not mentioned in the README. It was an existing file. I added a section to the README describing its contents. Same for decimal_stored_as_int64.parquet and binary_decimal_no_dictionary.parquet http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@848 PS1, Line 848: decimal_stored_as_int64.parquet > I do not see this file in the patch and it is not mentioned in the README. Done http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@856 PS1, Line 856: binary_decimal_no_dictionary.parquet > I do not see this file in the patch and it is not mentioned in the README. Done -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 1 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Mon, 28 Jan 2019 20:15:12 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Hello Lars Volker, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12163 to look at the new patch set (#3). Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata Allows Impala to read Parquet files that have been written with a lower precision / scale compared to the precision / scale specified during table creation. Currently, if the precision / scale in the table does not exactly match the precision / scale in the Parquet files, Impala will throw an error and the table will be unqueryable. This patch only allows reading Parquet files with a *lower* precision / scale compared to the table metadata. Users still get an error if they try to read higher precision / scale data into a lower precision / scale. The ability to read lower precision / scale data is allowed by the SQL Standard and several other engines, such as Hive, MySQL, and Postgres, allow for this behavior. If the underlying Parquet files contain rows that cannot be converted to the higher scale without overflow, the decimal slot is set to NULL and a warning is printed in the logs. This is consistent with what other engines, such as Hive do. If abort_on_error is set to true, then instead of setting the slot to NULL, the query fails. Testing: * Ran core tests * Added new tests in test_scanners.py Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 --- M be/src/exec/parquet/parquet-column-readers.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M common/thrift/generate_error_codes.py M testdata/data/README A testdata/data/binary_decimal_precision_and_scale_widening.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening-abort-on-error.test A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test M tests/common/test_result_verifier.py M tests/query_test/test_scanners.py 10 files changed, 423 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/12163/3 -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 3 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. Patch Set 2: (2 comments) http://gerrit.cloudera.org:8080/#/c/12163/2/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12163/2/be/src/exec/parquet/parquet-column-readers.cc@237 PS2, Line 237: inline ALWAYS_INLINE bool ConvertDecimalPrecisionAndScale(const InternalType* src, void* slot); line too long (97 > 90) http://gerrit.cloudera.org:8080/#/c/12163/2/be/src/exec/parquet/parquet-column-readers.cc@1027 PS2, Line 1027: << filename() << " column " << slot_desc()->type().DebugString(); tab used for whitespace -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 2 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Mon, 28 Jan 2019 20:12:46 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata
Hello Lars Volker, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12163 to look at the new patch set (#2). Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata .. IMPALA-7087: Read Parquet decimal columns with lower precision/scale than table metadata Allows Impala to read Parquet files that have been written with a lower precision / scale compared to the precision / scale specified during table creation. Currently, if the precision / scale in the table does not exactly match the precision / scale in the Parquet files, Impala will throw an error and the table will be unqueryable. This patch only allows reading Parquet files with a *lower* precision / scale compared to the table metadata. Users still get an error if they try to read higher precision / scale data into a lower precision / scale. The ability to read lower precision / scale data is allowed by the SQL Standard and several other engines, such as Hive, MySQL, and Postgres, allow for this behavior. If the underlying Parquet files contain rows that cannot be converted to the higher scale without overflow, the decimal slot is set to NULL and a warning is printed in the logs. This is consistent with what other engines, such as Hive do. If abort_on_error is set to true, then instead of setting the slot to NULL, the query fails. Testing: * Ran core tests * Added new tests in test_scanners.py Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 --- M be/src/exec/parquet/parquet-column-readers.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M common/thrift/generate_error_codes.py M testdata/data/README A testdata/data/binary_decimal_precision_and_scale_widening.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening-abort-on-error.test A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test M tests/common/test_result_verifier.py M tests/query_test/test_scanners.py 10 files changed, 422 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/12163/2 -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 2 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale .. Patch Set 1: > > Starting a new thread, since the last one was getting long. > > > > I started working on getting this patch in line with Hive's > > behavior - e.g. if an overflow occurs, set the row to NULL and > emit > > a warning. I hit an issue in the design of ScalarColumnReader::ConvertSlot. > > The ConvertSlot method returns a bool, if true the execution > > continues, if false the query fails. There is no clean way for > > ConvertSlot to set the tuple to NULL and continue processing. > > ScalarColumnReader::ValidateValue on the other had does allow for > > this. If ValidateValue returns false, and the parse_status is > ok() > > then the tuple is set to NULL and execution continues. If it > > returns false and the parse_status() is an error, it aborts the > > query. If it returns true, execution continues. > > > > I checked and all current implementations of ConvertSlot always > > return true, no matter what. So I propose changing ConvertSlot's > > return semantics to match those of ValidateValue. > > > > Any objections? > > I would still prefer the "exclude the whole file with warning if > overflow is possible" approach, as it would: > - give a useful error message > - probably make conversion faster, as a simple multiplication would > be enough > - would be simpler and need less testing > > If this is not an option (because enabling the potentially > overflowing conversion is really useful in some use cases), then I > agree with changing ConvertSlot() as you described. A few points to consider on the "exclude the whole file with warning if overflow is possible" approach 1: I don't think we should exclude entire files with just a warning only because an overflow might be possible. That could significantly throw off query results. I think a more fine grained approach where we essentially exclude individual rows by setting them to NULL would be better. 2: Allowing for overflows only gives the user more flexibility; if they don't want overflows, they can set abort_on_error on the query will fail instead of setting rows to NULL. 3: Setting overflowed slots to NULL is consistent with how Impala handles similar issues elsewhere; most of the ValidateValue methods follow this approach. 4: Yes, there is a performance overhead to checking overflows. I haven't checked how much it is, but consider that many decimal operations done by Impala expressions invokes the overflow check as well (e.g. decimal multiplication, addition, etc). 5: Currently, if an Impala query multiplies two slots and the result overflows, the entire query fails. So I'm also ok with keeping the current behavior of this patch since it matches how Impala currently deals with overflows. However, I would still prefer to follow Hive's behavior because (1) consistency with Hive is useful, and (2) users can still set abort_on_error=true if they want. If discussing this over Hangout is easier, happy to setup a call. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 1 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Thu, 24 Jan 2019 15:47:58 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale .. Patch Set 1: > Starting a new thread, since the last one was getting long. > > I started working on getting this patch in line with Hive's > behavior - e.g. if an overflow occurs, set the row to NULL and emit > a warning. I hit an issue in the design of ScalarColumnReader::ConvertSlot. > The ConvertSlot method returns a bool, if true the execution > continues, if false the query fails. There is no clean way for > ConvertSlot to set the tuple to NULL and continue processing. > ScalarColumnReader::ValidateValue on the other had does allow for > this. If ValidateValue returns false, and the parse_status is ok() > then the tuple is set to NULL and execution continues. If it > returns false and the parse_status() is an error, it aborts the > query. If it returns true, execution continues. > > I checked and all current implementations of ConvertSlot always > return true, no matter what. So I propose changing ConvertSlot's > return semantics to match those of ValidateValue. > > Any objections? > Starting a new thread, since the last one was getting long. > > I started working on getting this patch in line with Hive's > behavior - e.g. if an overflow occurs, set the row to NULL and emit > a warning. I hit an issue in the design of ScalarColumnReader::ConvertSlot. > The ConvertSlot method returns a bool, if true the execution > continues, if false the query fails. There is no clean way for > ConvertSlot to set the tuple to NULL and continue processing. > ScalarColumnReader::ValidateValue on the other had does allow for > this. If ValidateValue returns false, and the parse_status is ok() > then the tuple is set to NULL and execution continues. If it > returns false and the parse_status() is an error, it aborts the > query. If it returns true, execution continues. > > I checked and all current implementations of ConvertSlot always > return true, no matter what. So I propose changing ConvertSlot's > return semantics to match those of ValidateValue. > > Any objections? > Starting a new thread, since the last one was getting long. > > I started working on getting this patch in line with Hive's > behavior - e.g. if an overflow occurs, set the row to NULL and emit > a warning. I hit an issue in the design of ScalarColumnReader::ConvertSlot. > The ConvertSlot method returns a bool, if true the execution > continues, if false the query fails. There is no clean way for > ConvertSlot to set the tuple to NULL and continue processing. > ScalarColumnReader::ValidateValue on the other had does allow for > this. If ValidateValue returns false, and the parse_status is ok() > then the tuple is set to NULL and execution continues. If it > returns false and the parse_status() is an error, it aborts the > query. If it returns true, execution continues. > > I checked and all current implementations of ConvertSlot always > return true, no matter what. So I propose changing ConvertSlot's > return semantics to match those of ValidateValue. > > Any objections? I would still prefer the "exclude the whole file with warning if overflow is possible" approach, as it would: - give a useful error message - probably make conversion faster, as a simple multiplication would be enough - would be simpler and need less testing If this is not an option (because enabling the potentially overflowing conversion is really useful in some use cases), then I agree with changing ConvertSlot() as you described. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 1 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Thu, 10 Jan 2019 22:11:01 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale .. Patch Set 1: Starting a new thread, since the last one was getting long. I started working on getting this patch in line with Hive's behavior - e.g. if an overflow occurs, set the row to NULL and emit a warning. I hit an issue in the design of ScalarColumnReader::ConvertSlot. The ConvertSlot method returns a bool, if true the execution continues, if false the query fails. There is no clean way for ConvertSlot to set the tuple to NULL and continue processing. ScalarColumnReader::ValidateValue on the other had does allow for this. If ValidateValue returns false, and the parse_status is ok() then the tuple is set to NULL and execution continues. If it returns false and the parse_status() is an error, it aborts the query. If it returns true, execution continues. I checked and all current implementations of ConvertSlot always return true, no matter what. So I propose changing ConvertSlot's return semantics to match those of ValidateValue. Any objections? -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 1 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Thu, 10 Jan 2019 20:42:48 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale .. Patch Set 1: > > (1 comment) > > > > Thanks for taking a look @Csaba! I've only responded to your > > comment on overflow handling so far, as I want to get that > behavior > > locked down first. > > Are you sure that the query will be aborted - isn't it only one > warning / file? If there are "normal" files processed before the > one with overflows, then some rows will be returned before running > to the error. In the current implementation it is also possible > that some rows are returned from the file before reaching the > overflowing value. > > I am also not completely sure about the best approach. > > I would prefer the error to be non-fatal, so to print one > warning/file and continue the query. > > If no important use case is lost this way, then I would prefer to > use the formula and skip files based on metadata instead of > checking every row. > > This would make it possible to print a useful error message, so > that the user will know the precision needed to process the file, > and can call ALTER TABLE to increase precision / decrease scale and > make the file usable. If the error is printed based the first row, > then this can become a long iterative process. Thats a fair point. It is throwing an error. The last test in parquet-decimal-precision-and-scale-widening.test has a `CATCH` section, so if I understand the logic in ImpalaTestSuite#run_test_case then that means the query threw an exception. I agree that its possible for some rows to be returned before this error is throw, but I guess thats an issue with many runtime errors in Impala. I'm good with printing a non-fatal error and setting the row to NULL. This is what Hive does, so we will be consistent with Hive. Using the formula will prevent overflows from happening in the first place, so it somewhat orthogonal to the discussion of error handling. The only drawback I can see to using the formula you suggested is that it is not exactly SQL Standard compliant. It's entirely possible that the formula prevents a query from running, even though none of the values in the table would cause an overflow. The standard says that values should be rounded or truncated to the match target precision / scale, but I guess it doesn't explicitly mention overflows so its hard to say. I think that following Hive's behavior might be best here. I've added @Lars to the review to see what he thinks. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 1 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Mon, 07 Jan 2019 21:32:17 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale .. Patch Set 1: > (1 comment) > > Thanks for taking a look @Csaba! I've only responded to your > comment on overflow handling so far, as I want to get that behavior > locked down first. Are you sure that the query will be aborted - isn't it only one warning / file? If there are "normal" files processed before the one with overflows, then some rows will be returned before running to the error. In the current implementation it is also possible that some rows are returned from the file before reaching the overflowing value. I am also not completely sure about the best approach. I would prefer the error to be non-fatal, so to print one warning/file and continue the query. If no important use case is lost this way, then I would prefer to use the formula and skip files based on metadata instead of checking every row. This would make it possible to print a useful error message, so that the user will know the precision needed to process the file, and can call ALTER TABLE to increase precision / decrease scale and make the file usable. If the error is printed based the first row, then this can become a long iterative process. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 1 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Mon, 07 Jan 2019 18:22:45 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale .. Patch Set 1: (1 comment) Thanks for taking a look @Csaba! I've only responded to your comment on overflow handling so far, as I want to get that behavior locked down first. http://gerrit.cloudera.org:8080/#/c/12163/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12163/1//COMMIT_MSG@25 PS1, Line 25: If the underlying Parquet files contain rows that cannot be converted to : the higher scale without overflow, an error is thrown and the query : fails. This is different from other engines such as Hive which set the : row to NULL and allow the query to run to completion without error. > I do not know the use case, so I may be wrong, but I dislike the idea of fa Yes, the example you provided would trigger an overflow. The formula you provided sounds like it should work. I couldn't find much info in the SQL standard on how to handle these types of overflows, so maybe the behavior is ambiguous. For relational databases, you get an error when you try to insert a row that causes an overflow. However, the error is upon row insertion, so the table is still queryable after a failed attempt at inserting the row. As mentioned in the comments, Hive sets the row to NULL, and the query runs to completion. So sounds like there are three options: (1) the current behavior in the patch, (2) use the formula you provided to prevent overflows in which case no runtime errors or warnings are necessary, or (3) set the row to NULL and print out a warning (I think we do this for other types of Parquet scanning errors?). FWIW, the current behavior is exactly what you described. If you put a Parquet file into a table with a lower precision / scale, the table will be unqueryable until the files are removed. So if we went with option (2), any file inside the table that doesn't match the formula will make the table unqueryable. The main difference with this patch is that the table will be unqueryable only if certain rows that overflow are present. I don't feel strongly about any of the approaches, and I'm not super familiar with how Impala handles these types of issues, so open to suggestions. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 1 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Mon, 07 Jan 2019 16:26:28 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale .. Patch Set 1: (7 comments) http://gerrit.cloudera.org:8080/#/c/12163/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12163/1//COMMIT_MSG@25 PS1, Line 25: If the underlying Parquet files contain rows that cannot be converted to : the higher scale without overflow, an error is thrown and the query : fails. This is different from other engines such as Hive which set the : row to NULL and allow the query to run to completion without error. I do not know the use case, so I may be wrong, but I dislike the idea of failing the query if the Parquet file is not corrupt and only the data is "too large". If a file with such data is added to a table, the table can become unusable and Impala offers no tool to get rid of the problematic file. Would it make sense to be stricter during scale + precision checking to exclude the possibility of overflow? If I understand correctly, overflow can only occur if the scale in the file is lower then the expected scale: e.g value 2 in decimal (1,0) would overflow when converted to decimal(1, 1). If precision_file - scale_file <= precision_expected - scale_expected, then the conversion should happen without overflow, shouldn't it? http://gerrit.cloudera.org:8080/#/c/12163/1/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12163/1/be/src/exec/parquet/parquet-metadata-utils.cc@243 PS1, Line 243: // TODO: we could allow a mismatch and do a conversion at this step. This TODO could be probably updated. http://gerrit.cloudera.org:8080/#/c/12163/1/be/src/exec/parquet/parquet-metadata-utils.cc@255 PS1, Line 255: // TODO: we could allow a mismatch and do a conversion at this step. This TODO could be probably updated. http://gerrit.cloudera.org:8080/#/c/12163/1/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test File testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test: http://gerrit.cloudera.org:8080/#/c/12163/1/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test@85 PS1, Line 85: Please add some tests with 'where col=const' clauses to exercise the dictionary and stat filtering code paths. Dictionary filtering is disabled if 'needs_conversion_' is true, but it may be enabled in the future if conversion will be moved to dictionary construction. I think that stat filtering won't work correctly with the current implementation - the stat value will be interpreted as if it had the same precision/scale as the SQL column, which can lead to incorrectly dropping row groups. I had to deal with somewhat similar problems in https://gerrit.cloudera.org/#/c/11057/ http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@842 PS1, Line 842: decimal_stored_as_int32.parquet I do not see this file in the patch and it is not mentioned in the README. http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@848 PS1, Line 848: decimal_stored_as_int64.parquet I do not see this file in the patch and it is not mentioned in the README. http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@856 PS1, Line 856: binary_decimal_no_dictionary.parquet I do not see this file in the patch and it is not mentioned in the README. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 1 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Comment-Date: Mon, 07 Jan 2019 15:14:25 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 ) Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale .. Patch Set 1: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/1715/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 1 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Comment-Date: Fri, 04 Jan 2019 22:20:33 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale
Sahil Takiar has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12163 Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale .. IMPALA-7087: Read Parquet decimal columns with lower precision/scale Allows Impala to read Parquet files that have been written with a lower precision / scale compared to the precision / scale specified during table creation. Currently, if the precision / scale in the table does not exactly match the precision / scale in the underlying Parquet files, Impala will throw an error during any attempt to read data from the table. This patch only allows reading Parquet files with a *lower* precision / scale compared to the table metadata. Users still get an error if they try to read higher precision / scale data into a lower precision / scale. The ability to read lower precision / scale data is allowed by the SQL Standard and several other engines, such as Hive, MySQL, and Postgres, allow for this behavior. If the underlying Parquet files contain rows that cannot be converted to the higher scale without overflow, an error is thrown and the query fails. This is different from other engines such as Hive which set the row to NULL and allow the query to run to completion without error. Testing: * Ran core tests * Added new tests in test_scanners.py Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 --- M be/src/exec/parquet/parquet-column-readers.cc M be/src/exec/parquet/parquet-metadata-utils.cc M common/thrift/generate_error_codes.py M testdata/data/README A testdata/data/binary_decimal_precision_and_scale_widening.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test M tests/query_test/test_scanners.py 7 files changed, 357 insertions(+), 9 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/12163/1 -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 1 Gerrit-Owner: Sahil Takiar