[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 18: Code-Review+2 Carry +2 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 18 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Tue, 09 Apr 2019 08:53:26 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Csaba Ringhofer has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. IMPALA-5051: Add INT64 timestamp write support in Parquet Add query option "parquet_timestamp_type" that chooses the Parquet type used when writing TIMESTAMP columns. This is an experimental feature at the moment, because these types are not widely adopted in other Hadoop components yet. For this reason the query option is added as "development" level, and the default behavior is not changed. The following options can be used: INT96_NANOS (default): This is the same as the old behavior, can represent any timestamp that can be handled by Impala. INT64_MILLIS, INT64_MICROS: Can encode the whole [1400..1) range handled by Impala at the cost of reduced precision. Values are rounded towards minus infinity during writing. INT64_NANOS: Can encode a reduced range without losing nanosecond precision: [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807] Values outside this range are converted to NULLs without warning. The change was done completely in the backend and all TIMESTAMP columns are written using the type set in the query option. An alternative design would have been to implement some parts in the fronted by adding TIMESTAMP->BIGINT conversion functions to the query plan, which would make it easier to add the possibility of per-column setting in the future. I choose the current design because it seemed much simpler and there are no clear plans for the per-column setting. Most of the code will be still useful if we decide to go the other way in the future. All types are written without conversion to UTC (the way Impala always wrote timestamps), and this information is expressed in the new Parquet logical types by setting isAdjustedToUTC to false. The old logical type (converted_type) is not set, because old readers do not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and TIMESTAMP_MICROS are written in UTC. These readers can still read int64 timestamp columns as INT_64. Testing: - added unit tests for new TimestampValue->int64 functions - add EE tests for checking values / min-max stats / metadata written for int64 Parquet timestamps - ran core tests Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Reviewed-on: http://gerrit.cloudera.org:8080/12247 Tested-by: Impala Public Jenkins Reviewed-by: Csaba Ringhofer --- M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/hdfs-parquet-table-writer.h M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test M tests/query_test/test_insert_parquet.py M tests/util/get_parquet_metadata.py 18 files changed, 611 insertions(+), 117 deletions(-) Approvals: Impala Public Jenkins: Verified Csaba Ringhofer: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 19 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 18: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 18 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Tue, 09 Apr 2019 05:27:04 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 17: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/2694/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 17 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Tue, 09 Apr 2019 00:51:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 18: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3998/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 18 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Tue, 09 Apr 2019 00:28:27 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12247 to look at the new patch set (#17). Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. IMPALA-5051: Add INT64 timestamp write support in Parquet Add query option "parquet_timestamp_type" that chooses the Parquet type used when writing TIMESTAMP columns. This is an experimental feature at the moment, because these types are not widely adopted in other Hadoop components yet. For this reason the query option is added as "development" level, and the default behavior is not changed. The following options can be used: INT96_NANOS (default): This is the same as the old behavior, can represent any timestamp that can be handled by Impala. INT64_MILLIS, INT64_MICROS: Can encode the whole [1400..1) range handled by Impala at the cost of reduced precision. Values are rounded towards minus infinity during writing. INT64_NANOS: Can encode a reduced range without losing nanosecond precision: [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807] Values outside this range are converted to NULLs without warning. The change was done completely in the backend and all TIMESTAMP columns are written using the type set in the query option. An alternative design would have been to implement some parts in the fronted by adding TIMESTAMP->BIGINT conversion functions to the query plan, which would make it easier to add the possibility of per-column setting in the future. I choose the current design because it seemed much simpler and there are no clear plans for the per-column setting. Most of the code will be still useful if we decide to go the other way in the future. All types are written without conversion to UTC (the way Impala always wrote timestamps), and this information is expressed in the new Parquet logical types by setting isAdjustedToUTC to false. The old logical type (converted_type) is not set, because old readers do not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and TIMESTAMP_MICROS are written in UTC. These readers can still read int64 timestamp columns as INT_64. Testing: - added unit tests for new TimestampValue->int64 functions - add EE tests for checking values / min-max stats / metadata written for int64 Parquet timestamps - ran core tests Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f --- M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/hdfs-parquet-table-writer.h M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test M tests/query_test/test_insert_parquet.py M tests/util/get_parquet_metadata.py 18 files changed, 611 insertions(+), 117 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/17 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 17 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 16: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/3994/ -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 16 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Mon, 08 Apr 2019 21:53:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 16: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 16 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Mon, 08 Apr 2019 16:46:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 16: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3994/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 16 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Mon, 08 Apr 2019 16:46:07 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 15: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/2672/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 15 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Mon, 08 Apr 2019 15:36:51 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 15: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 15 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Mon, 08 Apr 2019 15:25:21 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12247 to look at the new patch set (#15). Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. IMPALA-5051: Add INT64 timestamp write support in Parquet Add query option "parquet_timestamp_type" that chooses the Parquet type used when writing TIMESTAMP columns. This is an experimental feature at the moment, because these types are not widely adopted in other Hadoop components yet. For this reason the query option is added as "development" level, and the default behavior is not changed. The following options can be used: INT96_NANOS (default): This is the same as the old behavior, can represent any timestamp that can be handled by Impala. INT64_MILLIS, INT64_MICROS: Can encode the whole [1400..1) range handled by Impala at the cost of reduced precision. Values are rounded towards minus infinity during writing. INT64_NANOS: Can encode a reduced range without losing nanosecond precision: [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807] Values outside this range are converted to NULLs without warning. The change was done completely in the backend and all TIMESTAMP columns are written using the type set in the query option. An alternative design would have been to implement some parts in the fronted by adding TIMESTAMP->BIGINT conversion functions to the query plan, which would make it easier to add the possibility of per-column setting in the future. I choose the current design because it seemed much simpler and there are no clear plans for the per-column setting. Most of the code will be still useful if we decide to go the other way in the future. All types are written without conversion to UTC (the way Impala always wrote timestamps), and this information is expressed in the new Parquet logical types by setting isAdjustedToUTC to false. The old logical type (converted_type) is not set, because old readers do not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and TIMESTAMP_MICROS are written in UTC. These readers can still read int64 timestamp columns as INT_64. Testing: - added unit tests for new TimestampValue->int64 functions - add EE tests for checking values / min-max stats / metadata written for int64 Parquet timestamps - ran core tests Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f --- M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/hdfs-parquet-table-writer.h M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test M tests/query_test/test_insert_parquet.py M tests/util/get_parquet_metadata.py 18 files changed, 611 insertions(+), 117 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/15 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 15 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 15: Patch set 15 fixes a copy-paste error. I guess that I ran timestamp-test without rebuilding before uploading that last change, as it was broken before this fix. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 15 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Mon, 08 Apr 2019 14:58:19 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 14: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/2614/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 14 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Tue, 02 Apr 2019 19:15:32 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 14: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/2612/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 14 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Tue, 02 Apr 2019 15:53:33 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 14: > Build Failed > > https://jenkins.impala.io/job/gerrit-code-review-checks/2609/ : > Initial code review checks failed. See linked job for details on > the failure. The error during code review check seems to be unrelated: E: Failed to fetch http://us-west-2.ec2.archive.ubuntu.com/ubuntu/pool/universe/o/objenesis/libobjenesis-java_2.2-1_all.deb 504 Gateway Time-out [IP: 34.210.25.51 80] -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 14 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Tue, 02 Apr 2019 15:16:55 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 14: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/2609/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 14 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Tue, 02 Apr 2019 13:00:15 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 13: (3 comments) http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc File be/src/runtime/timestamp-test.cc: http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@740 PS9, Line 740: EQ(MI > I feel I'm slightly in favor of adding proper distinct variables instead of All the names I came up with were awkward, so I reused the same name but put all the tests to different blocks. http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-test.cc File be/src/runtime/timestamp-test.cc: http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-test.cc@743 PS13, Line 743: Add 250ns > The comments should now reflect that you're setting it explicitly instead o Done http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-value.h@196 PS13, Line 196: 1'000'000'000LL > We have NANOS_PER_SEC in timestamp-value-inline.h, use here, too, and elsew NANOS_PER_DAY comes from walltime.h, which is included in most places, but I did not want to include it in a frequently used header like this. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 13 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Tue, 02 Apr 2019 12:18:12 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12247 to look at the new patch set (#14). Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. IMPALA-5051: Add INT64 timestamp write support in Parquet Add query option "parquet_timestamp_type" that chooses the Parquet type used when writing TIMESTAMP columns. This is an experimental feature at the moment, because these types are not widely adopted in other Hadoop components yet. For this reason the query option is added as "development" level, and the default behavior is not changed. The following options can be used: INT96_NANOS (default): This is the same as the old behavior, can represent any timestamp that can be handled by Impala. INT64_MILLIS, INT64_MICROS: Can encode the whole [1400..1) range handled by Impala at the cost of reduced precision. Values are rounded towards minus infinity during writing. INT64_NANOS: Can encode a reduced range without losing nanosecond precision: [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807] Values outside this range are converted to NULLs without warning. The change was done completely in the backend and all TIMESTAMP columns are written using the type set in the query option. An alternative design would have been to implement some parts in the fronted by adding TIMESTAMP->BIGINT conversion functions to the query plan, which would make it easier to add the possibility of per-column setting in the future. I choose the current design because it seemed much simpler and there are no clear plans for the per-column setting. Most of the code will be still useful if we decide to go the other way in the future. All types are written without conversion to UTC (the way Impala always wrote timestamps), and this information is expressed in the new Parquet logical types by setting isAdjustedToUTC to false. The old logical type (converted_type) is not set, because old readers do not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and TIMESTAMP_MICROS are written in UTC. These readers can still read int64 timestamp columns as INT_64. Testing: - added unit tests for new TimestampValue->int64 functions - add EE tests for checking values / min-max stats / metadata written for int64 Parquet timestamps - ran core tests Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f --- M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/hdfs-parquet-table-writer.h M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test M tests/query_test/test_insert_parquet.py M tests/util/get_parquet_metadata.py 18 files changed, 611 insertions(+), 117 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/14 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 14 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 13: Code-Review+2 (3 comments) Only had some nits, otherwise this looks good to me. http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc File be/src/runtime/timestamp-test.cc: http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@740 PS9, Line 740: EQ(MI > I have rewritten the code to always reset min_date. I think that it is easi I feel I'm slightly in favor of adding proper distinct variables instead of re-using min_date. http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-test.cc File be/src/runtime/timestamp-test.cc: http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-test.cc@743 PS13, Line 743: Add 250ns The comments should now reflect that you're setting it explicitly instead of adding. Also check the one for max* below http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-value.h@196 PS13, Line 196: 1'000'000'000LL We have NANOS_PER_SEC in timestamp-value-inline.h, use here, too, and elsewhere in this file? -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 13 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Fri, 29 Mar 2019 23:37:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 13: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/2573/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 13 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Thu, 28 Mar 2019 17:46:19 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12247 to look at the new patch set (#13). Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. IMPALA-5051: Add INT64 timestamp write support in Parquet Add query option "parquet_timestamp_type" that chooses the Parquet type used when writing TIMESTAMP columns. This is an experimental feature at the moment, because these types are not widely adopted in other Hadoop components yet. For this reason the query option is added as "development" level, and the default behavior is not changed. The following options can be used: INT96_NANOS (default): This is the same as the old behavior, can represent any timestamp that can be handled by Impala. INT64_MILLIS, INT64_MICROS: Can encode the whole [1400..1) range handled by Impala at the cost of reduced precision. Values are rounded towards minus infinity during writing. INT64_NANOS: Can encode a reduced range without losing nanosecond precision: [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807] Values outside this range are converted to NULLs without warning. The change was done completely in the backend and all TIMESTAMP columns are written using the type set in the query option. An alternative design would have been to implement some parts in the fronted by adding TIMESTAMP->BIGINT conversion functions to the query plan, which would make it easier to add the possibility of per-column setting in the future. I choose the current design because it seemed much simpler and there are no clear plans for the per-column setting. Most of the code will be still useful if we decide to go the other way in the future. All types are written without conversion to UTC (the way Impala always wrote timestamps), and this information is expressed in the new Parquet logical types by setting isAdjustedToUTC to false. The old logical type (converted_type) is not set, because old readers do not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and TIMESTAMP_MICROS are written in UTC. These readers can still read int64 timestamp columns as INT_64. Testing: - added unit tests for new TimestampValue->int64 functions - add EE tests for checking values / min-max stats / metadata written for int64 Parquet timestamps - ran core tests Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f --- M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/hdfs-parquet-table-writer.h M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test M tests/query_test/test_insert_parquet.py M tests/util/get_parquet_metadata.py 18 files changed, 570 insertions(+), 107 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/13 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 13 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 12: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/2572/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 12 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Thu, 28 Mar 2019 14:22:58 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 12: Patch 12 fixes the clang tidy error in timestamp-test + rebases and resolves conflicts related to query options. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 12 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Thu, 28 Mar 2019 13:57:50 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12247 to look at the new patch set (#12). Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. IMPALA-5051: Add INT64 timestamp write support in Parquet Add query option "parquet_timestamp_type" that chooses the Parquet type used when writing TIMESTAMP columns. This is an experimental feature at the moment, because these types are not widely adopted in other Hadoop components yet. For this reason the query option is added as "development" level, and the default behavior is not changed. The following options can be used: INT96_NANOS (default): This is the same as the old behavior, can represent any timestamp that can be handled by Impala. INT64_MILLIS, INT64_MICROS: Can encode the whole [1400..1) range handled by Impala at the cost of reduced precision. Values are rounded towards minus infinity during writing. INT64_NANOS: Can encode a reduced range without losing nanosecond precision: [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807] Values outside this range are converted to NULLs without warning. The change was done completely in the backend and all TIMESTAMP columns are written using the type set in the query option. An alternative design would have been to implement some parts in the fronted by adding TIMESTAMP->BIGINT conversion functions to the query plan, which would make it easier to add the possibility of per-column setting in the future. I choose the current design because it seemed much simpler and there are no clear plans for the per-column setting. Most of the code will be still useful if we decide to go the other way in the future. All types are written without conversion to UTC (the way Impala always wrote timestamps), and this information is expressed in the new Parquet logical types by setting isAdjustedToUTC to false. The old logical type (converted_type) is not set, because old readers do not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and TIMESTAMP_MICROS are written in UTC. These readers can still read int64 timestamp columns as INT_64. Testing: - added unit tests for new TimestampValue->int64 functions - add EE tests for checking values / min-max stats / metadata written for int64 Parquet timestamps - ran core tests Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f --- M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/hdfs-parquet-table-writer.h M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test M tests/query_test/test_insert_parquet.py M tests/util/get_parquet_metadata.py 18 files changed, 570 insertions(+), 107 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/12 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 12 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 11: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/2559/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 11 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Wed, 27 Mar 2019 15:30:08 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 10: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/2558/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 10 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Wed, 27 Mar 2019 15:10:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 11: (19 comments) http://gerrit.cloudera.org:8080/#/c/12247/9//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12247/9//COMMIT_MSG@17 PS9, Line 17: INT96_NANOS > I think the NANOS suffix is somewhat misleading as it suggests that this is I am open to a new name, but I want to consult with Hive / Parquet people first. This is a development query option at the moment, so I think that it is not a big issue to change the names in the future. About discouraging people from using INT96_NANOS: currently this is the only timestamp representation that can be read by other Hadoop components. I would wait for a Parquet-MR release with logical type support + some cross-component testing / benchmarking before advising anyone to use the new types. http://gerrit.cloudera.org:8080/#/c/12247/9//COMMIT_MSG@42 PS9, Line 42: not > typo Done http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.h File be/src/exec/parquet/parquet-metadata-utils.h: http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.h@61 PS9, Line 61: that the inpu > nit: that the input type.. Done http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.cc@150 PS9, Line 150: parquet::TimeUnit time_unit; > nit: declare these before using them? Done http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc File be/src/runtime/timestamp-test.cc: http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@713 PS9, Line 713: // Test padding on double digits > I'd declare that when you use it Done http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@721 PS9, Line 721: ASSERT_TRUE(ParseFormatTokens(_ctx)) << "TC: " << i; > Maybe prefix each of these comments with a comment or assert of the current I have rewritten min_date assignments to use explicit timestamps. http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@737 PS9, Line 737: TimestampValue min_date = TimestampValue::Parse("1400-01-01"); > This test does not correspond to the comment above, does it? I.e., it is ro I think that "floor" in the name should make it clear that it rounds toward -. Note that ToUnixTime() also rounds towards -. http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@740 PS9, Line 740: EQ(MI > The switch between us and ns confused me for a bit, not sure if there's a g I have rewritten the code to always reset min_date. I think that it is easier to understand now, but I can create new variables if you think that it would further improve readability. http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.h@239 PS9, Line 239: 2262 > The famous Impala Year-2262 problem being conceived and I get to be part of :D http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.h@375 PS9, Line 375: > nit: 1970-01-01 Done http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h File be/src/runtime/timestamp-value.inline.h: http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@98 PS9, Line 98: SECONDS_PER_ > While you're here, this could probably be a constant, too, given it's used Done http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@133 PS9, Line 133: return true; > Nit: I think some of these newlines could go. Done http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@156 PS9, Line 156: > nit: indent 4 spaces Done http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@160 PS9, Line 160: } > nit: If it doesn't fit on one line, it should have braces {} Done http://gerrit.cloudera.org:8080/#/c/12247/9/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: http://gerrit.cloudera.org:8080/#/c/12247/9/tests/query_test/test_insert_parquet.py@848 PS9, Line 848: insert_stmt = """insert into {0} values > Can you add a test for the case that values don't fit into 64bit nanos? Done http://gerrit.cloudera.org:8080/#/c/12247/10/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: http://gerrit.cloudera.org:8080/#/c/12247/10/tests/query_test/test_insert_parquet.py@824 PS10, Line 824: ("1969-12-31 23:59:59.9"), This was broken when rounding towards nearest was
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12247 to look at the new patch set (#11). Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. IMPALA-5051: Add INT64 timestamp write support in Parquet Add query option "parquet_timestamp_type" that chooses the Parquet type used when writing TIMESTAMP columns. This is an experimental feature at the moment, because these types are not widely adopted in other Hadoop components yet. For this reason the query option is added as "development" level, and the default behavior is not changed. The following options can be used: INT96_NANOS (default): This is the same as the old behavior, can represent any timestamp that can be handled by Impala. INT64_MILLIS, INT64_MICROS: Can encode the whole [1400..1) range handled by Impala at the cost of reduced precision. Values are rounded towards minus infinity during writing. INT64_NANOS: Can encode a reduced range without losing nanosecond precision: [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807] Values outside this range are converted to NULLs without warning. The change was done completely in the backend and all TIMESTAMP columns are written using the type set in the query option. An alternative design would have been to implement some parts in the fronted by adding TIMESTAMP->BIGINT conversion functions to the query plan, which would make it easier to add the possibility of per-column setting in the future. I choose the current design because it seemed much simpler and there are no clear plans for the per-column setting. Most of the code will be still useful if we decide to go the other way in the future. All types are written without conversion to UTC (the way Impala always wrote timestamps), and this information is expressed in the new Parquet logical types by setting isAdjustedToUTC to false. The old logical type (converted_type) is not set, because old readers do not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and TIMESTAMP_MICROS are written in UTC. These readers can still read int64 timestamp columns as INT_64. Testing: - added unit tests for new TimestampValue->int64 functions - add EE tests for checking values / min-max stats / metadata written for int64 Parquet timestamps - ran core tests Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f --- M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/hdfs-parquet-table-writer.h M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test M tests/query_test/test_insert_parquet.py M tests/util/get_parquet_metadata.py 18 files changed, 584 insertions(+), 109 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/11 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 11 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/12247/10/tests/util/get_parquet_metadata.py File tests/util/get_parquet_metadata.py: http://gerrit.cloudera.org:8080/#/c/12247/10/tests/util/get_parquet_metadata.py@180 PS10, Line 180: o flake8: E225 missing whitespace around operator -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 10 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Wed, 27 Mar 2019 14:28:08 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12247 to look at the new patch set (#10). Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. IMPALA-5051: Add INT64 timestamp write support in Parquet Add query option "parquet_timestamp_type" that chooses the Parquet type used when writing TIMESTAMP columns. This is an experimental feature at the moment, because these types are not widely adopted in other Hadoop components yet. For this reason the query option is added as "development" level, and the default behavior is not changed. The following options can be used: INT96_NANOS (default): This is the same as the old behavior, can represent any timestamp that can be handled by Impala. INT64_MILLIS, INT64_MICROS: Can encode the whole [1400..1) range handled by Impala at the cost of reduced precision. Values are rounded towards minus infinity during writing. INT64_NANOS: Can encode a reduced range without losing nanosecond precision: [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807] Values outside this range are converted to NULLs without warning. The change was done completely in the backend and all TIMESTAMP columns are written using the type set in the query option. An alternative design would have been to implement some parts in the fronted by adding TIMESTAMP->BIGINT conversion functions to the query plan, which would make it easier to add the possibility of per-column setting in the future. I choose the current design because it seemed much simpler and there are no clear plans for the per-column setting. Most of the code will be still useful if we decide to go the other way in the future. All types are written without conversion to UTC (the way Impala always wrote timestamps), and this information is expressed in the new Parquet logical types by setting isAdjustedToUTC to false. The old logical type (converted_type) is not set, because old readers do not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and TIMESTAMP_MICROS are written in UTC. These readers can still read int64 timestamp columns as INT_64. Testing: - added unit tests for new TimestampValue->int64 functions - add EE tests for checking values / min-max stats / metadata written for int64 Parquet timestamps - ran core tests Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f --- M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/hdfs-parquet-table-writer.h M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test M tests/query_test/test_insert_parquet.py M tests/util/get_parquet_metadata.py 18 files changed, 585 insertions(+), 109 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/10 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 10 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 9: (17 comments) http://gerrit.cloudera.org:8080/#/c/12247/9//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12247/9//COMMIT_MSG@17 PS9, Line 17: INT96_NANOS I think the NANOS suffix is somewhat misleading as it suggests that this is just ns precision with a larger range. Since we also want to discourage use of this I suggest to name it INT96_DEPRECATED (or INT96_IMPALA since this is where it came from). http://gerrit.cloudera.org:8080/#/c/12247/9//COMMIT_MSG@42 PS9, Line 42: net typo http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.h File be/src/exec/parquet/parquet-metadata-utils.h: http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.h@61 PS9, Line 61: that the type nit: that the input type.. http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.cc@150 PS9, Line 150: parquet::TimestampType timestamp_type; nit: declare these before using them? http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc File be/src/runtime/timestamp-test.cc: http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@713 PS9, Line 713: int64_t tm_min_micros, tm_min_millis; I'd declare that when you use it http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@721 PS9, Line 721: // Add 250ns and check the value is rounded down Maybe prefix each of these comments with a comment or assert of the current value of min_time. Otherwise one has to count how many times we added 250ns all the time. http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@737 PS9, Line 737: EXPECT_TRUE(min_date.FloorUtcToUnixTimeMicros(_min_micros)); This test does not correspond to the comment above, does it? I.e., it is rounding towards -? http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@740 PS9, Line 740: 250us The switch between us and ns confused me for a bit, not sure if there's a good way to make it stand out more. Also, some of these blocks add to the current value of min_date, while other blocks reset it. Maybe you can make more clear by not writing to it and using new variables instead (here and in the other blocks) http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.h@239 PS9, Line 239: 2262 The famous Impala Year-2262 problem being conceived and I get to be part of it. :) http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.h@375 PS9, Line 375: 1970.01.01 nit: 1970-01-01 http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h File be/src/runtime/timestamp-value.inline.h: http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@98 PS9, Line 98: 24 * 60 * 60 While you're here, this could probably be a constant, too, given it's used in other places. http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@133 PS9, Line 133: Nit: I think some of these newlines could go. http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@156 PS9, Line 156: static_cast(unixtime_seconds) * NANOS_PER_SEC nit: indent 4 spaces http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@160 PS9, Line 160: || nanos128 > std::numeric_limits::max()) return false; nit: If it doesn't fit on one line, it should have braces {} http://gerrit.cloudera.org:8080/#/c/12247/9/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: http://gerrit.cloudera.org:8080/#/c/12247/9/tests/query_test/test_insert_parquet.py@848 PS9, Line 848: ColumnStats('ts', -1001001, 1001001, 0) Can you add a test for the case that values don't fit into 64bit nanos? http://gerrit.cloudera.org:8080/#/c/12247/9/tests/util/get_parquet_metadata.py File tests/util/get_parquet_metadata.py: http://gerrit.cloudera.org:8080/#/c/12247/9/tests/util/get_parquet_metadata.py@180 PS9, Line 180: normpath Why is "normpath" needed here? http://gerrit.cloudera.org:8080/#/c/12247/9/tests/util/get_parquet_metadata.py@180 PS9, Line 180: table_dir = tmp_dir + "/" + os.path.basename(os.path.normpath(hdfs_path)) use os.path.join -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType:
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 9: Code-Review+1 (5 comments) http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39 PS8, Line 39: without conversion to UTC > The main problem with normalizing to UTC from Impala's standpoint is the pe OK, thanks for the explanation for both of you! http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc@579 PS8, Line 579: result_ > result_ shouldn't be a local variable, as we return a pointer to it. This i Ah, I completely missed that. http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc@142 PS8, Line 142: /// converted_type is not set because Impala always writes timestamps without UTC > That's true, but it has a metadata field to tell the caller the desired sem Sorry, I was really thinking about Hive and Spark. But it doesn't really matter, since now we have this 'isAdjustedToUTC' field. And Zoltan also said in one of his comments that "'pure' TIMESTAMP and TIMESTAMP WITHOUT TIME ZONE shall not be normalized to UTC" http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h File be/src/runtime/timestamp-value.inline.h: http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h@154 PS8, Line 154: // TODO: consider optimizing this (IMPALA-8268) : kudu::int128_t nanos128 = : static_cast(unixtime_seconds) * NANOS_PER_SEC : + time_.fractional_seconds(); : : if (nanos128 < std::numeric_limits::min() > I created a ticket for benchmarking and optimizing these new functions: IMP OK, thanks! http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test File testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test: http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test@102 PS8, Line 102: QUERY : create table int96_nanos (ts timestamp) stored as parquet; : : QUERY : # Insert edge values as "normal" int96 timestamps that can represent all values. : set parquet_timestamp_type=INT96_NANOS; : insert into int96_nanos values > Thanks for the tip! You're welcome! -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 9 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Fri, 01 Mar 2019 15:43:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Zoltan Ivanfi has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 8: (1 comment) http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc@142 PS8, Line 142: /// converted_type is not set because Impala always writes timestamps without UTC > As far as I know Parquet-MR does not do any timezone conversion and leaves That's true, but it has a metadata field to tell the caller the desired semantics, which can be either UTC-normalized or not. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 8 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Fri, 01 Mar 2019 14:11:43 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 9: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/2316/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 9 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Fri, 01 Mar 2019 13:46:34 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 8: (7 comments) http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39 PS8, Line 39: without conversion to UTC > No, "pure" TIMESTAMP and TIMESTAMP WITHOUT TIME ZONE shall not be normalize The main problem with normalizing to UTC from Impala's standpoint is the performance cost of the UTC->localtime conversion when reading it back. Doing the timezone conversion for all timestamps in all rows is very costly compared to other tasks during Parquet scanning. http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39 PS8, Line 39: tha > nit: the Done http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc@579 PS8, Line 579: result_ > What about deleting the member 'result_', and only have it here as a local result_ shouldn't be a local variable, as we return a pointer to it. This is the expected interface by BaseColumnWriter::AppendRow() - it has no template for the type of the column, so it handles values with a void* that points to the value that should be inserted to the current row, and is expected to live until we step to the next row. http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.h File be/src/exec/parquet/parquet-metadata-utils.h: http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.h@60 PS8, Line 60: Return > nit: Returns Done http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc@142 PS8, Line 142: /// converted_type is not set because Impala always writes timestamps without UTC > Parquet supports both UTC-normalized and timezone-agnostic timestamps, aka As far as I know Parquet-MR does not do any timezone conversion and leaves this task to the caller, e.g. Hive. http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h File be/src/runtime/timestamp-value.inline.h: http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h@154 PS8, Line 154: kudu::int128_t nanos128 = : static_cast(unixtime_seconds) * NANOS_PER_SEC : + time_.fractional_seconds(); : : if (nanos128 < std::numeric_limits::min() : || nanos128 > std::numeric_limits::max()) return false; > I think we can still avoid using int128_t. I created a ticket for benchmarking and optimizing these new functions: IMPALA-8268 Performance is not too important at the moment as parquet_timestamp_type is a development query option and should be mainly used to test whether the new timestamps can be read by other Hadoop components. I think that the most costly things in this function are time_.fractional_seconds() and UtcToUnixTime()'s time_.total_seconds(), as these need int64 integer division. These could be avoided by not using UtcToUnixTime and converting to nanoseconds from day_ + time_ directly, but this has to be done carefully near the edge values. http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test File testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test: http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test@102 PS8, Line 102: QUERY : create table int96_nanos (ts timestamp) stored as parquet; : : QUERY : # Insert edge values as "normal" int96 timestamps that can represent all values. : set parquet_timestamp_type=INT96_NANOS; : insert into int96_nanos values > nit: you dont't need to start a new QUERY block for each query when you don Thanks for the tip! -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 8 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Fri, 01 Mar 2019 13:35:32 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12247 to look at the new patch set (#9). Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. IMPALA-5051: Add INT64 timestamp write support in Parquet Add query option "parquet_timestamp_type" that chooses the Parquet type used when writing TIMESTAMP columns. This is an experimental feature at the moment, because these types are not widely adopted in other Hadoop components yet. For this reason the query option is added as "development" level, and the default behavior is not changed. The following options can be used: INT96_NANOS (default): This is the same as the old behavior, can represent any timestamp that can be handled by Impala. INT64_MILLIS, INT64_MICROS: Can encode the whole [1400..1) range handled by Impala at the cost of reduced precision. Values are rounded towards minus infinity during writing. INT64_NANOS: Can encode a reduced range without losing nanosecond precision: [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807] Values outside this range are converted to NULLs without warning. The change was done completely in the backend and all TIMESTAMP columns are written using the type set in the query option. An alternative design would have been to implement some parts in the fronted by adding TIMESTAMP->BIGINT conversion functions to the query plan, which would make it easier to add the possibility of per-column setting in the future. I choose the current design because it seemed much simpler and there are no clear plans for the per-column setting. Most of the code will be still useful if we decide to go the other way in the future. All types are written without conversion to UTC (the way Impala always wrote timestamps), and this information is expressed in the new Parquet logical types by setting isAdjustedToUTC to false. The old logical type (converted_type) is net set, because old readers do not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and TIMESTAMP_MICROS are written in UTC. These readers can still read int64 timestamp columns as INT_64. Testing: - added unit tests for new TimestampValue->int64 functions - add EE tests for checking values / min-max stats / metadata written for int64 Parquet timestamps - ran core tests Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f --- M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/hdfs-parquet-table-writer.h M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test M tests/query_test/test_insert_parquet.py M tests/util/get_parquet_metadata.py 18 files changed, 537 insertions(+), 69 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/9 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 9 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Zoltan Ivanfi has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 8: (1 comment) http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc@142 PS8, Line 142: /// converted_type is not set because Impala always writes timestamps without UTC > Does Parquet-MR also write INT64 timestamps un-normalized? Parquet supports both UTC-normalized and timezone-agnostic timestamps, aka Instant and LocalDateTime semantics. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 8 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Thu, 28 Feb 2019 17:17:52 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Zoltan Ivanfi has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 8: (1 comment) http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39 PS8, Line 39: without conversion to UTC > Whouldn't it be better to convert to UTC? You write later that old readers No, "pure" TIMESTAMP and TIMESTAMP WITHOUT TIME ZONE shall not be normalized to UTC, only the TIMESTAMP WITH LOCAL TIME ZONE type shall be. Please see this document: https://goo.gl/VV88c5 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 8 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Thu, 28 Feb 2019 17:16:26 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 8: (7 comments) http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39 PS8, Line 39: without conversion to UTC Whouldn't it be better to convert to UTC? You write later that old readers also assume that the data is written in UTC. http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39 PS8, Line 39: tha nit: the http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc@579 PS8, Line 579: result_ What about deleting the member 'result_', and only have it here as a local variable? Now it looks strange that we invoke a member function here and pass a member variable to it. Also, this way we will have some aliasing inside ConvertTimestamp() that might prevent some compiler optimizations, but I'm not really sure honestly. http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.h File be/src/exec/parquet/parquet-metadata-utils.h: http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.h@60 PS8, Line 60: Return nit: Returns http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc@142 PS8, Line 142: /// converted_type is not set because Impala always writes timestamps without UTC Does Parquet-MR also write INT64 timestamps un-normalized? http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h File be/src/runtime/timestamp-value.inline.h: http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h@154 PS8, Line 154: kudu::int128_t nanos128 = : static_cast(unixtime_seconds) * NANOS_PER_SEC : + time_.fractional_seconds(); : : if (nanos128 < std::numeric_limits::min() : || nanos128 > std::numeric_limits::max()) return false; I think we can still avoid using int128_t. http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test File testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test: http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test@102 PS8, Line 102: QUERY : create table int96_nanos (ts timestamp) stored as parquet; : : QUERY : # Insert edge values as "normal" int96 timestamps that can represent all values. : set parquet_timestamp_type=INT96_NANOS; : insert into int96_nanos values nit: you dont't need to start a new QUERY block for each query when you don't check the results. E.g. it could be: QUERY create table... set ... insert into ... create table ... -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 8 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Thu, 28 Feb 2019 15:35:07 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 8: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/2101/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 8 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Wed, 13 Feb 2019 19:18:34 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 8: (1 comment) About the last 3 patch sets: patch set 6 changes nanos->millis/micros conversions to round towards minus infinity instead of the nearest integer patch set 7 fixes a comment by Zoltan Borok-Nagy and a copy-paste error in patch 6 patch set 8 is a rebase + conflict resolution. Rebasing was necessary because of Impala-lzo changes broke the build. About checking the written Parquet files with Parquet tool: thanks for the tool - I used it to check the written files manually, but I would prefer to integrate it into Impala in a separate commit. A jira to track this: IMPALA-8197 http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@583 PS3, Line 583: successful and return true. If the timestamp is invalid > It seemed premature optimization to me, so I did some measurements and look Done -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 8 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Wed, 13 Feb 2019 19:18:34 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12247 to look at the new patch set (#8). Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. IMPALA-5051: Add INT64 timestamp write support in Parquet Add query option "parquet_timestamp_type" that chooses the Parquet type used when writing TIMESTAMP columns. This is an experimental feature at the moment, because these types are not widely adopted in other Hadoop components yet. For this reason the query option is added as "development" level, and the default behavior is not changed. The following options can be used: INT96_NANOS (default): This is the same as the old behavior, can represent any timestamp that can be handled by Impala. INT64_MILLIS, INT64_MICROS: Can encode the whole [1400..1) range handled by Impala at the cost of reduced precision. Values are rounded towards minus infinity during writing. INT64_NANOS: Can encode a reduced range without losing nanosecond precision: [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807] Values outside this range are converted to NULLs without warning. The change was done completely in the backend and all TIMESTAMP columns are written using the type set in the query option. An alternative design would have been to implement some parts in the fronted by adding TIMESTAMP->BIGINT conversion functions to the query plan, which would make it easier to add the possibility of per-column setting in the future. I choose the current design because it seemed much simpler and there are no clear plans for the per-column setting. Most of the code will be still useful if we decide to go the other way in the future. All types are written without conversion to UTC (tha way Impala always wrote timestamps), and this information is expressed in the new Parquet logical types by setting isAdjustedToUTC to false. The old logical type (converted_type) is net set, because old readers do not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and TIMESTAMP_MICROS are written in UTC. These readers can still read int64 timestamp columns as INT_64. Testing: - added unit tests for new TimestampValue->int64 functions - add EE tests for checking values / min-max stats / metadata written for int64 Parquet timestamps - ran core tests Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f --- M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/hdfs-parquet-table-writer.h M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test M tests/query_test/test_insert_parquet.py M tests/util/get_parquet_metadata.py 18 files changed, 542 insertions(+), 69 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/8 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 8 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 7: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/2100/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 7 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Wed, 13 Feb 2019 18:32:47 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12247 to look at the new patch set (#7). Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. IMPALA-5051: Add INT64 timestamp write support in Parquet Add query option "parquet_timestamp_type" that chooses the Parquet type used when writing TIMESTAMP columns. This is an experimental feature at the moment, because these types are not widely adopted in other Hadoop components yet. For this reason the query option is added as "development" level, and the default behavior is not changed. The following options can be used: INT96_NANOS (default): This is the same as the old behavior, can represent any timestamp that can be handled by Impala. INT64_MILLIS, INT64_MICROS: Can encode the whole [1400..1) range handled by Impala at the cost of reduced precision. Values are rounded towards minus infinity during writing. INT64_NANOS: Can encode a reduced range without losing nanosecond precision: [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807] Values outside this range are converted to NULLs without warning. The change was done completely in the backend and all TIMESTAMP columns are written using the type set in the query option. An alternative design would have been to implement some parts in the fronted by adding TIMESTAMP->BIGINT conversion functions to the query plan, which would make it easier to add the possibility of per-column setting in the future. I choose the current design because it seemed much simpler and there are no clear plans for the per-column setting. Most of the code will be still useful if we decide to go the other way in the future. All types are written without conversion to UTC (tha way Impala always wrote timestamps), and this information is expressed in the new Parquet logical types by setting isAdjustedToUTC to false. The old logical type (converted_type) is net set, because old readers do not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and TIMESTAMP_MICROS are written in UTC. These readers can still read int64 timestamp columns as INT_64. Testing: - added unit tests for new TimestampValue->int64 functions - add EE tests for checking values / min-max stats / metadata written for int64 Parquet timestamps - ran core tests Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f --- M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/hdfs-parquet-table-writer.h M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test M tests/query_test/test_insert_parquet.py M tests/util/get_parquet_metadata.py 18 files changed, 540 insertions(+), 67 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/7 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 7 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 6: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/2097/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 6 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Wed, 13 Feb 2019 18:01:46 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12247 to look at the new patch set (#6). Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. IMPALA-5051: Add INT64 timestamp write support in Parquet Add query option "parquet_timestamp_type" that chooses the Parquet type used when writing TIMESTAMP columns. This is an experimental feature at the moment, because these types are not widely adopted in other Hadoop components yet. For this reason the query option is added as "development" level, and the default behavior is not changed. The following options can be used: INT96_NANOS (default): This is the same as the old behavior, can represent any timestamp that can be handled by Impala. INT64_MILLIS, INT64_MICROS: Can encode the whole [1400..1) range handled by Impala at the cost of reduced precision. Values are rounded towards minus infinity during writing. INT64_NANOS: Can encode a reduced range without losing nanosecond precision: [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807] Values outside this range are converted to NULLs without warning. The change was done completely in the backend and all TIMESTAMP columns are written using the type set in the query option. An alternative design would have been to implement some parts in the fronted by adding TIMESTAMP->BIGINT conversion functions to the query plan, which would make it easier to add the possibility of per-column setting in the future. I choose the current design because it seemed much simpler and there are no clear plans for the per-column setting. Most of the code will be still useful if we decide to go the other way in the future. All types are written without conversion to UTC (tha way Impala always wrote timestamps), and this information is expressed in the new Parquet logical types by setting isAdjustedToUTC to false. The old logical type (converted_type) is net set, because old readers do not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and TIMESTAMP_MICROS are written in UTC. These readers can still read int64 timestamp columns as INT_64. Testing: - added unit tests for new TimestampValue->int64 functions - add EE tests for checking values / min-max stats / metadata written for int64 Parquet timestamps - ran core tests Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f --- M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/hdfs-parquet-table-writer.h M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test M tests/query_test/test_insert_parquet.py M tests/util/get_parquet_metadata.py 18 files changed, 539 insertions(+), 67 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/6 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 6 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 5: Some update: There are two design decisions under discussion at the moment, one is the type of rounding to use when some precision is lost, and the other is whether to convert out-of-range int64 nanos to NULL silently or to abort with an error in this case. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 5 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Wed, 30 Jan 2019 15:20:49 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@583 PS3, Line 583: Overrides of this function are expected to set 'result_' > I prefer the current solution - my guess is that adding an additional argum It seemed premature optimization to me, so I did some measurements and looked at the generated assembly. Both 'this' and the output parameter are passed in via registers. In the current case you need less registers, but you need to access 'result_' via 'this' plus an offset. In the other case, you need one more register to pass the output parameter, but you don't need to add an offset to 'this'. I couldn't measure any performance difference between the two approaches. So if performance was the only reason to do this, I'd rather choose the conventional way. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 3 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Thu, 24 Jan 2019 17:13:41 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Zoltan Ivanfi has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 5: > About 'parquet-tools dump': it would be great to use it during > Impala EE tests, but adding it as a dependency would also mean +1 > way Impala builds can break, so I am not sure at the moment. I am very confident that we should have this extra check. Adding parquet-tools to the build toolchain is not that difficult as it does not have to be installed. I created a small Maven project that runs it without installing it (in the conventional sense at least). The whole project consists of a single POM file and nothing else. Feel free to add it to Implala and modify it to fit your needs: https://github.com/zivanfi/parquet-tools-executor -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 5 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Thu, 24 Jan 2019 14:04:32 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 5: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/1862/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 5 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Wed, 23 Jan 2019 21:18:46 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 3: (7 comments) About 'parquet-tools dump': it would be great to use it during Impala EE tests, but adding it as a dependency would also mean +1 way Impala builds can break, so I am not sure at the moment. http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@400 PS3, Line 400: != > Shouldn't it be '=='? It is intentionally !=. -1 means variable length, and plain_encoded_value_size_ should remain -1 in that case. Actually this code led to DCHECK error during the writing of some decimals, which was not caught by the tests I ran locally. After running to issues on jenkins, I moved this logic Int64TimestampColumnWriterBase. http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@583 PS3, Line 583: Overrides of this function are expected to set 'result_' > I think an output parameter would be more conventional. I prefer the current solution - my guess is that adding an additional argument to a non-inline functions will make it slower, while writing to result_ is cheaper as 'this' is used during the function call anyway. http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc@158 PS3, Line 158: // converted_type is not set because older readers that do not use logical types > Ah, yes, good point, I haven't noticed we are only dealing with LocalDateTi Done http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/runtime/timestamp-value.inline.h File be/src/runtime/timestamp-value.inline.h: http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/runtime/timestamp-value.inline.h@135 PS3, Line 135: (NANOS_PER_MILLI / 2) > Might be worth to add a short comment about that we do the rounding here. The whole rounding logic is under discussion - Impala uses rounding to microsec when writing Kudu, but I think that truncating towards negative infinity would be better. I sent an email to d...@impala.apache.org about this topic, I plan to wait for an agreement there before changing these functions. http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/runtime/timestamp-value.inline.h@151 PS3, Line 151: kudu::int128_t nanos128 = : static_cast(unixtime_seconds) * NANOS_PER_SEC : + time_.fractional_seconds(); : : if (nanos128 < std::numeric_limits::min() : || nanos128 > std::numeric_limits::max()) return false; > I think we can avoid using int128_t here. We now what are the min and max u I agree, but I am waiting for a decision in rounding vs truncating, see line 135. http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc File be/src/service/query-options.cc: http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc@728 PS3, Line 728: if (iequals(value, "INT96_NANO")) { > You are right, I think that I messed something up like running the tests wi Done http://gerrit.cloudera.org:8080/#/c/12247/3/common/thrift/ImpalaInternalService.thrift File common/thrift/ImpalaInternalService.thrift: http://gerrit.cloudera.org:8080/#/c/12247/3/common/thrift/ImpalaInternalService.thrift@1 PS3, Line 1: > nit: accidental new line Done -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 3 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Wed, 23 Jan 2019 20:50:38 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12247 to look at the new patch set (#5). Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. IMPALA-5051: Add INT64 timestamp write support in Parquet Add query option "parquet_timestamp_type" that chooses the Parquet type used when writing TIMESTAMP columns. This is an experimental feature at the moment, because these types are not widely adopted in other Hadoop components yet. For this reason the query option is added as "development" level, and the default behavior is not changed. The following options can be used: INT96_NANOS (default): This is the same as the old behavior, can represent any timestamp that can be handled by Impala. INT64_MILLIS, INT64_MICROS: Can encode the whole [1400..1) range handled by Impala at the cost of reduced precision. Values are rounded to the nearest ms/us during writing. Values that would be rounded to year 1 are rounded down to avoid converting valid values to invalid ones. INT64_NANOS: Can encode a reduced range without losing nanosecond precision: [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807] Values outside this range are converted to NULLs without warning. The change was done completely in the backend and all TIMESTAMP columns are written using the type set in the query option. An alternative design would have been to implement some parts in the fronted by adding TIMESTAMP->BIGINT conversion functions to the query plan, which would make it easier to add the possibility of per-column setting in the future. I choose the current design because it seemed much simpler and there are no clear plans for the per-column setting. Most of the code will be still useful if we decide to go the other way in the future. All types are written without conversion to UTC (as Impala always wrote timestamps), and this information is expressed in the new Parquet logical types by setting isAdjustedToUTC to false. The old logical type (converted_type) is net set, because old readers do not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and TIMESTAMP_MICROS are written in UTC. These readers can still read int64 timestamp columns as INT_64. Testing: - added unit tests for new TimestampValue->int64 functions - add EE tests for checking values / min-max stats / metadata written for int64 Parquet timestamps - ran core tests Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f --- M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/hdfs-parquet-table-writer.h M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test M tests/query_test/test_insert_parquet.py M tests/util/get_parquet_metadata.py 18 files changed, 515 insertions(+), 66 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/5 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 5 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 4: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/1853/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 4 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Wed, 23 Jan 2019 15:20:24 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12247 to look at the new patch set (#4). Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. IMPALA-5051: Add INT64 timestamp write support in Parquet Add query option "parquet_timestamp_type" that chooses the Parquet type used when writing TIMESTAMP columns. This is an experimental feature at the moment, because these types are not widely adopted in other Hadoop components yet. For this reason the query option is added as "development" level, and the default behavior is not changed. The following options can be used: INT96_NANOS (default): This is the same as the old behavior, can represent any timestamp that can be handled by Impala. INT64_MILLIS, INT64_MICROS: Can encode the whole [1400..1) range handled by Impala at the cost of reduced precision. Values are rounded to the nearest ms/us during writing. Values that would be rounded to year 1 are rounded down to avoid converting valid values to invalid ones. INT64_NANOS: Can encode a reduced range without losing nanosecond precision: [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807] Values outside this range are converted to NULLs without warning. The change was done completely in the backend and all TIMESTAMP columns are written using the type set in the query option. An alternative design would have been to implement some parts in the fronted by adding TIMESTAMP->BIGINT conversion functions to the query plan, which would make it easier to add the possibility of per-column setting in the future. I choose the current design because it seemed much simpler and there are no clear plans for the per-column setting. Most of the code will be still useful if we decide to go the other way in the future. All types are written without conversion to UTC (as Impala always wrote timestamps), and this information is expressed in the new Parquet logical types by setting isAdjustedToUTC to false. The old logical type (converted_type) is net set, because old readers do not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and TIMESTAMP_MICROS are written in UTC. These readers can still read int64 timestamp columns as INT_64. Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f --- M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/hdfs-parquet-table-writer.h M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test M tests/query_test/test_insert_parquet.py M tests/util/get_parquet_metadata.py 18 files changed, 516 insertions(+), 66 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/4 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 4 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc File be/src/service/query-options.cc: http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc@728 PS3, Line 728: if (iequals(value, "INT96_NANO")) { > Shouldn't this typo have been caught by a test? You are right, I think that I messed something up like running the tests with an older version of impala running (my last change was adding "S" to the end of query options). I will return soon when the tests pass. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 3 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Tue, 22 Jan 2019 16:30:35 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Zoltan Ivanfi has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc@158 PS3, Line 158: // converted_type is not set because older readers that do not use logical types > I mean I don't see that this function sets the converted type anywhere. Ah, yes, good point, I haven't noticed we are only dealing with LocalDateTime semantics here and regardless of the precision, there is no converted type for that. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 3 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Tue, 22 Jan 2019 16:01:12 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc@158 PS3, Line 158: // converted_type is not set because older readers that do not use logical types > Personally I'm fine with the current name, it sets the converted type when I mean I don't see that this function sets the converted type anywhere. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 3 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Tue, 22 Jan 2019 15:56:27 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Zoltan Ivanfi has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc@158 PS3, Line 158: // converted_type is not set because older readers that do not use logical types > Then I think you should change the name of the function. Personally I'm fine with the current name, it sets the converted type when it makes sense. I don't know how this could be described without making the function name overly long. http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc File be/src/service/query-options.cc: http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc@728 PS3, Line 728: if (iequals(value, "INT96_NANO")) { > It should be "INT96_NANOS" Shouldn't this typo have been caught by a test? -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 3 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Tue, 22 Jan 2019 15:44:13 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 3: (7 comments) http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@400 PS3, Line 400: != Shouldn't it be '=='? http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@583 PS3, Line 583: Overrides of this function are expected to set 'result_' I think an output parameter would be more conventional. http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc@158 PS3, Line 158: // converted_type is not set because older readers that do not use logical types Then I think you should change the name of the function. http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/runtime/timestamp-value.inline.h File be/src/runtime/timestamp-value.inline.h: http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/runtime/timestamp-value.inline.h@135 PS3, Line 135: (NANOS_PER_MILLI / 2) Might be worth to add a short comment about that we do the rounding here. http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/runtime/timestamp-value.inline.h@151 PS3, Line 151: kudu::int128_t nanos128 = : static_cast(unixtime_seconds) * NANOS_PER_SEC : + time_.fractional_seconds(); : : if (nanos128 < std::numeric_limits::min() : || nanos128 > std::numeric_limits::max()) return false; I think we can avoid using int128_t here. We now what are the min and max unix_timeseconds that can be represented in the limited range. And if we are at the "boundary" (e.g. 2262-04-11 23:47:16), we now what are the limits of the fractional_seconds (e.g. 854775807). http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc File be/src/service/query-options.cc: http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc@728 PS3, Line 728: if (iequals(value, "INT96_NANO")) { It should be "INT96_NANOS" http://gerrit.cloudera.org:8080/#/c/12247/3/common/thrift/ImpalaInternalService.thrift File common/thrift/ImpalaInternalService.thrift: http://gerrit.cloudera.org:8080/#/c/12247/3/common/thrift/ImpalaInternalService.thrift@1 PS3, Line 1: nit: accidental new line -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 3 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Tue, 22 Jan 2019 15:00:18 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Zoltan Ivanfi has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 3: Would it be possible to run `parquet-tools dump` as a part of the tests to check whether the timestamp values can be read back correctly by parquet-mr? -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 3 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Tue, 22 Jan 2019 13:00:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 3: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/1837/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 3 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Mon, 21 Jan 2019 18:45:55 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Hello Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12247 to look at the new patch set (#3). Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. IMPALA-5051: Add INT64 timestamp write support in Parquet Add query option "parquet_timestamp_type" that chooses the Parquet type used when writing TIMESTAMP columns. This is an experimental feature at the moment, because these types are not widely adopted in other Hadoop components yet. For this reason the query option is added as "development" level, and the default behavior is not changed. The following options can be used: INT96_NANOS (default): This is the same as the old behavior, can represent any timestamp that can be handled by Impala. INT64_MILLIS, INT64_MICROS: Can encode the whole [1400..1) range handled by Impala at the cost of reduced precision. Values are rounded to the nearest ms/us during writing. Values that would be rounded to year 1 are rounded down to avoid converting valid values to invalid ones. INT64_NANOS: Can encode a reduced range without losing nanosecond precision: [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807] Values outside this range are converted to NULLs without warning. The change was done completely in the backend and all TIMESTAMP columns are written using the type set in the query option. An alternative design would have been to implement some parts in the fronted by adding TIMESTAMP->BIGINT conversion functions to the query plan, which would make it easier to add the possibility of per-column setting in the future. I choose the current design because it seemed much simpler and there are no clear plans for the per-column setting. Most of the code will be still useful if we decide to go the other way in the future. All types are written without conversion to UTC (as Impala always wrote timestamps), and this information is expressed in the new Parquet logical types by setting isAdjustedToUTC to false. The old logical type (converted_type) is net set, because old readers do not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and TIMESTAMP_MICROS are written in UTC. These readers can still read int64 timestamp columns as INT_64. Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f --- M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/hdfs-parquet-table-writer.h M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test M tests/query_test/test_insert_parquet.py M tests/util/get_parquet_metadata.py 18 files changed, 512 insertions(+), 66 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/3 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 3 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 ) Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. Patch Set 2: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/1836/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 2 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Mon, 21 Jan 2019 17:54:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet
Csaba Ringhofer has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12247 Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet .. IMPALA-5051: Add INT64 timestamp write support in Parquet Add query option "parquet_timestamp_type" that chooses the Parquet type used when writing TIMESTAMP columns. This is an experimental feature at the moment, because these types are not widely adopted in other Hadoop components yet. For this reason the query option is added as "development" level, and the default behavior is not changed. The following options can be used: INT96_NANOS (default): This is the same as the old behavior, can represent any timestamp that can be handled by Impala. INT64_MILLIS, INT64_MICROS: Can encode the whole [1400..1) range handled by Impala at the cost of reduced precision. Values are rounded to the nearest ms/us during writing. Values that would be rounded to year 1 are rounded down to avoid converting valid values to invalid ones. INT64_NANOS: Can encode a reduced range without losing nanosecond precision: [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807] Values outside this range are converted to NULLs without warning. The change was done completely in the backend and all TIMESTAMP columns are written using the type set in the query option. An alternative design would have been to implement some parts in the fronted by adding TIMESTAMP->BIGINT conversion functions to the query plan, which would make it easier to add the possibility of per-column setting in the future. I choose the current design because it seemed much simpler and there are no clear plans for the per-column setting. Most of the code will be still useful if we decide to go the other way in the future. All types are written without conversion to UTC (as Impala always wrote timestamps), and this information is expressed in the new Parquet logical types by setting isAdjustedToUTC to false. The old logical type (converted_type) is net set, because old readers do not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and TIMESTAMP_MICROS are written in UTC. These readers can still read int64 timestamp columns as INT_64. Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f --- M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/hdfs-parquet-table-writer.h M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test M tests/query_test/test_insert_parquet.py M tests/util/get_parquet_metadata.py 18 files changed, 512 insertions(+), 66 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/2 -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 2 Gerrit-Owner: Csaba Ringhofer