[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Impala Public Jenkins has submitted this change and it was merged. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE LIKE will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. HDFS tables created by Impala using CREATE TABLE LIKE will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Reviewed-on: http://gerrit.cloudera.org:8080/5939 Reviewed-by: Dan HechtTested-by: Impala Public Jenkins --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M tests/common/impala_test_suite.py M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py A tests/query_test/test_parquet_timestamp_compatibility.py 28 files changed, 850 insertions(+), 74 deletions(-) Approvals: Impala Public Jenkins: Verified Dan Hecht: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 12 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Impala Public Jenkins has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 11: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Impala Public Jenkins has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 11: Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/526/ -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Dan Hecht has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 11: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 11: > Uploaded patch set 11. Fixed WARN_UNUSED_RESULT warning that failed clang-tidy job. -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new patch set (#11). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE LIKE will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. HDFS tables created by Impala using CREATE TABLE LIKE will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M tests/common/impala_test_suite.py M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py A tests/query_test/test_parquet_timestamp_compatibility.py 28 files changed, 850 insertions(+), 74 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/11 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Hello Impala Public Jenkins, Alex Behm, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5939 to look at the new patch set (#11). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE LIKE will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. HDFS tables created by Impala using CREATE TABLE LIKE will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M tests/common/impala_test_suite.py M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py A tests/query_test/test_parquet_timestamp_compatibility.py 28 files changed, 850 insertions(+), 74 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/11 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Impala Public Jenkins has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 10: Verified-1 Build failed: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/521/ -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 10 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Impala Public Jenkins has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 10: Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/521/ -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 10 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 10: > Uploaded patch set 10. This is the rebased change. -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 10 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new patch set (#10). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE LIKE will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. HDFS tables created by Impala using CREATE TABLE LIKE will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M tests/common/impala_test_suite.py M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py A tests/query_test/test_parquet_timestamp_compatibility.py 28 files changed, 850 insertions(+), 74 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/10 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 10 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Hello Alex Behm, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5939 to look at the new patch set (#10). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE LIKE will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. HDFS tables created by Impala using CREATE TABLE LIKE will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M tests/common/impala_test_suite.py M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py A tests/query_test/test_parquet_timestamp_compatibility.py 28 files changed, 850 insertions(+), 74 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/10 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 10 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 9: (1 comment) http://gerrit.cloudera.org:8080/#/c/5939/8/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: PS8, Line 124: local_date_time lt(temp, timezone); > It says an exception can be thrown by the constructor, but I guess what's i Yes, exactly. According to the documentation local_date_time(posix_time::ptime, timezone_ptr) constructor won't throw an exception. -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Dan Hecht has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 9: Code-Review+2 (1 comment) http://gerrit.cloudera.org:8080/#/c/5939/8/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: PS8, Line 124: local_date_time lt(temp, timezone); > I couldn't find any indication of an exception in the code or in the docume It says an exception can be thrown by the constructor, but I guess what's it's saying is that can only happen for the constructors that takes a bool or flag? i.e. the one that takes UTC won't throw. Is that what you concluded? Just want to be super safe here since we've had problems with unhandled exceptions from boost. -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 9: (5 comments) http://gerrit.cloudera.org:8080/#/c/5939/8/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: PS8, Line 601: > ... in 'scanner_status'. Done PS8, Line 604: const TimestampValue* > please pass this as a 'Status* scanner_status', which is what we usually do Done http://gerrit.cloudera.org:8080/#/c/5939/8/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: PS8, Line 124: local_date_time lt(temp, timezone); > can this throw an exception? I couldn't find any indication of an exception in the code or in the documentation: http://www.boost.org/doc/libs/1_57_0/doc/html/date_time/local_time.html#date_time.local_time.local_date_time PS8, Line 125: *this = lt.local_time(); > or that? Same here. http://gerrit.cloudera.org:8080/#/c/5939/8/be/src/service/impala-server.cc File be/src/service/impala-server.cc: PS8, Line 122: tabl > it's the same for other FileSystem interfaced things, like S3. So maybe rem Done -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new patch set (#9). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE LIKE will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. HDFS tables created by Impala using CREATE TABLE LIKE will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M tests/common/impala_test_suite.py M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py A tests/query_test/test_parquet_timestamp_compatibility.py 28 files changed, 850 insertions(+), 74 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/9 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Hello Alex Behm, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5939 to look at the new patch set (#9). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE LIKE will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. HDFS tables created by Impala using CREATE TABLE LIKE will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M tests/common/impala_test_suite.py M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py A tests/query_test/test_parquet_timestamp_compatibility.py 28 files changed, 850 insertions(+), 74 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/9 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Dan Hecht has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 8: (5 comments) http://gerrit.cloudera.org:8080/#/c/5939/8/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: PS8, Line 601: . ... in 'scanner_status'. PS8, Line 604: Status& scanner_status please pass this as a 'Status* scanner_status', which is what we usually do for out parameters to make it more obvious at the callsite that it's modified. And nice to put out parameters last. http://gerrit.cloudera.org:8080/#/c/5939/8/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: PS8, Line 124: local_date_time lt(temp, timezone); can this throw an exception? PS8, Line 125: *this = lt.local_time(); or that? http://gerrit.cloudera.org:8080/#/c/5939/8/be/src/service/impala-server.cc File be/src/service/impala-server.cc: PS8, Line 122: HDFS it's the same for other FileSystem interfaced things, like S3. So maybe remove "HDFS". -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 8: > (1 comment) Yes, I think we should do that. -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Dan Hecht has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 7: (1 comment) http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: Line 123: local_date_time lt(temp, timezone); > I've investigated the history of UtcToLocal(): Great, thanks for checking that. So then should we consider rewriting UtcToLocal() in terms of FromUtc()? Not as part of this change and something we can sort out later. -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 8: (1 comment) http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: Line 123: ToPtime(); > I seem to remember there also being a lock in the boost code related to loc I've investigated the history of UtcToLocal(): - UtcToLocal() was first introduced in IMPALA-1658 (87b9fac2adda). This version of UtcToLocal() already used localtime_r() for UTC -> local time conversion. It also used couple boost functions to convert back and forth between 'ptime' and 'struct tm' types (to_tm(), ptime_from_tm(), nanoseconds()) - The first version of UtcToLocal() made querying tables with timestamps a lot slower (up to 30x times slower). The root cause was a global lock in localtime_r(). This is discussed in IMPALA-3316 and IMPALA-2125. Both JIRAs are still open. - UtcToLocal() was made 50% faster by 4ddce7f97880. This change got rid of the boost functions (to_tm(), ptime_from_tm(), nanoseconds()) and added the comment at L77. The change kept localtime_r() though, so it didn't do anything to solve the global lock problem. I wrote a simple benchmark program to confirm that UtcToLocal()'s performance degrades if the number of threads running it in parallel is increased. The benchmark program also confirms that FromUtc() does not have this problem. -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Dan Hecht has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 7: (1 comment) http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: Line 123: local_date_time lt(temp, timezone); > I've added a micro-benchmark to compare the two. UtcToLocal() is about 2.5 I seem to remember there also being a lock in the boost code related to locales that's avoided by using the standard C function, so the regression grows with concurrency. I might not be remembering correctly, though. Can you try to dig up the JIRA that introduced the use of localtime_r() to see if there's more details? -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new patch set (#8). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE LIKE will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. HDFS tables created by Impala using CREATE TABLE LIKE will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M tests/common/impala_test_suite.py M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py A tests/query_test/test_parquet_timestamp_compatibility.py 28 files changed, 806 insertions(+), 74 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/8 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 8: (20 comments) http://gerrit.cloudera.org:8080/#/c/5939/7//COMMIT_MSG Commit Message: PS7, Line 22: New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE : LIKE will set the table property to UTC if the global flag : --set_parque > I don't follow this given the next two paragraphs. This isn't true for the Correct. Changed the commit message. http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: Line 305: // parquet-mr. If conversion should not occur, this is set to an empty string. Otherwise > from the code, it looks like this is set to the empty string if conversion Correct, added a comment to document the behavior. http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 234: FLAGS_convert_legacy_hive_parquet_utc_timestamps > given that parquet_mr_write_zone() will take precedence over the flag, I th Done PS7, Line 246: DCHECK > nit: DCHECK_EQ() Done PS7, Line 609: status = scanner_state->LogOrReturnError(msg); > since this code is very performance critical, it would be best to order thi Correct. Reordered the conditions in the if-elif-else statement PS7, Line 611: status = status; > these could all be UNLIKELY since we want to optimize for the non-error cas Done PS7, Line 625: LY(dst->HasDateAndTime() > i.e. here too and below. Done PS7, Line 639: if (!SetTimestampConversionError(parent_->scan_node_, parent_->state_, : parent_->parse_status_, src, parent_->scan_node_->parquet_mr_write_zone())) { : return false; : } : } : } else { : DCHECK(parent_->scan_node_->parquet_mr_write_zone().empty()); : DCH > seems worth factoring this block (which occurs 3 times) into a non-inlined Done. Added a new non-class member function and factored out the block into it. (Adding a new member function instead to ScalarColumnReader template class is probably not a good idea since it is only needed in the ScalarColumnReaderinstance) http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: Line 193: > could you add a comment explaining these two lookup techniques? Done http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: PS7, Line 45: hen IsTimestampDependen > how about drawing the connection with the next function directly: Done Line 49: > comment this. Done http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: Line 101: Status s("Failed to convert timestamp to local time."); > this might produce a ton of logging. instead, why not return the Status ob Done Line 109: bool TimestampValue::FromUtc(const std::string& timezone_str) { > how do we guarantee that this is true? did the caller already check that? Yes, it did. PS7, Line 111: _zone_ptr timezo > UNLIKELY (since perf critical) Done Line 123: ToPtime(); > see the comment at line 77. i think we were avoiding this for perf reasons. I've added a micro-benchmark to compare the two. UtcToLocal() is about 2.5 times faster than FromUtc() when input data size is 1,000 and 1.5 times faster when input data size is 10,000. Rewriting FromUtc() not to use boost function calls is not that straightforward. The above version of UtcToLocal() uses localtime_r() standard C function to perform the UTC->local timezone conversion. FromUtc() on the other hand should convert timestamps to a given time zone. I don't think there is a standard C function to do this. http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: PS7, Line 197: > below you say "local time in the given timezone", which I guess is technica Changed the comment below for 'FromUtc' PS7, Line 203: > it's not clear that "convert" means that *this is modified in-place, so I t Done http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/service/impala-server.cc File be/src/service/impala-server.cc: PS7, Line 122: new > is that always the case? Fixed the description PS7, Line 122: bles created " : "using CREATE TABLE and CREATE TABLE LIKE . You can find details in the " : "documentation."); : : DEFINE_int64(max_result_cache_size, 1 > I'm not sure that this makes sense on it's own. I think you kind of have to Removed that part
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Hello Alex Behm, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5939 to look at the new patch set (#8). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE LIKE will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. HDFS tables created by Impala using CREATE TABLE LIKE will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M tests/common/impala_test_suite.py M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py A tests/query_test/test_parquet_timestamp_compatibility.py 28 files changed, 806 insertions(+), 74 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/8 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Dan Hecht has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 7: (20 comments) http://gerrit.cloudera.org:8080/#/c/5939/7//COMMIT_MSG Commit Message: PS7, Line 22: New tables created by Impala will set the table property to UTC if the : global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is : set to true. I don't follow this given the next two paragraphs. This isn't true for the "LIKE " case, right? http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: Line 305: // parquet-mr. FE guarantees that this is a valid time zone. from the code, it looks like this is set to the empty string if conversion should not occur? if that's true, please document that case. http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 234: !parent_->scan_node_->parquet_mr_write_zone().empty() given that parquet_mr_write_zone() will take precedence over the flag, I think it would be clearer to reverse the order of these two conditionals. i.e. we don't care what the value of the flag is if the timestamp is set, right? (I realize it makes no difference functionally, but seems more consistent with the intent). PS7, Line 246: DCHECK nit: DCHECK_EQ() PS7, Line 609: parent_->scan_node_->parquet_mr_write_zone().empty() since this code is very performance critical, it would be best to order this in the order we want to optimize for. i.e. first check timezone_, then check is_timestamp_dependent_timezone_ and then only if neither is set, check parquet_mr_write_zone().empty() (which is also probably the most expensive check). i.e. move this condition to the end of the if-else-if chain. And then both of parquet_mr_write_zone().empty() and FLAGS_convert_legacy_hive_parquet_utc_timestamps must be true, right? (or else we wouldn't have needs_conversion_ set). And so you can just DCHECK() them. PS7, Line 611: !dst->UtcToLocal()) these could all be UNLIKELY since we want to optimize for the non-error case. PS7, Line 625: !dst->FromUtc(timezone_) i.e. here too and below. PS7, Line 639: ErrorMsg msg(TErrorCode::PARQUET_MR_TIMESTAMP_CONVERSION_FAILED, : src->DebugString(), parent_->scan_node_->parquet_mr_write_zone(), : parent_->scan_node_->hdfs_table()->fully_qualified_name()); : Status status = parent_->state_->LogOrReturnError(msg); : if (!status.ok()) { : parent_->parse_status_ = status; : return false; : } seems worth factoring this block (which occurs 3 times) into a non-inlined subroutine, which will help avoid I-cache pollution (and also make the code more readable). http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: Line 193: could you add a comment explaining these two lookup techniques? http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: PS7, Line 45: for the Moscow timezone how about drawing the connection with the next function directly: (May not work correctly when IsTimestampDependentTimezone(tz), e.g. Moscow timezone). Line 49: static bool IsTimestampDependentTimezone(const std::string& tz); comment this. http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: Line 101: LOG(ERROR) << from_boost.what(); this might produce a ton of logging. instead, why not return the Status object from these, and include the what() in the details? then the error message deduping code in LogError() will take care of throttling it. Line 109: DCHECK(HasDateAndTime()); how do we guarantee that this is true? did the caller already check that? PS7, Line 111: timezone == NULL UNLIKELY (since perf critical) Line 123: local_date_time lt(temp, timezone); see the comment at line 77. i think we were avoiding this for perf reasons. did you measure the perf of this compared to UtcToLocal()? http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: PS7, Line 197: local time below you say "local time in the given timezone", which I guess is technically correct but then makes this statement confusing (becuase here the comment meant "local timezone"). So, how about either removing the words "local time in the" below, or change this comment to say "local time in the system timezone" PS7, Line 203: local time in the given timezone it's not clear that "convert" means that *this is modified in-place, so I think you should add that back in like the
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new patch set (#7). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New tables created by Impala will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the table property unless the global flag is set to true. Tables created using CREATE TABLE LIKE will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M tests/common/impala_test_suite.py M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py A tests/query_test/test_parquet_timestamp_compatibility.py 26 files changed, 665 insertions(+), 73 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/7 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 7: (9 comments) http://gerrit.cloudera.org:8080/#/c/5939/6/fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java File fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java: Line 113: "Invalid time zone in the '%s' table property: %s", > double 'the' Done http://gerrit.cloudera.org:8080/#/c/5939/6/fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java File fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java: Line 665: > Let's move all the CREATE/ALTER tests into a separate TestParquetMrInt96Wri Done Line 1882: "' '", "URI path cannot be empty."); > easier to read single quotes Done Line 1904: type == PrimitiveType.VARCHAR) { > easier to read single quotes Done http://gerrit.cloudera.org:8080/#/c/5939/6/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py File tests/custom_cluster/test_hive_parquet_timestamp_conversion.py: Line 105: parquet_path = get_fs_path( > What does "fn" stand for? I'm thinking "file name", but this is not just a Renamed it to 'parquet_path' Line 123: ON i.id = h.id AND i.day = h.day -- serves as a unique key > easier to read with the alias 'i' next to the table Done Line 125: (h.timestamp_col IS NULL) != (i.timestamp_col IS NULL) > simplify the first two conditions with: Done. Had to put round brackets around col IS NULL expressions to avoid analysis errors. http://gerrit.cloudera.org:8080/#/c/5939/6/tests/query_test/test_parquet_timestamp_compatibility.py File tests/query_test/test_parquet_timestamp_compatibility.py: Line 78: def test_invalid_parquet_mr_write_zone(self, vector, unique_database): > test_invalid_parquet_mr_write_zone Done Line 118: # tz_name conversion on the timestamp values. > extra space after "triggers" Done -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new patch set (#7). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New tables created by Impala will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the table property unless the global flag is set to true. Tables created using CREATE TABLE LIKE will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M tests/common/impala_test_suite.py M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py A tests/query_test/test_parquet_timestamp_compatibility.py 26 files changed, 665 insertions(+), 73 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/7 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new patch set (#6). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New tables created by Impala will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the table property unless the global flag is set to true. Tables created using CREATE TABLE LIKE will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M tests/common/impala_test_suite.py M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py A tests/query_test/test_parquet_timestamp_compatibility.py 26 files changed, 657 insertions(+), 63 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/6 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new patch set (#6). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New tables created by Impala will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the table property unless the global flag is set to true. Tables created using CREATE TABLE LIKE will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M tests/common/impala_test_suite.py M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py A tests/query_test/test_parquet_timestamp_compatibility.py 26 files changed, 657 insertions(+), 63 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/6 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/5939/5/tests/custom_cluster/test_parquet_timestamp_compatibility.py File tests/custom_cluster/test_parquet_timestamp_compatibility.py: Line 71: @pytest.mark.execute_serially > Moved tests that use startup options to ./tests/custom_cluster/test_hive_pa The rest of the tests were moved to ./tests/query_test/test_parquet_timestamp_compatibility.py -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new patch set (#6). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New tables created by Impala will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the table property unless the global flag is set to true. Tables created using CREATE TABLE LIKE will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M tests/common/impala_test_suite.py M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py A tests/query_test/test_parquet_timestamp_compatibility.py 26 files changed, 657 insertions(+), 63 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/6 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 5: (26 comments) Thanks for the review. http://gerrit.cloudera.org:8080/#/c/5939/5/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: Line 305: // parquet-mr. > Comment that the FE guarantees that this is a valid timezone. Done http://gerrit.cloudera.org:8080/#/c/5939/5/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 246: DCHECK((is_timestamp_dependent_timezone_ && timezone_ == NULL) || > simplify condition: Done Line 549: /// Used to cache the timezone object corresponding to "parquet.mr.int96.write.zone" > the "parquet.mr.int96.write.zone" table property Done Line 550: /// table property to avoid repeated calls to 'TimezoneDatabase::FindTimezone()'. > no need to single-quote here Done http://gerrit.cloudera.org:8080/#/c/5939/5/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: Line 190: // See if they specified a zone id > Who is "they"? Suggest rephrasing Done Line 202: return time_zone_ptr(); > return NULL? Done http://gerrit.cloudera.org:8080/#/c/5939/5/common/thrift/BackendGflags.thrift File common/thrift/BackendGflags.thrift: Line 58: // If true, all newly created Parquet tables will have the parquet.mr.int96.write.zone > ... all newly created HDFS tables regardless of format ... Done http://gerrit.cloudera.org:8080/#/c/5939/5/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: Line 218: 9: optional string parquet_mr_write_zone; > remove trailing semicolon for consistency Done http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java File fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java: Line 169: private static void analyzeParquetMrWriteZone(Table table, > These cases need tests in AnalyzeDDL Done. I also changed the exception message for consistency. http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java File fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java: Line 112: throw new AnalysisException("Invalid Time Zone: " + timezone); > Mention that the timezone is in the table property 'parquet.mr.int96.write. Done http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java: Line 183: if (getTblProperties().containsKey(HdfsTable.TBL_PROP_PARQUET_MR_WRITE_ZONE)) { > These cases need tests in AnalyzeDDL Done Line 186: "Only HDFS tables can use '%s' table property.", > the '%s' table property Done. I also changed the exception message for consistency. http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: Line 54: import org.apache.impala.common.InternalException; > not needed Done http://gerrit.cloudera.org:8080/#/c/5939/5/tests/custom_cluster/test_parquet_timestamp_compatibility.py File tests/custom_cluster/test_parquet_timestamp_compatibility.py: Line 25: class TestParquetTimestampCompatibility(CustomClusterTestSuite): > Test needs a comment. In particular, what is covered here and what is cover I've added a comment that describes tests in this file. What do you mean by "integration tests not part of Impala tests"? I don't think we have any integration tests upstream. Line 26: TEST_DB_NAME = 'timezone_test_db' > Why not use the unique_database fixture? That's the best practice. Switched to unique_database. Line 39: self.client = impalad.service.create_beeswax_client() > I think we already have a self.client from ImpalaTestSuite. Removed creating a client from here. Line 49: "/test-warehouse/alltypesagg_hive_13_1_parquet/alltypesagg_hive_13_1.parquet" > We need the appropriate filesystem prefix here. This will break on local fs Done Line 56: self.client.execute('USE ' + self.TEST_DB_NAME) > Why? Better to avoid changing the session state Removed it. Line 59: self.client.execute('INVALIDATE METADATA') > why? Removed it, it's not needed. Line 61: def _set_impala_table_timezone(self, timezone): > simplify to pass table name as param Done Line 71: @pytest.mark.execute_serially > How long does this test take? I'm thinking we should only have minimal test Moved tests that use startup options to ./tests/custom_cluster/test_hive_parquet_timestamp_conversion.py Line 81: statement = '''ALTER TABLE hive_table > Should be an analyzer test in AnalyzeDDL Removed it from here. Line 104: select_from_impala_table = '''SELECT timestamp_col FROM impala_table WHERE id = 1''' > Can you
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Alex Behm has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 5: (16 comments) http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java File fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java: Line 169: private static void analyzeParquetMrWriteZone(Table table, These cases need tests in AnalyzeDDL http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java: Line 183: if (getTblProperties().containsKey(HdfsTable.TBL_PROP_PARQUET_MR_WRITE_ZONE)) { These cases need tests in AnalyzeDDL http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: Line 54: import org.apache.impala.common.InternalException; not needed http://gerrit.cloudera.org:8080/#/c/5939/5/tests/custom_cluster/test_parquet_timestamp_compatibility.py File tests/custom_cluster/test_parquet_timestamp_compatibility.py: Line 25: class TestParquetTimestampCompatibility(CustomClusterTestSuite): Test needs a comment. In particular, what is covered here and what is covered in integration tests not part of Impala tests. Line 26: TEST_DB_NAME = 'timezone_test_db' Why not use the unique_database fixture? That's the best practice. Line 39: self.client = impalad.service.create_beeswax_client() I think we already have a self.client from ImpalaTestSuite. Line 49: "/test-warehouse/alltypesagg_hive_13_1_parquet/alltypesagg_hive_13_1.parquet" We need the appropriate filesystem prefix here. This will break on local fs or S3 builds. Take a look at: filesystem_utils.py get_fs_path() and how that is used elsewhere Line 56: self.client.execute('USE ' + self.TEST_DB_NAME) Why? Better to avoid changing the session state Line 59: self.client.execute('INVALIDATE METADATA') why? Line 61: def _set_impala_table_timezone(self, timezone): simplify to pass table name as param Line 71: @pytest.mark.execute_serially How long does this test take? I'm thinking we should only have minimal tests for the impalad startup option in a custom cluster test, probably in the existing test for convert_legacy_hive_parquet_utc_timestamps. The other tests should go into a regular test (subclass of ImpalaTestSuite) and run in parallel. Line 81: statement = '''ALTER TABLE hive_table Should be an analyzer test in AnalyzeDDL Line 104: select_from_impala_table = '''SELECT timestamp_col FROM impala_table WHERE id = 1''' Can you think of a way to validate all values in the table in bulk e.g. using our existing timestamp conversion functions? Having a few example values is still good, but we should also test that we are internally consistent with out conversion functions. Line 113: self._set_impala_table_timezone('EST') Add some comments about the behavior here. We are getting different values for the same timezone because the underlying Parquet files were written by Hive/Impala. Line 141: statement = '''ALTER TABLE hive_table analyzer test is more appropriate (also fix elsewhere) Line 185: def test_ddl(self, vector): We need a test where Hive sets a garbage timezone and Impala throws an error when analyzing a query against that table. We also need a test that shows what happens when a table property timezone is set and the convert_legacy_hive_parquet_utc_timestamps option is used. Probably good to consolidate with the existing custom cluster test. -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Alex Behm has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 5: (10 comments) I'm pretty happy with the code! Moving on to the tests. http://gerrit.cloudera.org:8080/#/c/5939/5/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: Line 305: // parquet-mr. Comment that the FE guarantees that this is a valid timezone. http://gerrit.cloudera.org:8080/#/c/5939/5/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 246: DCHECK((is_timestamp_dependent_timezone_ && timezone_ == NULL) || simplify condition: is_timestamp_dependent_timezone_ == (timezone_ == NULL) Line 549: /// Used to cache the timezone object corresponding to "parquet.mr.int96.write.zone" the "parquet.mr.int96.write.zone" table property Line 550: /// table property to avoid repeated calls to 'TimezoneDatabase::FindTimezone()'. no need to single-quote here http://gerrit.cloudera.org:8080/#/c/5939/5/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: Line 190: // See if they specified a zone id Who is "they"? Suggest rephrasing Line 202: return time_zone_ptr(); return NULL? http://gerrit.cloudera.org:8080/#/c/5939/5/common/thrift/BackendGflags.thrift File common/thrift/BackendGflags.thrift: Line 58: // If true, all newly created Parquet tables will have the parquet.mr.int96.write.zone ... all newly created HDFS tables regardless of format ... http://gerrit.cloudera.org:8080/#/c/5939/5/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: Line 218: 9: optional string parquet_mr_write_zone; remove trailing semicolon for consistency http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java File fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java: Line 112: throw new AnalysisException("Invalid Time Zone: " + timezone); Mention that the timezone is in the table property 'parquet.mr.int96.write.zone'. Otherwise, the user might have no clue where this timezone comes from. http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java: Line 186: "Only HDFS tables can use '%s' table property.", the '%s' table property -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 5: (15 comments) http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: Line 304: // Time zone for adjusting timestamp values read from Parquet files written by > single line That would make the line 92 characters long. http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 246: DCHECK((is_timestamp_dependent_timezone_ && timezone_ == NULL) || > Even better, we can check this at the scan node level, e.g., in HdfsScanNod Simplified the code. The timezone check is done in the FE, the BE is guaranteed to have a valid timezone string. Line 653: } > As mentioned above, we should terminate the query early when in such a bad Added DCHECK(false) http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: Line 194: for (const string& tz_region: tz_region_list_) { > No need to use auto here. const string& is clearer Done http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: Line 40: /// If 'tz' is not found in the database, nullptr is returned. > null pointer -> nullptr Done Line 46: /// If 'tz' is not found in the database, nullptr is returned. > null pointer -> nullptr Done http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 199: /// conversion was successfull and false otherwise. If conversion failed *this is set to > Comment on what the state of *this is when false is returned. Done Line 202: > Comment on what the state of *this is when false is returned. Done http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/service/fe-support.cc File be/src/service/fe-support.cc: Line 357: const char *tz = env->GetStringUTFChars(timezone, NULL); > const char* Done http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/service/impala-server.cc File be/src/service/impala-server.cc: Line 122: "the parquet.mr.int96.write.zone table property to UTC for new tables. This changes " > consistent placing of space (at line end or at line beginning) Done http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java File fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java: Line 67: analyzeJoin(analyzer); > If we do the validation here, then there is no need to check in each HdfsSc Done http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java: Line 184: if (getFileFormat() == THdfsFileFormat.KUDU) { > You can check getFileFormat() here. The only non-HDFS formats we have are H Done Line 189: String timezone = getTblProperties().get(HdfsTable.TBL_PROP_PARQUET_MR_WRITE_ZONE); > We should only do this for HDFS tables right? Done http://gerrit.cloudera.org:8080/#/c/5939/4/fe/src/main/java/org/apache/impala/service/FeSupport.java File fe/src/main/java/org/apache/impala/service/FeSupport.java: Line 87: // Returns true if timezone String is valid according to the BE timezone database, false > valid according to the BE timezone database Done http://gerrit.cloudera.org:8080/#/c/5939/4/tests/custom_cluster/test_parquet_timestamp_compatibility.py File tests/custom_cluster/test_parquet_timestamp_compatibility.py: Line 1: # Licensed to the Apache Software Foundation (ASF) under one > Use new Apache license header. Do we still have the old header in some exis My bad, I created this test from an old test script. -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new patch set (#5). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New tables created by Impala will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the table property unless the global flag is set to true. Tables created using CREATE TABLE LIKE will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M tests/common/impala_test_suite.py A tests/custom_cluster/test_parquet_timestamp_compatibility.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py 24 files changed, 569 insertions(+), 57 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/5 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Alex Behm has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java File fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java: Line 67: analyzeParquetMRTimeZone(); > If we follow the pattern set for "skip.header.linecount", then probably we If we do the validation here, then there is no need to check in each HdfsScanNode. Going from BaseTableRef -> ScanNode is a 1:1 translation. Even for "skip.header.linecount" there is no need to check in the HdfsScanNode again. Might want to remove that check. Your BE simplifications make sense to me. After the check here, the BE is guarantee to see a valid timezone string. http://gerrit.cloudera.org:8080/#/c/5939/4/tests/custom_cluster/test_parquet_timestamp_compatibility.py File tests/custom_cluster/test_parquet_timestamp_compatibility.py: Line 1: # Copyright (c) 2015 Cloudera, Inc. All rights reserved. Use new Apache license header. Do we still have the old header in some existing files? -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java File fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java: Line 67: analyzeParquetMRTimeZone(); > My apologies, but I was wrong on this one. I think it's unfortunate, but be If we follow the pattern set for "skip.header.linecount", then probably we should check the "parquet.mr.int96.write.zone" property in the HdfsScanNode constructor as well ("skip.header.linecount" is checked there too). If we do that, we can also simplify the BE code. The BE HdfsScanNode/HdfsScanner/ParquetColumnReader classes could safely assume that the timezone string is valid without doing any additional validation. What do you think? -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Alex Behm has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 246: parent_->state_->LogError(ErrorMsg(TErrorCode::PARQUET_MR_WRITE_ZONE_INVALID, > I don't think logging an error is the right thing to do here. We will scan Even better, we can check this at the scan node level, e.g., in HdfsScanNode::Open(). Maybe the members should go into the scan node as well, you can consider it. -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 4: (40 comments) http://gerrit.cloudera.org:8080/#/c/5939/3//COMMIT_MSG Commit Message: Line 14: values from Parquet files written by Hive. > I don't think the "vice versa" part is correct. My understanding is that Hi Removed it. Line 17: Impala reads Parquet MR timestamp data and adjusts values using a time > typo: adjusts Done Line 20: applied to data written by Impala. > Mention interaction with --convert_legacy_hive_parquet_utc_timestamps Done Line 23: global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is > Not sure if the name of this flag was already decided upon, but imo, it's n Changed the name of the flag to set_parquet_mr_int96_write_zone_to_utc_on_new_tables Line 23: global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is > It is also important that it is only set on new tables and that it is set t Done http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 242: if (!is_timestamp_dependent_timezone_) { > What if timezone_ is NULL here? Added error logging. Line 244: parent_->scan_node_->parquet_mr_write_zone()); > Does FindTimezone() really throw? We're not handling that in other places. You are right, it does not throw an exception. The documentation I found on boost::local_time::tz_database::time_zone_from_region() was incorrect: http://marc.info/?l=boost-bugs=132589759305973 Line 249: } > DCHECK that is_timestamp_dependent_timezone_ is true xor timezone_ is set If parquet_mr_time_zone() returns an invalid timezone string then (!timestamp_dependent_timezone_ && timezone_ == 0) will be true. We check for this error condition in ScalarColumnreader::ConvertSlot() and log error there as well if it is true. On the other hand it is true that is_timestamp_dependent_timezone_ and timezone_ cannot be set at the same time. Added a DCHECK for that. Line 554: /// table property to avoid repeated calls to 'TimezoneDatabase::FindTimezone()'. > initialize these new members in the constructor Done Line 608: const TimestampValue* src, TimestampValue* dst, MemPool* pool) { > DCHECK_NE Done Line 611: DCHECK_NE(parent_->scan_node_->parquet_mr_write_zone(), "UTC"); > Avoid creating/destroying this status even if no error happens (a possible Done Line 628: // Not a timestamp specific timezone. Convert timestamp to the timezone object > swap this case with the else if above Done Line 636: parent_->parse_status_ = status; > else { As explained above, it is possible that is_timestamp_dependent_timezone_ is false and timezone_ is NULL. Added logging an error message. http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: Line 194: for (const auto& tz_region: tz_region_list_) { > please convert to for-each while you are here Done http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: Line 37: /// Converts the name of a timezone to a boost timezone object. > Not your change, but comment on what happens when the timezone was not foun Done http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: Line 111: if (timezone == NULL) { > Can timezone be NULL? Removed the try-catch and added code to return false when timezone_ == NULL Line 113: return false; > Which function can throw? I was under the impression that FimezoneDatabase::FindTimezone() can throw, but it can't. Removed the try-catch and added check for timezone_ == NULL Line 125: return true; > I have a feeling boost might throw here if we give it funny input. What do you mean? The documentation doesn't say anything about an exception thrown from local_date_time(posix_time::ptime, timezone_ptr) constructor. http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 199: /// conversion was successfull and false otherwise. > comment on return values (and functions below) Done http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/fe-support.cc File be/src/service/fe-support.cc: Line 354: JNIEXPORT jboolean JNICALL > Why even pass and return thrift structs? You can just take a string and ret Done Line 365: static JNINativeMethod native_methods[] = { > Are you sure we need this? I see catalogd-main.cc calling InitCommonRuntime You are correct, I've removed the timezone db initialization from here. http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/impala-server.cc File be/src/service/impala-server.cc: Line 123: "the behavior of recent versions of Hive and Spark as well. The
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new patch set (#4). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New tables created by Impala will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the table property unless the global flag is set to true. Tables created using CREATE TABLE LIKE will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M tests/common/impala_test_suite.py A tests/custom_cluster/test_parquet_timestamp_compatibility.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py 23 files changed, 558 insertions(+), 57 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/4 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new patch set (#4). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New tables created by Impala will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the table property unless the global flag is set to true. Tables created using CREATE TABLE LIKE will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M tests/common/impala_test_suite.py A tests/custom_cluster/test_parquet_timestamp_compatibility.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py 23 files changed, 558 insertions(+), 57 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/4 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new patch set (#4). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New tables created by Impala will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the table property unless the global flag is set to true. Tables created using CREATE TABLE LIKE will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M tests/common/impala_test_suite.py A tests/custom_cluster/test_parquet_timestamp_compatibility.py M tests/metadata/test_ddl.py M tests/metadata/test_ddl_base.py 23 files changed, 558 insertions(+), 57 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/4 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new patch set (#4). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New tables created by Impala will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the table property unless the global flag is set to true. Tables created using CREATE TABLE LIKE will copy the property of the table that is copied. This change also alters the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M tests/common/impala_test_suite.py A tests/custom_cluster/test_parquet_timestamp_compatibility.py M tests/metadata/test_ddl.py 23 files changed, 597 insertions(+), 32 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/4 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Alex Behm has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/impala-server.cc File be/src/service/impala-server.cc: Line 123: "Impala, Hive, and Spark to not apply timestamp timezone adjustments for parquet " > Hive always writes in UTC (that's the problem). It makes Hive write as if i You are right. Important distinction. Your proposal sounds good to me. -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/5939/3//COMMIT_MSG Commit Message: Line 23: global flag --prevent_parquet_mr_zone_adjustment is set to true. > Not sure if the name of this flag was already decided upon, but imo, it's n It is also important that it is only set on new tables and that it is set to UTC. I would suggest --set_parquet_mr_int96_write_zone_to_utc_on_new_tables http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/impala-server.cc File be/src/service/impala-server.cc: Line 123: "Impala, Hive, and Spark to not apply timestamp timezone adjustments for parquet " > Mention that this property also makes Hive write in UTC. Hive always writes in UTC (that's the problem). It makes Hive write as if it was located in UTC instead of the actual local timezone, which makes the written value match the original value. I think this effect is so complex that we can not describe it in a few lines. I would replace the last sentence with "This changes the behavior of recent versions of Hive and Spark as well. The writing of timestamps is affected in Hive and Spark but not in Impala. The reading of timestamps that were written by Hive, Spark, or any other parquet-mr writer is affected in Hive, Spark and Impala. You can find details in the documentation." -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Alex Behm has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 3: (39 comments) I've gone through the code. Still need to go through the tests. http://gerrit.cloudera.org:8080/#/c/5939/3//COMMIT_MSG Commit Message: Line 14: values from Parquet files written by Hive, and vice versa. I don't think the "vice versa" part is correct. My understanding is that Hive does not adjust data written by Impala, so it reads the data correctly. Line 17: Impala reads Parquet MR timestamp data and adjust values using a time typo: adjusts Line 20: applied to data written by Impala. Mention interaction with --convert_legacy_hive_parquet_utc_timestamps Line 23: global flag --prevent_parquet_mr_zone_adjustment is set to true. Not sure if the name of this flag was already decided upon, but imo, it's not very clear what it does from the description. As an alternative, I'd propose something like: --set_parquet_mr_int96_write_zone It's a complex issue, so I'd prefer to be more explicit about what the flag does as opposed to which behavior the flag is supposed to induce across components. http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 242: timezone_ = TimezoneDatabase::FindTimezone( What if timezone_ is NULL here? Line 244: } catch (std::exception& from_boost) { Does FindTimezone() really throw? We're not handling that in other places. Line 249: } DCHECK that is_timestamp_dependent_timezone_ is true xor timezone_ is set Line 554: boost::local_time::time_zone_ptr timezone_; initialize these new members in the constructor Line 608: DCHECK(parent_->scan_node_->parquet_mr_time_zone() != "UTC"); DCHECK_NE Line 611: Status status; Avoid creating/destroying this status even if no error happens (a possible perf issue). I'd suggest you move the parent_->parse_status_ = status; directly into the error branches and return false from them directly. Line 628: } else if (LIKELY(timezone_ != NULL)) { swap this case with the else if above Line 636: } else { DCHECK(false) << add error msg here } http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: Line 194: for (vector::const_iterator iter = tz_region_list_.begin(); please convert to for-each while you are here http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: Line 37: /// Converts the name of a timezone to a boost timezone object. Not your change, but comment on what happens when the timezone was not found in the database. http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: Line 111: time_zone_ptr timezone = TimezoneDatabase::FindTimezone(timezone_str, *this); Can timezone be NULL? Line 113: } catch (std::exception& from_boost) { Which function can throw? Line 125: local_date_time lt(temp, timezone); I have a feeling boost might throw here if we give it funny input. http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 199: bool UtcToLocal(); comment on return values (and functions below) http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/fe-support.cc File be/src/service/fe-support.cc: Line 354: JNIEXPORT jbyteArray JNICALL Why even pass and return thrift structs? You can just take a string and return a boolean. Line 365: ABORT_IF_ERROR(TimezoneDatabase::Initialize()); Are you sure we need this? I see catalogd-main.cc calling InitCommonRuntime() which initializes the TZ DB. http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/impala-server.cc File be/src/service/impala-server.cc: Line 123: "Impala, Hive, and Spark to not apply timestamp timezone adjustments for parquet " Mention that this property also makes Hive write in UTC. http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/util/backend-gflag-util.cc File be/src/util/backend-gflag-util.cc: Line 55: cfg.__set_assign_utc_to_parquet_mr_timezone_prop( let's keep the name consistent across various uses http://gerrit.cloudera.org:8080/#/c/5939/3/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: Line 215: // Specifies a time zone for adjusting timestamp values read from Parquet MR files. This ... from Parquet files written by parquet-mr. Line 216: // is used for a Hive compatibilty fix. Mention that this comes from the table property http://gerrit.cloudera.org:8080/#/c/5939/3/common/thrift/generate_error_codes.py File common/thrift/generate_error_codes.py: Line 316: ("PARQUET_MR_TIMEZONE_INVALID", 102, "Invalid timezone '$0' specified in " Why even mention the file name? Seems better to mention
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 2: (11 comments) http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 237: timezone_ = TimezoneDatabase::FindTimezone( > We should first check for TimezoneDatabase::IsTimestampDependentTimezone, o Done. Also added slot_desc_->type().type == TYPE_TIMESTAMP to the condition. Line 240: timezone_ = NULL; > This exception should not be swallowed. If we can not convert to the desire Added an error message. Even though the program execution should never get to L240, (we deal with invalid timezones in the frontend in the analyze phase), I feel it is important to be defensive here and display errors. Line 543: /// If not NULL, TIMESTAMP column readers require conversion to this time zone. > Although technically correct, this comment is a bit misleading. One may eas Done Line 593: if (LIKELY(dst->HasDateAndTime())) { > DCHECK(parent_->scan_node_->parquet_mr_time_zone() != "UTC") Done Line 597: dst->FromUtc(parent_->scan_node_->parquet_mr_time_zone()); > This is somewhat confusing, as it seems as if the first two cases did the s Done. I also refactored the if statements to make the logic in 'ConvertSlot()' easier to follow. Error handling was added as well. http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: Line 189: time_zone_ptr TimezoneDatabase::FindTimezone(const string& tz) { > This should be a private helper function with two callers: 'FindTimezone(const string& tz)' is also called from the Frontend to check if a timezone string is valid (see 'Java_org_apache_impala_service_FeSupport_NativeCheckIsValidTimeZone()' in fe-support.cc). In this case it is not guaranteed that the timezone is not timestamp dependent. http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: Line 109: // If a table has an invalid timezone, it should not have been loaded. > How does this comment apply here? Was it supposed to be before the catch bl Removed the comment as it doesn't make much sense here. Line 112: *this = ptime(not_a_date_time); > Again, swallowing this exception can be a problem as end-users may not noti Added code to display exception message. Added error-handling code to the caller function as well. http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 201: /// Converts from UTC to local time in the given timezone. > Please document the difference that one can be timestamp-dependent while th Both versions of 'FromUtc()' will work both with timestamp-independent and timestamp-dependent timezones. On the other hand, there is a difference between the two versions of 'TimezoneDatabase::FindTimezone()' which is documented in timezone_db.h http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/service/fe-support.cc File be/src/service/fe-support.cc: Line 352: extern "C" > Please document the purpose of this call (allowing the Java code to check w Done http://gerrit.cloudera.org:8080/#/c/5939/2/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java File fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java: Line 182: "'parquet.mr.int96.write.zone' is only supported for HDFS tables.")); > Why? Parquet files should be handled this way regardless of using HDFS or n I'm not 100% sure about this, but I think HdfsTable instances are used to represent tables stored in HDFS or S3 or the local file system. This condition was meant to exclude Kudu/HBase tables. Let's leave it like this for now and see what other reviewers think. -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new patch set (#3). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive, and vice versa. After this change: Impala reads Parquet MR timestamp data and adjust values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New tables created by Impala will set the table property to UTC if the global flag --prevent_parquet_mr_zone_adjustment is set to true. Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the table property unless the global flag is set to true. Tables created using CREATE TABLE LIKE will copy the property of the table that is copied. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/Frontend.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M tests/common/impala_test_suite.py A tests/custom_cluster/test_parquet_timestamp_compatibility.py M tests/metadata/test_ddl.py 24 files changed, 637 insertions(+), 31 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/3 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new patch set (#3). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive, and vice versa. After this change: Impala reads Parquet MR timestamp data and adjust values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New tables created by Impala will set the table property to UTC if the global flag --prevent_parquet_mr_zone_adjustment is set to true. Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the table property unless the global flag is set to true. Tables created using CREATE TABLE LIKE will copy the property of the table that is copied. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/Frontend.thrift M common/thrift/PlanNodes.thrift M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M tests/common/impala_test_suite.py A tests/custom_cluster/test_parquet_timestamp_compatibility.py M tests/metadata/test_ddl.py 24 files changed, 637 insertions(+), 31 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/3 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila JegesGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new patch set (#2). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive, and vice versa. After this change: Impala reads Parquet MR timestamp data and adjust values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New tables created by Impala will set the table property to UTC if the global flag --prevent_parquet_mr_zone_adjustment is set to true. Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the table property unless the global flag is set to true. Tables created using CREATE TABLE LIKE will copy the property of the table that is copied. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/Frontend.thrift M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M tests/common/impala_test_suite.py A tests/custom_cluster/test_parquet_timestamp_compatibility.py M tests/metadata/test_ddl.py 23 files changed, 586 insertions(+), 27 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/2 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila Jeges
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new patch set (#2). Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive, and vice versa. After this change: Impala reads Parquet MR timestamp data and adjust values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New tables created by Impala will set the table property to UTC if the global flag --prevent_parquet_mr_zone_adjustment is set to true. Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the table property unless the global flag is set to true. Tables created using CREATE TABLE LIKE will copy the property of the table that is copied. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/Frontend.thrift M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M tests/common/impala_test_suite.py A tests/custom_cluster/test_parquet_timestamp_compatibility.py M tests/metadata/test_ddl.py 23 files changed, 586 insertions(+), 27 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/2 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila Jeges
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Attila Jeges has uploaded a new change for review. http://gerrit.cloudera.org:8080/5939 Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive, and vice versa. After this change: Impala reads Parquet MR timestamp data and adjust values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New tables created by Impala will set the table property to UTC if the global flag --prevent_parquet_mr_zone_adjustment is set to true. Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the table property unless the global flag is set to true. Tables created using CREATE TABLE LIKE will copy the property of the table that is copied. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/fe-support.cc M be/src/service/impala-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/Frontend.thrift M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M tests/common/impala_test_suite.py A tests/custom_cluster/test_parquet_timestamp_compatibility.py M tests/metadata/test_ddl.py 23 files changed, 588 insertions(+), 27 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/1 -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila Jeges