[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-05-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE
LIKE  will set the table property to UTC if the global flag
--set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true.

HDFS tables created by Impala using CREATE TABLE LIKE 
will copy the property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Reviewed-on: http://gerrit.cloudera.org:8080/5939
Reviewed-by: Dan Hecht 
Tested-by: Impala Public Jenkins
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
A tests/query_test/test_parquet_timestamp_compatibility.py
28 files changed, 850 insertions(+), 74 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Dan Hecht: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-05-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 11: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-05-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 11:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/526/

-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-05-02 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 11: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-05-02 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 11:

> Uploaded patch set 11.

Fixed WARN_UNUSED_RESULT warning that failed clang-tidy job.

-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-05-02 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#11).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE
LIKE  will set the table property to UTC if the global flag
--set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true.

HDFS tables created by Impala using CREATE TABLE LIKE 
will copy the property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
A tests/query_test/test_parquet_timestamp_compatibility.py
28 files changed, 850 insertions(+), 74 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/11
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-05-02 Thread Attila Jeges (Code Review)
Hello Impala Public Jenkins, Alex Behm, Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5939

to look at the new patch set (#11).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE
LIKE  will set the table property to UTC if the global flag
--set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true.

HDFS tables created by Impala using CREATE TABLE LIKE 
will copy the property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
A tests/query_test/test_parquet_timestamp_compatibility.py
28 files changed, 850 insertions(+), 74 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/11
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-05-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 10: Verified-1

Build failed: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/521/

-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-05-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 10:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/521/

-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-27 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 10:

> Uploaded patch set 10.

This is the rebased change.

-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-27 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#10).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE
LIKE  will set the table property to UTC if the global flag
--set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true.

HDFS tables created by Impala using CREATE TABLE LIKE 
will copy the property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
A tests/query_test/test_parquet_timestamp_compatibility.py
28 files changed, 850 insertions(+), 74 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/10
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-27 Thread Attila Jeges (Code Review)
Hello Alex Behm, Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5939

to look at the new patch set (#10).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE
LIKE  will set the table property to UTC if the global flag
--set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true.

HDFS tables created by Impala using CREATE TABLE LIKE 
will copy the property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
A tests/query_test/test_parquet_timestamp_compatibility.py
28 files changed, 850 insertions(+), 74 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/10
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-27 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5939/8/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

PS8, Line 124: local_date_time lt(temp, timezone);
> It says an exception can be thrown by the constructor, but I guess what's i
Yes, exactly. According to the documentation 
local_date_time(posix_time::ptime, timezone_ptr) constructor won't throw an 
exception.


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-27 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 9: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5939/8/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

PS8, Line 124: local_date_time lt(temp, timezone);
> I couldn't find any indication of an exception in the code or in the docume
It says an exception can be thrown by the constructor, but I guess what's it's 
saying is that can only happen for the constructors that takes a bool or flag? 
i.e. the one that takes UTC won't throw. Is that what you concluded? Just want 
to be super safe here since we've had problems with unhandled exceptions from 
boost.


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-27 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 9:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/5939/8/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS8, Line 601:  
> ... in 'scanner_status'.
Done


PS8, Line 604:  const TimestampValue* 
> please pass this as a 'Status* scanner_status', which is what we usually do
Done


http://gerrit.cloudera.org:8080/#/c/5939/8/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

PS8, Line 124: local_date_time lt(temp, timezone);
> can this throw an exception?
I couldn't find any indication of an exception in the code or in the 
documentation:
http://www.boost.org/doc/libs/1_57_0/doc/html/date_time/local_time.html#date_time.local_time.local_date_time


PS8, Line 125: *this = lt.local_time();
> or that?
Same here.


http://gerrit.cloudera.org:8080/#/c/5939/8/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

PS8, Line 122: tabl
> it's the same for other FileSystem interfaced things, like S3. So maybe rem
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-27 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#9).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE
LIKE  will set the table property to UTC if the global flag
--set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true.

HDFS tables created by Impala using CREATE TABLE LIKE 
will copy the property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
A tests/query_test/test_parquet_timestamp_compatibility.py
28 files changed, 850 insertions(+), 74 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/9
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-27 Thread Attila Jeges (Code Review)
Hello Alex Behm,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5939

to look at the new patch set (#9).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE
LIKE  will set the table property to UTC if the global flag
--set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true.

HDFS tables created by Impala using CREATE TABLE LIKE 
will copy the property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
A tests/query_test/test_parquet_timestamp_compatibility.py
28 files changed, 850 insertions(+), 74 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/9
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-26 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 8:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/5939/8/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS8, Line 601: .
... in 'scanner_status'.


PS8, Line 604:  Status& scanner_status
please pass this as a 'Status* scanner_status', which is what we usually do for 
out parameters to make it more obvious at the callsite that it's modified. And 
nice to put out parameters last.


http://gerrit.cloudera.org:8080/#/c/5939/8/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

PS8, Line 124: local_date_time lt(temp, timezone);
can this throw an exception?


PS8, Line 125: *this = lt.local_time();
or that?


http://gerrit.cloudera.org:8080/#/c/5939/8/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

PS8, Line 122: HDFS
it's the same for other FileSystem interfaced things, like S3. So maybe remove 
"HDFS".


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-26 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 8:

> (1 comment)

Yes, I think we should do that.

-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-26 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

Line 123:   local_date_time lt(temp, timezone);
> I've investigated the history of UtcToLocal():
Great, thanks for checking that.  So then should we consider rewriting 
UtcToLocal() in terms of FromUtc()?  Not as part of this change and something 
we can sort out later.


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-26 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

Line 123:   ToPtime();
> I seem to remember there also being a lock in the boost code related to loc
I've investigated the history of UtcToLocal():

- UtcToLocal() was first introduced in IMPALA-1658 (87b9fac2adda). This version 
of UtcToLocal() already used localtime_r() for UTC -> local time conversion. It 
also used couple boost functions to convert back and forth between 'ptime' and 
'struct tm' types (to_tm(), ptime_from_tm(), nanoseconds())

- The first version of UtcToLocal() made querying tables with timestamps a lot 
slower (up to 30x times slower). The root cause was a global lock in 
localtime_r(). This is discussed in IMPALA-3316 and IMPALA-2125. Both JIRAs are 
still open.

- UtcToLocal() was made 50% faster by 4ddce7f97880. This change got rid of the 
boost functions (to_tm(), ptime_from_tm(), nanoseconds()) and added the comment 
at L77. The change kept localtime_r() though, so it didn't do anything to solve 
the global lock problem.

I wrote a simple benchmark program to confirm that UtcToLocal()'s performance 
degrades if the number of threads running it in parallel is increased. The 
benchmark program also confirms that FromUtc() does not have this problem.


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-25 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

Line 123:   local_date_time lt(temp, timezone);
> I've added a micro-benchmark to compare the two. UtcToLocal() is about 2.5 
I seem to remember there also being a lock in the boost code related to locales 
that's avoided by using the standard C function, so the regression grows with 
concurrency.  I might not be remembering correctly, though. Can you try to dig 
up the JIRA that introduced the use of localtime_r() to see if there's more 
details?


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-24 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#8).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE
LIKE  will set the table property to UTC if the global flag
--set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true.

HDFS tables created by Impala using CREATE TABLE LIKE 
will copy the property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
A tests/query_test/test_parquet_timestamp_compatibility.py
28 files changed, 806 insertions(+), 74 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/8
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-24 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 8:

(20 comments)

http://gerrit.cloudera.org:8080/#/c/5939/7//COMMIT_MSG
Commit Message:

PS7, Line 22: New HDFS tables created by Impala using CREATE TABLE and CREATE 
TABLE
: LIKE  will set the table property to UTC if the global flag
: --set_parque
> I don't follow this given the next two paragraphs. This isn't true for the 
Correct. Changed the commit message.


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

Line 305:   // parquet-mr. If conversion should not occur, this is set to an 
empty string. Otherwise
> from the code, it looks like this is set to the empty string if conversion 
Correct, added a comment to document the behavior.


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 234: FLAGS_convert_legacy_hive_parquet_utc_timestamps
> given that parquet_mr_write_zone() will take precedence over the flag, I th
Done


PS7, Line 246: DCHECK
> nit: DCHECK_EQ()
Done


PS7, Line 609:  status = scanner_state->LogOrReturnError(msg);
> since this code is very performance critical, it would be best to order thi
Correct. Reordered the conditions in the if-elif-else statement


PS7, Line 611: status = status;
> these could all be UNLIKELY since we want to optimize for the non-error cas
Done


PS7, Line 625: LY(dst->HasDateAndTime()
> i.e. here too and below.
Done


PS7, Line 639: if (!SetTimestampConversionError(parent_->scan_node_, 
parent_->state_,
 : parent_->parse_status_, src, 
parent_->scan_node_->parquet_mr_write_zone())) {
 :   return false;
 : }
 :   }
 : } else {
 :   
DCHECK(parent_->scan_node_->parquet_mr_write_zone().empty());
 :   DCH
> seems worth factoring this block (which occurs 3 times) into a non-inlined 
Done. Added a new non-class member function and factored out the block into it.

(Adding a new member function instead to ScalarColumnReader template class is 
probably not a good idea since it is only needed in the 
ScalarColumnReader instance)


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

Line 193: 
> could you add a comment explaining these two lookup techniques?
Done


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

PS7, Line 45: hen IsTimestampDependen
> how about drawing the connection with the next function directly:
Done


Line 49: 
> comment this.
Done


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

Line 101: Status s("Failed to convert timestamp to local time.");
> this might produce a ton of logging.  instead, why not return the Status ob
Done


Line 109: bool TimestampValue::FromUtc(const std::string& timezone_str) {
> how do we guarantee that this is true? did the caller already check that?
Yes, it did.


PS7, Line 111: _zone_ptr timezo
> UNLIKELY (since perf critical)
Done


Line 123:   ToPtime();
> see the comment at line 77. i think we were avoiding this for perf reasons.
I've added a micro-benchmark to compare the two. UtcToLocal() is about 2.5 
times faster than FromUtc() when input data size is 1,000 and 1.5 times faster 
when input data size is 10,000.

Rewriting FromUtc() not to use boost function calls is not that 
straightforward. The above version of UtcToLocal() uses localtime_r() standard 
C function to perform the UTC->local timezone conversion. FromUtc() on the 
other hand should convert timestamps to a given time zone. I don't think there 
is a standard C function to do this.


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

PS7, Line 197: 
> below you say "local time in the given timezone", which I guess is technica
Changed the comment below for 'FromUtc'


PS7, Line 203: 
> it's not clear that "convert" means that *this is modified in-place, so I t
Done


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

PS7, Line 122: new
> is that always the case?
Fixed the description


PS7, Line 122: bles created "
 : "using CREATE TABLE and CREATE TABLE LIKE . You can 
find details in the "
 : "documentation.");
 : 
 : DEFINE_int64(max_result_cache_size, 1
> I'm not sure that this makes sense on it's own. I think you kind of have to
Removed that part



[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-24 Thread Attila Jeges (Code Review)
Hello Alex Behm,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5939

to look at the new patch set (#8).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE
LIKE  will set the table property to UTC if the global flag
--set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true.

HDFS tables created by Impala using CREATE TABLE LIKE 
will copy the property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
A tests/query_test/test_parquet_timestamp_compatibility.py
28 files changed, 806 insertions(+), 74 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/8
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-19 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 7:

(20 comments)

http://gerrit.cloudera.org:8080/#/c/5939/7//COMMIT_MSG
Commit Message:

PS7, Line 22: New tables created by Impala will set the table property to UTC 
if the
: global flag 
--set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
: set to true.
I don't follow this given the next two paragraphs. This isn't true for the 
"LIKE " case, right?


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

Line 305:   // parquet-mr. FE guarantees that this is a valid time zone.
from the code, it looks like this is set to the empty string if conversion 
should not occur? if that's true, please document that case.


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 234: !parent_->scan_node_->parquet_mr_write_zone().empty()
given that parquet_mr_write_zone() will take precedence over the flag, I think 
it would be clearer to reverse the order of these two conditionals. i.e. we 
don't care what the value of the flag is if the timestamp is set, right? (I 
realize it makes no difference functionally, but seems more consistent with the 
intent).


PS7, Line 246: DCHECK
nit: DCHECK_EQ()


PS7, Line 609: parent_->scan_node_->parquet_mr_write_zone().empty()
since this code is very performance critical, it would be best to order this in 
the order we want to optimize for. i.e. first check timezone_, then check 
is_timestamp_dependent_timezone_ and then only if neither is set, check 
parquet_mr_write_zone().empty() (which is also probably the most expensive 
check). i.e. move this condition to the end of the if-else-if chain.  

And then both of parquet_mr_write_zone().empty() and 
FLAGS_convert_legacy_hive_parquet_utc_timestamps must be true, right? (or else 
we wouldn't have needs_conversion_ set). And so you can just DCHECK() them.


PS7, Line 611: !dst->UtcToLocal())
these could all be UNLIKELY since we want to optimize for the non-error case.


PS7, Line 625: !dst->FromUtc(timezone_)
i.e. here too and below.


PS7, Line 639: ErrorMsg 
msg(TErrorCode::PARQUET_MR_TIMESTAMP_CONVERSION_FAILED,
 : src->DebugString(), 
parent_->scan_node_->parquet_mr_write_zone(),
 : 
parent_->scan_node_->hdfs_table()->fully_qualified_name());
 : Status status = parent_->state_->LogOrReturnError(msg);
 : if (!status.ok()) {
 :   parent_->parse_status_ = status;
 :   return false;
 : }
seems worth factoring this block (which occurs 3 times) into a non-inlined 
subroutine, which will help avoid I-cache pollution (and also make the code 
more readable).


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

Line 193: 
could you add a comment explaining these two lookup techniques?


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

PS7, Line 45: for the Moscow timezone
how about drawing the connection with the next function directly:

(May not work correctly when IsTimestampDependentTimezone(tz), e.g. Moscow 
timezone).


Line 49:   static bool IsTimestampDependentTimezone(const std::string& tz);
comment this.


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

Line 101: LOG(ERROR) << from_boost.what();
this might produce a ton of logging.  instead, why not return the Status object 
from these, and include the what() in the details? then the error message 
deduping code in LogError() will take care of throttling it.


Line 109:   DCHECK(HasDateAndTime());
how do we guarantee that this is true? did the caller already check that?


PS7, Line 111: timezone == NULL
UNLIKELY (since perf critical)


Line 123:   local_date_time lt(temp, timezone);
see the comment at line 77. i think we were avoiding this for perf reasons. did 
you measure the perf of this compared to UtcToLocal()?


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

PS7, Line 197: local time
below you say "local time in the given timezone", which I guess is technically 
correct but then makes this statement confusing (becuase here the comment meant 
"local timezone").  So, how about either removing the words "local time in the" 
below, or change this comment to say "local time in the system timezone"


PS7, Line 203: local time in the given timezone
it's not clear that "convert" means that *this is modified in-place, so I think 
you should add that back in like the 

[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-12 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#7).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New tables created by Impala will set the table property to UTC if the
global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
set to true.

Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not
set the table property unless the global flag is set to true.

Tables created using CREATE TABLE LIKE  will copy the
property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
A tests/query_test/test_parquet_timestamp_compatibility.py
26 files changed, 665 insertions(+), 73 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/7
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-12 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 7:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/5939/6/fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
File fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java:

Line 113:   "Invalid time zone in the '%s' table property: %s",
> double 'the'
Done


http://gerrit.cloudera.org:8080/#/c/5939/6/fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java:

Line 665: 
> Let's move all the CREATE/ALTER tests into a separate TestParquetMrInt96Wri
Done


Line 1882: "'  '", "URI path cannot be empty.");
> easier to read single quotes
Done


Line 1904:   type == PrimitiveType.VARCHAR) {
> easier to read single quotes
Done


http://gerrit.cloudera.org:8080/#/c/5939/6/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
File tests/custom_cluster/test_hive_parquet_timestamp_conversion.py:

Line 105: parquet_path = get_fs_path(
> What does "fn" stand for? I'm thinking "file name", but this is not just a 
Renamed it to 'parquet_path'


Line 123:   ON i.id = h.id AND i.day = h.day  -- serves as a unique key
> easier to read with the alias 'i' next to the table
Done


Line 125:   (h.timestamp_col IS NULL) != (i.timestamp_col IS NULL)
> simplify the first two conditions with:
Done. Had to put round brackets around col IS NULL expressions to avoid 
analysis errors.


http://gerrit.cloudera.org:8080/#/c/5939/6/tests/query_test/test_parquet_timestamp_compatibility.py
File tests/query_test/test_parquet_timestamp_compatibility.py:

Line 78:   def test_invalid_parquet_mr_write_zone(self, vector, 
unique_database):
> test_invalid_parquet_mr_write_zone
Done


Line 118:   # tz_name conversion on the timestamp values.
> extra space after "triggers"
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-12 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#7).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New tables created by Impala will set the table property to UTC if the
global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
set to true.

Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not
set the table property unless the global flag is set to true.

Tables created using CREATE TABLE LIKE  will copy the
property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
A tests/query_test/test_parquet_timestamp_compatibility.py
26 files changed, 665 insertions(+), 73 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/7
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-10 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#6).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New tables created by Impala will set the table property to UTC if the
global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
set to true.

Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not
set the table property unless the global flag is set to true.

Tables created using CREATE TABLE LIKE  will copy the
property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
A tests/query_test/test_parquet_timestamp_compatibility.py
26 files changed, 657 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/6
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-10 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#6).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New tables created by Impala will set the table property to UTC if the
global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
set to true.

Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not
set the table property unless the global flag is set to true.

Tables created using CREATE TABLE LIKE  will copy the
property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
A tests/query_test/test_parquet_timestamp_compatibility.py
26 files changed, 657 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/6
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-10 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5939/5/tests/custom_cluster/test_parquet_timestamp_compatibility.py
File tests/custom_cluster/test_parquet_timestamp_compatibility.py:

Line 71:   @pytest.mark.execute_serially
> Moved tests that use startup options to ./tests/custom_cluster/test_hive_pa
The rest of the tests were moved to 
./tests/query_test/test_parquet_timestamp_compatibility.py


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-10 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#6).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New tables created by Impala will set the table property to UTC if the
global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
set to true.

Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not
set the table property unless the global flag is set to true.

Tables created using CREATE TABLE LIKE  will copy the
property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
A tests/query_test/test_parquet_timestamp_compatibility.py
26 files changed, 657 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/6
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-10 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 5:

(26 comments)

Thanks for the review.

http://gerrit.cloudera.org:8080/#/c/5939/5/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

Line 305:   // parquet-mr.
> Comment that the FE guarantees that this is a valid timezone.
Done


http://gerrit.cloudera.org:8080/#/c/5939/5/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 246:   DCHECK((is_timestamp_dependent_timezone_ && timezone_ == NULL) 
||
> simplify condition:
Done


Line 549:   /// Used to cache the timezone object corresponding to 
"parquet.mr.int96.write.zone"
> the "parquet.mr.int96.write.zone" table property
Done


Line 550:   /// table property to avoid repeated calls to 
'TimezoneDatabase::FindTimezone()'.
> no need to single-quote here
Done


http://gerrit.cloudera.org:8080/#/c/5939/5/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

Line 190:   // See if they specified a zone id
> Who is "they"? Suggest rephrasing
Done


Line 202:   return time_zone_ptr();
> return NULL?
Done


http://gerrit.cloudera.org:8080/#/c/5939/5/common/thrift/BackendGflags.thrift
File common/thrift/BackendGflags.thrift:

Line 58:   // If true, all newly created Parquet tables will have the 
parquet.mr.int96.write.zone
> ... all newly created HDFS tables regardless of format ...
Done


http://gerrit.cloudera.org:8080/#/c/5939/5/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 218:   9: optional string parquet_mr_write_zone;
> remove trailing semicolon for consistency
Done


http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
File 
fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java:

Line 169:   private static void analyzeParquetMrWriteZone(Table table,
> These cases need tests in AnalyzeDDL
Done. I also changed the exception message for consistency.


http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
File fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java:

Line 112:   throw new AnalysisException("Invalid Time Zone: " + timezone);
> Mention that the timezone is in the table property 'parquet.mr.int96.write.
Done


http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java:

Line 183: if 
(getTblProperties().containsKey(HdfsTable.TBL_PROP_PARQUET_MR_WRITE_ZONE)) {
> These cases need tests in AnalyzeDDL
Done


Line 186: "Only HDFS tables can use '%s' table property.",
> the '%s' table property
Done. I also changed the exception message for consistency.


http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 54: import org.apache.impala.common.InternalException;
> not needed
Done


http://gerrit.cloudera.org:8080/#/c/5939/5/tests/custom_cluster/test_parquet_timestamp_compatibility.py
File tests/custom_cluster/test_parquet_timestamp_compatibility.py:

Line 25: class TestParquetTimestampCompatibility(CustomClusterTestSuite):
> Test needs a comment. In particular, what is covered here and what is cover
I've added a comment that describes tests in this file.

What do you mean by "integration tests not part of Impala tests"? I don't think 
we have any integration tests upstream.


Line 26:   TEST_DB_NAME = 'timezone_test_db'
> Why not use the unique_database fixture? That's the best practice.
Switched to unique_database.


Line 39: self.client = impalad.service.create_beeswax_client()
> I think we already have a self.client from ImpalaTestSuite.
Removed creating a client from here.


Line 49: 
"/test-warehouse/alltypesagg_hive_13_1_parquet/alltypesagg_hive_13_1.parquet"
> We need the appropriate filesystem prefix here. This will break on local fs
Done


Line 56: self.client.execute('USE ' + self.TEST_DB_NAME)
> Why? Better to avoid changing the session state
Removed it.


Line 59: self.client.execute('INVALIDATE METADATA')
> why?
Removed it, it's not needed.


Line 61:   def _set_impala_table_timezone(self, timezone):
> simplify to pass table name as param
Done


Line 71:   @pytest.mark.execute_serially
> How long does this test take? I'm thinking we should only have minimal test
Moved tests that use startup options to 
./tests/custom_cluster/test_hive_parquet_timestamp_conversion.py


Line 81: statement = '''ALTER TABLE hive_table
> Should be an analyzer test in AnalyzeDDL
Removed it from here.


Line 104: select_from_impala_table = '''SELECT timestamp_col FROM 
impala_table WHERE id = 1'''
> Can you 

[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-04 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 5:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
File 
fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java:

Line 169:   private static void analyzeParquetMrWriteZone(Table table,
These cases need tests in AnalyzeDDL


http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java:

Line 183: if 
(getTblProperties().containsKey(HdfsTable.TBL_PROP_PARQUET_MR_WRITE_ZONE)) {
These cases need tests in AnalyzeDDL


http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 54: import org.apache.impala.common.InternalException;
not needed


http://gerrit.cloudera.org:8080/#/c/5939/5/tests/custom_cluster/test_parquet_timestamp_compatibility.py
File tests/custom_cluster/test_parquet_timestamp_compatibility.py:

Line 25: class TestParquetTimestampCompatibility(CustomClusterTestSuite):
Test needs a comment. In particular, what is covered here and what is covered 
in integration tests not part of Impala tests.


Line 26:   TEST_DB_NAME = 'timezone_test_db'
Why not use the unique_database fixture? That's the best practice.


Line 39: self.client = impalad.service.create_beeswax_client()
I think we already have a self.client from ImpalaTestSuite.


Line 49: 
"/test-warehouse/alltypesagg_hive_13_1_parquet/alltypesagg_hive_13_1.parquet"
We need the appropriate filesystem prefix here. This will break on local fs or 
S3 builds. Take a look at:

filesystem_utils.py get_fs_path()

and how that is used elsewhere


Line 56: self.client.execute('USE ' + self.TEST_DB_NAME)
Why? Better to avoid changing the session state


Line 59: self.client.execute('INVALIDATE METADATA')
why?


Line 61:   def _set_impala_table_timezone(self, timezone):
simplify to pass table name as param


Line 71:   @pytest.mark.execute_serially
How long does this test take? I'm thinking we should only have minimal tests 
for the impalad startup option in a custom cluster test, probably in the 
existing test for convert_legacy_hive_parquet_utc_timestamps. 

The other tests should go into a regular test (subclass of ImpalaTestSuite) and 
run in parallel.


Line 81: statement = '''ALTER TABLE hive_table
Should be an analyzer test in AnalyzeDDL


Line 104: select_from_impala_table = '''SELECT timestamp_col FROM 
impala_table WHERE id = 1'''
Can you think of a way to validate all values in the table in bulk e.g. using 
our existing timestamp conversion functions?
Having a few example values is still good, but we should also test that we are 
internally consistent with out conversion functions.


Line 113: self._set_impala_table_timezone('EST')
Add some comments about the behavior here. We are getting different values for 
the same timezone because the underlying Parquet files were written by 
Hive/Impala.


Line 141: statement = '''ALTER TABLE hive_table
analyzer test is more appropriate (also fix elsewhere)


Line 185:   def test_ddl(self, vector):
We need a test where Hive sets a garbage timezone and Impala throws an error 
when analyzing a query against that table.

We also need a test that shows what happens when a table property timezone is 
set and the convert_legacy_hive_parquet_utc_timestamps option is used. Probably 
good to consolidate with the existing custom cluster test.


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-03 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 5:

(10 comments)

I'm pretty happy with the code! Moving on to the tests.

http://gerrit.cloudera.org:8080/#/c/5939/5/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

Line 305:   // parquet-mr.
Comment that the FE guarantees that this is a valid timezone.


http://gerrit.cloudera.org:8080/#/c/5939/5/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 246:   DCHECK((is_timestamp_dependent_timezone_ && timezone_ == NULL) 
||
simplify condition:

is_timestamp_dependent_timezone_ == (timezone_ == NULL)


Line 549:   /// Used to cache the timezone object corresponding to 
"parquet.mr.int96.write.zone"
the "parquet.mr.int96.write.zone" table property


Line 550:   /// table property to avoid repeated calls to 
'TimezoneDatabase::FindTimezone()'.
no need to single-quote here


http://gerrit.cloudera.org:8080/#/c/5939/5/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

Line 190:   // See if they specified a zone id
Who is "they"? Suggest rephrasing


Line 202:   return time_zone_ptr();
return NULL?


http://gerrit.cloudera.org:8080/#/c/5939/5/common/thrift/BackendGflags.thrift
File common/thrift/BackendGflags.thrift:

Line 58:   // If true, all newly created Parquet tables will have the 
parquet.mr.int96.write.zone
... all newly created HDFS tables regardless of format ...


http://gerrit.cloudera.org:8080/#/c/5939/5/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 218:   9: optional string parquet_mr_write_zone;
remove trailing semicolon for consistency


http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
File fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java:

Line 112:   throw new AnalysisException("Invalid Time Zone: " + timezone);
Mention that the timezone is in the table property 
'parquet.mr.int96.write.zone'. Otherwise, the user might have no clue where 
this timezone comes from.


http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java:

Line 186: "Only HDFS tables can use '%s' table property.",
the '%s' table property


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-03-23 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 5:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

Line 304:   // Time zone for adjusting timestamp values read from Parquet files 
written by
> single line
That would make the line 92 characters long.


http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 246:   DCHECK((is_timestamp_dependent_timezone_ && timezone_ == NULL) 
||
> Even better, we can check this at the scan node level, e.g., in HdfsScanNod
Simplified the code. The timezone check is done in the FE, the BE is guaranteed 
to have a valid timezone string.


Line 653:   }
> As mentioned above, we should terminate the query early when in such a bad 
Added DCHECK(false)


http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

Line 194:   for (const string& tz_region: tz_region_list_) {
> No need to use auto here. const string& is clearer
Done


http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

Line 40:   /// If 'tz' is not found in the database, nullptr is returned.
> null pointer -> nullptr
Done


Line 46:   /// If 'tz' is not found in the database, nullptr is returned.
> null pointer -> nullptr
Done


http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

Line 199:   /// conversion was successfull and false otherwise. If conversion 
failed *this is set to
> Comment on what the state of *this is when false is returned.
Done


Line 202: 
> Comment on what the state of *this is when false is returned.
Done


http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/service/fe-support.cc
File be/src/service/fe-support.cc:

Line 357:   const char *tz = env->GetStringUTFChars(timezone, NULL);
> const char*
Done


http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

Line 122: "the parquet.mr.int96.write.zone table property to UTC for new 
tables. This changes "
> consistent placing of space (at line end or at line beginning)
Done


http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
File fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java:

Line 67: analyzeJoin(analyzer);
> If we do the validation here, then there is no need to check in each HdfsSc
Done


http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java:

Line 184:   if (getFileFormat() == THdfsFileFormat.KUDU) {
> You can check getFileFormat() here. The only non-HDFS formats we have are H
Done


Line 189:   String timezone = 
getTblProperties().get(HdfsTable.TBL_PROP_PARQUET_MR_WRITE_ZONE);
> We should only do this for HDFS tables right?
Done


http://gerrit.cloudera.org:8080/#/c/5939/4/fe/src/main/java/org/apache/impala/service/FeSupport.java
File fe/src/main/java/org/apache/impala/service/FeSupport.java:

Line 87:   // Returns true if timezone String is valid according to the BE 
timezone database, false
> valid according to the BE timezone database
Done


http://gerrit.cloudera.org:8080/#/c/5939/4/tests/custom_cluster/test_parquet_timestamp_compatibility.py
File tests/custom_cluster/test_parquet_timestamp_compatibility.py:

Line 1: # Licensed to the Apache Software Foundation (ASF) under one
> Use new Apache license header. Do we still have the old header in some exis
My bad, I created this test from an old test script.


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-03-23 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#5).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New tables created by Impala will set the table property to UTC if the
global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
set to true.

Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not
set the table property unless the global flag is set to true.

Tables created using CREATE TABLE LIKE  will copy the
property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_parquet_timestamp_compatibility.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
24 files changed, 569 insertions(+), 57 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/5
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-03-13 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
File fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java:

Line 67: analyzeParquetMRTimeZone();
> If we follow the pattern set for "skip.header.linecount", then probably we 
If we do the validation here, then there is no need to check in each 
HdfsScanNode. Going from BaseTableRef -> ScanNode is a 1:1 translation.

Even for "skip.header.linecount" there is no need to check in the HdfsScanNode 
again. Might want to remove that check.

Your BE simplifications make sense to me. After the check here, the BE is 
guarantee to see a valid timezone string.


http://gerrit.cloudera.org:8080/#/c/5939/4/tests/custom_cluster/test_parquet_timestamp_compatibility.py
File tests/custom_cluster/test_parquet_timestamp_compatibility.py:

Line 1: # Copyright (c) 2015 Cloudera, Inc. All rights reserved.
Use new Apache license header. Do we still have the old header in some existing 
files?


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-03-13 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
File fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java:

Line 67: analyzeParquetMRTimeZone();
> My apologies, but I was wrong on this one. I think it's unfortunate, but be
If we follow the pattern set for "skip.header.linecount", then probably we 
should check the "parquet.mr.int96.write.zone" property in the HdfsScanNode 
constructor as well ("skip.header.linecount" is checked there too).

If we do that, we can also simplify the BE code. The BE 
HdfsScanNode/HdfsScanner/ParquetColumnReader classes could safely assume that 
the timezone string is valid without doing any additional validation. What do 
you think?


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-03-10 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 246:   
parent_->state_->LogError(ErrorMsg(TErrorCode::PARQUET_MR_WRITE_ZONE_INVALID,
> I don't think logging an error is the right thing to do here. We will scan 
Even better, we can check this at the scan node level, e.g., in 
HdfsScanNode::Open(). Maybe the members should go into the scan node as well, 
you can consider it.


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-03-07 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 4:

(40 comments)

http://gerrit.cloudera.org:8080/#/c/5939/3//COMMIT_MSG
Commit Message:

Line 14: values from Parquet files written by Hive.
> I don't think the "vice versa" part is correct. My understanding is that Hi
Removed it.


Line 17: Impala reads Parquet MR timestamp data and adjusts values using a time
> typo: adjusts
Done


Line 20: applied to data written by Impala.
> Mention interaction with --convert_legacy_hive_parquet_utc_timestamps
Done


Line 23: global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
> Not sure if the name of this flag was already decided upon, but imo, it's n
Changed the name of the flag to 
set_parquet_mr_int96_write_zone_to_utc_on_new_tables


Line 23: global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
> It is also important that it is only set on new tables and that it is set t
Done


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 242:   if (!is_timestamp_dependent_timezone_) {
> What if timezone_ is NULL here?
Added error logging.


Line 244: parent_->scan_node_->parquet_mr_write_zone());
> Does FindTimezone() really throw? We're not handling that in other places.
You are right, it does not throw an exception. The documentation I found on 
boost::local_time::tz_database::time_zone_from_region() was incorrect: 
http://marc.info/?l=boost-bugs=132589759305973


Line 249: }
> DCHECK that is_timestamp_dependent_timezone_ is true xor timezone_ is set
If parquet_mr_time_zone() returns an invalid timezone string then
(!timestamp_dependent_timezone_ && timezone_ == 0) will be true.

We check for this error condition in ScalarColumnreader::ConvertSlot() and log 
error there as well if it is true.

On the other hand it is true that is_timestamp_dependent_timezone_ and 
timezone_ cannot be set at the same time. Added a DCHECK for that.


Line 554:   /// table property to avoid repeated calls to 
'TimezoneDatabase::FindTimezone()'.
> initialize these new members in the constructor
Done


Line 608: const TimestampValue* src, TimestampValue* dst, MemPool* pool) {
> DCHECK_NE
Done


Line 611:   DCHECK_NE(parent_->scan_node_->parquet_mr_write_zone(), "UTC");
> Avoid creating/destroying this status even if no error happens (a possible 
Done


Line 628:   // Not a timestamp specific timezone. Convert timestamp to the 
timezone object
> swap this case with the else if above
Done


Line 636:   parent_->parse_status_ = status;
> else {
As explained above, it is possible that is_timestamp_dependent_timezone_ is 
false and timezone_ is NULL. Added logging an error message.


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

Line 194:   for (const auto& tz_region: tz_region_list_) {
> please convert to for-each while you are here
Done


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

Line 37:   /// Converts the name of a timezone to a boost timezone object.
> Not your change, but comment on what happens when the timezone was not foun
Done


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

Line 111:   if (timezone == NULL) {
> Can timezone be NULL?
Removed the try-catch and added code to return false when timezone_ == NULL


Line 113: return false;
> Which function can throw?
I was under the impression that FimezoneDatabase::FindTimezone() can throw, but 
it can't.
Removed the try-catch and added check for timezone_ == NULL


Line 125:   return true;
> I have a feeling boost might throw here if we give it funny input.
What do you mean? The documentation doesn't say anything about an exception 
thrown from local_date_time(posix_time::ptime, timezone_ptr) constructor.


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

Line 199:   /// conversion was successfull and false otherwise.
> comment on return values (and functions below)
Done


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/fe-support.cc
File be/src/service/fe-support.cc:

Line 354: JNIEXPORT jboolean JNICALL
> Why even pass and return thrift structs? You can just take a string and ret
Done


Line 365: static JNINativeMethod native_methods[] = {
> Are you sure we need this? I see catalogd-main.cc calling InitCommonRuntime
You are correct, I've removed the timezone db initialization from here.


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

Line 123: "the behavior of recent versions of Hive and Spark as well. The 

[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-03-07 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#4).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New tables created by Impala will set the table property to UTC if the
global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
set to true.

Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not
set the table property unless the global flag is set to true.

Tables created using CREATE TABLE LIKE  will copy the
property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_parquet_timestamp_compatibility.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
23 files changed, 558 insertions(+), 57 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/4
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-03-07 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#4).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New tables created by Impala will set the table property to UTC if the
global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
set to true.

Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not
set the table property unless the global flag is set to true.

Tables created using CREATE TABLE LIKE  will copy the
property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_parquet_timestamp_compatibility.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
23 files changed, 558 insertions(+), 57 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/4
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-03-07 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#4).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New tables created by Impala will set the table property to UTC if the
global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
set to true.

Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not
set the table property unless the global flag is set to true.

Tables created using CREATE TABLE LIKE  will copy the
property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_parquet_timestamp_compatibility.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
23 files changed, 558 insertions(+), 57 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/4
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-03-07 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#4).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New tables created by Impala will set the table property to UTC if the
global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
set to true.

Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not
set the table property unless the global flag is set to true.

Tables created using CREATE TABLE LIKE  will copy the
property of the table that is copied.

This change also alters the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_parquet_timestamp_compatibility.py
M tests/metadata/test_ddl.py
23 files changed, 597 insertions(+), 32 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/4
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-03-02 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

Line 123: "Impala, Hive, and Spark to not apply timestamp timezone 
adjustments for parquet "
> Hive always writes in UTC (that's the problem). It makes Hive write as if i
You are right. Important distinction. Your proposal sounds good to me.


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-03-02 Thread Zoltan Ivanfi (Code Review)
Zoltan Ivanfi has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5939/3//COMMIT_MSG
Commit Message:

Line 23: global flag --prevent_parquet_mr_zone_adjustment is set to true.
> Not sure if the name of this flag was already decided upon, but imo, it's n
It is also important that it is only set on new tables and that it is set to 
UTC. I would suggest --set_parquet_mr_int96_write_zone_to_utc_on_new_tables


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

Line 123: "Impala, Hive, and Spark to not apply timestamp timezone 
adjustments for parquet "
> Mention that this property also makes Hive write in UTC.
Hive always writes in UTC (that's the problem). It makes Hive write as if it 
was located in UTC instead of the actual local timezone, which makes the 
written value match the original value.

I think this effect is so complex that we can not describe it in a few lines. I 
would replace the last sentence with "This changes the behavior of recent 
versions of Hive and Spark as well. The writing of timestamps is affected in 
Hive and Spark but not in Impala. The reading of timestamps that were written 
by Hive, Spark, or any other parquet-mr writer is affected in Hive, Spark and 
Impala. You can find details in the documentation."


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-03-01 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 3:

(39 comments)

I've gone through the code. Still need to go through the tests.

http://gerrit.cloudera.org:8080/#/c/5939/3//COMMIT_MSG
Commit Message:

Line 14: values from Parquet files written by Hive, and vice versa.
I don't think the "vice versa" part is correct. My understanding is that Hive 
does not adjust data written by Impala, so it reads the data correctly.


Line 17: Impala reads Parquet MR timestamp data and adjust values using a time
typo: adjusts


Line 20: applied to data written by Impala.
Mention interaction with --convert_legacy_hive_parquet_utc_timestamps


Line 23: global flag --prevent_parquet_mr_zone_adjustment is set to true.
Not sure if the name of this flag was already decided upon, but imo, it's not 
very clear what it does from the description.

As an alternative, I'd propose something like:
--set_parquet_mr_int96_write_zone

It's a complex issue, so I'd prefer to be more explicit about what the flag 
does as opposed to which behavior the flag is supposed to induce across 
components.


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 242:   timezone_ = TimezoneDatabase::FindTimezone(
What if timezone_ is NULL here?


Line 244: } catch (std::exception& from_boost) {
Does FindTimezone() really throw? We're not handling that in other places.


Line 249: }
DCHECK that is_timestamp_dependent_timezone_ is true xor timezone_ is set


Line 554:   boost::local_time::time_zone_ptr timezone_;
initialize these new members in the constructor


Line 608:   DCHECK(parent_->scan_node_->parquet_mr_time_zone() != "UTC");
DCHECK_NE


Line 611: Status status;
Avoid creating/destroying this status even if no error happens (a possible perf 
issue). I'd suggest you move the parent_->parse_status_ = status; directly into 
the error branches and return false from them directly.


Line 628: } else if (LIKELY(timezone_ != NULL)) {
swap this case with the else if above


Line 636: }
else {
  DCHECK(false) << add error msg here
}


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

Line 194:   for (vector::const_iterator iter = tz_region_list_.begin();
please convert to for-each while you are here


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

Line 37:   /// Converts the name of a timezone to a boost timezone object.
Not your change, but comment on what happens when the timezone was not found in 
the database.


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

Line 111: time_zone_ptr timezone = 
TimezoneDatabase::FindTimezone(timezone_str, *this);
Can timezone be NULL?


Line 113:   } catch (std::exception& from_boost) {
Which function can throw?


Line 125:   local_date_time lt(temp, timezone);
I have a feeling boost might throw here if we give it funny input.


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

Line 199:   bool UtcToLocal();
comment on return values (and functions below)


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/fe-support.cc
File be/src/service/fe-support.cc:

Line 354: JNIEXPORT jbyteArray JNICALL
Why even pass and return thrift structs? You can just take a string and return 
a boolean.


Line 365: ABORT_IF_ERROR(TimezoneDatabase::Initialize());
Are you sure we need this? I see catalogd-main.cc calling InitCommonRuntime() 
which initializes the TZ DB.


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

Line 123: "Impala, Hive, and Spark to not apply timestamp timezone 
adjustments for parquet "
Mention that this property also makes Hive write in UTC.


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/util/backend-gflag-util.cc
File be/src/util/backend-gflag-util.cc:

Line 55:   cfg.__set_assign_utc_to_parquet_mr_timezone_prop(
let's keep the name consistent across various uses


http://gerrit.cloudera.org:8080/#/c/5939/3/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 215:   // Specifies a time zone for adjusting timestamp values read from 
Parquet MR files. This
... from Parquet files written by parquet-mr.


Line 216:   // is used for a Hive compatibilty fix.
Mention that this comes from the table property


http://gerrit.cloudera.org:8080/#/c/5939/3/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

Line 316:   ("PARQUET_MR_TIMEZONE_INVALID", 102, "Invalid timezone '$0' 
specified in "
Why even mention the file name? Seems better to mention 

[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-02-27 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 2:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 237: timezone_ = TimezoneDatabase::FindTimezone(
> We should first check for TimezoneDatabase::IsTimestampDependentTimezone, o
Done. Also added slot_desc_->type().type == TYPE_TIMESTAMP to the condition.


Line 240: timezone_ = NULL;
> This exception should not be swallowed. If we can not convert to the desire
Added an error message. 

Even though the program execution should never get to L240, (we deal with 
invalid timezones in the frontend in the analyze phase), I feel it is important 
to be defensive here and display errors.


Line 543:   /// If not NULL, TIMESTAMP column readers require conversion to 
this time zone.
> Although technically correct, this comment is a bit misleading. One may eas
Done


Line 593:   if (LIKELY(dst->HasDateAndTime())) {
> DCHECK(parent_->scan_node_->parquet_mr_time_zone() != "UTC")
Done


Line 597:   dst->FromUtc(parent_->scan_node_->parquet_mr_time_zone());
> This is somewhat confusing, as it seems as if the first two cases did the s
Done. I also refactored the if statements to make the logic in 'ConvertSlot()' 
easier to follow. Error handling was added as well.


http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

Line 189: time_zone_ptr TimezoneDatabase::FindTimezone(const string& tz) {
> This should be a private helper function with two callers:
'FindTimezone(const string& tz)' is also called from the Frontend to check if a 
timezone string is valid (see 
'Java_org_apache_impala_service_FeSupport_NativeCheckIsValidTimeZone()' in 
fe-support.cc). 

In this case it is not guaranteed that the timezone is not timestamp dependent.


http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

Line 109: // If a table has an invalid timezone, it should not have been 
loaded.
> How does this comment apply here? Was it supposed to be before the catch bl
Removed the comment as it doesn't make much sense here.


Line 112: *this = ptime(not_a_date_time);
> Again, swallowing this exception can be a problem as end-users may not noti
Added code to display exception message. Added error-handling code to the 
caller function as well.


http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

Line 201:   /// Converts from UTC to local time in the given timezone.
> Please document the difference that one can be timestamp-dependent while th
Both versions of 'FromUtc()' will work both with timestamp-independent and 
timestamp-dependent timezones.

On the other hand, there is a difference between the two versions of 
'TimezoneDatabase::FindTimezone()' which is documented in timezone_db.h


http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/service/fe-support.cc
File be/src/service/fe-support.cc:

Line 352: extern "C"
> Please document the purpose of this call (allowing the Java code to check w
Done


http://gerrit.cloudera.org:8080/#/c/5939/2/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
File 
fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java:

Line 182: "'parquet.mr.int96.write.zone' is only supported for HDFS 
tables."));
> Why? Parquet files should be handled this way regardless of using HDFS or n
I'm not 100% sure about this, but I think HdfsTable instances are used to 
represent tables stored in HDFS or S3 or the local file system. This condition 
was meant to exclude Kudu/HBase tables.

Let's leave it like this for now and see what other reviewers think.


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-02-27 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#3).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive, and vice versa.

After this change:
Impala reads Parquet MR timestamp data and adjust values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New tables created by Impala will set the table property to UTC if the
global flag --prevent_parquet_mr_zone_adjustment is set to true.

Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not
set the table property unless the global flag is set to true.

Tables created using CREATE TABLE LIKE  will copy the
property of the table that is copied.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/Frontend.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_parquet_timestamp_compatibility.py
M tests/metadata/test_ddl.py
24 files changed, 637 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-02-27 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#3).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive, and vice versa.

After this change:
Impala reads Parquet MR timestamp data and adjust values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New tables created by Impala will set the table property to UTC if the
global flag --prevent_parquet_mr_zone_adjustment is set to true.

Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not
set the table property unless the global flag is set to true.

Tables created using CREATE TABLE LIKE  will copy the
property of the table that is copied.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/Frontend.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_parquet_timestamp_compatibility.py
M tests/metadata/test_ddl.py
24 files changed, 637 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-02-08 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#2).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive, and vice versa.

After this change:
Impala reads Parquet MR timestamp data and adjust values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New tables created by Impala will set the table property to UTC if the
global flag --prevent_parquet_mr_zone_adjustment is set to true.

Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not
set the table property unless the global flag is set to true.

Tables created using CREATE TABLE LIKE  will copy the
property of the table that is copied.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/Frontend.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_parquet_timestamp_compatibility.py
M tests/metadata/test_ddl.py
23 files changed, 586 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-02-08 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#2).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive, and vice versa.

After this change:
Impala reads Parquet MR timestamp data and adjust values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New tables created by Impala will set the table property to UTC if the
global flag --prevent_parquet_mr_zone_adjustment is set to true.

Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not
set the table property unless the global flag is set to true.

Tables created using CREATE TABLE LIKE  will copy the
property of the table that is copied.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/Frontend.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_parquet_timestamp_compatibility.py
M tests/metadata/test_ddl.py
23 files changed, 586 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-02-08 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/5939

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive, and vice versa.

After this change:
Impala reads Parquet MR timestamp data and adjust values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New tables created by Impala will set the table property to UTC if the
global flag --prevent_parquet_mr_zone_adjustment is set to true.

Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not
set the table property unless the global flag is set to true.

Tables created using CREATE TABLE LIKE  will copy the
property of the table that is copied.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/Frontend.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_parquet_timestamp_compatibility.py
M tests/metadata/test_ddl.py
23 files changed, 588 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges