[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-09 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 18: Code-Review+2

Carry +2


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Tue, 09 Apr 2019 08:53:26 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-09 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..1) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is not set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Reviewed-on: http://gerrit.cloudera.org:8080/12247
Tested-by: Impala Public Jenkins 
Reviewed-by: Csaba Ringhofer 
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 611 insertions(+), 117 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Csaba Ringhofer: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 19
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 18: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Tue, 09 Apr 2019 05:27:04 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 17:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/2694/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 17
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Tue, 09 Apr 2019 00:51:40 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 18:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3998/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Tue, 09 Apr 2019 00:28:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-08 Thread Csaba Ringhofer (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#17).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..1) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is not set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 611 insertions(+), 117 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/17
--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 17
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 16: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/3994/


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 16
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Mon, 08 Apr 2019 21:53:40 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 16: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 16
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Mon, 08 Apr 2019 16:46:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 16:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3994/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 16
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Mon, 08 Apr 2019 16:46:07 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 15:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/2672/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 15
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Mon, 08 Apr 2019 15:36:51 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-08 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 15: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 15
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Mon, 08 Apr 2019 15:25:21 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-08 Thread Csaba Ringhofer (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#15).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..1) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is not set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 611 insertions(+), 117 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/15
--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 15
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-08 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 15:

Patch set 15 fixes a copy-paste error. I guess that I ran timestamp-test 
without rebuilding before uploading that last change, as it was broken before 
this fix.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 15
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Mon, 08 Apr 2019 14:58:19 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 14:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/2614/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Tue, 02 Apr 2019 19:15:32 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 14:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/2612/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Tue, 02 Apr 2019 15:53:33 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-02 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 14:

> Build Failed
 >
 > https://jenkins.impala.io/job/gerrit-code-review-checks/2609/ :
 > Initial code review checks failed. See linked job for details on
 > the failure.

The error during code review check seems to be unrelated:
E: Failed to fetch 
http://us-west-2.ec2.archive.ubuntu.com/ubuntu/pool/universe/o/objenesis/libobjenesis-java_2.2-1_all.deb
  504  Gateway Time-out [IP: 34.210.25.51 80]


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Tue, 02 Apr 2019 15:16:55 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 14:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/2609/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Tue, 02 Apr 2019 13:00:15 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-02 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 13:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc
File be/src/runtime/timestamp-test.cc:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@740
PS9, Line 740: EQ(MI
> I feel I'm slightly in favor of adding proper distinct variables instead of
All the names I came up with were awkward, so I reused the same name but put 
all the tests to different blocks.


http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-test.cc
File be/src/runtime/timestamp-test.cc:

http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-test.cc@743
PS13, Line 743: Add 250ns
> The comments should now reflect that you're setting it explicitly instead o
Done


http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-value.h@196
PS13, Line 196: 1'000'000'000LL
> We have NANOS_PER_SEC in timestamp-value-inline.h, use here, too, and elsew
NANOS_PER_DAY comes from walltime.h, which is included in most places, but I 
did not want to include it in a frequently used header like this.



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Tue, 02 Apr 2019 12:18:12 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-04-02 Thread Csaba Ringhofer (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#14).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..1) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is not set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 611 insertions(+), 117 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/14
--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-03-29 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 13: Code-Review+2

(3 comments)

Only had some nits, otherwise this looks good to me.

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc
File be/src/runtime/timestamp-test.cc:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@740
PS9, Line 740: EQ(MI
> I have rewritten the code to always reset min_date. I think that it is easi
I feel I'm slightly in favor of adding proper distinct variables instead of 
re-using min_date.


http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-test.cc
File be/src/runtime/timestamp-test.cc:

http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-test.cc@743
PS13, Line 743: Add 250ns
The comments should now reflect that you're setting it explicitly instead of 
adding. Also check the one for max* below


http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-value.h@196
PS13, Line 196: 1'000'000'000LL
We have NANOS_PER_SEC in timestamp-value-inline.h, use here, too, and elsewhere 
in this file?



--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Fri, 29 Mar 2019 23:37:45 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-03-28 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 13:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/2573/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Thu, 28 Mar 2019 17:46:19 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-03-28 Thread Csaba Ringhofer (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#13).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..1) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is not set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 570 insertions(+), 107 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/13
--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-03-28 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 12:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/2572/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Thu, 28 Mar 2019 14:22:58 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-03-28 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 12:

Patch 12 fixes the clang tidy error in timestamp-test + rebases and resolves 
conflicts related to query options.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Thu, 28 Mar 2019 13:57:50 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-03-28 Thread Csaba Ringhofer (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#12).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..1) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is not set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 570 insertions(+), 107 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/12
--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-03-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 11:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/2559/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 27 Mar 2019 15:30:08 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-03-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 10:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/2558/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 10
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 27 Mar 2019 15:10:40 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-03-27 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 11:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/12247/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12247/9//COMMIT_MSG@17
PS9, Line 17: INT96_NANOS
> I think the NANOS suffix is somewhat misleading as it suggests that this is
I am open to a new name, but I want to consult with Hive / Parquet people 
first. This is a development query option at the moment, so I think that it is 
not a big issue to change the names in the future.

About discouraging people from using INT96_NANOS: currently this is the only 
timestamp representation that can be read by other Hadoop components. I would 
wait for a Parquet-MR release with logical type support + some cross-component 
testing / benchmarking before advising anyone to use the new types.


http://gerrit.cloudera.org:8080/#/c/12247/9//COMMIT_MSG@42
PS9, Line 42: not
> typo
Done


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.h
File be/src/exec/parquet/parquet-metadata-utils.h:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.h@61
PS9, Line 61: that the inpu
> nit: that the input type..
Done


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.cc@150
PS9, Line 150:   parquet::TimeUnit time_unit;
> nit: declare these before using them?
Done


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc
File be/src/runtime/timestamp-test.cc:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@713
PS9, Line 713: // Test padding on double digits
> I'd declare that when you use it
Done


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@721
PS9, Line 721: ASSERT_TRUE(ParseFormatTokens(_ctx))  << "TC: " << i;
> Maybe prefix each of these comments with a comment or assert of the current
I have rewritten min_date assignments to use explicit timestamps.


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@737
PS9, Line 737:   TimestampValue min_date = TimestampValue::Parse("1400-01-01");
> This test does not correspond to the comment above, does it? I.e., it is ro
I think that "floor" in the name should make it clear that it rounds toward -. 
Note that ToUnixTime() also rounds towards -.


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@740
PS9, Line 740: EQ(MI
> The switch between us and ns confused me for a bit, not sure if there's a g
I have rewritten the code to always reset min_date. I think that it is easier 
to understand now, but I can create new variables if you think that it would 
further improve readability.


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.h@239
PS9, Line 239: 2262
> The famous Impala Year-2262 problem being conceived and I get to be part of
:D


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.h@375
PS9, Line 375:
> nit: 1970-01-01
Done


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@98
PS9, Line 98: SECONDS_PER_
> While you're here, this could probably be a constant, too, given it's used
Done


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@133
PS9, Line 133:   return true;
> Nit: I think some of these newlines could go.
Done


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@156
PS9, Line 156:
> nit: indent 4 spaces
Done


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@160
PS9, Line 160:   }
> nit: If it doesn't fit on one line, it should have braces {}
Done


http://gerrit.cloudera.org:8080/#/c/12247/9/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

http://gerrit.cloudera.org:8080/#/c/12247/9/tests/query_test/test_insert_parquet.py@848
PS9, Line 848: insert_stmt = """insert into {0} values
> Can you add a test for the case that values don't fit into 64bit nanos?
Done


http://gerrit.cloudera.org:8080/#/c/12247/10/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

http://gerrit.cloudera.org:8080/#/c/12247/10/tests/query_test/test_insert_parquet.py@824
PS10, Line 824: ("1969-12-31 23:59:59.9"),
This was broken when rounding towards nearest was 

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-03-27 Thread Csaba Ringhofer (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#11).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..1) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is not set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 584 insertions(+), 109 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/11
--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-03-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12247/10/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

http://gerrit.cloudera.org:8080/#/c/12247/10/tests/util/get_parquet_metadata.py@180
PS10, Line 180: o
flake8: E225 missing whitespace around operator



--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 10
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 27 Mar 2019 14:28:08 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-03-27 Thread Csaba Ringhofer (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#10).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..1) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is not set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 585 insertions(+), 109 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/10
--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 10
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-03-20 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 9:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/12247/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12247/9//COMMIT_MSG@17
PS9, Line 17: INT96_NANOS
I think the NANOS suffix is somewhat misleading as it suggests that this is 
just ns precision with a larger range. Since we also want to discourage use of 
this I suggest to name it INT96_DEPRECATED (or INT96_IMPALA since this is where 
it came from).


http://gerrit.cloudera.org:8080/#/c/12247/9//COMMIT_MSG@42
PS9, Line 42: net
typo


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.h
File be/src/exec/parquet/parquet-metadata-utils.h:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.h@61
PS9, Line 61: that the type
nit: that the input type..


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.cc@150
PS9, Line 150:   parquet::TimestampType timestamp_type;
nit: declare these before using them?


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc
File be/src/runtime/timestamp-test.cc:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@713
PS9, Line 713:   int64_t tm_min_micros, tm_min_millis;
I'd declare that when you use it


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@721
PS9, Line 721:   // Add 250ns and check the value is rounded down
Maybe prefix each of these comments with a comment or assert of the current 
value of min_time. Otherwise one has to count how many times we added 250ns all 
the time.


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@737
PS9, Line 737:   EXPECT_TRUE(min_date.FloorUtcToUnixTimeMicros(_min_micros));
This test does not correspond to the comment above, does it? I.e., it is 
rounding towards -?


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@740
PS9, Line 740: 250us
The switch between us and ns confused me for a bit, not sure if there's a good 
way to make it stand out more.

Also, some of these blocks add to the current value of min_date, while other 
blocks reset it. Maybe you can make more clear by not writing to it and using 
new variables instead (here and in the other blocks)


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.h@239
PS9, Line 239: 2262
The famous Impala Year-2262 problem being conceived and I get to be part of it. 
:)


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.h@375
PS9, Line 375: 1970.01.01
nit: 1970-01-01


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@98
PS9, Line 98: 24 * 60 * 60
While you're here, this could probably be a constant, too, given it's used in 
other places.


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@133
PS9, Line 133:
Nit: I think some of these newlines could go.


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@156
PS9, Line 156: static_cast(unixtime_seconds) * NANOS_PER_SEC
nit: indent 4 spaces


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@160
PS9, Line 160:   || nanos128 >  std::numeric_limits::max()) return 
false;
nit: If it doesn't fit on one line, it should have braces {}


http://gerrit.cloudera.org:8080/#/c/12247/9/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

http://gerrit.cloudera.org:8080/#/c/12247/9/tests/query_test/test_insert_parquet.py@848
PS9, Line 848:   ColumnStats('ts', -1001001, 1001001, 0)
Can you add a test for the case that values don't fit into 64bit nanos?


http://gerrit.cloudera.org:8080/#/c/12247/9/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

http://gerrit.cloudera.org:8080/#/c/12247/9/tests/util/get_parquet_metadata.py@180
PS9, Line 180: normpath
Why is "normpath" needed here?


http://gerrit.cloudera.org:8080/#/c/12247/9/tests/util/get_parquet_metadata.py@180
PS9, Line 180:   table_dir = tmp_dir + "/" + 
os.path.basename(os.path.normpath(hdfs_path))
use os.path.join



--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: 

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-03-01 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 9: Code-Review+1

(5 comments)

http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39
PS8, Line 39: without conversion to UTC
> The main problem with normalizing to UTC from Impala's standpoint is the pe
OK, thanks for the explanation for both of you!


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc@579
PS8, Line 579: result_
> result_ shouldn't be a local variable, as we return a pointer to it. This i
Ah, I completely missed that.


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc@142
PS8, Line 142: /// converted_type is not set because Impala always writes 
timestamps without UTC
> That's true, but it has a metadata field to tell the caller the desired sem
Sorry, I was really thinking about Hive and Spark.

But it doesn't really matter, since now we have this 'isAdjustedToUTC' field. 
And Zoltan also said in one of his comments that "'pure' TIMESTAMP and 
TIMESTAMP WITHOUT TIME ZONE shall not be normalized to UTC"


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h@154
PS8, Line 154: // TODO: consider optimizing this (IMPALA-8268)
 :   kudu::int128_t nanos128 =
 : static_cast(unixtime_seconds) * NANOS_PER_SEC
 : + time_.fractional_seconds();
 :
 :   if (nanos128 <  std::numeric_limits::min()
> I created a ticket for benchmarking and optimizing these new functions: IMP
OK, thanks!


http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test:

http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test@102
PS8, Line 102:  QUERY
 : create table int96_nanos (ts timestamp) stored as parquet;
 : 
 :  QUERY
 : # Insert edge values as "normal" int96 timestamps that can 
represent all values.
 : set parquet_timestamp_type=INT96_NANOS;
 : insert into int96_nanos values
> Thanks for the tip!
You're welcome!



--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Fri, 01 Mar 2019 15:43:45 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-03-01 Thread Zoltan Ivanfi (Code Review)
Zoltan Ivanfi has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc@142
PS8, Line 142: /// converted_type is not set because Impala always writes 
timestamps without UTC
> As far as I know Parquet-MR does not do any timezone conversion and leaves
That's true, but it has a metadata field to tell the caller the desired 
semantics, which can be either UTC-normalized or not.



--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Fri, 01 Mar 2019 14:11:43 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-03-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 9:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/2316/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Fri, 01 Mar 2019 13:46:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-03-01 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 8:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39
PS8, Line 39: without conversion to UTC
> No, "pure" TIMESTAMP and TIMESTAMP WITHOUT TIME ZONE shall not be normalize
The main problem with normalizing to UTC from Impala's standpoint is the 
performance cost of the UTC->localtime conversion when reading it back. Doing 
the timezone conversion for all timestamps in all rows is very costly compared 
to other tasks during Parquet scanning.


http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39
PS8, Line 39: tha
> nit: the
Done


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc@579
PS8, Line 579: result_
> What about deleting the member 'result_', and only have it here as a local
result_ shouldn't be a local variable, as we return a pointer to it. This is 
the expected interface by BaseColumnWriter::AppendRow() - it has no template 
for the type of the column, so it handles values with a void* that points to 
the value that should be inserted to the current row, and is expected to live 
until we step to the next row.


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.h
File be/src/exec/parquet/parquet-metadata-utils.h:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.h@60
PS8, Line 60: Return
> nit: Returns
Done


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc@142
PS8, Line 142: /// converted_type is not set because Impala always writes 
timestamps without UTC
> Parquet supports both UTC-normalized and timezone-agnostic timestamps, aka
As far as I know Parquet-MR does not do any timezone conversion and leaves this 
task to the caller, e.g. Hive.


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h@154
PS8, Line 154: kudu::int128_t nanos128 =
 : static_cast(unixtime_seconds) * NANOS_PER_SEC
 : + time_.fractional_seconds();
 :
 :   if (nanos128 <  std::numeric_limits::min()
 :   || nanos128 >  std::numeric_limits::max()) return 
false;
> I think we can still avoid using int128_t.
I created a ticket for benchmarking and optimizing these new functions: 
IMPALA-8268  Performance is not too important at the moment as 
parquet_timestamp_type is a development query option and should be mainly used 
to test whether the new timestamps can be read by other Hadoop components.

I think that the most costly things in this function are 
time_.fractional_seconds() and UtcToUnixTime()'s time_.total_seconds(), as 
these need int64 integer division. These could be avoided by not using 
UtcToUnixTime and converting to nanoseconds from day_ + time_ directly, but 
this has to be done carefully near the edge values.


http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test:

http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test@102
PS8, Line 102:  QUERY
 : create table int96_nanos (ts timestamp) stored as parquet;
 : 
 :  QUERY
 : # Insert edge values as "normal" int96 timestamps that can 
represent all values.
 : set parquet_timestamp_type=INT96_NANOS;
 : insert into int96_nanos values
> nit: you dont't need to start a new QUERY block for each query when you don
Thanks for the tip!



--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Fri, 01 Mar 2019 13:35:32 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-03-01 Thread Csaba Ringhofer (Code Review)
Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#9).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..1) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is net set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 537 insertions(+), 69 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/9
--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-02-28 Thread Zoltan Ivanfi (Code Review)
Zoltan Ivanfi has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc@142
PS8, Line 142: /// converted_type is not set because Impala always writes 
timestamps without UTC
> Does Parquet-MR also write INT64 timestamps un-normalized?
Parquet supports both UTC-normalized and timezone-agnostic timestamps, aka 
Instant and LocalDateTime semantics.



--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Thu, 28 Feb 2019 17:17:52 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-02-28 Thread Zoltan Ivanfi (Code Review)
Zoltan Ivanfi has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39
PS8, Line 39: without conversion to UTC
> Whouldn't it be better to convert to UTC? You write later that old readers
No, "pure" TIMESTAMP and TIMESTAMP WITHOUT TIME ZONE shall not be normalized to 
UTC, only the TIMESTAMP WITH LOCAL TIME ZONE type shall be. Please see this 
document: https://goo.gl/VV88c5



--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Thu, 28 Feb 2019 17:16:26 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-02-28 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 8:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39
PS8, Line 39: without conversion to UTC
Whouldn't it be better to convert to UTC? You write later that old readers also 
assume that the data is written in UTC.


http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39
PS8, Line 39: tha
nit: the


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc@579
PS8, Line 579: result_
What about deleting the member 'result_', and only have it here as a local 
variable?

Now it looks strange that we invoke a member function here and pass a member 
variable to it. Also, this way we will have some aliasing inside 
ConvertTimestamp() that might prevent some compiler optimizations, but I'm not 
really sure honestly.


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.h
File be/src/exec/parquet/parquet-metadata-utils.h:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.h@60
PS8, Line 60: Return
nit: Returns


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc@142
PS8, Line 142: /// converted_type is not set because Impala always writes 
timestamps without UTC
Does Parquet-MR also write INT64 timestamps un-normalized?


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h@154
PS8, Line 154: kudu::int128_t nanos128 =
 : static_cast(unixtime_seconds) * NANOS_PER_SEC
 : + time_.fractional_seconds();
 :
 :   if (nanos128 <  std::numeric_limits::min()
 :   || nanos128 >  std::numeric_limits::max()) return 
false;
I think we can still avoid using int128_t.


http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test:

http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test@102
PS8, Line 102:  QUERY
 : create table int96_nanos (ts timestamp) stored as parquet;
 : 
 :  QUERY
 : # Insert edge values as "normal" int96 timestamps that can 
represent all values.
 : set parquet_timestamp_type=INT96_NANOS;
 : insert into int96_nanos values
nit: you dont't need to start a new QUERY block for each query when you don't 
check the results.
E.g. it could be:

 
  QUERY
 create table...
 set ...
 insert into ...
 create table ...
 



--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Thu, 28 Feb 2019 15:35:07 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-02-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 8:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/2101/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 13 Feb 2019 19:18:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-02-13 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 8:

(1 comment)

About the last 3 patch sets:

patch set 6 changes nanos->millis/micros conversions to round towards minus 
infinity instead of the nearest integer

patch set 7 fixes a comment by Zoltan Borok-Nagy and a copy-paste error in 
patch 6

patch set 8 is a rebase + conflict resolution. Rebasing was necessary because 
of Impala-lzo changes broke the build.


About checking the written Parquet files with Parquet tool:
thanks for the tool - I used it to check the written files manually, but I 
would prefer to integrate it into Impala in a separate commit. A jira to track 
this: IMPALA-8197

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@583
PS3, Line 583: successful and return true. If the timestamp is invalid
> It seemed premature optimization to me, so I did some measurements and look
Done



--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 13 Feb 2019 19:18:34 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-02-13 Thread Csaba Ringhofer (Code Review)
Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#8).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..1) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (tha way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is net set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 542 insertions(+), 69 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/8
--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-02-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 7:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/2100/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 13 Feb 2019 18:32:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-02-13 Thread Csaba Ringhofer (Code Review)
Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#7).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..1) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (tha way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is net set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 540 insertions(+), 67 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/7
--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-02-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 6:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/2097/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 13 Feb 2019 18:01:46 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-02-13 Thread Csaba Ringhofer (Code Review)
Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#6).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..1) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (tha way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is net set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 539 insertions(+), 67 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/6
--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-01-30 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 5:

Some update:
There are two design decisions under discussion at the moment, one is the type 
of rounding to use when some precision is lost, and the other is whether to 
convert out-of-range int64 nanos to NULL silently or to abort with an error in 
this case.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 30 Jan 2019 15:20:49 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-01-24 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@583
PS3, Line 583: Overrides of this function are expected to set 'result_'
> I prefer the current solution - my guess is that adding an additional argum
It seemed premature optimization to me, so I did some measurements and looked 
at the generated assembly.

Both 'this' and the output parameter are passed in via registers.

In the current case you need less registers, but you need to access 'result_' 
via 'this' plus an offset.

In the other case, you need one more register to pass the output parameter, but 
you don't need to add an offset to 'this'.

I couldn't measure any performance difference between the two approaches.

So if performance was the only reason to do this, I'd rather choose the 
conventional way.



--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Thu, 24 Jan 2019 17:13:41 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-01-24 Thread Zoltan Ivanfi (Code Review)
Zoltan Ivanfi has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 5:

> About 'parquet-tools dump': it would be great to use it during
 > Impala EE tests, but adding it as a dependency would also mean +1
 > way Impala builds can break, so I am not sure at the moment.

I am very confident that we should have this extra check.

Adding parquet-tools to the build toolchain is not that difficult as it does 
not have to be installed. I created a small Maven project that runs it without 
installing it (in the conventional sense at least). The whole project consists 
of a single POM file and nothing else. Feel free to add it to Implala and 
modify it to fit your needs: https://github.com/zivanfi/parquet-tools-executor


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Thu, 24 Jan 2019 14:04:32 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-01-23 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 5:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1862/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 23 Jan 2019 21:18:46 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-01-23 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 3:

(7 comments)

About 'parquet-tools dump': it would be great to use it during Impala EE tests, 
but adding it as a dependency would also mean +1 way Impala builds can break, 
so I am not sure at the moment.

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@400
PS3, Line 400: !=
> Shouldn't it be '=='?
It is intentionally !=. -1 means variable length, and plain_encoded_value_size_ 
should remain -1 in that case.

Actually this code led to DCHECK error during the writing of some decimals, 
which was not caught by the tests I ran locally. After running to issues on 
jenkins, I moved this logic Int64TimestampColumnWriterBase.


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@583
PS3, Line 583: Overrides of this function are expected to set 'result_'
> I think an output parameter would be more conventional.
I prefer the current solution - my guess is that adding an additional argument 
to a non-inline functions will make it slower, while writing to result_ is 
cheaper as 'this' is used during the function call anyway.


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc@158
PS3, Line 158:   // converted_type is not set because older readers that do not 
use logical types
> Ah, yes, good point, I haven't noticed we are only dealing with LocalDateTi
Done


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/runtime/timestamp-value.inline.h@135
PS3, Line 135: (NANOS_PER_MILLI / 2)
> Might be worth to add a short comment about that we do the rounding here.
The whole rounding logic is under discussion - Impala uses rounding to microsec 
when writing Kudu, but I think that truncating towards negative infinity would 
be better. I sent an email to d...@impala.apache.org about this topic, I plan 
to wait for an agreement there before changing these functions.


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/runtime/timestamp-value.inline.h@151
PS3, Line 151:   kudu::int128_t nanos128 =
 : static_cast(unixtime_seconds) * NANOS_PER_SEC
 : + time_.fractional_seconds();
 :
 :   if (nanos128 <  std::numeric_limits::min()
 :   || nanos128 >  std::numeric_limits::max()) return 
false;
> I think we can avoid using int128_t here. We now what are the min and max u
I agree, but I am waiting for a decision in rounding vs truncating, see line 
135.


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc@728
PS3, Line 728: if (iequals(value, "INT96_NANO")) {
> You are right, I think that I messed something up like running the tests wi
Done


http://gerrit.cloudera.org:8080/#/c/12247/3/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/12247/3/common/thrift/ImpalaInternalService.thrift@1
PS3, Line 1:
> nit: accidental new line
Done



--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 23 Jan 2019 20:50:38 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-01-23 Thread Csaba Ringhofer (Code Review)
Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#5).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..1) range handled by Impala
  at the cost of reduced precision. Values are rounded to
  the nearest ms/us during writing. Values that would be rounded
  to year 1 are rounded down to avoid converting valid values
  to invalid ones.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (as Impala always
wrote timestamps), and this information is  expressed in the new
Parquet logical types by setting isAdjustedToUTC to false. The old
logical type (converted_type) is net set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 515 insertions(+), 66 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/5
--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-01-23 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 4:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1853/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 4
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Wed, 23 Jan 2019 15:20:24 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-01-23 Thread Csaba Ringhofer (Code Review)
Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#4).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..1) range handled by Impala
  at the cost of reduced precision. Values are rounded to
  the nearest ms/us during writing. Values that would be rounded
  to year 1 are rounded down to avoid converting valid values
  to invalid ones.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (as Impala always
wrote timestamps), and this information is  expressed in the new
Parquet logical types by setting isAdjustedToUTC to false. The old
logical type (converted_type) is net set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 516 insertions(+), 66 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/4
--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 4
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-01-22 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc@728
PS3, Line 728: if (iequals(value, "INT96_NANO")) {
> Shouldn't this typo have been caught by a test?
You are right, I think that I messed something up like running the tests with 
an older version of impala running (my last change was adding "S" to the end of 
query options). I will return soon when the tests pass.



--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Tue, 22 Jan 2019 16:30:35 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-01-22 Thread Zoltan Ivanfi (Code Review)
Zoltan Ivanfi has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc@158
PS3, Line 158:   // converted_type is not set because older readers that do not 
use logical types
> I mean I don't see that this function sets the converted type anywhere.
Ah, yes, good point, I haven't noticed we are only dealing with LocalDateTime 
semantics here and regardless of the precision, there is no converted type for 
that.



--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Tue, 22 Jan 2019 16:01:12 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-01-22 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc@158
PS3, Line 158:   // converted_type is not set because older readers that do not 
use logical types
> Personally I'm fine with the current name, it sets the converted type when
I mean I don't see that this function sets the converted type anywhere.



--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Tue, 22 Jan 2019 15:56:27 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-01-22 Thread Zoltan Ivanfi (Code Review)
Zoltan Ivanfi has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc@158
PS3, Line 158:   // converted_type is not set because older readers that do not 
use logical types
> Then I think you should change the name of the function.
Personally I'm fine with the current name, it sets the converted type when it 
makes sense. I don't know how this could be described without making the 
function name overly long.


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc@728
PS3, Line 728: if (iequals(value, "INT96_NANO")) {
> It should be "INT96_NANOS"
Shouldn't this typo have been caught by a test?



--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Tue, 22 Jan 2019 15:44:13 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-01-22 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 3:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@400
PS3, Line 400: !=
Shouldn't it be '=='?


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@583
PS3, Line 583: Overrides of this function are expected to set 'result_'
I think an output parameter would be more conventional.


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc@158
PS3, Line 158:   // converted_type is not set because older readers that do not 
use logical types
Then I think you should change the name of the function.


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/runtime/timestamp-value.inline.h@135
PS3, Line 135: (NANOS_PER_MILLI / 2)
Might be worth to add a short comment about that we do the rounding here.


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/runtime/timestamp-value.inline.h@151
PS3, Line 151:   kudu::int128_t nanos128 =
 : static_cast(unixtime_seconds) * NANOS_PER_SEC
 : + time_.fractional_seconds();
 :
 :   if (nanos128 <  std::numeric_limits::min()
 :   || nanos128 >  std::numeric_limits::max()) return 
false;
I think we can avoid using int128_t here. We now what are the min and max 
unix_timeseconds that can be represented in the limited range.

And if we are at the "boundary" (e.g. 2262-04-11 23:47:16), we now what are the 
limits of the fractional_seconds (e.g. 854775807).


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc@728
PS3, Line 728: if (iequals(value, "INT96_NANO")) {
It should be "INT96_NANOS"


http://gerrit.cloudera.org:8080/#/c/12247/3/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/12247/3/common/thrift/ImpalaInternalService.thrift@1
PS3, Line 1:
nit: accidental new line



--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Tue, 22 Jan 2019 15:00:18 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-01-22 Thread Zoltan Ivanfi (Code Review)
Zoltan Ivanfi has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 3:

Would it be possible to run `parquet-tools dump` as a part of the tests to 
check whether the timestamp values can be read back correctly by parquet-mr?


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Comment-Date: Tue, 22 Jan 2019 13:00:31 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1837/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 21 Jan 2019 18:45:55 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-01-21 Thread Csaba Ringhofer (Code Review)
Hello Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#3).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..1) range handled by Impala
  at the cost of reduced precision. Values are rounded to
  the nearest ms/us during writing. Values that would be rounded
  to year 1 are rounded down to avoid converting valid values
  to invalid ones.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (as Impala always
wrote timestamps), and this information is  expressed in the new
Parquet logical types by setting isAdjustedToUTC to false. The old
logical type (converted_type) is net set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 512 insertions(+), 66 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/3
--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..


Patch Set 2:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/1836/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 21 Jan 2019 17:54:40 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

2019-01-21 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/12247


Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
..

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..1) range handled by Impala
  at the cost of reduced precision. Values are rounded to
  the nearest ms/us during writing. Values that would be rounded
  to year 1 are rounded down to avoid converting valid values
  to invalid ones.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (as Impala always
wrote timestamps), and this information is  expressed in the new
Parquet logical types by setting isAdjustedToUTC to false. The old
logical type (converted_type) is net set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 512 insertions(+), 66 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/2
--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer