[jira] [Updated] (HIVE-21002) TIMESTAMP - Backwards incompatible change: Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x incorrectly

2019-02-28 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-21002:
-
Target Version/s: 3.2.0, 3.1.2

> TIMESTAMP - Backwards incompatible change: Hive 3.1 reads back Avro and 
> Parquet timestamps written by Hive 2.x incorrectly
> --
>
> Key: HIVE-21002
> URL: https://issues.apache.org/jira/browse/HIVE-21002
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.1.1
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x 
> incorrectly. As an example session to demonstrate this problem, create a 
> dataset using Hive version 2.x in America/Los_Angeles:
> {code:sql}
> hive> create table ts_‹format› (ts timestamp) stored as ‹format›;
> hive> insert into ts_‹format› values (*‘2018-01-01 00:00:00.000’*);
> {code}
> Querying this table by issuing
> {code:sql}
> hive> select * from ts_‹format›;
> {code}
> from different time zones using different versions of Hive and different 
> storage formats gives the following results:
> |‹format›|Writer time zone (in Hive 2.x)|Reader time zone|Result in Hive 2.x 
> reader|Result in Hive 3.1 reader|
> |Avro and Parquet|America/Los_Angeles|America/Los_Angeles|2018-01-01 
> *00*:00:00.0|2018-01-01 *08*:00:00.0|
> |Avro and Parquet|America/Los_Angeles|Europe/Paris|2018-01-01 
> *09*:00:00.0|2018-01-01 *08*:00:00.0|
> |Textfile and ORC|America/Los_Angeles|America/Los_Angeles|2018-01-01 
> 00:00:00.0|2018-01-01 00:00:00.0|
> |Textfile and ORC|America/Los_Angeles|Europe/Paris|2018-01-01 
> 00:00:00.0|2018-01-01 00:00:00.0|
> *Hive 3.1 clearly gives different results than Hive 2.x for timestamps stored 
> in Avro and Parquet formats.* Apache ORC behaviour has not changed because it 
> was modified to adjust timestamps to retain backwards compatibility. Textfile 
> behaviour has not changed, because its processing involves parsing and 
> formatting instead of proper serializing and deserializing, so they 
> inherently had LocalDateTime semantics even in Hive 2.x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21361) ORC support for TIMESTAMP WITHOUT TIME ZONE

2019-02-28 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-21361:
-
Description: Once ORC adds support for distinguishing between LocalDateTime 
and Instant semantics, Hive should add ORC support for the TIMESTAMP WITHOUT 
TIME ZONE type.  (was: Once Avro adds support for distinguishing between 
LocalDateTime and Instant semantics, Hive should add Avro support for the 
TIMESTAMP WITHOUT TIME ZONE type.)

> ORC support for TIMESTAMP WITHOUT TIME ZONE
> ---
>
> Key: HIVE-21361
> URL: https://issues.apache.org/jira/browse/HIVE-21361
> Project: Hive
>  Issue Type: Task
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Once ORC adds support for distinguishing between LocalDateTime and Instant 
> semantics, Hive should add ORC support for the TIMESTAMP WITHOUT TIME ZONE 
> type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21360) Avro support for TIMESTAMP WITHOUT TIME ZONE

2019-02-28 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-21360:
-
Description: Once Avro adds support for distinguishing between 
LocalDateTime and Instant semantics, Hive should add Avro support for the 
TIMESTAMP WITHOUT TIME ZONE type.  (was: Version 1.11.0 of parquet-mr will 
support distinguishing between LocalDateTime and Instant semantics. Once it is 
released, Hive can start using it to save TIMESTAMP WITHOUT TIME ZONE values to 
Parquet files.)

> Avro support for TIMESTAMP WITHOUT TIME ZONE
> 
>
> Key: HIVE-21360
> URL: https://issues.apache.org/jira/browse/HIVE-21360
> Project: Hive
>  Issue Type: Task
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Once Avro adds support for distinguishing between LocalDateTime and Instant 
> semantics, Hive should add Avro support for the TIMESTAMP WITHOUT TIME ZONE 
> type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21358) ORC support for TIMESTAMP WITH LOCAL TIME ZONE

2019-02-28 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-21358:
-
Description: Once ORC adds support for differentiating between 
LocalDateTime and Instant semantics, Hive should add Avro support for the 
TIMESTAMP WITH LOCAL TIME ZONE type.  (was: Once Avro adds support for 
differentiating between LocalDateTime and Instant semantics, Hive should add 
Avro support for the TIMESTAMP WITH LOCAL TIME ZONE type.)

> ORC support for TIMESTAMP WITH LOCAL TIME ZONE
> --
>
> Key: HIVE-21358
> URL: https://issues.apache.org/jira/browse/HIVE-21358
> Project: Hive
>  Issue Type: Task
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Once ORC adds support for differentiating between LocalDateTime and Instant 
> semantics, Hive should add Avro support for the TIMESTAMP WITH LOCAL TIME 
> ZONE type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21357) Avro support for TIMESTAMP WITH LOCAL TIME ZONE

2019-02-28 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-21357:
-
Description: Once Avro adds support for distinguishing between 
LocalDateTime and Instant semantics, Hive should add Avro support for the 
TIMESTAMP WITH LOCAL TIME ZONE type.  (was: Once Avro adds support for 
differentiating between LocalDateTime and Instant semantics, Hive should add 
Avro support for the TIMESTAMP WITH LOCAL TIME ZONE type.)

> Avro support for TIMESTAMP WITH LOCAL TIME ZONE
> ---
>
> Key: HIVE-21357
> URL: https://issues.apache.org/jira/browse/HIVE-21357
> Project: Hive
>  Issue Type: Task
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Once Avro adds support for distinguishing between LocalDateTime and Instant 
> semantics, Hive should add Avro support for the TIMESTAMP WITH LOCAL TIME 
> ZONE type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21355) Parquet support for TIMESTAMP WITH LOCAL TIME ZONE

2019-02-28 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-21355:
-
Description: Version 1.11.0 of parquet-mr will support distinguishing 
between LocalDateTime and Instant semantics. Once it is released, Hive can 
start using it to save TIMESTAMP WITH LOCAL TIME ZONE values to Parquet files.  
(was: Version 1.11.0 of parquet-mr will support differentiating between 
LocalDateTime and Instant semantics. Once it is released, Hive can start using 
it to save TIMESTAMP WITH LOCAL TIME ZONE values to Parquet files.)

> Parquet support for TIMESTAMP WITH LOCAL TIME ZONE
> --
>
> Key: HIVE-21355
> URL: https://issues.apache.org/jira/browse/HIVE-21355
> Project: Hive
>  Issue Type: Task
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Version 1.11.0 of parquet-mr will support distinguishing between 
> LocalDateTime and Instant semantics. Once it is released, Hive can start 
> using it to save TIMESTAMP WITH LOCAL TIME ZONE values to Parquet files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21355) Parquet support for TIMESTAMP WITH LOCAL TIME ZONE

2019-02-28 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-21355:
-
Description: Version 1.11.0 of parquet-mr will support differentiating 
between LocalDateTime and Instant semantics. Once it is released, Hive can 
start using it to save TIMESTAMP WITH LOCAL TIME ZONE values to Parquet files.  
(was: Version 1.11.0 of parquet-mr will support the timestamp semantics 
necessary for storing TIMESTAMP WITH LOCAL TIME ZONE values. Once it is 
released, Hive can start using it and save TIMESTAMP WITH LOCAL TIME ZONE 
values to Parquet files.)

> Parquet support for TIMESTAMP WITH LOCAL TIME ZONE
> --
>
> Key: HIVE-21355
> URL: https://issues.apache.org/jira/browse/HIVE-21355
> Project: Hive
>  Issue Type: Task
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Version 1.11.0 of parquet-mr will support differentiating between 
> LocalDateTime and Instant semantics. Once it is released, Hive can start 
> using it to save TIMESTAMP WITH LOCAL TIME ZONE values to Parquet files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21349) TIMESTAMP WITHOUT TIME ZONE

2019-02-28 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-21349:
-
Description: 
As specified in the [design doc for TIMESTAMP 
types|https://cwiki.apache.org/confluence/display/Hive/Different+TIMESTAMP+types],
 the TIMESTAMP WITHOUT TIME ZONE type shall behave like the 
[LocalDateTime|https://docs.oracle.com/javase/8/docs/api/java/time/LocalDateTime.html]
 class of Java, i.e., each value is a recording of what can be seen on a 
calendar and a clock hanging on the wall, for example "1969-07-20 16:17:39". It 
can be decomposed into year, month, day, hour, minute and seconds fields, but 
with no time zone information available, it does not correspond to any specific 
point in time.

This behaviour is consistent with the SQL standard (revisions 2003 and higher).

  was:
As specified in the [design doc for TIMESTAMP 
types|https://cwiki.apache.org/confluence/display/Hive/Different+TIMESTAMP+types],
 the TIMESTAMP WITHOUT TIME ZONE type shall behave like the LocalDateTime class 
of Java, i.e., each value is a recording of what can be seen on a calendar and 
a clock hanging on the wall, for example "1969-07-20 16:17:39". It can be 
decomposed into year, month, day, hour, minute and seconds fields, but with no 
time zone information available, it does not correspond to any specific point 
in time.

This behaviour is consistent with the SQL standard (revisions 2003 and higher).


> TIMESTAMP WITHOUT TIME ZONE
> ---
>
> Key: HIVE-21349
> URL: https://issues.apache.org/jira/browse/HIVE-21349
> Project: Hive
>  Issue Type: Task
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> As specified in the [design doc for TIMESTAMP 
> types|https://cwiki.apache.org/confluence/display/Hive/Different+TIMESTAMP+types],
>  the TIMESTAMP WITHOUT TIME ZONE type shall behave like the 
> [LocalDateTime|https://docs.oracle.com/javase/8/docs/api/java/time/LocalDateTime.html]
>  class of Java, i.e., each value is a recording of what can be seen on a 
> calendar and a clock hanging on the wall, for example "1969-07-20 16:17:39". 
> It can be decomposed into year, month, day, hour, minute and seconds fields, 
> but with no time zone information available, it does not correspond to any 
> specific point in time.
> This behaviour is consistent with the SQL standard (revisions 2003 and 
> higher).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21002) TIMESTAMP - Backwards incompatible change: Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x incorrectly

2019-02-28 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-21002:
-
Summary: TIMESTAMP - Backwards incompatible change: Hive 3.1 reads back 
Avro and Parquet timestamps written by Hive 2.x incorrectly  (was: Backwards 
incompatible change: Hive 3.1 reads back Avro and Parquet timestamps written by 
Hive 2.x incorrectly)

> TIMESTAMP - Backwards incompatible change: Hive 3.1 reads back Avro and 
> Parquet timestamps written by Hive 2.x incorrectly
> --
>
> Key: HIVE-21002
> URL: https://issues.apache.org/jira/browse/HIVE-21002
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.1.1
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x 
> incorrectly. As an example session to demonstrate this problem, create a 
> dataset using Hive version 2.x in America/Los_Angeles:
> {code:sql}
> hive> create table ts_‹format› (ts timestamp) stored as ‹format›;
> hive> insert into ts_‹format› values (*‘2018-01-01 00:00:00.000’*);
> {code}
> Querying this table by issuing
> {code:sql}
> hive> select * from ts_‹format›;
> {code}
> from different time zones using different versions of Hive and different 
> storage formats gives the following results:
> |‹format›|Writer time zone (in Hive 2.x)|Reader time zone|Result in Hive 2.x 
> reader|Result in Hive 3.1 reader|
> |Avro and Parquet|America/Los_Angeles|America/Los_Angeles|2018-01-01 
> *00*:00:00.0|2018-01-01 *08*:00:00.0|
> |Avro and Parquet|America/Los_Angeles|Europe/Paris|2018-01-01 
> *09*:00:00.0|2018-01-01 *08*:00:00.0|
> |Textfile and ORC|America/Los_Angeles|America/Los_Angeles|2018-01-01 
> 00:00:00.0|2018-01-01 00:00:00.0|
> |Textfile and ORC|America/Los_Angeles|Europe/Paris|2018-01-01 
> 00:00:00.0|2018-01-01 00:00:00.0|
> *Hive 3.1 clearly gives different results than Hive 2.x for timestamps stored 
> in Avro and Parquet formats.* Apache ORC behaviour has not changed because it 
> was modified to adjust timestamps to retain backwards compatibility. Textfile 
> behaviour has not changed, because its processing involves parsing and 
> formatting instead of proper serializing and deserializing, so they 
> inherently had LocalDateTime semantics even in Hive 2.x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21002) Backwards incompatible change: Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x incorrectly

2019-02-19 Thread Zoltan Ivanfi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772078#comment-16772078
 ] 

Zoltan Ivanfi commented on HIVE-21002:
--

As we discussed on the Hive mailing list, I modified the sub-tasks of this JIRA 
to reflect the new solution we agreed upon: The historical (backwards- and 
forwards-compatible) way of handling timestamps should be restored while 
keeping the new semantics at the same time. The details can be read in 
descriptions of the sub-tasks.

> Backwards incompatible change: Hive 3.1 reads back Avro and Parquet 
> timestamps written by Hive 2.x incorrectly
> --
>
> Key: HIVE-21002
> URL: https://issues.apache.org/jira/browse/HIVE-21002
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.1.1
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x 
> incorrectly. As an example session to demonstrate this problem, create a 
> dataset using Hive version 2.x in America/Los_Angeles:
> {code:sql}
> hive> create table ts_‹format› (ts timestamp) stored as ‹format›;
> hive> insert into ts_‹format› values (*‘2018-01-01 00:00:00.000’*);
> {code}
> Querying this table by issuing
> {code:sql}
> hive> select * from ts_‹format›;
> {code}
> from different time zones using different versions of Hive and different 
> storage formats gives the following results:
> |‹format›|Writer time zone (in Hive 2.x)|Reader time zone|Result in Hive 2.x 
> reader|Result in Hive 3.1 reader|
> |Avro and Parquet|America/Los_Angeles|America/Los_Angeles|2018-01-01 
> *00*:00:00.0|2018-01-01 *08*:00:00.0|
> |Avro and Parquet|America/Los_Angeles|Europe/Paris|2018-01-01 
> *09*:00:00.0|2018-01-01 *08*:00:00.0|
> |Textfile and ORC|America/Los_Angeles|America/Los_Angeles|2018-01-01 
> 00:00:00.0|2018-01-01 00:00:00.0|
> |Textfile and ORC|America/Los_Angeles|Europe/Paris|2018-01-01 
> 00:00:00.0|2018-01-01 00:00:00.0|
> *Hive 3.1 clearly gives different results than Hive 2.x for timestamps stored 
> in Avro and Parquet formats.* Apache ORC behaviour has not changed because it 
> was modified to adjust timestamps to retain backwards compatibility. Textfile 
> behaviour has not changed, because its processing involves parsing and 
> formatting instead of proper serializing and deserializing, so they 
> inherently had LocalDateTime semantics even in Hive 2.x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21291) Restore historical way of handling timestamps in Avro while keeping the new semantics at the same time

2019-02-19 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-21291:
-
Description: 
This sub-task is for implementing the Avro-specific parts of the following plan:

h1. Problem

Historically, the semantics of the TIMESTAMP type in Hive depended on the file 
format. Timestamps in Avro, Parquet and RCFiles with a binary SerDe had 
_Instant_ semantics, while timestamps in ORC, textfiles and RCFiles with a text 
SerDe had _LocalDateTime_ semantics.

The Hive community wanted to get rid of this inconsistency and have 
_LocalDateTime_ semantics in Avro, Parquet and RCFiles with a binary SerDe as 
well. *Hive 3.1 turned off normalization to UTC* to achieve this. While this 
leads to the desired new semantics, it also leads to incorrect results when new 
Hive versions read timestamps written by old Hive versions or when old Hive 
versions or any other component not aware of this change (including legacy 
Impala and Spark versions) read timestamps written by new Hive versions.

h1. Solution

To work around this issue, Hive *should restore the practice of normalizing to 
UTC* when writing timestamps to Avro, Parquet and RCFiles with a binary SerDe. 
In itself, this would restore the historical _Instant_ semantics, which is 
undesirable. In order to achieve the desired _LocalDateTime_ semantics in spite 
of normalizing to UTC, newer Hive versions should record the session-local 
local time zone in the file metadata fields serving arbitrary key-value storage 
purposes.

When reading back files with this time zone metadata, newer Hive versions (or 
any other new component aware of this extra metadata) can achieve 
_LocalDateTime_ semantics by *converting from UTC to the saved time zone 
(instead of to the local time zone)*. Legacy components that are unaware of the 
new metadata can read the files without any problem and the timestamps will 
show the historical Instant behaviour to them.

  was:
This sub-task is for implementing the Avro-specific parts of the following plan:

h1. Problem

Historically, the semantics of the TIMESTAMP type in Hive depended on the file 
format. Timestamps in Avro, Parquet and RCFiles with a binary SerDe had 
_Instant_ semantics, while timestamps in ORC, textfiles and RCFiles with a text 
SerDe had _LocalDateTime_ semantics.

The Hive community wanted to get rid of this inconsistency and have 
_LocalDateTime_ semantics in Avro, Parquet and RCFiles with a binary SerDe as 
well. *Hive 3.1 turned off normalization to UTC* to achieve this. While this 
leads to the desired new semantics, it also leads to incorrect results when new 
Hive versions read timestamps written by old Hive versions or when old Hive 
versions or any other component not aware of this change (including legacy 
Impala and Spark versions) read timestamps written by new Hive versions.

h1. Solution

To work around this issue, Hive *should restore the practice of normalizing to 
UTC* when writing timestamps to Avro, Parquet and RCFiles with a binary SerDe. 
In itself, this would restore the historical _Instant_ semantics, which is 
undesirable. In order to achieve the desired _LocalDateTime_ semantics in spite 
of normalizing to UTC, newer Hive versions should record the session-local 
local time zone in the file metadata fields serving arbitrary key-value storage 
purposes.

 When reading back files with this time zone metadata, newer Hive versions (or 
any other new component aware of this extra metadata) can achieve 
_LocalDateTime_ semantics by *converting from UTC to the saved time zone 
(instead of to the local time zone)*. Legacy components that are unaware of the 
new metadata can read the files without any problem and the timestamps will 
show the historical Instant behaviour to them.


> Restore historical way of handling timestamps in Avro while keeping the new 
> semantics at the same time
> --
>
> Key: HIVE-21291
> URL: https://issues.apache.org/jira/browse/HIVE-21291
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> This sub-task is for implementing the Avro-specific parts of the following 
> plan:
> h1. Problem
> Historically, the semantics of the TIMESTAMP type in Hive depended on the 
> file format. Timestamps in Avro, Parquet and RCFiles with a binary SerDe had 
> _Instant_ semantics, while timestamps in ORC, textfiles and RCFiles with a 
> text SerDe had _LocalDateTime_ semantics.
> The Hive community wanted to get rid of this inconsistency and have 
> _LocalDateTime_ semantics in Avro, Parquet and RCFiles with a binary SerDe as 
> well. *Hive 3.1 turned off normalization to UTC* to achieve this. While this 
> leads to the desired new semantics, it also leads to 

[jira] [Updated] (HIVE-21291) Restore historical way of handling timestamps in Avro while keeping the new semantics at the same time

2019-02-19 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-21291:
-
Description: 
This sub-task is for implementing the Avro-specific parts of the following plan:

h1. Problem

Historically, the semantics of the TIMESTAMP type in Hive depended on the file 
format. Timestamps in Avro, Parquet and RCFiles with a binary SerDe had 
_Instant_ semantics, while timestamps in ORC, textfiles and RCFiles with a text 
SerDe had _LocalDateTime_ semantics.

The Hive community wanted to get rid of this inconsistency and have 
_LocalDateTime_ semantics in Avro, Parquet and RCFiles with a binary SerDe as 
well. *Hive 3.1 turned off normalization to UTC* to achieve this. While this 
leads to the desired new semantics, it also leads to incorrect results when new 
Hive versions read timestamps written by old Hive versions or when old Hive 
versions or any other component not aware of this change (including legacy 
Impala and Spark versions) read timestamps written by new Hive versions.

h1. Solution

To work around this issue, Hive *should restore the practice of normalizing to 
UTC* when writing timestamps to Avro, Parquet and RCFiles with a binary SerDe. 
In itself, this would restore the historical _Instant_ semantics, which is 
undesirable. In order to achieve the desired _LocalDateTime_ semantics in spite 
of normalizing to UTC, newer Hive versions should record the session-local 
local time zone in the file metadata fields serving arbitrary key-value storage 
purposes.

 When reading back files with this time zone metadata, newer Hive versions (or 
any other new component aware of this extra metadata) can achieve 
_LocalDateTime_ semantics by *converting from UTC to the saved time zone 
(instead of to the local time zone)*. Legacy components that are unaware of the 
new metadata can read the files without any problem and the timestamps will 
show the historical Instant behaviour to them.

  was:
This sub-task is for implementing the Avro-specific parts of the following plan:

h1. Problem

Historically, the semantics of the TIMESTAMP type in Hive depended on the file 
format. Timestamps in Avro, Parquet and RCFiles with a binary SerDe had 
_Instant_ semantics, while timestamps in ORC, textfiles and RCFiles with a text 
SerDe had _LocalDateTime_ semantics.

The Hive community wanted to get rid of this inconsistency and have 
_LocalDateTime_ semantics in Avro, Parquet and RCFiles with a binary SerDe as 
well. *Hive 3.1 turned off normalization to UTC* to achieve this. While this 
leads to the desired new semantics, it also leads to incorrect results when new 
Hive versions read timestamps written by old Hive versions or when old Hive 
versions or any other component not aware of this change (including legacy 
Impala and Spark versions) read timestamps written by new Hive versions.

h1. Solution

To work around this issue, Hive *should restore the practice of normalizing to 
UTC* when writing timestamps to Avro, Parquet and RCFiles with a binary SerDe. 
In itself, this would restore the historical _Instant_ semantics, which is 
undesirable. In order to achieve the desired _LocalDateTime_ semantics in spite 
of normalizing to UTC, newer Hive versions should record the session-local 
local time zone in the file metadata fields serving arbitrary key-value storage 
purposes.



> Restore historical way of handling timestamps in Avro while keeping the new 
> semantics at the same time
> --
>
> Key: HIVE-21291
> URL: https://issues.apache.org/jira/browse/HIVE-21291
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> This sub-task is for implementing the Avro-specific parts of the following 
> plan:
> h1. Problem
> Historically, the semantics of the TIMESTAMP type in Hive depended on the 
> file format. Timestamps in Avro, Parquet and RCFiles with a binary SerDe had 
> _Instant_ semantics, while timestamps in ORC, textfiles and RCFiles with a 
> text SerDe had _LocalDateTime_ semantics.
> The Hive community wanted to get rid of this inconsistency and have 
> _LocalDateTime_ semantics in Avro, Parquet and RCFiles with a binary SerDe as 
> well. *Hive 3.1 turned off normalization to UTC* to achieve this. While this 
> leads to the desired new semantics, it also leads to incorrect results when 
> new Hive versions read timestamps written by old Hive versions or when old 
> Hive versions or any other component not aware of this change (including 
> legacy Impala and Spark versions) read timestamps written by new Hive 
> versions.
> h1. Solution
> To work around this issue, Hive *should restore the practice of normalizing 
> to UTC* when writing timestamps to Avro, Parquet and RCFiles with a 

[jira] [Resolved] (HIVE-21003) Reinstate Avro timestamp conversion between HS2 time zone and UTC

2019-02-19 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi resolved HIVE-21003.
--
Resolution: Won't Do

We agreed upon a different solution, therefore closing this sub-task as "won't 
do".

> Reinstate Avro timestamp conversion between HS2 time zone and UTC
> -
>
> Key: HIVE-21003
> URL: https://issues.apache.org/jira/browse/HIVE-21003
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20980) Reinstate Parquet timestamp conversion between HS2 time zone and UTC

2019-02-19 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-20980:
-
Resolution: Won't Do
Status: Resolved  (was: Patch Available)

We agreed upon a different solution, therefore closing this sub-task as "won't 
do".

> Reinstate Parquet timestamp conversion between HS2 time zone and UTC
> 
>
> Key: HIVE-20980
> URL: https://issues.apache.org/jira/browse/HIVE-20980
> Project: Hive
>  Issue Type: Sub-task
>  Components: File Formats
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
> Attachments: HIVE-20980.1.patch, HIVE-20980.2.patch, 
> HIVE-20980.2.patch
>
>
> With HIVE-20007, Parquet timestamps became timezone-agnostic. This means that 
> timestamps written after the change are read exactly as they were written; 
> but timestamps stored before this change are effectively converted from the 
> writing HS2 server time zone to GMT time zone. This patch reinstates the 
> original behavior: timestamps are converted to UTC before write and from UTC 
> before read.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21002) Backwards incompatible change: Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x incorrectly

2019-01-14 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-21002:
-
Description: 
Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x 
incorrectly. As an example session to demonstrate this problem, create a 
dataset using Hive version 2.x in America/Los_Angeles:
{code:sql}
hive> create table ts_‹format› (ts timestamp) stored as ‹format›;
hive> insert into ts_‹format› values (*‘2018-01-01 00:00:00.000’*);
{code}
Querying this table by issuing
{code:sql}
hive> select * from ts_‹format›;
{code}
from different time zones using different versions of Hive and different 
storage formats gives the following results:
|‹format›|Writer time zone (in Hive 2.x)|Reader time zone|Hive 2.x reader|Hive 
3.1 reader|
|Avro and Parquet|America/Los_Angeles|America/Los_Angeles|2018-01-01 
*00*:00:00.0|2018-01-01 *08*:00:00.0|
|Avro and Parquet|America/Los_Angeles|Europe/Paris|2018-01-01 
*09*:00:00.0|2018-01-01 *08*:00:00.0|
|Textfile and ORC|America/Los_Angeles|America/Los_Angeles|2018-01-01 
00:00:00.0|2018-01-01 00:00:00.0|
|Textfile and ORC|America/Los_Angeles|Europe/Paris|2018-01-01 
00:00:00.0|2018-01-01 00:00:00.0|

*Hive 3.1 clearly gives different results than Hive 2.x for timestamps stored 
in Avro and Parquet formats.* Apache ORC behaviour has not changed because it 
was modified to adjust timestamps to retain backwards compatibility. Textfile 
behaviour has not changed, because its processing involves parsing and 
formatting instead of proper serializing and deserializing, so they inherently 
had LocalDateTime semantics even in Hive 2.x.

  was:
Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x 
incorrectly. As an example session to demonstrate this problem, create a 
dataset using Hive version 2.x in America/Los_Angeles:
{code:sql}
hive> create table ts_‹format› (ts timestamp) stored as ‹format›;
hive> insert into ts_‹format› values (*‘2018-01-01 00:00:00.000’*);
{code}
Querying this table by issuing
{code:sql}
hive> select * from ts_‹format›;
{code}
from different time zones using different versions of Hive and different 
storage formats gives the following results:
|‹format›|Writer time zone|Reader time zone|Hive 2.x|Hive 3.1|
|Avro and Parquet|America/Los_Angeles|America/Los_Angeles|2018-01-01 
*00*:00:00.0|2018-01-01 *08*:00:00.0|
|Avro and Parquet|America/Los_Angeles|Europe/Paris|2018-01-01 
*09*:00:00.0|2018-01-01 *08*:00:00.0|
|Textfile and ORC|America/Los_Angeles|America/Los_Angeles|2018-01-01 
00:00:00.0|2018-01-01 00:00:00.0|
|Textfile and ORC|America/Los_Angeles|Europe/Paris|2018-01-01 
00:00:00.0|2018-01-01 00:00:00.0|

*Hive 3.1 clearly gives different results than Hive 2.x for timestamps stored 
in Avro and Parquet formats.* Apache ORC behaviour has not changed because it 
was modified to adjust timestamps to retain backwards compatibility. Textfile 
behaviour has not changed, because its processing involves parsing and 
formatting instead of proper serializing and deserializing, so they inherently 
had LocalDateTime semantics even in Hive 2.x.


> Backwards incompatible change: Hive 3.1 reads back Avro and Parquet 
> timestamps written by Hive 2.x incorrectly
> --
>
> Key: HIVE-21002
> URL: https://issues.apache.org/jira/browse/HIVE-21002
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.1.1
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x 
> incorrectly. As an example session to demonstrate this problem, create a 
> dataset using Hive version 2.x in America/Los_Angeles:
> {code:sql}
> hive> create table ts_‹format› (ts timestamp) stored as ‹format›;
> hive> insert into ts_‹format› values (*‘2018-01-01 00:00:00.000’*);
> {code}
> Querying this table by issuing
> {code:sql}
> hive> select * from ts_‹format›;
> {code}
> from different time zones using different versions of Hive and different 
> storage formats gives the following results:
> |‹format›|Writer time zone (in Hive 2.x)|Reader time zone|Hive 2.x 
> reader|Hive 3.1 reader|
> |Avro and Parquet|America/Los_Angeles|America/Los_Angeles|2018-01-01 
> *00*:00:00.0|2018-01-01 *08*:00:00.0|
> |Avro and Parquet|America/Los_Angeles|Europe/Paris|2018-01-01 
> *09*:00:00.0|2018-01-01 *08*:00:00.0|
> |Textfile and ORC|America/Los_Angeles|America/Los_Angeles|2018-01-01 
> 00:00:00.0|2018-01-01 00:00:00.0|
> |Textfile and ORC|America/Los_Angeles|Europe/Paris|2018-01-01 
> 00:00:00.0|2018-01-01 00:00:00.0|
> *Hive 3.1 clearly gives different results than Hive 2.x for timestamps stored 
> in Avro and Parquet formats.* Apache ORC behaviour has not changed because it 
> was modified to adjust timestamps to retain backwards 

[jira] [Commented] (HIVE-21002) Backwards incompatible change: Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x incorrectly

2019-01-14 Thread Zoltan Ivanfi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742257#comment-16742257
 ] 

Zoltan Ivanfi commented on HIVE-21002:
--

Hive 3.1 does return "2018-01-01 *00*:00:00.0" for new files, because it writes 
and reads without normalizing to UTC, which is different from what Hive 2.x 
did. This is exactly what causes Hive 3.1 to return "2018-01-01 *08*:00:00.0" 
for a file written by Hive 2.x, because that version normalized the timestamp 
to UTC before writing it. Since there already exist huge amounts of data 
written using Hive 2.x, Hive 3.x should remain capable of reading that existing 
data back correctly.

Even if it would be possible to detect the version of Hive that wrote a file, 
adding another workaround based on it would not solve the interoperability 
problem. Users may move data between older and newer Hive versions or have 
other legacy components that read timestamps from Parquet. These older 
applications do not contain the necessary logic to deal with Hive 3.1 
semantics, they build on the assumption that timestamps written by any version 
of Hive are normalized to UTC.

> Backwards incompatible change: Hive 3.1 reads back Avro and Parquet 
> timestamps written by Hive 2.x incorrectly
> --
>
> Key: HIVE-21002
> URL: https://issues.apache.org/jira/browse/HIVE-21002
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.1.1
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x 
> incorrectly. As an example session to demonstrate this problem, create a 
> dataset using Hive version 2.x in America/Los_Angeles:
> {code:sql}
> hive> create table ts_‹format› (ts timestamp) stored as ‹format›;
> hive> insert into ts_‹format› values (*‘2018-01-01 00:00:00.000’*);
> {code}
> Querying this table by issuing
> {code:sql}
> hive> select * from ts_‹format›;
> {code}
> from different time zones using different versions of Hive and different 
> storage formats gives the following results:
> |‹format›|Time zone|Hive 2.x|Hive 3.1|
> |Avro and Parquet|America/Los_Angeles|2018-01-01 *00*:00:00.0|2018-01-01 
> *08*:00:00.0|
> |Avro and Parquet|Europe/Paris|2018-01-01 *09*:00:00.0|2018-01-01 
> *08*:00:00.0|
> |Textfile and ORC|America/Los_Angeles|2018-01-01 00:00:00.0|2018-01-01 
> 00:00:00.0|
> |Textfile and ORC|Europe/Paris|2018-01-01 00:00:00.0|2018-01-01 00:00:00.0|
> *Hive 3.1 clearly gives different results than Hive 2.x for timestamps stored 
> in Avro and Parquet formats.* Apache ORC behaviour has not changed because it 
> was modified to adjust timestamps to retain backwards compatibility. Textfile 
> behaviour has not changed, because its processing involves parsing and 
> formatting instead of proper serializing and deserializing, so they 
> inherently had LocalDateTime semantics even in Hive 2.x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21002) Backwards incompatible change: Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x incorrectly

2019-01-14 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-21002:
-
Description: 
Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x 
incorrectly. As an example session to demonstrate this problem, create a 
dataset using Hive version 2.x in America/Los_Angeles:
{code:sql}
hive> create table ts_‹format› (ts timestamp) stored as ‹format›;
hive> insert into ts_‹format› values (*‘2018-01-01 00:00:00.000’*);
{code}
Querying this table by issuing
{code:sql}
hive> select * from ts_‹format›;
{code}
from different time zones using different versions of Hive and different 
storage formats gives the following results:
|‹format›|Writer time zone (in Hive 2.x)|Reader time zone|Result in Hive 2.x 
reader|Result in Hive 3.1 reader|
|Avro and Parquet|America/Los_Angeles|America/Los_Angeles|2018-01-01 
*00*:00:00.0|2018-01-01 *08*:00:00.0|
|Avro and Parquet|America/Los_Angeles|Europe/Paris|2018-01-01 
*09*:00:00.0|2018-01-01 *08*:00:00.0|
|Textfile and ORC|America/Los_Angeles|America/Los_Angeles|2018-01-01 
00:00:00.0|2018-01-01 00:00:00.0|
|Textfile and ORC|America/Los_Angeles|Europe/Paris|2018-01-01 
00:00:00.0|2018-01-01 00:00:00.0|

*Hive 3.1 clearly gives different results than Hive 2.x for timestamps stored 
in Avro and Parquet formats.* Apache ORC behaviour has not changed because it 
was modified to adjust timestamps to retain backwards compatibility. Textfile 
behaviour has not changed, because its processing involves parsing and 
formatting instead of proper serializing and deserializing, so they inherently 
had LocalDateTime semantics even in Hive 2.x.

  was:
Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x 
incorrectly. As an example session to demonstrate this problem, create a 
dataset using Hive version 2.x in America/Los_Angeles:
{code:sql}
hive> create table ts_‹format› (ts timestamp) stored as ‹format›;
hive> insert into ts_‹format› values (*‘2018-01-01 00:00:00.000’*);
{code}
Querying this table by issuing
{code:sql}
hive> select * from ts_‹format›;
{code}
from different time zones using different versions of Hive and different 
storage formats gives the following results:
|‹format›|Writer time zone (in Hive 2.x)|Reader time zone|Hive 2.x reader|Hive 
3.1 reader|
|Avro and Parquet|America/Los_Angeles|America/Los_Angeles|2018-01-01 
*00*:00:00.0|2018-01-01 *08*:00:00.0|
|Avro and Parquet|America/Los_Angeles|Europe/Paris|2018-01-01 
*09*:00:00.0|2018-01-01 *08*:00:00.0|
|Textfile and ORC|America/Los_Angeles|America/Los_Angeles|2018-01-01 
00:00:00.0|2018-01-01 00:00:00.0|
|Textfile and ORC|America/Los_Angeles|Europe/Paris|2018-01-01 
00:00:00.0|2018-01-01 00:00:00.0|

*Hive 3.1 clearly gives different results than Hive 2.x for timestamps stored 
in Avro and Parquet formats.* Apache ORC behaviour has not changed because it 
was modified to adjust timestamps to retain backwards compatibility. Textfile 
behaviour has not changed, because its processing involves parsing and 
formatting instead of proper serializing and deserializing, so they inherently 
had LocalDateTime semantics even in Hive 2.x.


> Backwards incompatible change: Hive 3.1 reads back Avro and Parquet 
> timestamps written by Hive 2.x incorrectly
> --
>
> Key: HIVE-21002
> URL: https://issues.apache.org/jira/browse/HIVE-21002
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.1.1
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x 
> incorrectly. As an example session to demonstrate this problem, create a 
> dataset using Hive version 2.x in America/Los_Angeles:
> {code:sql}
> hive> create table ts_‹format› (ts timestamp) stored as ‹format›;
> hive> insert into ts_‹format› values (*‘2018-01-01 00:00:00.000’*);
> {code}
> Querying this table by issuing
> {code:sql}
> hive> select * from ts_‹format›;
> {code}
> from different time zones using different versions of Hive and different 
> storage formats gives the following results:
> |‹format›|Writer time zone (in Hive 2.x)|Reader time zone|Result in Hive 2.x 
> reader|Result in Hive 3.1 reader|
> |Avro and Parquet|America/Los_Angeles|America/Los_Angeles|2018-01-01 
> *00*:00:00.0|2018-01-01 *08*:00:00.0|
> |Avro and Parquet|America/Los_Angeles|Europe/Paris|2018-01-01 
> *09*:00:00.0|2018-01-01 *08*:00:00.0|
> |Textfile and ORC|America/Los_Angeles|America/Los_Angeles|2018-01-01 
> 00:00:00.0|2018-01-01 00:00:00.0|
> |Textfile and ORC|America/Los_Angeles|Europe/Paris|2018-01-01 
> 00:00:00.0|2018-01-01 00:00:00.0|
> *Hive 3.1 clearly gives different results than Hive 2.x for timestamps stored 
> in Avro and Parquet formats.* Apache ORC behaviour has not changed because it 

[jira] [Updated] (HIVE-21002) Backwards incompatible change: Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x incorrectly

2019-01-14 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-21002:
-
Description: 
Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x 
incorrectly. As an example session to demonstrate this problem, create a 
dataset using Hive version 2.x in America/Los_Angeles:
{code:sql}
hive> create table ts_‹format› (ts timestamp) stored as ‹format›;
hive> insert into ts_‹format› values (*‘2018-01-01 00:00:00.000’*);
{code}
Querying this table by issuing
{code:sql}
hive> select * from ts_‹format›;
{code}
from different time zones using different versions of Hive and different 
storage formats gives the following results:
|‹format›|Writer time zone|Reader time zone|Hive 2.x|Hive 3.1|
|Avro and Parquet|America/Los_Angeles|America/Los_Angeles|2018-01-01 
*00*:00:00.0|2018-01-01 *08*:00:00.0|
|Avro and Parquet|America/Los_Angeles|Europe/Paris|2018-01-01 
*09*:00:00.0|2018-01-01 *08*:00:00.0|
|Textfile and ORC|America/Los_Angeles|America/Los_Angeles|2018-01-01 
00:00:00.0|2018-01-01 00:00:00.0|
|Textfile and ORC|America/Los_Angeles|Europe/Paris|2018-01-01 
00:00:00.0|2018-01-01 00:00:00.0|

*Hive 3.1 clearly gives different results than Hive 2.x for timestamps stored 
in Avro and Parquet formats.* Apache ORC behaviour has not changed because it 
was modified to adjust timestamps to retain backwards compatibility. Textfile 
behaviour has not changed, because its processing involves parsing and 
formatting instead of proper serializing and deserializing, so they inherently 
had LocalDateTime semantics even in Hive 2.x.

  was:
Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x 
incorrectly. As an example session to demonstrate this problem, create a 
dataset using Hive version 2.x in America/Los_Angeles:
{code:sql}
hive> create table ts_‹format› (ts timestamp) stored as ‹format›;
hive> insert into ts_‹format› values (*‘2018-01-01 00:00:00.000’*);
{code}
Querying this table by issuing
{code:sql}
hive> select * from ts_‹format›;
{code}
from different time zones using different versions of Hive and different 
storage formats gives the following results:
|‹format›|Time zone|Hive 2.x|Hive 3.1|
|Avro and Parquet|America/Los_Angeles|2018-01-01 *00*:00:00.0|2018-01-01 
*08*:00:00.0|
|Avro and Parquet|Europe/Paris|2018-01-01 *09*:00:00.0|2018-01-01 *08*:00:00.0|
|Textfile and ORC|America/Los_Angeles|2018-01-01 00:00:00.0|2018-01-01 
00:00:00.0|
|Textfile and ORC|Europe/Paris|2018-01-01 00:00:00.0|2018-01-01 00:00:00.0|

*Hive 3.1 clearly gives different results than Hive 2.x for timestamps stored 
in Avro and Parquet formats.* Apache ORC behaviour has not changed because it 
was modified to adjust timestamps to retain backwards compatibility. Textfile 
behaviour has not changed, because its processing involves parsing and 
formatting instead of proper serializing and deserializing, so they inherently 
had LocalDateTime semantics even in Hive 2.x.


> Backwards incompatible change: Hive 3.1 reads back Avro and Parquet 
> timestamps written by Hive 2.x incorrectly
> --
>
> Key: HIVE-21002
> URL: https://issues.apache.org/jira/browse/HIVE-21002
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.1.1
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x 
> incorrectly. As an example session to demonstrate this problem, create a 
> dataset using Hive version 2.x in America/Los_Angeles:
> {code:sql}
> hive> create table ts_‹format› (ts timestamp) stored as ‹format›;
> hive> insert into ts_‹format› values (*‘2018-01-01 00:00:00.000’*);
> {code}
> Querying this table by issuing
> {code:sql}
> hive> select * from ts_‹format›;
> {code}
> from different time zones using different versions of Hive and different 
> storage formats gives the following results:
> |‹format›|Writer time zone|Reader time zone|Hive 2.x|Hive 3.1|
> |Avro and Parquet|America/Los_Angeles|America/Los_Angeles|2018-01-01 
> *00*:00:00.0|2018-01-01 *08*:00:00.0|
> |Avro and Parquet|America/Los_Angeles|Europe/Paris|2018-01-01 
> *09*:00:00.0|2018-01-01 *08*:00:00.0|
> |Textfile and ORC|America/Los_Angeles|America/Los_Angeles|2018-01-01 
> 00:00:00.0|2018-01-01 00:00:00.0|
> |Textfile and ORC|America/Los_Angeles|Europe/Paris|2018-01-01 
> 00:00:00.0|2018-01-01 00:00:00.0|
> *Hive 3.1 clearly gives different results than Hive 2.x for timestamps stored 
> in Avro and Parquet formats.* Apache ORC behaviour has not changed because it 
> was modified to adjust timestamps to retain backwards compatibility. Textfile 
> behaviour has not changed, because its processing involves parsing and 
> formatting instead of proper serializing and deserializing, so they 
> 

[jira] [Comment Edited] (HIVE-20980) Reinstate Parquet timestamp conversion between HS2 time zone and UTC

2019-01-14 Thread Zoltan Ivanfi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742233#comment-16742233
 ] 

Zoltan Ivanfi edited comment on HIVE-20980 at 1/14/19 3:54 PM:
---

[~jcamachorodriguez] The addition of session-local time zones was orthogonal to 
the semantics change and it seemed to make sense to restore the timezone-aware 
semantics based on the session-local time zone rather than the server time 
zone. That being said, I do not have a strong preference towards either one, so 
if you prefer one over the other, we are fine with your choice.

There is an isAdjustedToUTC parameter in parquet-format indeed, which will be 
made available in the upcoming parquet-mr 1.11.0 release. It is also one of the 
reasons why I would prefer the TIMESTAMP and TIMESTAMP WITHOUT TIME ZONE types 
to behave differently for Parquet. The isAdjustedToUTC annotates int64 
timestamps, while previously we used int96 timestamps. Writing int64 timestamps 
is a breaking change in itself, so it should only be done at the user's 
explicit request. However, a configuration switch would not suffice for this 
purpose, because the necessity of writing backwards-compatible int96 timestamp 
for any single table would prevent every other table from using the new int64 
timestamps as well.

At the same time, introducing new semantics for timestamps breaks the existing 
rule that an int96 written by Impala is LocalDateTime but an int96 written by 
Hive or Spark is Instant. To prevent further confusion, the new semantics 
should never be written into int96 timestamps, only int64 ones, because the 
latter allow saving semantics metadata in the isAdjustedToUTC type parameter.

Having the old TIMESTAMP type behave in the legacy way and writing only int64 
timestamps with the new TIMESTAMP WITH LOCAL TIME ZONE type resolves these two 
problems in a nice way. (Please see [this 
appendix|https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.gonr2yqv3e77]
 of the proposal for details.) It is true that TIMESTAMP will behave 
differently between different file formats again, but that inconsisteny has 
historically been a part of Hive and fixing that would be a breaking change.


was (Author: zi):
[~jcamachorodriguez] The addition of session-local time zones was orthogonal to 
the semantics change and it seemed to make sense to restore the timezone-aware 
semantics based on the session-local time zone rather than the server time 
zone. That being said, I do not have a strong preference towards either one, so 
if you prefer one over the other, we are fine with your choice.

There is an isAdjustedToUTC parameter in parquet-format indeed, which will be 
made available in the upcoming parquet-mr 1.11.0 release. It is also one of the 
reasons why I would prefer the TIMESTAMP and TIMESTAMP WITHOUT TIME ZONE types 
to behave differently for Parquet. The isAdjustedToUTC annotates int64 
timestamps, while previously we used int96 timestamps. Writing int64 timestamps 
is a breaking change in itself, so it should only be done at the user's 
explicit request. However, a configuration switch would not suffice for this 
purpose, because the necessity of writing backwards-compatible int96 timestamp 
for any single table would prevent every other table from using the new int64 
timestamps as well.

At the same time, introducing new semantics for timestamps breaks the existing 
rule that an int96 written by Impala is LocalDateTime but an int96 written by 
Hive or Spark is Instant. To prevent further confusion, the new semantics 
should never be written into int96 timestamps, only int64 ones, because the 
latter allow saving semantics metadata in the isAdjustedToUTC type parameter.

Handling the old TIMESTAMP type behave in the legacy way and writing only int64 
timestamps with new TIMESTAMP WITH LOCAL TIME ZONE type resolves these two 
problems in a nice way. (Please see [this 
appendix|https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.gonr2yqv3e77]
 of the proposal for details.) It is true that TIMESTAMP will behave 
differently between different file formats again, but that inconsisteny has 
historically been a part of Hive and fixing that would be a breaking change.

> Reinstate Parquet timestamp conversion between HS2 time zone and UTC
> 
>
> Key: HIVE-20980
> URL: https://issues.apache.org/jira/browse/HIVE-20980
> Project: Hive
>  Issue Type: Sub-task
>  Components: File Formats
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
> Attachments: HIVE-20980.1.patch, HIVE-20980.2.patch, 
> HIVE-20980.2.patch
>
>
> With HIVE-20007, Parquet timestamps became timezone-agnostic. This 

[jira] [Commented] (HIVE-20980) Reinstate Parquet timestamp conversion between HS2 time zone and UTC

2019-01-14 Thread Zoltan Ivanfi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742233#comment-16742233
 ] 

Zoltan Ivanfi commented on HIVE-20980:
--

[~jcamachorodriguez] The addition of session-local time zones was orthogonal to 
the semantics change and it seemed to make sense to restore the timezone-aware 
semantics based on the session-local time zone rather than the server time 
zone. That being said, I do not have a strong preference towards either one, so 
if you prefer one over the other, we are fine with your choice.

There is an isAdjustedToUTC parameter in parquet-format indeed, which will be 
made available in the upcoming parquet-mr 1.11.0 release. It is also one of the 
reasons why I would prefer the TIMESTAMP and TIMESTAMP WITHOUT TIME ZONE types 
to behave differently for Parquet. The isAdjustedToUTC annotates int64 
timestamps, while previously we used int96 timestamps. Writing int64 timestamps 
is a breaking change in itself, so it should only be done at the user's 
explicit request. However, a configuration switch would not suffice for this 
purpose, because the necessity of writing backwards-compatible int96 timestamp 
for any single table would prevent every other table from using the new int64 
timestamps as well.

At the same time, introducing new semantics for timestamps breaks the existing 
rule that an int96 written by Impala is LocalDateTime but an int96 written by 
Hive or Spark is Instant. To prevent further confusion, the new semantics 
should never be written into int96 timestamps, only int64 ones, because the 
latter allow saving semantics metadata in the isAdjustedToUTC type parameter.

Handling the old TIMESTAMP type behave in the legacy way and writing only int64 
timestamps with new TIMESTAMP WITH LOCAL TIME ZONE type resolves these two 
problems in a nice way. (Please see [this 
appendix|https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.gonr2yqv3e77]
 of the proposal for details.) It is true that TIMESTAMP will behave 
differently between different file formats again, but that inconsisteny has 
historically been a part of Hive and fixing that would be a breaking change.

> Reinstate Parquet timestamp conversion between HS2 time zone and UTC
> 
>
> Key: HIVE-20980
> URL: https://issues.apache.org/jira/browse/HIVE-20980
> Project: Hive
>  Issue Type: Sub-task
>  Components: File Formats
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
> Attachments: HIVE-20980.1.patch, HIVE-20980.2.patch, 
> HIVE-20980.2.patch
>
>
> With HIVE-20007, Parquet timestamps became timezone-agnostic. This means that 
> timestamps written after the change are read exactly as they were written; 
> but timestamps stored before this change are effectively converted from the 
> writing HS2 server time zone to GMT time zone. This patch reinstates the 
> original behavior: timestamps are converted to UTC before write and from UTC 
> before read.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21117) A day may belong to a different year than the week it is a part of

2019-01-11 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-21117:
-
Description: 
When using the year() and weekofyear() functions in a query, their result is 
2018 and 1 (respectively) for the day '2018-12-31'.

The year() function returns 2018 for the input '2018-12-31', because that day 
belongs to the year 2018.

The weekofyear() functions returns 1 for the input '2018-12-31', because that 
day belongs to the first week of 2019.

Both functions provide sensible results on their own, but when combined, the 
result is wrong, because '2018-12-31' does not belong to week 1 of 2018.

I suggest adding a new function yearofweek() that would return 2019 for the 
input '2018-12-31' and adding a warning to the documentation of the 
weekofyear() function about this problem.

  was:
When using the year() and weekofyear() functions in a query, their result is 
2018 and 1 (respectively) for the day '2018-12-31'.

The year() function returns 2018 for the input '2018-12-31', because that day 
belongs to the year 2018.

The weekofyear() functions returns 1 for the input '2018-12-31', because that 
day belongs to the first week of 2019.

Both functions provide sensible results on their own, but when combined, the 
result is wrong, because '2018-12-31' does not belong to week 1 of 2018.

I suggest adding a new function yearofweek() and adding a warning to the 
documentation of the weekofyear() function about this problem.


> A day may belong to a different year than the week it is a part of
> --
>
> Key: HIVE-21117
> URL: https://issues.apache.org/jira/browse/HIVE-21117
> Project: Hive
>  Issue Type: New Feature
>Affects Versions: 2.3.4
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> When using the year() and weekofyear() functions in a query, their result is 
> 2018 and 1 (respectively) for the day '2018-12-31'.
> The year() function returns 2018 for the input '2018-12-31', because that day 
> belongs to the year 2018.
> The weekofyear() functions returns 1 for the input '2018-12-31', because that 
> day belongs to the first week of 2019.
> Both functions provide sensible results on their own, but when combined, the 
> result is wrong, because '2018-12-31' does not belong to week 1 of 2018.
> I suggest adding a new function yearofweek() that would return 2019 for the 
> input '2018-12-31' and adding a warning to the documentation of the 
> weekofyear() function about this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-17843) UINT32 Parquet columns are handled as signed INT32-s, silently reading incorrect data

2018-02-14 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364282#comment-16364282
 ] 

Zoltan Ivanfi edited comment on HIVE-17843 at 2/14/18 3:34 PM:
---

Sorry for the late answer. The simplest query suffices, e.g., a SELECT * on a 
table that contains a single column and a single row. But the parquet file has 
to have an unsigned integer in it and Hive does not write unsignes ints. 
[~gszadovszky] could you provide an example parquet file with an unsigned int 
that has its first bit set? Thanks!


was (Author: zi):
Sorry for the late answer. The simplest query suffices, e.g., a SELECT * on a 
table that contains a single column and a single row. But the parquet file has 
to have an unsigned integer in it and Hive does not write unsignes ints. 
[~gszadovszky] could you supply an example parquet file with an unsigned int 
that has its first bit set? Thanks!

> UINT32 Parquet columns are handled as signed INT32-s, silently reading 
> incorrect data
> -
>
> Key: HIVE-17843
> URL: https://issues.apache.org/jira/browse/HIVE-17843
> Project: Hive
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Janaki Lahorani
>Priority: Major
>
> An unsigned 32 bit Parquet column, such as
> {noformat}
> optional int32 uint_32_col (UINT_32)
> {noformat}
> is read by Hive as if it were signed, leading to incorrect results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-17843) UINT32 Parquet columns are handled as signed INT32-s, silently reading incorrect data

2018-02-14 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364282#comment-16364282
 ] 

Zoltan Ivanfi edited comment on HIVE-17843 at 2/14/18 3:34 PM:
---

Sorry for the late answer. The simplest query suffices, e.g., a SELECT * on a 
table that contains a single column and a single row. But the parquet file has 
to have an unsigned integer in it and Hive does not write unsignes ints. 
[~gszadovszky] could you provide an example parquet file with an unsigned int 
that has its most significant bit set? Thanks!


was (Author: zi):
Sorry for the late answer. The simplest query suffices, e.g., a SELECT * on a 
table that contains a single column and a single row. But the parquet file has 
to have an unsigned integer in it and Hive does not write unsignes ints. 
[~gszadovszky] could you provide an example parquet file with an unsigned int 
that has its first bit set? Thanks!

> UINT32 Parquet columns are handled as signed INT32-s, silently reading 
> incorrect data
> -
>
> Key: HIVE-17843
> URL: https://issues.apache.org/jira/browse/HIVE-17843
> Project: Hive
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Janaki Lahorani
>Priority: Major
>
> An unsigned 32 bit Parquet column, such as
> {noformat}
> optional int32 uint_32_col (UINT_32)
> {noformat}
> is read by Hive as if it were signed, leading to incorrect results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-17843) UINT32 Parquet columns are handled as signed INT32-s, silently reading incorrect data

2018-02-14 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364282#comment-16364282
 ] 

Zoltan Ivanfi commented on HIVE-17843:
--

Sorry for the late answer. The simplest query suffices, e.g., a SELECT * on a 
table that contains a single column and a single row. But the parquet file has 
to have an unsigned integer in it and Hive does not write unsignes ints. 
[~gszadovszky] could you supply an example parquet file with an unsigned int 
that has its first bit set? Thanks!

> UINT32 Parquet columns are handled as signed INT32-s, silently reading 
> incorrect data
> -
>
> Key: HIVE-17843
> URL: https://issues.apache.org/jira/browse/HIVE-17843
> Project: Hive
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Janaki Lahorani
>Priority: Major
>
> An unsigned 32 bit Parquet column, such as
> {noformat}
> optional int32 uint_32_col (UINT_32)
> {noformat}
> is read by Hive as if it were signed, leading to incorrect results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HIVE-12767) Implement table property to address Parquet int96 timestamp bug

2018-02-02 Thread Zoltan Ivanfi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi resolved HIVE-12767.
--
Resolution: Won't Fix

Hive already has a workaround based on a the writer metadata. This issue was 
about a more sophisticated and complicated solution based on table properties. 
But since the Spark community decided to implement a similar workaround to the 
one that already exists in Hive (based on a the writer metadata), the solution 
using table properties is not needed any more.

> Implement table property to address Parquet int96 timestamp bug
> ---
>
> Key: HIVE-12767
> URL: https://issues.apache.org/jira/browse/HIVE-12767
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Sergio Peña
>Assignee: Barna Zsombor Klara
>Priority: Major
> Attachments: HIVE-12767.10.patch, HIVE-12767.11.patch, 
> HIVE-12767.3.patch, HIVE-12767.4.patch, HIVE-12767.5.patch, 
> HIVE-12767.6.patch, HIVE-12767.7.patch, HIVE-12767.8.patch, 
> HIVE-12767.9.patch, TestNanoTimeUtils.java
>
>
> Parque timestamps using INT96 are not compatible with other tools, like 
> Impala, due to issues in Hive because it adjusts timezones values in a 
> different way than Impala.
> To address such issues. a new table property (parquet.mr.int96.write.zone) 
> must be used in Hive that detects what timezone to use when writing and 
> reading timestamps from Parquet.
> The following is the exit criteria for the fix:
> * Hive will read Parquet MR int96 timestamp data and adjust values using a 
> time zone from a table property, if set, or using the local time zone if it 
> is absent. No adjustment will be applied to data written by Impala.
> * Hive will write Parquet int96 timestamps using a time zone adjustment from 
> the same table property, if set, or using the local time zone if it is 
> absent. This keeps the data in the table consistent.
> * New tables created by Hive will set the table property to UTC if the global 
> option to set the property for new tables is enabled.
> ** Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set 
> the property unless the global setting to do so is enabled.
> ** Tables created using CREATE TABLE LIKE  will copy the 
> property of the table that is copied.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-17844) Hive JDBC driver converts floats to doubles, changing values slightly

2017-10-20 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213072#comment-16213072
 ] 

Zoltan Ivanfi commented on HIVE-17844:
--

[~akolb], yes, that's true, but it doesn't represent the initial user-supplied 
number faithfully any more.

If the user supplies the value 125.32, it can not be represented as a float 
precisely, but rounding rules during displaying that float will still display 
125.32. However, when converting that float to double, different, more precise 
rounding rules start to apply, resulting in 125.3169482422 getting 
displayed instead of 125.32. It is true that the double value is precisely the 
same as the float value, yet although none of them is precisely the same as the 
user-supplied value, the float gets displayed as the original value and the 
double isn't.

If 125.32 is stored directly to a double, then it is approximated more 
precisely and will display correctly. When it is stored as a float and then 
converted to a double, this is no longer true.

> Hive JDBC driver converts floats to doubles, changing values slightly
> -
>
> Key: HIVE-17844
> URL: https://issues.apache.org/jira/browse/HIVE-17844
> Project: Hive
>  Issue Type: Bug
>  Components: JDBC
>Reporter: Zoltan Ivanfi
>
> When querying data through Hive's JDBC driver, float values are automatically 
> converted to doubles and will slightly differ from values seen when directly 
> querying the data using a textual SQL interface.
> Please note that by slight difference I don't only mean additional zeroes at 
> the end, but an actual numeric difference in the displayed value. This is the 
> result of the different precision used when displaying double values, as 
> discussed 
> [here|https://stackoverflow.com/questions/17504833/why-converting-from-float-to-double-changes-the-value],
>  for example.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17844) Hive JDBC driver converts floats to doubles, changing values slightly

2017-10-19 Thread Zoltan Ivanfi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-17844:
-
Description: 
When querying data through Hive's JDBC driver, float values are automatically 
converted to doubles and will slightly differ from values seen when directly 
querying the data using a textual SQL interface.

Please note that by slight difference I don't only mean additional zeroes at 
the end, but an actual numeric difference in the displayed value. This is the 
result of the different precision used when displaying double values, as 
discussed 
[here|https://stackoverflow.com/questions/17504833/why-converting-from-float-to-double-changes-the-value],
 for example.

  was:
When querying data through Hive's JDBC driver, float values are automatically 
converted to doubles and will slightly differ from values seen when directly 
querying the data using a textual SQL interface.

Please note that by slight difference I don't only mean additional zeroes at 
the end, but an actual difference in the value itself. This is a result of the 
float to double conversion, as discussed 
[here|https://stackoverflow.com/questions/17504833/why-converting-from-float-to-double-changes-the-value],
 for example.


> Hive JDBC driver converts floats to doubles, changing values slightly
> -
>
> Key: HIVE-17844
> URL: https://issues.apache.org/jira/browse/HIVE-17844
> Project: Hive
>  Issue Type: Bug
>  Components: JDBC
>Reporter: Zoltan Ivanfi
>
> When querying data through Hive's JDBC driver, float values are automatically 
> converted to doubles and will slightly differ from values seen when directly 
> querying the data using a textual SQL interface.
> Please note that by slight difference I don't only mean additional zeroes at 
> the end, but an actual numeric difference in the displayed value. This is the 
> result of the different precision used when displaying double values, as 
> discussed 
> [here|https://stackoverflow.com/questions/17504833/why-converting-from-float-to-double-changes-the-value],
>  for example.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (HIVE-6394) Implement Timestmap in ParquetSerde

2017-01-05 Thread Zoltan Ivanfi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi reassigned HIVE-6394:
---

Assignee: Zoltan Ivanfi  (was: Szehon Ho)

> Implement Timestmap in ParquetSerde
> ---
>
> Key: HIVE-6394
> URL: https://issues.apache.org/jira/browse/HIVE-6394
> Project: Hive
>  Issue Type: Sub-task
>  Components: Serializers/Deserializers
>Reporter: Jarek Jarcec Cecho
>Assignee: Zoltan Ivanfi
>  Labels: Parquet
> Fix For: 0.14.0
>
> Attachments: HIVE-6394.2.patch, HIVE-6394.3.patch, HIVE-6394.4.patch, 
> HIVE-6394.5.patch, HIVE-6394.6.patch, HIVE-6394.6.patch, HIVE-6394.7.patch, 
> HIVE-6394.patch
>
>
> This JIRA is to implement timestamp support in Parquet SerDe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-14846) Char encoding does not apply to newline chars

2016-09-27 Thread Zoltan Ivanfi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-14846:
-
Description: 
I created and populated a table with utf-16 encoding:

{noformat}
hive> create external table utf16 (col1 timestamp, col2 string) row format 
delimited fields terminated by "," location '/tmp/utf16';
hive> alter table utf16 set serdeproperties ('serialization.encoding'='UTF-16');
hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség');
{noformat}

Then I checked the contents of the file:

{noformat}
$ hadoop fs -cat /tmp/utf16/00_0 | hd
  fe ff 00 32 00 30 00 31  00 30 00 2d 00 30 00 31  |...2.0.1.0.-.0.1|
0010  00 2d 00 30 00 31 00 20  00 30 00 30 00 3a 00 30  |.-.0.1. .0.0.:.0|
0020  00 30 00 3a 00 30 00 30  00 2c 00 68 01 51 00 73  |.0.:.0.0.,.h.Q.s|
0030  00 e9 00 67 0a|...g.|
0035
{noformat}

The newline character is represented as 0a instead of the expected 00 0a.

If I do it the other way around and put correct UTF-16 files into HDFS and try 
to query them from Hive, I get unknown unicode chars in the output:

{noformat}
hive> select * from utf16;
2010-01-01 00:00:00 hőség�
2010-01-02 00:00:00 város�
2010-01-03 00:00:00 füzet�
{noformat}


  was:
I created and populated a table with utf-16 encoding:

{noformat}
hive> create external table utf16 (col1 timestamp, col2 string) row format 
delimited fields terminated by "," location '/tmp/utf16';
hive> alter table utf16 set serdeproperties ('serialization.encoding'='UTF-16');
hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség');
{noformat}

Then I checked the contents of the file:

{noformat}
$ hadoop fs -cat /tmp/utf16/00_0 | hd
  fe ff 00 32 00 30 00 31  00 30 00 2d 00 30 00 31  |...2.0.1.0.-.0.1|
0010  00 2d 00 30 00 34 00 20  00 30 00 30 00 3a 00 30  |.-.0.4. .0.0.:.0|
0020  00 30 00 3a 00 30 00 30  00 2c 00 63 00 69 00 70  |.0.:.0.0.,.c.i.p|
0030  01 51 0a  |.Q.|
0033
{noformat}

The newline character is represented as 0a instead of the expected 00 0a.

If I do it the other way around and put correct UTF-16 files into HDFS and try 
to query them from Hive, I get unknown unicode chars in the output:

{noformat}
hive> select * from utf16;
2010-01-01 00:00:00 hőség�
2010-01-02 00:00:00 város�
2010-01-03 00:00:00 füzet�
{noformat}



> Char encoding does not apply to newline chars
> -
>
> Key: HIVE-14846
> URL: https://issues.apache.org/jira/browse/HIVE-14846
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Zoltan Ivanfi
>Priority: Minor
>
> I created and populated a table with utf-16 encoding:
> {noformat}
> hive> create external table utf16 (col1 timestamp, col2 string) row format 
> delimited fields terminated by "," location '/tmp/utf16';
> hive> alter table utf16 set serdeproperties 
> ('serialization.encoding'='UTF-16');
> hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség');
> {noformat}
> Then I checked the contents of the file:
> {noformat}
> $ hadoop fs -cat /tmp/utf16/00_0 | hd
>   fe ff 00 32 00 30 00 31  00 30 00 2d 00 30 00 31  |...2.0.1.0.-.0.1|
> 0010  00 2d 00 30 00 31 00 20  00 30 00 30 00 3a 00 30  |.-.0.1. .0.0.:.0|
> 0020  00 30 00 3a 00 30 00 30  00 2c 00 68 01 51 00 73  |.0.:.0.0.,.h.Q.s|
> 0030  00 e9 00 67 0a|...g.|
> 0035
> {noformat}
> The newline character is represented as 0a instead of the expected 00 0a.
> If I do it the other way around and put correct UTF-16 files into HDFS and 
> try to query them from Hive, I get unknown unicode chars in the output:
> {noformat}
> hive> select * from utf16;
> 2010-01-01 00:00:00   hőség�
> 2010-01-02 00:00:00   város�
> 2010-01-03 00:00:00   füzet�
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-14846) Char encoding does not apply to newline chars

2016-09-27 Thread Zoltan Ivanfi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-14846:
-
Description: 
I created and populated a table with utf-16 encoding:

{noformat}
hive> create external table utf16 (col1 timestamp, col2 string) row format 
delimited fields terminated by "," location '/tmp/utf16';
hive> alter table utf16 set serdeproperties ('serialization.encoding'='UTF-16');
hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség');
{noformat}

Then I checked the contents of the file:

{noformat}
$ hadoop fs -cat /tmp/utf16/00_0 | hd
  fe ff 00 32 00 30 00 31  00 30 00 2d 00 30 00 31  |...2.0.1.0.-.0.1|
0010  00 2d 00 30 00 34 00 20  00 30 00 30 00 3a 00 30  |.-.0.4. .0.0.:.0|
0020  00 30 00 3a 00 30 00 30  00 2c 00 63 00 69 00 70  |.0.:.0.0.,.c.i.p|
0030  01 51 0a  |.Q.|
0033
{noformat}

The newline character is represented as 0a instead of the expected 00 0a.

If I do it the other way around and put correct UTF-16 files into HDFS and try 
to query them from Hive, I get unknown unicode chars in the output:

{noformat}
hive> select * from utf16;
2010-01-01 00:00:00 hőség�
2010-01-02 00:00:00 város�
2010-01-03 00:00:00 füzet�
{noformat}


  was:
I created and populated a table with utf-16 encoding:

hive> create external table utf16 (col1 timestamp, col2 string) row format 
delimited fields terminated by "," location '/tmp/utf16';
hive> alter table utf16 set serdeproperties 
('serialization.encoding'='UTF-16');
hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség');

Then I checked the contents of the file:

$ hadoop fs -cat /tmp/utf16/00_0 | hd
  fe ff 00 32 00 30 00 31  00 30 00 2d 00 30 00 31  
|...2.0.1.0.-.0.1|
0010  00 2d 00 30 00 34 00 20  00 30 00 30 00 3a 00 30  |.-.0.4. 
.0.0.:.0|
0020  00 30 00 3a 00 30 00 30  00 2c 00 63 00 69 00 70  
|.0.:.0.0.,.c.i.p|
0030  01 51 0a  |.Q.|
0033

The newline character is represented as 0a instead of the expected 00 0a.

If I do it the other way around and put correct UTF-16 files into HDFS and try 
to query them from Hive, I get unknown unicode chars in the output:

hive> select * from utf16;
2010-01-01 00:00:00 hőség�
2010-01-02 00:00:00 város�
2010-01-03 00:00:00 füzet�



> Char encoding does not apply to newline chars
> -
>
> Key: HIVE-14846
> URL: https://issues.apache.org/jira/browse/HIVE-14846
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Zoltan Ivanfi
>Priority: Minor
>
> I created and populated a table with utf-16 encoding:
> {noformat}
> hive> create external table utf16 (col1 timestamp, col2 string) row format 
> delimited fields terminated by "," location '/tmp/utf16';
> hive> alter table utf16 set serdeproperties 
> ('serialization.encoding'='UTF-16');
> hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség');
> {noformat}
> Then I checked the contents of the file:
> {noformat}
> $ hadoop fs -cat /tmp/utf16/00_0 | hd
>   fe ff 00 32 00 30 00 31  00 30 00 2d 00 30 00 31  |...2.0.1.0.-.0.1|
> 0010  00 2d 00 30 00 34 00 20  00 30 00 30 00 3a 00 30  |.-.0.4. .0.0.:.0|
> 0020  00 30 00 3a 00 30 00 30  00 2c 00 63 00 69 00 70  |.0.:.0.0.,.c.i.p|
> 0030  01 51 0a  |.Q.|
> 0033
> {noformat}
> The newline character is represented as 0a instead of the expected 00 0a.
> If I do it the other way around and put correct UTF-16 files into HDFS and 
> try to query them from Hive, I get unknown unicode chars in the output:
> {noformat}
> hive> select * from utf16;
> 2010-01-01 00:00:00   hőség�
> 2010-01-02 00:00:00   város�
> 2010-01-03 00:00:00   füzet�
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)