I think this document contains most of the challenges around timestamp definition and management. It's a long read but has the details behind much of what you have mentioned. https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit?usp=sharing
On Wed, Dec 19, 2018 at 11:33 AM Boris Tyukin <bo...@boristyukin.com> wrote: > > Hello, > > I am trying to understand the reasons behind this decision by Impala devs. > > From Impala docs: > http://impala.apache.org/docs/build/html/topics/impala_timestamp.html > > By default, Impala stores and interprets TIMESTAMP values in UTC time zone > when writing to data files, reading from data files, or converting to and > from system time values through functions. > > And there are there two switches to change this behavior: > > use_local_tz_for_unix_timestamp_conversions > convert_legacy_hive_parquet_utc_timestamps (performance killer that has just > been fixed in the latest Impala release which has not made to CDH yet) > > My question is what are the thought process and reasons to do this conversion > in the first place from UTC and having Impala "assume" that timestamp is > always UTC? > > This is not how Hive or Spark or anything else I've seen before does it. This > is really unusual and causes tons of confusion if you try to use the same > data set from Hive, Spark and Impala, so when Impala is not the only thing on > a cluster. > > And second option, why there is no option NOT to convert the time in the > first place and just use the one which was intended to be stored? So if I > stored 2015-01-01 12:12:00 whatever time zone time is, I still want to see > that exact time in Impala, Hive and Spark and I do not need Impala converting > this time to my local cluster time. > > I am sure there is a reason for that just struggling to understand it... > > Thanks, > Boris