this is amazing how much complexity is behind something simple as time :) thanks so much for a detailed write-up, Csaba. It really helps and makes sense now. I do wish there was an option similar to use_local_tz_for_unix_ timestamp_conversions but the one that would not attempt to convert timestamp to a server time zone, granted performance will be worth because of 2 step conversion.
In our case, we have 100s of tables with thousands of timestamps. Data is coming from 3 different time zones and our requirement is to store these values in local time to those systems (so users can see exact local time as it was in their local system). We used to keep time in UTC but it caused tons of issues and confusion with users (while we liked the simplicity of having everything in UTC). Based on what you said, it looks like we have two options: 1) upgrade to CDH 6.1 and use convert_legacy_hive_parquet_utc_timestamps (which before was not usable for us due to huge performance impact) 2) convert and store time as we need but using string type not timestamp. Luckily, Hive, Impala and Spark can accept string values as arguments to data/time functions. Boris On Thu, Dec 20, 2018 at 10:28 AM Csaba Ringhofer <csringho...@cloudera.com> wrote: > > My question is what are the thought process and reasons to do this > conversion in the first place from UTC and having Impala "assume" that > timestamp is always UTC? > tl;dr: "timezone agnostic" is the fastest and simplest way to go, > especially if there is no good time zone library available > > I was not there when it was designed, but I think that the main reason > behind Impala's behavior is performance, and the secondary reason is that > finding and integrating a good timezone library is not trivial in C++. Note > that there are some nice time libraries, but most are not designed for > processing billions of records. > > Impala uses a compact in memory format for timestamps (32 bit days since > epoch and 64 bits nanoseconds since midnight) > Storing timestamps as string or as more complex structure (e.g. separate > numbers for year, month, day ..) would need more memory + make some > operations like comparison slower. > > During a query like "SELECT timestamp_col FROM parquet_table" the > timestamps may need time zone conversion during two steps: > 1. converting the timestamp from the way it is encoded in Parquet to > Impala's internal > 2. converting Impala's internal format to string like 2018-12-24 20:00:00, > as the result is returned as string > > By default Impala avoids timezone conversion in both steps by treating > timestamp as if they were in UTC: > 1. Parquet uses the same binary format as Impala, so decoding can be as > simple as copying 12 bytes in memory > 2. formatting as string needs some tricky calculations due to leap years, > but does not need time zone conversion > > convert_legacy_hive_parquet_utc_timestamps adds timezone conversion to > step 1 (if the writer is parquet-mr), which makes reading slower even with > recent improvements. There is no option to do time zone conversion during > step 2. > > Hive and Spark assume the stored timestamp to be in UTC, but displays it > in local time. I do not know how these software represent timestamps in > memory, but I am sure that they need to do time zone conversion in step 1 > or 2. This overhead may be less visible than in Impala, because most > operations are already slower, and the time library used is probably better > (especially compared to the pre 3.1 Impala). > > >(performance killer that has just been fixed in the latest Impala release > which has not made to CDH yet) > Note that CDH 6.1 was released recently, and timezone database changes are > included in the release, see: > > https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_610_new_features.html#impala_new_610 > > https://blog.cloudera.com/blog/2018/12/cloudera-enterprise-6-1-0-is-now-available/ > > On Wed, Dec 19, 2018 at 9:47 PM Boris Tyukin <bo...@boristyukin.com> > wrote: > >> wow thanks for sharing this doc, a wealth of info there. Thanks Greg! >> >> On Wed, Dec 19, 2018 at 3:17 PM Greg Rahn <greg.r...@gmail.com> wrote: >> >>> I think this document contains most of the challenges around timestamp >>> definition and management. It's a long read but has the details >>> behind much of what you have mentioned. >>> >>> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit?usp=sharing >>> >>> >>> >>> On Wed, Dec 19, 2018 at 11:33 AM Boris Tyukin <bo...@boristyukin.com> >>> wrote: >>> > >>> > Hello, >>> > >>> > I am trying to understand the reasons behind this decision by Impala >>> devs. >>> > >>> > From Impala docs: >>> > http://impala.apache.org/docs/build/html/topics/impala_timestamp.html >>> > >>> > By default, Impala stores and interprets TIMESTAMP values in UTC time >>> zone when writing to data files, reading from data files, or converting to >>> and from system time values through functions. >>> > >>> > And there are there two switches to change this behavior: >>> > >>> > use_local_tz_for_unix_timestamp_conversions >>> > convert_legacy_hive_parquet_utc_timestamps (performance killer that >>> has just been fixed in the latest Impala release which has not made to CDH >>> yet) >>> > >>> > My question is what are the thought process and reasons to do this >>> conversion in the first place from UTC and having Impala "assume" that >>> timestamp is always UTC? >>> > >>> > This is not how Hive or Spark or anything else I've seen before does >>> it. This is really unusual and causes tons of confusion if you try to use >>> the same data set from Hive, Spark and Impala, so when Impala is not the >>> only thing on a cluster. >>> > >>> > And second option, why there is no option NOT to convert the time in >>> the first place and just use the one which was intended to be stored? So if >>> I stored 2015-01-01 12:12:00 whatever time zone time is, I still want to >>> see that exact time in Impala, Hive and Spark and I do not need Impala >>> converting this time to my local cluster time. >>> > >>> > I am sure there is a reason for that just struggling to understand >>> it... >>> > >>> > Thanks, >>> > Boris >>> >>