> My question is what are the thought process and reasons to do this
conversion in the first place  from UTC and having Impala "assume" that
timestamp is always UTC?
tl;dr: "timezone agnostic" is the fastest and simplest way to go,
especially if there is no good time zone library available

I was not there when it was designed, but I think that the main reason
behind Impala's behavior is performance, and the secondary reason is that
finding and integrating a good timezone library is not trivial in C++. Note
that there are some nice time libraries, but most are not designed for
processing billions of records.

Impala uses a compact in memory format for timestamps (32 bit days since
epoch and 64 bits nanoseconds since midnight)
Storing timestamps as string or as more complex structure (e.g. separate
numbers for year, month, day ..) would need more memory + make some
operations like comparison slower.

During a query like "SELECT timestamp_col FROM parquet_table" the
timestamps may need time zone conversion during two steps:
1. converting the timestamp from the way it is encoded in Parquet to
Impala's internal
2. converting Impala's internal format to string like 2018-12-24 20:00:00,
as the result is returned as string

By default Impala avoids timezone conversion in both steps by treating
timestamp as if they were in UTC:
1. Parquet uses the same binary format as Impala, so decoding can be as
simple as copying 12 bytes in memory
2. formatting as string needs some tricky calculations due to leap years,
but does not need time zone conversion

convert_legacy_hive_parquet_utc_timestamps  adds timezone conversion to
step 1 (if the writer is parquet-mr), which makes reading slower even with
recent improvements. There is no option to do time zone conversion during
step 2.

Hive and Spark assume the stored timestamp to be in UTC, but displays it in
local time. I do not know how these software represent timestamps in
memory, but I am sure that they need to do time zone conversion in step 1
or 2. This overhead may be less visible than in Impala, because most
operations are already slower, and the time library used is probably better
(especially compared to the pre 3.1 Impala).

>(performance killer that has just been fixed in the latest Impala release
which has not made to CDH yet)
Note that CDH 6.1 was released recently, and timezone database changes are
included in the release, see:
https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_610_new_features.html#impala_new_610
https://blog.cloudera.com/blog/2018/12/cloudera-enterprise-6-1-0-is-now-available/

On Wed, Dec 19, 2018 at 9:47 PM Boris Tyukin <bo...@boristyukin.com> wrote:

> wow thanks for sharing this doc, a wealth of info there. Thanks Greg!
>
> On Wed, Dec 19, 2018 at 3:17 PM Greg Rahn <greg.r...@gmail.com> wrote:
>
>> I think this document contains most of the challenges around timestamp
>> definition and management.  It's a long read but has the details
>> behind much of what you have mentioned.
>>
>> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit?usp=sharing
>>
>>
>>
>> On Wed, Dec 19, 2018 at 11:33 AM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>> >
>> > Hello,
>> >
>> > I am trying to understand the reasons behind this decision by Impala
>> devs.
>> >
>> > From Impala docs:
>> > http://impala.apache.org/docs/build/html/topics/impala_timestamp.html
>> >
>> > By default, Impala stores and interprets TIMESTAMP values in UTC time
>> zone when writing to data files, reading from data files, or converting to
>> and from system time values through functions.
>> >
>> > And there are there two switches to change this behavior:
>> >
>> > use_local_tz_for_unix_timestamp_conversions
>> > convert_legacy_hive_parquet_utc_timestamps (performance killer that has
>> just been fixed in the latest Impala release which has not made to CDH yet)
>> >
>> > My question is what are the thought process and reasons to do this
>> conversion in the first place  from UTC and having Impala "assume" that
>> timestamp is always UTC?
>> >
>> > This is not how Hive or Spark or anything else I've seen before does
>> it. This is really unusual and causes tons of confusion if you try to use
>> the same data set from Hive, Spark and Impala, so when Impala is not the
>> only thing on a cluster.
>> >
>> > And second option, why there is no option NOT to convert the time in
>> the first place and just use the one which was intended to be stored? So if
>> I stored 2015-01-01 12:12:00 whatever time zone time is, I still want to
>> see that exact time in Impala, Hive and Spark and I do not need Impala
>> converting this time to my local cluster time.
>> >
>> > I am sure there is a reason for that just struggling to understand it...
>> >
>> > Thanks,
>> > Boris
>>
>

Reply via email to