Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)

Greg Rahn Thu, 11 Jan 2018 21:16:01 -0800

Franco-

Are there still outstanding questions or issues with this?  I'm just
returning from holiday so wanted to check in.


Agreed that timestamp w/ and w/o time zone are different beasts and come
with their own challenges.

Cheers,


On Tue, Dec 19, 2017 at 6:23 AM, Franco Venturi <[email protected]>
wrote:

> Thanks for your input Belazs.
> I am still pondering about this issue and I found very useful to follow
> your advice on seeing what's already in JIRA and doing some research on
> what others are doing.
>
> I didn't find anything that exactly describes my problem on the Impala
> JIRA (I might have missed it), however I did find a closely related
> discussion that is occurring in the Hive JIRA: I am referring to JIRAs
> HIVE-14305 (https://issues.apache.org/jira/browse/HIVE-14305), HIVE-14412
> (https://issues.apache.org/jira/browse/HIVE-14412), and HIVE-16418 (
> https://issues.apache.org/jira/browse/HIVE-16418).
>
> I found also useful to read how two of the most used RDBMS are dealing
> with timestamps:
>    - Oracle (see: https://docs.oracle.com/en/database/oracle/oracle-
> database/12.2/nlspg/datetime-data-types-and-time-zone-support.html)
>    - PostgreSQL (see: https://www.postgresql.org/docs/10/static/datatype-
> datetime.html)
>
> Finally I am going to refer to the Java 8 API as described here:
> https://docs.oracle.com/javase/8/docs/api/overview-summary.html; the
> reason is that a lot of Impala users (me included) connect to it via
> Java/JDBC and it is very important in my opinion that the Impala data types
> are correctly mapped to the correct Java classes to avoid confusion.
>
>
> After this long preamble here are some notes that hopefully will help in
> this discussion.
>
> - At the center of this issue is that there are two significantly
> different data types that we commonly refer to when we use the word
> 'timestamp':
>       - timestamp without time zone (most commonly referred to as just
> 'timestamp')
>       - timestamp with time zone
>
> - These two types are like apples and oranges in my opinion; getting
> confused between the two (like I did) can cause a lot of frustration (my
> first post in this thread shows that pretty well..)
>
> - The first data type ('timestamp without time zone' or just 'timestamp')
> is best described in this sentence in section 4.2.1.6 (Choosing a TIMESTAMP
> Data Type) in the Oracle document:
>
>         "Use the TIMESTAMP data type when you need a datetime value to
> record the time of an event without the time zone. For example, you can
> store information about the times when workers punch a time card in and out
> of their assembly line workstations. Because this is always a local time it
> is then not needed to store the timezone part"
>
> - I think this is the kind of timestamp that is currently implemented in
> Impala (please correct me if I am wrong) and in my opinion it should be
> mapped to something like the Java type java.time.LocalDateTime; the Java 8
> API for java.time.LocalDateTime has this sentence that I think provides a
> good insight on the meaning of this data type:
>
>         "This class does not store or represent a time-zone. Instead, it
> is a description of the date, as used for birthdays, combined with the
> local time as seen on a wall clock. It cannot represent an instant on the
> time-line without additional information such as an offset or time-zone."
>
> - Also this timestamp (again in my opinion) should not be mapped to the
> Kudu type UNIXTIME_MICROS, because their meaning is different (and this is
> what triggered my initial confusion, and I suspect I am not the only one
> out there, who misunderstood this difference)
>
>
> - For the second data type ('timestamp with time zone') I couldn't find a
> good definition in the Oracle document, however this sentence from the Java
> API for java.util.Date (of which java.sql.Timestamp is a subclass) offers a
> good insight:
>
>         "The class Date represents a specific instant in time"
>
> - This is also indirectly implied by the following sentence in the Oracle
> document:
>
>         "Two TIMESTAMP WITH TIME ZONE values are considered identical if
> they represent the same instant in UTC, regardless of the TIME ZONE offsets
> stored in the data."
>
> - The 'timestamp with time zone' is what I think should be mapped to the
> Java type 'java.sql.Timezone' and what also corresponds to the actual
> meaning of the Kudu type UNIXTIME_MICROS
>
> - This is the kind of timestamp type that is most useful to businesses
> (financial, healthcase, security logs) because it captures the "absolute"
> moment in time a transaction or an event occurred, regardless of which
> timezone it occurred, or if it was daylight saving time or not
>
>
> In conclusion the point I am at right now is that to really fix the issue
> in Impala, two things should occur:
>
> - a new data type ('timestamp with time zone'), with the features
> mentioned above, should be implemented - this change would hopefully not
> break any of the existing applications
>
> - the current 'timestamp' type should be mapped to the Java class
> java.time.LocalDateTime (not java.sql.Timestamp) - this would definitely
> break existing applications, and therefore should be postponed to Impala 3.0
>
>
> Regards,
> Franco
>
>
> ------------------------------
> *From: *"Jeszy" <[email protected]>
> *To: *[email protected]
> *Sent: *Monday, December 18, 2017 7:49:52 AM
> *Subject: *Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)
>
> Hello Franco,
>
> Thanks for your feedback! I agree there are pain points with using
> timestamps, especially together with other systems.
> Is there any particular approach or solution you propose that would
> work well for you? Have you found any jiras on issues.apache.org that
> describe what you're asking for? Commenting on a jira will help the
> team track your input better.
>
> Regards,
> Balazs
>
> On 17 December 2017 at 00:38, Franco Venturi <[email protected]> wrote:
> > Please note that the discussion below refers to the following versions:
> >       - Impala: v2.10.0-cdh5.13.0
> >       - Kudu: 1.5.0-cdh5.13.0
> >       - Everything runs on a standard Cloudera 5.13 installation
> >
> >
> > A few days ago I was writing some Java code to migrate several tables
> > directly from Oracle to Kudu (to be queried later on by our developers
> and
> > BI tools using Impala). Most of these tables have columns that are of
> type
> > "timestamp" (to be exact, they come in as instances of class
> > oracle.sql.TIMESTAMP and I cast them to java.sql.Timestamp; for the rest
> of
> > this discussion I'll assume we only deal with objects of
> java.sql.Timestamp,
> > to make things simple).
> > As you probably know, Kudu, starting I think with version 1.3.1, has a
> type
> > called 'UNIXTIME_MICROS') and that type gets mapped by Impala as "Impala
> > TIMESTAMP" data type
> > (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_
> timestamp.html).
> >
> >
> > A good description of the meaning of 'UNIXTIME_MICROS' in Kudu is in the
> > 'Apache Kudu Schema Design' document
> > (https://kudu.apache.org/docs/schema_design.html), which says:
> >
> >
> >       unixtime_micros (64-bit microseconds since the Unix epoch)
> >
> >
> > where the 'Unix epoch' is defined as 1/1/1970 00:00:00 GMT.
> >
> >
> > With this understanding I went ahead and wrote my Java code; when I ran
> the
> > first few tests, I noticed that the timestamp values returned by Impala
> (I
> > created in Impala an 'external' table 'stored as kudu') were off by
> several
> > hours compared to the values returned by the original table in Oracle
> (our
> > servers, both the Oracle ones and the Impala/Kudu ones, are all
> configured
> > in the 'America/New_York' timezone).
> >
> >
> > To investigate this difference, I created a simple table in Kudu with
> just
> > two columns, an INT64 as the primary key and a UNIXTIME_MICROS as a
> > timestamp. I ran a few inserts and selects over this table in Impala and
> > figured out that Impala stores a value that is more or less defined as
> > follow:
> >
> >
> >       number of microseconds since the Unix epoch (i.e. what I was
> expecting
> > originally)
> >       + offset of the timestamp I inserted with respect to GMT (in my
> case
> > this offset is the offset for EST or EDT depending if that timestamp was
> > during EST (winter) or EDT (summer))
> >
> >
> > This is how Impala achieves what is described as:
> >
> >
> >       Impala does not store timestamps using the local timezone, to avoid
> > undesired results from unexpected time zone issues
> >
> >
> > That same page has caveats like the following, that sent a shiver down my
> > spine:
> >
> >
> >       If that value was written to a data file, and shipped off to a
> distant
> > server to be analyzed alongside other data from far-flung locations, the
> > dates and times would not match up precisely because of time zone
> > differences
> >
> >
> > This means that if anyone is using (or even thinking about using) "Impala
> > timestamps" to say store financial or health services (or security)
> events,
> > they'll find some nasty "surprises" (even if they don't plan to ever move
> > their servers and only do business in one timezone).
> >
> >
> > Consider for instance the case of anything that occurred between 1am and
> 2am
> > EDT on 11/5/2017 (i.e. in the hour before we moved our clocks back from
> EDT
> > to EST) - there's no way to store the timestamps for these events in Kudu
> > via Impala.
> >
> > To prove this I wrote this simple piece of Java code (which uses Java 8
> and
> > all well documented and non-deprecated classes and methods) to do just an
> > insert and a select via Impala JDBC of a timestamp row in the simple
> table
> > that I mentioned above (primary key + timestamp column):
> >
> >
> >
> >       // run insert
> >       long primaryKey = 1L;
> >       PreparedStatement insert = connection.prepareStatement("insert
> into "
> > + table + " values (?, ?)");
> >       insert.setLong(1, primaryKey);
> >       Timestamp timestampIn = new Timestamp(1509859800000L);
> >       System.out.println("TimestampIn: " + timestampIn + " - getTime():
> " +
> > timestampIn.getTime());
> >       insert.setTimestamp(2, timestampIn);
> >       insert.executeUpdate();
> >       insert.close();
> >
> >
> >       // run select
> >       PreparedStatement select = connection.prepareStatement("select " +
> > timestampColumn + " from " + table + " where " + primaryKeyColumn +
> "=?");
> >       select.setLong(1, primaryKey);
> >       ResultSet resultSet = select.executeQuery();
> >       while (resultSet.next()) {
> >           Timestamp timestampOut = resultSet.getTimestamp(1);
> >           System.out.println("TimestampOut: " + timestampOut + " -
> > getTime(): " + timestampOut.getTime());
> >       }
> >       resultSet.close();
> >       select.close();
> >
> >
> >
> > and this is the output:
> >
> >
> >      TimestampIn: 2017-11-05 01:30:00.0 - getTime(): 1509859800000
> >      TimestampOut: 2017-11-05 01:30:00.0 - getTime(): 1509863400000
> >
> >
> >
> > If this kind of timestamp had been designed say in 1972, I would have
> > probably been OK with it, but in 2017, when people and companies do
> business
> > 24/7, I find this approach (and sentences like "dates and times would not
> > match up precisely") unacceptable.
> >
> >
> >
> > To its defense, the Impala TIMESTAMP web page mentioned above spends
> several
> > paragraphs about how to deal with "UTC timestamps" and problems like the
> > ones shown above, but it basically has the user wrap everything in Impala
> > functions like 'to_utc_timestamp()' and 'from_utc_timestamp()'.
> >
> >
> > This proposed solution - besides the performance impact of say having a
> > query like this:
> >
> >
> >       select ... where to_utc_timestamp(timestamp_column_a) <
> > to_utc_timestamp(timestamp_column_b)
> >
> >
> > makes also much harder to write queries having timestamp fields (which is
> > very common in my workplace) or to use it with automated BI tools.
> >
> >
> >
> > It is definitely not what one would expect from an analytic MPP database,
> > that aspires to be used at the Enterprise level in 2017.
> >
> >
> >
> > Regards,
> > Franco Venturi
>
>

Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)

Reply via email to