Franco- Are there still outstanding questions or issues with this? I'm just returning from holiday so wanted to check in.
Agreed that timestamp w/ and w/o time zone are different beasts and come with their own challenges. Cheers, On Tue, Dec 19, 2017 at 6:23 AM, Franco Venturi <[email protected]> wrote: > Thanks for your input Belazs. > I am still pondering about this issue and I found very useful to follow > your advice on seeing what's already in JIRA and doing some research on > what others are doing. > > I didn't find anything that exactly describes my problem on the Impala > JIRA (I might have missed it), however I did find a closely related > discussion that is occurring in the Hive JIRA: I am referring to JIRAs > HIVE-14305 (https://issues.apache.org/jira/browse/HIVE-14305), HIVE-14412 > (https://issues.apache.org/jira/browse/HIVE-14412), and HIVE-16418 ( > https://issues.apache.org/jira/browse/HIVE-16418). > > I found also useful to read how two of the most used RDBMS are dealing > with timestamps: > - Oracle (see: https://docs.oracle.com/en/database/oracle/oracle- > database/12.2/nlspg/datetime-data-types-and-time-zone-support.html) > - PostgreSQL (see: https://www.postgresql.org/docs/10/static/datatype- > datetime.html) > > Finally I am going to refer to the Java 8 API as described here: > https://docs.oracle.com/javase/8/docs/api/overview-summary.html; the > reason is that a lot of Impala users (me included) connect to it via > Java/JDBC and it is very important in my opinion that the Impala data types > are correctly mapped to the correct Java classes to avoid confusion. > > > After this long preamble here are some notes that hopefully will help in > this discussion. > > - At the center of this issue is that there are two significantly > different data types that we commonly refer to when we use the word > 'timestamp': > - timestamp without time zone (most commonly referred to as just > 'timestamp') > - timestamp with time zone > > - These two types are like apples and oranges in my opinion; getting > confused between the two (like I did) can cause a lot of frustration (my > first post in this thread shows that pretty well..) > > - The first data type ('timestamp without time zone' or just 'timestamp') > is best described in this sentence in section 4.2.1.6 (Choosing a TIMESTAMP > Data Type) in the Oracle document: > > "Use the TIMESTAMP data type when you need a datetime value to > record the time of an event without the time zone. For example, you can > store information about the times when workers punch a time card in and out > of their assembly line workstations. Because this is always a local time it > is then not needed to store the timezone part" > > - I think this is the kind of timestamp that is currently implemented in > Impala (please correct me if I am wrong) and in my opinion it should be > mapped to something like the Java type java.time.LocalDateTime; the Java 8 > API for java.time.LocalDateTime has this sentence that I think provides a > good insight on the meaning of this data type: > > "This class does not store or represent a time-zone. Instead, it > is a description of the date, as used for birthdays, combined with the > local time as seen on a wall clock. It cannot represent an instant on the > time-line without additional information such as an offset or time-zone." > > - Also this timestamp (again in my opinion) should not be mapped to the > Kudu type UNIXTIME_MICROS, because their meaning is different (and this is > what triggered my initial confusion, and I suspect I am not the only one > out there, who misunderstood this difference) > > > - For the second data type ('timestamp with time zone') I couldn't find a > good definition in the Oracle document, however this sentence from the Java > API for java.util.Date (of which java.sql.Timestamp is a subclass) offers a > good insight: > > "The class Date represents a specific instant in time" > > - This is also indirectly implied by the following sentence in the Oracle > document: > > "Two TIMESTAMP WITH TIME ZONE values are considered identical if > they represent the same instant in UTC, regardless of the TIME ZONE offsets > stored in the data." > > - The 'timestamp with time zone' is what I think should be mapped to the > Java type 'java.sql.Timezone' and what also corresponds to the actual > meaning of the Kudu type UNIXTIME_MICROS > > - This is the kind of timestamp type that is most useful to businesses > (financial, healthcase, security logs) because it captures the "absolute" > moment in time a transaction or an event occurred, regardless of which > timezone it occurred, or if it was daylight saving time or not > > > In conclusion the point I am at right now is that to really fix the issue > in Impala, two things should occur: > > - a new data type ('timestamp with time zone'), with the features > mentioned above, should be implemented - this change would hopefully not > break any of the existing applications > > - the current 'timestamp' type should be mapped to the Java class > java.time.LocalDateTime (not java.sql.Timestamp) - this would definitely > break existing applications, and therefore should be postponed to Impala 3.0 > > > Regards, > Franco > > > ------------------------------ > *From: *"Jeszy" <[email protected]> > *To: *[email protected] > *Sent: *Monday, December 18, 2017 7:49:52 AM > *Subject: *Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...) > > Hello Franco, > > Thanks for your feedback! I agree there are pain points with using > timestamps, especially together with other systems. > Is there any particular approach or solution you propose that would > work well for you? Have you found any jiras on issues.apache.org that > describe what you're asking for? Commenting on a jira will help the > team track your input better. > > Regards, > Balazs > > On 17 December 2017 at 00:38, Franco Venturi <[email protected]> wrote: > > Please note that the discussion below refers to the following versions: > > - Impala: v2.10.0-cdh5.13.0 > > - Kudu: 1.5.0-cdh5.13.0 > > - Everything runs on a standard Cloudera 5.13 installation > > > > > > A few days ago I was writing some Java code to migrate several tables > > directly from Oracle to Kudu (to be queried later on by our developers > and > > BI tools using Impala). Most of these tables have columns that are of > type > > "timestamp" (to be exact, they come in as instances of class > > oracle.sql.TIMESTAMP and I cast them to java.sql.Timestamp; for the rest > of > > this discussion I'll assume we only deal with objects of > java.sql.Timestamp, > > to make things simple). > > As you probably know, Kudu, starting I think with version 1.3.1, has a > type > > called 'UNIXTIME_MICROS') and that type gets mapped by Impala as "Impala > > TIMESTAMP" data type > > (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_ > timestamp.html). > > > > > > A good description of the meaning of 'UNIXTIME_MICROS' in Kudu is in the > > 'Apache Kudu Schema Design' document > > (https://kudu.apache.org/docs/schema_design.html), which says: > > > > > > unixtime_micros (64-bit microseconds since the Unix epoch) > > > > > > where the 'Unix epoch' is defined as 1/1/1970 00:00:00 GMT. > > > > > > With this understanding I went ahead and wrote my Java code; when I ran > the > > first few tests, I noticed that the timestamp values returned by Impala > (I > > created in Impala an 'external' table 'stored as kudu') were off by > several > > hours compared to the values returned by the original table in Oracle > (our > > servers, both the Oracle ones and the Impala/Kudu ones, are all > configured > > in the 'America/New_York' timezone). > > > > > > To investigate this difference, I created a simple table in Kudu with > just > > two columns, an INT64 as the primary key and a UNIXTIME_MICROS as a > > timestamp. I ran a few inserts and selects over this table in Impala and > > figured out that Impala stores a value that is more or less defined as > > follow: > > > > > > number of microseconds since the Unix epoch (i.e. what I was > expecting > > originally) > > + offset of the timestamp I inserted with respect to GMT (in my > case > > this offset is the offset for EST or EDT depending if that timestamp was > > during EST (winter) or EDT (summer)) > > > > > > This is how Impala achieves what is described as: > > > > > > Impala does not store timestamps using the local timezone, to avoid > > undesired results from unexpected time zone issues > > > > > > That same page has caveats like the following, that sent a shiver down my > > spine: > > > > > > If that value was written to a data file, and shipped off to a > distant > > server to be analyzed alongside other data from far-flung locations, the > > dates and times would not match up precisely because of time zone > > differences > > > > > > This means that if anyone is using (or even thinking about using) "Impala > > timestamps" to say store financial or health services (or security) > events, > > they'll find some nasty "surprises" (even if they don't plan to ever move > > their servers and only do business in one timezone). > > > > > > Consider for instance the case of anything that occurred between 1am and > 2am > > EDT on 11/5/2017 (i.e. in the hour before we moved our clocks back from > EDT > > to EST) - there's no way to store the timestamps for these events in Kudu > > via Impala. > > > > To prove this I wrote this simple piece of Java code (which uses Java 8 > and > > all well documented and non-deprecated classes and methods) to do just an > > insert and a select via Impala JDBC of a timestamp row in the simple > table > > that I mentioned above (primary key + timestamp column): > > > > > > > > // run insert > > long primaryKey = 1L; > > PreparedStatement insert = connection.prepareStatement("insert > into " > > + table + " values (?, ?)"); > > insert.setLong(1, primaryKey); > > Timestamp timestampIn = new Timestamp(1509859800000L); > > System.out.println("TimestampIn: " + timestampIn + " - getTime(): > " + > > timestampIn.getTime()); > > insert.setTimestamp(2, timestampIn); > > insert.executeUpdate(); > > insert.close(); > > > > > > // run select > > PreparedStatement select = connection.prepareStatement("select " + > > timestampColumn + " from " + table + " where " + primaryKeyColumn + > "=?"); > > select.setLong(1, primaryKey); > > ResultSet resultSet = select.executeQuery(); > > while (resultSet.next()) { > > Timestamp timestampOut = resultSet.getTimestamp(1); > > System.out.println("TimestampOut: " + timestampOut + " - > > getTime(): " + timestampOut.getTime()); > > } > > resultSet.close(); > > select.close(); > > > > > > > > and this is the output: > > > > > > TimestampIn: 2017-11-05 01:30:00.0 - getTime(): 1509859800000 > > TimestampOut: 2017-11-05 01:30:00.0 - getTime(): 1509863400000 > > > > > > > > If this kind of timestamp had been designed say in 1972, I would have > > probably been OK with it, but in 2017, when people and companies do > business > > 24/7, I find this approach (and sentences like "dates and times would not > > match up precisely") unacceptable. > > > > > > > > To its defense, the Impala TIMESTAMP web page mentioned above spends > several > > paragraphs about how to deal with "UTC timestamps" and problems like the > > ones shown above, but it basically has the user wrap everything in Impala > > functions like 'to_utc_timestamp()' and 'from_utc_timestamp()'. > > > > > > This proposed solution - besides the performance impact of say having a > > query like this: > > > > > > select ... where to_utc_timestamp(timestamp_column_a) < > > to_utc_timestamp(timestamp_column_b) > > > > > > makes also much harder to write queries having timestamp fields (which is > > very common in my workplace) or to use it with automated BI tools. > > > > > > > > It is definitely not what one would expect from an analytic MPP database, > > that aspires to be used at the Enterprise level in 2017. > > > > > > > > Regards, > > Franco Venturi > >
