Greg, if you are somewhere on the West Coast (i.e. if your servers are configured to run on the 'America/Los_Angeles' timezone), do you mind changing this line from:
Timestamp timestampIn = new Timestamp(1509859800000L); to: Timestamp timestampIn = new Timestamp(1509870600000L); and see if it make any difference? Also I see you are running a different version of Impalad (2.9.0-cdh5.12.1), while I am on v2.10.0-cdh5.13.0. On the other hand, I'll give it a try with the latest JDBC driver provider by Cloudera and see how it goes. Franco ----- Original Message ----- From: "greg rahn" <[email protected]> To: [email protected] Sent: Monday, December 18, 2017 6:46:44 PM Subject: Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...) It seems to work as expected using the latest version of the Impala JDBC driver. Might you give that a try and see if it solves your issue? create table t (pk int primary key, ts timestamp) partition by hash(pk) partitions 2 stored as kudu tblproperties('kudu.num_tablet_replicas' = '1'); *** Database product version: 2.9.0-cdh5.12.1 *** JDBC Driver version: 02.05.41.1061 TimestampIn: 2017-11-04 22:30:00.0 - getTime(): 1509859800000 TimestampOut: 2017-11-04 22:30:00.0 - getTime(): 1509859800000 On 2017-12-18 04:49, Jeszy <[email protected]> wrote: > Hello Franco, > > Thanks for your feedback! I agree there are pain points with using > timestamps, especially together with other systems. > Is there any particular approach or solution you propose that would > work well for you? Have you found any jiras on issues.apache.org that > describe what you're asking for? Commenting on a jira will help the > team track your input better. > > Regards, > Balazs > > On 17 December 2017 at 00:38, Franco Venturi <[email protected]> wrote: > > Please note that the discussion below refers to the following versions: > > - Impala: v2.10.0-cdh5.13.0 > > - Kudu: 1.5.0-cdh5.13.0 > > - Everything runs on a standard Cloudera 5.13 installation > > > > > > A few days ago I was writing some Java code to migrate several tables > > directly from Oracle to Kudu (to be queried later on by our developers and > > BI tools using Impala). Most of these tables have columns that are of type > > "timestamp" (to be exact, they come in as instances of class > > oracle.sql.TIMESTAMP and I cast them to java.sql.Timestamp; for the rest of > > this discussion I'll assume we only deal with objects of > > java.sql.Timestamp, > > to make things simple). > > As you probably know, Kudu, starting I think with version 1.3.1, has a type > > called 'UNIXTIME_MICROS') and that type gets mapped by Impala as "Impala > > TIMESTAMP" data type > > (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_timestamp.html). > > > > > > > > A good description of the meaning of 'UNIXTIME_MICROS' in Kudu is in the > > 'Apache Kudu Schema Design' document > > (https://kudu.apache.org/docs/schema_design.html), which says: > > > > > > unixtime_micros (64-bit microseconds since the Unix epoch) > > > > > > where the 'Unix epoch' is defined as 1/1/1970 00:00:00 GMT. > > > > > > With this understanding I went ahead and wrote my Java code; when I ran the > > first few tests, I noticed that the timestamp values returned by Impala (I > > created in Impala an 'external' table 'stored as kudu') were off by several > > hours compared to the values returned by the original table in Oracle (our > > servers, both the Oracle ones and the Impala/Kudu ones, are all configured > > in the 'America/New_York' timezone). > > > > > > To investigate this difference, I created a simple table in Kudu with just > > two columns, an INT64 as the primary key and a UNIXTIME_MICROS as a > > timestamp. I ran a few inserts and selects over this table in Impala and > > figured out that Impala stores a value that is more or less defined as > > follow: > > > > > > number of microseconds since the Unix epoch (i.e. what I was expecting > > originally) > > + offset of the timestamp I inserted with respect to GMT (in my case > > this offset is the offset for EST or EDT depending if that timestamp was > > during EST (winter) or EDT (summer)) > > > > > > This is how Impala achieves what is described as: > > > > > > Impala does not store timestamps using the local timezone, to avoid > > undesired results from unexpected time zone issues > > > > > > That same page has caveats like the following, that sent a shiver down my > > spine: > > > > > > If that value was written to a data file, and shipped off to a distant > > server to be analyzed alongside other data from far-flung locations, the > > dates and times would not match up precisely because of time zone > > differences > > > > > > This means that if anyone is using (or even thinking about using) "Impala > > timestamps" to say store financial or health services (or security) events, > > they'll find some nasty "surprises" (even if they don't plan to ever move > > their servers and only do business in one timezone). > > > > > > Consider for instance the case of anything that occurred between 1am and > > 2am > > EDT on 11/5/2017 (i.e. in the hour before we moved our clocks back from EDT > > to EST) - there's no way to store the timestamps for these events in Kudu > > via Impala. > > > > To prove this I wrote this simple piece of Java code (which uses Java 8 and > > all well documented and non-deprecated classes and methods) to do just an > > insert and a select via Impala JDBC of a timestamp row in the simple table > > that I mentioned above (primary key + timestamp column): > > > > > > > > // run insert > > long primaryKey = 1L; > > PreparedStatement insert = connection.prepareStatement("insert into " > > + table + " values (?, ?)"); > > insert.setLong(1, primaryKey); > > Timestamp timestampIn = new Timestamp(1509859800000L); > > System.out.println("TimestampIn: " + timestampIn + " - getTime(): " + > > timestampIn.getTime()); > > insert.setTimestamp(2, timestampIn); > > insert.executeUpdate(); > > insert.close(); > > > > > > // run select > > PreparedStatement select = connection.prepareStatement("select " + > > timestampColumn + " from " + table + " where " + primaryKeyColumn + "=?"); > > select.setLong(1, primaryKey); > > ResultSet resultSet = select.executeQuery(); > > while (resultSet.next()) { > > Timestamp timestampOut = resultSet.getTimestamp(1); > > System.out.println("TimestampOut: " + timestampOut + " - > > getTime(): " + timestampOut.getTime()); > > } > > resultSet.close(); > > select.close(); > > > > > > > > and this is the output: > > > > > > TimestampIn: 2017-11-05 01:30:00.0 - getTime(): 1509859800000 > > TimestampOut: 2017-11-05 01:30:00.0 - getTime(): 1509863400000 > > > > > > > > If this kind of timestamp had been designed say in 1972, I would have > > probably been OK with it, but in 2017, when people and companies do > > business > > 24/7, I find this approach (and sentences like "dates and times would not > > match up precisely") unacceptable. > > > > > > > > To its defense, the Impala TIMESTAMP web page mentioned above spends > > several > > paragraphs about how to deal with "UTC timestamps" and problems like the > > ones shown above, but it basically has the user wrap everything in Impala > > functions like 'to_utc_timestamp()' and 'from_utc_timestamp()'. > > > > > > This proposed solution - besides the performance impact of say having a > > query like this: > > > > > > select ... where to_utc_timestamp(timestamp_column_a) < > > to_utc_timestamp(timestamp_column_b) > > > > > > makes also much harder to write queries having timestamp fields (which is > > very common in my workplace) or to use it with automated BI tools. > > > > > > > > It is definitely not what one would expect from an analytic MPP database, > > that aspires to be used at the Enterprise level in 2017. > > > > > > > > Regards, > > Franco Venturi >
