It seems to work as expected using the latest version of the Impala JDBC
driver.
Might you give that a try and see if it solves your issue?
create table t (pk int primary key, ts timestamp)
partition by hash(pk) partitions 2
stored as kudu
tblproperties('kudu.num_tablet_replicas' = '1');
*** Database product version: 2.9.0-cdh5.12.1
*** JDBC Driver version: 02.05.41.1061
TimestampIn: 2017-11-04 22:30:00.0 - getTime(): 1509859800000
TimestampOut: 2017-11-04 22:30:00.0 - getTime(): 1509859800000
On 2017-12-18 04:49, Jeszy <[email protected]> wrote:
> Hello Franco,
>
> Thanks for your feedback! I agree there are pain points with using
> timestamps, especially together with other systems.
> Is there any particular approach or solution you propose that would
> work well for you? Have you found any jiras on issues.apache.org that
> describe what you're asking for? Commenting on a jira will help the
> team track your input better.
>
> Regards,
> Balazs
>
> On 17 December 2017 at 00:38, Franco Venturi <[email protected]> wrote:
> > Please note that the discussion below refers to the following versions:
> > - Impala: v2.10.0-cdh5.13.0
> > - Kudu: 1.5.0-cdh5.13.0
> > - Everything runs on a standard Cloudera 5.13 installation
> >
> >
> > A few days ago I was writing some Java code to migrate several tables
> > directly from Oracle to Kudu (to be queried later on by our developers and
> > BI tools using Impala). Most of these tables have columns that are of type
> > "timestamp" (to be exact, they come in as instances of class
> > oracle.sql.TIMESTAMP and I cast them to java.sql.Timestamp; for the rest of
> > this discussion I'll assume we only deal with objects of java.sql.Timestamp,
> > to make things simple).
> > As you probably know, Kudu, starting I think with version 1.3.1, has a type
> > called 'UNIXTIME_MICROS') and that type gets mapped by Impala as "Impala
> > TIMESTAMP" data type
> > (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_timestamp.html).
> >
> >
> > A good description of the meaning of 'UNIXTIME_MICROS' in Kudu is in the
> > 'Apache Kudu Schema Design' document
> > (https://kudu.apache.org/docs/schema_design.html), which says:
> >
> >
> > unixtime_micros (64-bit microseconds since the Unix epoch)
> >
> >
> > where the 'Unix epoch' is defined as 1/1/1970 00:00:00 GMT.
> >
> >
> > With this understanding I went ahead and wrote my Java code; when I ran the
> > first few tests, I noticed that the timestamp values returned by Impala (I
> > created in Impala an 'external' table 'stored as kudu') were off by several
> > hours compared to the values returned by the original table in Oracle (our
> > servers, both the Oracle ones and the Impala/Kudu ones, are all configured
> > in the 'America/New_York' timezone).
> >
> >
> > To investigate this difference, I created a simple table in Kudu with just
> > two columns, an INT64 as the primary key and a UNIXTIME_MICROS as a
> > timestamp. I ran a few inserts and selects over this table in Impala and
> > figured out that Impala stores a value that is more or less defined as
> > follow:
> >
> >
> > number of microseconds since the Unix epoch (i.e. what I was expecting
> > originally)
> > + offset of the timestamp I inserted with respect to GMT (in my case
> > this offset is the offset for EST or EDT depending if that timestamp was
> > during EST (winter) or EDT (summer))
> >
> >
> > This is how Impala achieves what is described as:
> >
> >
> > Impala does not store timestamps using the local timezone, to avoid
> > undesired results from unexpected time zone issues
> >
> >
> > That same page has caveats like the following, that sent a shiver down my
> > spine:
> >
> >
> > If that value was written to a data file, and shipped off to a distant
> > server to be analyzed alongside other data from far-flung locations, the
> > dates and times would not match up precisely because of time zone
> > differences
> >
> >
> > This means that if anyone is using (or even thinking about using) "Impala
> > timestamps" to say store financial or health services (or security) events,
> > they'll find some nasty "surprises" (even if they don't plan to ever move
> > their servers and only do business in one timezone).
> >
> >
> > Consider for instance the case of anything that occurred between 1am and 2am
> > EDT on 11/5/2017 (i.e. in the hour before we moved our clocks back from EDT
> > to EST) - there's no way to store the timestamps for these events in Kudu
> > via Impala.
> >
> > To prove this I wrote this simple piece of Java code (which uses Java 8 and
> > all well documented and non-deprecated classes and methods) to do just an
> > insert and a select via Impala JDBC of a timestamp row in the simple table
> > that I mentioned above (primary key + timestamp column):
> >
> >
> >
> > // run insert
> > long primaryKey = 1L;
> > PreparedStatement insert = connection.prepareStatement("insert into "
> > + table + " values (?, ?)");
> > insert.setLong(1, primaryKey);
> > Timestamp timestampIn = new Timestamp(1509859800000L);
> > System.out.println("TimestampIn: " + timestampIn + " - getTime(): " +
> > timestampIn.getTime());
> > insert.setTimestamp(2, timestampIn);
> > insert.executeUpdate();
> > insert.close();
> >
> >
> > // run select
> > PreparedStatement select = connection.prepareStatement("select " +
> > timestampColumn + " from " + table + " where " + primaryKeyColumn + "=?");
> > select.setLong(1, primaryKey);
> > ResultSet resultSet = select.executeQuery();
> > while (resultSet.next()) {
> > Timestamp timestampOut = resultSet.getTimestamp(1);
> > System.out.println("TimestampOut: " + timestampOut + " -
> > getTime(): " + timestampOut.getTime());
> > }
> > resultSet.close();
> > select.close();
> >
> >
> >
> > and this is the output:
> >
> >
> > TimestampIn: 2017-11-05 01:30:00.0 - getTime(): 1509859800000
> > TimestampOut: 2017-11-05 01:30:00.0 - getTime(): 1509863400000
> >
> >
> >
> > If this kind of timestamp had been designed say in 1972, I would have
> > probably been OK with it, but in 2017, when people and companies do business
> > 24/7, I find this approach (and sentences like "dates and times would not
> > match up precisely") unacceptable.
> >
> >
> >
> > To its defense, the Impala TIMESTAMP web page mentioned above spends several
> > paragraphs about how to deal with "UTC timestamps" and problems like the
> > ones shown above, but it basically has the user wrap everything in Impala
> > functions like 'to_utc_timestamp()' and 'from_utc_timestamp()'.
> >
> >
> > This proposed solution - besides the performance impact of say having a
> > query like this:
> >
> >
> > select ... where to_utc_timestamp(timestamp_column_a) <
> > to_utc_timestamp(timestamp_column_b)
> >
> >
> > makes also much harder to write queries having timestamp fields (which is
> > very common in my workplace) or to use it with automated BI tools.
> >
> >
> >
> > It is definitely not what one would expect from an analytic MPP database,
> > that aspires to be used at the Enterprise level in 2017.
> >
> >
> >
> > Regards,
> > Franco Venturi
>