Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)

[email protected] Mon, 18 Dec 2017 15:47:25 -0800

It seems to work as expected using the latest version of the Impala JDBC 
driver.  
Might you give that a try and see if it solves your issue?


create table t (pk int primary key, ts timestamp) 
partition by hash(pk) partitions 2 
stored as kudu
tblproperties('kudu.num_tablet_replicas' = '1');

*** Database product version: 2.9.0-cdh5.12.1
*** JDBC Driver version: 02.05.41.1061
TimestampIn:  2017-11-04 22:30:00.0 - getTime(): 1509859800000
TimestampOut: 2017-11-04 22:30:00.0 - getTime(): 1509859800000

On 2017-12-18 04:49, Jeszy <[email protected]> wrote: 
> Hello Franco,
> 
> Thanks for your feedback! I agree there are pain points with using
> timestamps, especially together with other systems.
> Is there any particular approach or solution you propose that would
> work well for you? Have you found any jiras on issues.apache.org that
> describe what you're asking for? Commenting on a jira will help the
> team track your input better.
> 
> Regards,
> Balazs
> 
> On 17 December 2017 at 00:38, Franco Venturi <[email protected]> wrote:
> > Please note that the discussion below refers to the following versions:
> >       - Impala: v2.10.0-cdh5.13.0
> >       - Kudu: 1.5.0-cdh5.13.0
> >       - Everything runs on a standard Cloudera 5.13 installation
> >
> >
> > A few days ago I was writing some Java code to migrate several tables
> > directly from Oracle to Kudu (to be queried later on by our developers and
> > BI tools using Impala). Most of these tables have columns that are of type
> > "timestamp" (to be exact, they come in as instances of class
> > oracle.sql.TIMESTAMP and I cast them to java.sql.Timestamp; for the rest of
> > this discussion I'll assume we only deal with objects of java.sql.Timestamp,
> > to make things simple).
> > As you probably know, Kudu, starting I think with version 1.3.1, has a type
> > called 'UNIXTIME_MICROS') and that type gets mapped by Impala as "Impala
> > TIMESTAMP" data type
> > (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_timestamp.html).
> >
> >
> > A good description of the meaning of 'UNIXTIME_MICROS' in Kudu is in the
> > 'Apache Kudu Schema Design' document
> > (https://kudu.apache.org/docs/schema_design.html), which says:
> >
> >
> >       unixtime_micros (64-bit microseconds since the Unix epoch)
> >
> >
> > where the 'Unix epoch' is defined as 1/1/1970 00:00:00 GMT.
> >
> >
> > With this understanding I went ahead and wrote my Java code; when I ran the
> > first few tests, I noticed that the timestamp values returned by Impala (I
> > created in Impala an 'external' table 'stored as kudu') were off by several
> > hours compared to the values returned by the original table in Oracle (our
> > servers, both the Oracle ones and the Impala/Kudu ones, are all configured
> > in the 'America/New_York' timezone).
> >
> >
> > To investigate this difference, I created a simple table in Kudu with just
> > two columns, an INT64 as the primary key and a UNIXTIME_MICROS as a
> > timestamp. I ran a few inserts and selects over this table in Impala and
> > figured out that Impala stores a value that is more or less defined as
> > follow:
> >
> >
> >       number of microseconds since the Unix epoch (i.e. what I was expecting
> > originally)
> >       + offset of the timestamp I inserted with respect to GMT (in my case
> > this offset is the offset for EST or EDT depending if that timestamp was
> > during EST (winter) or EDT (summer))
> >
> >
> > This is how Impala achieves what is described as:
> >
> >
> >       Impala does not store timestamps using the local timezone, to avoid
> > undesired results from unexpected time zone issues
> >
> >
> > That same page has caveats like the following, that sent a shiver down my
> > spine:
> >
> >
> >       If that value was written to a data file, and shipped off to a distant
> > server to be analyzed alongside other data from far-flung locations, the
> > dates and times would not match up precisely because of time zone
> > differences
> >
> >
> > This means that if anyone is using (or even thinking about using) "Impala
> > timestamps" to say store financial or health services (or security) events,
> > they'll find some nasty "surprises" (even if they don't plan to ever move
> > their servers and only do business in one timezone).
> >
> >
> > Consider for instance the case of anything that occurred between 1am and 2am
> > EDT on 11/5/2017 (i.e. in the hour before we moved our clocks back from EDT
> > to EST) - there's no way to store the timestamps for these events in Kudu
> > via Impala.
> >
> > To prove this I wrote this simple piece of Java code (which uses Java 8 and
> > all well documented and non-deprecated classes and methods) to do just an
> > insert and a select via Impala JDBC of a timestamp row in the simple table
> > that I mentioned above (primary key + timestamp column):
> >
> >
> >
> >       // run insert
> >       long primaryKey = 1L;
> >       PreparedStatement insert = connection.prepareStatement("insert into "
> > + table + " values (?, ?)");
> >       insert.setLong(1, primaryKey);
> >       Timestamp timestampIn = new Timestamp(1509859800000L);
> >       System.out.println("TimestampIn: " + timestampIn + " - getTime(): " +
> > timestampIn.getTime());
> >       insert.setTimestamp(2, timestampIn);
> >       insert.executeUpdate();
> >       insert.close();
> >
> >
> >       // run select
> >       PreparedStatement select = connection.prepareStatement("select " +
> > timestampColumn + " from " + table + " where " + primaryKeyColumn + "=?");
> >       select.setLong(1, primaryKey);
> >       ResultSet resultSet = select.executeQuery();
> >       while (resultSet.next()) {
> >           Timestamp timestampOut = resultSet.getTimestamp(1);
> >           System.out.println("TimestampOut: " + timestampOut + " -
> > getTime(): " + timestampOut.getTime());
> >       }
> >       resultSet.close();
> >       select.close();
> >
> >
> >
> > and this is the output:
> >
> >
> >      TimestampIn: 2017-11-05 01:30:00.0 - getTime(): 1509859800000
> >      TimestampOut: 2017-11-05 01:30:00.0 - getTime(): 1509863400000
> >
> >
> >
> > If this kind of timestamp had been designed say in 1972, I would have
> > probably been OK with it, but in 2017, when people and companies do business
> > 24/7, I find this approach (and sentences like "dates and times would not
> > match up precisely") unacceptable.
> >
> >
> >
> > To its defense, the Impala TIMESTAMP web page mentioned above spends several
> > paragraphs about how to deal with "UTC timestamps" and problems like the
> > ones shown above, but it basically has the user wrap everything in Impala
> > functions like 'to_utc_timestamp()' and 'from_utc_timestamp()'.
> >
> >
> > This proposed solution - besides the performance impact of say having a
> > query like this:
> >
> >
> >       select ... where to_utc_timestamp(timestamp_column_a) <
> > to_utc_timestamp(timestamp_column_b)
> >
> >
> > makes also much harder to write queries having timestamp fields (which is
> > very common in my workplace) or to use it with automated BI tools.
> >
> >
> >
> > It is definitely not what one would expect from an analytic MPP database,
> > that aspires to be used at the Enterprise level in 2017.
> >
> >
> >
> > Regards,
> > Franco Venturi
>

Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)

Reply via email to