Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)

Franco Venturi Mon, 18 Dec 2017 16:51:12 -0800

Greg, 
if you are somewhere on the West Coast (i.e. if your servers are configured to 
run on the 'America/Los_Angeles' timezone), do you mind changing this line 
from:


Timestamp timestampIn = new Timestamp(1509859800000L); 

to: 

Timestamp timestampIn = new Timestamp(1509870600000L); 

and see if it make any difference? 

Also I see you are running a different version of Impalad (2.9.0-cdh5.12.1), 
while I am on v2.10.0-cdh5.13.0. 

On the other hand, I'll give it a try with the latest JDBC driver provider by 
Cloudera and see how it goes. 

Franco 


----- Original Message -----

From: "greg rahn" <[email protected]> 
To: [email protected] 
Sent: Monday, December 18, 2017 6:46:44 PM 
Subject: Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...) 

It seems to work as expected using the latest version of the Impala JDBC 
driver. 
Might you give that a try and see if it solves your issue? 

create table t (pk int primary key, ts timestamp) 
partition by hash(pk) partitions 2 
stored as kudu 
tblproperties('kudu.num_tablet_replicas' = '1'); 

*** Database product version: 2.9.0-cdh5.12.1 
*** JDBC Driver version: 02.05.41.1061 
TimestampIn: 2017-11-04 22:30:00.0 - getTime(): 1509859800000 
TimestampOut: 2017-11-04 22:30:00.0 - getTime(): 1509859800000 

On 2017-12-18 04:49, Jeszy <[email protected]> wrote: 
> Hello Franco, 
> 
> Thanks for your feedback! I agree there are pain points with using 
> timestamps, especially together with other systems. 
> Is there any particular approach or solution you propose that would 
> work well for you? Have you found any jiras on issues.apache.org that 
> describe what you're asking for? Commenting on a jira will help the 
> team track your input better. 
> 
> Regards, 
> Balazs 
> 
> On 17 December 2017 at 00:38, Franco Venturi <[email protected]> wrote: 
> > Please note that the discussion below refers to the following versions: 
> > - Impala: v2.10.0-cdh5.13.0 
> > - Kudu: 1.5.0-cdh5.13.0 
> > - Everything runs on a standard Cloudera 5.13 installation 
> > 
> > 
> > A few days ago I was writing some Java code to migrate several tables 
> > directly from Oracle to Kudu (to be queried later on by our developers and 
> > BI tools using Impala). Most of these tables have columns that are of type 
> > "timestamp" (to be exact, they come in as instances of class 
> > oracle.sql.TIMESTAMP and I cast them to java.sql.Timestamp; for the rest of 
> > this discussion I'll assume we only deal with objects of 
> > java.sql.Timestamp, 
> > to make things simple). 
> > As you probably know, Kudu, starting I think with version 1.3.1, has a type 
> > called 'UNIXTIME_MICROS') and that type gets mapped by Impala as "Impala 
> > TIMESTAMP" data type 
> > (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_timestamp.html).
> >  
> > 
> > 
> > A good description of the meaning of 'UNIXTIME_MICROS' in Kudu is in the 
> > 'Apache Kudu Schema Design' document 
> > (https://kudu.apache.org/docs/schema_design.html), which says: 
> > 
> > 
> > unixtime_micros (64-bit microseconds since the Unix epoch) 
> > 
> > 
> > where the 'Unix epoch' is defined as 1/1/1970 00:00:00 GMT. 
> > 
> > 
> > With this understanding I went ahead and wrote my Java code; when I ran the 
> > first few tests, I noticed that the timestamp values returned by Impala (I 
> > created in Impala an 'external' table 'stored as kudu') were off by several 
> > hours compared to the values returned by the original table in Oracle (our 
> > servers, both the Oracle ones and the Impala/Kudu ones, are all configured 
> > in the 'America/New_York' timezone). 
> > 
> > 
> > To investigate this difference, I created a simple table in Kudu with just 
> > two columns, an INT64 as the primary key and a UNIXTIME_MICROS as a 
> > timestamp. I ran a few inserts and selects over this table in Impala and 
> > figured out that Impala stores a value that is more or less defined as 
> > follow: 
> > 
> > 
> > number of microseconds since the Unix epoch (i.e. what I was expecting 
> > originally) 
> > + offset of the timestamp I inserted with respect to GMT (in my case 
> > this offset is the offset for EST or EDT depending if that timestamp was 
> > during EST (winter) or EDT (summer)) 
> > 
> > 
> > This is how Impala achieves what is described as: 
> > 
> > 
> > Impala does not store timestamps using the local timezone, to avoid 
> > undesired results from unexpected time zone issues 
> > 
> > 
> > That same page has caveats like the following, that sent a shiver down my 
> > spine: 
> > 
> > 
> > If that value was written to a data file, and shipped off to a distant 
> > server to be analyzed alongside other data from far-flung locations, the 
> > dates and times would not match up precisely because of time zone 
> > differences 
> > 
> > 
> > This means that if anyone is using (or even thinking about using) "Impala 
> > timestamps" to say store financial or health services (or security) events, 
> > they'll find some nasty "surprises" (even if they don't plan to ever move 
> > their servers and only do business in one timezone). 
> > 
> > 
> > Consider for instance the case of anything that occurred between 1am and 
> > 2am 
> > EDT on 11/5/2017 (i.e. in the hour before we moved our clocks back from EDT 
> > to EST) - there's no way to store the timestamps for these events in Kudu 
> > via Impala. 
> > 
> > To prove this I wrote this simple piece of Java code (which uses Java 8 and 
> > all well documented and non-deprecated classes and methods) to do just an 
> > insert and a select via Impala JDBC of a timestamp row in the simple table 
> > that I mentioned above (primary key + timestamp column): 
> > 
> > 
> > 
> > // run insert 
> > long primaryKey = 1L; 
> > PreparedStatement insert = connection.prepareStatement("insert into " 
> > + table + " values (?, ?)"); 
> > insert.setLong(1, primaryKey); 
> > Timestamp timestampIn = new Timestamp(1509859800000L); 
> > System.out.println("TimestampIn: " + timestampIn + " - getTime(): " + 
> > timestampIn.getTime()); 
> > insert.setTimestamp(2, timestampIn); 
> > insert.executeUpdate(); 
> > insert.close(); 
> > 
> > 
> > // run select 
> > PreparedStatement select = connection.prepareStatement("select " + 
> > timestampColumn + " from " + table + " where " + primaryKeyColumn + "=?"); 
> > select.setLong(1, primaryKey); 
> > ResultSet resultSet = select.executeQuery(); 
> > while (resultSet.next()) { 
> > Timestamp timestampOut = resultSet.getTimestamp(1); 
> > System.out.println("TimestampOut: " + timestampOut + " - 
> > getTime(): " + timestampOut.getTime()); 
> > } 
> > resultSet.close(); 
> > select.close(); 
> > 
> > 
> > 
> > and this is the output: 
> > 
> > 
> > TimestampIn: 2017-11-05 01:30:00.0 - getTime(): 1509859800000 
> > TimestampOut: 2017-11-05 01:30:00.0 - getTime(): 1509863400000 
> > 
> > 
> > 
> > If this kind of timestamp had been designed say in 1972, I would have 
> > probably been OK with it, but in 2017, when people and companies do 
> > business 
> > 24/7, I find this approach (and sentences like "dates and times would not 
> > match up precisely") unacceptable. 
> > 
> > 
> > 
> > To its defense, the Impala TIMESTAMP web page mentioned above spends 
> > several 
> > paragraphs about how to deal with "UTC timestamps" and problems like the 
> > ones shown above, but it basically has the user wrap everything in Impala 
> > functions like 'to_utc_timestamp()' and 'from_utc_timestamp()'. 
> > 
> > 
> > This proposed solution - besides the performance impact of say having a 
> > query like this: 
> > 
> > 
> > select ... where to_utc_timestamp(timestamp_column_a) < 
> > to_utc_timestamp(timestamp_column_b) 
> > 
> > 
> > makes also much harder to write queries having timestamp fields (which is 
> > very common in my workplace) or to use it with automated BI tools. 
> > 
> > 
> > 
> > It is definitely not what one would expect from an analytic MPP database, 
> > that aspires to be used at the Enterprise level in 2017. 
> > 
> > 
> > 
> > Regards, 
> > Franco Venturi 
>

Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)

Reply via email to