bug spark should not use java.sql.timestamp was: sql timestamp timezone bug

Andy Davidson Sat, 19 Mar 2016 23:04:58 -0700

Here is a nice analysis of the issue from the Cassandra mail list. (Datastax
is the Databricks for Cassandra)


Should I fill a bug?

Kind regards

Andy

http://stackoverflow.com/questions/2305973/java-util-date-vs-java-sql-date
and this one 

On Fri, Mar 18, 2016 at 11:35 AM Russell Spitzer <russ...@datastax.com>
wrote:
> Unfortunately part of Spark SQL. They have based their type on
> java.sql.timestamp (and date) which adjust to the client timezone when
> displaying and storing.
> See discussions
> http://stackoverflow.com/questions/9202857/timezones-in-sql-date-vs-java-sql-d
> ate
> And Code
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.s
> cala#L81-L93
> 

From:  Andrew Davidson <a...@santacruzintegration.com>
Date:  Thursday, March 17, 2016 at 3:25 PM
To:  Andrew Davidson <a...@santacruzintegration.com>, "user @spark"
<user@spark.apache.org>
Subject:  Re: sql timestamp timezone bug

> 
> For completeness. Clearly spark sql returned a different data set
> 
> In [4]:
> rawDF.selectExpr("count(row_key) as num_samples",
>                     "sum(count) as total",
>                     "max(count) as max ").show()
> +-----------+----------------+-------------+
> |num_samples|total|max|
> +-----------+----------------+-------------+
> |       2037| 3867| 67|
> +-----------+----------------+-------------+
> 
> 
> From:  Andrew Davidson <a...@santacruzintegration.com>
> Date:  Thursday, March 17, 2016 at 3:02 PM
> To:  "user @spark" <user@spark.apache.org>
> Subject:  sql timestamp timezone bug
> 
>> I am using pyspark 1.6.0 and
>> datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series
>> data
>> 
>> The data is originally captured by a spark streaming app and written to
>> Cassandra. The value of the timestamp comes from
>> 
>> Rdd.foreachRDD(new VoidFunction2<JavaRDD<String>, Time>()
>> �});
>> 
>> I am confident the time stamp is stored correctly in cassandra and that
>> the clocks on the machines in my cluster are set correctly
>> 
>> I noticed that if I used Cassandra CQLSH to select a data set between two
>> points in time the row count did not match the row count I got when I did
>> the same select in spark using SQL, It appears the spark sql assumes all
>> timestamp strings are in the local time zone.
>> 
>> 
>> Here is what I expect. (this is what is returned by CQLSH)
>> cqlsh> select
>>    ...     count(row_key) as num_samples, sum(count) as total, max(count)
>> as max
>>    ... from
>>    ...     notification.json_timeseries
>>    ... where
>>    ...     row_key in (똱ed', 똟lue')
>>    ...     and created > '2016-03-12 00:30:00+0000'
>>    ...     and created <= '2016-03-12 04:30:00+0000'
>>    ... allow filtering;
>> 
>>  num_samples | total| max
>> -------------+------------------+---------------
>>         3242 |11277 |  17
>> 
>> 
>> Here is  my pyspark select statement. Notice the 똠reated column encodes
>> the timezone¹. I am running this on my local mac (in PST timezone) and
>> connecting to my data center (which runs on UTC) over a VPN.
>> 
>> rawDF = sqlContext.read\
>> .format("org.apache.spark.sql.cassandra")\
>> .options(table="json_timeseries", keyspace="notification")\
>> .load() 
>> 
>> 
>> rawDF.registerTempTable(tmpTableName)
>> 
>> 
>> 
>> stmnt = "select \
>> row_key, created, count, unix_timestamp(created) as unixTimeStamp, \
>> unix_timestamp(created, 'yyyy-MM-dd HH:mm:ss.z') as hack, \
>> to_utc_timestamp(created, 'gmt') as gmt \
>> from \
>> rawTable \
>> where \
>> (created > '{0}') and (created <= '{1}') \
>> and \
>> (row_key = 똱ed' or row_key = 똟lue¹) \
>> )".format('2016-03-12 00:30:00+0000', '2016-03-12 04:30:00+0000')
>> 
>> rawDF = sqlCtx.sql(stmnt).cache()
>> 
>> 
>> 
>> 
>> I get a different values for row count, max, �
>> 
>> If I convert the UTC time stamp string to my local timezone the row count
>> matches the count returned by  cqlsh
>> 
>> # pst works, matches cassandra cqlsh
>> # .format('2016-03-11 16:30:00+0000', '2016-03-11 20:30:00+0000')
>> 
>> Am I doing something wrong in my pyspark code?
>> 
>> 
>> Kind regards
>> 
>> Andy
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
>>

bug spark should not use java.sql.timestamp was: sql timestamp timezone bug

Reply via email to