Here is a nice analysis of the issue from the Cassandra mail list. (Datastax is the Databricks for Cassandra)
Should I fill a bug? Kind regards Andy http://stackoverflow.com/questions/2305973/java-util-date-vs-java-sql-date and this one On Fri, Mar 18, 2016 at 11:35 AM Russell Spitzer <russ...@datastax.com> wrote: > Unfortunately part of Spark SQL. They have based their type on > java.sql.timestamp (and date) which adjust to the client timezone when > displaying and storing. > See discussions > http://stackoverflow.com/questions/9202857/timezones-in-sql-date-vs-java-sql-d > ate > And Code > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/ > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.s > cala#L81-L93 > From: Andrew Davidson <a...@santacruzintegration.com> Date: Thursday, March 17, 2016 at 3:25 PM To: Andrew Davidson <a...@santacruzintegration.com>, "user @spark" <user@spark.apache.org> Subject: Re: sql timestamp timezone bug > > For completeness. Clearly spark sql returned a different data set > > In [4]: > rawDF.selectExpr("count(row_key) as num_samples", > "sum(count) as total", > "max(count) as max ").show() > +-----------+----------------+-------------+ > |num_samples|total|max| > +-----------+----------------+-------------+ > | 2037| 3867| 67| > +-----------+----------------+-------------+ > > > From: Andrew Davidson <a...@santacruzintegration.com> > Date: Thursday, March 17, 2016 at 3:02 PM > To: "user @spark" <user@spark.apache.org> > Subject: sql timestamp timezone bug > >> I am using pyspark 1.6.0 and >> datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series >> data >> >> The data is originally captured by a spark streaming app and written to >> Cassandra. The value of the timestamp comes from >> >> Rdd.foreachRDD(new VoidFunction2<JavaRDD<String>, Time>() >> �}); >> >> I am confident the time stamp is stored correctly in cassandra and that >> the clocks on the machines in my cluster are set correctly >> >> I noticed that if I used Cassandra CQLSH to select a data set between two >> points in time the row count did not match the row count I got when I did >> the same select in spark using SQL, It appears the spark sql assumes all >> timestamp strings are in the local time zone. >> >> >> Here is what I expect. (this is what is returned by CQLSH) >> cqlsh> select >> ... count(row_key) as num_samples, sum(count) as total, max(count) >> as max >> ... from >> ... notification.json_timeseries >> ... where >> ... row_key in (똱ed', 똟lue') >> ... and created > '2016-03-12 00:30:00+0000' >> ... and created <= '2016-03-12 04:30:00+0000' >> ... allow filtering; >> >> num_samples | total| max >> -------------+------------------+--------------- >> 3242 |11277 | 17 >> >> >> Here is my pyspark select statement. Notice the 똠reated column encodes >> the timezone¹. I am running this on my local mac (in PST timezone) and >> connecting to my data center (which runs on UTC) over a VPN. >> >> rawDF = sqlContext.read\ >> .format("org.apache.spark.sql.cassandra")\ >> .options(table="json_timeseries", keyspace="notification")\ >> .load() >> >> >> rawDF.registerTempTable(tmpTableName) >> >> >> >> stmnt = "select \ >> row_key, created, count, unix_timestamp(created) as unixTimeStamp, \ >> unix_timestamp(created, 'yyyy-MM-dd HH:mm:ss.z') as hack, \ >> to_utc_timestamp(created, 'gmt') as gmt \ >> from \ >> rawTable \ >> where \ >> (created > '{0}') and (created <= '{1}') \ >> and \ >> (row_key = 똱ed' or row_key = 똟lue¹) \ >> )".format('2016-03-12 00:30:00+0000', '2016-03-12 04:30:00+0000') >> >> rawDF = sqlCtx.sql(stmnt).cache() >> >> >> >> >> I get a different values for row count, max, � >> >> If I convert the UTC time stamp string to my local timezone the row count >> matches the count returned by cqlsh >> >> # pst works, matches cassandra cqlsh >> # .format('2016-03-11 16:30:00+0000', '2016-03-11 20:30:00+0000') >> >> Am I doing something wrong in my pyspark code? >> >> >> Kind regards >> >> Andy >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >>