I am using pyspark 1.6.0 and
datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series
data
The data is originally captured by a spark streaming app and written to
Cassandra. The value of the timestamp comes from
Rdd.foreachRDD(new VoidFunction2<JavaRDD<String>, Time>()
});
I am confident the time stamp is stored correctly in cassandra and that
the clocks on the machines in my cluster are set correctly
I noticed that if I used Cassandra CQLSH to select a data set between two
points in time the row count did not match the row count I got when I did
the same select in spark using SQL, It appears the spark sql assumes all
timestamp strings are in the local time zone.
Here is what I expect. (this is what is returned by CQLSH)
cqlsh> select
... count(row_key) as num_samples, sum(count) as total, max(count)
as max
... from
... notification.json_timeseries
... where
... row_key in (red', blue')
... and created > '2016-03-12 00:30:00+0000'
... and created <= '2016-03-12 04:30:00+0000'
... allow filtering;
num_samples | total| max
-------------+------------------+---------------
3242 |11277 | 17
Here is my pyspark select statement. Notice the created column encodes
the timezone¹. I am running this on my local mac (in PST timezone) and
connecting to my data center (which runs on UTC) over a VPN.
rawDF = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="json_timeseries", keyspace="notification")\
.load()
rawDF.registerTempTable(tmpTableName)
stmnt = "select \
row_key, created, count, unix_timestamp(created) as unixTimeStamp, \
unix_timestamp(created, 'yyyy-MM-dd HH:mm:ss.z') as hack, \
to_utc_timestamp(created, 'gmt') as gmt \
from \
rawTable \
where \
(created > '{0}') and (created <= '{1}') \
and \
(row_key = red' or row_key = blue¹) \
)".format('2016-03-12 00:30:00+0000', '2016-03-12 04:30:00+0000')
rawDF = sqlCtx.sql(stmnt).cache()
I get a different values for row count, max,
If I convert the UTC time stamp string to my local timezone the row count
matches the count returned by cqlsh
# pst works, matches cassandra cqlsh
# .format('2016-03-11 16:30:00+0000', '2016-03-11 20:30:00+0000')
Am I doing something wrong in my pyspark code?
Kind regards
Andy
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]