On Thu, Mar 17, 2016 at 3:02 PM, Andy Davidson
<[email protected]> wrote:
> I am using pyspark 1.6.0 and
> datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series
> data
>
> The data is originally captured by a spark streaming app and written to
> Cassandra. The value of the timestamp comes from
>
> Rdd.foreachRDD(new VoidFunction2<JavaRDD<String>, Time>()
> Š});
>
> I am confident the time stamp is stored correctly in cassandra and that
> the clocks on the machines in my cluster are set correctly
>
> I noticed that if I used Cassandra CQLSH to select a data set between two
> points in time the row count did not match the row count I got when I did
> the same select in spark using SQL, It appears the spark sql assumes all
> timestamp strings are in the local time zone.
>
>
> Here is what I expect. (this is what is returned by CQLSH)
> cqlsh> select
> ... count(row_key) as num_samples, sum(count) as total, max(count)
> as max
> ... from
> ... notification.json_timeseries
> ... where
> ... row_key in (Œred', Œblue')
> ... and created > '2016-03-12 00:30:00+0000'
> ... and created <= '2016-03-12 04:30:00+0000'
> ... allow filtering;
>
> num_samples | total| max
> -------------+------------------+---------------
> 3242 |11277 | 17
>
>
> Here is my pyspark select statement. Notice the Œcreated column encodes
> the timezone¹. I am running this on my local mac (in PST timezone) and
> connecting to my data center (which runs on UTC) over a VPN.
>
> rawDF = sqlContext.read\
> .format("org.apache.spark.sql.cassandra")\
> .options(table="json_timeseries", keyspace="notification")\
> .load()
>
>
> rawDF.registerTempTable(tmpTableName)
>
>
>
> stmnt = "select \
> row_key, created, count, unix_timestamp(created) as unixTimeStamp, \
> unix_timestamp(created, 'yyyy-MM-dd HH:mm:ss.z') as hack, \
> to_utc_timestamp(created, 'gmt') as gmt \
> from \
> rawTable \
> where \
> (created > '{0}') and (created <= '{1}') \
> and \
> (row_key = Œred' or row_key = Œblue¹) \
> )".format('2016-03-12 00:30:00+0000', '2016-03-12 04:30:00+0000')
>
> rawDF = sqlCtx.sql(stmnt).cache()
What's the type of `created`? TimestampType?
If yes, when created is compared to a string, it will be casted into
string, then compared as string, it become
cast(created, as string) > '2016-03-12 00:30:00+0000'
Could you try this
sqlCtx.sql("select created, cast(created as string) from rawTable").show()
>
>
>
> I get a different values for row count, max, Š
>
> If I convert the UTC time stamp string to my local timezone the row count
> matches the count returned by cqlsh
>
> # pst works, matches cassandra cqlsh
> # .format('2016-03-11 16:30:00+0000', '2016-03-11 20:30:00+0000')
>
> Am I doing something wrong in my pyspark code?
>
>
> Kind regards
>
> Andy
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]