Hello,

For the past days I have been trying to process and analyse with Spark a
Cassandra eventLog table similar to the one shown here.
Basically what I want to calculate is the delta time "epoch" between each
event type for all the device id's in the table. Currently its working as
expected but I am wondering if there is a better or more optimal way of
achieving this kind of calculation in Spark.

Note that to simplify the example I have removed all the Cassandra stuff and
just use a CSV file.

*eventLog.txt:*
dev_id,event_type,event_ts
-----------------
1,loging,2015-01-03 01:15:00
1,activated,2015-01-03 01:10:00
1,register,2015-01-03 01:00:00
2,get_data,2015-01-02 01:00:10
2,loging,2015-01-02 01:00:00
3,update_data,2015-01-01 01:15:00
3,get_data,2015-01-01 01:10:00
3,loging,2015-01-01 01:00:00
-----------------

*Spark Code:*
-----------------
import java.sql.Timestamp
def getDateDiff( d1:String, d2:String) : Long = {
Timestamp.valueOf(d2).getTime() - Timestamp.valueOf(d1).getTime() }

val rawEvents = sc.textFile("eventLog.txt").map(_.split(",")).map(e =>
(e(0).trim.toInt, e(1).trim, e(2).trim))
val indexed = rawEvents.zipWithIndex.map(_.swap)
val shifted = indexed.map{case (k,v) => (k-1,v)}
val joined = indexed.join(shifted)
val cleaned = joined.filter(x => x._2._1._1 == x._2._2._1) // Filter out
dev_id's that don't match
val eventDuration = cleaned.map{case (i,(v1,v2)) => (v1._1, s"${v1._2} ->
${v2._2}", getDateDiff(v2._3, v1._3)) }
eventDuration.collect.foreach(println)
-----------------

*Output:*
-----------------
(1,loging -> activated,300000)
(3,get_data -> loging,600000)
(1,activated -> register,600000)
(2,get_data -> loging,10000)
(3,update_data -> get_data,300000)


This code was inspired by the following posts:
http://stackoverflow.com/questions/26560292/apache-spark-distance-between-two-points-using-squareddistance
http://apache-spark-user-list.1001560.n3.nabble.com/Cumulative-distance-calculation-on-moving-objects-RDD-td20729.html
http://stackoverflow.com/questions/28236347/functional-approach-in-sequential-rdd-processing-apache-spark


Best regards and thanks in advance for any suggestions,
Sebastian 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/EventLog-Timeline-calculation-Optimization-tp21792.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to