Hello, For the past days I have been trying to process and analyse with Spark a Cassandra eventLog table similar to the one shown here. Basically what I want to calculate is the delta time "epoch" between each event type for all the device id's in the table. Currently its working as expected but I am wondering if there is a better or more optimal way of achieving this kind of calculation in Spark.
Note that to simplify the example I have removed all the Cassandra stuff and just use a CSV file. *eventLog.txt:* dev_id,event_type,event_ts ----------------- 1,loging,2015-01-03 01:15:00 1,activated,2015-01-03 01:10:00 1,register,2015-01-03 01:00:00 2,get_data,2015-01-02 01:00:10 2,loging,2015-01-02 01:00:00 3,update_data,2015-01-01 01:15:00 3,get_data,2015-01-01 01:10:00 3,loging,2015-01-01 01:00:00 ----------------- *Spark Code:* ----------------- import java.sql.Timestamp def getDateDiff( d1:String, d2:String) : Long = { Timestamp.valueOf(d2).getTime() - Timestamp.valueOf(d1).getTime() } val rawEvents = sc.textFile("eventLog.txt").map(_.split(",")).map(e => (e(0).trim.toInt, e(1).trim, e(2).trim)) val indexed = rawEvents.zipWithIndex.map(_.swap) val shifted = indexed.map{case (k,v) => (k-1,v)} val joined = indexed.join(shifted) val cleaned = joined.filter(x => x._2._1._1 == x._2._2._1) // Filter out dev_id's that don't match val eventDuration = cleaned.map{case (i,(v1,v2)) => (v1._1, s"${v1._2} -> ${v2._2}", getDateDiff(v2._3, v1._3)) } eventDuration.collect.foreach(println) ----------------- *Output:* ----------------- (1,loging -> activated,300000) (3,get_data -> loging,600000) (1,activated -> register,600000) (2,get_data -> loging,10000) (3,update_data -> get_data,300000) This code was inspired by the following posts: http://stackoverflow.com/questions/26560292/apache-spark-distance-between-two-points-using-squareddistance http://apache-spark-user-list.1001560.n3.nabble.com/Cumulative-distance-calculation-on-moving-objects-RDD-td20729.html http://stackoverflow.com/questions/28236347/functional-approach-in-sequential-rdd-processing-apache-spark Best regards and thanks in advance for any suggestions, Sebastian -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/EventLog-Timeline-calculation-Optimization-tp21792.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org