The problem is simple I want a to stream data 24/7 do some calculations and save the result in a csv/json file so that i could use it for visualization using dc.js/d3.js
I opted for spark streaming on yarn cluster with kafka tried running it for 24/7 Using GroupByKey and updateStateByKey to have the computed historical data Initially streaming is working fine.. but after few hours i am getting 14/10/30 23:48:49 ERROR TaskSetManager: Task 2485162.0:3 failed 4 times; aborting job 14/10/30 23:48:50 ERROR JobScheduler: Error running job streaming job 1414692270000 ms.1 org.apache.spark.SparkException: Job aborted due to stage failure: Task 2485162.0:3 failed 4 times, most recent failure: Exception failure in TID 478548 on host 172.18.152.36: java.lang.ArrayIndexOutOfBoundsException Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) I guess its due to the GroupByKey and updateStateByKey, i tried GroupByKey(100) increased partition Also when data is in state say for eg 10th sec 1000 records are in state, 100th sec 20,000 records are in state out of which 19,000 records are not updated how to remove them from state.. UpdateStateByKey(none) how and when to do that, how we will know when to send none, and save the data before setting none? I also tried not sending any data a few hours but check the web ui i am getting task FINISHED app-20141030203943-0000 NewApp 0 6.0 GB 2014/10/30 20:39:43 hadoop FINISHED 4.2 h This makes me confused.. In the code it says awaitTermination, but did not terminate the task.. will streaming stop if no data is received for a significant amount of time? Is there any doc available on how much time spark will run when no data is streamed? Any Doc available -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Issue-not-running-24-7-tp17791.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org