The problem is simple

I want a to stream data 24/7 do some calculations and save the result in a
csv/json file so that i could use it for visualization using dc.js/d3.js

I opted for spark streaming on yarn cluster with kafka tried running it for
24/7

Using GroupByKey and updateStateByKey to have the computed historical data

Initially streaming is working fine.. but after few hours i am getting

14/10/30 23:48:49 ERROR TaskSetManager: Task 2485162.0:3 failed 4 times;
aborting job
14/10/30 23:48:50 ERROR JobScheduler: Error running job streaming job
1414692270000 ms.1
org.apache.spark.SparkException: Job aborted due to stage failure: Task
2485162.0:3 failed 4 times, most recent failure: Exception failure in TID
478548 on host 172.18.152.36: java.lang.ArrayIndexOutOfBoundsException

Driver stacktrace:
    at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
    at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
    at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
I guess its due to the GroupByKey and updateStateByKey, i tried
GroupByKey(100) increased partition

Also when data is in state say for eg 10th sec 1000 records are in state,
100th sec 20,000 records are in state out of which 19,000 records are not
updated how to remove them from state.. UpdateStateByKey(none) how and when
to do that, how we will know when to send none, and save the data before
setting none?

I also tried not sending any data a few hours but check the web ui i am
getting task FINISHED

app-20141030203943-0000 NewApp  0       6.0 GB  2014/10/30 20:39:43     hadoop  
FINISHED
4.2 h

This makes me confused.. In the code it says awaitTermination, but did not
terminate the task.. will streaming stop if no data is received for a
significant amount of time? Is there any doc available on how much time
spark will run when no data is streamed? Any Doc available



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Issue-not-running-24-7-tp17791.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to