Stage failures

Tom Vacek Tue, 22 Oct 2013 13:19:30 -0700

I have a simple code snippet for the shell.  I'm running 0.8.0, and this
happens with both the Spark master and Mesos.  Basically, I'm just reading
a local file from the login node, broadcasting the contents as a set, and
then filtering a list embedded in an RDD.  However, the stage fails every
time I have run it.  I'm looking for advice about the problem.  I suspect
there is a misconfiguration on the cluster, but I have no idea where to
start.  Any suggestions are appreciated.  Code snippet and log messages
follow.


val wordLookup = scala.io.Source.fromFile("/data/share/rnaTokLookup",
"latin1").getLines().toList

val rnaToks = wordLookup.map(ss => {val chunks = ss.split("\\t"); chunks(0)
} ).toSet

val rnaToksB = sc.broadcast(rnaToks)

val text = sc.textFile("hdfs://sanji-03/user/tom/rnaHuge/docVecs")

val ngrams = text.map(tt => {val blobs = tt.split("\\t"); (blobs(0),
blobs(1).split(" "))})
//ngrams: [(String, Array[String])]
val ngramsLight = textBlobs.map(tt => (tt._1,
tt._2.filter(rnaToksB.value.contains(_))))

ngramsLight.map(tt => tt._1 + "\t" + tt._2.mkString("
")).saveAsTextFile("hdfs://sanji-03/user/tom/rnaHuge/docVecsLight")

It runs just fine until it hits:
13/10/22 13:55:55 INFO client.Client$ClientActor: Executor updated:
app-20131022135234-0033/5 is now FAILED (Command exited with code 1)
13/10/22 13:55:55 INFO cluster.SparkDeploySchedulerBackend: Executor
app-20131022135234-0033/5 removed: Command exited with code 1
13/10/22 13:56:05 INFO client.Client$ClientActor: Connecting to master
spark://sanji-03:7077
13/10/22 13:56:05 ERROR client.Client$ClientActor: Error notifying
standalone scheduler's driver actor
org.apache.spark.SparkException: Error notifying standalone scheduler's
driver actor
at
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.removeExecutor(StandaloneSchedulerBackend.scala:192)
at
org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.executorRemoved(SparkDeploySchedulerBackend.scala:90)
at
org.apache.spark.deploy.client.Client$ClientActor$$anonfun$receive$1.apply(Client.scala:92)
at
org.apache.spark.deploy.client.Client$ClientActor$$anonfun$receive$1.apply(Client.scala:72)
at akka.actor.Actor$class.apply(Actor.scala:318)
at org.apache.spark.deploy.client.Client$ClientActor.apply(Client.scala:51)
at akka.actor.ActorCell.invoke(ActorCell.scala:626)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197)
at akka.dispatch.Mailbox.run(Mailbox.scala:179)
at
akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[10000] milliseconds
at akka.dispatch.DefaultPromise.ready(Future.scala:870)
at akka.dispatch.DefaultPromise.result(Future.scala:874)
at akka.dispatch.Await$.result(Future.scala:74)
at
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.removeExecutor(StandaloneSchedulerBackend.scala:189)
... 13 more

After this, the node that failed is brought back and assigned a new task,
and a similar train of messages are sprayed over and over again, generally
involving different nodes.  The node log for this instance shows a number
of restarts, which have the following pattern:
13/10/22 13:32:33 INFO worker.Worker: Asked to launch executor
app-20131022133233-0024/5 for Spark shell
13/10/22 13:32:33 INFO worker.ExecutorRunner: Launch command: "java" "-cp"
"/data/scala/scala-0.8.0-incubating/lib/scala-library.jar:/data/spark/spark-0.8.0-incubating/conf:/data/spark/spark-0.8.0-incubating/assembly/target/scala-2.9.3/spark-assembly-0.8.0-incubating-hadoop1.2.1.jar"
"-Dspark.local.dir=/data/spark/tmp-0.8.0" "-Dspark.worker.timeout=120000"
"-Dspark.local.dir=/data/spark/tmp-0.8.0" "-Dspark.worker.timeout=120000"
"-Xms73728M" "-Xmx73728M"
"org.apache.spark.executor.StandaloneExecutorBackend" "akka://
[email protected]:36446/user/StandaloneScheduler" "5" "
sanji-08.int.westgroup.com" "22"
13/10/22 13:37:38 INFO worker.Worker: Asked to kill executor
app-20131022133233-0024/5
13/10/22 13:37:38 INFO worker.ExecutorRunner: Killing process!
13/10/22 13:37:38 INFO worker.ExecutorRunner: Runner thread for executor
app-20131022133233-0024/5 interrupted
13/10/22 13:37:38 INFO worker.ExecutorRunner: Redirection to
/data/spark/spark-0.8.0-incubating/work/app-20131022133233-0024/5/stderr
closed: Stream closed
13/10/22 13:37:38 INFO worker.Worker: Executor app-20131022133233-0024/5
finished with state KILLED

Stage failures

Reply via email to