I have a spark program that worked in local mode, but throws an error in
yarn-client mode on a cluster. On the edge node in my home directory, I
have an output directory (called transout) which is ready to receive files.
The spark job I'm running is supposed to write a few hundred files into
that directory, once for each iteration of a foreach function. This works
in local mode, and my only guess as to why this would fail in yarn-client
mode is that the RDD is distributed across many nodes and the program is
trying to use the PrintWriter on the datanodes, where the output directory
doesn't exist. Is this what's happening? Any proposed solution?

abbreviation of the code:

import java.io.PrintWriter
...
rdd.foreach {
  val outFile = new PrintWriter("transoutput/output.%s".format(id))
  outFile.println("test")
  outFile.close()
}

Error:

14/09/10 16:57:09 WARN TaskSetManager: Lost TID 1826 (task 0.0:26)
14/09/10 16:57:09 WARN TaskSetManager: Loss was due to
java.io.FileNotFoundException
java.io.FileNotFoundException: transoutput/input.598718 (No such file or
directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
at java.io.FileOutputStream.<init>(FileOutputStream.java:84)
at java.io.PrintWriter.<init>(PrintWriter.java:146)
at
com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:98)
at
com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:95)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Reply via email to