I have a spark program that worked in local mode, but throws an error in yarn-client mode on a cluster. On the edge node in my home directory, I have an output directory (called transout) which is ready to receive files. The spark job I'm running is supposed to write a few hundred files into that directory, once for each iteration of a foreach function. This works in local mode, and my only guess as to why this would fail in yarn-client mode is that the RDD is distributed across many nodes and the program is trying to use the PrintWriter on the datanodes, where the output directory doesn't exist. Is this what's happening? Any proposed solution?
abbreviation of the code: import java.io.PrintWriter ... rdd.foreach { val outFile = new PrintWriter("transoutput/output.%s".format(id)) outFile.println("test") outFile.close() } Error: 14/09/10 16:57:09 WARN TaskSetManager: Lost TID 1826 (task 0.0:26) 14/09/10 16:57:09 WARN TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: transoutput/input.598718 (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.<init>(FileOutputStream.java:194) at java.io.FileOutputStream.<init>(FileOutputStream.java:84) at java.io.PrintWriter.<init>(PrintWriter.java:146) at com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:98) at com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:95) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)