Re: PrintWriter error in foreach
Ok, so I don't think the workers on the data nodes will be able to see my output directory on the edge node. I don't think stdout will work either, so I'll write to HDFS via rdd.saveAsTextFile(...) On Wed, Sep 10, 2014 at 3:51 PM, Daniil Osipov wrote: > Try providing full path to the file you want to write, and make sure the > directory exists and is writable by the Spark process. > > On Wed, Sep 10, 2014 at 3:46 PM, Arun Luthra > wrote: > >> I have a spark program that worked in local mode, but throws an error in >> yarn-client mode on a cluster. On the edge node in my home directory, I >> have an output directory (called transout) which is ready to receive files. >> The spark job I'm running is supposed to write a few hundred files into >> that directory, once for each iteration of a foreach function. This works >> in local mode, and my only guess as to why this would fail in yarn-client >> mode is that the RDD is distributed across many nodes and the program is >> trying to use the PrintWriter on the datanodes, where the output directory >> doesn't exist. Is this what's happening? Any proposed solution? >> >> abbreviation of the code: >> >> import java.io.PrintWriter >> ... >> rdd.foreach { >> val outFile = new PrintWriter("transoutput/output.%s".format(id)) >> outFile.println("test") >> outFile.close() >> } >> >> Error: >> >> 14/09/10 16:57:09 WARN TaskSetManager: Lost TID 1826 (task 0.0:26) >> 14/09/10 16:57:09 WARN TaskSetManager: Loss was due to >> java.io.FileNotFoundException >> java.io.FileNotFoundException: transoutput/input.598718 (No such file or >> directory) >> at java.io.FileOutputStream.open(Native Method) >> at java.io.FileOutputStream.(FileOutputStream.java:194) >> at java.io.FileOutputStream.(FileOutputStream.java:84) >> at java.io.PrintWriter.(PrintWriter.java:146) >> at >> com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:98) >> at >> com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:95) >> at scala.collection.Iterator$class.foreach(Iterator.scala:727) >> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) >> at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703) >> at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703) >> at >> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) >> at >> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) >> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) >> at org.apache.spark.scheduler.Task.run(Task.scala:51) >> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> at java.lang.Thread.run(Thread.java:662) >> > >
Re: PrintWriter error in foreach
Try providing full path to the file you want to write, and make sure the directory exists and is writable by the Spark process. On Wed, Sep 10, 2014 at 3:46 PM, Arun Luthra wrote: > I have a spark program that worked in local mode, but throws an error in > yarn-client mode on a cluster. On the edge node in my home directory, I > have an output directory (called transout) which is ready to receive files. > The spark job I'm running is supposed to write a few hundred files into > that directory, once for each iteration of a foreach function. This works > in local mode, and my only guess as to why this would fail in yarn-client > mode is that the RDD is distributed across many nodes and the program is > trying to use the PrintWriter on the datanodes, where the output directory > doesn't exist. Is this what's happening? Any proposed solution? > > abbreviation of the code: > > import java.io.PrintWriter > ... > rdd.foreach { > val outFile = new PrintWriter("transoutput/output.%s".format(id)) > outFile.println("test") > outFile.close() > } > > Error: > > 14/09/10 16:57:09 WARN TaskSetManager: Lost TID 1826 (task 0.0:26) > 14/09/10 16:57:09 WARN TaskSetManager: Loss was due to > java.io.FileNotFoundException > java.io.FileNotFoundException: transoutput/input.598718 (No such file or > directory) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.(FileOutputStream.java:194) > at java.io.FileOutputStream.(FileOutputStream.java:84) > at java.io.PrintWriter.(PrintWriter.java:146) > at > com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:98) > at > com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:95) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703) > at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) >
PrintWriter error in foreach
I have a spark program that worked in local mode, but throws an error in yarn-client mode on a cluster. On the edge node in my home directory, I have an output directory (called transout) which is ready to receive files. The spark job I'm running is supposed to write a few hundred files into that directory, once for each iteration of a foreach function. This works in local mode, and my only guess as to why this would fail in yarn-client mode is that the RDD is distributed across many nodes and the program is trying to use the PrintWriter on the datanodes, where the output directory doesn't exist. Is this what's happening? Any proposed solution? abbreviation of the code: import java.io.PrintWriter ... rdd.foreach { val outFile = new PrintWriter("transoutput/output.%s".format(id)) outFile.println("test") outFile.close() } Error: 14/09/10 16:57:09 WARN TaskSetManager: Lost TID 1826 (task 0.0:26) 14/09/10 16:57:09 WARN TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: transoutput/input.598718 (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:194) at java.io.FileOutputStream.(FileOutputStream.java:84) at java.io.PrintWriter.(PrintWriter.java:146) at com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:98) at com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:95) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)