For testing, maybe try using .collect and doing the comparison between expected and actual in memory rather than on disk?
On Thu, Jan 2, 2014 at 12:54 PM, Philip Ogren <[email protected]>wrote: > I just tried your suggestion and get the same results with the _temporary > directory. Thanks though. > > > On 1/2/2014 10:28 AM, Andrew Ash wrote: > > You want to write it to a local file on the machine? Try using > "file:///path/to/target/mydir/" instead > > I'm not sure what behavior would be if you did this on a multi-machine > cluster though -- you may get a bit of data on each machine in that local > directory. > > > On Thu, Jan 2, 2014 at 12:22 PM, Philip Ogren <[email protected]>wrote: > >> I have a very simple Spark application that looks like the following: >> >> >> var myRdd: RDD[Array[String]] = initMyRdd() >> println(myRdd.first.mkString(", ")) >> println(myRdd.count) >> >> myRdd.saveAsTextFile("hdfs://myserver:8020/mydir") >> myRdd.saveAsTextFile("target/mydir/") >> >> >> The println statements work as expected. The first saveAsTextFile >> statement also works as expected. The second saveAsTextFile statement does >> not (even if the first is commented out.) I get the exception pasted >> below. If I inspect "target/mydir" I see that there is a directory called >> _temporary/0/_temporary/attempt_201401020953_0000_m_000000_1 which contains >> an empty part-00000 file. It's curious because this code worked before >> with Spark 0.8.0 and now I am running on Spark 0.8.1. I happen to be >> running this on Windows in "local" mode at the moment. Perhaps I should >> try running it on my linux box. >> >> Thanks, >> Philip >> >> >> Exception in thread "main" org.apache.spark.SparkException: Job aborted: >> Task 2.0:0 failed more than 0 times; aborting job >> java.lang.NullPointerException >> at >> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827) >> at >> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825) >> at >> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) >> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >> at >> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825) >> at >> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440) >> at org.apache.spark.scheduler.DAGScheduler.org >> $apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502) >> at >> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157) >> >> >> > >
