Yep - that works great and is what I normally do.
I perhaps should have framed my email as a bug report. The
documentation for saveAsTextFile says you can write results out to a
local file but it doesn't work for me per the described behavior. It
also worked before and now it doesn't. So, it seems like a bug. Should
I file a Jira issue? I haven't done that yet for this project but would
be happy to.
Thanks,
Philip
On 1/2/2014 11:23 AM, Andrew Ash wrote:
For testing, maybe try using .collect and doing the comparison between
expected and actual in memory rather than on disk?
On Thu, Jan 2, 2014 at 12:54 PM, Philip Ogren <[email protected]
<mailto:[email protected]>> wrote:
I just tried your suggestion and get the same results with the
_temporary directory. Thanks though.
On 1/2/2014 10:28 AM, Andrew Ash wrote:
You want to write it to a local file on the machine? Try using
"file:///path/to/target/mydir/" instead
I'm not sure what behavior would be if you did this on a
multi-machine cluster though -- you may get a bit of data on each
machine in that local directory.
On Thu, Jan 2, 2014 at 12:22 PM, Philip Ogren
<[email protected] <mailto:[email protected]>> wrote:
I have a very simple Spark application that looks like the
following:
var myRdd: RDD[Array[String]] = initMyRdd()
println(myRdd.first.mkString(", "))
println(myRdd.count)
myRdd.saveAsTextFile("hdfs://myserver:8020/mydir")
myRdd.saveAsTextFile("target/mydir/")
The println statements work as expected. The first
saveAsTextFile statement also works as expected. The second
saveAsTextFile statement does not (even if the first is
commented out.) I get the exception pasted below. If I
inspect "target/mydir" I see that there is a directory called
_temporary/0/_temporary/attempt_201401020953_0000_m_000000_1
which contains an empty part-00000 file. It's curious
because this code worked before with Spark 0.8.0 and now I am
running on Spark 0.8.1. I happen to be running this on
Windows in "local" mode at the moment. Perhaps I should try
running it on my linux box.
Thanks,
Philip
Exception in thread "main" org.apache.spark.SparkException:
Job aborted: Task 2.0:0 failed more than 0 times; aborting
job java.lang.NullPointerException
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440)
at org.apache.spark.scheduler.DAGScheduler.org
<http://org.apache.spark.scheduler.DAGScheduler.org>$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502)
at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)