Re: rdd.saveAsTextFile problem

Philip Ogren Thu, 02 Jan 2014 10:33:28 -0800

Yep - that works great and is what I normally do.

I perhaps should have framed my email as a bug report. Thedocumentation for saveAsTextFile says you can write results out to alocal file but it doesn't work for me per the described behavior. Italso worked before and now it doesn't. So, it seems like a bug. ShouldI file a Jira issue? I haven't done that yet for this project but wouldbe happy to.


Thanks,
Philip

On 1/2/2014 11:23 AM, Andrew Ash wrote:

For testing, maybe try using .collect and doing the comparison betweenexpected and actual in memory rather than on disk?

On Thu, Jan 2, 2014 at 12:54 PM, Philip Ogren <[email protected]<mailto:[email protected]>> wrote:


    I just tried your suggestion and get the same results with the
    _temporary directory.  Thanks though.


    On 1/2/2014 10:28 AM, Andrew Ash wrote:

    You want to write it to a local file on the machine?  Try using
    "file:///path/to/target/mydir/" instead

    I'm not sure what behavior would be if you did this on a
    multi-machine cluster though -- you may get a bit of data on each
    machine in that local directory.


    On Thu, Jan 2, 2014 at 12:22 PM, Philip Ogren
    <[email protected] <mailto:[email protected]>> wrote:

        I have a very simple Spark application that looks like the
        following:


        var myRdd: RDD[Array[String]] = initMyRdd()
        println(myRdd.first.mkString(", "))
        println(myRdd.count)

        myRdd.saveAsTextFile("hdfs://myserver:8020/mydir")
        myRdd.saveAsTextFile("target/mydir/")


        The println statements work as expected.  The first
        saveAsTextFile statement also works as expected.  The second
        saveAsTextFile statement does not (even if the first is
        commented out.)  I get the exception pasted below.  If I
        inspect "target/mydir" I see that there is a directory called
        _temporary/0/_temporary/attempt_201401020953_0000_m_000000_1
        which contains an empty part-00000 file.  It's curious
        because this code worked before with Spark 0.8.0 and now I am
        running on Spark 0.8.1. I happen to be running this on
        Windows in "local" mode at the moment.  Perhaps I should try
        running it on my linux box.

        Thanks,
        Philip


        Exception in thread "main" org.apache.spark.SparkException:
        Job aborted: Task 2.0:0 failed more than 0 times; aborting
        job java.lang.NullPointerException
            at
        
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827)
            at
        
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825)
            at
        
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
            at
        scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
            at
        
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825)
            at
        
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440)
            at org.apache.spark.scheduler.DAGScheduler.org
        
<http://org.apache.spark.scheduler.DAGScheduler.org>$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502)
            at
        
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)

Re: rdd.saveAsTextFile problem

Reply via email to