We are trying to test out running Spark 0.8.0 on a Windows box, and while
we can get it to run all the examples that don't output results to disk, we
can't get it to write output..

Has anyone been able to write out to a local file on a single node windows
install without using hdfs?

Here is our test code:

object FileWritingTest {
    def main (args: Array[String]): Unit = {
      val sc = new SparkContext("local[1]", "File Writing Test", null,
null, null, null);
      val res = sc.parallelize(Range(0, 10), 10).flatMap(p => "%d".format(p
* 10))    //generate some work to do
      res.saveAsTextFile("file:///c:/somepath")    //save the results out
to a file
    }
}

This works as expected using a unix based system. However, when trying to
run on a windows cmd shell I get the following errors:

[WARN] 11 Dec 2013 12:00:33 - org.apache.hadoop.util.NativeCodeLoader -
Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Saving as
hadoop file of type (NullWritable, Text)
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Starting
job: saveAsTextFile at Test.scala:19
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Got job 0
(saveAsTextFile at Test.scala:19) with 10 output partitions
(allowLocal=false)
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Final stage:
Stage 0 (saveAsTextFile at Test.scala:19)
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Parents of
final stage: List()
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Missing
parents: List()
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Submitting
Stage 0 (MappedRDD[2] at saveAsTextFile at Test.scala:19), which has no
missing parents
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Submitting
10 missing tasks from Stage 0 (MappedRDD[2] at saveAsTextFile at
Test.scala:19)
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Size of task
0 is 5966 bytes
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Running 0
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Loss was due
to org.apache.hadoop.util.Shell$ExitCodeException
org.apache.hadoop.util.Shell$ExitCodeException: chmod: getting attributes
of 
`/cygdrive/c/somepath/_temporary/_attempt_201312111200_0000_m_000000_0/part-00000':
No such file or directory
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:261)
        at org.apache.hadoop.util.Shell.run(Shell.java:188)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(
Shell.java:381)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:467)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:450)
        at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(
RawLocalFileSystem.java:593)
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(
RawLocalFileSystem.java:584)
        at org.apache.hadoop.fs.FilterFileSystem.setPermission(
FilterFileSystem.java:427)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(
ChecksumFileSystem.java:465)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(
ChecksumFileSystem.java:433)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:781)
        at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(
TextOutputFormat.java:118)
        at org.apache.hadoop.mapred.SparkHadoopWriter.open(
SparkHadoopWriter.scala:86)
        at org.apache.spark.rdd.PairRDDFunctions.writeToFile$
1(PairRDDFunctions.scala:667)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:680)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:680)
        at org.apache.spark.scheduler.ResultTask.run(ResultTask.scala:99)
        at org.apache.spark.scheduler.local.LocalScheduler.runTask(
LocalScheduler.scala:198)
        at org.apache.spark.scheduler.local.LocalActor$$anonfun$
launchTask$1$$anon$1.run(LocalScheduler.scala:68)
        at java.util.concurrent.Executors$RunnableAdapter.
call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Remove
TaskSet 0.0 from pool
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Failed to
run saveAsTextFile at Test.scala:19
Exception in thread "main" org.apache.spark.SparkException: Job failed:
Task 0.0:0 failed more than 4 times; aborting job
org.apache.hadoop.util.Shell$ExitCodeException: chmod: getting attributes
of 
`/cygdrive/c/somepath/_temporary/_attempt_201312111200_0000_m_000000_0/part-00000':
No such file or directory
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$
abortStage$1.apply(DAGScheduler.scala:760)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$
abortStage$1.apply(DAGScheduler.scala:758)
        at scala.collection.mutable.ResizableArray$class.foreach(
ResizableArray.scala:60)
        at scala.collection.mutable.ArrayBuffer.foreach(
ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(
DAGScheduler.scala:758)
        at org.apache.spark.scheduler.DAGScheduler.processEvent(
DAGScheduler.scala:379)
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$
scheduler$DAGScheduler$$run(DAGScheduler.scala:441)
        at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(
DAGScheduler.scala:149)

The fact that it's using a cygwin path (/cygdrive/c/somepath/_
temporary/_attempt_201312111200_0000_m_000000_0/part-00000) seems suspect
since I'm running from a cmd shell. Running from a cygwin shell leads to
other errors.

Has anyone's been able to get simple file output to run from either a
cygwin shell or the windows cmd shell?

Does anyone knwo if it is Spark or Hadoop that is transforming the path?




-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  [email protected]

Reply via email to