You might try a more standard windows path. I typically write to a
local directory such as "target/spark-output".
On 12/11/2013 10:45 AM, Nathan Kronenfeld wrote:
We are trying to test out running Spark 0.8.0 on a Windows box, and
while we can get it to run all the examples that don't output results
to disk, we can't get it to write output..
Has anyone been able to write out to a local file on a single node
windows install without using hdfs?
Here is our test code:
object FileWritingTest {
def main (args: Array[String]): Unit = {
val sc = new SparkContext("local[1]", "File Writing Test", null,
null, null, null);
val res = sc.parallelize(Range(0, 10), 10).flatMap(p =>
"%d".format(p * 10)) //generate some work to do
res.saveAsTextFile("file:///c:/somepath") //save the results
out to a file
}
}
This works as expected using a unix based system. However, when trying
to run on a windows cmd shell I get the following errors:
[WARN] 11 Dec 2013 12:00:33 - org.apache.hadoop.util.NativeCodeLoader
- Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Saving
as hadoop file of type (NullWritable, Text)
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class -
Starting job: saveAsTextFile at Test.scala:19
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Got job
0 (saveAsTextFile at Test.scala:19) with 10 output partitions
(allowLocal=false)
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Final
stage: Stage 0 (saveAsTextFile at Test.scala:19)
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Parents
of final stage: List()
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Missing
parents: List()
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class -
Submitting Stage 0 (MappedRDD[2] at saveAsTextFile at Test.scala:19),
which has no missing parents
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class -
Submitting 10 missing tasks from Stage 0 (MappedRDD[2] at
saveAsTextFile at Test.scala:19)
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Size of
task 0 is 5966 bytes
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Running 0
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Loss
was due to org.apache.hadoop.util.Shell$ExitCodeException
org.apache.hadoop.util.Shell$ExitCodeException: chmod: getting
attributes of
`/cygdrive/c/somepath/_temporary/_attempt_201312111200_0000_m_000000_0/part-00000':
No such file or directory
at org.apache.hadoop.util.Shell.runCommand(Shell.java:261)
at org.apache.hadoop.util.Shell.run(Shell.java:188)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:381)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:467)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:450)
at
org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:593)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:584)
at
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:427)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:465)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:781)
at
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:118)
at
org.apache.hadoop.mapred.SparkHadoopWriter.open(SparkHadoopWriter.scala:86)
at
org.apache.spark.rdd.PairRDDFunctions.writeToFile$1(PairRDDFunctions.scala:667)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:680)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:680)
at org.apache.spark.scheduler.ResultTask.run(ResultTask.scala:99)
at
org.apache.spark.scheduler.local.LocalScheduler.runTask(LocalScheduler.scala:198)
at
org.apache.spark.scheduler.local.LocalActor$$anonfun$launchTask$1$$anon$1.run(LocalScheduler.scala:68)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Remove
TaskSet 0.0 from pool
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Failed
to run saveAsTextFile at Test.scala:19
Exception in thread "main" org.apache.spark.SparkException: Job
failed: Task 0.0:0 failed more than 4 times; aborting job
org.apache.hadoop.util.Shell$ExitCodeException: chmod: getting
attributes of
`/cygdrive/c/somepath/_temporary/_attempt_201312111200_0000_m_000000_0/part-00000':
No such file or directory
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:760)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:758)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:758)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:379)
at org.apache.spark.scheduler.DAGScheduler.org
<http://org.apache.spark.scheduler.DAGScheduler.org>$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:441)
at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:149)
The fact that it's using a cygwin path
(/cygdrive/c/somepath/_temporary/_attempt_201312111200_0000_m_000000_0/part-00000)
seems suspect since I'm running from a cmd shell. Running from a
cygwin shell leads to other errors.
Has anyone's been able to get simple file output to run from either a
cygwin shell or the windows cmd shell?
Does anyone knwo if it is Spark or Hadoop that is transforming the path?
--
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone: +1-416-203-3003 x 238
Email: [email protected] <mailto:[email protected]>