Hi, I'm trying to use SparkContext.addFile() to propagate a file to worker nodes, in a standalone cluster (2 nodes, 1 master, 1 worker connected to the master). I don't have HDFS or any distributed file system. Just playing with basic stuff. Here's the code in my driver (actually spark-shell running on the master node). In the current directory I have file spam.data The following commands are taken from the book http://www.packtpub.com/fast-data-processing-with-spark/book , page 44
*scala> sc.addFile("spam.data")* 14/04/07 14:03:48 INFO Utils: Copying /home/thierry/dev/spark-samples/packt-book/LoadSaveExample/spam.data to /tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.data 14/04/07 14:03:49 INFO SparkContext: Added file spam.data at http://192.168.1.51:59008/files/spam.data with timestamp 1396893828972 *scala> import org.apache.spark.SparkFiles* import org.apache.spark.SparkFiles *scala> val inFile = sc.textFile(SparkFiles.get("spam.data"))* 14/04/07 14:05:00 INFO MemoryStore: ensureFreeSpace(138763) called with curMem=0, maxMem=311387750 14/04/07 14:05:00 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 135.5 KB, free 296.8 MB) inFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:13 Now trigger some action to make the worker work. *scala> inFile.count()* In the stderr.log of the app on the worker : 14/04/07 14:05:33 INFO Executor: Fetching http://192.168.1.51:59008/files/spam.data with timestamp 1396893828972 14/04/07 14:05:33 INFO Utils: Fetching http://192.168.1.51:59008/files/spam.data to /tmp/fetchFileTemp435286457200696761.tmp So apparently the file was successfully downloaded from the driver to the worker. The jar of the application is also successfully downloaded. But a bit later, in the same stderr.log: 14/04/07 14:05:34 INFO HttpBroadcast: Reading broadcast variable 0 took 0.352334273 s 14/04/07 14:05:34 INFO HadoopRDD: Input split: file:/tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.data:0+349170 14/04/07 14:05:34 ERROR Executor: Exception in task ID 0 java.io.FileNotFoundException: File file:/tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.data does not exist at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:137) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763) at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:106) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:156) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) It looks like the file is looked for in: /tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.dat which is the temp location on the master node where the driver is running, while it was downloaded in the worker node in /tmp/fetchFileTemp435286457200696761.tmp I see hadoop related classes in the stack trace. Does it mean HDFS is used ? If that's the case, is it because I'm using the precompiled spark-0.9.0-incubating-bin-hadoop2 ? I couldn't find any response, neither in the spark user list, nor by googling it or in the spark guides (sorry for that probably very basic question) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-addFile-and-FileNotFoundException-tp3844.html Sent from the Apache Spark User List mailing list archive at Nabble.com.