Hi,
I'm trying to use SparkContext.addFile() to propagate a file to worker
nodes, in a standalone cluster (2 nodes, 1 master, 1 worker connected to the
master). I don't have HDFS or any distributed file system. Just playing with
basic stuff.
Here's the code in my driver (actually spark-shell running on the master
node). In the current directory I have file spam.data
The following commands are taken from the book
http://www.packtpub.com/fast-data-processing-with-spark/book , page 44


*scala> sc.addFile("spam.data")*

14/04/07 14:03:48 INFO Utils: Copying
/home/thierry/dev/spark-samples/packt-book/LoadSaveExample/spam.data to
/tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.data
14/04/07 14:03:49 INFO SparkContext: Added file spam.data at
http://192.168.1.51:59008/files/spam.data with timestamp 1396893828972

*scala> import org.apache.spark.SparkFiles*
import org.apache.spark.SparkFiles

*scala> val inFile = sc.textFile(SparkFiles.get("spam.data"))*

14/04/07 14:05:00 INFO MemoryStore: ensureFreeSpace(138763) called with
curMem=0, maxMem=311387750
14/04/07 14:05:00 INFO MemoryStore: Block broadcast_0 stored as values to
memory (estimated size 135.5 KB, free 296.8 MB)
inFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
<console>:13


Now trigger some action to make the worker work.

*scala> inFile.count()*


In the stderr.log of the app on the worker :

14/04/07 14:05:33 INFO Executor: Fetching
http://192.168.1.51:59008/files/spam.data with timestamp 1396893828972
14/04/07 14:05:33 INFO Utils: Fetching
http://192.168.1.51:59008/files/spam.data to
/tmp/fetchFileTemp435286457200696761.tmp

So apparently the file was successfully downloaded from the driver to the
worker. The jar of the application is also successfully downloaded.
But a bit later, in the same stderr.log:

14/04/07 14:05:34 INFO HttpBroadcast: Reading broadcast variable 0 took
0.352334273 s
14/04/07 14:05:34 INFO HadoopRDD: Input split:
file:/tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.data:0+349170
14/04/07 14:05:34 ERROR Executor: Exception in task ID 0
java.io.FileNotFoundException: File
file:/tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.data does not
exist
        at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
        at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
        at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:137)
        at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763)
        at
org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:106)
        at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:156)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
        at org.apache.spark.scheduler.Task.run(Task.scala:53)
        at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
        at
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

It looks like the file is looked for in:

/tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.dat

which is the temp location on the master node where the driver is running,
while it was downloaded in the worker node in
/tmp/fetchFileTemp435286457200696761.tmp

I see hadoop related classes in the stack trace. Does it mean HDFS is used ?
If that's the case, is it because I'm using the precompiled
spark-0.9.0-incubating-bin-hadoop2 ?

I couldn't find any response, neither in the spark user list, nor by
googling it or in the spark guides (sorry for that probably very basic
question)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-addFile-and-FileNotFoundException-tp3844.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to