I have a standalone spark 3.2.0 cluster with two workers started on PC_A and
want to run a pyspark job from PC_B. The job wants to load a text file. I keep
getting file not found error messages when I execute the job.
Folder/file "/home/bddev/parrot/words.txt" exists on PC_B but not on PC_A.
try 1:
>>> df = spark.read.text("/home/bddev/parrot/words.txt")
>>> df.select("*").groupBy("value").count().orderBy("count",ascending=False).show()
22/03/14 14:14:44 WARN TaskSetManager: Lost task 11.0 in stage 0.0 (TID 11)
(pca executor 0): java.io.FileNotFoundException:
File file:/home/bddev/parrot/words.txt does not exist
...
try 2:
>>> from pyspark import SparkFiles
>>> sc.addFile("/home/bddev/parrot/words.txt")
>>> SparkFiles.get("words.txt")
'/tmp/spark-43bf6d61-45a5-463f-adb9-ad4240743010/userFiles-261ec611-2655-4e05-a76c-681122bd22f1/words.txt'
>>> df = spark.read.text("words.txt")
>>> df.select("*").groupBy("value").count().orderBy("count",ascending=False).show()
[Stage 1:> (0 + 16) / 16]
[lots of network activity, looks like the file is being copied over to the from
PC_B to PC_A]
22/03/14 14:19:21 WARN TaskSetManager: Lost task 15.0 in stage 1.0 (TID 72)
(pca executor 1): java.io.FileNotFoundException:
File file:/home/bddev/parrot/words.txt does not exist
...
How can I work with filenames that are not local to my machine? For example can
I put the file on the cluster machine locally and access it from the pyspark
somehow? Is it required that the cluster and the client "see" the file in the
same folder?
This just for playing around, longer term I plan on having a bit different
setup with data sitting on a remote network attached storage machine.
Thanks!
//hinko
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]