Spark-submit without access to HDFS

Eugene Miretsky Wed, 15 Nov 2023 13:32:31 -0800

Hey All,

We are running Pyspark spark-submit from a client outside the cluster. The
client has network connectivity only to the Yarn Master, not the HDFS
Datanodes. How can we submit the jobs? The idea would be to preload all the
dependencies (job code, libraries, etc) to HDFS, and just submit the job
from the client.


We tried something like this
'PYSPARK_ARCHIVES_PATH=hdfs://some-path/pyspark.zip spark-submit --master
yarn --deploy-mode cluster --py-files hdfs://yarn-master-url hdfs://foo.py'

The error we are getting is
"

org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while
waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending remote=/
10.117.110.19:9866]

org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/user/users/.sparkStaging/application_1698216436656_0104/*spark_conf.zip*
could only be written to 0 of the 1 minReplication nodes. There are 2
datanode(s) running and 2 node(s) are excluded in this operation.
"

A few question
1) What are the spark_conf.zip files. Is it the hive-site/yarn-site conf
files? Why would the client send them to the cluster? (the cluster already
has all that info - this would make sense in client mode, but not cluster
mode )
2) Is it possible to use spark-submit without HDFS access?
3) How would we fix this?

Cheers,
Eugene

-- 

*Eugene Miretsky*
Managing Partner |  Badal.io | Book a meeting /w me!
<http://calendly.com/eugene-badal>
mobile:  416-568-9245
email:     eug...@badal.io <zb...@badal.io>

Spark-submit without access to HDFS

Reply via email to