Hi, How are you submitting your spark job from your client?
Your files can either be on HDFS or HCFS such as gs, s3 etc. With reference to --py-files hdfs://yarn-master-url hdfs://foo.py', I assume you want your spark-submit --verbose \ --deploy-mode cluster \ --conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" \ --conf "spark.yarn.appMasterEnv.PYTHONPATH=${PYTHONPATH}" \ --conf "spark.executorEnv.PYTHONPATH=${PYTHONPATH}" \ --py-files $CODE_DIRECTORY_CLOUD/dataproc_on_gke.zip \ --conf "spark.driver.memory"=4G \ --conf "spark.executor.memory"=4G \ --conf "spark.num.executors"=4 \ --conf "spark.executor.cores"=2 \ $CODE_DIRECTORY_CLOUD/${APPLICATION} in my case I define $CODE_DIRECTORY_CLOUD as below on google cloud storage CODE_DIRECTORY="/home/hduser/dba/bin/python/" CODE_DIRECTORY_CLOUD="gs://,${PROJECT}-spark-on-k8s/codes" cd $CODE_DIRECTORY [ -f ${source_code}.zip ] && rm -r -f ${source_code}.zip echo `date` ", ===> creating source zip directory from ${source_code}" # zip needs to be done at root directory of code zip -rq ${source_code}.zip ${source_code} gsutil cp ${source_code}.zip $CODE_DIRECTORY_CLOUD gsutil cp /home/hduser/dba/bin/python/${source_code}/src/${APPLICATION} $CODE_DIRECTORY_CLOUD So in summary I create a zip file of my project and copy it across to the cloud storage and then put the application (py file) there as well and use them in spark-submit I trust this answers your question. HTH Mich Talebzadeh, Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Wed, 15 Nov 2023 at 21:33, Eugene Miretsky <eug...@badal.io.invalid> wrote: > Hey All, > > We are running Pyspark spark-submit from a client outside the cluster. The > client has network connectivity only to the Yarn Master, not the HDFS > Datanodes. How can we submit the jobs? The idea would be to preload all the > dependencies (job code, libraries, etc) to HDFS, and just submit the job > from the client. > > We tried something like this > 'PYSPARK_ARCHIVES_PATH=hdfs://some-path/pyspark.zip spark-submit --master > yarn --deploy-mode cluster --py-files hdfs://yarn-master-url hdfs://foo.py' > > The error we are getting is > " > > org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while > waiting for channel to be ready for connect. ch : > java.nio.channels.SocketChannel[connection-pending remote=/ > 10.117.110.19:9866] > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > /user/users/.sparkStaging/application_1698216436656_0104/*spark_conf.zip* > could only be written to 0 of the 1 minReplication nodes. There are 2 > datanode(s) running and 2 node(s) are excluded in this operation. > " > > A few question > 1) What are the spark_conf.zip files. Is it the hive-site/yarn-site conf > files? Why would the client send them to the cluster? (the cluster already > has all that info - this would make sense in client mode, but not cluster > mode ) > 2) Is it possible to use spark-submit without HDFS access? > 3) How would we fix this? > > Cheers, > Eugene > > -- > > *Eugene Miretsky* > Managing Partner | Badal.io | Book a meeting /w me! > <http://calendly.com/eugene-badal> > mobile: 416-568-9245 > email: eug...@badal.io <zb...@badal.io> >