Hey! Thanks for the response.
We are getting the error because there is no network connectivity to the data nodes - that's expected. What I am trying to find out is WHY we need access to the data nodes, and if there is a way to submit a job without it. Cheers, Eugene On Wed, Nov 15, 2023 at 7:32 PM eab...@163.com <eab...@163.com> wrote: > Hi Eugene, > I think you should Check if the HDFS service is running properly. From > the logs, it appears that there are two datanodes in HDFS, > but none of them are healthy. > Please investigate the reasons why the datanodes are not functioning properly. > It seems that the issue might be due to insufficient disk space. > > ------------------------------ > eabour > > > *From:* Eugene Miretsky <eug...@badal.io.INVALID> > *Date:* 2023-11-16 05:31 > *To:* user <user@spark.apache.org> > *Subject:* Spark-submit without access to HDFS > Hey All, > > We are running Pyspark spark-submit from a client outside the cluster. The > client has network connectivity only to the Yarn Master, not the HDFS > Datanodes. How can we submit the jobs? The idea would be to preload all the > dependencies (job code, libraries, etc) to HDFS, and just submit the job > from the client. > > We tried something like this > 'PYSPARK_ARCHIVES_PATH=hdfs://some-path/pyspark.zip spark-submit --master > yarn --deploy-mode cluster --py-files hdfs://yarn-master-url hdfs://foo.py' > > The error we are getting is > " > > org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while > waiting for channel to be ready for connect. ch : > java.nio.channels.SocketChannel[connection-pending remote=/ > 10.117.110.19:9866] > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > /user/users/.sparkStaging/application_1698216436656_0104/*spark_conf.zip* > could only be written to 0 of the 1 minReplication nodes. There are 2 > datanode(s) running and 2 node(s) are excluded in this operation. > " > > A few question > 1) What are the spark_conf.zip files. Is it the hive-site/yarn-site conf > files? Why would the client send them to the cluster? (the cluster already > has all that info - this would make sense in client mode, but not cluster > mode ) > 2) Is it possible to use spark-submit without HDFS access? > 3) How would we fix this? > > Cheers, > Eugene > > -- > > *Eugene Miretsky* > Managing Partner | Badal.io | Book a meeting /w me! > <http://calendly.com/eugene-badal> > mobile: 416-568-9245 > email: eug...@badal.io <zb...@badal.io> > > -- *Eugene Miretsky* Managing Partner | Badal.io | Book a meeting /w me! <http://calendly.com/eugene-badal> mobile: 416-568-9245 email: eug...@badal.io <zb...@badal.io>