Re: [EXTERNAL] Re: Spark-submit without access to HDFS

Eugene Miretsky Wed, 15 Nov 2023 18:01:00 -0800

Hey!

Thanks for the response.


We are getting the error because there is no network connectivity to the
data nodes - that's expected.

What I am trying to find out is WHY we need access to the data nodes, and
if there is a way to submit a job without it.

Cheers,
Eugene

On Wed, Nov 15, 2023 at 7:32 PM eab...@163.com <eab...@163.com> wrote:

> Hi Eugene，
>     I think you should Check if the HDFS service is running properly.  From 
> the logs, it appears that there are two datanodes in HDFS,
>  but none of them are healthy.
> Please investigate the reasons why the datanodes are not functioning properly.
> It seems that the issue might be due to insufficient disk space.
>
> ------------------------------
> eabour
>
>
> *From:* Eugene Miretsky <eug...@badal.io.INVALID>
> *Date:* 2023-11-16 05:31
> *To:* user <user@spark.apache.org>
> *Subject:* Spark-submit without access to HDFS
> Hey All,
>
> We are running Pyspark spark-submit from a client outside the cluster. The
> client has network connectivity only to the Yarn Master, not the HDFS
> Datanodes. How can we submit the jobs? The idea would be to preload all the
> dependencies (job code, libraries, etc) to HDFS, and just submit the job
> from the client.
>
> We tried something like this
> 'PYSPARK_ARCHIVES_PATH=hdfs://some-path/pyspark.zip spark-submit --master
> yarn --deploy-mode cluster --py-files hdfs://yarn-master-url hdfs://foo.py'
>
> The error we are getting is
> "
>
> org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while
> waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending remote=/
> 10.117.110.19:9866]
>
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /user/users/.sparkStaging/application_1698216436656_0104/*spark_conf.zip*
> could only be written to 0 of the 1 minReplication nodes. There are 2
> datanode(s) running and 2 node(s) are excluded in this operation.
> "
>
> A few question
> 1) What are the spark_conf.zip files. Is it the hive-site/yarn-site conf
> files? Why would the client send them to the cluster? (the cluster already
> has all that info - this would make sense in client mode, but not cluster
> mode )
> 2) Is it possible to use spark-submit without HDFS access?
> 3) How would we fix this?
>
> Cheers,
> Eugene
>
> --
>
> *Eugene Miretsky*
> Managing Partner |  Badal.io | Book a meeting /w me!
> <http://calendly.com/eugene-badal>
> mobile:  416-568-9245
> email:     eug...@badal.io <zb...@badal.io>
>
>

-- 

*Eugene Miretsky*
Managing Partner |  Badal.io | Book a meeting /w me!
<http://calendly.com/eugene-badal>
mobile:  416-568-9245
email:     eug...@badal.io <zb...@badal.io>

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

Reply via email to