Setting fs.s3a.aws.credentials.provider through a connect server.

2023-11-17 Thread Leandro Martelli
Hi all!

Has anyone been through this already?

I have a spark docker images that are used in 2 different environments and
each one requires a different credentials provider for s3a. That parameter
is the only difference between them.

When passing via --conf, it works as expected.

When --conf is omitted and we pass it via .option() in the SparkSession
builder, it gets ignored and we see just the default providers.

Any hint is appreciated.

Thanks a lot!

Leandro


Re: Spark-submit without access to HDFS

2023-11-17 Thread Mich Talebzadeh
Hi,

How are you submitting your spark job from your client?

Your files can either be on HDFS or HCFS such as gs, s3 etc.

With reference to --py-files hdfs://yarn-master-url hdfs://foo.py', I
assume you want your

spark-submit --verbose \
   --deploy-mode cluster \
   --conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" \
   --conf "spark.yarn.appMasterEnv.PYTHONPATH=${PYTHONPATH}" \
   --conf "spark.executorEnv.PYTHONPATH=${PYTHONPATH}" \
   --py-files $CODE_DIRECTORY_CLOUD/dataproc_on_gke.zip \
   --conf "spark.driver.memory"=4G \
   --conf "spark.executor.memory"=4G \
   --conf "spark.num.executors"=4 \
   --conf "spark.executor.cores"=2 \
   $CODE_DIRECTORY_CLOUD/${APPLICATION}

in my case I define $CODE_DIRECTORY_CLOUD as below on google cloud storage

CODE_DIRECTORY="/home/hduser/dba/bin/python/"
CODE_DIRECTORY_CLOUD="gs://,${PROJECT}-spark-on-k8s/codes"
cd $CODE_DIRECTORY
[ -f ${source_code}.zip ] && rm -r -f ${source_code}.zip
echo `date` ", ===> creating source zip directory from  ${source_code}"
# zip needs to be done at root directory of code
zip -rq ${source_code}.zip ${source_code}
gsutil cp ${source_code}.zip $CODE_DIRECTORY_CLOUD
gsutil cp /home/hduser/dba/bin/python/${source_code}/src/${APPLICATION}
$CODE_DIRECTORY_CLOUD

So in summary I create a zip  file of my project and copy it across to the
cloud storage and then put the application (py file) there as well and use
them in spark-submit

I trust this answers your question.

HTH



Mich Talebzadeh,
Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 15 Nov 2023 at 21:33, Eugene Miretsky 
wrote:

> Hey All,
>
> We are running Pyspark spark-submit from a client outside the cluster. The
> client has network connectivity only to the Yarn Master, not the HDFS
> Datanodes. How can we submit the jobs? The idea would be to preload all the
> dependencies (job code, libraries, etc) to HDFS, and just submit the job
> from the client.
>
> We tried something like this
> 'PYSPARK_ARCHIVES_PATH=hdfs://some-path/pyspark.zip spark-submit --master
> yarn --deploy-mode cluster --py-files hdfs://yarn-master-url hdfs://foo.py'
>
> The error we are getting is
> "
>
> org.apache.hadoop.net.ConnectTimeoutException: 6 millis timeout while
> waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending remote=/
> 10.117.110.19:9866]
>
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /user/users/.sparkStaging/application_1698216436656_0104/*spark_conf.zip*
> could only be written to 0 of the 1 minReplication nodes. There are 2
> datanode(s) running and 2 node(s) are excluded in this operation.
> "
>
> A few question
> 1) What are the spark_conf.zip files. Is it the hive-site/yarn-site conf
> files? Why would the client send them to the cluster? (the cluster already
> has all that info - this would make sense in client mode, but not cluster
> mode )
> 2) Is it possible to use spark-submit without HDFS access?
> 3) How would we fix this?
>
> Cheers,
> Eugene
>
> --
>
> *Eugene Miretsky*
> Managing Partner |  Badal.io | Book a meeting /w me!
> 
> mobile:  416-568-9245
> email: eug...@badal.io 
>