Sorry I forgot. This below is catered for yarn mode if your application code primarily consists of Python files and does not require a separate virtual environment with specific dependencies, you can use the --py-files argument in spark-submit
spark-submit --verbose \ --master yarn \ --deploy-mode cluster \ --name $APPNAME \ --driver-memory 1g \ # Adjust memory as needed --executor-memory 1g \ # Adjust memory as needed --num-executors 2 \ # Adjust executors as needed -*-py-files ${build_directory}/source_code.zip \* $CODE_DIRECTORY_CLOUD/my_application_entry_point.py # Path to your main application script For application code with a separate virtual environment) If your application code has specific dependencies that you manage in a separate virtual environment, you can leverage the --conf spark.yarn.dist.archives argument. spark-submit --verbose \ -master yarn \ -deploy-mode cluster \ --name $APPNAME \ --driver-memory 1g \ # Adjust memory as needed --executor-memory 1g \ # Adjust memory as needed --num-executors 2 \ # Adjust executors as needed- *-conf "spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#pyspark_venv \* $CODE_DIRECTORY_CLOUD/my_application_entry_point.py # Path to your main application script Explanation: - --conf "spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#pyspark_venv: This configures Spark to distribute your virtual environment archive ( pyspark_venv.tar.gz) to the Yarn cluster nodes. The #pyspark_venv part defines a symbolic link name within the container. - You do not need --py-fileshere because the virtual environment archive will contain all necessary dependencies. Choosing the best approach: The choice depends on your project setup: - No Separate Virtual Environment: Use --py-files if your application code consists mainly of Python files and doesn't require a separate virtual environment. - Separate Virtual Environment: Use --conf spark.yarn.dist.archives if you manage dependencies in a separate virtual environment archive. HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Tue, 5 Mar 2024 at 17:28, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > > > I use zip file personally and pass the application name (in your case > main.py) as the last input line like below > > APPLICATION is your main.py. It does not need to be called main.py. It > could be anything like testpython.py > > CODE_DIRECTORY_CLOUD="gs://spark-on-k8s/codes" ## replace gs with s3 > # zip needs to be done at root directory of code > zip -rq ${source_code}.zip ${source_code} > gsutil cp ${source_code}.zip $CODE_DIRECTORY_CLOUD ## replace gsutil with > aws s3 > gsutil cp /${source_code}/src/${APPLICATION} $CODE_DIRECTORY_CLOUD > > your spark job > > spark-submit --verbose \ > --properties-file ${property_file} \ > --master k8s://https://$KUBERNETES_MASTER_IP:443 \ > --deploy-mode cluster \ > --name $APPNAME \ > * --py-files $CODE_DIRECTORY_CLOUD/spark_on_gke.zip \* > --conf spark.kubernetes.namespace=$NAMESPACE \ > --conf spark.network.timeout=300 \ > --conf spark.kubernetes.allocation.batch.size=3 \ > --conf spark.kubernetes.allocation.batch.delay=1 \ > --conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \ > --conf spark.kubernetes.executor.container.image=${IMAGEDRIVER} > \ > --conf spark.kubernetes.driver.pod.name=$APPNAME \ > --conf > spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \ > --conf > spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \ > --conf > spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" > \ > --conf spark.dynamicAllocation.enabled=true \ > --conf spark.dynamicAllocation.shuffleTracking.enabled=true \ > --conf spark.dynamicAllocation.shuffleTracking.timeout=20s \ > --conf spark.dynamicAllocation.executorIdleTimeout=30s \ > --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=40s \ > --conf spark.dynamicAllocation.minExecutors=0 \ > --conf spark.dynamicAllocation.maxExecutors=20 \ > --conf spark.driver.cores=3 \ > --conf spark.executor.cores=3 \ > --conf spark.driver.memory=1024m \ > --conf spark.executor.memory=1024m \ > * $CODE_DIRECTORY_CLOUD/${APPLICATION}* > > HTH > > Mich Talebzadeh, > Dad | Technologist | Solutions Architect | Engineer > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Tue, 5 Mar 2024 at 16:15, Pedro, Chuck <cpe...@travelers.com.invalid> > wrote: > >> Hi all, >> >> >> >> I am working in Databricks. When I submit a spark job with the –py-files >> argument, it seems the first two are read in but the third is ignored. >> >> >> >> "--py-files", >> >> "s3://some_path/appl_src.py", >> >> "s3://some_path/main.py", >> >> "s3://a_different_path/common.py", >> >> >> >> I can see the first two acknowledged in the Log4j but not the third. >> >> >> >> 24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/appl_src.py to ... >> >> 24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/main.py to ... >> >> >> >> As a result, the job fails because appl_src.py is importing from >> common.py but can’t find it. >> >> >> >> I posted to both Databricks community here >> <https://community.databricks.com/t5/data-engineering/spark-submit-not-reading-one-of-my-py-files-arguments/m-p/62361#M31953> >> and Stack Overflow here >> <https://stackoverflow.com/questions/78077822/databricks-spark-submit-getting-error-with-py-files> >> but did not get a response. >> >> >> >> I’m aware that we could use a .zip file, so I tried zipping the first two >> arguments but then got a totally different error: >> >> >> >> “Exception in thread "main" org.apache.spark.SparkException: Failed to >> get main class in JAR with error 'null'. Please specify one with --class.” >> >> >> >> Basically I just want the application code in one s3 path and a “common” >> utilities package in another path. Thanks for your help. >> >> >> >> >> >> >> >> *Kind regards,* >> >> Chuck Pedro >> >> >> >> >> ------------------------------ >> This message (including any attachments) may contain confidential, >> proprietary, privileged and/or private information. The information is >> intended to be for the use of the individual or entity designated above. If >> you are not the intended recipient of this message, please notify the >> sender immediately, and delete the message and any attachments. Any >> disclosure, reproduction, distribution or other use of this message or any >> attachments by an individual or entity other than the intended recipient is >> prohibited. >> >> TRVDiscDefault::1201 >> >