Re: It seems --py-files only takes the first two arguments. Can someone please confirm?

Mich Talebzadeh Tue, 05 Mar 2024 16:44:18 -0800

Sorry I forgot. This below is catered for yarn mode

if your application code primarily consists of Python files and does not
require a separate virtual environment with specific dependencies, you can
use the --py-files argument in spark-submit


spark-submit --verbose \
   --master yarn \
  --deploy-mode cluster \
  --name $APPNAME \
  --driver-memory 1g \  # Adjust memory as needed
  --executor-memory 1g \  # Adjust memory as needed
  --num-executors 2 \     # Adjust executors as needed
  -*-py-files ${build_directory}/source_code.zip \*
  $CODE_DIRECTORY_CLOUD/my_application_entry_point.py  # Path to your
main application script

For application code with a separate virtual environment)

If your application code has specific dependencies that you manage in a
separate virtual environment, you can leverage the --conf
spark.yarn.dist.archives argument.
spark-submit --verbose \
-master yarn \
-deploy-mode cluster \
--name $APPNAME \
 --driver-memory 1g \ # Adjust memory as needed
--executor-memory 1g \ # Adjust memory as needed
--num-executors 2 \ # Adjust executors as needed-
*-conf "spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#pyspark_venv \*
$CODE_DIRECTORY_CLOUD/my_application_entry_point.py # Path to your main
application script

Explanation:

   - --conf "spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#pyspark_venv:
This
   configures Spark to distribute your virtual environment archive (
   pyspark_venv.tar.gz) to the Yarn cluster nodes. The #pyspark_venv  part
   defines a symbolic link name within the container.
   - You do not need --py-fileshere because the virtual environment archive
   will contain all necessary dependencies.

Choosing the best approach:

The choice depends on your project setup:

   - No Separate Virtual Environment: Use  --py-files if your application
   code consists mainly of Python files and doesn't require a separate virtual
   environment.
   - Separate Virtual Environment: Use --conf spark.yarn.dist.archives if
   you manage dependencies in a separate virtual environment archive.

HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Tue, 5 Mar 2024 at 17:28, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

>
>
>
>  I use zip file personally and pass the application name (in your case
> main.py) as the last input line like below
>
> APPLICATION is your main.py. It does not need to be called main.py. It
> could be anything like  testpython.py
>
> CODE_DIRECTORY_CLOUD="gs://spark-on-k8s/codes"   ## replace gs with s3
> # zip needs to be done at root directory of code
> zip -rq ${source_code}.zip ${source_code}
> gsutil cp ${source_code}.zip $CODE_DIRECTORY_CLOUD  ## replace gsutil with
> aws s3
> gsutil cp /${source_code}/src/${APPLICATION} $CODE_DIRECTORY_CLOUD
>
> your spark job
>
>  spark-submit --verbose \
>            --properties-file ${property_file} \
>            --master k8s://https://$KUBERNETES_MASTER_IP:443 \
>            --deploy-mode cluster \
>            --name $APPNAME \
>  *          --py-files $CODE_DIRECTORY_CLOUD/spark_on_gke.zip \*
>            --conf spark.kubernetes.namespace=$NAMESPACE \
>            --conf spark.network.timeout=300 \
>            --conf spark.kubernetes.allocation.batch.size=3 \
>            --conf spark.kubernetes.allocation.batch.delay=1 \
>            --conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \
>            --conf spark.kubernetes.executor.container.image=${IMAGEDRIVER}
> \
>            --conf spark.kubernetes.driver.pod.name=$APPNAME \
>            --conf
> spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \
>            --conf
> spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
>            --conf
> spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
> \
>            --conf spark.dynamicAllocation.enabled=true \
>            --conf spark.dynamicAllocation.shuffleTracking.enabled=true \
>            --conf spark.dynamicAllocation.shuffleTracking.timeout=20s \
>            --conf spark.dynamicAllocation.executorIdleTimeout=30s \
>            --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=40s \
>            --conf spark.dynamicAllocation.minExecutors=0 \
>            --conf spark.dynamicAllocation.maxExecutors=20 \
>            --conf spark.driver.cores=3 \
>            --conf spark.executor.cores=3 \
>            --conf spark.driver.memory=1024m \
>            --conf spark.executor.memory=1024m \
>         *   $CODE_DIRECTORY_CLOUD/${APPLICATION}*
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Tue, 5 Mar 2024 at 16:15, Pedro, Chuck <cpe...@travelers.com.invalid>
> wrote:
>
>> Hi all,
>>
>>
>>
>> I am working in Databricks. When I submit a spark job with the –py-files
>> argument, it seems the first two are read in but the third is ignored.
>>
>>
>>
>> "--py-files",
>>
>> "s3://some_path/appl_src.py",
>>
>> "s3://some_path/main.py",
>>
>> "s3://a_different_path/common.py",
>>
>>
>>
>> I can see the first two acknowledged in the Log4j but not the third.
>>
>>
>>
>> 24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/appl_src.py to ...
>>
>> 24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/main.py to ...
>>
>>
>>
>> As a result, the job fails because appl_src.py is importing from
>> common.py but can’t find it.
>>
>>
>>
>> I posted to both Databricks community here
>> <https://community.databricks.com/t5/data-engineering/spark-submit-not-reading-one-of-my-py-files-arguments/m-p/62361#M31953>
>> and Stack Overflow here
>> <https://stackoverflow.com/questions/78077822/databricks-spark-submit-getting-error-with-py-files>
>> but did not get a response.
>>
>>
>>
>> I’m aware that we could use a .zip file, so I tried zipping the first two
>> arguments but then got a totally different error:
>>
>>
>>
>> “Exception in thread "main" org.apache.spark.SparkException: Failed to
>> get main class in JAR with error 'null'.  Please specify one with --class.”
>>
>>
>>
>> Basically I just want the application code in one s3 path and a “common”
>> utilities package in another path. Thanks for your help.
>>
>>
>>
>>
>>
>>
>>
>> *Kind regards,*
>>
>> Chuck Pedro
>>
>>
>>
>>
>> ------------------------------
>> This message (including any attachments) may contain confidential,
>> proprietary, privileged and/or private information. The information is
>> intended to be for the use of the individual or entity designated above. If
>> you are not the intended recipient of this message, please notify the
>> sender immediately, and delete the message and any attachments. Any
>> disclosure, reproduction, distribution or other use of this message or any
>> attachments by an individual or entity other than the intended recipient is
>> prohibited.
>>
>> TRVDiscDefault::1201
>>
>

Re: It seems --py-files only takes the first two arguments. Can someone please confirm?

Reply via email to