antonipp opened a new pull request, #39670:
URL: https://github.com/apache/spark/pull/39670

   ### What changes were proposed in this pull request?
   Backport https://github.com/apache/spark/pull/38376 to `branch-3.2`
   
   You can find a detailed description of the issue and an example reproduction 
on the Jira card: https://issues.apache.org/jira/browse/SPARK-40817
   
   The idea for this fix is to update the logic which uploads user-specified 
files (via `spark.jars`, `spark.files`, etc) to 
`spark.kubernetes.file.upload.path`. After uploading local files, it used to 
overwrite the initial list of URIs passed by the user and it would thus erase 
all remote URIs which were specified there.
   
   Small example of this behaviour: 
   1. User set the value of `spark.jars` to 
`s3a://some-bucket/my-application.jar,/tmp/some-local-jar.jar` when running 
`spark-submit` in cluster mode
   2. `BasicDriverFeatureStep.getAdditionalPodSystemProperties()` gets called 
at one point while running `spark-submit`
   3. This function would set `spark.jars` to a new value of 
`${SPARK_KUBERNETES_UPLOAD_PATH}/spark-upload-${RANDOM_STRING}/some-local-jar.jar`.
 Note that `s3a://some-bucket/my-application.jar` has been discarded.
   
   With the logic proposed in this PR, the new value of `spark.jars` would be 
`s3a://some-bucket/my-application.jar,${SPARK_KUBERNETES_UPLOAD_PATH}/spark-upload-${RANDOM_STRING}/some-local-jar.jar`,
 so in other words we are making sure that remote URIs are no longer discarded.
   
   ### Why are the changes needed?
   We encountered this issue in production when trying to launch Spark on 
Kubernetes jobs in cluster mode with a fix of local and remote dependencies.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, see description of the new behaviour above.
   
   ### How was this patch tested?
   - Added a unit test for the new behaviour
   - Added an integration test for the new behaviour
   - Tried this patch in our Kubernetes environment with `SparkPi`:
   ```
   spark-submit \
     --master k8s://https://$KUBERNETES_API_SERVER_URL:443 \
     --deploy-mode cluster \
     --name=spark-submit-test \
     --class org.apache.spark.examples.SparkPi \
     --conf 
spark.jars=/opt/my-local-jar.jar,s3a://$BUCKET_NAME/my-remote-jar.jar \
     --conf 
spark.kubernetes.file.upload.path=s3a://$BUCKET_NAME/my-upload-path/ \
     [...]
     /opt/spark/examples/jars/spark-examples_2.12-3.1.3.jar
   ```
   Before applying the patch, `s3a://$BUCKET_NAME/my-remote-jar.jar` was 
discarded from the final value of `spark.jars`. After applying the patch and 
launching the job again, I confirmed that 
`s3a://$BUCKET_NAME/my-remote-jar.jar` was no longer discarded by looking at 
the Spark config for the running job.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to