Anton Ippolitov created SPARK-40817:
---------------------------------------

             Summary: Remote spark.jars URIs ignored for Spark on Kubernetes in 
cluster mode 
                 Key: SPARK-40817
                 URL: https://issues.apache.org/jira/browse/SPARK-40817
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes, Spark Submit
    Affects Versions: 3.2.2, 3.3.0, 3.1.3, 3.0.0
         Environment: Spark 3.1.3

Kubernetes 1.21

Ubuntu 20.04.1
            Reporter: Anton Ippolitov
         Attachments: image-2022-10-17-10-44-46-862.png

I discovered that remote URIs in {{spark.jars}} get discarded when launching 
Spark on Kubernetes in cluster mode via spark-submit.
h1. Reproduction

Here is an example reproduction with S3 being used for remote JAR storage: 

I first created 2 JARs:
 * {{/opt/my-local-jar.jar}} on the host where I'm running spark-submit
 * {{s3://$BUCKET_NAME/my-remote-jar.jar}} in an S3 bucket I own

I then ran the following spark-submit command with {{spark.jars}} pointing to 
both the local JAR and the remote JAR:
{code:java}
 spark-submit \
  --master k8s://https://$KUBERNETES_API_SERVER_URL:443 \
  --deploy-mode cluster \
  --name=spark-submit-test \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.jars=/opt/my-local-jar.jar,s3a://$BUCKET_NAME/my-remote-jar.jar \
  --conf spark.kubernetes.file.upload.path=s3a://$BUCKET_NAME/my-upload-path/ \
  [...]
  /opt/spark/examples/jars/spark-examples_2.12-3.1.3.jar
{code}
Once the driver and the executors started, I confirmed that there was no trace 
of {{my-remote-jar.jar}} anymore. For example, looking at the Spark History 
Server, I could see that {{spark.jars}} got transformed into this:

!image-2022-10-17-10-17-03-697.png|width=744,height=60!

There was no mention of {{my-remote-jar.jar}} on the classpath or anywhere else.

Note that I ran all tests with Spark 3.1.3, however the code which handles 
those dependencies seems to be the same for more recent versions of Spark as 
well.
h1. Root cause description

I believe that the issue seems to be coming from [this 
logic|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L163-L186]
 in {{{}BasicDriverFeatureStep.getAdditionalPodSystemProperties(){}}}.

Specifically, this logic takes all URIs in {{{}spark.jars{}}}, [filters only on 
local 
URIs,|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L165]
 
[uploads|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L173]
 those local files to {{spark.kubernetes.file.upload.path }}and then 
[*replaces*|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L182]
 the value of {{spark.jars}} with those newly uploaded JARs. By overwriting the 
previous value of {{{}spark.jars{}}}, we are losing all mention of remote JARs 
that were previously specified there. 

Consequently, when the Spark driver starts afterwards, it only downloads JARs 
from {{{}spark.kubernetes.file.upload.path{}}}.
h1. Possible solution

I think a possible fix would be to not fully overwrite the value of 
{{spark.jars}} but to make sure that we keep remote URIs there.

The new logic would look something like this:
{code:java}
Seq(JARS, FILES, ARCHIVES, SUBMIT_PYTHON_FILES).foreach { key =>
  val uris = conf.get(key).filter(uri => 
KubernetesUtils.isLocalAndResolvable(uri))
  // Save remote URIs
  val remoteUris = conf.get(key).filter(uri => 
!KubernetesUtils.isLocalAndResolvable(uri))
  val value = {
    if (key == ARCHIVES) {
      uris.map(UriBuilder.fromUri(_).fragment(null).build()).map(_.toString)
    } else {
      uris
    }
  }
  val resolved = KubernetesUtils.uploadAndTransformFileUris(value, 
Some(conf.sparkConf))
  if (resolved.nonEmpty) {
    val resolvedValue = if (key == ARCHIVES) {
      uris.zip(resolved).map { case (uri, r) =>
        UriBuilder.fromUri(r).fragment(new 
java.net.URI(uri).getFragment).build().toString
      }
    } else {
      resolved
    }
    // don't forget to add remote URIs
    additionalProps.put(key.key, (resolvedValue ++ remoteUris).mkString(","))
  }
} {code}
I tested it out in my environment and it worked: 
{{s3a://$BUCKET_NAME/my-remote-jar.jar}} was kept in {{spark.jars}} and the 
Spark driver was able to download it.

I don't know the codebase well enough though to assess whether I am missing 
something else or if this is enough to fix the issue.

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to