Abhishek Modi created SPARK-29472:
-------------------------------------

             Summary: Mechanism for Excluding Jars at Launch for YARN
                 Key: SPARK-29472
                 URL: https://issues.apache.org/jira/browse/SPARK-29472
             Project: Spark
          Issue Type: New Feature
          Components: YARN
    Affects Versions: 2.4.4
            Reporter: Abhishek Modi


*Summary*

It would be convenient if there were an easy way to exclude jars from Spark’s 
classpath at launch time. This would complement the way in which jars can be 
added to the classpath using {{extraClassPath}}.

 

*Context*

The Spark build contains its dependency jars in the {{/jars}} directory. These 
jars become part of the executor’s classpath. By default on YARN, these jars 
are packaged and distributed to containers at launch ({{spark-submit}}) time.

 

While developing Spark applications, customers sometimes need to debug using 
different versions of dependencies. This can become difficult if the dependency 
(eg. Parquet 1.11.0) is one that Spark already has in {{/jars}} (eg. Parquet 
1.10.1 in Spark 2.4), as the dependency included with Spark is preferentially 
loaded. 

 

Configurations such as {{userClassPathFirst}} are available. However these have 
often come with other side effects. For example, if the customer’s build 
includes Avro they will likely see {{Caused by: java.lang.LinkageError: loader 
constraint violation: when resolving method 
"org.apache.spark.SparkConf.registerAvroSchemas(Lscala/collection/Seq;)Lorg/apache/spark/SparkConf;"
 the class loader (instance of org/apache/spark/util/ChildFirstURLClassLoader) 
of the current class, com/uber/marmaray/common/spark/SparkFactory, and the 
class loader (instance of sun/misc/Launcher$AppClassLoader) for the method's 
defining class, org/apache/spark/SparkConf, have different Class objects for 
the type scala/collection/Seq used in the signature}}. Resolving such issues 
often takes many hours.

 

To deal with these sorts of issues, customers often download the Spark build, 
remove the target jars and then do spark-submit. Other times, customers may not 
be able to do spark-submit as it is gated behind some Spark Job Server. In this 
case, customers may try downloading the build, removing the jars, and then 
using configurations such as {{spark.yarn.dist.jars}} or 
{{spark.yarn.dist.archives}}. Both of these options are undesirable as they are 
very operationally heavy, error prone and often result in the customer’s spark 
builds going out of sync with the authoritative build. 

 

*Solution*

I’d like to propose adding a {{spark.yarn.jars.exclusionRegex}} configuration. 
Customers could provide a regex such as {{.\*parquet.\*}} and jar files 
matching this regex would not be included in the driver and executor classpath.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to