Bumping. All inputs appreciated

Regards
Praneet
_____________________________________________
From: Sharma, Praneet
Sent: Friday, August 23, 2019 11:24 AM
To: 'user@spark.apache.org' <user@spark.apache.org>
Subject: Classloading issues when using connectors with Uber jars with improper 
Shading in single Spark job


Hi Guys

When using connectors with Uber jars, we are hitting classloading issues in 
Spark 2.3.0. Upon investigation we found out that the classloading issues were 
caused by improper shading of certain classes in these uber jars. The aim of 
this email is to start a discussion on whether such issues can be 
mitigated/avoided from Spark core, and if yes, then how.

Issue Summary

We have built a Spark job using Spark version 2.3.0 which reads from Azure 
Cosmos DB. In Spark, to read from cosmos DB, we are relying on an Uber jar 
provided by Azure - 
azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar<https://search.maven.org/artifact/com.microsoft.azure/azure-cosmosdb-spark_2.3.0_2.11/1.2.2/jar>.
 To add this jar to the Spark driver and executor classpaths, we are relying on 
the properties spark.driver.extraClasspath and spark.executor.extraClasspath, 
respectively. When this Spark job is run, we hit the following issue:

ERROR ApplicationMaster: User class threw exception: java.lang.VerifyError: Bad 
type on operand stack
Exception Details:
  Location:    
org/apache/spark/metrics/sink/MetricsServlet.<init>(Ljava/util/Properties;Lcom/codahale/metrics/MetricRegistry;Lorg/apache/spark/SecurityManager;)V
 @116: invokevirtual
  Reason:    Type 'com/codahale/metrics/json/MetricsModule' (current frame, 
stack[2]) is not assignable to 'com/fasterxml/jackson/databind/Module'

We have done some analysis on this issue from our side which we are detailing 
in the below section.

Issue Details

We have a Spark 2.3.0 setup to work with Cloudera cluster in yarn-cluster mode. 
By default, Spark's driver and executor classpaths, among others, have the 
following jars:
*       jackson-databind-2.6.5.jar
*       metrics-json-3.1.2.jar
*       spark-core_2.11-2.3.1.jar

To work with cosmos DB, the jar we have explicitly added in the classpaths is 
azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar. This is an uber jar and is 
shaded with prefix "cosmosdb_connector_shaded", but the shading is improper, 
meaning some classes are left with their original fully qualified names - this 
is the origin of the classloading issue I mentioned above. The below table 
highlights the classes of interest being present in one of the above mentioned 
jars and the order in which Spark's MutableURLClassLoader might attempt to load 
them:

Order   Jar     Classes of Interest
1       azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar  
com.codahale.metrics.json.MetricsModule
cosmosdb_connector_shaded.com.fasterxml.jackson.databind.Module
2       jackson-databind-2.6.5.jar      com.fasterxml.jackson.databind.Module
3       metrics-json-3.1.2.jar  com.codahale.metrics.json.MetricsModule
4       spark-core_2.11-2.3.1.jar       
org.apache.spark.metrics.sink.MetricsServlet

Please note that classes from cosmosdb-spark TPL will get loaded first because 
this jar is added to the top of URL classpath pile of MutableURLClassLoader. To 
reiterate, the error we see is: "Type 'com/codahale/metrics/json/MetricsModule' 
(current frame, stack[2]) is not assignable to 
'com/fasterxml/jackson/databind/Module'". And according to our analysis, the 
following is the reason why this occurs:

*       When the Spark driver comes up, it attempts to register an instance of 
MetricsModule class. MetricsModule is a concrete class implementation of Module 
class, which means the Spark driver classloader will attempt to load both these 
classes.
*       The Module class is loaded from jackson-databind-2.6.5.jar class 
because even though the same class is present in the uber jar but it is shaded, 
hence its fully qualified classname is different from the one which is 
attempted to being loaded.
*       On the other hand, when the MetricsModule class is loaded, it gets 
loaded from the uber jar. This is because, due to improper nature of shading 
plugin, the MetricsModule class was not shaded in the uber jar, and also 
because uber jar is at the top of the classpath pile.
*       This results in the above exception because the MetricsModule  class 
present in the uber jar was not compatible with the Module class present in 
jackson-databind-2.6.5.jar.

This is one example where the issue occurs. We have seen similar issues 
occurring when we attempt to use both Cosmos DB and Snowflake connectors in a 
single Spark job. Please note that we only use those TPLs and their versions 
which the vendors officially announce as supported for a particular Spark 
version.

How the above error could have been avoided?

*       If the uber jar had been properly shaded, the MetricsModule class would 
have been loaded from metrics-json-3.1.2.jar which is compatible with 
jackson-databind-2.6.5.jar containing the abstract class Module.
o       From what we understand, total shading is not possible due to a variety 
of reasons.
*       Or, If the dependencies embedded in the uber jar had been compatible 
with the same dependencies present elsewhere, then even with improperly shaded 
uber jar, it would not have mattered the jar from where the duplicate 
dependency class is being loaded.
o       But this is in vendor control and not in our control. Connectors by 
different vendors can work separately in Spark but using them together in a 
single job can get problematic. We see this when attempting to use both cosmos 
DB and Snowflake in a single Spark job.

Child-first Classloader as a solution for wrapping connectors in Spark?

We are able to resolve these issues on Spark driver by wrapping load and save 
Dataframe calls in their separate child-first classloaders and ensuring the 
Uber TPLs are only present in these child-first classloaders. For example, the 
child-first classloader (child of MutableURLClassloader) wrapping cosmos db's 
load method will only have azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar in 
its classpath. But we are not able to achieve the same with Spark executors 
because we don't have much control there. To achieve a similar thing with 
cosmos DB spark executor code, we will be required to modify its open-source 
code and wrap the lambda within mapPartitionsWithIndex with a child-first 
classloader. This would mean doing this individually for each connector we want 
to work with, which is not viable.

Thus we reach out to you guys to figure out how such issues can be tackled. Is 
there a way to achieve isolated classloading for connectors reading and writing 
portions in Spark? If yes, how can that be achieved? If not, what other choices 
do we have here? What does spark community recommend we do to support our use 
cases mentioned here?

Regards
Praneet


Reply via email to