Re: Separating classloader management from SparkContexts

Punya Biswal Wed, 19 Mar 2014 07:36:51 -0700

Hi Andrew,

Thanks for pointing me to that example. My understanding of the JobServer
(based on watching a demo of its UI) is that it maintains a set of spark
contexts and allows people to add jars to them, but doesn't allow unloading
or reloading jars within a spark context. The code in JobCache appears to be
a performance enhancement to speed up retrieval of jars that are used
frequently -- the classloader change is purely on the driver side, so that
the driver can serialize the job instance. I'm looking for a classloader
change on the executor-side, so that different jars can be uploaded to the
same SparkContext even if they contain some of the same classes.


Punya

From:  Andrew Ash <and...@andrewash.com>
Reply-To:  "user@spark.apache.org" <user@spark.apache.org>
Date:  Wednesday, March 19, 2014 at 2:03 AM
To:  "user@spark.apache.org" <user@spark.apache.org>
Subject:  Re: Separating classloader management from SparkContexts

Hi Punya, 

This seems like a problem that the recently-announced job-server would
likely have run into at one point.  I haven't tested it yet, but I'd be
interested to see what happens when two jobs in the job server have
conflicting classes.  Does the server correctly segregate each job's classes
from other concurrently-running jobs?

>From my reading of the code I think it may not work the way I'd want it to,
though there are a few classloader tricks going on.

https://github.com/ooyala/spark-jobserver/blob/master/job-server/src/spark.j
observer/JobCache.scala
<https://urldefense.proofpoint.com/v1/url?u=https://github.com/ooyala/spark-
jobserver/blob/master/job-server/src/spark.jobserver/JobCache.scala&k=fDZpZZ
QMmYwf27OU23GmAQ%3D%3D%0A&r=kTrYN051orSRhyA6mqYxbjRIX%2BBCPm7thmzLC79vBeM%3D
%0A&m=FPFPeXJiBQNyIG6CREbwusGj2ZQn1K10JLVA7ZNTjxY%3D%0A&s=694260f3ba26cd5ed7
adff9956193622a06dd7b316cc288dc9c8dd356bb396e6>

In line 29 there the jar is added to the SparkContext, and in 30 the jar is
added to the job-server's local classloader.

Note all this PR related to classloaders -
https://github.com/apache/spark/pull/119
<https://urldefense.proofpoint.com/v1/url?u=https://github.com/apache/spark/
pull/119&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=kTrYN051orSRhyA6mqYxbjRIX%2BBCP
m7thmzLC79vBeM%3D%0A&m=FPFPeXJiBQNyIG6CREbwusGj2ZQn1K10JLVA7ZNTjxY%3D%0A&s=0
3ce2711c63b039ff6ea09a592ea5f16ac287890bcb90d1bf5855ed968ecf815>

Andrew



On Tue, Mar 18, 2014 at 9:24 AM, Punya Biswal <pbis...@palantir.com> wrote:
> Hi Spark people,
> 
> Sorry to bug everyone again about this, but do people have any thoughts on
> whether sub-contexts would be a good way to solve this problem? I'm thinking
> of something like
> 
> class SparkContext {
>   // ... stuff ...
>   def inSubContext[T](fn: SparkContext => T): T
> }
> 
> this way, I could do something like
> 
> val sc = /* get myself a spark context somehow */;
> val rdd = sc.textFile("/stuff.txt")
> sc.inSubContext { sc1 =>
>   sc1.addJar("extras-v1.jar")
>   print(sc1.filter(/* fn that depends on jar */).count)
> }
> sc.inSubContext { sc2 =>
>   sc2.addJar("extras-v2.jar")
>   print(sc2.filter(/* fn that depends on jar */).count)
> }
> 
> ... even if classes in extras-v1.jar and extras-v2.jar have name collisions.
> 
> Punya
> 
> From: Punya Biswal <pbis...@palantir.com>
> Reply-To: <user@spark.apache.org>
> Date: Sunday, March 16, 2014 at 11:09 AM
> To: "user@spark.apache.org" <user@spark.apache.org>
> Subject: Separating classloader management from SparkContexts
> 
> Hi all,
> 
> I'm trying to use Spark to support users who are interactively refining the
> code that processes their data. As a concrete example, I might create an
> RDD[String] and then write several versions of a function to map over the RDD
> until I'm satisfied with the transformation. Right now, once I do addJar() to
> add one version of the jar to the SparkContext, there's no way to add a new
> version of the jar unless I rename the classes and functions involved, or lose
> my current work by re-creating the SparkContext. Is there a better way to do
> this?
> 
> One idea that comes to mind is that we could add APIs to create "sub-contexts"
> from within a SparkContext. Jars added to a sub-context would get added to a
> child classloader on the executor, so that different sub-contexts could use
> classes with the same name while still being able to access on-heap objects
> for RDDs. If this makes sense conceptually, I'd like to work on a PR to add
> such functionality to Spark.
> 
> Punya
>

smime.p7s
Description: S/MIME cryptographic signature

Re: Separating classloader management from SparkContexts

Reply via email to