Hi DB,
I found it is a little hard to implement the solution I mentioned:
> Do not send the primary jar and secondary jars to executors'
> distributed cache. Instead, add them to "spark.jars" in SparkSubmit
> and serve them via http by called sc.addJar in SparkContext.
If you look at Application
@Xiangrui
How about we send the primary jar and secondary jars into distributed cache
without adding them into the system classloader of executors. Then we add
them using custom classloader so we don't need to call secondary jars
through reflection in primary jar. This will be consistent to what we
Hey I just looked at the fix here:
https://github.com/apache/spark/pull/848
Given that this is quite simple, maybe it's best to just go with this
and just explain that we don't support adding jars dynamically in YARN
in Spark 1.0. That seems like a reasonable thing to do.
On Wed, May 21, 2014 at
Of these two solutions I'd definitely prefer 2 in the short term. I'd
imagine the fix is very straightforward (it would mostly just be
remove code), and we'd be making this more consistent with the
standalone mode which makes things way easier to reason about.
In the long term we'll definitely wan
db tsai, i do not think userClassPathFirst is working, unless the classes
you load dont reference any classes already loaded by the parent
classloader (a mostly hypothetical situation)... i filed a jira for this
here:
https://issues.apache.org/jira/browse/SPARK-1863
On Tue, May 20, 2014 at 1:04
That's a good example. If we really want to cover that case, there are
two solutions:
1. Follow DB's patch, adding jars to the system classloader. Then we
cannot put a user class in front of an existing class.
2. Do not send the primary jar and secondary jars to executors'
distributed cache. Inste
Is that an assumption we can make? I think we'd run into an issue in this
situation:
*In primary jar:*
def makeDynamicObject(clazz: String) = Class.forName(clazz).newInstance()
*In app code:*
sc.addJar("dynamicjar.jar")
...
rdd.map(x => makeDynamicObject("some.class.from.DynamicJar"))
It might
How about the jars added dynamically? Those will be in customLoader's
classpath but not in the system one. Unfortunately, when we reference to
those jars added dynamically in primary jar, the default classloader will
be the system one not the custom one.
It works in standalone mode since the prima
I think adding jars dynamically should work as long as the primary jar
and the secondary jars do not depend on dynamically added jars, which
should be the correct logic. -Xiangrui
On Wed, May 21, 2014 at 1:40 PM, DB Tsai wrote:
> This will be another separate story.
>
> Since in the yarn deployme
This will be another separate story.
Since in the yarn deployment, as Sandy said, the app.jar will be always in
the systemclassloader which means any object instantiated in app.jar will
have parent loader of systemclassloader instead of custom one. As a result,
the custom class loader in yarn will
This will solve the issue for jars added upon application submission, but,
on top of this, we need to make sure that anything dynamically added
through sc.addJar works as well.
To do so, we need to make sure that any jars retrieved via the driver's
HTTP server are loaded by the same classloader th
Talked with Sandy and DB offline. I think the best solution is sending
the secondary jars to the distributed cache of all containers rather
than just the master, and set the classpath to include spark jar,
primary app jar, and secondary jars before executor starts. In this
way, user only needs to s
In 1.0, there is a new option for users to choose which classloader has
higher priority via spark.files.userClassPathFirst, I decided to submit the
PR for 0.9 first. We use this patch in our lab and we can use those jars
added by sc.addJar without reflection.
https://github.com/apache/spark/pull/8
Good summary! We fixed it in branch 0.9 since our production is still in
0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0
tonight.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/
It just hit me why this problem is showing up on YARN and not on standalone.
The relevant difference between YARN and standalone is that, on YARN, the
app jar is loaded by the system classloader instead of Spark's custom URL
classloader.
On YARN, the system classloader knows about [the classes in
Having a user add define a custom class inside of an added jar and
instantiate it directly inside of an executor is definitely supported
in Spark and has been for a really long time (several years). This is
something we do all the time in Spark.
DB - I'd hold off on a re-architecting of this until
I don't think a customer classloader is necessary.
Well, it occurs to me that this is no new problem. Hadoop, Tomcat, etc
all run custom user code that creates new user objects without
reflection. I should go see how that's done. Maybe it's totally valid
to set the thread's context classloader for
Sounds like the problem is that classloaders always look in their parents
before themselves, and Spark users want executors to pick up classes from
their custom code before the ones in Spark plus its dependencies.
Would a custom classloader that delegates to the parent after first
checking itself
Hi Sean,
It's true that the issue here is classloader, and due to the classloader
delegation model, users have to use reflection in the executors to pick up
the classloader in order to use those classes added by sc.addJars APIs.
However, it's very inconvenience for users, and not documented in spa
I might be stating the obvious for everyone, but the issue here is not
reflection or the source of the JAR, but the ClassLoader. The basic
rules are this.
"new Foo" will use the ClassLoader that defines Foo. This is usually
the ClassLoader that loaded whatever it is that first referenced Foo
and c
log: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
-- Forwarded message --
From: Sandy Ryza
Date: Sun, May 18, 2014 at 4:49 PM
Subject: Re: Calling external classes added by sc.addJar needs to be
through reflection
To: "dev@spark.apache.org"
Hey Xi
The jars are included in my driver, and I can successfully use them in the
driver. I'm working on a patch, and it's almost working. Will submit a PR
soon.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com
The reflection actually works. But you need to get the loader by `val
loader = Thread.currentThread.getContextClassLoader` which is set by Spark
executor. Our team verified this, and uses it as workaround.
Sincerely,
DB Tsai
---
My Blog: https
Hey Xiangrui,
If the jars are placed in the distributed cache and loaded statically, as
the primary app jar is in YARN, then it shouldn't be an issue. Other jars,
however, including additional jars that are sc.addJar'd and jars specified
with the spark-submit --jars argument, are loaded dynamical
Hi Sandy,
It is hard to imagine that a user needs to create an object in that
way. Since the jars are already in distributed cache before the
executor starts, is there any reason we cannot add the locally cached
jars to classpath directly?
Best,
Xiangrui
On Sun, May 18, 2014 at 4:00 PM, Sandy Ry
Hi Patrick,
If spark-submit works correctly, user only needs to specify runtime
jars via `--jars` instead of using `sc.addJar`. Is it correct? I
checked SparkSubmit and yarn.Client but didn't find any code to handle
`args.jars` for YARN mode. So I don't know where in the code the jars
in the distr
I spoke with DB offline about this a little while ago and he confirmed that
he was able to access the jar from the driver.
The issue appears to be a general Java issue: you can't directly
instantiate a class from a dynamically loaded jar.
I reproduced it locally outside of Spark with:
---
URL
@xiangrui - we don't expect these to be present on the system
classpath, because they get dynamically added by Spark (e.g. your
application can call sc.addJar well after the JVM's have started).
@db - I'm pretty surprised to see that behavior. It's definitely not
intended that users need reflectio
@db - it's possible that you aren't including the jar in the classpath
of your driver program (I think this is what mridul was suggesting).
It would be helpful to see the stack trace of the CNFE.
- Patrick
On Sun, May 18, 2014 at 11:54 AM, Patrick Wendell wrote:
> @xiangrui - we don't expect the
Btw, I tried
rdd.map { i =>
System.getProperty("java.class.path")
}.collect()
but didn't see the jars added via "--jars" on the executor classpath.
-Xiangrui
On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng wrote:
> I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
> reflecti
I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870
DB, could you add more info to that JIRA? Thanks!
-Xiangrui
On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng wrote:
> Btw, I tried
>
> rdd.map { i =>
> System.getProperty("java.class.path")
> }.collect()
>
> but didn't see the j
I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
reflection approach mentioned by DB didn't work either. I checked the
distributed cache on a worker node and found the jar there. It is also
in the Environment tab of the WebUI. The workaround is making an
assembly jar.
DB, could y
Can you try moving your mapPartitions to another class/object which is
referenced only after sc.addJar ?
I would suspect CNFEx is coming while loading the class containing
mapPartitions before addJars is executed.
In general though, dynamic loading of classes means you use reflection to
instantia
Finally find a way out of the ClassLoader maze! It took me some times to
understand how it works; I think it worths to document it in a separated
thread.
We're trying to add external utility.jar which contains CSVRecordParser,
and we added the jar to executors through sc.addJar APIs.
If the insta
34 matches
Mail list logo