I don't know why I don't see my last message in the thread here:
https://lists.apache.org/thread/5wgdqp746nj4f6ovdl42rt82wc8ltkcn
Also don't get messages from Artemis in my mail, I can only see them
in the thread web UI, which is very confusing.
On top of that when I click on "reply via your own email client" in
the web UI, I get: Bad Request Error 400
Anyways to answer to your last comment Artemis:
> I guess there are several misconceptions here:
There's no confusion on my side, all that makes sense. When I said
"worker" in that comment I meant the scheduler worker not Spark
worker, which in the Spark realm would be the client.
Everything else you said is undoubtedly correct, but unrelated to the
issue/problem at hand.
Sean, Artemis - I appreciate your feedback about the infra setup, but
it's beside the problem behind this issue.
Let me describe a simpler setup/example with the same problem, say:
1. I have a jupyter notebook
2. use local/driver spark mode only
3. I start the driver, process some data, store it in pandas dataframe
4. now say I want to add a package to spark driver (or increase the
JVM memory etc)
There's currently no way to do the step 4 without restarting the
notebook process which holds the "reference" to the Spark driver/JVM.
If I restart the Jupter notebook I would lose all the data in memory
(e.g. pandas data), ofc I can save that data to e.g. disk but that's
beside the point.
I understand you don't want to provide this functionality in Spark,
nor warn users on changes in Spark Configuration that won't actually
work - as a user I wish I could get at least a warning in that case,
but I respect your decision. It seems like the workaround to shutdown
the JVM works in this case, I would much appreciate your feedback
about **that specific workaround** please. Any reason not to use it?
Cheers - Rafal
On Thu, 10 Mar 2022 at 18:50, Rafał Wojdyła <ravwojd...@gmail.com> wrote:
If you have a long running python orchestrator worker (e.g. Luigi
worker), and say it's gets a DAG of A -> B ->C, and say the worker
first creates a spark driver for A (which doesn't need extra
jars/packages), then it gets B which is also a spark job but it
needs an extra package, it won't be able to create a new spark
driver with extra packages since it's "not possible" to create a
new driver JVM. I would argue it's the same scenario if you have
multiple spark jobs that need different amounts of memory or
anything that requires JVM restart. Of course I can use the
workaround to shut down the driver/JVM, do you have any feedback
about that workaround (see my previous comment or the issue).
On Thu, 10 Mar 2022 at 18:12, Sean Owen <sro...@gmail.com> wrote:
Wouldn't these be separately submitted jobs for separate
workloads? You can of course dynamically change each job
submitted to have whatever packages you like, from whatever is
orchestrating. A single job doing everything sound right.
On Thu, Mar 10, 2022, 12:05 PM Rafał Wojdyła
<ravwojd...@gmail.com> wrote:
Because I can't (and should not) know ahead of time which
jobs will be executed, that's the job of the orchestration
layer (and can be dynamic). I know I can specify multiple
packages. Also not worried about memory.
On Thu, 10 Mar 2022 at 13:54, Artemis User
<arte...@dtechspace.com> wrote:
If changing packages or jars isn't your concern, why
not just specify ALL packages that you would need for
the Spark environment? You know you can define
multiple packages under the packages option. This
shouldn't cause memory issues since JVM uses dynamic
class loading...
On 3/9/22 10:03 PM, Rafał Wojdyła wrote:
Hi Artemis,
Thanks for your input, to answer your questions:
> You may want to ask yourself why it is necessary to
change the jar packages during runtime.
I have a long running orchestrator process, which
executes multiple spark jobs, currently on a single
VM/driver, some of those jobs might require extra
packages/jars (please see example in the issue).
> Changing package doesn't mean to reload the classes.
AFAIU this is unrelated
> There is no way to reload the same class unless you
customize the classloader of Spark.
AFAIU this is an implementation detail.
> I also don't think it is necessary to implement a
warning or error message when changing the
configuration since it doesn't do any harm
To reiterate right now the API allows to change
configuration of the context, without that
configuration taking effect. See example of confused
users here:
*
https://stackoverflow.com/questions/41886346/spark-2-1-0-session-config-settings-pyspark
*
https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1
I'm curious if you have any opinion about the
"hard-reset" workaround, copy-pasting from the issue:
```
s: SparkSession = ...
# Hard reset:
s.stop()
s._sc._gateway.shutdown()
s._sc._gateway.proc.stdin.close()
SparkContext._gateway = None
SparkContext._jvm = None
```
Cheers - Rafal
On 2022/03/09 15:39:58 Artemis User wrote:
> This is indeed a JVM issue, not a Spark issue. You
may want to ask
> yourself why it is necessary to change the jar
packages during runtime.
> Changing package doesn't mean to reload the
classes. There is no way to
> reload the same class unless you customize the
classloader of Spark. I
> also don't think it is necessary to implement a
warning or error message
> when changing the configuration since it doesn't do
any harm. Spark
> uses lazy binding so you can do a lot of such
"unharmful" things.
> Developers will have to understand the behaviors of
each API before when
> using them..
>
>
> On 3/9/22 9:31 AM, Rafał Wojdyła wrote:
> > Sean,
> > I understand you might be sceptical about adding
this functionality
> > into (py)spark, I'm curious:
> > * would error/warning on update in configuration
that is currently
> > effectively impossible (requires restart of JVM)
be reasonable?
> > * what do you think about the workaround in the
issue?
> > Cheers - Rafal
> >
> > On Wed, 9 Mar 2022 at 14:24, Sean Owen
<sr...@gmail.com> wrote:
> >
> > Unfortunately this opens a lot more questions
and problems than it
> > solves. What if you take something off the
classpath, for example?
> > change a class?
> >
> > On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła
> > <ra...@gmail.com> wrote:
> >
> > Thanks Sean,
> > To be clear, if you prefer to change the
label on this issue
> > from bug to sth else, feel free to do so,
no strong opinions
> > on my end. What happens to the classpath,
whether spark uses
> > some classloader magic, is probably an
implementation detail.
> > That said, it's definitely not intuitive
that you can change
> > the configuration and get the context
(with the updated
> > config) without any warnings/errors. Also
what would you
> > recommend as a workaround or solution to
this problem? Any
> > comments about the workaround in the
issue? Keep in mind that
> > I can't restart the long running
orchestration process (python
> > process if that matters).
> > Cheers - Rafal
> >
> > On Wed, 9 Mar 2022 at 13:15, Sean Owen
<sr...@gmail.com> wrote:
> >
> > That isn't a bug - you can't change
the classpath once the
> > JVM is executing.
> >
> > On Wed, Mar 9, 2022 at 7:11 AM Rafał
Wojdyła
> > <ra...@gmail.com> wrote:
> >
> > Hi,
> > My use case is that, I have a
long running process
> > (orchestrator) with multiple
tasks, some tasks might
> > require extra spark dependencies.
It seems once the
> > spark context is started it's not
possible to update
> > `spark.jars.packages`? I have reported an issue at
> > https://issues.apache.org/jira/browse/SPARK-38438,
> > together with a workaround ("hard
reset of the
> > cluster"). I wonder if anyone has
a solution for this?
> > Cheers - Rafal
> >
>