Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-11 Thread Artemis User
OK, I see the confusions in terminologies.  However, what were suggested 
should still work.  A Luigi worker in this case would function like a 
Spark client, responsible for submitting a Spark application (or job in 
Luigi's term).  In other words, you just define all necessary jars for 
all your jobs in your SparkContext (or to make things easier, define in 
the spark-default.conf file or just place them in the spark's jars 
directory).  This should work 100% especially when you don't know which 
"job" (should be called application or task) needs which jars in advance.


For other questions unrelated to this discussion, I'd suggest starting a 
new thread to make things clear. Thanks!


On 3/11/22 1:09 PM, Rafał Wojdyła wrote:
I don't know why I don't see my last message in the thread here: 
https://lists.apache.org/thread/5wgdqp746nj4f6ovdl42rt82wc8ltkcn
Also don't get messages from Artemis in my mail, I can only see them 
in the thread web UI, which is very confusing.
On top of that when I click on "reply via your own email client" in 
the web UI, I get: Bad Request Error 400


Anyways to answer to your last comment Artemis:

> I guess there are several misconceptions here:

There's no confusion on my side, all that makes sense. When I said 
"worker" in that comment I meant the scheduler worker not Spark 
worker, which in the Spark realm would be the client.
Everything else you said is undoubtedly correct, but unrelated to the 
issue/problem at hand.


Sean, Artemis - I appreciate your feedback about the infra setup, but 
it's beside the problem behind this issue.


Let me describe a simpler setup/example with the same problem, say:
 1. I have a jupyter notebook
 2. use local/driver spark mode only
 3. I start the driver, process some data, store it in pandas dataframe
 4. now say I want to add a package to spark driver (or increase the 
JVM memory etc)


There's currently no way to do the step 4 without restarting the 
notebook process which holds the "reference" to the Spark driver/JVM. 
If I restart the Jupter notebook I would lose all the data in memory 
(e.g. pandas data), ofc I can save that data to e.g. disk but that's 
beside the point.


I understand you don't want to provide this functionality in Spark, 
nor warn users on changes in Spark Configuration that won't actually 
work - as a user I wish I could get at least a warning in that case, 
but I respect your decision. It seems like the workaround to shutdown 
the JVM works in this case, I would much appreciate your feedback 
about **that specific workaround** please. Any reason not to use it?

Cheers - Rafal

On Thu, 10 Mar 2022 at 18:50, Rafał Wojdyła  wrote:

If you have a long running python orchestrator worker (e.g. Luigi
worker), and say it's gets a DAG of A -> B ->C, and say the worker
first creates a spark driver for A (which doesn't need extra
jars/packages), then it gets B which is also a spark job but it
needs an extra package, it won't be able to create a new spark
driver with extra packages since it's "not possible" to create a
new driver JVM. I would argue it's the same scenario if you have
multiple spark jobs that need different amounts of memory or
anything that requires JVM restart. Of course I can use the
workaround to shut down the driver/JVM, do you have any feedback
about that workaround (see my previous comment or the issue).

On Thu, 10 Mar 2022 at 18:12, Sean Owen  wrote:

Wouldn't these be separately submitted jobs for separate
workloads? You can of course dynamically change each job
submitted to have whatever packages you like, from whatever is
orchestrating. A single job doing everything sound right.

On Thu, Mar 10, 2022, 12:05 PM Rafał Wojdyła
 wrote:

Because I can't (and should not) know ahead of time which
jobs will be executed, that's the job of the orchestration
layer (and can be dynamic). I know I can specify multiple
packages. Also not worried about memory.

On Thu, 10 Mar 2022 at 13:54, Artemis User
 wrote:

If changing packages or jars isn't your concern, why
not just specify ALL packages that you would need for
the Spark environment?  You know you can define
multiple packages under the packages option.  This
shouldn't cause memory issues since JVM uses dynamic
class loading...

On 3/9/22 10:03 PM, Rafał Wojdyła wrote:

Hi Artemis,
Thanks for your input, to answer your questions:

> You may want to ask yourself why it is necessary to
change the jar packages during runtime.

I have a long running orchestrator process, which
executes multiple spark jobs, currently on a single
VM/driver, some of those jobs might require 

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-11 Thread Rafał Wojdyła
I don't know why I don't see my last message in the thread here:
https://lists.apache.org/thread/5wgdqp746nj4f6ovdl42rt82wc8ltkcn
Also don't get messages from Artemis in my mail, I can only see them in the
thread web UI, which is very confusing.
On top of that when I click on "reply via your own email client" in the web
UI, I get: Bad Request Error 400

Anyways to answer to your last comment Artemis:

> I guess there are several misconceptions here:

There's no confusion on my side, all that makes sense. When I said "worker"
in that comment I meant the scheduler worker not Spark worker, which in the
Spark realm would be the client.
Everything else you said is undoubtedly correct, but unrelated to the
issue/problem at hand.

Sean, Artemis - I appreciate your feedback about the infra setup, but it's
beside the problem behind this issue.

Let me describe a simpler setup/example with the same problem, say:
 1. I have a jupyter notebook
 2. use local/driver spark mode only
 3. I start the driver, process some data, store it in pandas dataframe
 4. now say I want to add a package to spark driver (or increase the JVM
memory etc)

There's currently no way to do the step 4 without restarting the notebook
process which holds the "reference" to the Spark driver/JVM. If I restart
the Jupter notebook I would lose all the data in memory (e.g. pandas data),
ofc I can save that data to e.g. disk but that's beside the point.

I understand you don't want to provide this functionality in Spark, nor
warn users on changes in Spark Configuration that won't actually work - as
a user I wish I could get at least a warning in that case, but I respect
your decision. It seems like the workaround to shutdown the JVM works in
this case, I would much appreciate your feedback about **that specific
workaround** please. Any reason not to use it?
Cheers - Rafal

On Thu, 10 Mar 2022 at 18:50, Rafał Wojdyła  wrote:

> If you have a long running python orchestrator worker (e.g. Luigi worker),
> and say it's gets a DAG of A -> B ->C, and say the worker first creates a
> spark driver for A (which doesn't need extra jars/packages), then it gets B
> which is also a spark job but it needs an extra package, it won't be able
> to create a new spark driver with extra packages since it's "not possible"
> to create a new driver JVM. I would argue it's the same scenario if you
> have multiple spark jobs that need different amounts of memory or anything
> that requires JVM restart. Of course I can use the workaround to shut down
> the driver/JVM, do you have any feedback about that workaround (see my
> previous comment or the issue).
>
> On Thu, 10 Mar 2022 at 18:12, Sean Owen  wrote:
>
>> Wouldn't these be separately submitted jobs for separate workloads? You
>> can of course dynamically change each job submitted to have whatever
>> packages you like, from whatever is orchestrating. A single job doing
>> everything sound right.
>>
>> On Thu, Mar 10, 2022, 12:05 PM Rafał Wojdyła 
>> wrote:
>>
>>> Because I can't (and should not) know ahead of time which jobs will be
>>> executed, that's the job of the orchestration layer (and can be dynamic). I
>>> know I can specify multiple packages. Also not worried about memory.
>>>
>>> On Thu, 10 Mar 2022 at 13:54, Artemis User 
>>> wrote:
>>>
 If changing packages or jars isn't your concern, why not just specify
 ALL packages that you would need for the Spark environment?  You know you
 can define multiple packages under the packages option.  This shouldn't
 cause memory issues since JVM uses dynamic class loading...

 On 3/9/22 10:03 PM, Rafał Wojdyła wrote:

 Hi Artemis,
 Thanks for your input, to answer your questions:

 > You may want to ask yourself why it is necessary to change the jar
 packages during runtime.

 I have a long running orchestrator process, which executes multiple
 spark jobs, currently on a single VM/driver, some of those jobs might
 require extra packages/jars (please see example in the issue).

 > Changing package doesn't mean to reload the classes.

 AFAIU this is unrelated

 > There is no way to reload the same class unless you customize the
 classloader of Spark.

 AFAIU this is an implementation detail.

 > I also don't think it is necessary to implement a warning or error
 message when changing the configuration since it doesn't do any harm

 To reiterate right now the API allows to change configuration of the
 context, without that configuration taking effect. See example of confused
 users here:
  *
 https://stackoverflow.com/questions/41886346/spark-2-1-0-session-config-settings-pyspark
  *
 https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1

 I'm curious if you have any opinion about the "hard-reset" workaround,
 copy-pasting from the issue:

 ```
 s: 

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-10 Thread Artemis User

I guess there are several misconceptions here:

1. Worker doesn't create driver, client does.
2. Regardless of job scheduling, all jobs of the same task/application
   are under the same SparkContext which is created by the driver. 
   Therefore, you need to specify ALL dependency jars for ALL jobs when
   a single SparkContext is initialized
3. The SparkContext object lives and won't change as long as the
   application is alive
4. Please see this Spark doc page as a reference:
   https://spark.apache.org/docs/latest/cluster-overview.html


On 3/10/22 1:50 PM, Rafał Wojdyła wrote:
If you have a long running python orchestrator worker (e.g. Luigi 
worker), and say it's gets a DAG of A -> B ->C, and say the worker 
first creates a spark driver for A (which doesn't need extra 
jars/packages), then it gets B which is also a spark job but it needs 
an extra package, it won't be able to create a new spark driver with 
extra packages since it's "not possible" to create a new driver JVM. I 
would argue it's the same scenario if you have multiple spark jobs 
that need different amounts of memory or anything that requires JVM 
restart. Of course I can use the workaround to shut down the 
driver/JVM, do you have any feedback about that workaround (see my 
previous comment or the issue).


On Thu, 10 Mar 2022 at 18:12, Sean Owen  wrote:

Wouldn't these be separately submitted jobs for separate
workloads? You can of course dynamically change each job submitted
to have whatever packages you like, from whatever is
orchestrating. A single job doing everything sound right.

On Thu, Mar 10, 2022, 12:05 PM Rafał Wojdyła
 wrote:

Because I can't (and should not) know ahead of time which jobs
will be executed, that's the job of the orchestration layer
(and can be dynamic). I know I can specify multiple packages.
Also not worried about memory.

On Thu, 10 Mar 2022 at 13:54, Artemis User
 wrote:

If changing packages or jars isn't your concern, why not
just specify ALL packages that you would need for the
Spark environment?  You know you can define multiple
packages under the packages option. This shouldn't cause
memory issues since JVM uses dynamic class loading...

On 3/9/22 10:03 PM, Rafał Wojdyła wrote:

Hi Artemis,
Thanks for your input, to answer your questions:

> You may want to ask yourself why it is necessary to
change the jar packages during runtime.

I have a long running orchestrator process, which
executes multiple spark jobs, currently on a single
VM/driver, some of those jobs might require extra
packages/jars (please see example in the issue).

> Changing package doesn't mean to reload the classes.

AFAIU this is unrelated

> There is no way to reload the same class unless you
customize the classloader of Spark.

AFAIU this is an implementation detail.

> I also don't think it is necessary to implement a
warning or error message when changing the configuration
since it doesn't do any harm

To reiterate right now the API allows to change
configuration of the context, without that configuration
taking effect. See example of confused users here:
 *

https://stackoverflow.com/questions/41886346/spark-2-1-0-session-config-settings-pyspark
 *

https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1

I'm curious if you have any opinion about the
"hard-reset" workaround, copy-pasting from the issue:

```
s: SparkSession = ...

# Hard reset:
s.stop()
s._sc._gateway.shutdown()
s._sc._gateway.proc.stdin.close()
SparkContext._gateway = None
SparkContext._jvm = None
```

Cheers - Rafal

On 2022/03/09 15:39:58 Artemis User wrote:
> This is indeed a JVM issue, not a Spark issue.  You may
want to ask
> yourself why it is necessary to change the jar packages
during runtime.
> Changing package doesn't mean to reload the classes.
There is no way to
> reload the same class unless you customize the
classloader of Spark.  I
> also don't think it is necessary to implement a warning
or error message
> when changing the configuration since it doesn't do any
harm.  Spark
> uses lazy binding so you can do a lot of such
"unharmful" things.
> Developers will have to understand the behaviors of
each API before when
> 

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-10 Thread Sean Owen
Wouldn't these be separately submitted jobs for separate workloads? You can
of course dynamically change each job submitted to have whatever packages
you like, from whatever is orchestrating. A single job doing everything
sound right.

On Thu, Mar 10, 2022, 12:05 PM Rafał Wojdyła  wrote:

> Because I can't (and should not) know ahead of time which jobs will be
> executed, that's the job of the orchestration layer (and can be dynamic). I
> know I can specify multiple packages. Also not worried about memory.
>
> On Thu, 10 Mar 2022 at 13:54, Artemis User  wrote:
>
>> If changing packages or jars isn't your concern, why not just specify ALL
>> packages that you would need for the Spark environment?  You know you can
>> define multiple packages under the packages option.  This shouldn't cause
>> memory issues since JVM uses dynamic class loading...
>>
>> On 3/9/22 10:03 PM, Rafał Wojdyła wrote:
>>
>> Hi Artemis,
>> Thanks for your input, to answer your questions:
>>
>> > You may want to ask yourself why it is necessary to change the jar
>> packages during runtime.
>>
>> I have a long running orchestrator process, which executes multiple spark
>> jobs, currently on a single VM/driver, some of those jobs might
>> require extra packages/jars (please see example in the issue).
>>
>> > Changing package doesn't mean to reload the classes.
>>
>> AFAIU this is unrelated
>>
>> > There is no way to reload the same class unless you customize the
>> classloader of Spark.
>>
>> AFAIU this is an implementation detail.
>>
>> > I also don't think it is necessary to implement a warning or error
>> message when changing the configuration since it doesn't do any harm
>>
>> To reiterate right now the API allows to change configuration of the
>> context, without that configuration taking effect. See example of confused
>> users here:
>>  *
>> https://stackoverflow.com/questions/41886346/spark-2-1-0-session-config-settings-pyspark
>>  *
>> https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1
>>
>> I'm curious if you have any opinion about the "hard-reset" workaround,
>> copy-pasting from the issue:
>>
>> ```
>> s: SparkSession = ...
>>
>> # Hard reset:
>> s.stop()
>> s._sc._gateway.shutdown()
>> s._sc._gateway.proc.stdin.close()
>> SparkContext._gateway = None
>> SparkContext._jvm = None
>> ```
>>
>> Cheers - Rafal
>>
>> On 2022/03/09 15:39:58 Artemis User wrote:
>> > This is indeed a JVM issue, not a Spark issue.  You may want to ask
>> > yourself why it is necessary to change the jar packages during
>> runtime.
>> > Changing package doesn't mean to reload the classes. There is no way to
>> > reload the same class unless you customize the classloader of Spark.  I
>> > also don't think it is necessary to implement a warning or error
>> message
>> > when changing the configuration since it doesn't do any harm.  Spark
>> > uses lazy binding so you can do a lot of such "unharmful" things.
>> > Developers will have to understand the behaviors of each API before
>> when
>> > using them..
>> >
>> >
>> > On 3/9/22 9:31 AM, Rafał Wojdyła wrote:
>> > >  Sean,
>> > > I understand you might be sceptical about adding this functionality
>> > > into (py)spark, I'm curious:
>> > > * would error/warning on update in configuration that is currently
>> > > effectively impossible (requires restart of JVM) be reasonable?
>> > > * what do you think about the workaround in the issue?
>> > > Cheers - Rafal
>> > >
>> > > On Wed, 9 Mar 2022 at 14:24, Sean Owen  wrote:
>> > >
>> > > Unfortunately this opens a lot more questions and problems than it
>> > > solves. What if you take something off the classpath, for example?
>> > > change a class?
>> > >
>> > > On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła
>> > >  wrote:
>> > >
>> > > Thanks Sean,
>> > > To be clear, if you prefer to change the label on this issue
>> > > from bug to sth else, feel free to do so, no strong opinions
>> > > on my end. What happens to the classpath, whether spark uses
>> > > some classloader magic, is probably an implementation detail.
>> > > That said, it's definitely not intuitive that you can change
>> > > the configuration and get the context (with the updated
>> > > config) without any warnings/errors. Also what would you
>> > > recommend as a workaround or solution to this problem? Any
>> > > comments about the workaround in the issue? Keep in mind that
>> > > I can't restart the long running orchestration process (python
>> > > process if that matters).
>> > > Cheers - Rafal
>> > >
>> > > On Wed, 9 Mar 2022 at 13:15, Sean Owen 
>> wrote:
>> > >
>> > > That isn't a bug - you can't change the classpath once the
>> > > JVM is executing.
>> > >
>> > > On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła
>> > >  wrote:
>> > >
>> > > Hi,

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-10 Thread Rafał Wojdyła
Because I can't (and should not) know ahead of time which jobs will be
executed, that's the job of the orchestration layer (and can be dynamic). I
know I can specify multiple packages. Also not worried about memory.

On Thu, 10 Mar 2022 at 13:54, Artemis User  wrote:

> If changing packages or jars isn't your concern, why not just specify ALL
> packages that you would need for the Spark environment?  You know you can
> define multiple packages under the packages option.  This shouldn't cause
> memory issues since JVM uses dynamic class loading...
>
> On 3/9/22 10:03 PM, Rafał Wojdyła wrote:
>
> Hi Artemis,
> Thanks for your input, to answer your questions:
>
> > You may want to ask yourself why it is necessary to change the jar
> packages during runtime.
>
> I have a long running orchestrator process, which executes multiple spark
> jobs, currently on a single VM/driver, some of those jobs might
> require extra packages/jars (please see example in the issue).
>
> > Changing package doesn't mean to reload the classes.
>
> AFAIU this is unrelated
>
> > There is no way to reload the same class unless you customize the
> classloader of Spark.
>
> AFAIU this is an implementation detail.
>
> > I also don't think it is necessary to implement a warning or error
> message when changing the configuration since it doesn't do any harm
>
> To reiterate right now the API allows to change configuration of the
> context, without that configuration taking effect. See example of confused
> users here:
>  *
> https://stackoverflow.com/questions/41886346/spark-2-1-0-session-config-settings-pyspark
>  *
> https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1
>
> I'm curious if you have any opinion about the "hard-reset" workaround,
> copy-pasting from the issue:
>
> ```
> s: SparkSession = ...
>
> # Hard reset:
> s.stop()
> s._sc._gateway.shutdown()
> s._sc._gateway.proc.stdin.close()
> SparkContext._gateway = None
> SparkContext._jvm = None
> ```
>
> Cheers - Rafal
>
> On 2022/03/09 15:39:58 Artemis User wrote:
> > This is indeed a JVM issue, not a Spark issue.  You may want to ask
> > yourself why it is necessary to change the jar packages during runtime.
> > Changing package doesn't mean to reload the classes. There is no way to
> > reload the same class unless you customize the classloader of Spark.  I
> > also don't think it is necessary to implement a warning or error message
> > when changing the configuration since it doesn't do any harm.  Spark
> > uses lazy binding so you can do a lot of such "unharmful" things.
> > Developers will have to understand the behaviors of each API before when
> > using them..
> >
> >
> > On 3/9/22 9:31 AM, Rafał Wojdyła wrote:
> > >  Sean,
> > > I understand you might be sceptical about adding this functionality
> > > into (py)spark, I'm curious:
> > > * would error/warning on update in configuration that is currently
> > > effectively impossible (requires restart of JVM) be reasonable?
> > > * what do you think about the workaround in the issue?
> > > Cheers - Rafal
> > >
> > > On Wed, 9 Mar 2022 at 14:24, Sean Owen  wrote:
> > >
> > > Unfortunately this opens a lot more questions and problems than it
> > > solves. What if you take something off the classpath, for example?
> > > change a class?
> > >
> > > On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła
> > >  wrote:
> > >
> > > Thanks Sean,
> > > To be clear, if you prefer to change the label on this issue
> > > from bug to sth else, feel free to do so, no strong opinions
> > > on my end. What happens to the classpath, whether spark uses
> > > some classloader magic, is probably an implementation detail.
> > > That said, it's definitely not intuitive that you can change
> > > the configuration and get the context (with the updated
> > > config) without any warnings/errors. Also what would you
> > > recommend as a workaround or solution to this problem? Any
> > > comments about the workaround in the issue? Keep in mind that
> > > I can't restart the long running orchestration process (python
> > > process if that matters).
> > > Cheers - Rafal
> > >
> > > On Wed, 9 Mar 2022 at 13:15, Sean Owen 
> wrote:
> > >
> > > That isn't a bug - you can't change the classpath once the
> > > JVM is executing.
> > >
> > > On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła
> > >  wrote:
> > >
> > > Hi,
> > > My use case is that, I have a long running process
> > > (orchestrator) with multiple tasks, some tasks might
> > > require extra spark dependencies. It seems once the
> > > spark context is started it's not possible to update
> > > `spark.jars.packages`? I have reported an issue at
> > > 

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-10 Thread Artemis User
If changing packages or jars isn't your concern, why not just specify 
ALL packages that you would need for the Spark environment? You know you 
can define multiple packages under the packages option.  This shouldn't 
cause memory issues since JVM uses dynamic class loading...


On 3/9/22 10:03 PM, Rafał Wojdyła wrote:

Hi Artemis,
Thanks for your input, to answer your questions:

> You may want to ask yourself why it is necessary to change the jar 
packages during runtime.


I have a long running orchestrator process, which executes multiple 
spark jobs, currently on a single VM/driver, some of those jobs might 
require extra packages/jars (please see example in the issue).


> Changing package doesn't mean to reload the classes.

AFAIU this is unrelated

> There is no way to reload the same class unless you customize the 
classloader of Spark.


AFAIU this is an implementation detail.

> I also don't think it is necessary to implement a warning or error 
message when changing the configuration since it doesn't do any harm


To reiterate right now the API allows to change configuration of the 
context, without that configuration taking effect. See example of 
confused users here:
 * 
https://stackoverflow.com/questions/41886346/spark-2-1-0-session-config-settings-pyspark
 * 
https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1


I'm curious if you have any opinion about the "hard-reset" workaround, 
copy-pasting from the issue:


```
s: SparkSession = ...

# Hard reset:
s.stop()
s._sc._gateway.shutdown()
s._sc._gateway.proc.stdin.close()
SparkContext._gateway = None
SparkContext._jvm = None
```

Cheers - Rafal

On 2022/03/09 15:39:58 Artemis User wrote:
> This is indeed a JVM issue, not a Spark issue.  You may want to ask
> yourself why it is necessary to change the jar packages during runtime.
> Changing package doesn't mean to reload the classes. There is no way to
> reload the same class unless you customize the classloader of Spark.  I
> also don't think it is necessary to implement a warning or error 
message

> when changing the configuration since it doesn't do any harm.  Spark
> uses lazy binding so you can do a lot of such "unharmful" things.
> Developers will have to understand the behaviors of each API before 
when

> using them..
>
>
> On 3/9/22 9:31 AM, Rafał Wojdyła wrote:
> >  Sean,
> > I understand you might be sceptical about adding this functionality
> > into (py)spark, I'm curious:
> > * would error/warning on update in configuration that is currently
> > effectively impossible (requires restart of JVM) be reasonable?
> > * what do you think about the workaround in the issue?
> > Cheers - Rafal
> >
> > On Wed, 9 Mar 2022 at 14:24, Sean Owen  wrote:
> >
> >     Unfortunately this opens a lot more questions and problems than it
> >     solves. What if you take something off the classpath, for example?
> >     change a class?
> >
> >     On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła
> >      wrote:
> >
> >         Thanks Sean,
> >         To be clear, if you prefer to change the label on this issue
> >         from bug to sth else, feel free to do so, no strong opinions
> >         on my end. What happens to the classpath, whether spark uses
> >         some classloader magic, is probably an implementation detail.
> >         That said, it's definitely not intuitive that you can change
> >         the configuration and get the context (with the updated
> >         config) without any warnings/errors. Also what would you
> >         recommend as a workaround or solution to this problem? Any
> >         comments about the workaround in the issue? Keep in mind that
> >         I can't restart the long running orchestration process (python
> >         process if that matters).
> >         Cheers - Rafal
> >
> >         On Wed, 9 Mar 2022 at 13:15, Sean Owen  
wrote:

> >
> >             That isn't a bug - you can't change the classpath once the
> >             JVM is executing.
> >
> >             On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła
> >              wrote:
> >
> >                 Hi,
> >                 My use case is that, I have a long running process
> >                 (orchestrator) with multiple tasks, some tasks might
> >                 require extra spark dependencies. It seems once the
> >                 spark context is started it's not possible to update
> >                 `spark.jars.packages`? I have reported an issue at
> > https://issues.apache.org/jira/browse/SPARK-38438,
> >                 together with a workaround ("hard reset of the
> >                 cluster"). I wonder if anyone has a solution for this?
> >                 Cheers - Rafal
> >
>



Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-09 Thread Rafał Wojdyła
Hi Artemis,
Thanks for your input, to answer your questions:

> You may want to ask yourself why it is necessary to change the jar
packages during runtime.

I have a long running orchestrator process, which executes multiple spark
jobs, currently on a single VM/driver, some of those jobs might
require extra packages/jars (please see example in the issue).

> Changing package doesn't mean to reload the classes.

AFAIU this is unrelated

> There is no way to reload the same class unless you customize the
classloader of Spark.

AFAIU this is an implementation detail.

> I also don't think it is necessary to implement a warning or error
message when changing the configuration since it doesn't do any harm

To reiterate right now the API allows to change configuration of the
context, without that configuration taking effect. See example of confused
users here:
 *
https://stackoverflow.com/questions/41886346/spark-2-1-0-session-config-settings-pyspark
 *
https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1

I'm curious if you have any opinion about the "hard-reset" workaround,
copy-pasting from the issue:

```
s: SparkSession = ...

# Hard reset:
s.stop()
s._sc._gateway.shutdown()
s._sc._gateway.proc.stdin.close()
SparkContext._gateway = None
SparkContext._jvm = None
```

Cheers - Rafal

On 2022/03/09 15:39:58 Artemis User wrote:
> This is indeed a JVM issue, not a Spark issue.  You may want to ask
> yourself why it is necessary to change the jar packages during runtime.
> Changing package doesn't mean to reload the classes. There is no way to
> reload the same class unless you customize the classloader of Spark.  I
> also don't think it is necessary to implement a warning or error message
> when changing the configuration since it doesn't do any harm.  Spark
> uses lazy binding so you can do a lot of such "unharmful" things.
> Developers will have to understand the behaviors of each API before when
> using them..
>
>
> On 3/9/22 9:31 AM, Rafał Wojdyła wrote:
> >  Sean,
> > I understand you might be sceptical about adding this functionality
> > into (py)spark, I'm curious:
> > * would error/warning on update in configuration that is currently
> > effectively impossible (requires restart of JVM) be reasonable?
> > * what do you think about the workaround in the issue?
> > Cheers - Rafal
> >
> > On Wed, 9 Mar 2022 at 14:24, Sean Owen  wrote:
> >
> > Unfortunately this opens a lot more questions and problems than it
> > solves. What if you take something off the classpath, for example?
> > change a class?
> >
> > On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła
> >  wrote:
> >
> > Thanks Sean,
> > To be clear, if you prefer to change the label on this issue
> > from bug to sth else, feel free to do so, no strong opinions
> > on my end. What happens to the classpath, whether spark uses
> > some classloader magic, is probably an implementation detail.
> > That said, it's definitely not intuitive that you can change
> > the configuration and get the context (with the updated
> > config) without any warnings/errors. Also what would you
> > recommend as a workaround or solution to this problem? Any
> > comments about the workaround in the issue? Keep in mind that
> > I can't restart the long running orchestration process (python
> > process if that matters).
> > Cheers - Rafal
> >
> > On Wed, 9 Mar 2022 at 13:15, Sean Owen  wrote:
> >
> > That isn't a bug - you can't change the classpath once the
> > JVM is executing.
> >
> > On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła
> >  wrote:
> >
> > Hi,
> > My use case is that, I have a long running process
> > (orchestrator) with multiple tasks, some tasks might
> > require extra spark dependencies. It seems once the
> > spark context is started it's not possible to update
> > `spark.jars.packages`? I have reported an issue at
> > https://issues.apache.org/jira/browse/SPARK-38438,
> > together with a workaround ("hard reset of the
> > cluster"). I wonder if anyone has a solution for this?
> > Cheers - Rafal
> >
>

>


Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-09 Thread Artemis User
This is indeed a JVM issue, not a Spark issue.  You may want to ask 
yourself why it is necessary to change the jar packages during runtime.  
Changing package doesn't mean to reload the classes. There is no way to 
reload the same class unless you customize the classloader of Spark.  I 
also don't think it is necessary to implement a warning or error message 
when changing the configuration since it doesn't do any harm.  Spark 
uses lazy binding so you can do a lot of such "unharmful" things.  
Developers will have to understand the behaviors of each API before when 
using them..



On 3/9/22 9:31 AM, Rafał Wojdyła wrote:

 Sean,
I understand you might be sceptical about adding this functionality 
into (py)spark, I'm curious:
* would error/warning on update in configuration that is currently 
effectively impossible (requires restart of JVM) be reasonable?

* what do you think about the workaround in the issue?
Cheers - Rafal

On Wed, 9 Mar 2022 at 14:24, Sean Owen  wrote:

Unfortunately this opens a lot more questions and problems than it
solves. What if you take something off the classpath, for example?
change a class?

On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła
 wrote:

Thanks Sean,
To be clear, if you prefer to change the label on this issue
from bug to sth else, feel free to do so, no strong opinions
on my end. What happens to the classpath, whether spark uses
some classloader magic, is probably an implementation detail.
That said, it's definitely not intuitive that you can change
the configuration and get the context (with the updated
config) without any warnings/errors. Also what would you
recommend as a workaround or solution to this problem? Any
comments about the workaround in the issue? Keep in mind that
I can't restart the long running orchestration process (python
process if that matters).
Cheers - Rafal

On Wed, 9 Mar 2022 at 13:15, Sean Owen  wrote:

That isn't a bug - you can't change the classpath once the
JVM is executing.

On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła
 wrote:

Hi,
My use case is that, I have a long running process
(orchestrator) with multiple tasks, some tasks might
require extra spark dependencies. It seems once the
spark context is started it's not possible to update
`spark.jars.packages`? I have reported an issue at
https://issues.apache.org/jira/browse/SPARK-38438,
together with a workaround ("hard reset of the
cluster"). I wonder if anyone has a solution for this?
Cheers - Rafal



Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-09 Thread Rafał Wojdyła
Sean,
I understand you might be sceptical about adding this functionality into
(py)spark, I'm curious:
* would error/warning on update in configuration that is currently
effectively impossible (requires restart of JVM) be reasonable?
* what do you think about the workaround in the issue?
Cheers - Rafal

On Wed, 9 Mar 2022 at 14:24, Sean Owen  wrote:

> Unfortunately this opens a lot more questions and problems than it solves.
> What if you take something off the classpath, for example? change a class?
>
> On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła  wrote:
>
>> Thanks Sean,
>> To be clear, if you prefer to change the label on this issue from bug to
>> sth else, feel free to do so, no strong opinions on my end. What happens to
>> the classpath, whether spark uses some classloader magic, is probably an
>> implementation detail. That said, it's definitely not intuitive that you
>> can change the configuration and get the context (with the updated config)
>> without any warnings/errors. Also what would you recommend as a workaround
>> or solution to this problem? Any comments about the workaround in the
>> issue? Keep in mind that I can't restart the long running orchestration
>> process (python process if that matters).
>> Cheers - Rafal
>>
>> On Wed, 9 Mar 2022 at 13:15, Sean Owen  wrote:
>>
>>> That isn't a bug - you can't change the classpath once the JVM is
>>> executing.
>>>
>>> On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła 
>>> wrote:
>>>
 Hi,
 My use case is that, I have a long running process (orchestrator) with
 multiple tasks, some tasks might require extra spark dependencies. It seems
 once the spark context is started it's not possible to update
 `spark.jars.packages`? I have reported an issue at
 https://issues.apache.org/jira/browse/SPARK-38438, together with a
 workaround ("hard reset of the cluster"). I wonder if anyone has a solution
 for this?
 Cheers - Rafal

>>>


Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-09 Thread Sean Owen
Unfortunately this opens a lot more questions and problems than it solves.
What if you take something off the classpath, for example? change a class?

On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła  wrote:

> Thanks Sean,
> To be clear, if you prefer to change the label on this issue from bug to
> sth else, feel free to do so, no strong opinions on my end. What happens to
> the classpath, whether spark uses some classloader magic, is probably an
> implementation detail. That said, it's definitely not intuitive that you
> can change the configuration and get the context (with the updated config)
> without any warnings/errors. Also what would you recommend as a workaround
> or solution to this problem? Any comments about the workaround in the
> issue? Keep in mind that I can't restart the long running orchestration
> process (python process if that matters).
> Cheers - Rafal
>
> On Wed, 9 Mar 2022 at 13:15, Sean Owen  wrote:
>
>> That isn't a bug - you can't change the classpath once the JVM is
>> executing.
>>
>> On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła 
>> wrote:
>>
>>> Hi,
>>> My use case is that, I have a long running process (orchestrator) with
>>> multiple tasks, some tasks might require extra spark dependencies. It seems
>>> once the spark context is started it's not possible to update
>>> `spark.jars.packages`? I have reported an issue at
>>> https://issues.apache.org/jira/browse/SPARK-38438, together with a
>>> workaround ("hard reset of the cluster"). I wonder if anyone has a solution
>>> for this?
>>> Cheers - Rafal
>>>
>>


Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-09 Thread Rafał Wojdyła
Thanks Sean,
To be clear, if you prefer to change the label on this issue from bug to
sth else, feel free to do so, no strong opinions on my end. What happens to
the classpath, whether spark uses some classloader magic, is probably an
implementation detail. That said, it's definitely not intuitive that you
can change the configuration and get the context (with the updated config)
without any warnings/errors. Also what would you recommend as a workaround
or solution to this problem? Any comments about the workaround in the
issue? Keep in mind that I can't restart the long running orchestration
process (python process if that matters).
Cheers - Rafal

On Wed, 9 Mar 2022 at 13:15, Sean Owen  wrote:

> That isn't a bug - you can't change the classpath once the JVM is
> executing.
>
> On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła  wrote:
>
>> Hi,
>> My use case is that, I have a long running process (orchestrator) with
>> multiple tasks, some tasks might require extra spark dependencies. It seems
>> once the spark context is started it's not possible to update
>> `spark.jars.packages`? I have reported an issue at
>> https://issues.apache.org/jira/browse/SPARK-38438, together with a
>> workaround ("hard reset of the cluster"). I wonder if anyone has a solution
>> for this?
>> Cheers - Rafal
>>
>


Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-09 Thread Sean Owen
That isn't a bug - you can't change the classpath once the JVM is executing.

On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła  wrote:

> Hi,
> My use case is that, I have a long running process (orchestrator) with
> multiple tasks, some tasks might require extra spark dependencies. It seems
> once the spark context is started it's not possible to update
> `spark.jars.packages`? I have reported an issue at
> https://issues.apache.org/jira/browse/SPARK-38438, together with a
> workaround ("hard reset of the cluster"). I wonder if anyone has a solution
> for this?
> Cheers - Rafal
>