[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2017-04-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976902#comment-15976902
 ] 

Michael Schmeißer edited comment on SPARK-650 at 4/20/17 3:34 PM:
--

In a nutshell, we have our own class "MySerializer" which is derived from 
{{org.apache.spark.serializer.JavaSerializer}} and performs our custom 
initialization in {{MySerializer#newInstance}} before calling the super method 
{{org.apache.spark.serializer.JavaSerializer#newInstance}}. Then, when building 
the SparkConf for initialization of the SparkContext, we add 
{{pSparkConf.set("spark.closure.serializer", 
MySerializer.class.getCanonicalName());}}.

We package this with our application JAR and it works. So I think you have to 
look at your classpath configuration [~mboes]. In our case, the JAR which 
contains the closure serializer is listed in the following properties (we use 
Spark 1.5.0 on YARN in cluster mode):
* driver.extraClassPath
* executor.extraClassPath
* yarn.secondary.jars
* spark.yarn.secondary.jars
* spark.driver.extraClassPath
* spark.executor.extraClassPath

If I recall it correctly, the variants without the "spark." prefix are produced 
by us because we prefix all of our properties with "spark." to transfer them 
via Oozie and unmask them again later, so you should only need the properties 
with the "spark." prefix.

Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue 
SPARK-1107. 2) You can add a TaskCompletionListener with 
{{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}.
 To get the current TaskContext on the executor, just use 
{{org.apache.spark.TaskContext#get}}. We have some functionality to log the 
progress of a function in fixed intervals (e.g. every 1,000 records). To do 
this, you can use mapPartitions with a custom iterator. 


was (Author: skamandros):
In a nutshell, we have our own class "MySerializer" which is derived from 
{{org.apache.spark.serializer.JavaSerializer}} and performs our custom 
initialization in {{MySerializer#newInstance}} before calling the super method 
{{org.apache.spark.serializer.JavaSerializer#newInstance}}. Then, when building 
the SparkConf for initialization of the SparkContext, we add 
{{pSparkConf.set("spark.closure.serializer", 
MySerializer.class.getCanonicalName());}}.

We package this with our application JAR and it works. So I think you have to 
look at your classpath configuration [~mboes]. In our case, the JAR which 
contains the closure serializer is listed in the following properties (we use 
Spark on YARN in cluster mode):
* driver.extraClassPath
* executor.extraClassPath
* yarn.secondary.jars
* spark.yarn.secondary.jars
* spark.driver.extraClassPath
* spark.executor.extraClassPath

If I recall it correctly, the variants without the "spark." prefix are produced 
by us because we prefix all of our properties with "spark." to transfer them 
via Oozie and unmask them again later, so you should only need the properties 
with the "spark." prefix.

Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue 
SPARK-1107. 2) You can add a TaskCompletionListener with 
{{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}.
 To get the current TaskContext on the executor, just use 
{{org.apache.spark.TaskContext#get}}. We have some functionality to log the 
progress of a function in fixed intervals (e.g. every 1,000 records). To do 
this, you can use mapPartitions with a custom iterator. 

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2017-04-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976902#comment-15976902
 ] 

Michael Schmeißer edited comment on SPARK-650 at 4/20/17 3:34 PM:
--

In a nutshell, we have our own class "MySerializer" which is derived from 
{{org.apache.spark.serializer.JavaSerializer}} and performs our custom 
initialization in {{MySerializer#newInstance}} before calling the super method 
{{org.apache.spark.serializer.JavaSerializer#newInstance}}. Then, when building 
the SparkConf for initialization of the SparkContext, we add 
{{pSparkConf.set("spark.closure.serializer", 
MySerializer.class.getCanonicalName());}}.

We package this with our application JAR and it works. So I think you have to 
look at your classpath configuration [~mboes]. In our case, the JAR which 
contains the closure serializer is listed in the following properties (we use 
Spark on YARN in cluster mode):
* driver.extraClassPath
* executor.extraClassPath
* yarn.secondary.jars
* spark.yarn.secondary.jars
* spark.driver.extraClassPath
* spark.executor.extraClassPath

If I recall it correctly, the variants without the "spark." prefix are produced 
by us because we prefix all of our properties with "spark." to transfer them 
via Oozie and unmask them again later, so you should only need the properties 
with the "spark." prefix.

Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue 
SPARK-1107. 2) You can add a TaskCompletionListener with 
{{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}.
 To get the current TaskContext on the executor, just use 
{{org.apache.spark.TaskContext#get}}. We have some functionality to log the 
progress of a function in fixed intervals (e.g. every 1,000 records). To do 
this, you can use mapPartitions with a custom iterator. 


was (Author: skamandros):
In a nutshell, we have our own class "MySerializer" which is derived from 
{{org.apache.spark.serializer.JavaSerializer}} and performs our custom 
initialization in {{MySerializer#newInstance}} before calling the super method 
{{org.apache.spark.serializer.JavaSerializer#newInstance}}. Then, when building 
the SparkConf for initialization of the SparkContext, we add 
{{pSparkConf.set("spark.closure.serializer", 
MySerializer.class.getCanonicalName());}}.

We package this with our application JAR and it works. So I think you have to 
look at your classpath configuration [~mboes]. In our case, the JAR which 
contains the closure serializer is listed in the following properties:
* driver.extraClassPath
* executor.extraClassPath
* yarn.secondary.jars
* spark.yarn.secondary.jars
* spark.driver.extraClassPath
* spark.executor.extraClassPath

If I recall it correctly, the variants without the "spark." prefix are produced 
by us because we prefix all of our properties with "spark." to transfer them 
via Oozie and unmask them again later, so you should only need the properties 
with the "spark." prefix.

Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue 
SPARK-1107. 2) You can add a TaskCompletionListener with 
{{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}.
 To get the current TaskContext on the executor, just use 
{{org.apache.spark.TaskContext#get}}. We have some functionality to log the 
progress of a function in fixed intervals (e.g. every 1,000 records). To do 
this, you can use mapPartitions with a custom iterator. 

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2017-04-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976902#comment-15976902
 ] 

Michael Schmeißer edited comment on SPARK-650 at 4/20/17 3:33 PM:
--

In a nutshell, we have our own class "MySerializer" which is derived from 
{{org.apache.spark.serializer.JavaSerializer}} and performs our custom 
initialization in {{MySerializer#newInstance}} before calling the super method 
{{org.apache.spark.serializer.JavaSerializer#newInstance}}. Then, when building 
the SparkConf for initialization of the SparkContext, we add 
{{pSparkConf.set("spark.closure.serializer", 
MySerializer.class.getCanonicalName());}}.

We package this with our application JAR and it works. So I think you have to 
look at your classpath configuration [~mboes]. In our case, the JAR which 
contains the closure serializer is listed in the following properties:
* driver.extraClassPath
* executor.extraClassPath
* yarn.secondary.jars
* spark.yarn.secondary.jars
* spark.driver.extraClassPath
* spark.executor.extraClassPath

If I recall it correctly, the variants without the "spark." prefix are produced 
by us because we prefix all of our properties with "spark." to transfer them 
via Oozie and unmask them again later, so you should only need the properties 
with the "spark." prefix.

Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue 
SPARK-1107. 2) You can add a TaskCompletionListener with 
{{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}.
 To get the current TaskContext on the executor, just use 
{{org.apache.spark.TaskContext#get}}. We have some functionality to log the 
progress of a function in fixed intervals (e.g. every 1,000 records). To do 
this, you can use mapPartitions with a custom iterator. 


was (Author: skamandros):
In a nutshell, we have our own class "MySerializer" which is derived from 
{{org.apache.spark.serializer.JavaSerializer}} and performs our custom 
initialization in {{MySerializer#newInstance}} before calling the super method 
{{org.apache.spark.serializer.JavaSerializer#newInstance}}. Then, when building 
the SparkConf for initialization of the SparkContext, we add 
{{pSparkConf.set("spark.closure.serializer", 
MySerializer.class.getCanonicalName());}}.

We package this with our application JAR and it works. So I think you have to 
look at your classpath configuration [~mboes]. In our case, the JAR which 
contains the closure serializer is listeed in the following properties:
* driver.extraClassPath
* executor.extraClassPath
* yarn.secondary.jars
* spark.yarn.secondary.jars
* spark.driver.extraClassPath
* spark.executor.extraClassPath

If I recall it correctly, the variants without the "spark." prefix are produced 
by us because we prefix all of our properties with "spark." to transfer them 
via Oozie and unmask them again later, so you should only need the properties 
with the "spark." prefix.

Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue 
SPARK-1107. 2) You can add a TaskCompletionListener with 
{{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}.
 To get the current TaskContext on the executor, just use 
{{org.apache.spark.TaskContext#get}}. We have some functionality to log the 
progress of a function in fixed intervals (e.g. every 1,000 records). To do 
this, you can use mapPartitions with a custom iterator. 

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2017-04-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976902#comment-15976902
 ] 

Michael Schmeißer edited comment on SPARK-650 at 4/20/17 3:32 PM:
--

In a nutshell, we have our own class "MySerializer" which is derived from 
{{org.apache.spark.serializer.JavaSerializer}} and performs our custom 
initialization in {{MySerializer#newInstance}} before calling the super method 
{{com.gfk.st2.pace.df.jobflow.orch.spark.api.ClosureSerializerAsInitHook#newInstance}}.
 Then, when building the SparkConf for initialization of the SparkContext, we 
add {{pSparkConf.set("spark.closure.serializer", 
MySerializer.class.getCanonicalName());}}.

We package this with our application JAR and it works. So I think you have to 
look at your classpath configuration [~mboes]. In our case, the JAR which 
contains the closure serializer is listeed in the following properties:
* driver.extraClassPath
* executor.extraClassPath
* yarn.secondary.jars
* spark.yarn.secondary.jars
* spark.driver.extraClassPath
* spark.executor.extraClassPath

If I recall it correctly, the variants without the "spark." prefix are produced 
by us because we prefix all of our properties with "spark." to transfer them 
via Oozie and unmask them again later, so you should only need the properties 
with the "spark." prefix.

Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue 
SPARK-1107. 2) You can add a TaskCompletionListener with 
{{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}.
 To get the current TaskContext on the executor, just use 
{{org.apache.spark.TaskContext#get}}. We have some functionality to log the 
progress of a function in fixed intervals (e.g. every 1,000 records). To do 
this, you can use mapPartitions with a custom iterator. 


was (Author: skamandros):
In a nutshell, we have our own class "MySerializer" which is derived from 
{{org.apache.spark.serializer.JavaSerializer}} and performs our custom 
initialization in `MySerializer#newInstance` before calling the super method 
`com.gfk.st2.pace.df.jobflow.orch.spark.api.ClosureSerializerAsInitHook#newInstance`.
 Then, when building the SparkConf for initialization of the SparkContext, we 
add `pSparkConf.set("spark.closure.serializer", 
MySerializer.class.getCanonicalName());`.

We package this with our application JAR and it works. So I think you have to 
look at your classpath configuration [~mboes]. In our case, the JAR which 
contains the closure serializer is listeed in the following properties:
* driver.extraClassPath
* executor.extraClassPath
* yarn.secondary.jars
* spark.yarn.secondary.jars
* spark.driver.extraClassPath
* spark.executor.extraClassPath

If I recall it correctly, the variants without the "spark." prefix are produced 
by us because we prefix all of our properties with "spark." to transfer them 
via Oozie and unmask them again later, so you should only need the properties 
with the "spark." prefix.

Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue 
SPARK-1107. 2) You can add a TaskCompletionListener with 
`org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)`.
 To get the current TaskContext on the executor, just use 
`org.apache.spark.TaskContext#get`. We have some functionality to log the 
progress of a function in fixed intervals (e.g. every 1,000 records). To do 
this, you can use mapPartitions with a custom iterator. 

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2017-04-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976902#comment-15976902
 ] 

Michael Schmeißer edited comment on SPARK-650 at 4/20/17 3:32 PM:
--

In a nutshell, we have our own class "MySerializer" which is derived from 
{{org.apache.spark.serializer.JavaSerializer}} and performs our custom 
initialization in {{MySerializer#newInstance}} before calling the super method 
{{org.apache.spark.serializer.JavaSerializer#newInstance}}. Then, when building 
the SparkConf for initialization of the SparkContext, we add 
{{pSparkConf.set("spark.closure.serializer", 
MySerializer.class.getCanonicalName());}}.

We package this with our application JAR and it works. So I think you have to 
look at your classpath configuration [~mboes]. In our case, the JAR which 
contains the closure serializer is listeed in the following properties:
* driver.extraClassPath
* executor.extraClassPath
* yarn.secondary.jars
* spark.yarn.secondary.jars
* spark.driver.extraClassPath
* spark.executor.extraClassPath

If I recall it correctly, the variants without the "spark." prefix are produced 
by us because we prefix all of our properties with "spark." to transfer them 
via Oozie and unmask them again later, so you should only need the properties 
with the "spark." prefix.

Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue 
SPARK-1107. 2) You can add a TaskCompletionListener with 
{{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}.
 To get the current TaskContext on the executor, just use 
{{org.apache.spark.TaskContext#get}}. We have some functionality to log the 
progress of a function in fixed intervals (e.g. every 1,000 records). To do 
this, you can use mapPartitions with a custom iterator. 


was (Author: skamandros):
In a nutshell, we have our own class "MySerializer" which is derived from 
{{org.apache.spark.serializer.JavaSerializer}} and performs our custom 
initialization in {{MySerializer#newInstance}} before calling the super method 
{{com.gfk.st2.pace.df.jobflow.orch.spark.api.ClosureSerializerAsInitHook#newInstance}}.
 Then, when building the SparkConf for initialization of the SparkContext, we 
add {{pSparkConf.set("spark.closure.serializer", 
MySerializer.class.getCanonicalName());}}.

We package this with our application JAR and it works. So I think you have to 
look at your classpath configuration [~mboes]. In our case, the JAR which 
contains the closure serializer is listeed in the following properties:
* driver.extraClassPath
* executor.extraClassPath
* yarn.secondary.jars
* spark.yarn.secondary.jars
* spark.driver.extraClassPath
* spark.executor.extraClassPath

If I recall it correctly, the variants without the "spark." prefix are produced 
by us because we prefix all of our properties with "spark." to transfer them 
via Oozie and unmask them again later, so you should only need the properties 
with the "spark." prefix.

Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue 
SPARK-1107. 2) You can add a TaskCompletionListener with 
{{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}.
 To get the current TaskContext on the executor, just use 
{{org.apache.spark.TaskContext#get}}. We have some functionality to log the 
progress of a function in fixed intervals (e.g. every 1,000 records). To do 
this, you can use mapPartitions with a custom iterator. 

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2017-04-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976902#comment-15976902
 ] 

Michael Schmeißer edited comment on SPARK-650 at 4/20/17 3:31 PM:
--

In a nutshell, we have our own class "MySerializer" which is derived from 
{{org.apache.spark.serializer.JavaSerializer}} and performs our custom 
initialization in `MySerializer#newInstance` before calling the super method 
`com.gfk.st2.pace.df.jobflow.orch.spark.api.ClosureSerializerAsInitHook#newInstance`.
 Then, when building the SparkConf for initialization of the SparkContext, we 
add `pSparkConf.set("spark.closure.serializer", 
MySerializer.class.getCanonicalName());`.

We package this with our application JAR and it works. So I think you have to 
look at your classpath configuration [~mboes]. In our case, the JAR which 
contains the closure serializer is listeed in the following properties:
* driver.extraClassPath
* executor.extraClassPath
* yarn.secondary.jars
* spark.yarn.secondary.jars
* spark.driver.extraClassPath
* spark.executor.extraClassPath

If I recall it correctly, the variants without the "spark." prefix are produced 
by us because we prefix all of our properties with "spark." to transfer them 
via Oozie and unmask them again later, so you should only need the properties 
with the "spark." prefix.

Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue 
SPARK-1107. 2) You can add a TaskCompletionListener with 
`org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)`.
 To get the current TaskContext on the executor, just use 
`org.apache.spark.TaskContext#get`. We have some functionality to log the 
progress of a function in fixed intervals (e.g. every 1,000 records). To do 
this, you can use mapPartitions with a custom iterator. 


was (Author: skamandros):
In a nutshell, we have our own class "MySerializer" which is derived from 
`org.apache.spark.serializer.JavaSerializer` and performs our custom 
initialization in `MySerializer#newInstance` before calling the super method 
`com.gfk.st2.pace.df.jobflow.orch.spark.api.ClosureSerializerAsInitHook#newInstance`.
 Then, when building the SparkConf for initialization of the SparkContext, we 
add `pSparkConf.set("spark.closure.serializer", 
MySerializer.class.getCanonicalName());`.

We package this with our application JAR and it works. So I think you have to 
look at your classpath configuration [~mboes]. In our case, the JAR which 
contains the closure serializer is listeed in the following properties:
* driver.extraClassPath
* executor.extraClassPath
* yarn.secondary.jars
* spark.yarn.secondary.jars
* spark.driver.extraClassPath
* spark.executor.extraClassPath

If I recall it correctly, the variants without the "spark." prefix are produced 
by us because we prefix all of our properties with "spark." to transfer them 
via Oozie and unmask them again later, so you should only need the properties 
with the "spark." prefix.

Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue 
SPARK-1107. 2) You can add a TaskCompletionListener with 
`org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)`.
 To get the current TaskContext on the executor, just use 
`org.apache.spark.TaskContext#get`. We have some functionality to log the 
progress of a function in fixed intervals (e.g. every 1,000 records). To do 
this, you can use mapPartitions with a custom iterator. 

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2017-04-10 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963304#comment-15963304
 ] 

Ryan Williams edited comment on SPARK-650 at 4/10/17 6:42 PM:
--

Both suggested workarounds here are lacking or broken / actively harmful, 
afaict, and the use case is real and valid.

The ADAM project struggled for >2 years with this problem:

- [a 3rd-party {{OutputFormat}} required this field to be 
set|https://github.com/HadoopGenomics/Hadoop-BAM/blob/eb688fb90c60e8c956f9d1e4793fea01e3164056/src/main/java/org/seqdoop/hadoop_bam/KeyIgnoringAnySAMOutputFormat.java#L93]
- the value of the field is computed on the driver, and needs to somehow be 
sent to and set in each executor JVM.

h3. {{mapPartitions}} hack

[Some attempts to set the field via a dummy {{mapPartitions}} 
job|https://github.com/hammerlab/adam/blob/b87bfb72c7411b5ea088b12334aa1b548102eb4b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala#L134-L146]
 actually added [pernicious, non-deterministic 
bugs|https://github.com/bigdatagenomics/adam/issues/676#issuecomment-219347677].

In general Spark seems to provide no guarantees that ≥1 tasks will get 
scheduled on each executor in such a situation:

- in the above, node locality resulted in some executors being missed
- dynamic-allocation also offers chances for executors to come online later and 
never be initialized

h3. object/singleton initialization

How can one use singleton initialization to pass an object from the driver to 
each executor? Maybe I've missed this in the discussion above.

In the end, ADAM decided to write the object to a file and route that file's 
path to the {{OutputFormat}} via a hadoop configuration value, which is pretty 
inelegant.

h4. Another use case

I have another need for this atm where regular lazy-object-initialization is 
also insufficient: [due to a rough-edge in Scala programs' classloader 
configuration, {{FileSystemProvider}}'s in user JARs are not loaded 
properly|https://github.com/scala/bug/issues/10247].

[A workaround discussed in the 1st post on that issue fixes the 
problem|https://github.com/hammerlab/spark-commands/blob/1.0.3/src/main/scala/org/hammerlab/commands/FileSystems.scala#L8-L20],
 but needs to be run before {{FileSystemProvider.installedProviders}} is first 
called on the JVM, which can be triggered by numerous {{java.nio.file}} 
operations.

I don't see a clear way to work in code in that will always lazily call my 
{{FileSystems.load}} function on each executor, let alone ensure that it 
happens before any code in the JAR calls e.g.
 {{Paths.get}}.


was (Author: rdub):
Both suggested workarounds here are lacking or broken / actively harmful, 
afaict, and the use case is real and valid.

The ADAM project struggled for >2 years with this problem:

- [a 3rd-party {{OutputFormat}} required this field to be 
set|https://github.com/HadoopGenomics/Hadoop-BAM/blob/eb688fb90c60e8c956f9d1e4793fea01e3164056/src/main/java/org/seqdoop/hadoop_bam/KeyIgnoringAnySAMOutputFormat.java#L93]
- the value of the field is computed on the driver, and needs to somehow be 
sent to and set in each executor JVM.

h3. {{mapPartitions}} hack

[Some attempts to set the field via a dummy {{mapPartitions}} 
job|https://github.com/hammerlab/adam/blob/b87bfb72c7411b5ea088b12334aa1b548102eb4b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala#L134-L146]
 actually added [pernicious, non-deterministic 
bugs|https://github.com/bigdatagenomics/adam/issues/676#issuecomment-219347677].

In general Spark seems to provide no guarantees that ≥1 tasks will get 
scheduled on each executor in such a situation:

- in the above, node locality resulted in some executors being missed
- dynamic-allocation also offers chances for executors to come online later and 
never be initialized

h3. object/singleton initialization

How can one use singleton initialization to pass an object from the driver to 
each executor? Maybe I've missed this in the discussion above.

In the end, ADAM decided to write the object to a file and route that file's 
path to the {{OutputFormat}} via a hadoop configuration value, which is pretty 
inelegant.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-15 Thread Lars Francke (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578914#comment-15578914
 ] 

Lars Francke edited comment on SPARK-650 at 10/15/16 11:22 PM:
---

I can only come up with three reasons at the moment. I hope they all make sense.

1) Singletons/Static Initialisers run once per Class Loader where this class is 
being loaded/used. I haven't actually seen this being a problem (and it might 
actually be desired behaviour in this case) but making the init step explicit 
would prevent this from ever becoming one.
2) I'd like to fail fast for some things and not upon first access (which might 
be behind a conditional somewhere)
3) It is hard enough to reason about where some piece of code is running as it 
is (Driver or Task/Executor), adding Singletons to the mix makes it even more 
confusing.


Thank you for reopening!


was (Author: lars_francke):
I can only come up with three reasons at the moment. I hope they all make sense.

1) Singletons/Static Initialisers run once per Class Loader where this class is 
being loaded/used. I haven't actually seen this being a problem (and it might 
actually be desired behaviour in this case) but making the init step explicit 
would prevent this from ever becoming one.
2) I'd like to fail fast for some things and not upon first access (which might 
be behind a conditional somewhere)
3) It is hard enough to reason about where some piece of code is running as it 
is (Driver or Task/Executor), adding Singletons to the mix makes it even more 
confusing.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15561958#comment-15561958
 ] 

Michael Schmeißer edited comment on SPARK-650 at 10/10/16 10:50 AM:


To me, the two seem related, but not exact duplicates. SPARK-636 seems to aim 
for a more generic mechanism.


was (Author: skamandros):
To mee, the two seem related, but not exact duplicates. SPARK-636 seems to aim 
for a more generic mechanism.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org