[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976902#comment-15976902 ] Michael Schmeißer edited comment on SPARK-650 at 4/20/17 3:34 PM: -- In a nutshell, we have our own class "MySerializer" which is derived from {{org.apache.spark.serializer.JavaSerializer}} and performs our custom initialization in {{MySerializer#newInstance}} before calling the super method {{org.apache.spark.serializer.JavaSerializer#newInstance}}. Then, when building the SparkConf for initialization of the SparkContext, we add {{pSparkConf.set("spark.closure.serializer", MySerializer.class.getCanonicalName());}}. We package this with our application JAR and it works. So I think you have to look at your classpath configuration [~mboes]. In our case, the JAR which contains the closure serializer is listed in the following properties (we use Spark 1.5.0 on YARN in cluster mode): * driver.extraClassPath * executor.extraClassPath * yarn.secondary.jars * spark.yarn.secondary.jars * spark.driver.extraClassPath * spark.executor.extraClassPath If I recall it correctly, the variants without the "spark." prefix are produced by us because we prefix all of our properties with "spark." to transfer them via Oozie and unmask them again later, so you should only need the properties with the "spark." prefix. Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue SPARK-1107. 2) You can add a TaskCompletionListener with {{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}. To get the current TaskContext on the executor, just use {{org.apache.spark.TaskContext#get}}. We have some functionality to log the progress of a function in fixed intervals (e.g. every 1,000 records). To do this, you can use mapPartitions with a custom iterator. was (Author: skamandros): In a nutshell, we have our own class "MySerializer" which is derived from {{org.apache.spark.serializer.JavaSerializer}} and performs our custom initialization in {{MySerializer#newInstance}} before calling the super method {{org.apache.spark.serializer.JavaSerializer#newInstance}}. Then, when building the SparkConf for initialization of the SparkContext, we add {{pSparkConf.set("spark.closure.serializer", MySerializer.class.getCanonicalName());}}. We package this with our application JAR and it works. So I think you have to look at your classpath configuration [~mboes]. In our case, the JAR which contains the closure serializer is listed in the following properties (we use Spark on YARN in cluster mode): * driver.extraClassPath * executor.extraClassPath * yarn.secondary.jars * spark.yarn.secondary.jars * spark.driver.extraClassPath * spark.executor.extraClassPath If I recall it correctly, the variants without the "spark." prefix are produced by us because we prefix all of our properties with "spark." to transfer them via Oozie and unmask them again later, so you should only need the properties with the "spark." prefix. Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue SPARK-1107. 2) You can add a TaskCompletionListener with {{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}. To get the current TaskContext on the executor, just use {{org.apache.spark.TaskContext#get}}. We have some functionality to log the progress of a function in fixed intervals (e.g. every 1,000 records). To do this, you can use mapPartitions with a custom iterator. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976902#comment-15976902 ] Michael Schmeißer edited comment on SPARK-650 at 4/20/17 3:34 PM: -- In a nutshell, we have our own class "MySerializer" which is derived from {{org.apache.spark.serializer.JavaSerializer}} and performs our custom initialization in {{MySerializer#newInstance}} before calling the super method {{org.apache.spark.serializer.JavaSerializer#newInstance}}. Then, when building the SparkConf for initialization of the SparkContext, we add {{pSparkConf.set("spark.closure.serializer", MySerializer.class.getCanonicalName());}}. We package this with our application JAR and it works. So I think you have to look at your classpath configuration [~mboes]. In our case, the JAR which contains the closure serializer is listed in the following properties (we use Spark on YARN in cluster mode): * driver.extraClassPath * executor.extraClassPath * yarn.secondary.jars * spark.yarn.secondary.jars * spark.driver.extraClassPath * spark.executor.extraClassPath If I recall it correctly, the variants without the "spark." prefix are produced by us because we prefix all of our properties with "spark." to transfer them via Oozie and unmask them again later, so you should only need the properties with the "spark." prefix. Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue SPARK-1107. 2) You can add a TaskCompletionListener with {{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}. To get the current TaskContext on the executor, just use {{org.apache.spark.TaskContext#get}}. We have some functionality to log the progress of a function in fixed intervals (e.g. every 1,000 records). To do this, you can use mapPartitions with a custom iterator. was (Author: skamandros): In a nutshell, we have our own class "MySerializer" which is derived from {{org.apache.spark.serializer.JavaSerializer}} and performs our custom initialization in {{MySerializer#newInstance}} before calling the super method {{org.apache.spark.serializer.JavaSerializer#newInstance}}. Then, when building the SparkConf for initialization of the SparkContext, we add {{pSparkConf.set("spark.closure.serializer", MySerializer.class.getCanonicalName());}}. We package this with our application JAR and it works. So I think you have to look at your classpath configuration [~mboes]. In our case, the JAR which contains the closure serializer is listed in the following properties: * driver.extraClassPath * executor.extraClassPath * yarn.secondary.jars * spark.yarn.secondary.jars * spark.driver.extraClassPath * spark.executor.extraClassPath If I recall it correctly, the variants without the "spark." prefix are produced by us because we prefix all of our properties with "spark." to transfer them via Oozie and unmask them again later, so you should only need the properties with the "spark." prefix. Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue SPARK-1107. 2) You can add a TaskCompletionListener with {{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}. To get the current TaskContext on the executor, just use {{org.apache.spark.TaskContext#get}}. We have some functionality to log the progress of a function in fixed intervals (e.g. every 1,000 records). To do this, you can use mapPartitions with a custom iterator. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976902#comment-15976902 ] Michael Schmeißer edited comment on SPARK-650 at 4/20/17 3:33 PM: -- In a nutshell, we have our own class "MySerializer" which is derived from {{org.apache.spark.serializer.JavaSerializer}} and performs our custom initialization in {{MySerializer#newInstance}} before calling the super method {{org.apache.spark.serializer.JavaSerializer#newInstance}}. Then, when building the SparkConf for initialization of the SparkContext, we add {{pSparkConf.set("spark.closure.serializer", MySerializer.class.getCanonicalName());}}. We package this with our application JAR and it works. So I think you have to look at your classpath configuration [~mboes]. In our case, the JAR which contains the closure serializer is listed in the following properties: * driver.extraClassPath * executor.extraClassPath * yarn.secondary.jars * spark.yarn.secondary.jars * spark.driver.extraClassPath * spark.executor.extraClassPath If I recall it correctly, the variants without the "spark." prefix are produced by us because we prefix all of our properties with "spark." to transfer them via Oozie and unmask them again later, so you should only need the properties with the "spark." prefix. Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue SPARK-1107. 2) You can add a TaskCompletionListener with {{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}. To get the current TaskContext on the executor, just use {{org.apache.spark.TaskContext#get}}. We have some functionality to log the progress of a function in fixed intervals (e.g. every 1,000 records). To do this, you can use mapPartitions with a custom iterator. was (Author: skamandros): In a nutshell, we have our own class "MySerializer" which is derived from {{org.apache.spark.serializer.JavaSerializer}} and performs our custom initialization in {{MySerializer#newInstance}} before calling the super method {{org.apache.spark.serializer.JavaSerializer#newInstance}}. Then, when building the SparkConf for initialization of the SparkContext, we add {{pSparkConf.set("spark.closure.serializer", MySerializer.class.getCanonicalName());}}. We package this with our application JAR and it works. So I think you have to look at your classpath configuration [~mboes]. In our case, the JAR which contains the closure serializer is listeed in the following properties: * driver.extraClassPath * executor.extraClassPath * yarn.secondary.jars * spark.yarn.secondary.jars * spark.driver.extraClassPath * spark.executor.extraClassPath If I recall it correctly, the variants without the "spark." prefix are produced by us because we prefix all of our properties with "spark." to transfer them via Oozie and unmask them again later, so you should only need the properties with the "spark." prefix. Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue SPARK-1107. 2) You can add a TaskCompletionListener with {{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}. To get the current TaskContext on the executor, just use {{org.apache.spark.TaskContext#get}}. We have some functionality to log the progress of a function in fixed intervals (e.g. every 1,000 records). To do this, you can use mapPartitions with a custom iterator. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976902#comment-15976902 ] Michael Schmeißer edited comment on SPARK-650 at 4/20/17 3:32 PM: -- In a nutshell, we have our own class "MySerializer" which is derived from {{org.apache.spark.serializer.JavaSerializer}} and performs our custom initialization in {{MySerializer#newInstance}} before calling the super method {{com.gfk.st2.pace.df.jobflow.orch.spark.api.ClosureSerializerAsInitHook#newInstance}}. Then, when building the SparkConf for initialization of the SparkContext, we add {{pSparkConf.set("spark.closure.serializer", MySerializer.class.getCanonicalName());}}. We package this with our application JAR and it works. So I think you have to look at your classpath configuration [~mboes]. In our case, the JAR which contains the closure serializer is listeed in the following properties: * driver.extraClassPath * executor.extraClassPath * yarn.secondary.jars * spark.yarn.secondary.jars * spark.driver.extraClassPath * spark.executor.extraClassPath If I recall it correctly, the variants without the "spark." prefix are produced by us because we prefix all of our properties with "spark." to transfer them via Oozie and unmask them again later, so you should only need the properties with the "spark." prefix. Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue SPARK-1107. 2) You can add a TaskCompletionListener with {{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}. To get the current TaskContext on the executor, just use {{org.apache.spark.TaskContext#get}}. We have some functionality to log the progress of a function in fixed intervals (e.g. every 1,000 records). To do this, you can use mapPartitions with a custom iterator. was (Author: skamandros): In a nutshell, we have our own class "MySerializer" which is derived from {{org.apache.spark.serializer.JavaSerializer}} and performs our custom initialization in `MySerializer#newInstance` before calling the super method `com.gfk.st2.pace.df.jobflow.orch.spark.api.ClosureSerializerAsInitHook#newInstance`. Then, when building the SparkConf for initialization of the SparkContext, we add `pSparkConf.set("spark.closure.serializer", MySerializer.class.getCanonicalName());`. We package this with our application JAR and it works. So I think you have to look at your classpath configuration [~mboes]. In our case, the JAR which contains the closure serializer is listeed in the following properties: * driver.extraClassPath * executor.extraClassPath * yarn.secondary.jars * spark.yarn.secondary.jars * spark.driver.extraClassPath * spark.executor.extraClassPath If I recall it correctly, the variants without the "spark." prefix are produced by us because we prefix all of our properties with "spark." to transfer them via Oozie and unmask them again later, so you should only need the properties with the "spark." prefix. Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue SPARK-1107. 2) You can add a TaskCompletionListener with `org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)`. To get the current TaskContext on the executor, just use `org.apache.spark.TaskContext#get`. We have some functionality to log the progress of a function in fixed intervals (e.g. every 1,000 records). To do this, you can use mapPartitions with a custom iterator. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976902#comment-15976902 ] Michael Schmeißer edited comment on SPARK-650 at 4/20/17 3:32 PM: -- In a nutshell, we have our own class "MySerializer" which is derived from {{org.apache.spark.serializer.JavaSerializer}} and performs our custom initialization in {{MySerializer#newInstance}} before calling the super method {{org.apache.spark.serializer.JavaSerializer#newInstance}}. Then, when building the SparkConf for initialization of the SparkContext, we add {{pSparkConf.set("spark.closure.serializer", MySerializer.class.getCanonicalName());}}. We package this with our application JAR and it works. So I think you have to look at your classpath configuration [~mboes]. In our case, the JAR which contains the closure serializer is listeed in the following properties: * driver.extraClassPath * executor.extraClassPath * yarn.secondary.jars * spark.yarn.secondary.jars * spark.driver.extraClassPath * spark.executor.extraClassPath If I recall it correctly, the variants without the "spark." prefix are produced by us because we prefix all of our properties with "spark." to transfer them via Oozie and unmask them again later, so you should only need the properties with the "spark." prefix. Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue SPARK-1107. 2) You can add a TaskCompletionListener with {{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}. To get the current TaskContext on the executor, just use {{org.apache.spark.TaskContext#get}}. We have some functionality to log the progress of a function in fixed intervals (e.g. every 1,000 records). To do this, you can use mapPartitions with a custom iterator. was (Author: skamandros): In a nutshell, we have our own class "MySerializer" which is derived from {{org.apache.spark.serializer.JavaSerializer}} and performs our custom initialization in {{MySerializer#newInstance}} before calling the super method {{com.gfk.st2.pace.df.jobflow.orch.spark.api.ClosureSerializerAsInitHook#newInstance}}. Then, when building the SparkConf for initialization of the SparkContext, we add {{pSparkConf.set("spark.closure.serializer", MySerializer.class.getCanonicalName());}}. We package this with our application JAR and it works. So I think you have to look at your classpath configuration [~mboes]. In our case, the JAR which contains the closure serializer is listeed in the following properties: * driver.extraClassPath * executor.extraClassPath * yarn.secondary.jars * spark.yarn.secondary.jars * spark.driver.extraClassPath * spark.executor.extraClassPath If I recall it correctly, the variants without the "spark." prefix are produced by us because we prefix all of our properties with "spark." to transfer them via Oozie and unmask them again later, so you should only need the properties with the "spark." prefix. Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue SPARK-1107. 2) You can add a TaskCompletionListener with {{org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)}}. To get the current TaskContext on the executor, just use {{org.apache.spark.TaskContext#get}}. We have some functionality to log the progress of a function in fixed intervals (e.g. every 1,000 records). To do this, you can use mapPartitions with a custom iterator. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976902#comment-15976902 ] Michael Schmeißer edited comment on SPARK-650 at 4/20/17 3:31 PM: -- In a nutshell, we have our own class "MySerializer" which is derived from {{org.apache.spark.serializer.JavaSerializer}} and performs our custom initialization in `MySerializer#newInstance` before calling the super method `com.gfk.st2.pace.df.jobflow.orch.spark.api.ClosureSerializerAsInitHook#newInstance`. Then, when building the SparkConf for initialization of the SparkContext, we add `pSparkConf.set("spark.closure.serializer", MySerializer.class.getCanonicalName());`. We package this with our application JAR and it works. So I think you have to look at your classpath configuration [~mboes]. In our case, the JAR which contains the closure serializer is listeed in the following properties: * driver.extraClassPath * executor.extraClassPath * yarn.secondary.jars * spark.yarn.secondary.jars * spark.driver.extraClassPath * spark.executor.extraClassPath If I recall it correctly, the variants without the "spark." prefix are produced by us because we prefix all of our properties with "spark." to transfer them via Oozie and unmask them again later, so you should only need the properties with the "spark." prefix. Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue SPARK-1107. 2) You can add a TaskCompletionListener with `org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)`. To get the current TaskContext on the executor, just use `org.apache.spark.TaskContext#get`. We have some functionality to log the progress of a function in fixed intervals (e.g. every 1,000 records). To do this, you can use mapPartitions with a custom iterator. was (Author: skamandros): In a nutshell, we have our own class "MySerializer" which is derived from `org.apache.spark.serializer.JavaSerializer` and performs our custom initialization in `MySerializer#newInstance` before calling the super method `com.gfk.st2.pace.df.jobflow.orch.spark.api.ClosureSerializerAsInitHook#newInstance`. Then, when building the SparkConf for initialization of the SparkContext, we add `pSparkConf.set("spark.closure.serializer", MySerializer.class.getCanonicalName());`. We package this with our application JAR and it works. So I think you have to look at your classpath configuration [~mboes]. In our case, the JAR which contains the closure serializer is listeed in the following properties: * driver.extraClassPath * executor.extraClassPath * yarn.secondary.jars * spark.yarn.secondary.jars * spark.driver.extraClassPath * spark.executor.extraClassPath If I recall it correctly, the variants without the "spark." prefix are produced by us because we prefix all of our properties with "spark." to transfer them via Oozie and unmask them again later, so you should only need the properties with the "spark." prefix. Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue SPARK-1107. 2) You can add a TaskCompletionListener with `org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)`. To get the current TaskContext on the executor, just use `org.apache.spark.TaskContext#get`. We have some functionality to log the progress of a function in fixed intervals (e.g. every 1,000 records). To do this, you can use mapPartitions with a custom iterator. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963304#comment-15963304 ] Ryan Williams edited comment on SPARK-650 at 4/10/17 6:42 PM: -- Both suggested workarounds here are lacking or broken / actively harmful, afaict, and the use case is real and valid. The ADAM project struggled for >2 years with this problem: - [a 3rd-party {{OutputFormat}} required this field to be set|https://github.com/HadoopGenomics/Hadoop-BAM/blob/eb688fb90c60e8c956f9d1e4793fea01e3164056/src/main/java/org/seqdoop/hadoop_bam/KeyIgnoringAnySAMOutputFormat.java#L93] - the value of the field is computed on the driver, and needs to somehow be sent to and set in each executor JVM. h3. {{mapPartitions}} hack [Some attempts to set the field via a dummy {{mapPartitions}} job|https://github.com/hammerlab/adam/blob/b87bfb72c7411b5ea088b12334aa1b548102eb4b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala#L134-L146] actually added [pernicious, non-deterministic bugs|https://github.com/bigdatagenomics/adam/issues/676#issuecomment-219347677]. In general Spark seems to provide no guarantees that ≥1 tasks will get scheduled on each executor in such a situation: - in the above, node locality resulted in some executors being missed - dynamic-allocation also offers chances for executors to come online later and never be initialized h3. object/singleton initialization How can one use singleton initialization to pass an object from the driver to each executor? Maybe I've missed this in the discussion above. In the end, ADAM decided to write the object to a file and route that file's path to the {{OutputFormat}} via a hadoop configuration value, which is pretty inelegant. h4. Another use case I have another need for this atm where regular lazy-object-initialization is also insufficient: [due to a rough-edge in Scala programs' classloader configuration, {{FileSystemProvider}}'s in user JARs are not loaded properly|https://github.com/scala/bug/issues/10247]. [A workaround discussed in the 1st post on that issue fixes the problem|https://github.com/hammerlab/spark-commands/blob/1.0.3/src/main/scala/org/hammerlab/commands/FileSystems.scala#L8-L20], but needs to be run before {{FileSystemProvider.installedProviders}} is first called on the JVM, which can be triggered by numerous {{java.nio.file}} operations. I don't see a clear way to work in code in that will always lazily call my {{FileSystems.load}} function on each executor, let alone ensure that it happens before any code in the JAR calls e.g. {{Paths.get}}. was (Author: rdub): Both suggested workarounds here are lacking or broken / actively harmful, afaict, and the use case is real and valid. The ADAM project struggled for >2 years with this problem: - [a 3rd-party {{OutputFormat}} required this field to be set|https://github.com/HadoopGenomics/Hadoop-BAM/blob/eb688fb90c60e8c956f9d1e4793fea01e3164056/src/main/java/org/seqdoop/hadoop_bam/KeyIgnoringAnySAMOutputFormat.java#L93] - the value of the field is computed on the driver, and needs to somehow be sent to and set in each executor JVM. h3. {{mapPartitions}} hack [Some attempts to set the field via a dummy {{mapPartitions}} job|https://github.com/hammerlab/adam/blob/b87bfb72c7411b5ea088b12334aa1b548102eb4b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala#L134-L146] actually added [pernicious, non-deterministic bugs|https://github.com/bigdatagenomics/adam/issues/676#issuecomment-219347677]. In general Spark seems to provide no guarantees that ≥1 tasks will get scheduled on each executor in such a situation: - in the above, node locality resulted in some executors being missed - dynamic-allocation also offers chances for executors to come online later and never be initialized h3. object/singleton initialization How can one use singleton initialization to pass an object from the driver to each executor? Maybe I've missed this in the discussion above. In the end, ADAM decided to write the object to a file and route that file's path to the {{OutputFormat}} via a hadoop configuration value, which is pretty inelegant. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578914#comment-15578914 ] Lars Francke edited comment on SPARK-650 at 10/15/16 11:22 PM: --- I can only come up with three reasons at the moment. I hope they all make sense. 1) Singletons/Static Initialisers run once per Class Loader where this class is being loaded/used. I haven't actually seen this being a problem (and it might actually be desired behaviour in this case) but making the init step explicit would prevent this from ever becoming one. 2) I'd like to fail fast for some things and not upon first access (which might be behind a conditional somewhere) 3) It is hard enough to reason about where some piece of code is running as it is (Driver or Task/Executor), adding Singletons to the mix makes it even more confusing. Thank you for reopening! was (Author: lars_francke): I can only come up with three reasons at the moment. I hope they all make sense. 1) Singletons/Static Initialisers run once per Class Loader where this class is being loaded/used. I haven't actually seen this being a problem (and it might actually be desired behaviour in this case) but making the init step explicit would prevent this from ever becoming one. 2) I'd like to fail fast for some things and not upon first access (which might be behind a conditional somewhere) 3) It is hard enough to reason about where some piece of code is running as it is (Driver or Task/Executor), adding Singletons to the mix makes it even more confusing. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15561958#comment-15561958 ] Michael Schmeißer edited comment on SPARK-650 at 10/10/16 10:50 AM: To me, the two seem related, but not exact duplicates. SPARK-636 seems to aim for a more generic mechanism. was (Author: skamandros): To mee, the two seem related, but not exact duplicates. SPARK-636 seems to aim for a more generic mechanism. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org