[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578856#comment-16578856 ] Imran Rashid commented on SPARK-650: Folks may be interested in SPARK-24918. perhaps one should be closed a duplicate of the other, but for now there is some discussion on both, so I'll leave them open for the time being > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517398#comment-16517398 ] Avi minsky commented on SPARK-650: -- We encountered an issue with the combination of lazy static loading and speculation. Because speculation kills tasks it might kill while loading lazy static classes which make them unusuable and later all application might fail for noclassdeferror > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511033#comment-16511033 ] Sina Madani commented on SPARK-650: --- I too have this problem. It seems that Apache Flink solves this quite nicely by having "RichFunction" variants for operations like map, filter, reduce etc. A RichFunction, such as RichMapFunction, provides open(Configuration parameters) and close() methods which can be used to run setup and teardown code once per worker and also initialise the worker from primitive key-value pairs. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16239991#comment-16239991 ] quang nguyen commented on SPARK-650: Hi, We had an application run on spark cluster to a secured hdfs(kerberos) Because spark had not supported for kerberos yet, it will be convenient for us that spark supports a setup hook to login as an user on each executor. Can u figure out another solution for us? (run on yarn mode isn't an option) > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158202#comment-16158202 ] yiming.xu commented on SPARK-650: - I need a hook too. Some case, We need init something like spring initbean :( > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16117114#comment-16117114 ] Sean Owen commented on SPARK-650: - I can also imagine cases involving legacy code that make this approach hard to implement. Still, it's possible with enough 'discipline', but this is true of wrangling any legacy code. I don't think the question of semantics is fully appreciated here. Is killing the app's other tasks on the same executor reasonable behavior? how many failures are allowed by default by this new mechanism? what do you do if init never returns? for how long? Are you willing to reschedule the task on another executor? how does it interact with locality? I know, any change raises questions, but this one raises a lot. It's a conceptual change in Spark and I'm just sure it's not going to happen 3 years in. Tasks have never had special status or lifecycle w.r.t. executors and that's a positive thing, really. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16117090#comment-16117090 ] Louis Bergelson commented on SPARK-650: --- [~srowen] Thanks for the reply and the example. Unfortunately, I still believe that the singleton approach doesn't work well for our use case. We don't have a single resource which needs initialization and can always be wrapped in a singleton. We have a sprawl of legacy dependencies that need to be initialized in certain ways before use, and then can be called into from literally hundreds of entry points. One of the things that needs initializing is the set of FileSystemProviders that [~rdub] mentioned above. This has to be done before potentially any file access in our dependencies. It's implausible to wrap all of our library code into singleton objects and it's difficult to always call initResources() before every library call. It requires a lot of discipline on the part of the developers. Since we develop a framework for biologists to use to write tools, any thing that has to be enforced by convention isn't ideal and is likely to cause problems. People will forget to start their work by calling initResources() or worse, they'll remember to call initResources(), but only at the start of the first stage. Then they'll run into issues when executors die and are replaced during a later stage and the initialization doesn't run on the new executor. For something that could be cleanly wrapped in a singleton I agree that the semantics are obvious, but for the case where you're calling init() before running your code, the semantics are confusing and error prone. I'm sure there are complications from introducing a setup hook, but the one you mention seems simple enough to me. If a setup fails, that executor is killed and can't schedule tasks. There would probably have to be a mechanism for timing out after a certain number of failed executor starts, but I suspect that that exists already in some fashion for other sorts of failures. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110506#comment-16110506 ] Sean Owen commented on SPARK-650: - Are you looking for an example of how it works? something like this, for what I assume is the common case of something like initializing a connection to an external resource: {code} val config = ... df.mapPartitions { it => MyResource.initIfNeeded(config) it.map(...) } ... object MyResource { private var initted = false def initIfNeeded(config: Config): Unit = this.synchronized { if (!initted) { initializeResource(config) initted = true } } {code} If config is big, or tricky to pass around, that too can be read directly from a location, or wrapped up in some object in your code. It can actually be: {code} df.mapPartitions { it => MyResource.initIfNeeded() it.map(...) } ... object MyResource { private var initted = false def initIfNeeded(): Unit = this.synchronized { if (!initted) { val config = getConf() initializeResource(config) initted = true } } {code} You get the idea. This is not a special technique, not even really singletons. Just making a method that executes the first time it's called and then does nothing after. If you don't like having to call initResource -- call that in whatever code produces the resource connection or whatever. We can imagine objections and answers like this all day I'm sure. I think it covers all use cases I can imagine that a setup hook does, so the question is just is it easy enough? You're saying it's unusably hard, and proposing some hack on the serializer that sounds much more error-prone. I just cannot agree with this. This is much simpler than other solutions people are arguing against here, which I also think are too complex. Was it just a misunderstanding of the proposal? [~lou...@broadinstitute.org] have you considered the implications of the semantics of a setup hook? for example, if setup fails on an executor, can you schedule a task that needed it? how do you track that? Here, the semantics are obvious. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110006#comment-16110006 ] Michael Schmeißer commented on SPARK-650: - Please see my comment from 05/Dec/16 12:39 and the following discussion - we are kind of going in circles here. I tried to explain the (real) problems we were facing as good as I can and which solution we applied to them and why other solutions have been dismissed. The fact is: There are numerous people here who seem to have the same issues and are glad to apply the workaround because "using the singleton" doesn't seem to provide a solution to them either. Probably we all don't understand how to do this but then again there seems to be something missing - at least documentation, doesn't it? What I can tell you in addition is that we have concerned experienced developers with the topic who have used quite a few singletons. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16109311#comment-16109311 ] Sean Owen commented on SPARK-650: - I still don't see an argument against my primary suggestion: the singleton. The last comment on it just said, oh, how do you do it? it's quite possible. Nothing to do with the serializer. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16109179#comment-16109179 ] Louis Bergelson commented on SPARK-650: --- I can't understand how people are dismissing this as not an issue. There are many cases where you need to initialize something on an executor, and many of them need input from the driver. All of the given workarounds are terrible hacks and at best force bad design, and at worst introduce confusing and non-deterministic bugs. Any time that the recommended solution to a common problem that many people are having is to abuse the Serializer in order to trick it into executing non-serialization code it seems obvious that there's a missing capability in the system. The fact that executors can come on and offline at any time during the run makes it especially essential that we have a robust way of initializing them. I just really don't understand the opposition to adding an initialization hook, it would solve so many problems in a clean way and doesn't seem like it would be particularly problematic on its own. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054162#comment-16054162 ] Michael Schmeißer commented on SPARK-650: - [~riteshtijoriwala] - Sorry, but I am not familiar with Spark 2.0.0 yet. But what I can say is that we have raised a Cloudera support case to address this issue so maybe we can expect some help from this side. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053745#comment-16053745 ] Ritesh Tijoriwala commented on SPARK-650: - [~Skamandros] - Any similar tricks for spark 2.0.0? I see the config option to set the closure serializer has been removed - https://issues.apache.org/jira/browse/SPARK-12414. Currently we do "set of different things" to ensure our classes are loaded/instantiated before spark starts execution of its stages. It would be nice to consolidate this in one place/hook. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976902#comment-15976902 ] Michael Schmeißer commented on SPARK-650: - In a nutshell, we have our own class "MySerializer" which is derived from `org.apache.spark.serializer.JavaSerializer` and performs our custom initialization in `MySerializer#newInstance` before calling the super method `com.gfk.st2.pace.df.jobflow.orch.spark.api.ClosureSerializerAsInitHook#newInstance`. Then, when building the SparkConf for initialization of the SparkContext, we add `pSparkConf.set("spark.closure.serializer", MySerializer.class.getCanonicalName());`. We package this with our application JAR and it works. So I think you have to look at your classpath configuration [~mboes]. In our case, the JAR which contains the closure serializer is listeed in the following properties: * driver.extraClassPath * executor.extraClassPath * yarn.secondary.jars * spark.yarn.secondary.jars * spark.driver.extraClassPath * spark.executor.extraClassPath If I recall it correctly, the variants without the "spark." prefix are produced by us because we prefix all of our properties with "spark." to transfer them via Oozie and unmask them again later, so you should only need the properties with the "spark." prefix. Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue SPARK-1107. 2) You can add a TaskCompletionListener with `org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)`. To get the current TaskContext on the executor, just use `org.apache.spark.TaskContext#get`. We have some functionality to log the progress of a function in fixed intervals (e.g. every 1,000 records). To do this, you can use mapPartitions with a custom iterator. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969422#comment-15969422 ] Ritesh Tijoriwala commented on SPARK-650: - [~Skamandros] - I would also like to know about hooking 'JavaSerializer'. I have a similar use case where I need to initialize set of objects/resources on each executor. I would also like to know if anybody has a way to hook into some "clean up" on each executor when 1) the executor shutdown 2) when a batch finishes and before next batch starts > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967554#comment-15967554 ] Mathieu Boespflug commented on SPARK-650: - [~Skamandros] how did you manage to hook `JavaSerializer`? I tried doing so myself, by defining a new subclass, but then I need to make sure that new class is installed on all executors. Meaning I have to copy a .jar on all my nodes manually. For some reason Spark won't try looking for the serializer inside my application JAR. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963304#comment-15963304 ] Ryan Williams commented on SPARK-650: - Both suggested workarounds here are lacking or broken / actively harmful, afaict, and the use case is real and valid. The ADAM project struggled for >2 years with this problem: - [a 3rd-party {{OutputFormat}} required this field to be set|https://github.com/HadoopGenomics/Hadoop-BAM/blob/eb688fb90c60e8c956f9d1e4793fea01e3164056/src/main/java/org/seqdoop/hadoop_bam/KeyIgnoringAnySAMOutputFormat.java#L93] - the value of the field is computed on the driver, and needs to somehow be sent to and set in each executor JVM. h3. {{mapPartitions}} hack [Some attempts to set the field via a dummy {{mapPartitions}} job|https://github.com/hammerlab/adam/blob/b87bfb72c7411b5ea088b12334aa1b548102eb4b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala#L134-L146] actually added [pernicious, non-deterministic bugs|https://github.com/bigdatagenomics/adam/issues/676#issuecomment-219347677]. In general Spark seems to provide no guarantees that ≥1 tasks will get scheduled on each executor in such a situation: - in the above, node locality resulted in some executors being missed - dynamic-allocation also offers chances for executors to come online later and never be initialized h3. object/singleton initialization How can one use singleton initialization to pass an object from the driver to each executor? Maybe I've missed this in the discussion above. In the end, ADAM decided to write the object to a file and route that file's path to the {{OutputFormat}} via a hadoop configuration value, which is pretty inelegant. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15731628#comment-15731628 ] Michael Schmeißer commented on SPARK-650: - No, it's not just about propagating information - some code actually needs to be run. We have some static utilities which need to be initialized, but they don't know anything about Spark but are rather provided by external libraries. Thus, we need to actually trigger the initialization on all executors. The only other way that I see is to wrap all access to those external utilities with something on our side that is Spark-aware and initializes them if needed. But I think compared to this, our current solution is better. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15725790#comment-15725790 ] Herman van Hovell commented on SPARK-650: - A creatively applied broadcast variable might also do the trick BTW. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15725784#comment-15725784 ] Herman van Hovell commented on SPARK-650: - If you only try to propagate information, then you can use SparkContext.localProperties and the TaskContext on the executor side. They provide the machinery to do this. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15725542#comment-15725542 ] Michael Schmeißer commented on SPARK-650: - Sure it can be included in the closure and this was also our first solution to the problem. But if the application has many layers and you need the resource which requires info X to initialize often, it soon gets very inconvenient because you have to pass X around a lot and pollute your APIs. Thus, our next solution was to create a base function class which takes X in its constructor and makes sure that the resource is initialized on the executor side if it wasn't before. The drawback of this solution is that the function developer can forget to extend the function base class and then he may or may not be able to access the resource depending on whether a function has run before on the executor which performed the initialization. This is really error-prone (actually led to errors) and even if done correctly, prevents lambdas from beeing used for functions. As a result, we now use the "empty RDD" approach or piggy-back the Spark JavaSerializer. Both works fine and initializes the executor-side resource properly on all executors. So, from a function developer's point-of-view that's nice, but overall, the solution relies on Spark internals to work which is why I would rather have an explicit mechanism to perform such an initialization. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723170#comment-15723170 ] Sean Owen commented on SPARK-650: - Why? info X can be included in the closure, and the executor can call "single.getInstance(X)" to pass this info. Init happens only once in any event. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15722178#comment-15722178 ] Michael Schmeißer commented on SPARK-650: - Thanks [~robert.neumann]! I am ready to help, if I can. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15722170#comment-15722170 ] Michael Schmeißer commented on SPARK-650: - A singleton is not really feasible if additional information is required which is known (or determined) by the driver and thus needs to be sent to the executors for the initialization to happen. In this case, the options are 1) use some side-channel that is "magically" inferred by the executor, 2) use an empty RDD, repartition it to the number of executors and run mapPartitions on it, 3) piggy-back the JavaSerializer to run the initialization before any function is called or 4) require every function which may need the resource to initialize it on its own. Each of these options has significant drawbacks in my opinion. While 4 sounds good for most cases, it has some cons which I've described earlier (my comment from Oct 16) and make it unfeasible for our use-case. Option 1 might be possible, but the data flow wouldn't be all that obvious. Right now, we go with a mix of option 2 and 3 (try to determine the number of executors and if you can't, hijack the serializer), but really, this is hacked and might break in future releases of Spark. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714957#comment-15714957 ] Robert Neumann commented on SPARK-650: -- OK. Will do. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714819#comment-15714819 ] Herman van Hovell commented on SPARK-650: - [~lars_francke][~Skamandros][~rneumann] If you think that this is an important feature, then write a design doc and open a PR. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714762#comment-15714762 ] Robert Neumann commented on SPARK-650: -- Sean, I agree this is the essential question in this thread. If we get this sorted out, then we are good and can achieve consensus on what to do with this ticket. A singleton "works" indeed. However, from a software engineering point of view it is not nice. There exists a class of Spark Streaming jobs that requires "setup -> do -> cleanup" semantics. The framework (in this case Spark Streaming) should explicitly support these semantics through appropriate API hooks. A singleton instead would hide these semantics and you would need to implement some laxy code to check whether an HBase connection was already setup or not; the singelton would need to do this for every write operation to HBase. I do not think that application logic (the Singleton within the Spark Streaming job) is the right place to wire in the "setup -> do -> cleanup" pattern. It is a generic pattern and there exists a class of Spark Streaming jobs (not only one specific Streaming job) that are based on this pattern. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714760#comment-15714760 ] Robert Neumann commented on SPARK-650: -- Sean, I agree this is the essential question in this thread. If we get this sorted out, then we are good and can achieve consensus on what to do with this ticket. A singleton "works" indeed. However, from a software engineering point of view it is not nice. There exists a class of Spark Streaming jobs that requires "setup -> do -> cleanup" semantics. The framework (in this case Spark Streaming) should explicitly support these semantics through appropriate API hooks. A singleton instead would hide these semantics and you would need to implement some laxy code to check whether an HBase connection was already setup or not; the singelton would need to do this for every write operation to HBase. I do not think that application logic (the Singleton within the Spark Streaming job) is the right place to wire in the "setup -> do -> cleanup" pattern. It is a generic pattern and there exists a class of Spark Streaming jobs (not only one specific Streaming job) that are based on this pattern. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714725#comment-15714725 ] Sean Owen commented on SPARK-650: - Why would a singleton not work? This is really the essential question in this thread. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714611#comment-15714611 ] Robert Neumann commented on SPARK-650: -- I am supporting Olivier Armand. We need a way in our Streaming job to setup an HBase connection per executor (and not per partition). A Singleton is not something we are looking at for this purpose. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581982#comment-15581982 ] Sean Owen commented on SPARK-650: - Yep, if you must pass some configuration, it generally can't happen magically at class-loading time. You can provide a "initIfNeeded(conf)" method that must be explicitly called in key places, but, that's simple and canonical Java practice. In your example, there's no need to do anything. Just use the info in the function the executor runs. It's passed in the closure. This is entirely normal Spark. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581971#comment-15581971 ] Michael Schmeißer commented on SPARK-650: - I agree that static initialization would solve the problem for cases where everything is known or can be loaded at class-loading time, e.g. from property files in the artifact itself. For situations like RecordReaders, it might also work, because they have an initialize method where they get contextual information that could have been enriched with the required values from the driver. However, we also have other cases, where information from the driver is needed. Imagine the following case: We have a temporary directory in HDFS which is determined by the Oozie workflow instance ID. The driver knows this information, because it is provided by Oozie via main method arguments. The executor needs this information as well, e.g. to load some data that is required to initialize a static context. Then, the question arises: How does the information get to the executor? Either with the function instance which would mean that the developer of the function needs to know that he has to call an initialization method in every function or at least in every first function on an RDD (which he probably doesn't know, because he received the RDD from a different part of the application). Or with an explicit mechanism which is executed before the developer functions run on any executor. Which would lead me again to the "empty RDD" workaround. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581774#comment-15581774 ] Sean Owen commented on SPARK-650: - BTW I am not suggesting an "empty RDD" for your case. That was specific to the streaming scenario. For this, again, why not just access some initialization method during class init of some class that is referenced wherever you want, including a custom InputFormat? This can be made to happen once per JVM (class loader), from any code, at class init time before anything else can happen. It's just a standard Java mechanism. If you mean it requires some configuration not available at class-loading time you can still make such an init take place wherever, as soon as, such configuration is available. Even in an InputFormat. Although I can imagine corner cases where this becomes hard, I think it's over-thinking this to imagine a whole new lifecycle method to accomplish what basic JVM mechanisms allow. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580502#comment-15580502 ] Michael Schmeißer commented on SPARK-650: - What if I have a Hadoop InputFormat? Then, certain things happen before the first RDD exists, don't they? I'll give the solution with the empty RDD a shot next week, this sounds a little bit better than what we have right now, but it still relies on certain internals of Spark which are most likely undocumented and might change in future? I've had the feeling that Spark basically has a functional approach with the RDDs and executing anything on an empty RDD could be optimized to just do nothing? > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580135#comment-15580135 ] Sean Owen commented on SPARK-650: - But, why do you need to do it before you have an RDD? You can easily make this a library function. Or, just some static init that happens on demand whenever a certain class is loaded. The nice thing about that is that it's transparent, just like with any singleton / static init in the JVM. If you really want, you can make an empty RDD and repartition it and use that as a dummy, but it only serves to do some initialization early that would happen transparently anyway. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580069#comment-15580069 ] Michael Schmeißer commented on SPARK-650: - But I'll need to have an RDD to do this, I can't just do it during the SparkContext setup - right now, we have multiple sources of RDDs and every developer would still need to know that they have to run this code after creating an RDD, won't they? Or is there some way to use a "pseudo-RDD" right after creation of the SparkContext to execute the init code on the executors? > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580062#comment-15580062 ] Sean Owen commented on SPARK-650: - This is still easy to do with mapPartitions, which can call {{initWithTheseParamsIfNotAlreadyInitialized(...)}} once per partition, which should guarantee it happens once per JVM before anything else proceeds. I don't think you need to bury it in serialization logic. I can see there are hard ways to implement this, but I believe an easy way is still readily available within the existing API mechanisms. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580055#comment-15580055 ] Michael Schmeißer commented on SPARK-650: - Ok, let me explain the specific problems that we have encountered, which might help to understand the issue and possible solutions: We need to run some code on the executors before anything gets processed, e.g. initialization of the log system or context setup. To do this, we need information that is present on the driver, but not on the executors. Our current solution is to provide a base class for Spark function implementations which contains the information from the driver and initializes everything in its readObject method. Since multiple narrow-dependent functions may be executed on the same executor JVM subsequently, this class needs to make sure that initialization doesn't run multiple times. Sure, that's not hard to do, but if you mix setup and cleanup logic for functions, partitions and/or the JVM itself, it can get quite confusing without explicit hooks. So, our solution basically works, but with that approach, you can't use lambdas for Spark functions, which is quite inconvenient, especially for simple map operations. Even worse, if you use a lambda or otherwise forget to extend the required base class, the initialization doesn't occur and very weird exceptions follow, depending on which resource your function tries to access during its execution. Or if you have very bad luck, no exception will occur, but the log messages will get logged to an incorrect destination. It's very hard to prevent such cases without an explicit initialization mechanism and in a team with several developers, you can't expect everyone to know what is going on there. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579928#comment-15579928 ] Sean Owen commented on SPARK-650: - Yeah that's a decent use case, because latency is an issue (streaming) and you potentially have time to set up before latency matters. You can still use this approach because empty RDDs arrive if no data has, and empty RDDs can still be repartitioned. Here's a way to, once, if the first RDD has no data, do something once per partition, which ought to amount to at least once per executor: {code} var first = true lines.foreachRDD { rdd => if (first) { if (rdd.isEmpty) { rdd.repartition(sc.defaultParallelism).foreachPartition(_ => Thing.initOnce()) } first = false } } {code} "ought", because, there isn't actually a guarantee that it will put the empty partitions on different executors. In practice, it seems to, when I just tried it. That's a partial solution, but it's an optimization anyway, and maybe it helps you right now. I am still not sure it means this needs a whole mechanism, if this is the only type of use case. Maybe there are others. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579737#comment-15579737 ] Olivier Armand commented on SPARK-650: -- Data doesn't arrives necessarily immediately, but we need to ensure that when it arrives, lazy initialization doesn't introduce latency. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579720#comment-15579720 ] Sean Owen commented on SPARK-650: - It would work in this case to immediately schedule initialization on the executors because it sounds like data arrives immediately in your case. The part I am missing is how it can occur faster than this with another mechanism. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579710#comment-15579710 ] Olivier Armand commented on SPARK-650: -- > "just run a dummy mapPartitions at the outset on the same data that the first > job would touch" But this wouldn't work for Spark Streaming? (our case). > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579630#comment-15579630 ] Sean Owen commented on SPARK-650: - Reopening doesn't do anything by itself, or cause anyone to consider this. If this just sits for another year, it will have been a tiny part of a larger problem. I would ask those asking to keep this open to advance the discussion, or else I think you'd agree it eventually should be closed. (Here, I'm really speaking about hundreds of issues like this here, not so much this one.) Part of the problem is that I don't think the details of this feature request were ever elaborated. I think that if you dig into what it would mean, you'd find that a) it's kind of tricky to define and then implement all the right semantics, and b) almost any use case along these lines in my experience is resolved as I suggest, with a simple per-JVM initialization. If the response lately here is, well, we're not quite sure how that works, then we need to get to the bottom of that, not just insisting an issue stay open. To your points: - The executor is going to load user code into one classloader, so we do have that an executor = JVM = classloader. - You can fail things as fast as you like by invoking this init as soon as like in your app. - It's clear where things execute, or else, we must assume app developers understand this or else all bets are off. The driver program executes things in the driver unless they're part of a distributed map() etc operation, which clearly execute on the executor. These IMHO aren't reasons to design a new, different, bespoke mechanism. That has a cost too, if you're positing that it's hard to understand when things run where. The one catch I see is that, by design, we don't control which tasks run on what executors. We can't guarantee init code runs on all executors this way. But, is it meaningful to initialize an executor that never sees an app's tasks? it can't be. Lazy init is a good thing and compatible with the Spark model. If startup time is an issue (and I'm still not clear on the latency problem mentioned above), then it gets a little more complicated, but, that's also a little more niche: just run a dummy mapPartitions at the outset on the same data that the first job would touch, even asynchronously if you like with other driver activities. No need to wait; it just gives the init a head-start on the executors that will need it straight away. That's just my opinion of course, but I think those are the questions that would need to be answered to argue something happens here. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578914#comment-15578914 ] Lars Francke commented on SPARK-650: I can only come up with three reasons at the moment. I hope they all make sense. 1) Singletons/Static Initialisers run once per Class Loader where this class is being loaded/used. I haven't actually seen this being a problem (and it might actually be desired behaviour in this case) but making the init step explicit would prevent this from ever becoming one. 2) I'd like to fail fast for some things and not upon first access (which might be behind a conditional somewhere) 3) It is hard enough to reason about where some piece of code is running as it is (Driver or Task/Executor), adding Singletons to the mix makes it even more confusing. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578875#comment-15578875 ] Lars Francke commented on SPARK-650: I also have to disagree with this being a duplicate or obsolete. [~oarmand] and [~Skamandros] already mentioned reasons regarding the duplication. About it being obsolete: I have seen multiple clients facing this problem, finding this issue and hoping it'd get fixed some day. I would hesitate a guess and say that most _users_ of Spark have no JIRA account here and do not register or log in just to vote for this issue. That said: This issue is (with six votes) in the top 150 out of almost 17k total issues in the Spark project. As it happens this is a non-trivial thing to implement in Spark (as far as I can tell from my limited knowledge of the inner workings) so it's pretty hard for a "drive by" contributor to help here. You had the discussion about community perception on the mailing list (re: Spark Improvement Proposals) and this issue happens to be one of those that at least I see popping up every once in a while in discussions with clients. I would love to see this issue staying open as a feature request and have some hope that someone someday will implement it. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578535#comment-15578535 ] Sean Owen commented on SPARK-650: - If you need init to happen ASAP when the driver starts, isn't any similar mechanism going to be about the same in this regard? This cost is paid just once, and I don't think in general startup is very low latency for any Spark app. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578487#comment-15578487 ] Olivier Armand commented on SPARK-650: -- Sean, a singleton is not the best option in our case. The Spark Streaming executors are writing to HBase, we need to initialize the HBase connection. The singleton seems (or seemed when we tested it for our customer a few months after this issue was raised) to be created when the first RDD is processed by the executor, and not when the driver starts. This imposes very high processing time for the first events. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578378#comment-15578378 ] Sean Owen commented on SPARK-650: - Sorry, I mean the _status_ doesn't matter. Most issues this old are obsolete or de facto won't-fix. Resolving it or not doesn't matter. I would even say this is 'not a problem', because a simple singleton provides once-per-executor execution of whatever you like. It's more complex to make a custom mechanism that makes you route this via Spark. That's probably way this hasn't proved necessary. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578361#comment-15578361 ] Michael Schmeißer commented on SPARK-650: - Then somebody should please explain to me, how this doesn't matter or rather how certain use-cases are supposed to be solved. We need to initialize each JVM and connect it to our logging system, set correlation IDs, initialize contexts and so on. I guess that most users just have implemented work-arounds as we did, but in an enterprise environment, this is really not the preferable long-term solution to me. Plus, I think that it would really not be hard to implement this feature for someone who has knowledge about the Spark executor setup. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578338#comment-15578338 ] Sean Owen commented on SPARK-650: - In practice, these should probably all be WontFix as it hasn't mattered enough to implement in almost 4 years. It really doesn't matter. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578289#comment-15578289 ] Michael Schmeißer commented on SPARK-650: - I disagree that those issues are duplicates. Spark-636 looks for a generic way to execute code on the Executors, but not for a reliable and easy mechanism to execute code during Executor initialization. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573360#comment-15573360 ] holdenk commented on SPARK-650: --- Would people feel ok if we marked this as a duplicate of 636 since it does seem like this a subset of 636. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15561958#comment-15561958 ] Michael Schmeißer commented on SPARK-650: - To mee, the two seem related, but not exact duplicates. SPARK-636 seems to aim for a more generic mechanism. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556324#comment-15556324 ] holdenk commented on SPARK-650: --- I think this is a duplicate of SPARK-636 yes? > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15554861#comment-15554861 ] Luis Ramos commented on SPARK-650: -- I have similar requirements to Michael's – this would be a very useful feature to have. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14916264#comment-14916264 ] Michael Schmeißer commented on SPARK-650: - I would need this feature as well to perform some initialization of the logging system (which reads its configuration from an external source rather than just a file). > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a setup hook API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645919#comment-14645919 ] Lars Francke commented on SPARK-650: Not [~matei] but I think this would be a good idea to have. Abusing another undocumented concept doesn't seem like a nice way to treat a useful and common use-case. Add a setup hook API for running initialization code on each executor --- Key: SPARK-650 URL: https://issues.apache.org/jira/browse/SPARK-650 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Matei Zaharia Priority: Minor Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a setup hook API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206182#comment-14206182 ] Andrew Ash commented on SPARK-650: -- As mentioned in SPARK-572 static classes' initialization methods are being abused to perform this functionality. [~matei] do you still feel that a per-executor initialization function is a hook that Spark should expose in its public API? Add a setup hook API for running initialization code on each executor --- Key: SPARK-650 URL: https://issues.apache.org/jira/browse/SPARK-650 Project: Spark Issue Type: New Feature Reporter: Matei Zaharia Priority: Minor Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org