We have a somewhat complex pipeline which has multiple output files on HDFS, and we'd like the materialization of those outputs to happen concurrently.
Internal to Spark, any "save" call creates a new "job", which runs synchronously -- that is, the line of code after your save() executes once the job completes, executing the entire dependency DAG to produce it. Same with foreach, collect, count, etc. The files we want to save have overlapping dependencies. For us to create multiple outputs concurrently, we have a few options that I can see: - Spawn a thread for each file we want to save, letting Spark run the jobs somewhat independently. This has the downside of various concurrency bugs (e.g. SPARK-4454 <https://issues.apache.org/jira/browse/SPARK-4454>, and more recently SPARK-13631 <https://issues.apache.org/jira/browse/SPARK-13631>) and also causes RDDs up the dependency graph to get independently, uselessly recomputed. - Analyze our own dependency graph, materialize (by checkpointing or saving) common dependencies, and then executing the two saves in threads. - Implement our own saves to HDFS as side-effects inside mapPartitions (which is how save actually works internally anyway, modulo committing logic to handle speculative execution), yielding an empty dummy RDD for each thing we want to save, and then run foreach or count on the union of all the dummy RDDs, which causes Spark to schedule the entire DAG we're interested in. Currently we are doing a little of #1 and a little of #3, depending on who originally wrote the code. #2 is probably closer to what we're supposed to be doing, but IMO Spark is already able to produce a good execution plan and we shouldn't have to do that. AFAIK, there's no way to do what I *actually* want in Spark, which is to have some control over which saves go into which jobs, and then execute the jobs directly. I can envision a new version of the various save functions which take an extra job argument, or something, or some way to defer and unblock job creation in the spark context. Ideas?