Hello fellow Sparkians. In https://groups.google.com/d/msg/spark-users/eXV7Bp3phsc/gVgm-MdeEkwJ, Matei suggested that Spark might get deferred grouping and forced execution of multiple jobs in an efficient way. His code sample:
rdd.reduceLater(reduceFunction1) // returns Future[ResultType1] rdd.reduceLater(reduceFunction2) // returns Future[ResultType2] SparkContext.force() // executes all the "later" operations as part of a single optimized job This would be immensely useful. If you ever want to do a thing where you do two passes over the data and save two different results to disk, you either have to cache the RDD which can be slow or deprive the processing code of memory, or recompute the whole thing twice. If Spark was smart enough to let you group together these operations and "fork" an RDD (say an RDD.partition method), you could very easily implement these n-pass operations across RDDs and have spark execute them efficiently. Our use case for a feature like this is processing many records and attaching metadata to the records during processing about our confidence in the data-points, and then writing the data to one spot and the metadata to another spot. I've also wanted this for taking a dataset, profiling it for partition size or anomalously sized partitions, and then using the profiling result to repartition the data before saving it to disk, which I think is impossible to do without caching right now. This use case is a bit more interesting because information from earlier on in the DAG needs to influence later stages, and so I suspect the answer will be "cache the thing". I explicitly don't want to cache it because I'm not really doing an "iterative" algorithm where I'm willing to pay the heap and time penalties, I'm just doing an operation which needs run-time information without a collect call. This suggests that something like a repartition with a lazily evaluated accumulator might work as well, but I haven't been able to figure out a solution even with this primitive and the current APIs. So, does anyone know if this feature might land, and if not, where to start implementing it? What would the Python story for Futures be?