Hi folks,
We have written a spark job that scans multiple hdfs directories and perform transformations on them. For now, this is done with a simple for loop that starts one task at each iteration. This looks like: dirs.foreach { case (src,dest) => sc.textFile(src).process.saveAsFile(dest) } However, each iteration is independent, and we would like to optimize that by running them with spark simultaneously (or in a chained fashion), such that we don't have idle executors at the end of each iteration (some directories sometimes only require one partition) Has anyone already done such a thing? How would you suggest we could do that? Cheers, Anselme --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org