Hi there, I'd like to write some iterative computation, i.e., computation that can be done via a for loop. I understand that in Spark foreach is a better choice. However, foreach and foreachPartition seem to be for self-contained computation that only involves the corresponding Row or Partition, respectively. But in my application each computational task does not only involve one partition, but also other partitions. It's just that every task has a specific way of using the corresponding partition and the other partitions. An example application will be cross-validation in machine learning, where each fold corresponds to a partition, e.g., the whole data is divided into 5 folds, then for task 1, I use fold 1 for testing and folds 2,3,4,5 for training; for task 2, I use fold 2 for testing and folds 1,3,4,5 for training; etc.
In this case, if I were to use foreachPartition, it seems that I need to duplicate the data the number of times equal to the number of folds (or iterations of my for loop). More generally, I would need to still prepare a partition for every distributed task and that partition would need to include all the data needed for the task, which could be a huge waste of space. Is there any other solutions? Thanks. f. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-for-loops-in-Spark-tp26939.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org