Efficient for loops in Spark

flyinggip Thu, 12 May 2016 06:34:28 -0700

Hi there, 

I'd like to write some iterative computation, i.e., computation that can be
done via a for loop. I understand that in Spark foreach is a better choice.
However, foreach and foreachPartition seem to be for self-contained
computation that only involves the corresponding Row or Partition,
respectively. But in my application each computational task does not only
involve one partition, but also other partitions. It's just that every task
has a specific way of using the corresponding partition and the other
partitions. An example application will be cross-validation in machine
learning, where each fold corresponds to a partition, e.g., the whole data
is divided into 5 folds, then for task 1, I use fold 1 for testing and folds
2,3,4,5 for training; for task 2, I use fold 2 for testing and folds 1,3,4,5
for training; etc.


In this case, if I were to use foreachPartition, it seems that I need to
duplicate the data the number of times equal to the number of folds (or
iterations of my for loop). More generally, I would need to still prepare a
partition for every distributed task and that partition would need to
include all the data needed for the task, which could be a huge waste of
space. 

Is there any other solutions? Thanks. 

f. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-for-loops-in-Spark-tp26939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Efficient for loops in Spark

Reply via email to