This is from a separate thread with a differently named title.
Why can't you modify the actual contents of an RDD using forEach? It appears to
be working for me. What I'm doing is changing cluster assignments and distances
per data item for each iteration of the clustering algorithm. The clustering
algorithm is massive and iterates thousands of times. As I understand it now,
you are supposed to create new RDDs on each pass. This is a hierachical k-means
that I'm doing and hence it is consist of many iterations rather than large
iterations.
So I understand the restriction of why operation when aggregating and reducing
etc, need to be associative. However, forEach operates on a single item. So
being that Spark is advertised as being great for iterative algorithms since it
operates in-memory, how can it be good to create thousands upon thousands of
RDDs during the course of an iterative algorithm? Does Spark have some kind of
trick like reuse behind the scenes - fully persistent data objects or whatever?
How can it possibly be efficient for 'iterative' algorithms when it is creating
so many RDDs as opposed to one?
Or is the answer that I should keep doing what I'm doing because it is working
even though it is not theoretically sound and aligned with functional ideas. I
personally just want it to be fast and be able to operate on up to 500 million
data items.