You're certainly not iterating on the driver. The Iterable you process in your function is on the cluster and done in parallel.
On Fri, Aug 1, 2014 at 8:36 PM, Kristopher Kalish <k...@kalish.net> wrote: > The reason I want an RDD is because I'm assuming that iterating the > individual elements of an RDD on the driver of the cluster is much slower > than coming up with the mean and standard deviation using a map-reduce-based > algorithm. > > I don't know the intimate details of Spark's implementation, but it seems > like each iterable element would need to be serialized and sent to the > driver who would maintain the state (count, sum, total deviation from mean, > etc), which is a lot of network traffic. > > -Kris > > > On Fri, Aug 1, 2014 at 2:57 PM, Sean Owen <so...@cloudera.com> wrote: >> >> On Fri, Aug 1, 2014 at 7:55 PM, kriskalish <k...@kalish.net> wrote: >> > I have what seems like a relatively straightforward task to accomplish, >> > but I >> > cannot seem to figure it out from the Spark documentation or searching >> > the >> > mailing list. >> > >> > I have an RDD[(String, MyClass)] that I would like to group by the key, >> > and >> > calculate the mean and standard deviation of the "foo" field of MyClass. >> > It >> > "feels" like I should be able to use group by to get an RDD for each >> > unique >> > key, but it gives me an iterable. >> >> Hm, why would you expect or want that? an RDD is a large distributed >> data set. It's much easier to compute a mean and stdev over an >> Iterable of numbers than an RDD. >> >> You can map your class to its double field and use anything that >> operates on doubles. > >