You can break this down into a two-stage pipeline like the following: rdd.map(r => (r._1._1, r._2)).reduceByKey(_ + _).map(_._2).reduce(_ * _) The first map puts each (i,j) pair into an i group, and then the reduceByKey will sum up all the values in each i group.
The second map extracts the value of each i group and the reduce will sum up all of those values. There are optimizations that can be made here if your data gets really big or really dense, but that's the basic idea. - Evan On Tue, Jan 21, 2014 at 11:29 AM, Aureliano Buendia <[email protected]>wrote: > Surprisingly, this turned out to be more complicated than what I expected. > > I had the impression that this would be trivial in spark. Am I missing > something here? > > > On Tue, Jan 21, 2014 at 5:42 AM, Aureliano Buendia > <[email protected]>wrote: > >> Hi, >> >> It seems spark does not support nested RDD's, so I was wondering how can >> spark handle multi dimensional reductions. >> >> As an example consider a dataset with these rows: >> >> ((i, j), value) >> >> where i, j and k are long indexes, and value is a double. >> >> How is it possible to first reduce the above rdd over j, and then reduce >> the results over i? >> >> Just to clarify, a scala equivalent would look like this: >> >> var results = 0 >> for (i <- 0 until I) { >> var jReduction = 0 >> for (j <- 0 until J) { >> *// Reduce over j* >> jReduction = jReduction + rdd(i,j) >> } >> *// Reduce over i* >> results = results * jReductions(i) >> } >> >> >
