It wasn't so much the cogroup that was optimized here, but what is done to the result of cogroup. Yes, it was a matter of not materializing the entire result of a flatMap-like function after the cogroup, since this will accept just an Iterator (actually, TraversableOnce).
I'd say that wherever you flatMap a large-ish value to another one, you should consider this pattern, yes. I think this may also be a case where Scala's lazy collections (with .view) could be useful? On Mon, Dec 29, 2014 at 4:28 AM, Stephen Haberman <stephen.haber...@gmail.com> wrote: > Hey, > > I saw this commit go by, and find it fairly fascinating: > > https://github.com/apache/spark/commit/c233ab3d8d75a33495298964fe73dbf7dd8fe305 > > For two reasons: 1) we have a report that is bogging down exactly in > a .join with lots of elements, so, glad to see the fix, but, more > interesting I think: > > 2) If such a subtle bug was lurking in spark-core, it leaves me worried > that every time we use .map in our own cogroup code, that we'll be > committing the same perf error. > > Has anyone thought more deeply about whether this is a big deal or not? > Should ".iterator.map" vs. ".map" be strongly preferred/best practice > for cogroup code? > > Thanks, > Stephen > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org