Re: recent join/iterator fix

Sean Owen Mon, 29 Dec 2014 01:44:27 -0800

It wasn't so much the cogroup that was optimized here, but what is
done to the result of cogroup. Yes, it was a matter of not
materializing the entire result of a flatMap-like function after the
cogroup, since this will accept just an Iterator (actually,
TraversableOnce).


I'd say that wherever you flatMap a large-ish value to another one,
you should consider this pattern, yes.

I think this may also be a case where Scala's lazy collections (with
.view) could be useful?

On Mon, Dec 29, 2014 at 4:28 AM, Stephen Haberman
<stephen.haber...@gmail.com> wrote:
> Hey,
>
> I saw this commit go by, and find it fairly fascinating:
>
> https://github.com/apache/spark/commit/c233ab3d8d75a33495298964fe73dbf7dd8fe305
>
> For two reasons: 1) we have a report that is bogging down exactly in
> a .join with lots of elements, so, glad to see the fix, but, more
> interesting I think:
>
> 2) If such a subtle bug was lurking in spark-core, it leaves me worried
> that every time we use .map in our own cogroup code, that we'll be
> committing the same perf error.
>
> Has anyone thought more deeply about whether this is a big deal or not?
> Should ".iterator.map" vs. ".map" be strongly preferred/best practice
> for cogroup code?
>
> Thanks,
> Stephen
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: recent join/iterator fix

Reply via email to