Re: Map and Reduce cycle

Fabian Hueske Fri, 17 Jul 2015 07:20:09 -0700

Hi Bill,

Flink uses pipelined data shipping by default. If you have a program like:
source -> map -> reduce -> sink, the mapper will immediately start to
shuffle data over the network to the reducer. The reducer collects that
data and starts to sort it in batches. When the mapper is done and the
reducer is done with the sorting, the sorted batches are merged and a
sorted stream is fed into the reduce function.
So, the shuffling and sorting happens while the mapper is running, but the
reduce function can only be applied after the last mapper has finished.


I am not aware of benchmarks that compare Flink's and Spark's in-memory
operations, but this blog post compares the performance of sorting binary
data in managed memory (like Flink) to naive sorting on the heap (Spark
might do it differently!).

-->
http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html

Fabian


2015-07-17 16:07 GMT+02:00 Bill Sparks <jspa...@cray.com>:

>  Does flink require all the map tasks to finish before the reducers can
> proceed like Spark, or can the reducer operations start before all the
> mappers have finished like the older Hadoop mapreduce.
>
>  Also my understanding is that flink manages it's own heap, do you/we
> have a sense of the performance impact of this as compared to say …. Spark
> where it's all in the JVM.
>
>  Regards,
>    Bill.
>
>  --
>  Jonathan (Bill) Sparks
> Software Architecture
> Cray Inc.
>

Re: Map and Reduce cycle

Reply via email to