In batch / DataSet programs, groupBy() is execute by partitioning the data
(usually hash partitioning) and sorting each partition to group all
elements with the same key.
keyBy() in DataStream programs also partitions the data and results in a
KeyedStream. The KeyedStream has information about the
Hi Fabian,
Thank you for the explanation. Could you also explain how keyBy() would
work? I assume it should work same as groupBy(), but in streaming mode
since the data is unbounded all elements that arrive in the first window
are grouped/partitioned by keys and aggregated and so on until no more
Hi Ravinder,
your drawing is pretty much correct (Flink will inject a combiner between
flat map and reduce which locally combines records with the same key).
The partitioning between flat map and reduce is done with hash partitioning
by default. However, you can also define a custom partitioner to
The picture you reference does not really show how dataflows are connected.
For a better picture, visit this link:
https://ci.apache.org/projects/flink/flink-docs-release-1.0/concepts/concepts.html#parallel-dataflows
Let me know if this doesn't answer your question.
On 19.04.2016 14:22, Ravind
Hello All,
Considering the following streaming dataflow of the example WordCount, I
want to understand how Sink is parallelised.
Source --> flatMap --> groupBy(), sum() --> Sink
If I set the paralellism at runtime using -p, as shown here
https://ci.apache.org/projects/flink/flink-docs-release-1