subject:"Sink Parallelism"

Re: Sink Parallelism

2016-04-20 Thread Fabian Hueske

In batch / DataSet programs, groupBy() is execute by partitioning the data (usually hash partitioning) and sorting each partition to group all elements with the same key. keyBy() in DataStream programs also partitions the data and results in a KeyedStream. The KeyedStream has information about the

Re: Sink Parallelism

2016-04-20 Thread Ravinder Kaur

Hi Fabian, Thank you for the explanation. Could you also explain how keyBy() would work? I assume it should work same as groupBy(), but in streaming mode since the data is unbounded all elements that arrive in the first window are grouped/partitioned by keys and aggregated and so on until no more

Re: Sink Parallelism

2016-04-20 Thread Fabian Hueske

Hi Ravinder, your drawing is pretty much correct (Flink will inject a combiner between flat map and reduce which locally combines records with the same key). The partitioning between flat map and reduce is done with hash partitioning by default. However, you can also define a custom partitioner to

Re: Sink Parallelism

2016-04-19 Thread Chesnay Schepler

The picture you reference does not really show how dataflows are connected. For a better picture, visit this link: https://ci.apache.org/projects/flink/flink-docs-release-1.0/concepts/concepts.html#parallel-dataflows Let me know if this doesn't answer your question. On 19.04.2016 14:22, Ravind

Sink Parallelism

2016-04-19 Thread Ravinder Kaur

Hello All, Considering the following streaming dataflow of the example WordCount, I want to understand how Sink is parallelised. Source --> flatMap --> groupBy(), sum() --> Sink If I set the paralellism at runtime using -p, as shown here https://ci.apache.org/projects/flink/flink-docs-release-1

Re: Sink Parallelism

Re: Sink Parallelism

Re: Sink Parallelism

Re: Sink Parallelism

Sink Parallelism

5 matches

Site Navigation

Mail list logo

Footer information