Re: Flink streaming Parallelism

Tzu-Li (Gordon) Tai Mon, 07 Aug 2017 21:23:25 -0700

Hi,

The equivalent would be setting a parallelism on your sink operator. e.g. 
stream.addSink(…).setParallelism(…).
By default the parallelism of all operators in the pipeline will be whatever 
parallelism was set for the whole job, unless parallelism is explicitly set for 
a specific operator. For more details on the distributed runtime concepts you 
can take a look at [1]


        I saw the implementation of elasticsearch sink in Flink which can do 
batching of messsges before writes. How can I batch data based on a custom 
logic? For eg: batch writes  grouped on one of the message keys.  This is 
possible in Storm via FieldGrouping.
The equivalent of partitioning streams in Flink is `stream.keyBy(…)`. All 
messages of the same key would then go to the same parallel downstream operator 
instance. If its an ElasticsearchSink, then following a keyBy all messages of 
the same key will be batched by the same ElasticSearch writer.

Hope this helps!

Cheers,
Gordon

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.3/concepts/runtime.html


On 8 August 2017 at 8:58:30 AM, Jakes John (jakesjohn12...@gmail.com) wrote:

       I am coming from Apache Storm world.  I am planning to switch from storm 
to flink. I was reading Flink documentation but, I couldn't find some 
requirements in Flink which was present in Storm.  

I need to have a streaming pipeline  Kafka->flink-> ElasticSearch.  In storm,  
I have seen that I can specify number of tasks per bolt.  Typically databases 
are slow in writes and hence I need more writers to the database.  Reading from 
kafka is pretty fast when compared to ES writes.  This means that I need to 
have more ES writer tasks than Kafka consumers. How can I achieve it in Flink?  
What are the concepts in Flink similar to Storm Parallelism concepts like 
workers, executors, tasks?
        I saw the implementation of elasticsearch sink in Flink which can do 
batching of messsges before writes. How can I batch data based on a custom 
logic? For eg: batch writes  grouped on one of the message keys.  This is 
possible in Storm via FieldGrouping. But I couldn't find an equivalent way to 
do grouping in Flink and control the overall number of writes to ES.

Please help me with above questions and some pointers to flink parallelism.

Re: Flink streaming Parallelism

Reply via email to