Re: Dynamic partitioner for Flink based on incoming load

2020-07-03 Thread Robert Metzger
> This will mean 2 shuffles, and 1 node might bottleneck if 1 topic has too much data? Yes > Is there a way to avoid shuffle at all (or do only 1) and avoid a situation when 1 node will become a hotspot? Do you know the amount of data per kafka topic beforehand, or does this have to be dynamic?

Re: Dynamic partitioner for Flink based on incoming load

2020-06-25 Thread Alexander Filipchik
This will mean 2 shuffles, and 1 node might bottleneck if 1 topic has too much data? Is there a way to avoid shuffle at all (or do only 1) and avoid a situation when 1 node will become a hotspot? Alex On Thu, Jun 25, 2020 at 8:05 AM Kostas Kloudas wrote: > Hi Alexander, > > Routing of input dat

Re: Dynamic partitioner for Flink based on incoming load

2020-06-25 Thread Kostas Kloudas
Hi Alexander, Routing of input data in Flink can be done through keying and this can guarantee collocation constraints. This means that you can send two records to the same node by giving them the same key, e.g. the topic name. Keep in mind that elements with different keys do not necessarily go t

Re: Dynamic partitioner for Flink based on incoming load

2020-06-24 Thread Alexander Filipchik
Maybe I misreading the documentation, but: "Data within the partition directories are split into part files. Each partition will contain at least one part file for each subtask of the sink that has received data for that partition." So, it is 1 partition per subtask. I'm trying to figure out how t

Re: Dynamic partitioner for Flink based on incoming load

2020-06-24 Thread Seth Wiesman
You can achieve this in Flink 1.10 using the StreamingFileSink. I’d also like to note that Flink 1.11 (which is currently going through release testing and should be available imminently) has support for exactly this functionality in the table API. https://ci.apache.org/projects/flink/flink-docs-

Dynamic partitioner for Flink based on incoming load

2020-06-24 Thread Alexander Filipchik
Hello! We are working an a Flink Streaming job that reads data from multiple Kafka topics and writes them to DFS. We are using StreamingFileSink with custom implementation for GCS FS and it generates a lot of files as streams are partitioned among multiple JMs. In the ideal case we should have at