You can achieve this in Flink 1.10 using the StreamingFileSink.

I’d also like to note that Flink 1.11 (which is currently going through
release testing and should be available imminently) has support for exactly
this functionality in the table API.

https://ci.apache.org/projects/flink/flink-docs-master/dev/table/connectors/filesystem.html


On Wed, Jun 24, 2020 at 1:53 PM Alexander Filipchik <afilipc...@gmail.com>
wrote:

> Hello!
>
> We are working an a Flink Streaming job that reads data from multiple
> Kafka topics and writes them to DFS. We are using StreamingFileSink with
> custom implementation for GCS FS and it generates a lot of files as streams
> are partitioned among multiple JMs. In the ideal case we should have at
> most 1 file per kafka topic per interval. We also have some heavy topics
> and some pretty light ones, so the solution should also be smart to utilize
> resources efficiently.
>
> I was thinking we can partition based on how much data is ingested in the
> last minute or so to make sure: messages from the same topic are routed to
> the same (or minimal number of ) file if there are enough resources to do
> so. Think bin packing.
>
> Is it a good idea? Is there a built in way to achieve it? If not, is there
> a way to push state into the partitioner (or even kafka client to
> repartition in the source)? I was thinking that I can run a side stream
> that will calculate data volumes and then broadcast it into the main
> stream, so partitioner can make a decision, but it feels a bit complex.
>
> Another way is to modify kafka client to track messages per topics and
> make decision at that layer.
>
> Am I on the right path?
>
> Thank you
>

Reply via email to