Hello Flink Community,

When designing our data pipelines, we very often encounter the requirement
to stream traffic (usually from kafka) to external distributed file system
(usually HDFS or S3). This data is typically meant to be queried from
hive/presto or similar tools. Preferably data sits in columnar format like
parquet.

Currently, using flink, it is possible to leverage StreamingFileSink to
achieve what we want to some extent. It satisfies our requirements to
partition data by event time, ensure exactly-once semantics and
fault-tolerance with checkpointing. Unfortunately, when using bulk writer
like PaquetWriter, that comes with a price of producing a big number of
files which degrades the performance of queries.

I believe that many companies struggle with similar use cases. I know that
some of them have already approached that problem. Solutions like Alibaba
Hologres or Netflix solution with Iceberg described during FF 2019 emerged.
Given that full transition to real-time data warehouse may take a
significant amount of time and effort, I would like to primarily focus on
solutions for tools like hive/presto backed up by a distributed file
system. Usually those are the systems that we are integrating with.

So what options do we have? Maybe I missed some existing open source tool?

Currently, I can come up with two approaches using flink exclusively:
1. Cache incoming traffic in flink state until trigger fires according to
rolling strategy, probably with some late events special strategy and then
output data with StreamingFileSink. Solution is not perfect as it may
introduce additional latency and queries will still be less performant
compared to fully compacted files (late events problem). And the biggest
issue I am afraid of is actually a performance drop while releasing data
from flink state and its peak character
2. Focus on implementing batch rewrite job that will compact data offline.
Source for the job could be both kafka or small files produced by another
job that uses plain StreamingFileSink. The drawback is that whole system
gets more complex, additional maintenance is needed and, maybe what is more
troubling, we enter to batch world again (how could we know that no more
late data will come and we can safely run the job)

I would really love to hear what are community thoughts on that.

Kind regards
Marek

Reply via email to