Flink's FileSource will enumerate the files and keep track of the progress
in parallel for the individual files. Depending on the format you use, the
progress is tracked at the different level of granularity (TextLine being
the simplest one that tracks the progress based on the number of lines
processed in each file). In case of failures, the source will pick up where
it left off. Files removal is trickier - the easiest way to achieve that
would be to have tombstones at the end of files and process them in user
code.

Best,
Alexander Fedulov

On Thu, 26 Oct 2023 at 18:17, arjun s <arjunjoice...@gmail.com> wrote:

> Hello team,
> I'm currently in the process of configuring a Flink job. This job entails
> reading files from a specified directory and then transmitting the data to
> a Kafka sink. I've already successfully designed a Flink job that reads the
> file contents in a streaming manner and effectively sends them to Kafka.
> However, my specific requirement is a bit more intricate. I need the job to
> not only read these files and push the data to Kafka but also relocate the
> processed file to a different directory once all of its contents have been
> processed. Following this, the job should seamlessly transition to
> processing the next file in the source directory. Additionally, I have some
> concerns regarding how the job will behave if it encounters a restart.
> Could you please advise if this is achievable, and if so, provide guidance
> or code to implement it?
>
> I'm also quite interested in how the job will handle situations where the
> source has a parallelism greater than 2 or 3, and how it can accurately
> monitor the completion of reading all contents in each file.
>
> Thanks and Regards,
> Arjun
>

Reply via email to