Hi,
I have a Beam code with Flink runner which reads from Kafka, applies 10
minutes window and writes the data into parquet format in S3. Its running fine
when everything goes well. But due to some issue, if my pipeline stops running
for an hour or two, then for it to catch up from latest Flink checkpoint it’s
trying to read data from Kafka at a very high rate and trying to dump to S3 in
parquet format. As the data processed in the latest window period of 10 minutes
is huge because of catching up with lag, it is failing with out of memory and
its never able to be run successfully with my current resources. I checked that
there is a Beam property called maxBundleSize through which we can control
maximum size of a bundle but I didn’t find any property to handle number of
bundles processed within the window interval.
I wanted to check if there is any way to limit number of records processed
within a window interval.
Thanks,
Sandeep