It looks like Flink's default behavior is to restart all operators on a single
operator error - in my case it is a Kafka Producer timing out. When this
happens, I see logs that all operators are restarted. This essentially leads to
data loss. In my case the volume of data is so high that it is becoming very
expensive to checkpoint. I was wondering if Flink has a lifecycle hook to
attach a forced checkpointing before restarting operators. That will solve a
dire production issue for us.