It looks like Flink's default behavior is to restart all operators on a single 
operator error - in my case it is a Kafka Producer timing out. When this 
happens, I see logs that all operators are restarted. This essentially leads to 
data loss. In my case the volume of data is so high that it is becoming very 
expensive to checkpoint. I was wondering if Flink has a lifecycle hook to 
attach a forced checkpointing before restarting operators. That will solve a 
dire production issue for us. 

-- Ashish

Reply via email to