Hi Stephan - I agree that the savepoint-shutdown-restart model is nominally the same as the rolling restart with one notable exception - a lack of atomicity. There is a gap between invoking the savepoint command and the shutdown command. My problem isn’t fortunate enough to have idempotent operations: replaying events ends up double-counting. With the current model (or at least as far as I can tell from the documentation you linked) I will double-process some events that are slightly after the savepoint.
One thing that could alleviate this is an atomic shutdown-with-savepoint (or savepoint-with-shutdown, I’m not so picky about which way it is, I only want it to be atomic). With this, I can be assured that the savepoint matches the actual last-processed state. My understanding of the processing within Flink is that this could be modeled by a “savepoint” event followed by a “shutdown” event into the event stream, but my understanding is a bit cartoonish so I’m sure it’s more involved. Ron — Ron Crocker Principal Engineer & Architect ( ( •)) New Relic rcroc...@newrelic.com M: +1 630 363 8835 > On Dec 20, 2016, at 12:40 PM, Stephan Ewen <se...@apache.org> wrote: > > Hi Andrew! > > Would be great to know if what Aljoscha described works for you. Ideally, > this costs no more than a failure/recovery cycle, which one typically also > gets with rolling upgrades. > > Best, > Stephan