Hi Gordon, The issue really is we are trying to avoid checkpointing as datasets are really heavy and all of the states are really transient in a few of our apps (flushed within few seconds). So high volume/velocity and transient nature of state make those app good candidates to just not have checkpoints.
We do have offsets committed to Kafka AND we have “some” tolerance for gap / duplicate. However, we do want to handle “graceful” restarts / shutdown. For shutdown, we have been taking savepoints (which works great) but for restart, we just can’t find a way. Bottom line - we are trading off resiliency for resource utilization and performance but would like to harden apps for production deployments as much as we can. Hope that makes sense. Thanks, Ashish > On Mar 6, 2018, at 10:19 PM, Tzu-Li Tai <tzuli...@gmail.com> wrote: > > Hi Ashish, > > Could you elaborate a bit more on why you think the restart of all operators > lead to data loss? > > When restart occurs, Flink will restart the job from the latest complete > checkpoint. > All operator states will be reloaded with state written in that checkpoint, > and the position of the input stream will also be re-winded. > > I don't think there is a way to force a checkpoint before restarting occurs, > but as I mentioned, that should not be required, because the last complete > checkpoint will be used. > Am I missing something in your particular setup? > > Cheers, > Gordon > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/