Thanks for sharing Kien. Sounds like the logical behavior but good to hear it is confirmed by your experience.
-- Christophe On Sat, Feb 3, 2018 at 7:25 AM, Kien Truong <duckientru...@gmail.com> wrote: > > > Sent from TypeApp <http://www.typeapp.com/r?b=11979> > On Feb 3, 2018, at 10:48, Kien Truong <duckientru...@gmail.com> wrote: >> >> Hi, >> Speaking from my experience, if the distributed disk fail, the checkpoint >> will fail as well, but the job will continue running. The checkpoint >> scheduler will keep running, so the first scheduled checkpoint after you >> repair your disk should succeed. >> >> Of course, if you also write to the distributed disk inside your job, >> then your job may crash too, but this is unrelated to the checkpoint >> process. >> >> Best regards, >> Kien >> >> Sent from TypeApp <http://www.typeapp.com/r?b=11979> >> On Feb 2, 2018, at 23:30, Christophe Jolif < cjo...@gmail.com> wrote: >>> >>> If I understand well RocksDB is using two disk, the Task Manager local >>> disk for "local storage" of the state and the distributed disk for >>> checkpointing. >>> >>> Two questions: >>> >>> - if I have 3 TaskManager I should expect more or less (depending on how >>> the tasks are balanced) to find a third of my overall state stored on disk >>> on each of this TaskManager node? >>> >>> - if the local node/disk fails I will get the state back from the >>> distributed disk and things will start again and all is fine. However what >>> happens if the distributed disk fails? Will Flink continue processing >>> waiting for me to mount a new distributed disk? Or will it stop? May I lose >>> data/reprocess things under that condition? >>> >>> -- >>> Christophe Jolif >>> >>