Sent from TypeApp
On Feb 3, 2018, 10:48, at 10:48, Kien Truong <duckientru...@gmail.com> wrote: >Hi, >Speaking from my experience, if the distributed disk fail, the >checkpoint will fail as well, but the job will continue running. The >checkpoint scheduler will keep running, so the first scheduled >checkpoint after you repair your disk should succeed. > >Of course, if you also write to the distributed disk inside your job, >then your job may crash too, but this is unrelated to the checkpoint >process. > >Best regards, >Kien > >Sent from TypeApp > >On Feb 2, 2018, 23:30, at 23:30, Christophe Jolif <cjo...@gmail.com> >wrote: >>If I understand well RocksDB is using two disk, the Task Manager local >>disk >>for "local storage" of the state and the distributed disk for >>checkpointing. >> >>Two questions: >> >>- if I have 3 TaskManager I should expect more or less (depending on >>how >>the tasks are balanced) to find a third of my overall state stored on >>disk >>on each of this TaskManager node? >> >>- if the local node/disk fails I will get the state back from the >>distributed disk and things will start again and all is fine. However >>what >>happens if the distributed disk fails? Will Flink continue processing >>waiting for me to mount a new distributed disk? Or will it stop? May I >>lose >>data/reprocess things under that condition? >> >>-- >>Christophe Jolif