> On 15 Feb 2016, at 13:40, Stefano Baghino <stefano.bagh...@radicalbit.io> > wrote: > > Hi Ufuk, thanks for replying. > > Regarding the masters file: yes, I've specified all the masters and checked > out that they were actually running after the start-cluster.sh. I'll gladly > share the logs as soon as I get to see them. > > Regarding the state backend: how does having a non-distributed storage as the > state backend influence the HA features? I thought it would have meant that > the job state couldn't be restored but the job itself could've been started > after the backup job manager started. Does not having a reliable distributed > storage service as the state backend mean that the HA features don't work?
No, the submitted job is also stored in the state backend and it is recovered from there. ZooKeeper has a pointer to the state handle of the configured backend. Since all job managers run on the same host it should work as you expected. The requirement is that all job managers need to be able to access the state backend. Recovery of a job manager failure is actually independent of the execution retries right now. I think as soon as we have a look at the logs, we will figure it out. ;) – Ufuk