Hi Gwenhaël,

good to hear that you could resolve the problem.

When you run multiple HA flink jobs in the same cluster, then you don’t
have to adjust the configuration of Flink. It should work out of the box.

However, if you run multiple HA Flink cluster, then you have to set for
each cluster a distinct ZooKeeper root path via the option
recovery.zookeeper.path.root in the Flink configuraiton. This is necessary
because otherwise all JobManagers (the ones of the different clusters) will
compete for a single leadership. Furthermore, all TaskManagers will only
see the one and only leader and connect to it. The reason is that the
TaskManagers will look up their leader at a ZNode below the ZooKeeper root
path.

If you have other questions then don’t hesitate asking me.

Cheers,
Till
​

On Wed, Nov 18, 2015 at 6:37 PM, Gwenhael Pasquiers <
gwenhael.pasqui...@ericsson.com> wrote:

> Nevermind,
>
>
>
> Looking at the logs I saw that it was having issues trying to connect to
> ZK.
>
> To make I short is had the wrong port.
>
>
>
> It is now starting.
>
>
>
> Tomorrow I’ll try to kill some JobManagers *evil*.
>
>
>
> Another question : if I have multiple HA flink jobs, are there some points
> to check in order to be sure that they won’t collide on hdfs or ZK ?
>
>
>
> B.R.
>
>
>
> Gwenhaël PASQUIERS
>
>
>
> *From:* Till Rohrmann [mailto:till.rohrm...@gmail.com]
> *Sent:* mercredi 18 novembre 2015 18:01
> *To:* user@flink.apache.org
> *Subject:* Re: YARN High Availability
>
>
>
> Hi Gwenhaël,
>
>
>
> do you have access to the yarn logs?
>
>
>
> Cheers,
>
> Till
>
>
>
> On Wed, Nov 18, 2015 at 5:55 PM, Gwenhael Pasquiers <
> gwenhael.pasqui...@ericsson.com> wrote:
>
> Hello,
>
>
>
> We’re trying to set up high availability using an existing zookeeper
> quorum already running in our Cloudera cluster.
>
>
>
> So, as per the doc we’ve changed the max attempt in yarn’s config as well
> as the flink.yaml.
>
>
>
> recovery.mode: zookeeper
>
> recovery.zookeeper.quorum: host1:3181,host2:3181,host3:3181
>
> state.backend: filesystem
>
> state.backend.fs.checkpointdir: hdfs:///flink/checkpoints
>
> recovery.zookeeper.storageDir: hdfs:///flink/recovery/
>
> yarn.application-attempts: 1000
>
>
>
> Everything is ok as long as recovery.mode is commented.
>
> As soon as I uncomment recovery.mode the deployment on yarn is stuck on :
>
>
>
> “Deploying cluster, current state ACCEPTED”.
>
> “Deployment took more than 60 seconds….”
>
> Every second.
>
>
>
> And I have more than enough resources available on my yarn cluster.
>
>
>
> Do you have any idea of what could cause this, and/or what logs I should
> look for in order to understand ?
>
>
>
> B.R.
>
>
>
> Gwenhaël PASQUIERS
>
>
>

Reply via email to