Re: YARN High Availability

Aljoscha Krettek Thu, 19 Nov 2015 02:56:14 -0800

Maybe we could add a user parameter to specify a cluster name that is used
to make the paths unique.


On Thu, Nov 19, 2015, 11:24 Till Rohrmann <trohrm...@apache.org> wrote:

> I agree that this would make the configuration easier. However, it entails
> also that the user has to retrieve the randomized path from the logs if he
> wants to restart jobs after the cluster has crashed or intentionally
> restarted. Furthermore, the system won't be able to clean up old checkpoint
> and job handles in case that the cluster stop was intentional.
>
> Thus, the question is how do we define the behaviour in order to retrieve
> handles and to clean up old handles so that ZooKeeper won't be cluttered
> with old handles?
>
> There are basically two modes:
>
> 1. Keep state handles when shutting down the cluster. Provide a mean to
> define a fixed path when starting the cluster and also a mean to purge old
> state handles. Furthermore, add a shutdown mode where the handles under the
> current path are directly removed. This mode would guarantee to always have
> the state handles available if not explicitly told differently. However,
> the downside is that ZooKeeper will be cluttered most certainly.
>
> 2. Remove the state handles when shutting down the cluster. Provide a
> shutdown mode where we keep the state handles. This will keep ZooKeeper
> clean but will give you also the possibility to keep a checkpoint around if
> necessary. However, the user is more likely to lose his state when shutting
> down the cluster.
>
> On Thu, Nov 19, 2015 at 10:55 AM, Robert Metzger <rmetz...@apache.org>
> wrote:
>
>> I agree with Aljoscha. Many companies install Flink (and its config) in a
>> central directory and users share that installation.
>>
>> On Thu, Nov 19, 2015 at 10:45 AM, Aljoscha Krettek <aljos...@apache.org>
>> wrote:
>>
>>> I think we should find a way to randomize the paths where the HA stuff
>>> stores data. If users don’t realize that they store data in the same paths
>>> this could lead to problems.
>>>
>>> > On 19 Nov 2015, at 08:50, Till Rohrmann <trohrm...@apache.org> wrote:
>>> >
>>> > Hi Gwenhaël,
>>> >
>>> > good to hear that you could resolve the problem.
>>> >
>>> > When you run multiple HA flink jobs in the same cluster, then you
>>> don’t have to adjust the configuration of Flink. It should work out of the
>>> box.
>>> >
>>> > However, if you run multiple HA Flink cluster, then you have to set
>>> for each cluster a distinct ZooKeeper root path via the option
>>> recovery.zookeeper.path.root in the Flink configuraiton. This is necessary
>>> because otherwise all JobManagers (the ones of the different clusters) will
>>> compete for a single leadership. Furthermore, all TaskManagers will only
>>> see the one and only leader and connect to it. The reason is that the
>>> TaskManagers will look up their leader at a ZNode below the ZooKeeper root
>>> path.
>>> >
>>> > If you have other questions then don’t hesitate asking me.
>>> >
>>> > Cheers,
>>> > Till
>>> >
>>> >
>>> > On Wed, Nov 18, 2015 at 6:37 PM, Gwenhael Pasquiers <
>>> gwenhael.pasqui...@ericsson.com> wrote:
>>> > Nevermind,
>>> >
>>> >
>>> >
>>> > Looking at the logs I saw that it was having issues trying to connect
>>> to ZK.
>>> >
>>> > To make I short is had the wrong port.
>>> >
>>> >
>>> >
>>> > It is now starting.
>>> >
>>> >
>>> >
>>> > Tomorrow I’ll try to kill some JobManagers *evil*.
>>> >
>>> >
>>> >
>>> > Another question : if I have multiple HA flink jobs, are there some
>>> points to check in order to be sure that they won’t collide on hdfs or ZK ?
>>> >
>>> >
>>> >
>>> > B.R.
>>> >
>>> >
>>> >
>>> > Gwenhaël PASQUIERS
>>> >
>>> >
>>> >
>>> > From: Till Rohrmann [mailto:till.rohrm...@gmail.com]
>>> > Sent: mercredi 18 novembre 2015 18:01
>>> > To: user@flink.apache.org
>>> > Subject: Re: YARN High Availability
>>> >
>>> >
>>> >
>>> > Hi Gwenhaël,
>>> >
>>> >
>>> >
>>> > do you have access to the yarn logs?
>>> >
>>> >
>>> >
>>> > Cheers,
>>> >
>>> > Till
>>> >
>>> >
>>> >
>>> > On Wed, Nov 18, 2015 at 5:55 PM, Gwenhael Pasquiers <
>>> gwenhael.pasqui...@ericsson.com> wrote:
>>> >
>>> > Hello,
>>> >
>>> >
>>> >
>>> > We’re trying to set up high availability using an existing zookeeper
>>> quorum already running in our Cloudera cluster.
>>> >
>>> >
>>> >
>>> > So, as per the doc we’ve changed the max attempt in yarn’s config as
>>> well as the flink.yaml.
>>> >
>>> >
>>> >
>>> > recovery.mode: zookeeper
>>> >
>>> > recovery.zookeeper.quorum: host1:3181,host2:3181,host3:3181
>>> >
>>> > state.backend: filesystem
>>> >
>>> > state.backend.fs.checkpointdir: hdfs:///flink/checkpoints
>>> >
>>> > recovery.zookeeper.storageDir: hdfs:///flink/recovery/
>>> >
>>> > yarn.application-attempts: 1000
>>> >
>>> >
>>> >
>>> > Everything is ok as long as recovery.mode is commented.
>>> >
>>> > As soon as I uncomment recovery.mode the deployment on yarn is stuck
>>> on :
>>> >
>>> >
>>> >
>>> > “Deploying cluster, current state ACCEPTED”.
>>> >
>>> > “Deployment took more than 60 seconds….”
>>> >
>>> > Every second.
>>> >
>>> >
>>> >
>>> > And I have more than enough resources available on my yarn cluster.
>>> >
>>> >
>>> >
>>> > Do you have any idea of what could cause this, and/or what logs I
>>> should look for in order to understand ?
>>> >
>>> >
>>> >
>>> > B.R.
>>> >
>>> >
>>> >
>>> > Gwenhaël PASQUIERS
>>> >
>>> >
>>> >
>>> >
>>>
>>>
>>
>

Re: YARN High Availability

Reply via email to