Re: Flink Job cluster in HA mode - recovery vs upgrade

Chesnay Schepler Fri, 21 Aug 2020 12:16:41 -0700

The HaServices are only being given the JobGraph, to this is not possible.

Actually I have to correct myself. For a job cluster the state in HAshould be irrelevant when you're submitting another jar.Flink has no way of knowing that this jar is in any way connected to theprevious job; they will be treated as separate things.

However, you will likely end up with stale data in zookeeper (thejobgraph of the failed job).


On 21/08/2020 17:51, Alexey Trenikhun wrote:

Is it feasible to override ZooKeeperHaServices to recreate JobGraphfrom jar instead of reading it from ZK state. Any hints? I havefeeling that reading JobGraph from jar is more resilient approach,less chances of mistakes during upgrade


Thanks,
Alexey

------------------------------------------------------------------------
*From:* Piotr Nowojski <[email protected]>
*Sent:* Thursday, August 20, 2020 7:04 AM
*To:* Chesnay Schepler <[email protected]>

*Cc:* Alexey Trenikhun <[email protected]>; Flink User Mail List<[email protected]>

*Subject:* Re: Flink Job cluster in HA mode - recovery vs upgrade

Thank you for the clarification Chesney and sorry for the incorrectprevious answer.


Piotrek

czw., 20 sie 2020 o 15:59 Chesnay Schepler <[email protected]<mailto:[email protected]>> napisał(a):


    This is incorrect; we do store the JobGraph in ZooKeeper. If you
    just delete the deployment the cluster will recover the previous
    JobGraph (assuming you aren't changing the Zookeeper configuration).

    If you wish to update the job, then you should cancel it (along
    with creating a savepoint), which will clear the Zookeeper state,
    and then create a new deployment

    On 20/08/2020 15:43, Piotr Nowojski wrote:

    Hi Alexey,

    I might be wrong (I don't know this side of Flink very well), but
    as far as I know JobGraph is never stored in the ZK. It's always
    recreated from the job's JAR. So you should be able to upgrade
    the job by replacing the JAR with a newer version, as long as the
    operator UIDs are the same before and after the upgrade (for
    operator state to match before and after the upgrade).

    Best, Piotrek

    czw., 20 sie 2020 o 06:34 Alexey Trenikhun <[email protected]
    <mailto:[email protected]>> napisał(a):

        Hello,

        Let's say I run Flink Job cluster with persistent storage and
        Zookeeper HA on k8s with single  JobManager and use
        externalized checkpoints. When JM crashes, k8s will restart
        JM pod, and JM will read JobId and JobGraph from ZK and
        restore from latest checkpoint. Now let's say I want to
        upgrade job binary, I delete deployments, create new
        deployments referring to newer image, will JM still read
        JobGraph from ZK or will create new one from new job jar?

        Thanks,
        Alexey

Re: Flink Job cluster in HA mode - recovery vs upgrade

Reply via email to