AFAIK there's currently nothing implemented to solve this problem, but working on a possible fix can be implemented on top of https://github.com/lyft/flinkk8soperator which already has a pretty fancy state machine for rolling upgrades. I'd love to be involved as this is an issue I've been thinking about as well.
Yuval On Tue, Sep 24, 2019 at 5:02 PM Sean Hester <sean.hes...@bettercloud.com> wrote: > hi all--we've run into a gap (knowledge? design? tbd?) for our use cases > when deploying Flink jobs to start from savepoints using the job-cluster > mode in Kubernetes. > > we're running a ~15 different jobs, all in job-cluster mode, using a mix > of Flink 1.8.1 and 1.9.0, under GKE (Google Kubernetes Engine). these are > all long-running streaming jobs, all essentially acting as microservices. > we're using Helm charts to configure all of our deployments. > > we have a number of use cases where we want to restart jobs from a > savepoint to replay recent events, i.e. when we've enhanced the job logic > or fixed a bug. but after the deployment we want to have the job resume > it's "long-running" behavior, where any unplanned restarts resume from the > latest checkpoint. > > the issue we run into is that any obvious/standard/idiomatic Kubernetes > deployment includes the savepoint argument in the configuration. if the Job > Manager container(s) have an unplanned restart, when they come back up they > will start from the savepoint instead of resuming from the latest > checkpoint. everything is working as configured, but that's not exactly > what we want. we want the savepoint argument to be transient somehow (only > used during the initial deployment), but Kubernetes doesn't really support > the concept of transient configuration. > > i can see a couple of potential solutions that either involve custom code > in the jobs or custom logic in the container (i.e. a custom entrypoint > script that records that the configured savepoint has already been used in > a file on a persistent volume or GCS, and potentially when/why/by which > deployment). but these seem like unexpected and hacky solutions. before we > head down that road i wanted to ask: > > - is this is already a solved problem that i've missed? > - is this issue already on the community's radar? > > thanks in advance! > > -- > *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865 > 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305 > <http://www.bettercloud.com> <http://www.bettercloud.com> > *Altitude 2019 in San Francisco | Sept. 23 - 25* > It’s not just an IT conference, it’s “a complete learning and networking > experience” > <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude> > > -- Best Regards, Yuval Itzchakov.