AFAIK there's currently nothing implemented to solve this problem, but
working on a possible fix can be implemented on top of
https://github.com/lyft/flinkk8soperator which already has a pretty fancy
state machine for rolling upgrades. I'd love to be involved as this is an
issue I've been thinking about as well.

Yuval

On Tue, Sep 24, 2019 at 5:02 PM Sean Hester <sean.hes...@bettercloud.com>
wrote:

> hi all--we've run into a gap (knowledge? design? tbd?) for our use cases
> when deploying Flink jobs to start from savepoints using the job-cluster
> mode in Kubernetes.
>
> we're running a ~15 different jobs, all in job-cluster mode, using a mix
> of Flink 1.8.1 and 1.9.0, under GKE (Google Kubernetes Engine). these are
> all long-running streaming jobs, all essentially acting as microservices.
> we're using Helm charts to configure all of our deployments.
>
> we have a number of use cases where we want to restart jobs from a
> savepoint to replay recent events, i.e. when we've enhanced the job logic
> or fixed a bug. but after the deployment we want to have the job resume
> it's "long-running" behavior, where any unplanned restarts resume from the
> latest checkpoint.
>
> the issue we run into is that any obvious/standard/idiomatic Kubernetes
> deployment includes the savepoint argument in the configuration. if the Job
> Manager container(s) have an unplanned restart, when they come back up they
> will start from the savepoint instead of resuming from the latest
> checkpoint. everything is working as configured, but that's not exactly
> what we want. we want the savepoint argument to be transient somehow (only
> used during the initial deployment), but Kubernetes doesn't really support
> the concept of transient configuration.
>
> i can see a couple of potential solutions that either involve custom code
> in the jobs or custom logic in the container (i.e. a custom entrypoint
> script that records that the configured savepoint has already been used in
> a file on a persistent volume or GCS, and potentially when/why/by which
> deployment). but these seem like unexpected and hacky solutions. before we
> head down that road i wanted to ask:
>
>    - is this is already a solved problem that i've missed?
>    - is this issue already on the community's radar?
>
> thanks in advance!
>
> --
> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865
> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305
> <http://www.bettercloud.com> <http://www.bettercloud.com>
> *Altitude 2019 in San Francisco | Sept. 23 - 25*
> It’s not just an IT conference, it’s “a complete learning and networking
> experience”
> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude>
>
>

-- 
Best Regards,
Yuval Itzchakov.

Reply via email to