Hi,

One way to do it would be to use the Flink Metrics [1] and use something
like Prometheus to scrape the metrics and use them to create alerts?

Thanks,

Martijn

[1]
https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/metrics/#checkpointing

On Thu, 14 Oct 2021 at 14:45, Mathieu D <matd...@gmail.com> wrote:

> Hey there,
>
> We have some instabilities around checkpointing, that we don't quite
> understand.
> In general, as soon as a checkpoint fails, our cluster does not recover
> back to a proper state.
> But to better understand the mechanism, we'd like to be notified as soon
> as this happens, so we can jump on our console and try to understand the
> problem.
>
> So, in my mind, we'd simply send a slack notif to some ops, as soon as a
> checkpoint fails.
>
> Is there a way to register a callback in the checkpointing system, and get
> called as soon one fails ?
>
> [FWIW our config: Flink 1.12 on Yarn/EMR, checkpointing on s3,
> rocksdbbackend]
>
> Thanks.
> Mathieu
>
>

Reply via email to