Hi, One way to do it would be to use the Flink Metrics [1] and use something like Prometheus to scrape the metrics and use them to create alerts?
Thanks, Martijn [1] https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/metrics/#checkpointing On Thu, 14 Oct 2021 at 14:45, Mathieu D <matd...@gmail.com> wrote: > Hey there, > > We have some instabilities around checkpointing, that we don't quite > understand. > In general, as soon as a checkpoint fails, our cluster does not recover > back to a proper state. > But to better understand the mechanism, we'd like to be notified as soon > as this happens, so we can jump on our console and try to understand the > problem. > > So, in my mind, we'd simply send a slack notif to some ops, as soon as a > checkpoint fails. > > Is there a way to register a callback in the checkpointing system, and get > called as soon one fails ? > > [FWIW our config: Flink 1.12 on Yarn/EMR, checkpointing on s3, > rocksdbbackend] > > Thanks. > Mathieu > >