Hi flinksters, Scenario: We have cdc messages from our rdbms(various tables) flowing to Kafka. Our flink job reads the CDC messages and creates events based on certain rules.
I am using Prometheus and grafana. Following are there metrics that i need to calculate A) Number of CDC messages wrt to each table. B) Number of events created wrt to each event type. C) Average/P99/P95 Latency (event created ts - ccd operation ts) For A and B, I created counters and able to see the metrices flowing into Prometheus . Few questions I have here. 1) How to create labels for counters in flink ? I did not find any easier method to do it . Right now I see that I need to create counters for each type of table and events . I referred to one of the community discussions. [1] . Is there any way apart from this ? 2) When the job gets restarted , the counters get back to 0 . How to prevent that and to get continuity. For C , I calculated latency in code for each event and assigned it to histogram. Few questions I have here. 3) I read in a few blogs [2] that histogram is the best way to get latencies. Is there any better idea? 4) How to create buckets for various ranges? I also read in a community email that flink implements histogram as summaries. I also should be able to see the latencies across timelines . [1] https://stackoverflow.com/questions/58456830/how-to-use-multiple-counters-in-flink [2] https://povilasv.me/prometheus-tracking-request-duration/ Thanks, Prasanna.