Hi Gordon, We have Kafka 0.10.1.1 running and use the flink-connector-kafka-0.10 driver.
There are a bunch of flink_taskmanager_job_task_operator_* metrics, including some about the committed offset for each partition. It seems I have 4 different records_lag_max with different attempt_id, though, 3 with -Inf and 1 with a value -- which will give me some more understand of Prometheus to extract this properly. I was also checking our Grafana and the metric we were using was "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to be new (with the attempt thingy). On the "KafkaConsumer" front, but it only has the "commited_offset" for each partition. On Wed, Jun 13, 2018 at 5:41 AM, Tzu-Li (Gordon) Tai <tzuli...@apache.org> wrote: > Hi, > > Which Kafka version are you using? > > AFAIK, the only recent changes to Kafka connector metrics in the 1.4.x > series would be FLINK-8419 [1]. > The ‘records_lag_max’ metric is a Kafka-shipped metric simply forwarded > from the internally used Kafka client, so nothing should have been affected. > > Do you see other metrics under the pattern of > ‘flink_taskmanager_job_task_operator_*’? > All Kafka-shipped metrics should still follow this pattern. > If not, could you find the ‘records_lag_max’ metric (or any other > Kafka-shipped metrics [2]) under the user scope ‘KafkaConsumer’? > > The above should provide more insight into what may be wrong here. > > - Gordon > > [1] https://issues.apache.org/jira/browse/FLINK-8419 > [2] https://docs.confluent.io/current/kafka/monitoring.html#fetch-metrics > > On 12 June 2018 at 11:47:51 PM, Julio Biason (julio.bia...@azion.com) > wrote: > > Hey guys, > > I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus > monitoring is not getting the current Kafka lag. > > After updating to 1.4.2 and making the symlink between > opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back on > Prometheus, but the most important one, > flink_taskmanager_job_task_operator_records_lag_max > is now returning -Inf. > > Did I miss something? > > -- > *Julio Biason*, Sofware Engineer > *AZION* | Deliver. Accelerate. Protect. > Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 > <callto:+5551996209291>*99907 0554* > > -- *Julio Biason*, Sofware Engineer *AZION* | Deliver. Accelerate. Protect. Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 <callto:+5551996209291>*99907 0554*