Hi guys,

We are on the final stages of moving our Flink pipeline from staging to
production, but I just found something kinda weird:

We are graphing some Flink metrics, like
flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max. If I got
this right, that's "kafka head offset - flink consumer offset", e.g., the
number of records flink still needs to reach the most recent in the
partition. Is that right?

If that's the case, I saw another weird thing: It seems that, at some
points, this lag falls back to 0 and then slowly goes back up (remember,
this is a staging environment, not production, so we are using smaller
machines with few cores [2] and low memory [8Gb]) -- attached Grafana graph
for reference. I don't see any checkpoint errors or taskmanager failures,
so I don't think it simply dropped everything and started over.

Any ideas what's going on here?

-- 
*Julio Biason*, Sofware Engineer
*AZION*  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
<callto:+5551996209291>*99907 0554*

Reply via email to