Re: Trying to understand KafkaConsumer_records_lag_max
Hi Gordon (and list), Yes, that's probably what's going on. I got another message from 徐骁 which told me almost the same thing -- something I completely forgot (he also mentioned auto.offset.reset, which could be forcing Flink to keep reading from the top of Kafka instead of trying to go back and read older entries). Now I need to figure out how to make my pipeline consume entries faster (or at least on par) with the speed those are getting in Kafka -- but that's a discussion for another email. ;) On Mon, Apr 16, 2018 at 1:29 AM, Tzu-Li (Gordon) Tai wrote: > Hi Julio, > > I'm not really sure, but do you think it is possible that there could be > some hard data retention setting for your Kafka topics in the staging > environment? > As in, at some point in time and maybe periodically, all data in the Kafka > topics are dropped and therefore the consumers effectively jump directly > back to the head again. > > Cheers, > Gordon > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050. > n4.nabble.com/ > -- *Julio Biason*, Sofware Engineer *AZION* | Deliver. Accelerate. Protect. Office: +55 51 3083 8101 | Mobile: +55 51 *99907 0554*
Re: Trying to understand KafkaConsumer_records_lag_max
Hi Julio, I'm not really sure, but do you think it is possible that there could be some hard data retention setting for your Kafka topics in the staging environment? As in, at some point in time and maybe periodically, all data in the Kafka topics are dropped and therefore the consumers effectively jump directly back to the head again. Cheers, Gordon -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Trying to understand KafkaConsumer_records_lag_max
Hi guys, We are on the final stages of moving our Flink pipeline from staging to production, but I just found something kinda weird: We are graphing some Flink metrics, like flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max. If I got this right, that's "kafka head offset - flink consumer offset", e.g., the number of records flink still needs to reach the most recent in the partition. Is that right? If that's the case, I saw another weird thing: It seems that, at some points, this lag falls back to 0 and then slowly goes back up (remember, this is a staging environment, not production, so we are using smaller machines with few cores [2] and low memory [8Gb]) -- attached Grafana graph for reference. I don't see any checkpoint errors or taskmanager failures, so I don't think it simply dropped everything and started over. Any ideas what's going on here? -- *Julio Biason*, Sofware Engineer *AZION* | Deliver. Accelerate. Protect. Office: +55 51 3083 8101 | Mobile: +55 51 *99907 0554*