Re: Trying to understand KafkaConsumer_records_lag_max

2018-04-16 Thread Julio Biason
Hi Gordon (and list),

Yes, that's probably what's going on. I got another message from 徐骁 which
told me almost the same thing -- something I completely forgot (he also
mentioned auto.offset.reset, which could be forcing Flink to keep reading
from the top of Kafka instead of trying to go back and read older entries).

Now I need to figure out how to make my pipeline consume entries faster (or
at least on par) with the speed those are getting in Kafka -- but that's a
discussion for another email. ;)

On Mon, Apr 16, 2018 at 1:29 AM, Tzu-Li (Gordon) Tai 
wrote:

> Hi Julio,
>
> I'm not really sure, but do you think it is possible that there could be
> some hard data retention setting for your Kafka topics in the staging
> environment?
> As in, at some point in time and maybe periodically, all data in the Kafka
> topics are dropped and therefore the consumers effectively jump directly
> back to the head again.
>
> Cheers,
> Gordon
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>



-- 
*Julio Biason*, Sofware Engineer
*AZION*  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101   |  Mobile: +55 51
*99907 0554*


Re: Trying to understand KafkaConsumer_records_lag_max

2018-04-15 Thread Tzu-Li (Gordon) Tai
Hi Julio,

I'm not really sure, but do you think it is possible that there could be
some hard data retention setting for your Kafka topics in the staging
environment?
As in, at some point in time and maybe periodically, all data in the Kafka
topics are dropped and therefore the consumers effectively jump directly
back to the head again.

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Trying to understand KafkaConsumer_records_lag_max

2018-04-13 Thread Julio Biason
Hi guys,

We are on the final stages of moving our Flink pipeline from staging to
production, but I just found something kinda weird:

We are graphing some Flink metrics, like
flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max. If I got
this right, that's "kafka head offset - flink consumer offset", e.g., the
number of records flink still needs to reach the most recent in the
partition. Is that right?

If that's the case, I saw another weird thing: It seems that, at some
points, this lag falls back to 0 and then slowly goes back up (remember,
this is a staging environment, not production, so we are using smaller
machines with few cores [2] and low memory [8Gb]) -- attached Grafana graph
for reference. I don't see any checkpoint errors or taskmanager failures,
so I don't think it simply dropped everything and started over.

Any ideas what's going on here?

-- 
*Julio Biason*, Sofware Engineer
*AZION*  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101   |  Mobile: +55 51
*99907 0554*