Hi All,

Any pointers on the below checkpoint failure scenario. Appreciate all the
help. Thanks

Thanks

On Sun, Jul 7, 2019 at 9:23 PM Navneeth Krishnan <reachnavnee...@gmail.com>
wrote:

> Hi All,
>
> Occasionally I run into failed checkpoints error where 2 or 3 consecutive
> checkpoints fails after running for a minute and then it recovers. This is
> causing delay in processing the incoming data since there is huge amount of
> data buffered during the failed checkpoints. I don't see any errors in the
> taskmanager logs but here is the error in the jobmanager log. The state
> size is around 100 mb.
>
> *Checkpoint configuration:*
> Option Value
> Checkpointing Mode Exactly Once
> Interval 1m 0s
> Timeout 1m 0s
> Minimum Pause Between Checkpoints 5s
> Maximum Concurrent Checkpoints 1
> Persist Checkpoints Externally Enabled (retain on cancellation)
> *Jobmanager Log:*
>
> 2019-07-05 17:53:54,125 [flink-akka.actor.default-dispatcher-465901] WARN
> o.a.f.r.c.CheckpointCoordinator - Received late message for now expired
> checkpoint attempt 9867 from 79515b6550d2c223701be0a9c870995f of job
> 00ff93caa4cc9464bd41e1d050fcf65c.
> 2019-07-05 17:53:54,141 [flink-akka.actor.default-dispatcher-465901] WARN
> o.a.f.r.c.CheckpointCoordinator - Received late message for now expired
> checkpoint attempt 9867 from 630984cdd5e66b4d9ea95a91cb4d23f6 of job
> 00ff93caa4cc9464bd41e1d050fcf65c.
> 2019-07-05 17:53:54,168 [flink-akka.actor.default-dispatcher-465901] WARN
> o.a.f.r.c.CheckpointCoordinator - Received late message for now expired
> checkpoint attempt 9867 from e12ed2e185a37559f93181905a52ebeb of job
> 00ff93caa4cc9464bd41e1d050fcf65c.
> 2019-07-05 17:53:54,215 [flink-akka.actor.default-dispatcher-465901] WARN
> o.a.f.r.c.CheckpointCoordinator - Received late message for now expired
> checkpoint attempt 9867 from 1fede192e2ff11e0905d98ff5ff6f9ce of job
> 00ff93caa4cc9464bd41e1d050fcf65c.
> 2019-07-05 17:53:54,223 [flink-akka.actor.default-dispatcher-465901] WARN
> o.a.f.r.c.CheckpointCoordinator - Received late message for now expired
> checkpoint attempt 9867 from d4e895eb20cc259c95b249cd0252930f of job
> 00ff93caa4cc9464bd41e1d050fcf65c.
> 2019-07-05 17:53:54,310 [flink-akka.actor.default-dispatcher-465901] WARN
> o.a.f.r.c.CheckpointCoordinator - Received late message for now expired
> checkpoint attempt 9867 from be5c711d7b37ed6d8022224dc447db91 of job
> 00ff93caa4cc9464bd41e1d050fcf65c.
> 2019-07-05 17:53:54,351 [flink-akka.actor.default-dispatcher-465901] WARN
> o.a.f.r.c.CheckpointCoordinator - Received late message for now expired
> checkpoint attempt 9867 from 1ed52695cc407f2f143d2bb5d23cbdbb of job
> 00ff93caa4cc9464bd41e1d050fcf65c.
> 2019-07-05 17:53:54,398 [flink-akka.actor.default-dispatcher-465901] WARN
> o.a.f.r.c.CheckpointCoordinator - Received late message for now expired
> checkpoint attempt 9867 from 2e43cf968ad399c0b8426239a7dd081c of job
> 00ff93caa4cc9464bd41e1d050fcf65c.
> 2019-07-05 17:53:54,959 [flink-akka.actor.default-dispatcher-465868] INFO
> o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 9868 (279307042
> bytes in 50707 ms).
> 2019-07-05 17:54:04,174 [Checkpoint Timer] INFO
> o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 9869 @ 1562349244171
> 2019-07-05 17:54:10,709 [flink-akka.actor.default-dispatcher-465905] INFO
> o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 9869 (253638470
> bytes in 6430 ms).
> 2019-07-05 17:55:04,174 [Checkpoint Timer] INFO
> o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 9870 @ 1562349304171
> 2019-07-05 17:55:09,816 [flink-akka.actor.default-dispatcher-465913] INFO
> o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 9870 (138649543
> bytes in 5551 ms).
> 2019-07-05 17:56:04,174 [Checkpoint Timer] INFO
> o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 9871 @ 1562349364171
>
> Thanks
>

Reply via email to