[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)
[ https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408801#comment-17408801 ] Till Rohrmann commented on FLINK-19816: --- Hi [~Paul Lin], I think you are not running into the concrete issue that is fixed with this ticket. Instead I believe that you are running into FLINK-11813 that will be fixed with the next release. I think the following is happening: The job reaches a globally terminal state (FAILED). Then it tells the {{Dispatcher}} that triggers the clean up of HA information. After the cleanup has happened, the {{Dispatcher}} process loses the leadership and is restarted. Since you seem to use the application mode/per job mode, Flink will be started with the same job but with no checkpoint information since it has been cleaned up. This will ultimately result in a restart. Part of the problem has been solved via FLINK-21979 but the last remaining piece is FLINK-11813 that will most likely introduce a {{JobResultStore}} that can outlive Flink. Only by persisting information about the job status that can survive a cluster failure, we are able to properly resolve the situation. As a consequence, there will be some bookkeeping information that needs to be taken care of by the user/operator of Flink. > Flink restored from a wrong checkpoint (a very old one and not the last > completed one) > -- > > Key: FLINK-19816 > URL: https://issues.apache.org/jira/browse/FLINK-19816 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Steven Zhen Wu >Assignee: Till Rohrmann >Priority: Critical > Labels: pull-request-available > Fix For: 1.11.3, 1.12.0 > > Attachments: jm.log > > > h2. Summary > Upon failure, it seems that Flink didn't restore from the last completed > checkpoint. Instead, it restored from a very old checkpoint. As a result, > Kafka offsets are invalid and caused the job to replay from the beginning as > Kafka consumer "auto.offset.reset" was set to "EARLIEST". > This is an embarrassingly parallel stateless job. Parallelism is over 1,000. > I have the full log file from jobmanager at INFO level available upon request. > h2. Sequence of events from the logs > Just before the failure, checkpoint *210768* completed. > {code} > 2020-10-25 02:35:05,970 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [jobmanager-future-thread-5] - Completed checkpoint 210768 for job > 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms). > {code} > During restart, somehow it decided to restore from a very old checkpoint > *203531*. > {code:java} > 2020-10-25 02:36:03,301 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-3] - Start SessionDispatcherLeaderProcess. > 2020-10-25 02:36:03,302 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Recover all persisted job graphs. > 2020-10-25 02:36:03,304 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6 > 2020-10-25 02:36:03,307 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Trying to recover job with job id > 233b4938179c06974e4535ac8a868675. > 2020-10-25 02:36:03,381 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd > ... > 2020-10-25 02:36:03,427 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Recovering checkpoints from > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Found 0 checkpoints in > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Trying to fetch 0 checkpoints > from storage. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [flink-akka.actor.default-dispatcher-82003] - Starting job > 233b4938179c06974e4535ac8a868675 from savepoint > s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata > () > {code}
[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)
[ https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408693#comment-17408693 ] Paul Lin commented on FLINK-19816: -- Still getting this error with Flink 1.12.3. The jobmanager logs are attached, please take a look. [~trohrmann] [^jm.log] > Flink restored from a wrong checkpoint (a very old one and not the last > completed one) > -- > > Key: FLINK-19816 > URL: https://issues.apache.org/jira/browse/FLINK-19816 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Steven Zhen Wu >Assignee: Till Rohrmann >Priority: Critical > Labels: pull-request-available > Fix For: 1.11.3, 1.12.0 > > Attachments: jm.log > > > h2. Summary > Upon failure, it seems that Flink didn't restore from the last completed > checkpoint. Instead, it restored from a very old checkpoint. As a result, > Kafka offsets are invalid and caused the job to replay from the beginning as > Kafka consumer "auto.offset.reset" was set to "EARLIEST". > This is an embarrassingly parallel stateless job. Parallelism is over 1,000. > I have the full log file from jobmanager at INFO level available upon request. > h2. Sequence of events from the logs > Just before the failure, checkpoint *210768* completed. > {code} > 2020-10-25 02:35:05,970 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [jobmanager-future-thread-5] - Completed checkpoint 210768 for job > 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms). > {code} > During restart, somehow it decided to restore from a very old checkpoint > *203531*. > {code:java} > 2020-10-25 02:36:03,301 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-3] - Start SessionDispatcherLeaderProcess. > 2020-10-25 02:36:03,302 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Recover all persisted job graphs. > 2020-10-25 02:36:03,304 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6 > 2020-10-25 02:36:03,307 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Trying to recover job with job id > 233b4938179c06974e4535ac8a868675. > 2020-10-25 02:36:03,381 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd > ... > 2020-10-25 02:36:03,427 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Recovering checkpoints from > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Found 0 checkpoints in > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Trying to fetch 0 checkpoints > from storage. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [flink-akka.actor.default-dispatcher-82003] - Starting job > 233b4938179c06974e4535ac8a868675 from savepoint > s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata > () > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)
[ https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228601#comment-17228601 ] Till Rohrmann commented on FLINK-19816: --- After fixing FLINK-20033, I've downgraded this issue to critical because the problem should now be very unlikely to happen. Still we should fix this problem asap. > Flink restored from a wrong checkpoint (a very old one and not the last > completed one) > -- > > Key: FLINK-19816 > URL: https://issues.apache.org/jira/browse/FLINK-19816 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Steven Zhen Wu >Assignee: Matthias >Priority: Critical > Fix For: 1.12.0, 1.11.3 > > > h2. Summary > Upon failure, it seems that Flink didn't restore from the last completed > checkpoint. Instead, it restored from a very old checkpoint. As a result, > Kafka offsets are invalid and caused the job to replay from the beginning as > Kafka consumer "auto.offset.reset" was set to "EARLIEST". > This is an embarrassingly parallel stateless job. Parallelism is over 1,000. > I have the full log file from jobmanager at INFO level available upon request. > h2. Sequence of events from the logs > Just before the failure, checkpoint *210768* completed. > {code} > 2020-10-25 02:35:05,970 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [jobmanager-future-thread-5] - Completed checkpoint 210768 for job > 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms). > {code} > During restart, somehow it decided to restore from a very old checkpoint > *203531*. > {code:java} > 2020-10-25 02:36:03,301 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-3] - Start SessionDispatcherLeaderProcess. > 2020-10-25 02:36:03,302 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Recover all persisted job graphs. > 2020-10-25 02:36:03,304 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6 > 2020-10-25 02:36:03,307 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Trying to recover job with job id > 233b4938179c06974e4535ac8a868675. > 2020-10-25 02:36:03,381 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd > ... > 2020-10-25 02:36:03,427 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Recovering checkpoints from > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Found 0 checkpoints in > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Trying to fetch 0 checkpoints > from storage. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [flink-akka.actor.default-dispatcher-82003] - Starting job > 233b4938179c06974e4535ac8a868675 from savepoint > s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata > () > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)
[ https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227458#comment-17227458 ] Till Rohrmann commented on FLINK-19816: --- The new ticket for tracking the second problem is FLINK-20033. > Flink restored from a wrong checkpoint (a very old one and not the last > completed one) > -- > > Key: FLINK-19816 > URL: https://issues.apache.org/jira/browse/FLINK-19816 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Steven Zhen Wu >Assignee: Matthias >Priority: Blocker > Fix For: 1.12.0, 1.11.3 > > > h2. Summary > Upon failure, it seems that Flink didn't restore from the last completed > checkpoint. Instead, it restored from a very old checkpoint. As a result, > Kafka offsets are invalid and caused the job to replay from the beginning as > Kafka consumer "auto.offset.reset" was set to "EARLIEST". > This is an embarrassingly parallel stateless job. Parallelism is over 1,000. > I have the full log file from jobmanager at INFO level available upon request. > h2. Sequence of events from the logs > Just before the failure, checkpoint *210768* completed. > {code} > 2020-10-25 02:35:05,970 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [jobmanager-future-thread-5] - Completed checkpoint 210768 for job > 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms). > {code} > During restart, somehow it decided to restore from a very old checkpoint > *203531*. > {code:java} > 2020-10-25 02:36:03,301 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-3] - Start SessionDispatcherLeaderProcess. > 2020-10-25 02:36:03,302 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Recover all persisted job graphs. > 2020-10-25 02:36:03,304 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6 > 2020-10-25 02:36:03,307 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Trying to recover job with job id > 233b4938179c06974e4535ac8a868675. > 2020-10-25 02:36:03,381 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd > ... > 2020-10-25 02:36:03,427 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Recovering checkpoints from > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Found 0 checkpoints in > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Trying to fetch 0 checkpoints > from storage. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [flink-akka.actor.default-dispatcher-82003] - Starting job > 233b4938179c06974e4535ac8a868675 from savepoint > s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata > () > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)
[ https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227446#comment-17227446 ] Till Rohrmann commented on FLINK-19816: --- It turns out that we actually have two problems: 1) What we have already described: A job reaching a globally terminal state while being suspended and the resulting race condition is a problem. 2) Stopping the JobMaster causing the job to fail and thereby to exceed the maximum allowed number of restarts causing it to go to {{FAILED}}. That's also what happened in your case [~stevenz3wu]. This problem has been unintentionally fixed via FLINK-19237 for {{1.12}}. I will create a new ticket to also fix it for {{1.10}} and {{1.11}}. > Flink restored from a wrong checkpoint (a very old one and not the last > completed one) > -- > > Key: FLINK-19816 > URL: https://issues.apache.org/jira/browse/FLINK-19816 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Steven Zhen Wu >Assignee: Matthias >Priority: Blocker > Fix For: 1.12.0, 1.11.3 > > > h2. Summary > Upon failure, it seems that Flink didn't restore from the last completed > checkpoint. Instead, it restored from a very old checkpoint. As a result, > Kafka offsets are invalid and caused the job to replay from the beginning as > Kafka consumer "auto.offset.reset" was set to "EARLIEST". > This is an embarrassingly parallel stateless job. Parallelism is over 1,000. > I have the full log file from jobmanager at INFO level available upon request. > h2. Sequence of events from the logs > Just before the failure, checkpoint *210768* completed. > {code} > 2020-10-25 02:35:05,970 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [jobmanager-future-thread-5] - Completed checkpoint 210768 for job > 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms). > {code} > During restart, somehow it decided to restore from a very old checkpoint > *203531*. > {code:java} > 2020-10-25 02:36:03,301 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-3] - Start SessionDispatcherLeaderProcess. > 2020-10-25 02:36:03,302 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Recover all persisted job graphs. > 2020-10-25 02:36:03,304 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6 > 2020-10-25 02:36:03,307 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Trying to recover job with job id > 233b4938179c06974e4535ac8a868675. > 2020-10-25 02:36:03,381 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd > ... > 2020-10-25 02:36:03,427 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Recovering checkpoints from > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Found 0 checkpoints in > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Trying to fetch 0 checkpoints > from storage. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [flink-akka.actor.default-dispatcher-82003] - Starting job > 233b4938179c06974e4535ac8a868675 from savepoint > s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata > () > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)
[ https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227425#comment-17227425 ] Xintong Song commented on FLINK-19816: -- Alright, thanks for the explaination. > Flink restored from a wrong checkpoint (a very old one and not the last > completed one) > -- > > Key: FLINK-19816 > URL: https://issues.apache.org/jira/browse/FLINK-19816 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Steven Zhen Wu >Assignee: Matthias >Priority: Blocker > Fix For: 1.12.0, 1.11.3 > > > h2. Summary > Upon failure, it seems that Flink didn't restore from the last completed > checkpoint. Instead, it restored from a very old checkpoint. As a result, > Kafka offsets are invalid and caused the job to replay from the beginning as > Kafka consumer "auto.offset.reset" was set to "EARLIEST". > This is an embarrassingly parallel stateless job. Parallelism is over 1,000. > I have the full log file from jobmanager at INFO level available upon request. > h2. Sequence of events from the logs > Just before the failure, checkpoint *210768* completed. > {code} > 2020-10-25 02:35:05,970 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [jobmanager-future-thread-5] - Completed checkpoint 210768 for job > 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms). > {code} > During restart, somehow it decided to restore from a very old checkpoint > *203531*. > {code:java} > 2020-10-25 02:36:03,301 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-3] - Start SessionDispatcherLeaderProcess. > 2020-10-25 02:36:03,302 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Recover all persisted job graphs. > 2020-10-25 02:36:03,304 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6 > 2020-10-25 02:36:03,307 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Trying to recover job with job id > 233b4938179c06974e4535ac8a868675. > 2020-10-25 02:36:03,381 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd > ... > 2020-10-25 02:36:03,427 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Recovering checkpoints from > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Found 0 checkpoints in > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Trying to fetch 0 checkpoints > from storage. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [flink-akka.actor.default-dispatcher-82003] - Starting job > 233b4938179c06974e4535ac8a868675 from savepoint > s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata > () > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)
[ https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227412#comment-17227412 ] Till Rohrmann commented on FLINK-19816: --- This problem is not new. However, it affects all HA setups and can effectively cause data duplication in exactly once sinks because we might simply run a job twice because of this bug. Hence, I believe that it is a true blocker which we must fix before releasing. > Flink restored from a wrong checkpoint (a very old one and not the last > completed one) > -- > > Key: FLINK-19816 > URL: https://issues.apache.org/jira/browse/FLINK-19816 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Steven Zhen Wu >Priority: Blocker > Fix For: 1.12.0, 1.11.3 > > > h2. Summary > Upon failure, it seems that Flink didn't restore from the last completed > checkpoint. Instead, it restored from a very old checkpoint. As a result, > Kafka offsets are invalid and caused the job to replay from the beginning as > Kafka consumer "auto.offset.reset" was set to "EARLIEST". > This is an embarrassingly parallel stateless job. Parallelism is over 1,000. > I have the full log file from jobmanager at INFO level available upon request. > h2. Sequence of events from the logs > Just before the failure, checkpoint *210768* completed. > {code} > 2020-10-25 02:35:05,970 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [jobmanager-future-thread-5] - Completed checkpoint 210768 for job > 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms). > {code} > During restart, somehow it decided to restore from a very old checkpoint > *203531*. > {code:java} > 2020-10-25 02:36:03,301 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-3] - Start SessionDispatcherLeaderProcess. > 2020-10-25 02:36:03,302 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Recover all persisted job graphs. > 2020-10-25 02:36:03,304 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6 > 2020-10-25 02:36:03,307 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Trying to recover job with job id > 233b4938179c06974e4535ac8a868675. > 2020-10-25 02:36:03,381 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd > ... > 2020-10-25 02:36:03,427 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Recovering checkpoints from > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Found 0 checkpoints in > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Trying to fetch 0 checkpoints > from storage. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [flink-akka.actor.default-dispatcher-82003] - Starting job > 233b4938179c06974e4535ac8a868675 from savepoint > s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata > () > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)
[ https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227266#comment-17227266 ] Xintong Song commented on FLINK-19816: -- Thanks for the discussion, [~stevenz3wu] & [~trohrmann]. [~trohrmann], is this a new problem introduced after 1.11.0? If not, maybe we can downgrade it to a critical issue, given that the problem only affects under zookeeper failures. WDYT? > Flink restored from a wrong checkpoint (a very old one and not the last > completed one) > -- > > Key: FLINK-19816 > URL: https://issues.apache.org/jira/browse/FLINK-19816 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Steven Zhen Wu >Priority: Blocker > Fix For: 1.12.0, 1.11.3 > > > h2. Summary > Upon failure, it seems that Flink didn't restore from the last completed > checkpoint. Instead, it restored from a very old checkpoint. As a result, > Kafka offsets are invalid and caused the job to replay from the beginning as > Kafka consumer "auto.offset.reset" was set to "EARLIEST". > This is an embarrassingly parallel stateless job. Parallelism is over 1,000. > I have the full log file from jobmanager at INFO level available upon request. > h2. Sequence of events from the logs > Just before the failure, checkpoint *210768* completed. > {code} > 2020-10-25 02:35:05,970 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [jobmanager-future-thread-5] - Completed checkpoint 210768 for job > 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms). > {code} > During restart, somehow it decided to restore from a very old checkpoint > *203531*. > {code:java} > 2020-10-25 02:36:03,301 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-3] - Start SessionDispatcherLeaderProcess. > 2020-10-25 02:36:03,302 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Recover all persisted job graphs. > 2020-10-25 02:36:03,304 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6 > 2020-10-25 02:36:03,307 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Trying to recover job with job id > 233b4938179c06974e4535ac8a868675. > 2020-10-25 02:36:03,381 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd > ... > 2020-10-25 02:36:03,427 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Recovering checkpoints from > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Found 0 checkpoints in > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Trying to fetch 0 checkpoints > from storage. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [flink-akka.actor.default-dispatcher-82003] - Starting job > 233b4938179c06974e4535ac8a868675 from savepoint > s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata > () > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)
[ https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226740#comment-17226740 ] Steven Zhen Wu commented on FLINK-19816: [~trohrmann] thanks a lot for taking a look. forwarded the logs to you. but it seems that we may not need them > Flink restored from a wrong checkpoint (a very old one and not the last > completed one) > -- > > Key: FLINK-19816 > URL: https://issues.apache.org/jira/browse/FLINK-19816 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Steven Zhen Wu >Priority: Blocker > Fix For: 1.12.0, 1.11.3 > > > h2. Summary > Upon failure, it seems that Flink didn't restore from the last completed > checkpoint. Instead, it restored from a very old checkpoint. As a result, > Kafka offsets are invalid and caused the job to replay from the beginning as > Kafka consumer "auto.offset.reset" was set to "EARLIEST". > This is an embarrassingly parallel stateless job. Parallelism is over 1,000. > I have the full log file from jobmanager at INFO level available upon request. > h2. Sequence of events from the logs > Just before the failure, checkpoint *210768* completed. > {code} > 2020-10-25 02:35:05,970 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [jobmanager-future-thread-5] - Completed checkpoint 210768 for job > 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms). > {code} > During restart, somehow it decided to restore from a very old checkpoint > *203531*. > {code:java} > 2020-10-25 02:36:03,301 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-3] - Start SessionDispatcherLeaderProcess. > 2020-10-25 02:36:03,302 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Recover all persisted job graphs. > 2020-10-25 02:36:03,304 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6 > 2020-10-25 02:36:03,307 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Trying to recover job with job id > 233b4938179c06974e4535ac8a868675. > 2020-10-25 02:36:03,381 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd > ... > 2020-10-25 02:36:03,427 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Recovering checkpoints from > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Found 0 checkpoints in > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Trying to fetch 0 checkpoints > from storage. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [flink-akka.actor.default-dispatcher-82003] - Starting job > 233b4938179c06974e4535ac8a868675 from savepoint > s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata > () > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)
[ https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226609#comment-17226609 ] Till Rohrmann commented on FLINK-19816: --- Looking at the code, I believe that the scenario you are describing can actually happen. If the job is about to complete successfully, it will notify the {{Dispatcher}} and then remove all checkpoints from the {{CompletedCheckpointStore}}. If at the same time, the Dispatcher loses the leadership, it will stop all running jobs but w/o cleaning up the persisted jobs. Depending on which action is executed first (losing the leadership and stopping jobs w/o cleaning up the persisted JobGraph or completing successfully and cleaning up the persisted JobGraph) one can end up with a cluster where one has deleted all checkpoint data but not the {{JobGraph}} itself. I think the problem is that we don't look at the final job status when deciding whether to clean up the persisted {{JobGraph}} or not but we decide on {{cleanupHA}} when the {{Dispatcher}} is being stopped or when the job reaches a globally terminal state. In the former case, this can lead to ignoring a successful termination of the job. > Flink restored from a wrong checkpoint (a very old one and not the last > completed one) > -- > > Key: FLINK-19816 > URL: https://issues.apache.org/jira/browse/FLINK-19816 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Steven Zhen Wu >Priority: Major > > h2. Summary > Upon failure, it seems that Flink didn't restore from the last completed > checkpoint. Instead, it restored from a very old checkpoint. As a result, > Kafka offsets are invalid and caused the job to replay from the beginning as > Kafka consumer "auto.offset.reset" was set to "EARLIEST". > This is an embarrassingly parallel stateless job. Parallelism is over 1,000. > I have the full log file from jobmanager at INFO level available upon request. > h2. Sequence of events from the logs > Just before the failure, checkpoint *210768* completed. > {code} > 2020-10-25 02:35:05,970 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [jobmanager-future-thread-5] - Completed checkpoint 210768 for job > 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms). > {code} > During restart, somehow it decided to restore from a very old checkpoint > *203531*. > {code:java} > 2020-10-25 02:36:03,301 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-3] - Start SessionDispatcherLeaderProcess. > 2020-10-25 02:36:03,302 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Recover all persisted job graphs. > 2020-10-25 02:36:03,304 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6 > 2020-10-25 02:36:03,307 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Trying to recover job with job id > 233b4938179c06974e4535ac8a868675. > 2020-10-25 02:36:03,381 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd > ... > 2020-10-25 02:36:03,427 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Recovering checkpoints from > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Found 0 checkpoints in > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Trying to fetch 0 checkpoints > from storage. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [flink-akka.actor.default-dispatcher-82003] - Starting job > 233b4938179c06974e4535ac8a868675 from savepoint > s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata > () > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)
[ https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226595#comment-17226595 ] Till Rohrmann commented on FLINK-19816: --- Hi [~stevenz3wu], thanks a lot for this analysis. Have you gotten hold of the complete logs? I will go through the code to see whether I see a problem and the logs might speed this process up a bit. In any case, this sounds like a bug in Flink to me. > Flink restored from a wrong checkpoint (a very old one and not the last > completed one) > -- > > Key: FLINK-19816 > URL: https://issues.apache.org/jira/browse/FLINK-19816 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Steven Zhen Wu >Priority: Major > > h2. Summary > Upon failure, it seems that Flink didn't restore from the last completed > checkpoint. Instead, it restored from a very old checkpoint. As a result, > Kafka offsets are invalid and caused the job to replay from the beginning as > Kafka consumer "auto.offset.reset" was set to "EARLIEST". > This is an embarrassingly parallel stateless job. Parallelism is over 1,000. > I have the full log file from jobmanager at INFO level available upon request. > h2. Sequence of events from the logs > Just before the failure, checkpoint *210768* completed. > {code} > 2020-10-25 02:35:05,970 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [jobmanager-future-thread-5] - Completed checkpoint 210768 for job > 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms). > {code} > During restart, somehow it decided to restore from a very old checkpoint > *203531*. > {code:java} > 2020-10-25 02:36:03,301 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-3] - Start SessionDispatcherLeaderProcess. > 2020-10-25 02:36:03,302 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Recover all persisted job graphs. > 2020-10-25 02:36:03,304 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6 > 2020-10-25 02:36:03,307 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Trying to recover job with job id > 233b4938179c06974e4535ac8a868675. > 2020-10-25 02:36:03,381 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd > ... > 2020-10-25 02:36:03,427 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Recovering checkpoints from > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Found 0 checkpoints in > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Trying to fetch 0 checkpoints > from storage. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [flink-akka.actor.default-dispatcher-82003] - Starting job > 233b4938179c06974e4535ac8a868675 from savepoint > s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata > () > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)
[ https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226460#comment-17226460 ] Steven Zhen Wu commented on FLINK-19816: This happened again for the same job in production. I noticed both failures started zookeeper failure. Here are some observations on the sequence of events. My hypothesis is that the race condition / interactions between recovering from zk failure and failure-rate restart-strategy caused this problem of restoring a wrong and very old checkpoint. [~trohrmann] 1. initially, there were some problems with zookeeper that caused the job to fail {code} 2020-10-25 02:35:59,266 WARN org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [main-SendThread(ip-100-81-150-64.us-west-2.compute.internal:2181)] - Client session timed out, have not heard from server in 26676 ms for sessionid 0x363850d0d034d9d 2020-10-25 02:35:59,268 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [main-SendThread(ip-100-81-150-64.us-west-2.compute.internal:2181)] - Client session timed out, have not heard from server in 26676 ms for sessionid 0x363850d0d034d9d, closing socket connection and attempting reconnect 2020-10-25 02:35:59,282 INFO com.netflix.bdp.s3fs.BdpS3FileSystem [cluster-io-thread-25] - Deleting path: s3://us-west-2.spaas.prod/checkpoints/r7E1/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4 535ac8a868675/chk-210758/16d4e138-4199-4fd2-a014-4b394189f72b 2020-10-25 02:35:59,369 INFO org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager [main-EventThread] - State change: SUSPENDED {code} 2. This job is configured with `restart-strategy=failure-rate`. and there are enough task restarts to trigger the terminal condition canRestart() to return false. This should eventually lead the Flink job to halt/terminal state. {code} 2020-10-25 02:35:59,641 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [flink-akka.actor.default-dispatcher-81991] - Job clapp-avro-nontvui (233b4938179c06974e4535ac8a868675) switched from state FAILING to FA ILED. org.apache.flink.runtime.JobException: Recovery is suppressed by FailureRateRestartBackoffTimeStrategy(FailureRateRestartBackoffTimeStrategy(failuresIntervalMS=180,backoffTimeMS=3,maxFailuresPerInterval=20) at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116) at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78) at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:203) at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185) at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179) at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:508) at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49) at org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1725) at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287) at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255) at org.apache.flink.runtime.executiongraph.Execution.fail(Execution.java:954) at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.signalPayloadRelease(SingleLogicalSlot.java:173) at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.release(SingleLogicalSlot.java:165) at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:732) at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537) at org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149) at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseTaskManagerInternal(SlotPoolImpl.java:818) at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseTaskManager(SlotPoolImpl.java:777) at org.apache.flink.runtime.jobmaster.JobMaster.disconnectTaskManager(JobMaster.java:435) at org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:352) at org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:216) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:514)