hi, 扩缩容会重启作业,在作业重启期间,job manager 先启动了,还有部分task manager没启动就有可能报“Not all required tasks are currently running..”的错误,作业的所有task完全启动后这个错误就会消失。
Best, Yanfei Chen Yang <chen.y...@doordash.com.invalid> 于2023年5月4日周四 09:44写道: > > 您好, > > 我的 Flink job是以 reactive 模式运行,然后用了 Kubernetes HPA 来自动扩容/缩容 > TaskManager。每当TaskManager > 扩容/缩容的时候,Flink会在日志中报错:因为扩缩容之前的TaskManager没有在运行导致checkpoint失败,同时也有checkpoint失败的警报。 > 但实际上checkpoint 还能顺利进行, job也没有运行错误。 重启job后这个错误就会消失。想请教一下如何修复这个问题? > > 详细的日志如下 > 2022-12-13 05:08:22.339 [jobmanager-io-thread-1] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 40393 for job 00000000000000000000000000000000 (488170 bytes, > checkpointDuration=2582 ms, finalizationTime=322 ms). > 2022-12-13 05:08:28.083 [Checkpoint Timer] INFO > org.apache.flink.runtime.checkpoint.CheckpointFailureManager - Failed to > trigger checkpoint for job 00000000000000000000000000000000 since > Checkpoint triggering task Source: Custom Source -> Sink: Unnamed (1/79) of > job 00000000000000000000000000000000 is not being executed at the moment. > Aborting checkpoint. Failure reason: Not all required tasks are currently > running.. > 2022-12-13 05:09:19.437 [Checkpoint Timer] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 40394 (type=CheckpointType{name='Checkpoint', > sharingFilesStrategy=FORWARD_BACKWARD}) @ 1670908159435 for job > 00000000000000000000000000000000. > 2022-12-13 05:09:25.208 [jobmanager-io-thread-1] INFO > org.apache.flink.fs.s3.common.writer.S3Committer - Committing > flink-ingest-sps-nv-consumer/2022-11-15T01:10:30Z/00000000000000000000000000000000/chk-40394/_metadata > with MPU ID > _3vKXSVBMuBM7207EpGvCXOTRQskAiPPj88DSTTn55Uzuc_76dnubmTAPBovyWbKBKU8Wxqz6SuFBJ8cZnAOH_PkGEP36KJzMFYYPmT.xZvmLnM.YX1oJSHN3VP1TXpJECY8y80psYvRWvbt2e8CMeoa9JiOWiGYGRmqLGRdlQA- > 2022-12-13 05:09:25.747 [jobmanager-io-thread-1] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 40394 for job 00000000000000000000000000000000 (482850 bytes, > checkpointDuration=5982 ms, finalizationTime=330 ms). > > Thanks, > Chen