hi, 扩缩容会重启作业,在作业重启期间,job manager 先启动了,还有部分task manager没启动就有可能报“Not all
required tasks are currently
running..”的错误,作业的所有task完全启动后这个错误就会消失。

Best,
Yanfei
Chen Yang <chen.y...@doordash.com.invalid> 于2023年5月4日周四 09:44写道:
>
> 您好,
>
> 我的 Flink job是以 reactive 模式运行,然后用了 Kubernetes HPA 来自动扩容/缩容
> TaskManager。每当TaskManager
> 扩容/缩容的时候,Flink会在日志中报错:因为扩缩容之前的TaskManager没有在运行导致checkpoint失败,同时也有checkpoint失败的警报。
> 但实际上checkpoint 还能顺利进行, job也没有运行错误。 重启job后这个错误就会消失。想请教一下如何修复这个问题?
>
> 详细的日志如下
> 2022-12-13 05:08:22.339 [jobmanager-io-thread-1] INFO
>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed
> checkpoint 40393 for job 00000000000000000000000000000000 (488170 bytes,
> checkpointDuration=2582 ms, finalizationTime=322 ms).
> 2022-12-13 05:08:28.083 [Checkpoint Timer] INFO
>  org.apache.flink.runtime.checkpoint.CheckpointFailureManager  - Failed to
> trigger checkpoint for job 00000000000000000000000000000000 since
> Checkpoint triggering task Source: Custom Source -> Sink: Unnamed (1/79) of
> job 00000000000000000000000000000000 is not being executed at the moment.
> Aborting checkpoint. Failure reason: Not all required tasks are currently
> running..
> 2022-12-13 05:09:19.437 [Checkpoint Timer] INFO
>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering
> checkpoint 40394 (type=CheckpointType{name='Checkpoint',
> sharingFilesStrategy=FORWARD_BACKWARD}) @ 1670908159435 for job
> 00000000000000000000000000000000.
> 2022-12-13 05:09:25.208 [jobmanager-io-thread-1] INFO
>  org.apache.flink.fs.s3.common.writer.S3Committer  - Committing
> flink-ingest-sps-nv-consumer/2022-11-15T01:10:30Z/00000000000000000000000000000000/chk-40394/_metadata
> with MPU ID
> _3vKXSVBMuBM7207EpGvCXOTRQskAiPPj88DSTTn55Uzuc_76dnubmTAPBovyWbKBKU8Wxqz6SuFBJ8cZnAOH_PkGEP36KJzMFYYPmT.xZvmLnM.YX1oJSHN3VP1TXpJECY8y80psYvRWvbt2e8CMeoa9JiOWiGYGRmqLGRdlQA-
> 2022-12-13 05:09:25.747 [jobmanager-io-thread-1] INFO
>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed
> checkpoint 40394 for job 00000000000000000000000000000000 (482850 bytes,
> checkpointDuration=5982 ms, finalizationTime=330 ms).
>
> Thanks,
> Chen

回复