[jira] [Updated] (FLINK-10617) Restoring job fails because of slot allocation timeout
[ https://issues.apache.org/jira/browse/FLINK-10617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elias Levy updated FLINK-10617: --- Affects Version/s: 1.6.2 > Restoring job fails because of slot allocation timeout > -- > > Key: FLINK-10617 > URL: https://issues.apache.org/jira/browse/FLINK-10617 > Project: Flink > Issue Type: Bug > Components: ResourceManager, TaskManager >Affects Versions: 1.6.1, 1.6.2 >Reporter: Elias Levy >Priority: Major > > The following may be related to FLINK-9932, but I am unsure. If you believe > it is, go ahead and close this issue and a duplicate. > While trying to test local state recovery on a job with large state, the job > failed to be restored because slot allocation timed out. > The job is running on a standalone cluster with 12 nodes and 96 task slots (8 > per node). The job has parallelism of 96, so it consumes all of the slots, > and has ~200 GB of state in RocksDB. > To test local state recovery I decided to kill one of the TMs. The TM > immediately restarted and re-registered with the JM. I confirmed the JM > showed 96 registered task slots. > {noformat} > 21:35:44,616 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >- Resolved ResourceManager address, beginning registration > 21:35:44,616 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >- Registration at ResourceManager attempt 1 (timeout=100ms) > 21:35:44,640 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >- Successful registration at resource manager > akka.tcp://flink@172.31.18.172:6123/user/resourcemanager under registration > id 302988dea6afbd613bb2f96429b65d18. > 21:36:49,667 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >- Receive slot request AllocationID{4274d96a59d370305520876f5b84fb9f} for > job 87c61e8ee64cdbd50f191d39610eb58f from resource manager with leader id > 8e06aa64d5f8961809da38fe7f224cc1. > 21:36:49,667 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >- Allocated slot for AllocationID{4274d96a59d370305520876f5b84fb9f}. > 21:36:49,667 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService >- Add job 87c61e8ee64cdbd50f191d39610eb58f for job leader monitoring. > 21:36:49,668 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Starting ZooKeeperLeaderRetrievalService > /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock. > 21:36:49,671 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService >- Try to register at job manager > akka.tcp://flink@172.31.18.172:6123/user/jobmanager_3 with leader id > f85f6f9b-7713-4be3-a8f0-8443d91e5e6d. > 21:36:49,681 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >- Receive slot request AllocationID{3a64e2c8c5b22adbcfd3ffcd2b49e7f9} for > job 87c61e8ee64cdbd50f191d39610eb58f from resource manager with leader id > 8e06aa64d5f8961809da38fe7f224cc1. > 21:36:49,681 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >- Allocated slot for AllocationID{3a64e2c8c5b22adbcfd3ffcd2b49e7f9}. > 21:36:49,681 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService >- Add job 87c61e8ee64cdbd50f191d39610eb58f for job leader monitoring. > 21:36:49,681 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Stopping ZooKeeperLeaderRetrievalService > /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock. > 21:36:49,681 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Starting ZooKeeperLeaderRetrievalService > /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock. > 21:36:49,683 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService >- Try to register at job manager > akka.tcp://flink@172.31.18.172:6123/user/jobmanager_3 with leader id > f85f6f9b-7713-4be3-a8f0-8443d91e5e6d. > 21:36:49,687 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService >- Resolved JobManager address, beginning registration > 21:36:49,687 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService >- Resolved JobManager address, beginning registration > 21:36:49,687 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >- Receive slot request AllocationID{740caf20a5f7f767864122dc9a7444d9} for > job 87c61e8ee64cdbd50f191d39610eb58f from resource manager with leader id > 8e06aa64d5f8961809da38fe7f224cc1. > 21:36:49,688 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService >- Registration at JobManager attempt 1 (timeout=100ms) > 21:36:49,688 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >- Allocated slot for AllocationID{740caf20a5f7f767864122dc9a7444d9}. > 21:36:49,688 INFO org
[jira] [Updated] (FLINK-10617) Restoring job fails because of slot allocation timeout
[ https://issues.apache.org/jira/browse/FLINK-10617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elias Levy updated FLINK-10617: --- Priority: Critical (was: Major) > Restoring job fails because of slot allocation timeout > -- > > Key: FLINK-10617 > URL: https://issues.apache.org/jira/browse/FLINK-10617 > Project: Flink > Issue Type: Bug > Components: ResourceManager, TaskManager >Affects Versions: 1.6.1, 1.6.2 >Reporter: Elias Levy >Priority: Critical > > The following may be related to FLINK-9932, but I am unsure. If you believe > it is, go ahead and close this issue and a duplicate. > While trying to test local state recovery on a job with large state, the job > failed to be restored because slot allocation timed out. > The job is running on a standalone cluster with 12 nodes and 96 task slots (8 > per node). The job has parallelism of 96, so it consumes all of the slots, > and has ~200 GB of state in RocksDB. > To test local state recovery I decided to kill one of the TMs. The TM > immediately restarted and re-registered with the JM. I confirmed the JM > showed 96 registered task slots. > {noformat} > 21:35:44,616 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >- Resolved ResourceManager address, beginning registration > 21:35:44,616 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >- Registration at ResourceManager attempt 1 (timeout=100ms) > 21:35:44,640 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >- Successful registration at resource manager > akka.tcp://flink@172.31.18.172:6123/user/resourcemanager under registration > id 302988dea6afbd613bb2f96429b65d18. > 21:36:49,667 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >- Receive slot request AllocationID{4274d96a59d370305520876f5b84fb9f} for > job 87c61e8ee64cdbd50f191d39610eb58f from resource manager with leader id > 8e06aa64d5f8961809da38fe7f224cc1. > 21:36:49,667 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >- Allocated slot for AllocationID{4274d96a59d370305520876f5b84fb9f}. > 21:36:49,667 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService >- Add job 87c61e8ee64cdbd50f191d39610eb58f for job leader monitoring. > 21:36:49,668 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Starting ZooKeeperLeaderRetrievalService > /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock. > 21:36:49,671 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService >- Try to register at job manager > akka.tcp://flink@172.31.18.172:6123/user/jobmanager_3 with leader id > f85f6f9b-7713-4be3-a8f0-8443d91e5e6d. > 21:36:49,681 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >- Receive slot request AllocationID{3a64e2c8c5b22adbcfd3ffcd2b49e7f9} for > job 87c61e8ee64cdbd50f191d39610eb58f from resource manager with leader id > 8e06aa64d5f8961809da38fe7f224cc1. > 21:36:49,681 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >- Allocated slot for AllocationID{3a64e2c8c5b22adbcfd3ffcd2b49e7f9}. > 21:36:49,681 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService >- Add job 87c61e8ee64cdbd50f191d39610eb58f for job leader monitoring. > 21:36:49,681 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Stopping ZooKeeperLeaderRetrievalService > /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock. > 21:36:49,681 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Starting ZooKeeperLeaderRetrievalService > /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock. > 21:36:49,683 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService >- Try to register at job manager > akka.tcp://flink@172.31.18.172:6123/user/jobmanager_3 with leader id > f85f6f9b-7713-4be3-a8f0-8443d91e5e6d. > 21:36:49,687 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService >- Resolved JobManager address, beginning registration > 21:36:49,687 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService >- Resolved JobManager address, beginning registration > 21:36:49,687 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >- Receive slot request AllocationID{740caf20a5f7f767864122dc9a7444d9} for > job 87c61e8ee64cdbd50f191d39610eb58f from resource manager with leader id > 8e06aa64d5f8961809da38fe7f224cc1. > 21:36:49,688 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService >- Registration at JobManager attempt 1 (timeout=100ms) > 21:36:49,688 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >- Allocated slot for AllocationID{740caf20a5f7f767864122dc9a7444d9}. > 21:36:49,68