[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed
[ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17336302#comment-17336302 ] Flink Jira Bot commented on FLINK-16215: This issue was labeled "stale-major" 7 ago and has not received any updates so it is being deprioritized. If this ticket is actually Major, please raise the priority and ask a committer to assign you the issue or revive the public discussion. > Start redundant TaskExecutor when JM failed > --- > > Key: FLINK-16215 > URL: https://issues.apache.org/jira/browse/FLINK-16215 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0, 1.11.3, 1.12.0 >Reporter: YufeiLiu >Priority: Major > Labels: stale-major > > TaskExecutor will reconnect to the new ResourceManager leader when JM failed, > and JobMaster will restart and reschedule job. If job slot request arrive > earlier than TM registration, RM will start new workers rather than reuse the > existing TMs. > It‘s hard to reproduce becasue TM registration usually come first, and > timeout check will stop redundant TMs. > But I think it would be better if we make the {{recoverWokerNode}} to > interface, and put recovered slots in {{pendingSlots}} wait for TM > reconnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed
[ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17328033#comment-17328033 ] Flink Jira Bot commented on FLINK-16215: This major issue is unassigned and itself and all of its Sub-Tasks have not been updated for 30 days. So, it has been labeled "stale-major". If this ticket is indeed "major", please either assign yourself or give an update. Afterwards, please remove the label. In 7 days the issue will be deprioritized. > Start redundant TaskExecutor when JM failed > --- > > Key: FLINK-16215 > URL: https://issues.apache.org/jira/browse/FLINK-16215 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0, 1.11.3, 1.12.0 >Reporter: YufeiLiu >Priority: Major > Labels: stale-major > > TaskExecutor will reconnect to the new ResourceManager leader when JM failed, > and JobMaster will restart and reschedule job. If job slot request arrive > earlier than TM registration, RM will start new workers rather than reuse the > existing TMs. > It‘s hard to reproduce becasue TM registration usually come first, and > timeout check will stop redundant TMs. > But I think it would be better if we make the {{recoverWokerNode}} to > interface, and put recovered slots in {{pendingSlots}} wait for TM > reconnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed
[ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050678#comment-17050678 ] Yangze Guo commented on FLINK-16215: [~liuyufei] Hi, after a deeper investigation, we believe FLINK-16299 is actually not a problem. Feel free to continue your work :). > Start redundant TaskExecutor when JM failed > --- > > Key: FLINK-16215 > URL: https://issues.apache.org/jira/browse/FLINK-16215 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: YufeiLiu >Priority: Major > > TaskExecutor will reconnect to the new ResourceManager leader when JM failed, > and JobMaster will restart and reschedule job. If job slot request arrive > earlier than TM registration, RM will start new workers rather than reuse the > existing TMs. > It‘s hard to reproduce becasue TM registration usually come first, and > timeout check will stop redundant TMs. > But I think it would be better if we make the {{recoverWokerNode}} to > interface, and put recovered slots in {{pendingSlots}} wait for TM > reconnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed
[ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046515#comment-17046515 ] YufeiLiu commented on FLINK-16215: -- [~xintongsong] LGTM. I think can handle almost all scenarios. And I can continue my work after [FLINK-16299|https://issues.apache.org/jira/browse/FLINK-16299] > Start redundant TaskExecutor when JM failed > --- > > Key: FLINK-16215 > URL: https://issues.apache.org/jira/browse/FLINK-16215 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: YufeiLiu >Priority: Major > > TaskExecutor will reconnect to the new ResourceManager leader when JM failed, > and JobMaster will restart and reschedule job. If job slot request arrive > earlier than TM registration, RM will start new workers rather than reuse the > existing TMs. > It‘s hard to reproduce becasue TM registration usually come first, and > timeout check will stop redundant TMs. > But I think it would be better if we make the {{recoverWokerNode}} to > interface, and put recovered slots in {{pendingSlots}} wait for TM > reconnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed
[ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046417#comment-17046417 ] Xintong Song commented on FLINK-16215: -- [~liuyufei], Is it possible that we first assume all the recovered containers have TM process started? That means we might request less new workers than needed, and we can always request more when finding out a recovered container does not have a TM process started. E.g., The min slots is 4, and num-slots-per-tm is 1, thus we need 4 TMs at least. Say if we recovered 2 containers, one has a TM process started inside and one doesn't have (but we don't know that yet). In that case we can assume for both 2 recovered containers the TM will register, so we request 2 more container. At the same time, we request the container status for the recovered containers. When the status of container in state NEW is received, we release it and request one more container to meet the min workers. I would not consider the overhead of waiting for all recovered containers' status to be trivial, especially for large production scenarios where you may have thousands of containers. On the other hand, recovering a container without a TM process started should be a rare case (as you said, hard to reproduce), so I think we can afford to start new TM for it a bit later (i.e. after the status is received). > Start redundant TaskExecutor when JM failed > --- > > Key: FLINK-16215 > URL: https://issues.apache.org/jira/browse/FLINK-16215 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: YufeiLiu >Priority: Major > > TaskExecutor will reconnect to the new ResourceManager leader when JM failed, > and JobMaster will restart and reschedule job. If job slot request arrive > earlier than TM registration, RM will start new workers rather than reuse the > existing TMs. > It‘s hard to reproduce becasue TM registration usually come first, and > timeout check will stop redundant TMs. > But I think it would be better if we make the {{recoverWokerNode}} to > interface, and put recovered slots in {{pendingSlots}} wait for TM > reconnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed
[ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046335#comment-17046335 ] YufeiLiu commented on FLINK-16215: -- [~xintongsong] I'm thinking about to reuse the slots will be recovered and consideration of {{cluster.slots-number.min}} configuration. If we use async execution maybe has potential risk of allocating resource more than our expect. If {{evenly-spread-out-slots}} config is set, it will try to use every TM rather than fulfill one of them. Here is a case, if source parallelism is 2 and sink is 4, need 2 TM if they has 2 slots each. When JM has failover, if SlotRequest of source come first, it will start a new worker and wait for slot report, and 3 TM will registered ATM, when SlotRequest of sink arrives all of TMs will be used and can't be released in timeout check. Maybe this case is too extreme, but I think put them in initialize stage would be better if it doesn't take long time. Here is the approach I'm thinking. * In {{getContainersFromPreviousAttempts}}, add all recovered containers to the {{workerNodeMap}}, and binding a {{CompletableFuture}} for each {{YarnWorkerNode}}, then starts a async status query for each container. * In {{onContainerStatusReceived}}, complete the future of this container. * In {{prepareLeadershipAsync}}, wait for container status report, if the returned container's state is {{NEW}}, release it and remove it from the workerNodeMap, if it's {{RUNNING}} put into PendingSlots. * As all containers status are confirmed, initialize minimum of workers if {{cluster.slots-number.min}} is set, which should reduce the PendingSlots. * Prepare work is done and can confirm leadership. > Start redundant TaskExecutor when JM failed > --- > > Key: FLINK-16215 > URL: https://issues.apache.org/jira/browse/FLINK-16215 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: YufeiLiu >Priority: Major > > TaskExecutor will reconnect to the new ResourceManager leader when JM failed, > and JobMaster will restart and reschedule job. If job slot request arrive > earlier than TM registration, RM will start new workers rather than reuse the > existing TMs. > It‘s hard to reproduce becasue TM registration usually come first, and > timeout check will stop redundant TMs. > But I think it would be better if we make the {{recoverWokerNode}} to > interface, and put recovered slots in {{pendingSlots}} wait for TM > reconnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed
[ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046241#comment-17046241 ] Xintong Song commented on FLINK-16215: -- I believe the problem of identifying recovered containers in which no TM process is started should a separate issue. I created another ticket FLINK-16299 for it. I also linked this ticket to FLINK-13554 and FLINK-15959, because I believe the former fixes this issue, and the latter is somehow related to this issue. > Start redundant TaskExecutor when JM failed > --- > > Key: FLINK-16215 > URL: https://issues.apache.org/jira/browse/FLINK-16215 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: YufeiLiu >Priority: Major > > TaskExecutor will reconnect to the new ResourceManager leader when JM failed, > and JobMaster will restart and reschedule job. If job slot request arrive > earlier than TM registration, RM will start new workers rather than reuse the > existing TMs. > It‘s hard to reproduce becasue TM registration usually come first, and > timeout check will stop redundant TMs. > But I think it would be better if we make the {{recoverWokerNode}} to > interface, and put recovered slots in {{pendingSlots}} wait for TM > reconnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed
[ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046207#comment-17046207 ] Xintong Song commented on FLINK-16215: -- Just trying to understand, why do we need to block confirming leadership until the recover is done? I was thinking about the following approach. * In {{getContainersFromPreviousAttempts}}, add all recovered containers to the {{workerNodeMap}}, and starts a async status query for each container. * In {{onContainerStatusReceived}}, if the returned container's state is {{NEW}}, release it and remove it from the {{workerNodeMap}}. > Start redundant TaskExecutor when JM failed > --- > > Key: FLINK-16215 > URL: https://issues.apache.org/jira/browse/FLINK-16215 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: YufeiLiu >Priority: Major > > TaskExecutor will reconnect to the new ResourceManager leader when JM failed, > and JobMaster will restart and reschedule job. If job slot request arrive > earlier than TM registration, RM will start new workers rather than reuse the > existing TMs. > It‘s hard to reproduce becasue TM registration usually come first, and > timeout check will stop redundant TMs. > But I think it would be better if we make the {{recoverWokerNode}} to > interface, and put recovered slots in {{pendingSlots}} wait for TM > reconnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed
[ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046198#comment-17046198 ] Yangze Guo commented on FLINK-16215: [~liuyufei] SlotManager will decide what and how many resources should TMs be started with. For more details, you could take a look at the design doc of FLINK-14106. > Start redundant TaskExecutor when JM failed > --- > > Key: FLINK-16215 > URL: https://issues.apache.org/jira/browse/FLINK-16215 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: YufeiLiu >Priority: Major > > TaskExecutor will reconnect to the new ResourceManager leader when JM failed, > and JobMaster will restart and reschedule job. If job slot request arrive > earlier than TM registration, RM will start new workers rather than reuse the > existing TMs. > It‘s hard to reproduce becasue TM registration usually come first, and > timeout check will stop redundant TMs. > But I think it would be better if we make the {{recoverWokerNode}} to > interface, and put recovered slots in {{pendingSlots}} wait for TM > reconnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed
[ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046195#comment-17046195 ] YufeiLiu commented on FLINK-16215: -- [~xintongsong] This sounds good. We can do some actual recovery work like release unused containers rather than just put all previous containers into WorkerMap. It need some changes in {{YarnResourceManager}}, It's kind of like {{MesosResourceManager}} recovery process. I think put the recovery work in {{prepareLeadershipAsync}} would be nice, which don't confirm leadership until the recovery is done. Do we have plans to improve on this? Besides, I had some question about "RM not assuming all TMs have the same resource", than which decide the specification of TM? > Start redundant TaskExecutor when JM failed > --- > > Key: FLINK-16215 > URL: https://issues.apache.org/jira/browse/FLINK-16215 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: YufeiLiu >Priority: Major > > TaskExecutor will reconnect to the new ResourceManager leader when JM failed, > and JobMaster will restart and reschedule job. If job slot request arrive > earlier than TM registration, RM will start new workers rather than reuse the > existing TMs. > It‘s hard to reproduce becasue TM registration usually come first, and > timeout check will stop redundant TMs. > But I think it would be better if we make the {{recoverWokerNode}} to > interface, and put recovered slots in {{pendingSlots}} wait for TM > reconnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed
[ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046062#comment-17046062 ] Xintong Song commented on FLINK-16215: -- [~liuyufei] I see your point. You mean for FLINK-15959, in case of failover it's hard to know how many new TMs we need to start to meet the min slots? I did some investigation, and come up with another potential solution for identifying the allocated but non-started containers among all recovered containers. We can request container status from Yarn by calling {{NMClientAsync#getContainerStatusAsync}}, the result will be returned in {{onContainerStatusReceived}}. The {{ContainerStatus#getState}} should be {{RUNNING}} for those containers in which the {{TaskExecutor}} process is already started, and {{NEW}} for those containers allocated but not started yet. In this way, we can release those containers allocated but not started, and also decide how many new TMs we need to start. An alternative is to directly start TM process inside those allocated but not started containers. But we are working on another effort to make RM not assuming all TMs have the same resource, which means RM cannot decide the resource configurations to start the new TM with. > Start redundant TaskExecutor when JM failed > --- > > Key: FLINK-16215 > URL: https://issues.apache.org/jira/browse/FLINK-16215 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: YufeiLiu >Priority: Major > > TaskExecutor will reconnect to the new ResourceManager leader when JM failed, > and JobMaster will restart and reschedule job. If job slot request arrive > earlier than TM registration, RM will start new workers rather than reuse the > existing TMs. > It‘s hard to reproduce becasue TM registration usually come first, and > timeout check will stop redundant TMs. > But I think it would be better if we make the {{recoverWokerNode}} to > interface, and put recovered slots in {{pendingSlots}} wait for TM > reconnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed
[ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045565#comment-17045565 ] YufeiLiu commented on FLINK-16215: -- [~xintongsong] I understand your concern, so we can't know how many slots will be recovered when JM failover. I came up with this because the issue we discuss before [FLINK-15959|https://issues.apache.org/jira/browse/FLINK-15959], it's hardly to know exactly how many slots are missing at startup. > Start redundant TaskExecutor when JM failed > --- > > Key: FLINK-16215 > URL: https://issues.apache.org/jira/browse/FLINK-16215 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: YufeiLiu >Priority: Major > > TaskExecutor will reconnect to the new ResourceManager leader when JM failed, > and JobMaster will restart and reschedule job. If job slot request arrive > earlier than TM registration, RM will start new workers rather than reuse the > existing TMs. > It‘s hard to reproduce becasue TM registration usually come first, and > timeout check will stop redundant TMs. > But I think it would be better if we make the {{recoverWokerNode}} to > interface, and put recovered slots in {{pendingSlots}} wait for TM > reconnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed
[ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043137#comment-17043137 ] Xintong Song commented on FLINK-16215: -- I share [~trohrmann]'s concern. On Yarn deployment, {{YarnResourceManager}} starts a {{TaskExecutor}} in two steps. 1. Requests a container from Yarn. 2. Launch the {{TaskExecutor}} process inside the allocated container. If the JM failover happens between the two steps, the container will be recovered but no {{TaskExecutor}} will be started inside it. I think it is a problem that for such a container, neither a {{TaskExecutor}} will be started in it, nor will it be released. This might be solved by FLINK-13554, with a timeout for starting new {{TaskExecutor}}s. We can apply this timeout to recovered containers as well. FYI, the Kubernetes deployment does not have this problem, because the pod/container is allocated and {{TaskExecutor}} is started in one step. > Start redundant TaskExecutor when JM failed > --- > > Key: FLINK-16215 > URL: https://issues.apache.org/jira/browse/FLINK-16215 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: YufeiLiu >Priority: Major > > TaskExecutor will reconnect to the new ResourceManager leader when JM failed, > and JobMaster will restart and reschedule job. If job slot request arrive > earlier than TM registration, RM will start new workers rather than reuse the > existing TMs. > It‘s hard to reproduce becasue TM registration usually come first, and > timeout check will stop redundant TMs. > But I think it would be better if we make the {{recoverWokerNode}} to > interface, and put recovered slots in {{pendingSlots}} wait for TM > reconnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed
[ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043134#comment-17043134 ] Yang Wang commented on FLINK-16215: --- I think even we make {{recoverWokerNode}} as interface and do the recovery before slot request coming, we still could not completely avoid this problem. Since there is no guarantee that we could get all the previous containers from the recovery process. Some other containers may also be returned via the subsequent heartbeat. Maybe the {{JobMaster}} should be aware of the failover and could recover the running from {{TaskManager}}. If it fails with timeout, then allocate a new slot from {{ResourceManager}}. It is just a rough thought. Please correct me if i am wrong. > Start redundant TaskExecutor when JM failed > --- > > Key: FLINK-16215 > URL: https://issues.apache.org/jira/browse/FLINK-16215 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: YufeiLiu >Priority: Major > > TaskExecutor will reconnect to the new ResourceManager leader when JM failed, > and JobMaster will restart and reschedule job. If job slot request arrive > earlier than TM registration, RM will start new workers rather than reuse the > existing TMs. > It‘s hard to reproduce becasue TM registration usually come first, and > timeout check will stop redundant TMs. > But I think it would be better if we make the {{recoverWokerNode}} to > interface, and put recovered slots in {{pendingSlots}} wait for TM > reconnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed
[ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042036#comment-17042036 ] Till Rohrmann commented on FLINK-16215: --- I think at the moment I would not recommend to do it because when recovering the previous containers we assume that a {{TaskExecutor}} has already been started in the container. I think this is not necessarily true and it could simply mean that we have obtained a container which we don't use. If this gets fixed and if we can guarantee that in every container there is a {{TaskExecutor}} running, then I believe that we could slots belonging to these "not yet" registered {{TaskExecutors}} as pending slots. > Start redundant TaskExecutor when JM failed > --- > > Key: FLINK-16215 > URL: https://issues.apache.org/jira/browse/FLINK-16215 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: YufeiLiu >Priority: Major > > TaskExecutor will reconnect to the new ResourceManager leader when JM failed, > and JobMaster will restart and reschedule job. If job slot request arrive > earlier than TM registration, RM will start new workers rather than reuse the > existing TMs. > It‘s hard to reproduce becasue TM registration usually come first, and > timeout check will stop redundant TMs. > But I think it would be better if we make the {{recoverWokerNode}} to > interface, and put recovered slots in {{pendingSlots}} wait for TM > reconnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed
[ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041869#comment-17041869 ] Andrey Zagrebin commented on FLINK-16215: - I assume it is about some active RM integration, e.g. Yarn. I agree reusing existing TMs is better but if it happens rarely as it is hard to reproduce, why is it a problem? The failover should be also a rare case so that starting a new TM should not be a big penalty and existing TMs will just disappear after the timeout as already pointed out. cc [~trohrmann] > Start redundant TaskExecutor when JM failed > --- > > Key: FLINK-16215 > URL: https://issues.apache.org/jira/browse/FLINK-16215 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: YufeiLiu >Priority: Major > > TaskExecutor will reconnect to the new ResourceManager leader when JM failed, > and JobMaster will restart and reschedule job. If job slot request arrive > earlier than TM registration, RM will start new workers rather than reuse the > existing TMs. > It‘s hard to reproduce becasue TM registration usually come first, and > timeout check will stop redundant TMs. > But I think it would be better if we make the {{recoverWokerNode}} to > interface, and put recovered slots in {{pendingSlots}} wait for TM > reconnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)