[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed

2021-04-29 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17336302#comment-17336302
 ] 

Flink Jira Bot commented on FLINK-16215:


This issue was labeled "stale-major" 7 ago and has not received any updates so 
it is being deprioritized. If this ticket is actually Major, please raise the 
priority and ask a committer to assign you the issue or revive the public 
discussion.


> Start redundant TaskExecutor when JM failed
> ---
>
> Key: FLINK-16215
> URL: https://issues.apache.org/jira/browse/FLINK-16215
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0, 1.11.3, 1.12.0
>Reporter: YufeiLiu
>Priority: Major
>  Labels: stale-major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, 
> and JobMaster will restart and reschedule job. If job slot request arrive 
> earlier than TM registration, RM will start new workers rather than reuse the 
> existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and 
> timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to 
> interface, and put recovered slots in {{pendingSlots}} wait for TM 
> reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed

2021-04-22 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17328033#comment-17328033
 ] 

Flink Jira Bot commented on FLINK-16215:


This major issue is unassigned and itself and all of its Sub-Tasks have not 
been updated for 30 days. So, it has been labeled "stale-major". If this ticket 
is indeed "major", please either assign yourself or give an update. Afterwards, 
please remove the label. In 7 days the issue will be deprioritized.

> Start redundant TaskExecutor when JM failed
> ---
>
> Key: FLINK-16215
> URL: https://issues.apache.org/jira/browse/FLINK-16215
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0, 1.11.3, 1.12.0
>Reporter: YufeiLiu
>Priority: Major
>  Labels: stale-major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, 
> and JobMaster will restart and reschedule job. If job slot request arrive 
> earlier than TM registration, RM will start new workers rather than reuse the 
> existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and 
> timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to 
> interface, and put recovered slots in {{pendingSlots}} wait for TM 
> reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed

2020-03-03 Thread Yangze Guo (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050678#comment-17050678
 ] 

Yangze Guo commented on FLINK-16215:


[~liuyufei] Hi, after a deeper investigation, we believe FLINK-16299 is 
actually not a problem. Feel free to continue your work :).

> Start redundant TaskExecutor when JM failed
> ---
>
> Key: FLINK-16215
> URL: https://issues.apache.org/jira/browse/FLINK-16215
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: YufeiLiu
>Priority: Major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, 
> and JobMaster will restart and reschedule job. If job slot request arrive 
> earlier than TM registration, RM will start new workers rather than reuse the 
> existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and 
> timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to 
> interface, and put recovered slots in {{pendingSlots}} wait for TM 
> reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed

2020-02-27 Thread YufeiLiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046515#comment-17046515
 ] 

YufeiLiu commented on FLINK-16215:
--

[~xintongsong] LGTM. I think can handle almost all scenarios. And I can 
continue my work after 
[FLINK-16299|https://issues.apache.org/jira/browse/FLINK-16299] 

> Start redundant TaskExecutor when JM failed
> ---
>
> Key: FLINK-16215
> URL: https://issues.apache.org/jira/browse/FLINK-16215
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: YufeiLiu
>Priority: Major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, 
> and JobMaster will restart and reschedule job. If job slot request arrive 
> earlier than TM registration, RM will start new workers rather than reuse the 
> existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and 
> timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to 
> interface, and put recovered slots in {{pendingSlots}} wait for TM 
> reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed

2020-02-27 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046417#comment-17046417
 ] 

Xintong Song commented on FLINK-16215:
--

[~liuyufei],

Is it possible that we first assume all the recovered containers have TM 
process started? That means we might request less new workers than needed, and 
we can always request more when finding out a recovered container does not have 
a TM process started.

E.g., The min slots is 4, and num-slots-per-tm is 1, thus we need 4 TMs at 
least. Say if we recovered 2 containers, one has a TM process started inside 
and one doesn't have (but we don't know that yet). In that case we can assume 
for both 2 recovered containers the TM will register, so we request 2 more 
container. At the same time, we request the container status for the recovered 
containers. When the status of container in state NEW is received, we release 
it and request one more container to meet the min workers.

I would not consider the overhead of waiting for all recovered containers' 
status to be trivial, especially for large production scenarios where you may 
have thousands of containers. On the other hand, recovering a container without 
a TM process started should be a rare case (as you said, hard to reproduce), so 
I think we can afford to start new TM for it a bit later (i.e. after the status 
is received).

> Start redundant TaskExecutor when JM failed
> ---
>
> Key: FLINK-16215
> URL: https://issues.apache.org/jira/browse/FLINK-16215
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: YufeiLiu
>Priority: Major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, 
> and JobMaster will restart and reschedule job. If job slot request arrive 
> earlier than TM registration, RM will start new workers rather than reuse the 
> existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and 
> timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to 
> interface, and put recovered slots in {{pendingSlots}} wait for TM 
> reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed

2020-02-27 Thread YufeiLiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046335#comment-17046335
 ] 

YufeiLiu commented on FLINK-16215:
--

[~xintongsong] I'm thinking about to reuse the slots will be recovered and 
consideration of {{cluster.slots-number.min}} configuration. If we use async 
execution maybe has potential risk of allocating resource more than our expect.


 If {{evenly-spread-out-slots}} config is set, it will try to use every TM 
rather than fulfill one of them. Here is a case, if source parallelism is 2 and 
sink is 4, need 2 TM if they has 2 slots each. When JM has failover, if 
SlotRequest of source come first, it will start a new worker and wait for slot 
report, and 3 TM will registered ATM, when SlotRequest of sink arrives all of 
TMs will be used and can't be released in timeout check.


 Maybe this case is too extreme, but I think put them in initialize stage would 
be better if it doesn't take long time.
 Here is the approach I'm thinking.
 * In {{getContainersFromPreviousAttempts}}, add all recovered containers to 
the {{workerNodeMap}}, and binding a {{CompletableFuture}} for 
each {{YarnWorkerNode}}, then starts a async status query for each container.
 * In {{onContainerStatusReceived}}, complete the future of this container.
 * In {{prepareLeadershipAsync}}, wait for container status report, if the 
returned container's state is {{NEW}}, release it and remove it from the 
workerNodeMap, if it's {{RUNNING}} put into PendingSlots.
 * As all containers status are confirmed, initialize minimum of workers if 
{{cluster.slots-number.min}} is set, which should reduce the PendingSlots.
 * Prepare work is done and can confirm leadership.

> Start redundant TaskExecutor when JM failed
> ---
>
> Key: FLINK-16215
> URL: https://issues.apache.org/jira/browse/FLINK-16215
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: YufeiLiu
>Priority: Major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, 
> and JobMaster will restart and reschedule job. If job slot request arrive 
> earlier than TM registration, RM will start new workers rather than reuse the 
> existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and 
> timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to 
> interface, and put recovered slots in {{pendingSlots}} wait for TM 
> reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed

2020-02-26 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046241#comment-17046241
 ] 

Xintong Song commented on FLINK-16215:
--

I believe the problem of identifying recovered containers in which no TM 
process is started should a separate issue. I created another ticket 
FLINK-16299 for it.

I also linked this ticket to FLINK-13554 and FLINK-15959, because I believe the 
former fixes this issue, and the latter is somehow related to this issue.

> Start redundant TaskExecutor when JM failed
> ---
>
> Key: FLINK-16215
> URL: https://issues.apache.org/jira/browse/FLINK-16215
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: YufeiLiu
>Priority: Major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, 
> and JobMaster will restart and reschedule job. If job slot request arrive 
> earlier than TM registration, RM will start new workers rather than reuse the 
> existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and 
> timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to 
> interface, and put recovered slots in {{pendingSlots}} wait for TM 
> reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed

2020-02-26 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046207#comment-17046207
 ] 

Xintong Song commented on FLINK-16215:
--

Just trying to understand, why do we need to block confirming leadership until 
the recover is done?

I was thinking about the following approach.
 * In {{getContainersFromPreviousAttempts}}, add all recovered containers to 
the {{workerNodeMap}}, and starts a async status query for each container.
 * In {{onContainerStatusReceived}}, if the returned container's state is 
{{NEW}}, release it and remove it from the {{workerNodeMap}}.

> Start redundant TaskExecutor when JM failed
> ---
>
> Key: FLINK-16215
> URL: https://issues.apache.org/jira/browse/FLINK-16215
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: YufeiLiu
>Priority: Major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, 
> and JobMaster will restart and reschedule job. If job slot request arrive 
> earlier than TM registration, RM will start new workers rather than reuse the 
> existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and 
> timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to 
> interface, and put recovered slots in {{pendingSlots}} wait for TM 
> reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed

2020-02-26 Thread Yangze Guo (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046198#comment-17046198
 ] 

Yangze Guo commented on FLINK-16215:


[~liuyufei]

SlotManager will decide what and how many resources should TMs be started with. 
For more details, you could take a look at the design doc of FLINK-14106.

> Start redundant TaskExecutor when JM failed
> ---
>
> Key: FLINK-16215
> URL: https://issues.apache.org/jira/browse/FLINK-16215
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: YufeiLiu
>Priority: Major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, 
> and JobMaster will restart and reschedule job. If job slot request arrive 
> earlier than TM registration, RM will start new workers rather than reuse the 
> existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and 
> timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to 
> interface, and put recovered slots in {{pendingSlots}} wait for TM 
> reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed

2020-02-26 Thread YufeiLiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046195#comment-17046195
 ] 

YufeiLiu commented on FLINK-16215:
--

[~xintongsong]
This sounds good. We can do some actual recovery work like release unused 
containers rather than just put all previous containers into WorkerMap.

It need some changes in {{YarnResourceManager}}, It's kind of like 
{{MesosResourceManager}} recovery process. I think put the recovery work in 
{{prepareLeadershipAsync}} would be nice,  which don't confirm leadership until 
the recovery is done.  Do we have plans to improve on this?

Besides, I had some question about "RM not assuming all TMs have the same 
resource", than which decide the specification of TM? 

> Start redundant TaskExecutor when JM failed
> ---
>
> Key: FLINK-16215
> URL: https://issues.apache.org/jira/browse/FLINK-16215
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: YufeiLiu
>Priority: Major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, 
> and JobMaster will restart and reschedule job. If job slot request arrive 
> earlier than TM registration, RM will start new workers rather than reuse the 
> existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and 
> timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to 
> interface, and put recovered slots in {{pendingSlots}} wait for TM 
> reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed

2020-02-26 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046062#comment-17046062
 ] 

Xintong Song commented on FLINK-16215:
--

[~liuyufei] 
I see your point. You mean for FLINK-15959, in case of failover it's hard to 
know how many new TMs we need to start to meet the min slots?

I did some investigation, and come up with another potential solution for 
identifying the allocated but non-started containers among all recovered 
containers. We can request container status from Yarn by calling 
{{NMClientAsync#getContainerStatusAsync}}, the result will be returned in 
{{onContainerStatusReceived}}. The {{ContainerStatus#getState}} should be 
{{RUNNING}} for those containers in which the {{TaskExecutor}} process is 
already started, and {{NEW}} for those containers allocated but not started yet.

In this way, we can release those containers allocated but not started, and 
also decide how many new TMs we need to start. An alternative is to directly 
start TM process inside those allocated but not started containers. But we are 
working on another effort to make RM not assuming all TMs have the same 
resource, which means RM cannot decide the resource configurations to start the 
new TM with.

> Start redundant TaskExecutor when JM failed
> ---
>
> Key: FLINK-16215
> URL: https://issues.apache.org/jira/browse/FLINK-16215
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: YufeiLiu
>Priority: Major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, 
> and JobMaster will restart and reschedule job. If job slot request arrive 
> earlier than TM registration, RM will start new workers rather than reuse the 
> existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and 
> timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to 
> interface, and put recovered slots in {{pendingSlots}} wait for TM 
> reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed

2020-02-26 Thread YufeiLiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045565#comment-17045565
 ] 

YufeiLiu commented on FLINK-16215:
--

[~xintongsong] I understand your concern, so we can't know how many slots will 
be recovered when JM failover. I came up with this because the issue we discuss 
before [FLINK-15959|https://issues.apache.org/jira/browse/FLINK-15959], it's 
hardly to know exactly how many slots are missing at startup.

> Start redundant TaskExecutor when JM failed
> ---
>
> Key: FLINK-16215
> URL: https://issues.apache.org/jira/browse/FLINK-16215
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: YufeiLiu
>Priority: Major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, 
> and JobMaster will restart and reschedule job. If job slot request arrive 
> earlier than TM registration, RM will start new workers rather than reuse the 
> existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and 
> timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to 
> interface, and put recovered slots in {{pendingSlots}} wait for TM 
> reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed

2020-02-23 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043137#comment-17043137
 ] 

Xintong Song commented on FLINK-16215:
--

I share [~trohrmann]'s concern.

On Yarn deployment, {{YarnResourceManager}} starts a {{TaskExecutor}} in two 
steps.
1.  Requests a container from Yarn.
2. Launch the {{TaskExecutor}} process inside the allocated container.
If the JM failover happens between the two steps, the container will be 
recovered but no {{TaskExecutor}} will be started inside it. 

I think it is a problem that for such a container, neither a {{TaskExecutor}} 
will be started in it, nor will it be released. This might be solved by 
FLINK-13554, with a timeout for starting new {{TaskExecutor}}s. We can apply 
this timeout to recovered containers as well.

FYI, the Kubernetes deployment does not have this problem, because the 
pod/container is allocated and {{TaskExecutor}} is started in one step.

> Start redundant TaskExecutor when JM failed
> ---
>
> Key: FLINK-16215
> URL: https://issues.apache.org/jira/browse/FLINK-16215
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: YufeiLiu
>Priority: Major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, 
> and JobMaster will restart and reschedule job. If job slot request arrive 
> earlier than TM registration, RM will start new workers rather than reuse the 
> existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and 
> timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to 
> interface, and put recovered slots in {{pendingSlots}} wait for TM 
> reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed

2020-02-23 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043134#comment-17043134
 ] 

Yang Wang commented on FLINK-16215:
---

I think even we make {{recoverWokerNode}} as interface and do the recovery 
before slot request coming, we still could not completely avoid this problem. 
Since there is no guarantee that we could get all the previous containers from 
the recovery process. Some other containers may also be returned via the 
subsequent heartbeat.

Maybe the {{JobMaster}} should be aware of the failover and could recover the 
running from {{TaskManager}}. If it fails with timeout, then allocate a new 
slot from {{ResourceManager}}. It is just a rough thought. Please correct me if 
i am wrong.

 

> Start redundant TaskExecutor when JM failed
> ---
>
> Key: FLINK-16215
> URL: https://issues.apache.org/jira/browse/FLINK-16215
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: YufeiLiu
>Priority: Major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, 
> and JobMaster will restart and reschedule job. If job slot request arrive 
> earlier than TM registration, RM will start new workers rather than reuse the 
> existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and 
> timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to 
> interface, and put recovered slots in {{pendingSlots}} wait for TM 
> reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed

2020-02-21 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042036#comment-17042036
 ] 

Till Rohrmann commented on FLINK-16215:
---

I think at the moment I would not recommend to do it because when recovering 
the previous containers we assume that a {{TaskExecutor}} has already been 
started in the container. I think this is not necessarily true and it could 
simply mean that we have obtained a container which we don't use. If this gets 
fixed and if we can guarantee that in every container there is a 
{{TaskExecutor}} running, then I believe that we could slots belonging to these 
"not yet" registered {{TaskExecutors}} as pending slots.

> Start redundant TaskExecutor when JM failed
> ---
>
> Key: FLINK-16215
> URL: https://issues.apache.org/jira/browse/FLINK-16215
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: YufeiLiu
>Priority: Major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, 
> and JobMaster will restart and reschedule job. If job slot request arrive 
> earlier than TM registration, RM will start new workers rather than reuse the 
> existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and 
> timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to 
> interface, and put recovered slots in {{pendingSlots}} wait for TM 
> reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed

2020-02-21 Thread Andrey Zagrebin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041869#comment-17041869
 ] 

Andrey Zagrebin commented on FLINK-16215:
-

I assume it is about some active RM integration, e.g. Yarn.

I agree reusing existing TMs is better but if it happens rarely as it is hard 
to reproduce, why is it a problem? The failover should be also a rare case so 
that starting a new TM should not be a big penalty and existing TMs will just 
disappear after the timeout as already pointed out.

cc [~trohrmann]

> Start redundant TaskExecutor when JM failed
> ---
>
> Key: FLINK-16215
> URL: https://issues.apache.org/jira/browse/FLINK-16215
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: YufeiLiu
>Priority: Major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, 
> and JobMaster will restart and reschedule job. If job slot request arrive 
> earlier than TM registration, RM will start new workers rather than reuse the 
> existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and 
> timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to 
> interface, and put recovered slots in {{pendingSlots}} wait for TM 
> reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)