[jira] [Updated] (FLINK-9567) Flink does not release resource in Yarn Cluster mode

2018-06-21 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-9567:
--
Labels: pull-request-available  (was: )

> Flink does not release resource in Yarn Cluster mode
> 
>
> Key: FLINK-9567
> URL: https://issues.apache.org/jira/browse/FLINK-9567
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, YARN
>Affects Versions: 1.5.0
>Reporter: Shimin Yang
>Assignee: Shimin Yang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.6.0
>
> Attachments: FlinkYarnProblem, fulllog.txt
>
>
> After restart the Job Manager in Yarn Cluster mode, sometimes Flink does not 
> release task manager containers in some specific case. In the worst case, I 
> had a job configured to 5 task managers, but possess more than 100 containers 
> in the end. Although the task didn't failed, but it affect other jobs in the 
> Yarn Cluster.
> In the first log I posted, the container with id 24 is the reason why Yarn 
> did not release resources. As the container was killed before restart, but it 
> has not received the callback of *onContainerComplete* in 
> *YarnResourceManager* which should be called by *AMRMAsyncClient* of Yarn. 
> After restart, as we can see in line 347 of FlinkYarnProblem log, 
> 2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - 
> Association with remote system [akka.tcp://flink@bd-r1hdp69:30609] has 
> failed, address is now gated for [50] ms. Reason: [Disassociated]
> Flink lost the connection of container 24 which is on bd-r1hdp69 machine. 
> When it try to call *closeTaskManagerConnection* in *onContainerComplete*, it 
> did not has the connection to TaskManager on container 24, so it just ignore 
> the close of TaskManger.
> 2018-06-14 22:50:51,812 DEBUG org.apache.flink.yarn.YarnResourceManager - No 
> open TaskExecutor connection container_1528707394163_29461_02_24. 
> Ignoring close TaskExecutor connection.
>  However, bafore calling *closeTaskManagerConnection,* it already called 
> *requestYarnContainer* which lead to *numPendingContainerRequests variable 
> in* *YarnResourceManager* increased by 1.
> As the excessive container return is determined by the 
> *numPendingContainerRequests* variable in *YarnResourceManager*, it cannot 
> return this container although it is not required. Meanwhile, the restart 
> logic has already allocated enough containers for Task Managers, Flink will 
> possess the extra container for a long time for nothing. 
> In the full log, the job ended with 7 containers while only 3 are running 
> TaskManagers.
> ps: Another strange thing I found is that when sometimes request for a yarn 
> container, it will return much more than requested. Is it a normal scenario 
> for AMRMAsyncClient?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9567) Flink does not release resource in Yarn Cluster mode

2018-06-18 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9567:
-
Fix Version/s: 1.6.0

> Flink does not release resource in Yarn Cluster mode
> 
>
> Key: FLINK-9567
> URL: https://issues.apache.org/jira/browse/FLINK-9567
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, YARN
>Affects Versions: 1.5.0
>Reporter: Shimin Yang
>Priority: Critical
> Fix For: 1.6.0
>
> Attachments: FlinkYarnProblem, fulllog.txt
>
>
> After restart the Job Manager in Yarn Cluster mode, sometimes Flink does not 
> release task manager containers in some specific case. In the worst case, I 
> had a job configured to 5 task managers, but possess more than 100 containers 
> in the end. Although the task didn't failed, but it affect other jobs in the 
> Yarn Cluster.
> In the first log I posted, the container with id 24 is the reason why Yarn 
> did not release resources. As the container was killed before restart, but it 
> has not received the callback of *onContainerComplete* in 
> *YarnResourceManager* which should be called by *AMRMAsyncClient* of Yarn. 
> After restart, as we can see in line 347 of FlinkYarnProblem log, 
> 2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - 
> Association with remote system [akka.tcp://flink@bd-r1hdp69:30609] has 
> failed, address is now gated for [50] ms. Reason: [Disassociated]
> Flink lost the connection of container 24 which is on bd-r1hdp69 machine. 
> When it try to call *closeTaskManagerConnection* in *onContainerComplete*, it 
> did not has the connection to TaskManager on container 24, so it just ignore 
> the close of TaskManger.
> 2018-06-14 22:50:51,812 DEBUG org.apache.flink.yarn.YarnResourceManager - No 
> open TaskExecutor connection container_1528707394163_29461_02_24. 
> Ignoring close TaskExecutor connection.
>  However, bafore calling *closeTaskManagerConnection,* it already called 
> *requestYarnContainer* which lead to *numPendingContainerRequests variable 
> in* *YarnResourceManager* increased by 1.
> As the excessive container return is determined by the 
> *numPendingContainerRequests* variable in *YarnResourceManager*, it cannot 
> return this container although it is not required. Meanwhile, the restart 
> logic has already allocated enough containers for Task Managers, Flink will 
> possess the extra container for a long time for nothing. 
> In the full log, the job ended with 7 containers while only 3 are running 
> TaskManagers.
> ps: Another strange thing I found is that when sometimes request for a yarn 
> container, it will return much more than requested. Is it a normal scenario 
> for AMRMAsyncClient?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9567) Flink does not release resource in Yarn Cluster mode

2018-06-14 Thread Shimin Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shimin Yang updated FLINK-9567:
---
Description: 
After restart the Job Manager in Yarn Cluster mode, sometimes Flink does not 
release task manager containers in some specific case. In the worst case, I had 
a job configured to 5 task managers, but possess more than 100 containers in 
the end. Although the task didn't failed, but it affect other jobs in the Yarn 
Cluster.

In the first log I posted, the container with id 24 is the reason why Yarn did 
not release resources. As the container was killed before restart, but it has 
not received the callback of *onContainerComplete* in *YarnResourceManager* 
which should be called by *AMRMAsyncClient* of Yarn. After restart, as we can 
see in line 347 of FlinkYarnProblem log, 

2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - 
Association with remote system [akka.tcp://flink@bd-r1hdp69:30609] has failed, 
address is now gated for [50] ms. Reason: [Disassociated]

Flink lost the connection of container 24 which is on bd-r1hdp69 machine. When 
it try to call *closeTaskManagerConnection* in *onContainerComplete*, it did 
not has the connection to TaskManager on container 24, so it just ignore the 
close of TaskManger.

2018-06-14 22:50:51,812 DEBUG org.apache.flink.yarn.YarnResourceManager - No 
open TaskExecutor connection container_1528707394163_29461_02_24. Ignoring 
close TaskExecutor connection.

 However, bafore calling *closeTaskManagerConnection,* it already called 
*requestYarnContainer* which lead to *numPendingContainerRequests variable in* 
*YarnResourceManager* increased by 1.

As the excessive container return is determined by the 
*numPendingContainerRequests* variable in *YarnResourceManager*, it cannot 
return this container although it is not required. Meanwhile, the restart logic 
has already allocated enough containers for Task Managers, Flink will possess 
the extra container for a long time for nothing. 

In the full log, the job ended with 7 containers while only 3 are running 
TaskManagers.

ps: Another strange thing I found is that when sometimes request for a yarn 
container, it will return much more than requested. Is it a normal scenario for 
AMRMAsyncClient?

  was:
After restart the Job Manager in Yarn Cluster mode, sometimes Flink does not 
release task manager containers in some specific case. In the worst case, I had 
a job configured to 5 task managers, but possess more than 100 containers in 
the end. Although the task didn't failed, but it affect other jobs in the Yarn 
Cluster.

In the first log I posted, the container with id 24 is the reason why Yarn did 
not release resources. As the container was killed before restart, but it has 
not received the callback of *onContainerComplete* in *YarnResourceManager* 
which should be called by *AMRMAsyncClient* of Yarn. After restart, as we can 
see in line 347 of FlinkYarnProblem log, 

2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - 
Association with remote system [akka.tcp://flink@bd-r1hdp69:30609] has failed, 
address is now gated for [50] ms. Reason: [Disassociated]

Flink lost the connection of container 24 which is on bd-r1hdp69 machine. When 
it try to call *closeTaskManagerConnection* in *onContainerComplete*, it did 
not has the connection to TaskManager on container 24, so it just ignore the 
close of TaskManger.

2018-06-14 22:50:51,812 DEBUG org.apache.flink.yarn.YarnResourceManager - No 
open TaskExecutor connection container_1528707394163_29461_02_24. Ignoring 
close TaskExecutor connection.

 However, bafore calling *closeTaskManagerConnection,* it already called 
*requestYarnContainer* which lead to *numPendingContainerRequests variable in* 
*YarnResourceManager* increased by 1.

As the excessive container return is determined by the 
*numPendingContainerRequests* variable in *YarnResourceManager*, it cannot 
return this container although it is not required. Meanwhile, the restart logic 
has already allocated enough containers for Task Managers, Flink will possess 
the extra container for a long time for nothing. 

ps: Another strange thing I found is that when sometimes request for a yarn 
container, it will return much more than requested. Is it a normal scenario for 
AMRMAsyncClient?


> Flink does not release resource in Yarn Cluster mode
> 
>
> Key: FLINK-9567
> URL: https://issues.apache.org/jira/browse/FLINK-9567
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, YARN
>Affects Versions: 1.5.0
>Reporter: Shimin Yang
>Priority: Critical
> Attachments: FlinkYarnProblem, fulllog.txt
>
>
> After restart the Job Manager in Yarn Cluster mode, sometimes Flink does not 
> release task manager 

[jira] [Updated] (FLINK-9567) Flink does not release resource in Yarn Cluster mode

2018-06-14 Thread Shimin Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shimin Yang updated FLINK-9567:
---
Description: 
After restart the Job Manager in Yarn Cluster mode, sometimes Flink does not 
release task manager containers in some specific case. In the worst case, I had 
a job configured to 5 task managers, but possess more than 100 containers in 
the end. Although the task didn't failed, but it affect other jobs in the Yarn 
Cluster.

In the first log I posted, the container with id 24 is the reason why Yarn did 
not release resources. As the container was killed before restart, but it has 
not received the callback of *onContainerComplete* in *YarnResourceManager* 
which should be called by *AMRMAsyncClient* of Yarn. After restart, as we can 
see in line 347 of FlinkYarnProblem log, 

2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - 
Association with remote system [akka.tcp://flink@bd-r1hdp69:30609] has failed, 
address is now gated for [50] ms. Reason: [Disassociated]

Flink lost the connection of container 24 which is on bd-r1hdp69 machine. When 
it try to call *closeTaskManagerConnection* in *onContainerComplete*, it did 
not has the connection to TaskManager on container 24, so it just ignore the 
close of TaskManger.

2018-06-14 22:50:51,812 DEBUG org.apache.flink.yarn.YarnResourceManager - No 
open TaskExecutor connection container_1528707394163_29461_02_24. Ignoring 
close TaskExecutor connection.

 However, bafore calling *closeTaskManagerConnection,* it already called 
*requestYarnContainer* which lead to *numPendingContainerRequests variable in* 
*YarnResourceManager* increased by 1.

As the excessive container return is determined by the 
*numPendingContainerRequests* variable in *YarnResourceManager*, it cannot 
return this container although it is not required. Meanwhile, the restart logic 
has already allocated enough containers for Task Managers, Flink will possess 
the extra container for a long time for nothing. 

ps: Another strange thing I found is that when sometimes request for a yarn 
container, it will return much more than requested. Is it a normal scenario for 
AMRMAsyncClient?

  was:
After restart the Job Manager in Yarn Cluster mode, sometimes Flink does not 
release task manager containers in some specific case. In the worst case, I had 
a job configured to 5 task managers, but possess more than 100 containers in 
the end. Although the task didn't failed, but it affect other jobs in the Yarn 
Cluster.

In the first log I posted, the container with id 24 is the reason why Yarn did 
not release resources. As the container was killed before restart, but it has 
not received the callback of *onContainerComplete* in *YarnResourceManager* 
which should be called by *AMRMAsyncClient* of Yarn. After restart, as we can 
see in line 347 of FlinkYarnProblem log, 

2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - 
Association with remote system [akka.tcp://flink@bd-r1hdp69:30609] has failed, 
address is now gated for [50] ms. Reason: [Disassociated]

Flink lost the connection of container 24 which is on bd-r1hdp69 machine. When 
it try to call *closeTaskManagerConnection* in *onContainerComplete*, it did 
not has the connection to TaskManager on container 24, so it just ignore the 
close of TaskManger.

2018-06-14 22:50:51,812 DEBUG org.apache.flink.yarn.YarnResourceManager - No 
open TaskExecutor connection container_1528707394163_29461_02_24. Ignoring 
close TaskExecutor connection.

 However, bafore calling *closeTaskManagerConnection,* it already called 
*requestYarnContainer* which lead to *numPendingContainerRequests variable in* 
*YarnResourceManager* increased by 1.**

As the excessive container return is determined by the 
*numPendingContainerRequests* variable in *YarnResourceManager*, it cannot 
return this container although it is not required*.* Meanwhile, the restart 
logic has already allocated enough containers for Task Managers, Flink will 
possess the extra container for a long time for nothing. 

ps: Another strange thing I found is that when sometimes request for a yarn 
container, it will return much more than requested. Is it a normal scenario for 
AMRMAsyncClient?


> Flink does not release resource in Yarn Cluster mode
> 
>
> Key: FLINK-9567
> URL: https://issues.apache.org/jira/browse/FLINK-9567
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, YARN
>Affects Versions: 1.5.0
>Reporter: Shimin Yang
>Priority: Critical
> Attachments: FlinkYarnProblem, fulllog.txt
>
>
> After restart the Job Manager in Yarn Cluster mode, sometimes Flink does not 
> release task manager containers in some specific case. In the worst case, I 
> had a job configured to 5 

[jira] [Updated] (FLINK-9567) Flink does not release resource in Yarn Cluster mode

2018-06-14 Thread Shimin Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shimin Yang updated FLINK-9567:
---
Description: 
After restart the Job Manager in Yarn Cluster mode, sometimes Flink does not 
release task manager containers in some specific case. In the worst case, I had 
a job configured to 5 task managers, but possess more than 100 containers in 
the end. Although the task didn't failed, but it affect other jobs in the Yarn 
Cluster.

In the first log I posted, the container with id 24 is the reason why Yarn did 
not release resources. As the container was killed before restart, but it has 
not received the callback of *onContainerComplete* in *YarnResourceManager* 
which should be called by *AMRMAsyncClient* of Yarn. After restart, as we can 
see in line 347 of FlinkYarnProblem log, 

2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - 
Association with remote system [akka.tcp://flink@bd-r1hdp69:30609] has failed, 
address is now gated for [50] ms. Reason: [Disassociated]

Flink lost the connection of container 24 which is on bd-r1hdp69 machine. When 
it try to call *closeTaskManagerConnection* in *onContainerComplete*, it did 
not has the connection to TaskManager on container 24, so it just ignore the 
close of TaskManger.

2018-06-14 22:50:51,812 DEBUG org.apache.flink.yarn.YarnResourceManager - No 
open TaskExecutor connection container_1528707394163_29461_02_24. Ignoring 
close TaskExecutor connection.

 However, bafore calling *closeTaskManagerConnection,* it already called 
*requestYarnContainer* which lead to *numPendingContainerRequests variable in* 
*YarnResourceManager* increased by 1.**

As the excessive container return is determined by the 
*numPendingContainerRequests* variable in *YarnResourceManager*, it cannot 
return this container although it is not required*.* Meanwhile, the restart 
logic has already allocated enough containers for Task Managers, Flink will 
possess the extra container for a long time for nothing. 

ps: Another strange thing I found is that when sometimes request for a yarn 
container, it will return much more than requested. Is it a normal scenario for 
AMRMAsyncClient?

  was:
After restart the Job Manager in Yarn Cluster mode, Flink does not release task 
manager containers in some specific case.

In the first log I posted, the container with id 24 is the reason why Yarn did 
not release resources. Although the Task Manager in the container with id 24 
was released before restart. 

But in line 347, 

2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - 
Association with remote system [akka.tcp://flink@bd-r1hdp69:30609] has failed, 
address is now gated for [50] ms. Reason: [Disassociated] 

this problem caused flink to request for one more container more than need. As 
the excessive container return id determined by the 
*numPendingContainerRequests* variable in *YarnResourceManager*, I think it's 
the *onContainersCompleted* in *YarnResourceManager* called the method 
*requestYarnContainer* which leads to the increase of 
*numPendingContainerRequests.* However, the restart logic has already allocated 
enough containers for Task Managers, Flink will possess the extra container for 
a long time for nothing. In the worst case, I had a job configured to 5 task 
managers, but possess more than 100 containers in the end.

ps: Another strange thing I found is that when sometimes request for a yarn 
container, it will return much more than requested. Is it a normal scenario for 
AMRMAsyncClient?


> Flink does not release resource in Yarn Cluster mode
> 
>
> Key: FLINK-9567
> URL: https://issues.apache.org/jira/browse/FLINK-9567
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, YARN
>Affects Versions: 1.5.0
>Reporter: Shimin Yang
>Priority: Major
> Attachments: FlinkYarnProblem, fulllog.txt
>
>
> After restart the Job Manager in Yarn Cluster mode, sometimes Flink does not 
> release task manager containers in some specific case. In the worst case, I 
> had a job configured to 5 task managers, but possess more than 100 containers 
> in the end. Although the task didn't failed, but it affect other jobs in the 
> Yarn Cluster.
> In the first log I posted, the container with id 24 is the reason why Yarn 
> did not release resources. As the container was killed before restart, but it 
> has not received the callback of *onContainerComplete* in 
> *YarnResourceManager* which should be called by *AMRMAsyncClient* of Yarn. 
> After restart, as we can see in line 347 of FlinkYarnProblem log, 
> 2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - 
> Association with remote system [akka.tcp://flink@bd-r1hdp69:30609] has 
> failed, address is now gated for [50] 

[jira] [Updated] (FLINK-9567) Flink does not release resource in Yarn Cluster mode

2018-06-14 Thread Shimin Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shimin Yang updated FLINK-9567:
---
Priority: Critical  (was: Major)

> Flink does not release resource in Yarn Cluster mode
> 
>
> Key: FLINK-9567
> URL: https://issues.apache.org/jira/browse/FLINK-9567
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, YARN
>Affects Versions: 1.5.0
>Reporter: Shimin Yang
>Priority: Critical
> Attachments: FlinkYarnProblem, fulllog.txt
>
>
> After restart the Job Manager in Yarn Cluster mode, sometimes Flink does not 
> release task manager containers in some specific case. In the worst case, I 
> had a job configured to 5 task managers, but possess more than 100 containers 
> in the end. Although the task didn't failed, but it affect other jobs in the 
> Yarn Cluster.
> In the first log I posted, the container with id 24 is the reason why Yarn 
> did not release resources. As the container was killed before restart, but it 
> has not received the callback of *onContainerComplete* in 
> *YarnResourceManager* which should be called by *AMRMAsyncClient* of Yarn. 
> After restart, as we can see in line 347 of FlinkYarnProblem log, 
> 2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - 
> Association with remote system [akka.tcp://flink@bd-r1hdp69:30609] has 
> failed, address is now gated for [50] ms. Reason: [Disassociated]
> Flink lost the connection of container 24 which is on bd-r1hdp69 machine. 
> When it try to call *closeTaskManagerConnection* in *onContainerComplete*, it 
> did not has the connection to TaskManager on container 24, so it just ignore 
> the close of TaskManger.
> 2018-06-14 22:50:51,812 DEBUG org.apache.flink.yarn.YarnResourceManager - No 
> open TaskExecutor connection container_1528707394163_29461_02_24. 
> Ignoring close TaskExecutor connection.
>  However, bafore calling *closeTaskManagerConnection,* it already called 
> *requestYarnContainer* which lead to *numPendingContainerRequests variable 
> in* *YarnResourceManager* increased by 1.**
> As the excessive container return is determined by the 
> *numPendingContainerRequests* variable in *YarnResourceManager*, it cannot 
> return this container although it is not required*.* Meanwhile, the restart 
> logic has already allocated enough containers for Task Managers, Flink will 
> possess the extra container for a long time for nothing. 
> ps: Another strange thing I found is that when sometimes request for a yarn 
> container, it will return much more than requested. Is it a normal scenario 
> for AMRMAsyncClient?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9567) Flink does not release resource in Yarn Cluster mode

2018-06-14 Thread Shimin Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shimin Yang updated FLINK-9567:
---
Attachment: fulllog.txt

> Flink does not release resource in Yarn Cluster mode
> 
>
> Key: FLINK-9567
> URL: https://issues.apache.org/jira/browse/FLINK-9567
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, YARN
>Affects Versions: 1.5.0
>Reporter: Shimin Yang
>Priority: Major
> Attachments: FlinkYarnProblem, fulllog.txt
>
>
> After restart the Job Manager in Yarn Cluster mode, Flink does not release 
> task manager containers in some specific case.
> In the first log I posted, the container with id 24 is the reason why Yarn 
> did not release resources. Although the Task Manager in the container with id 
> 24 was released before restart. 
> But in line 347, 
> 2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - 
> Association with remote system [akka.tcp://flink@bd-r1hdp69:30609] has 
> failed, address is now gated for [50] ms. Reason: [Disassociated] 
> this problem caused flink to request for one more container more than need. 
> As the excessive container return id determined by the 
> *numPendingContainerRequests* variable in *YarnResourceManager*, I think it's 
> the *onContainersCompleted* in *YarnResourceManager* called the method 
> *requestYarnContainer* which leads to the increase of 
> *numPendingContainerRequests.* However, the restart logic has already 
> allocated enough containers for Task Managers, Flink will possess the extra 
> container for a long time for nothing. In the worst case, I had a job 
> configured to 5 task managers, but possess more than 100 containers in the 
> end.
> ps: Another strange thing I found is that when sometimes request for a yarn 
> container, it will return much more than requested. Is it a normal scenario 
> for AMRMAsyncClient?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9567) Flink does not release resource in Yarn Cluster mode

2018-06-14 Thread Shimin Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shimin Yang updated FLINK-9567:
---
Attachment: FlinkYarnProblem

> Flink does not release resource in Yarn Cluster mode
> 
>
> Key: FLINK-9567
> URL: https://issues.apache.org/jira/browse/FLINK-9567
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, YARN
>Affects Versions: 1.5.0
>Reporter: Shimin Yang
>Priority: Major
> Attachments: FlinkYarnProblem
>
>
> After restart the Job Manager in Yarn Cluster mode, Flink does not release 
> task manager containers in some specific case.
> In the first log I posted, the container with id 24 is the reason why Yarn 
> did not release resources. Although the Task Manager in the container with id 
> 24 was released before restart. 
> But in line 347, 
> 2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - 
> Association with remote system [akka.tcp://flink@bd-r1hdp69:30609] has 
> failed, address is now gated for [50] ms. Reason: [Disassociated] 
> this problem caused flink to request for one more container more than need. 
> As the excessive container return id determined by the 
> *numPendingContainerRequests* variable in *YarnResourceManager*, I think it's 
> the *onContainersCompleted* in *YarnResourceManager* called the method 
> *requestYarnContainer* which leads to the increase of 
> *numPendingContainerRequests.* However, the restart logic has already 
> allocated enough containers for Task Managers, Flink will possess the extra 
> container for a long time for nothing. In the worst case, I had a job 
> configured to 5 task managers, but possess more than 100 containers in the 
> end.
> ps: Another strange thing I found is that when sometimes request for a yarn 
> container, it will return much more than requested. Is it a normal scenario 
> for AMRMAsyncClient?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9567) Flink does not release resource in Yarn Cluster mode

2018-06-14 Thread Shimin Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shimin Yang updated FLINK-9567:
---
Attachment: (was: jobmanager.log)

> Flink does not release resource in Yarn Cluster mode
> 
>
> Key: FLINK-9567
> URL: https://issues.apache.org/jira/browse/FLINK-9567
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, YARN
>Affects Versions: 1.5.0
>Reporter: Shimin Yang
>Priority: Major
> Attachments: FlinkYarnProblem
>
>
> After restart the Job Manager in Yarn Cluster mode, Flink does not release 
> task manager containers in some specific case.
> In the first log I posted, the container with id 24 is the reason why Yarn 
> did not release resources. Although the Task Manager in the container with id 
> 24 was released before restart. 
> But in line 347, 
> 2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - 
> Association with remote system [akka.tcp://flink@bd-r1hdp69:30609] has 
> failed, address is now gated for [50] ms. Reason: [Disassociated] 
> this problem caused flink to request for one more container more than need. 
> As the excessive container return id determined by the 
> *numPendingContainerRequests* variable in *YarnResourceManager*, I think it's 
> the *onContainersCompleted* in *YarnResourceManager* called the method 
> *requestYarnContainer* which leads to the increase of 
> *numPendingContainerRequests.* However, the restart logic has already 
> allocated enough containers for Task Managers, Flink will possess the extra 
> container for a long time for nothing. In the worst case, I had a job 
> configured to 5 task managers, but possess more than 100 containers in the 
> end.
> ps: Another strange thing I found is that when sometimes request for a yarn 
> container, it will return much more than requested. Is it a normal scenario 
> for AMRMAsyncClient?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9567) Flink does not release resource in Yarn Cluster mode

2018-06-14 Thread Shimin Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shimin Yang updated FLINK-9567:
---
Description: 
After restart the Job Manager in Yarn Cluster mode, Flink does not release task 
manager containers in some specific case.

In the first log I posted, the container with id 24 is the reason why Yarn did 
not release resources. Although the Task Manager in the container with id 24 
was released before restart. 

But in line 347, 

2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - 
Association with remote system [akka.tcp://flink@bd-r1hdp69:30609] has failed, 
address is now gated for [50] ms. Reason: [Disassociated] 

this problem caused flink to request for one more container more than need. As 
the excessive container return id determined by the 
*numPendingContainerRequests* variable in *YarnResourceManager*, I think it's 
the *onContainersCompleted* in *YarnResourceManager* called the method 
*requestYarnContainer* which leads to the increase of 
*numPendingContainerRequests.* However, the restart logic has already allocated 
enough containers for Task Managers, Flink will possess the extra container for 
a long time for nothing. In the worst case, I had a job configured to 5 task 
managers, but possess more than 100 containers in the end.

ps: Another strange thing I found is that when sometimes request for a yarn 
container, it will return much more than requested. Is it a normal scenario for 
AMRMAsyncClient?

  was:
After restart the Job Manager in Yarn Cluster mode, Flink does not release task 
manager containers in some specific case. According to my observation, the 
reason is the instance variable *numPendingContainerRequests* in 
*YarnResourceManager* class does not decrease since it has not received the 
containers. And after restart of job manager, it will make increase the 
*numPendingContainerRequests* by the number of task executors. 

Since the callback function *onContainersAllocated* will return the excessive 
container immediately only if the *numPendingContainerRequests* <= 0, so the 
number of container grows bigger and bigger while only a few are acting as task 
manager.

I think it is important to clear the *numPendingContainerRequests* variable 
after restart the Job Manager, but not very clear at how to do that. There's no 
other way to decrease the *numPendingContainerRequests* except the 
*onContainersAllocated*. Is it fine to add a method to operate on the 
*numPendingContainerRequests* variable? And meanwhile, there's no handle of 
YarnResourceManager in the *ExecutionGraph* restart logic.

ps: Another strange thing I found is that when sometimes request for a yarn 
container, it will return much more than requested. Is it a normal scenario for 
AMRMAsyncClient?


> Flink does not release resource in Yarn Cluster mode
> 
>
> Key: FLINK-9567
> URL: https://issues.apache.org/jira/browse/FLINK-9567
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, YARN
>Affects Versions: 1.5.0
>Reporter: Shimin Yang
>Priority: Major
> Attachments: jobmanager.log
>
>
> After restart the Job Manager in Yarn Cluster mode, Flink does not release 
> task manager containers in some specific case.
> In the first log I posted, the container with id 24 is the reason why Yarn 
> did not release resources. Although the Task Manager in the container with id 
> 24 was released before restart. 
> But in line 347, 
> 2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - 
> Association with remote system [akka.tcp://flink@bd-r1hdp69:30609] has 
> failed, address is now gated for [50] ms. Reason: [Disassociated] 
> this problem caused flink to request for one more container more than need. 
> As the excessive container return id determined by the 
> *numPendingContainerRequests* variable in *YarnResourceManager*, I think it's 
> the *onContainersCompleted* in *YarnResourceManager* called the method 
> *requestYarnContainer* which leads to the increase of 
> *numPendingContainerRequests.* However, the restart logic has already 
> allocated enough containers for Task Managers, Flink will possess the extra 
> container for a long time for nothing. In the worst case, I had a job 
> configured to 5 task managers, but possess more than 100 containers in the 
> end.
> ps: Another strange thing I found is that when sometimes request for a yarn 
> container, it will return much more than requested. Is it a normal scenario 
> for AMRMAsyncClient?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9567) Flink does not release resource in Yarn Cluster mode

2018-06-11 Thread Shimin Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shimin Yang updated FLINK-9567:
---
Description: 
After restart the Job Manager in Yarn Cluster mode, Flink does not release task 
manager containers in some specific case. According to my observation, the 
reason is the instance variable *numPendingContainerRequests* in 
*YarnResourceManager* class does not decrease since it has not received the 
containers. And after restart of job manager, it will make increase the 
*numPendingContainerRequests* by the number of task executors. 

Since the callback function *onContainersAllocated* will return the excessive 
container immediately only if the *numPendingContainerRequests* <= 0, so the 
number of container grows bigger and bigger while only a few are acting as task 
manager.

I think it is important to clear the *numPendingContainerRequests* variable 
after restart the Job Manager, but not very clear at how to do that. There's no 
other way to decrease the *numPendingContainerRequests* except the 
*onContainersAllocated*. Is it fine to add a method to operate on the 
*numPendingContainerRequests* variable? And meanwhile, there's no handle of 
YarnResourceManager in the *ExecutionGraph* restart logic.

ps: Another strange thing I found is that when sometimes request for a yarn 
container, it will return much more than requested. Is it a normal scenario for 
AMRMAsyncClient?

  was:
After restart the Job Manager in Yarn Cluster mode, Flink does not release task 
manager containers in some specific case. According to my observation, the 
reason is the instance variable *numPendingContainerRequests* in 
*YarnResourceManager* class does not decrease since it has not received the 
containers. And after restart of job manager, it will make increase the 
*numPendingContainerRequests* by the number of task executors. 

Since the callback function *onContainersAllocated* will return the excessive 
container immediately only if the *numPendingContainerRequests* <= 0, so the 
number of container grows bigger and bigger while only a few are acting as task 
manager.

I think it is important to clear the *numPendingContainerRequests* variable 
after restart the Job Manager, but not very clear at how to do that. There's no 
other way to decrease the *numPendingContainerRequests* except the 
*onContainersAllocated*. Is it fine to add a method to operate on the 
*numPendingContainerRequests* variable? And meanwhile, there's no handle of 
YarnResourceManager in the *ExecutionGraph* restart logic.


> Flink does not release resource in Yarn Cluster mode
> 
>
> Key: FLINK-9567
> URL: https://issues.apache.org/jira/browse/FLINK-9567
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, YARN
>Affects Versions: 1.5.0
>Reporter: Shimin Yang
>Priority: Major
> Attachments: jobmanager.log
>
>
> After restart the Job Manager in Yarn Cluster mode, Flink does not release 
> task manager containers in some specific case. According to my observation, 
> the reason is the instance variable *numPendingContainerRequests* in 
> *YarnResourceManager* class does not decrease since it has not received the 
> containers. And after restart of job manager, it will make increase the 
> *numPendingContainerRequests* by the number of task executors. 
> Since the callback function *onContainersAllocated* will return the excessive 
> container immediately only if the *numPendingContainerRequests* <= 0, so the 
> number of container grows bigger and bigger while only a few are acting as 
> task manager.
> I think it is important to clear the *numPendingContainerRequests* variable 
> after restart the Job Manager, but not very clear at how to do that. There's 
> no other way to decrease the *numPendingContainerRequests* except the 
> *onContainersAllocated*. Is it fine to add a method to operate on the 
> *numPendingContainerRequests* variable? And meanwhile, there's no handle of 
> YarnResourceManager in the *ExecutionGraph* restart logic.
> ps: Another strange thing I found is that when sometimes request for a yarn 
> container, it will return much more than requested. Is it a normal scenario 
> for AMRMAsyncClient?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9567) Flink does not release resource in Yarn Cluster mode

2018-06-11 Thread Shimin Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shimin Yang updated FLINK-9567:
---
Attachment: jobmanager.log

> Flink does not release resource in Yarn Cluster mode
> 
>
> Key: FLINK-9567
> URL: https://issues.apache.org/jira/browse/FLINK-9567
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, YARN
>Affects Versions: 1.5.0
>Reporter: Shimin Yang
>Priority: Major
> Attachments: jobmanager.log
>
>
> After restart the Job Manager in Yarn Cluster mode, Flink does not release 
> task manager containers in some specific case. According to my observation, 
> the reason is the instance variable *numPendingContainerRequests* in 
> *YarnResourceManager* class does not decrease since it has not received the 
> containers. And after restart of job manager, it will make increase the 
> *numPendingContainerRequests* by the number of task executors. 
> Since the callback function *onContainersAllocated* will return the excessive 
> container immediately only if the *numPendingContainerRequests* <= 0, so the 
> number of container grows bigger and bigger while only a few are acting as 
> task manager.
> I think it is important to clear the *numPendingContainerRequests* variable 
> after restart the Job Manager, but not very clear at how to do that. There's 
> no other way to decrease the *numPendingContainerRequests* except the 
> *onContainersAllocated*. Is it fine to add a method to operate on the 
> *numPendingContainerRequests* variable? And meanwhile, there's no handle of 
> YarnResourceManager in the *ExecutionGraph* restart logic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)