[jira] [Updated] (YARN-7790) Improve Capacity Scheduler Async Scheduling to better handle node failures

2018-01-25 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7790:
-
Attachment: YARN-7790.003.patch

> Improve Capacity Scheduler Async Scheduling to better handle node failures
> --
>
> Key: YARN-7790
> URL: https://issues.apache.org/jira/browse/YARN-7790
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-7790.001.patch, YARN-7790.002.patch, 
> YARN-7790.003.patch
>
>
> This is not a new issue but async scheduling makes it worse:
> In sync scheduling, if an AM container allocated to a node, it assumes node 
> just heartbeat to RM, and AM launcher will connect NM to launch the 
> container. Even though it is possible that NM crashes after the heartbeat, 
> which causes AM hangs for a while. But it is related rare.
> In async scheduling world, multiple AM containers can be placed on a 
> problematic NM, which could cause application hangs easily. Discussed with 
> [~sunilg] and [~jianhe] , we need one fix:
> When async scheduling enabled:
>  - Skip node which missed X node heartbeat.
> And in addition, it's better to reduce wait time by setting following configs 
> to earlier fail a container being launched at an NM with connectivity issue.
> {code:java}
> RetryPolicy retryPolicy =
> createRetryPolicy(conf,
>   YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
>   YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
>   YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
>   YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
> {code}
> The second part is not covered by the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7790) Improve Capacity Scheduler Async Scheduling to better handle node failures

2018-01-23 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7790:
-
Attachment: YARN-7790.002.patch

> Improve Capacity Scheduler Async Scheduling to better handle node failures
> --
>
> Key: YARN-7790
> URL: https://issues.apache.org/jira/browse/YARN-7790
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-7790.001.patch, YARN-7790.002.patch
>
>
> This is not a new issue but async scheduling makes it worse:
> In sync scheduling, if an AM container allocated to a node, it assumes node 
> just heartbeat to RM, and AM launcher will connect NM to launch the 
> container. Even though it is possible that NM crashes after the heartbeat, 
> which causes AM hangs for a while. But it is related rare.
> In async scheduling world, multiple AM containers can be placed on a 
> problematic NM, which could cause application hangs easily. Discussed with 
> [~sunilg] and [~jianhe] , we need one fix:
> When async scheduling enabled:
>  - Skip node which missed X node heartbeat.
> And in addition, it's better to reduce wait time by setting following configs 
> to earlier fail a container being launched at an NM with connectivity issue.
> {code:java}
> RetryPolicy retryPolicy =
> createRetryPolicy(conf,
>   YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
>   YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
>   YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
>   YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
> {code}
> The second part is not covered by the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7790) Improve Capacity Scheduler Async Scheduling to better handle node failures

2018-01-23 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7790:
-
Description: 
This is not a new issue but async scheduling makes it worse:

In sync scheduling, if an AM container allocated to a node, it assumes node 
just heartbeat to RM, and AM launcher will connect NM to launch the container. 
Even though it is possible that NM crashes after the heartbeat, which causes AM 
hangs for a while. But it is related rare.

In async scheduling world, multiple AM containers can be placed on a 
problematic NM, which could cause application hangs easily. Discussed with 
[~sunilg] and [~jianhe] , we need one fix:

When async scheduling enabled:
 - Skip node which missed X node heartbeat.

And in addition, it's better to reduce wait time by setting following configs 
to earlier fail a container being launched at an NM with connectivity issue.
{code:java}
RetryPolicy retryPolicy =
createRetryPolicy(conf,
  YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
  YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
  YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
  YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
{code}

The second part is not covered by the patch.

  was:
This is not a new issue but async scheduling makes it worse:

In sync scheduling, if an AM container allocated to a node, it assumes node 
just heartbeat to RM, and AM launcher will connect NM to launch the container. 
Even though it is possible that NM crashes after the heartbeat, which causes AM 
hangs for a while. But it is related rare.

In async scheduling world, multiple AM containers can be placed on a 
problematic NM, which could cause application hangs easily. Discussed with 
[~sunilg] and [~jianhe] , we need one fix:

When async scheduling enabled:
 - Skip node which missed X node heartbeat.

And in addition, it's better to reduce wait time by setting following configs 
to earlier fail a container being launched at a NM with connectivity issue.
{code:java}
RetryPolicy retryPolicy =
createRetryPolicy(conf,
  YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
  YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
  YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
  YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
{code}


> Improve Capacity Scheduler Async Scheduling to better handle node failures
> --
>
> Key: YARN-7790
> URL: https://issues.apache.org/jira/browse/YARN-7790
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-7790.001.patch
>
>
> This is not a new issue but async scheduling makes it worse:
> In sync scheduling, if an AM container allocated to a node, it assumes node 
> just heartbeat to RM, and AM launcher will connect NM to launch the 
> container. Even though it is possible that NM crashes after the heartbeat, 
> which causes AM hangs for a while. But it is related rare.
> In async scheduling world, multiple AM containers can be placed on a 
> problematic NM, which could cause application hangs easily. Discussed with 
> [~sunilg] and [~jianhe] , we need one fix:
> When async scheduling enabled:
>  - Skip node which missed X node heartbeat.
> And in addition, it's better to reduce wait time by setting following configs 
> to earlier fail a container being launched at an NM with connectivity issue.
> {code:java}
> RetryPolicy retryPolicy =
> createRetryPolicy(conf,
>   YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
>   YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
>   YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
>   YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
> {code}
> The second part is not covered by the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7790) Improve Capacity Scheduler Async Scheduling to better handle node failures

2018-01-23 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7790:
-
Description: 
This is not a new issue but async scheduling makes it worse:

In sync scheduling, if an AM container allocated to a node, it assumes node 
just heartbeat to RM, and AM launcher will connect NM to launch the container. 
Even though it is possible that NM crashes after the heartbeat, which causes AM 
hangs for a while. But it is related rare.

In async scheduling world, multiple AM containers can be placed on a 
problematic NM, which could cause application hangs easily. Discussed with 
[~sunilg] and [~jianhe] , we need one fix:

When async scheduling enabled:
 - Skip node which missed X node heartbeat.

And in addition, it's better to reduce wait time by setting following configs 
to earlier fail a container being launched at a NM with connectivity issue.
{code:java}
RetryPolicy retryPolicy =
createRetryPolicy(conf,
  YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
  YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
  YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
  YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
{code}

  was:
This is not a new issue but async scheduling makes it worse:

In sync scheduling, if an AM container allocated to a node, it assumes node 
just heartbeat to RM, and AM launcher will connect NM to launch the container. 
Even though it is possible that NM crashes after the heartbeat, which causes AM 
hangs for a while. But it is related rare.

In async scheduling world, multiple AM containers can be placed on a 
problematic NM, which could cause application hangs easily. Discussed with 
[~sunilg] and [~jianhe] , we need one fix:

When async scheduling enabled:
- Skip node which missed X node heartbeat.

And in addition, it's better to reduce wait time by setting following configs 
to earlier fail a container being launched on a NM with connection issue.
{code:java}
RetryPolicy retryPolicy =
createRetryPolicy(conf,
  YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
  YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
  YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
  YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
{code}


> Improve Capacity Scheduler Async Scheduling to better handle node failures
> --
>
> Key: YARN-7790
> URL: https://issues.apache.org/jira/browse/YARN-7790
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-7790.001.patch
>
>
> This is not a new issue but async scheduling makes it worse:
> In sync scheduling, if an AM container allocated to a node, it assumes node 
> just heartbeat to RM, and AM launcher will connect NM to launch the 
> container. Even though it is possible that NM crashes after the heartbeat, 
> which causes AM hangs for a while. But it is related rare.
> In async scheduling world, multiple AM containers can be placed on a 
> problematic NM, which could cause application hangs easily. Discussed with 
> [~sunilg] and [~jianhe] , we need one fix:
> When async scheduling enabled:
>  - Skip node which missed X node heartbeat.
> And in addition, it's better to reduce wait time by setting following configs 
> to earlier fail a container being launched at a NM with connectivity issue.
> {code:java}
> RetryPolicy retryPolicy =
> createRetryPolicy(conf,
>   YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
>   YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
>   YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
>   YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7790) Improve Capacity Scheduler Async Scheduling to better handle node failures

2018-01-23 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7790:
-
Description: 
This is not a new issue but async scheduling makes it worse:

In sync scheduling, if an AM container allocated to a node, it assumes node 
just heartbeat to RM, and AM launcher will connect NM to launch the container. 
Even though it is possible that NM crashes after the heartbeat, which causes AM 
hangs for a while. But it is related rare.

In async scheduling world, multiple AM containers can be placed on a 
problematic NM, which could cause application hangs easily. Discussed with 
[~sunilg] and [~jianhe] , we need one fix:

When async scheduling enabled:
1) Skip node which missed X node heartbeat.

And in addition, it's better to reduce wait time by setting following configs 
to earlier fail a container being launched on a NM with connection issue.
{code:java}
RetryPolicy retryPolicy =
createRetryPolicy(conf,
  YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
  YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
  YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
  YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
{code}

  was:
This is not a new issue but async scheduling makes it worse:

In sync scheduling, if an AM container allocated to a node, it assumes node 
just heartbeat to RM, and in the same response, it will be sent back to NM. 
Even though it is possible that NM crashes after the heartbeat, which causes AM 
hangs for 10 mins. But it is related rare.

In async scheduling world, multiple AM containers can be placed on a 
problematic NM, which could cause application hangs for long time. Discussed 
with [~sunilg] , we need at least two fixes:

When async scheduling enabled:
1) Skip node which missed X node heartbeat.
2) Kill AM container in ALLOCATED state on a node which missed Y node heartbeat.

Credit to [~ssath...@hortonworks.com] who found the issue.


> Improve Capacity Scheduler Async Scheduling to better handle node failures
> --
>
> Key: YARN-7790
> URL: https://issues.apache.org/jira/browse/YARN-7790
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-7790.001.patch
>
>
> This is not a new issue but async scheduling makes it worse:
> In sync scheduling, if an AM container allocated to a node, it assumes node 
> just heartbeat to RM, and AM launcher will connect NM to launch the 
> container. Even though it is possible that NM crashes after the heartbeat, 
> which causes AM hangs for a while. But it is related rare.
> In async scheduling world, multiple AM containers can be placed on a 
> problematic NM, which could cause application hangs easily. Discussed with 
> [~sunilg] and [~jianhe] , we need one fix:
> When async scheduling enabled:
> 1) Skip node which missed X node heartbeat.
> And in addition, it's better to reduce wait time by setting following configs 
> to earlier fail a container being launched on a NM with connection issue.
> {code:java}
> RetryPolicy retryPolicy =
> createRetryPolicy(conf,
>   YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
>   YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
>   YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
>   YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7790) Improve Capacity Scheduler Async Scheduling to better handle node failures

2018-01-23 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7790:
-
Description: 
This is not a new issue but async scheduling makes it worse:

In sync scheduling, if an AM container allocated to a node, it assumes node 
just heartbeat to RM, and AM launcher will connect NM to launch the container. 
Even though it is possible that NM crashes after the heartbeat, which causes AM 
hangs for a while. But it is related rare.

In async scheduling world, multiple AM containers can be placed on a 
problematic NM, which could cause application hangs easily. Discussed with 
[~sunilg] and [~jianhe] , we need one fix:

When async scheduling enabled:
- Skip node which missed X node heartbeat.

And in addition, it's better to reduce wait time by setting following configs 
to earlier fail a container being launched on a NM with connection issue.
{code:java}
RetryPolicy retryPolicy =
createRetryPolicy(conf,
  YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
  YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
  YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
  YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
{code}

  was:
This is not a new issue but async scheduling makes it worse:

In sync scheduling, if an AM container allocated to a node, it assumes node 
just heartbeat to RM, and AM launcher will connect NM to launch the container. 
Even though it is possible that NM crashes after the heartbeat, which causes AM 
hangs for a while. But it is related rare.

In async scheduling world, multiple AM containers can be placed on a 
problematic NM, which could cause application hangs easily. Discussed with 
[~sunilg] and [~jianhe] , we need one fix:

When async scheduling enabled:
1) Skip node which missed X node heartbeat.

And in addition, it's better to reduce wait time by setting following configs 
to earlier fail a container being launched on a NM with connection issue.
{code:java}
RetryPolicy retryPolicy =
createRetryPolicy(conf,
  YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
  YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
  YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
  YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
{code}


> Improve Capacity Scheduler Async Scheduling to better handle node failures
> --
>
> Key: YARN-7790
> URL: https://issues.apache.org/jira/browse/YARN-7790
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-7790.001.patch
>
>
> This is not a new issue but async scheduling makes it worse:
> In sync scheduling, if an AM container allocated to a node, it assumes node 
> just heartbeat to RM, and AM launcher will connect NM to launch the 
> container. Even though it is possible that NM crashes after the heartbeat, 
> which causes AM hangs for a while. But it is related rare.
> In async scheduling world, multiple AM containers can be placed on a 
> problematic NM, which could cause application hangs easily. Discussed with 
> [~sunilg] and [~jianhe] , we need one fix:
> When async scheduling enabled:
> - Skip node which missed X node heartbeat.
> And in addition, it's better to reduce wait time by setting following configs 
> to earlier fail a container being launched on a NM with connection issue.
> {code:java}
> RetryPolicy retryPolicy =
> createRetryPolicy(conf,
>   YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
>   YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
>   YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
>   YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7790) Improve Capacity Scheduler Async Scheduling to better handle node failures

2018-01-23 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7790:
-
Reporter: Sumana Sathish  (was: Wangda Tan)

> Improve Capacity Scheduler Async Scheduling to better handle node failures
> --
>
> Key: YARN-7790
> URL: https://issues.apache.org/jira/browse/YARN-7790
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-7790.001.patch
>
>
> This is not a new issue but async scheduling makes it worse:
> In sync scheduling, if an AM container allocated to a node, it assumes node 
> just heartbeat to RM, and in the same response, it will be sent back to NM. 
> Even though it is possible that NM crashes after the heartbeat, which causes 
> AM hangs for 10 mins. But it is related rare.
> In async scheduling world, multiple AM containers can be placed on a 
> problematic NM, which could cause application hangs for long time. Discussed 
> with [~sunilg] , we need at least two fixes:
> When async scheduling enabled:
> 1) Skip node which missed X node heartbeat.
> 2) Kill AM container in ALLOCATED state on a node which missed Y node 
> heartbeat.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7790) Improve Capacity Scheduler Async Scheduling to better handle node failures

2018-01-23 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7790:
-
Description: 
This is not a new issue but async scheduling makes it worse:

In sync scheduling, if an AM container allocated to a node, it assumes node 
just heartbeat to RM, and in the same response, it will be sent back to NM. 
Even though it is possible that NM crashes after the heartbeat, which causes AM 
hangs for 10 mins. But it is related rare.

In async scheduling world, multiple AM containers can be placed on a 
problematic NM, which could cause application hangs for long time. Discussed 
with [~sunilg] , we need at least two fixes:

When async scheduling enabled:
1) Skip node which missed X node heartbeat.
2) Kill AM container in ALLOCATED state on a node which missed Y node heartbeat.

Credit to [~ssath...@hortonworks.com] who found the issue.

  was:
This is not a new issue but async scheduling makes it worse:

In sync scheduling, if an AM container allocated to a node, it assumes node 
just heartbeat to RM, and in the same response, it will be sent back to NM. 
Even though it is possible that NM crashes after the heartbeat, which causes AM 
hangs for 10 mins. But it is related rare.

In async scheduling world, multiple AM containers can be placed on a 
problematic NM, which could cause application hangs for long time. Discussed 
with [~sunilg] , we need at least two fixes:

When async scheduling enabled:
1) Skip node which missed X node heartbeat.
2) Kill AM container in ALLOCATED state on a node which missed Y node heartbeat.


> Improve Capacity Scheduler Async Scheduling to better handle node failures
> --
>
> Key: YARN-7790
> URL: https://issues.apache.org/jira/browse/YARN-7790
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-7790.001.patch
>
>
> This is not a new issue but async scheduling makes it worse:
> In sync scheduling, if an AM container allocated to a node, it assumes node 
> just heartbeat to RM, and in the same response, it will be sent back to NM. 
> Even though it is possible that NM crashes after the heartbeat, which causes 
> AM hangs for 10 mins. But it is related rare.
> In async scheduling world, multiple AM containers can be placed on a 
> problematic NM, which could cause application hangs for long time. Discussed 
> with [~sunilg] , we need at least two fixes:
> When async scheduling enabled:
> 1) Skip node which missed X node heartbeat.
> 2) Kill AM container in ALLOCATED state on a node which missed Y node 
> heartbeat.
> Credit to [~ssath...@hortonworks.com] who found the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7790) Improve Capacity Scheduler Async Scheduling to better handle node failures

2018-01-23 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7790:
-
Attachment: YARN-7790.001.patch

> Improve Capacity Scheduler Async Scheduling to better handle node failures
> --
>
> Key: YARN-7790
> URL: https://issues.apache.org/jira/browse/YARN-7790
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-7790.001.patch
>
>
> This is not a new issue but async scheduling makes it worse:
> In sync scheduling, if an AM container allocated to a node, it assumes node 
> just heartbeat to RM, and in the same response, it will be sent back to NM. 
> Even though it is possible that NM crashes after the heartbeat, which causes 
> AM hangs for 10 mins. But it is related rare.
> In async scheduling world, multiple AM containers can be placed on a 
> problematic NM, which could cause application hangs for long time. Discussed 
> with [~sunilg] , we need at least two fixes:
> When async scheduling enabled:
> 1) Skip node which missed X node heartbeat.
> 2) Kill AM container in ALLOCATED state on a node which missed Y node 
> heartbeat.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org