[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-09-17 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--
Attachment: YARN-2001.5.patch

Trying the same patch again. no failures actually found in the jenkins console 
log

> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch, 
> YARN-2001.4.patch, YARN-2001.5.patch, YARN-2001.5.patch, YARN-2001.5.patch
>
>
> After failover, RM may require a certain threshold to determine whether it’s 
> safe to make scheduling decisions and start accepting new container requests 
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
> until a certain amount of nodes joining before accepting new container 
> requests.  Or it could simply be a timeout, only after the timeout RM accepts 
> new requests. 
> NMs joined after the threshold can be treated as new NMs and instructed to 
> kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-09-17 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--
Attachment: YARN-2001.5.patch

Test passes locally, re-submit the same patch

> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch, 
> YARN-2001.4.patch, YARN-2001.5.patch, YARN-2001.5.patch
>
>
> After failover, RM may require a certain threshold to determine whether it’s 
> safe to make scheduling decisions and start accepting new container requests 
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
> until a certain amount of nodes joining before accepting new container 
> requests.  Or it could simply be a timeout, only after the timeout RM accepts 
> new requests. 
> NMs joined after the threshold can be treated as new NMs and instructed to 
> kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-09-17 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--
Attachment: YARN-2001.5.patch

Fixed logging to add the "wait" msg.

> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch, 
> YARN-2001.4.patch, YARN-2001.5.patch
>
>
> After failover, RM may require a certain threshold to determine whether it’s 
> safe to make scheduling decisions and start accepting new container requests 
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
> until a certain amount of nodes joining before accepting new container 
> requests.  Or it could simply be a timeout, only after the timeout RM accepts 
> new requests. 
> NMs joined after the threshold can be treated as new NMs and instructed to 
> kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-09-16 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--
Attachment: YARN-2001.4.patch

> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch, 
> YARN-2001.4.patch
>
>
> After failover, RM may require a certain threshold to determine whether it’s 
> safe to make scheduling decisions and start accepting new container requests 
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
> until a certain amount of nodes joining before accepting new container 
> requests.  Or it could simply be a timeout, only after the timeout RM accepts 
> new requests. 
> NMs joined after the threshold can be treated as new NMs and instructed to 
> kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-09-11 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--
Attachment: YARN-2001.3.patch

patch rebased

> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch
>
>
> After failover, RM may require a certain threshold to determine whether it’s 
> safe to make scheduling decisions and start accepting new container requests 
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
> until a certain amount of nodes joining before accepting new container 
> requests.  Or it could simply be a timeout, only after the timeout RM accepts 
> new requests. 
> NMs joined after the threshold can be treated as new NMs and instructed to 
> kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-07-07 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--

Attachment: YARN-2001.2.patch

> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2001.1.patch, YARN-2001.2.patch
>
>
> After failover, RM may require a certain threshold to determine whether it’s 
> safe to make scheduling decisions and start accepting new container requests 
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
> until a certain amount of nodes joining before accepting new container 
> requests.  Or it could simply be a timeout, only after the timeout RM accepts 
> new requests. 
> NMs joined after the threshold can be treated as new NMs and instructed to 
> kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-06-09 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--

Attachment: YARN-2001.1.patch

Preliminary patch to demonstrate the idea:

1. Add a new timeout config to make schedulers not allocate new containers 
until timeout.
2. This config is used only if there are any applications in state-store to 
recover. (the amount of time could also be proportional to the num of apps)

> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2001.1.patch
>
>
> After failover, RM may require a certain threshold to determine whether it’s 
> safe to make scheduling decisions and start accepting new container requests 
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
> until a certain amount of nodes joining before accepting new container 
> requests.  Or it could simply be a timeout, only after the timeout RM accepts 
> new requests. 
> NMs joined after the threshold can be treated as new NMs and instructed to 
> kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-05 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--

Description: 
After failover, RM may require a certain threshold to determine whether it’s 
safe to make scheduling decisions and start accepting new container requests 
from AMs. The threshold could be a certain amount of nodes. i.e. RM waits until 
a certain amount of nodes joining before accepting new container requests.  Or 
it could simply be a timeout, only after the timeout RM accepts new requests. 
NMs joined after the threshold can be treated as new NMs and instructed to kill 
all its containers.

  was:After failover, RM may require a certain threshold to determine whether 
it’s safe to make scheduling decisions and start accepting new container 
requests from AMs. The threshold could be a certain amount of nodes. i.e. RM 
waits until a certain amount of nodes joining before accepting new container 
requests.  Or it could simply be a timeout, only after the timeout RM accepts 
new requests.


> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
>
> After failover, RM may require a certain threshold to determine whether it’s 
> safe to make scheduling decisions and start accepting new container requests 
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
> until a certain amount of nodes joining before accepting new container 
> requests.  Or it could simply be a timeout, only after the timeout RM accepts 
> new requests. 
> NMs joined after the threshold can be treated as new NMs and instructed to 
> kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-05 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--

Description: After failover, RM may require a certain threshold to 
determine whether it’s safe to make scheduling decisions and start accepting 
new container requests from AMs. The threshold could be a certain amount of 
nodes. i.e. RM waits until a certain amount of nodes joining before accepting 
new container requests.  Or it could simply be a timeout, only after the 
timeout RM accepts new requests.  (was: RM may not accept allocate requests 
from AMs until all the NMs have re-synced back with RM. This is to eliminate 
some race conditions like containerIds overlapping between 
)

> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
>
> After failover, RM may require a certain threshold to determine whether it’s 
> safe to make scheduling decisions and start accepting new container requests 
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
> until a certain amount of nodes joining before accepting new container 
> requests.  Or it could simply be a timeout, only after the timeout RM accepts 
> new requests.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-05 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--

Description: 
RM may not accept allocate requests from AMs until all the NMs have re-synced 
back with RM. This is to eliminate some race conditions like containerIds 
overlapping between 


  was:
RM should not accept allocate requests from AMs until all the NMs have 
registered with RM. For that, RM needs to remember the previous NMs and wait 
for all the NMs to register.
This is also useful for remembering decommissioned nodes across restarts.


> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
>
> RM may not accept allocate requests from AMs until all the NMs have re-synced 
> back with RM. This is to eliminate some race conditions like containerIds 
> overlapping between 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-05 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--

Summary: Threshold for RM to accept requests from AM after failover  (was: 
Persist NMs info for RM restart)

> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
>
> RM should not accept allocate requests from AMs until all the NMs have 
> registered with RM. For that, RM needs to remember the previous NMs and wait 
> for all the NMs to register.
> This is also useful for remembering decommissioned nodes across restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)