[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-09-17 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--
Attachment: YARN-2001.5.patch

Fixed logging to add the wait msg.

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch, 
 YARN-2001.4.patch, YARN-2001.5.patch


 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-09-17 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--
Attachment: YARN-2001.5.patch

Test passes locally, re-submit the same patch

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch, 
 YARN-2001.4.patch, YARN-2001.5.patch, YARN-2001.5.patch


 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-09-17 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--
Attachment: YARN-2001.5.patch

Trying the same patch again. no failures actually found in the jenkins console 
log

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch, 
 YARN-2001.4.patch, YARN-2001.5.patch, YARN-2001.5.patch, YARN-2001.5.patch


 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-09-16 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--
Attachment: YARN-2001.4.patch

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch, 
 YARN-2001.4.patch


 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-09-11 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--
Attachment: YARN-2001.3.patch

patch rebased

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch


 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-07-07 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--

Attachment: YARN-2001.2.patch

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2001.1.patch, YARN-2001.2.patch


 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-06-09 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--

Attachment: YARN-2001.1.patch

Preliminary patch to demonstrate the idea:

1. Add a new timeout config to make schedulers not allocate new containers 
until timeout.
2. This config is used only if there are any applications in state-store to 
recover. (the amount of time could also be proportional to the num of apps)

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2001.1.patch


 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-05 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--

Summary: Threshold for RM to accept requests from AM after failover  (was: 
Persist NMs info for RM restart)

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 RM should not accept allocate requests from AMs until all the NMs have 
 registered with RM. For that, RM needs to remember the previous NMs and wait 
 for all the NMs to register.
 This is also useful for remembering decommissioned nodes across restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-05 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--

Description: 
RM may not accept allocate requests from AMs until all the NMs have re-synced 
back with RM. This is to eliminate some race conditions like containerIds 
overlapping between 


  was:
RM should not accept allocate requests from AMs until all the NMs have 
registered with RM. For that, RM needs to remember the previous NMs and wait 
for all the NMs to register.
This is also useful for remembering decommissioned nodes across restarts.


 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 RM may not accept allocate requests from AMs until all the NMs have re-synced 
 back with RM. This is to eliminate some race conditions like containerIds 
 overlapping between 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-05 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--

Description: After failover, RM may require a certain threshold to 
determine whether it’s safe to make scheduling decisions and start accepting 
new container requests from AMs. The threshold could be a certain amount of 
nodes. i.e. RM waits until a certain amount of nodes joining before accepting 
new container requests.  Or it could simply be a timeout, only after the 
timeout RM accepts new requests.  (was: RM may not accept allocate requests 
from AMs until all the NMs have re-synced back with RM. This is to eliminate 
some race conditions like containerIds overlapping between 
)

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-05 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--

Description: 
After failover, RM may require a certain threshold to determine whether it’s 
safe to make scheduling decisions and start accepting new container requests 
from AMs. The threshold could be a certain amount of nodes. i.e. RM waits until 
a certain amount of nodes joining before accepting new container requests.  Or 
it could simply be a timeout, only after the timeout RM accepts new requests. 
NMs joined after the threshold can be treated as new NMs and instructed to kill 
all its containers.

  was:After failover, RM may require a certain threshold to determine whether 
it’s safe to make scheduling decisions and start accepting new container 
requests from AMs. The threshold could be a certain amount of nodes. i.e. RM 
waits until a certain amount of nodes joining before accepting new container 
requests.  Or it could simply be a timeout, only after the timeout RM accepts 
new requests.


 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)