[jira] [Assigned] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2017-06-02 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot reassigned YARN-2175:
---

Assignee: (was: Anubhav Dhoot)

> Container localization has no timeouts and tasks can be stuck there for a 
> long time
> ---
>
> Key: YARN-2175
> URL: https://issues.apache.org/jira/browse/YARN-2175
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Anubhav Dhoot
>
> There are no timeouts that can be used to limit the time taken by various 
> container startup operations. Localization for example could take a long time 
> and there is no automated way to kill an task if its stuck in these states. 
> These may have nothing to do with the task itself and could be an issue 
> within the platform.
> Ideally there should be configurable limits for various states within the 
> NodeManager to limit various states. The RM does not care about most of these 
> and its only between AM and the NM. We can start by making these global 
> configurable defaults and in future we can make it fancier by letting AM 
> override them in the start container request. 
> This jira will be used to limit localization time and we can open others if 
> we feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-2661) Container Localization is not resource limited

2017-06-02 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot reassigned YARN-2661:
---

Assignee: (was: Anubhav Dhoot)

> Container Localization is not resource limited
> --
>
> Key: YARN-2661
> URL: https://issues.apache.org/jira/browse/YARN-2661
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Anubhav Dhoot
>
> Container localization itself can take up a lot of resources. Today this is 
> not resource limited in any way and can adversely affect actual containers 
> running on the node



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-3119) Memory limit check need not be enforced unless aggregate usage of all containers is near limit

2017-06-02 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot reassigned YARN-3119:
---

Assignee: (was: Anubhav Dhoot)

> Memory limit check need not be enforced unless aggregate usage of all 
> containers is near limit
> --
>
> Key: YARN-3119
> URL: https://issues.apache.org/jira/browse/YARN-3119
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Anubhav Dhoot
> Attachments: YARN-3119.prelim.patch
>
>
> Today we kill any container preemptively even if the total usage of 
> containers for that is well within the limit for YARN. Instead if we enforce 
> memory limit only if the total limit of all containers is close to some 
> configurable ratio of overall memory assigned to containers, we can allow for 
> flexibility in container memory usage without adverse effects. This is 
> similar in principle to how cgroups uses soft_limit_in_bytes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-3229) Incorrect processing of container as LOST on Interruption during NM shutdown

2017-06-02 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot reassigned YARN-3229:
---

Assignee: (was: Anubhav Dhoot)

> Incorrect processing of container as LOST on Interruption during NM shutdown
> 
>
> Key: YARN-3229
> URL: https://issues.apache.org/jira/browse/YARN-3229
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>
> YARN-2846 fixed the issue of writing to the state store incorrectly that the 
> process is LOST. But even after that we still process the ContainerExitEvent. 
> If notInterrupted is false in RecoveredContainerLaunch#call we should skip 
> the following
> {noformat}
>  if (retCode != 0) {
>   LOG.warn("Recovered container exited with a non-zero exit code "
>   + retCode);
>   this.dispatcher.getEventHandler().handle(new ContainerExitEvent(
>   containerId,
>   ContainerEventType.CONTAINER_EXITED_WITH_FAILURE, retCode,
>   "Container exited with a non-zero exit code " + retCode));
>   return retCode;
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-3257) FairScheduler: MaxAm may be set too low preventing apps from starting

2017-06-02 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot reassigned YARN-3257:
---

Assignee: (was: Anubhav Dhoot)

> FairScheduler: MaxAm may be set too low preventing apps from starting
> -
>
> Key: YARN-3257
> URL: https://issues.apache.org/jira/browse/YARN-3257
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Anubhav Dhoot
> Attachments: YARN-3257.001.patch
>
>
> In YARN-2637 CapacityScheduler#LeafQueue does not enforce max am share if the 
> limit prevents the first application from starting. This would be good to add 
> to FSLeafQueue as well



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-3994) RM should respect AM resource/placement constraints

2017-06-02 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot reassigned YARN-3994:
---

Assignee: (was: Anubhav Dhoot)

> RM should respect AM resource/placement constraints
> ---
>
> Key: YARN-3994
> URL: https://issues.apache.org/jira/browse/YARN-3994
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bikas Saha
>
> Today, locality and cpu for the AM can be specified in the AM launch 
> container request but are ignored at the RM. Locality is assumed to be ANY 
> and cpu is dropped. There may be other things too that are ignored. This 
> should be fixed so that the user gets what is specified in their code to 
> launch the AM. cc [~leftnoteasy] [~vvasudev] [~adhoot]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-4021) RuntimeException/YarnRuntimeException sent over to the client can cause client to assume a local fatal failure

2017-06-02 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot reassigned YARN-4021:
---

Assignee: (was: Anubhav Dhoot)

> RuntimeException/YarnRuntimeException sent over to the client can cause 
> client to assume a local fatal failure 
> ---
>
> Key: YARN-4021
> URL: https://issues.apache.org/jira/browse/YARN-4021
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>
> Currently RuntimeException and its derived types such as 
> YarnRuntimeExceptions are serialized over to the client and thrown at the 
> client after YARN-731. This can cause issues like MAPREDUCE-6439 where we 
> assume a local fatal exception has happened. 
> Instead we should have a way to distinguish local RuntimeException versus 
> remote RuntimeException to avoid these issues. We need to go over all the 
> current client side code that is expecting a remote RuntimeException inorder 
> to make it work with this change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-4076) FairScheduler does not allow AM to choose which containers to preempt

2017-06-02 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot reassigned YARN-4076:
---

Assignee: (was: Anubhav Dhoot)

> FairScheduler does not allow AM to choose which containers to preempt
> -
>
> Key: YARN-4076
> URL: https://issues.apache.org/jira/browse/YARN-4076
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Anubhav Dhoot
>
> Capacity scheduler allows for AM to choose which containers will be 
> preempted. See comment about corresponding work pending for FairScheduler 
> https://issues.apache.org/jira/browse/YARN-568?focusedCommentId=13649126=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13649126



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-4030) Make Nodemanager cgroup usage for container easier to use when its running inside a cgroup

2017-06-02 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot reassigned YARN-4030:
---

Assignee: (was: Anubhav Dhoot)

> Make Nodemanager cgroup usage for container easier to use when its running 
> inside a cgroup 
> ---
>
> Key: YARN-4030
> URL: https://issues.apache.org/jira/browse/YARN-4030
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Anubhav Dhoot
>
> Today nodemanager uses the cgroup prefix pointed by 
> yarn.nodemanager.linux-container-executor.cgroups.hierarchy (default value 
> /hadoop-yarn) directly at the path of the controller say 
> /sys/fs/cgroup/cpu/hadoop-yarn. 
> If there are nodemanagers running inside docker containers on a host, each 
> would typically be separated by a cgroup under the controller path say
> /sys/fs/cgroup/cpu/docker//nmcgroup for NM1 and 
> /sys/fs/cgroup/cpu/docker//nmcgroup for NM2. 
> In this case the correct behavior should be to use the docker cgroup paths as 
> /sys/fs/cgroup/cpu/docker//hadoop-yarn for NM1
> /sys/fs/cgroup/cpu/docker//hadoop-yarn for NM2.
> But the default behavior would make both NMs try to use 
> /sys/fs/cgroup/cpu/hadoop-yarn which is incorrect and would usually fail 
> based on the permissions setup.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-4144) Add NM that causes LaunchFailedTransition to blacklist

2017-06-02 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot reassigned YARN-4144:
---

Assignee: (was: Anubhav Dhoot)

> Add NM that causes LaunchFailedTransition to blacklist
> --
>
> Key: YARN-4144
> URL: https://issues.apache.org/jira/browse/YARN-4144
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Anubhav Dhoot
>
> During discussion of YARN-2005 we need to add more cases where blacklisting 
> can occur. This tracks adding any failures in launch via 
> LaunchFailedTransition to also contribute to blacklisting



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4032) Corrupted state from a previous version can still cause RM to fail with NPE due to same reasons as YARN-2834

2015-10-31 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4032:

Assignee: (was: Anubhav Dhoot)

Unassigning this from myself since I am not going to have time to work on this. 
 Please feel free to take this up. 

> Corrupted state from a previous version can still cause RM to fail with NPE 
> due to same reasons as YARN-2834
> 
>
> Key: YARN-4032
> URL: https://issues.apache.org/jira/browse/YARN-4032
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Anubhav Dhoot
>Priority: Critical
> Attachments: YARN-4032.prelim.patch
>
>
> YARN-2834 ensures in 2.6.0 there will not be any inconsistent state. But if 
> someone is upgrading from a previous version, the state can still be 
> inconsistent and then RM will still fail with NPE after upgrade to 2.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3738) Add support for recovery of reserved apps (running under dynamic queues) to Capacity Scheduler

2015-10-23 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3738:

Attachment: YARN-3738-v3.patch

Retriggering jenkins with same patch

> Add support for recovery of reserved apps (running under dynamic queues) to 
> Capacity Scheduler
> --
>
> Key: YARN-3738
> URL: https://issues.apache.org/jira/browse/YARN-3738
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Subru Krishnan
> Attachments: YARN-3738-v2.patch, YARN-3738-v3.patch, 
> YARN-3738-v3.patch, YARN-3738.patch
>
>
> YARN-3736 persists the current state of the Plan to the RMStateStore. This 
> JIRA covers recovery of the Plan, i.e. dynamic reservation queues with 
> associated apps as part Capacity Scheduler failover mechanism.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3738) Add support for recovery of reserved apps (running under dynamic queues) to Capacity Scheduler

2015-10-23 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971817#comment-14971817
 ] 

Anubhav Dhoot commented on YARN-3738:
-

+1 pending jenkins

> Add support for recovery of reserved apps (running under dynamic queues) to 
> Capacity Scheduler
> --
>
> Key: YARN-3738
> URL: https://issues.apache.org/jira/browse/YARN-3738
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Subru Krishnan
> Attachments: YARN-3738-v2.patch, YARN-3738-v3.patch, 
> YARN-3738-v3.patch, YARN-3738-v4.patch, YARN-3738.patch
>
>
> YARN-3736 persists the current state of the Plan to the RMStateStore. This 
> JIRA covers recovery of the Plan, i.e. dynamic reservation queues with 
> associated apps as part Capacity Scheduler failover mechanism.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3739) Add reservation system recovery to RM recovery process

2015-10-22 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3739:

Fix Version/s: 2.8.0

> Add reservation system recovery to RM recovery process
> --
>
> Key: YARN-3739
> URL: https://issues.apache.org/jira/browse/YARN-3739
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Subru Krishnan
> Fix For: 2.8.0
>
> Attachments: YARN-3739-v1.patch, YARN-3739-v2.patch, 
> YARN-3739-v3.patch
>
>
> YARN-1051 introduced a reservation system in the YARN RM. This JIRA tracks 
> the recovery of the reservation system in case of a RM failover.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3739) Add recovery of reservation system to RM failover process

2015-10-22 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969137#comment-14969137
 ] 

Anubhav Dhoot commented on YARN-3739:
-

+1

> Add recovery of reservation system to RM failover process
> -
>
> Key: YARN-3739
> URL: https://issues.apache.org/jira/browse/YARN-3739
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Subru Krishnan
> Attachments: YARN-3739-v1.patch, YARN-3739-v2.patch, 
> YARN-3739-v3.patch
>
>
> YARN-1051 introduced a reservation system in the YARN RM. This JIRA tracks 
> the recovery of the reservation system in case of a RM failover.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3739) Add reservation system recovery to RM recovery process

2015-10-22 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3739:

Summary: Add reservation system recovery to RM recovery process  (was: Add 
recovery of reservation system to RM failover process)

> Add reservation system recovery to RM recovery process
> --
>
> Key: YARN-3739
> URL: https://issues.apache.org/jira/browse/YARN-3739
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Subru Krishnan
> Attachments: YARN-3739-v1.patch, YARN-3739-v2.patch, 
> YARN-3739-v3.patch
>
>
> YARN-1051 introduced a reservation system in the YARN RM. This JIRA tracks 
> the recovery of the reservation system in case of a RM failover.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4184) Remove update reservation state api from state store as its not used by ReservationSystem

2015-10-21 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4184:

Assignee: Subru Krishnan  (was: Anubhav Dhoot)

> Remove update reservation state api from state store as its not used by 
> ReservationSystem
> -
>
> Key: YARN-4184
> URL: https://issues.apache.org/jira/browse/YARN-4184
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Subru Krishnan
>
> ReservationSystem uses remove/add for updates and thus update api in state 
> store is not needed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3739) Add recovery of reservation system to RM failover process

2015-10-21 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968136#comment-14968136
 ] 

Anubhav Dhoot commented on YARN-3739:
-

Minor comment on  loadState you can enumerate on reservations.entrySet() 
instead of keySet and then doing a get
Looks good otherwise

> Add recovery of reservation system to RM failover process
> -
>
> Key: YARN-3739
> URL: https://issues.apache.org/jira/browse/YARN-3739
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Subru Krishnan
> Attachments: YARN-3739-v1.patch
>
>
> YARN-1051 introduced a reservation system in the YARN RM. This JIRA tracks 
> the recovery of the reservation system in case of a RM failover.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs

2015-10-20 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965804#comment-14965804
 ] 

Anubhav Dhoot commented on YARN-3985:
-

Reran the test multiple times without failure

> Make ReservationSystem persist state using RMStateStore reservation APIs 
> -
>
> Key: YARN-3985
> URL: https://issues.apache.org/jira/browse/YARN-3985
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3985.001.patch, YARN-3985.002.patch, 
> YARN-3985.002.patch, YARN-3985.002.patch, YARN-3985.003.patch, 
> YARN-3985.004.patch, YARN-3985.005.patch, YARN-3985.005.patch
>
>
> YARN-3736 adds the RMStateStore apis to store and load reservation state. 
> This jira adds the actual storing of state from ReservationSystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs

2015-10-20 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3985:

Attachment: YARN-3985.005.patch

Added a retry since there are multiple events that we need to wait for. 
Draindispatcher await can return if we have drained the first event from the 
queue and the next event has not yet been added. So a simple await does not 
seem to work reliably. 

> Make ReservationSystem persist state using RMStateStore reservation APIs 
> -
>
> Key: YARN-3985
> URL: https://issues.apache.org/jira/browse/YARN-3985
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3985.001.patch, YARN-3985.002.patch, 
> YARN-3985.002.patch, YARN-3985.002.patch, YARN-3985.003.patch, 
> YARN-3985.004.patch, YARN-3985.005.patch
>
>
> YARN-3736 adds the RMStateStore apis to store and load reservation state. 
> This jira adds the actual storing of state from ReservationSystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs

2015-10-20 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3985:

Attachment: YARN-3985.005.patch

> Make ReservationSystem persist state using RMStateStore reservation APIs 
> -
>
> Key: YARN-3985
> URL: https://issues.apache.org/jira/browse/YARN-3985
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3985.001.patch, YARN-3985.002.patch, 
> YARN-3985.002.patch, YARN-3985.002.patch, YARN-3985.003.patch, 
> YARN-3985.004.patch, YARN-3985.005.patch, YARN-3985.005.patch
>
>
> YARN-3736 adds the RMStateStore apis to store and load reservation state. 
> This jira adds the actual storing of state from ReservationSystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs

2015-10-19 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3985:

Attachment: YARN-3985.004.patch

Thanks for the review [~asuresh]. Attached patch removes sleep by adding the 
plan synchronization and a DrainDispatcher await needed to ensure the node 
capacity is added to the scheduler and to the plan capacity.

> Make ReservationSystem persist state using RMStateStore reservation APIs 
> -
>
> Key: YARN-3985
> URL: https://issues.apache.org/jira/browse/YARN-3985
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3985.001.patch, YARN-3985.002.patch, 
> YARN-3985.002.patch, YARN-3985.002.patch, YARN-3985.003.patch, 
> YARN-3985.004.patch
>
>
> YARN-3736 adds the RMStateStore apis to store and load reservation state. 
> This jira adds the actual storing of state from ReservationSystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node

2015-10-14 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957372#comment-14957372
 ] 

Anubhav Dhoot commented on YARN-4227:
-

The previous statement should also be updated to handle a null node to avoid a 
NPE inside it
{noformat}application.unreserve(rmContainer.getReservedPriority(), 
node);{noformat}
This may need to still process some portion FSAppAttempt#unreserveInternal 
instead of skipping the entire processing.

The test seems ok. Should we rename blacklist -> remove?

Overall the fix looks ok. Just another bug which indicates until we restructure 
the code we will have to keep adding bandaids.

> FairScheduler: RM quits processing expired container from a removed node
> 
>
> Key: YARN-4227
> URL: https://issues.apache.org/jira/browse/YARN-4227
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.3.0, 2.5.0, 2.7.1
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: YARN-4227.2.patch, YARN-4227.3.patch, YARN-4227.4.patch, 
> YARN-4227.patch
>
>
> Under some circumstances the node is removed before an expired container 
> event is processed causing the RM to exit:
> {code}
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
> Expired:container_1436927988321_1307950_01_12 Timed out after 600 secs
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1436927988321_1307950_01_12 Container Transitioned from 
> ACQUIRED to EXPIRED
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: 
> Completed container: container_1436927988321_1307950_01_12 in state: 
> EXPIRED event:EXPIRE
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op   
>OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS  
> APPID=application_1436927988321_1307950 
> CONTAINERID=container_1436927988321_1307950_01_12
> 2015-10-04 21:14:01,063 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type CONTAINER_EXPIRED to the scheduler
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}
> The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 
> and 2.6.0 by different customers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4032) Corrupted state from a previous version can still cause RM to fail with NPE due to same reasons as YARN-2834

2015-10-12 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14953924#comment-14953924
 ] 

Anubhav Dhoot commented on YARN-4032:
-

This is a sample log

{noformat}
2015-10-10 04:35:32,486 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1441905716013_43686_01 State change from NEW to FINISHED
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:642)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1219)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1044)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1008)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:760)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:107)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:841)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$1900(RMAppImpl.java:103)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:856)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:846)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:721)
:
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:642)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1219)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1044)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1008)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:760)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:107)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:841)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$1900(RMAppImpl.java:103)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:856)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:846)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at 

[jira] [Updated] (YARN-4032) Corrupted state from a previous version can still cause RM to fail with NPE due to same reasons as YARN-2834

2015-10-12 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4032:

Attachment: YARN-4032.prelim.patch

Prelim patch based on the discussion

> Corrupted state from a previous version can still cause RM to fail with NPE 
> due to same reasons as YARN-2834
> 
>
> Key: YARN-4032
> URL: https://issues.apache.org/jira/browse/YARN-4032
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
> Attachments: YARN-4032.prelim.patch
>
>
> YARN-2834 ensures in 2.6.0 there will not be any inconsistent state. But if 
> someone is upgrading from a previous version, the state can still be 
> inconsistent and then RM will still fail with NPE after upgrade to 2.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4247) Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing events

2015-10-11 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952545#comment-14952545
 ] 

Anubhav Dhoot commented on YARN-4247:
-

Yup. I had tested without that change. Resolving this as not needed.

> Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing 
> events
> -
>
> Key: YARN-4247
> URL: https://issues.apache.org/jira/browse/YARN-4247
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Blocker
> Attachments: YARN-4247.001.patch, YARN-4247.001.patch
>
>
> We see this deadlock in our testing where events do not get processed and we 
> see this in the logs before the RM dies of OOM {noformat} 2015-10-08 
> 04:48:01,918 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of 
> event-queue is 1488000 2015-10-08 04:48:01,918 INFO 
> org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1488000 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4247) Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing events

2015-10-10 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952032#comment-14952032
 ] 

Anubhav Dhoot commented on YARN-4247:
-

Tested this in a cluster. Before this fix the cluster would fall over around 3 
to 4 hours. After this fix the cluster going strong beyond 24 hours.

> Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing 
> events
> -
>
> Key: YARN-4247
> URL: https://issues.apache.org/jira/browse/YARN-4247
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Blocker
> Attachments: YARN-4247.001.patch, YARN-4247.001.patch
>
>
> We see this deadlock in our testing where events do not get processed and we 
> see this in the logs before the RM dies of OOM {noformat} 2015-10-08 
> 04:48:01,918 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of 
> event-queue is 1488000 2015-10-08 04:48:01,918 INFO 
> org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1488000 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4247) Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing events

2015-10-09 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4247:

Attachment: YARN-4247.001.patch

Fix removes need for locking from FSAppAttempt to RMAppAttemptImpl.

> Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing 
> events
> -
>
> Key: YARN-4247
> URL: https://issues.apache.org/jira/browse/YARN-4247
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Blocker
> Attachments: YARN-4247.001.patch
>
>
> We see this deadlock in our testing where events do not get processed and we 
> see this in the logs before the RM dies of OOM {noformat} 2015-10-08 
> 04:48:01,918 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of 
> event-queue is 1488000 2015-10-08 04:48:01,918 INFO 
> org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1488000 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4247) Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing events

2015-10-09 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4247:

Attachment: YARN-4247.001.patch

retrigger jenkins

> Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing 
> events
> -
>
> Key: YARN-4247
> URL: https://issues.apache.org/jira/browse/YARN-4247
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Blocker
> Attachments: YARN-4247.001.patch, YARN-4247.001.patch
>
>
> We see this deadlock in our testing where events do not get processed and we 
> see this in the logs before the RM dies of OOM {noformat} 2015-10-08 
> 04:48:01,918 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of 
> event-queue is 1488000 2015-10-08 04:48:01,918 INFO 
> org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1488000 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4235) FairScheduler PrimaryGroup does not handle empty groups returned for a user

2015-10-09 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951331#comment-14951331
 ] 

Anubhav Dhoot commented on YARN-4235:
-

Thanks [~rohithsharma] for review and commit!

> FairScheduler PrimaryGroup does not handle empty groups returned for a user 
> 
>
> Key: YARN-4235
> URL: https://issues.apache.org/jira/browse/YARN-4235
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Fix For: 2.8.0
>
> Attachments: YARN-4235.001.patch
>
>
> We see NPE if empty groups are returned for a user. This causes a NPE and 
> cause RM to crash as below
> {noformat}
> 2015-09-22 16:51:52,780  FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ADDED to the scheduler
> java.lang.IndexOutOfBoundsException: Index: 0
>   at java.util.Collections$EmptyList.get(Collections.java:3212)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule$PrimaryGroup.getQueueForApp(QueuePlacementRule.java:149)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule.assignAppToQueue(QueuePlacementRule.java:74)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementPolicy.assignAppToQueue(QueuePlacementPolicy.java:167)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:689)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:595)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1180)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:111)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-09-22 16:51:52,797  INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4247) Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing events

2015-10-09 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951161#comment-14951161
 ] 

Anubhav Dhoot commented on YARN-4247:
-

Looking at the jstack here is the deadlock between FS and RMAppAttemptImpl
The first thread has a lock on FSAppAttempt and is waiting on the 
RMAppAttemptImpl lock
The second thread RMAppAttemptImpl.getApplicationResourceUsageReport has taken 
a readlock and waiting on FSAppAttempt
This causes other threads (eg. third thread) such as the AsyncDispatcher 
threads to get  blocked causing RM to stop processing events and then crash 
with OOM because of the backlog of events.

{noformat}
"IPC Server handler 49 on 8030" #239 daemon prio=5 os_prio=0 
tid=0x01093000 nid=0x8206 waiting on condition [0x7f930b2da000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00071719e0f0> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.*RMAppAttemptImpl*.getMasterContainer(RMAppAttemptImpl.java:747)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.isWaitingForAMContainer(SchedulerApplicationAttempt.java:482)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:938)
- locked <0x000715932d98> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.*FSAppAttempt*)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:529)
- locked <0x0007171a5328> (a 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService$AllocateResponseLock)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)


"IPC Server handler 9 on 8032" #253 daemon prio=5 os_prio=0 
tid=0x00e2e800 nid=0x8214 waiting for monitor entry [0x7f930a4cd000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceUsageReport(SchedulerApplicationAttempt.java:570)
- waiting to lock <0x000715932d98> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getAppResourceUsageReport(AbstractYarnScheduler.java:241)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:114)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.*RMAppAttemptImpl*.getApplicationResourceUsageReport(RMAppAttemptImpl.java:798)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:655)
at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:330)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:170)
at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:401)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at 

[jira] [Moved] (YARN-4247) Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing events

2015-10-09 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot moved MAPREDUCE-6509 to YARN-4247:


Component/s: (was: resourcemanager)
 resourcemanager
 fairscheduler
Key: YARN-4247  (was: MAPREDUCE-6509)
Project: Hadoop YARN  (was: Hadoop Map/Reduce)

> Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing 
> events
> -
>
> Key: YARN-4247
> URL: https://issues.apache.org/jira/browse/YARN-4247
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> We see this deadlock in our testing where events do not get processed and we 
> see this in the logs before the RM dies of OOM {noformat} 2015-10-08 
> 04:48:01,918 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of 
> event-queue is 1488000 2015-10-08 04:48:01,918 INFO 
> org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1488000 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4235) FairScheduler PrimaryGroup does not handle empty groups returned for a user

2015-10-07 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4235:

Attachment: YARN-4235.001.patch

Handle empty groups

> FairScheduler PrimaryGroup does not handle empty groups returned for a user 
> 
>
> Key: YARN-4235
> URL: https://issues.apache.org/jira/browse/YARN-4235
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-4235.001.patch
>
>
> We see NPE if empty groups are returned for a user. This causes a NPE and 
> cause RM to crash as below
> {noformat}
> 2015-09-22 16:51:52,780  FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ADDED to the scheduler
> java.lang.IndexOutOfBoundsException: Index: 0
>   at java.util.Collections$EmptyList.get(Collections.java:3212)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule$PrimaryGroup.getQueueForApp(QueuePlacementRule.java:149)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule.assignAppToQueue(QueuePlacementRule.java:74)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementPolicy.assignAppToQueue(QueuePlacementPolicy.java:167)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:689)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:595)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1180)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:111)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-09-22 16:51:52,797  INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4235) FairScheduler PrimaryGroup does not handle empty groups returned for a user

2015-10-07 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4235:
---

 Summary: FairScheduler PrimaryGroup does not handle empty groups 
returned for a user 
 Key: YARN-4235
 URL: https://issues.apache.org/jira/browse/YARN-4235
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


We see NPE if empty groups are returned for a user. This causes a NPE and cause 
RM to crash as below

{noformat}
2015-09-22 16:51:52,780  FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
handling event type APP_ADDED to the scheduler
java.lang.IndexOutOfBoundsException: Index: 0
at java.util.Collections$EmptyList.get(Collections.java:3212)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule$PrimaryGroup.getQueueForApp(QueuePlacementRule.java:149)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule.assignAppToQueue(QueuePlacementRule.java:74)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementPolicy.assignAppToQueue(QueuePlacementPolicy.java:167)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:689)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:595)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1180)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:111)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684)
at java.lang.Thread.run(Thread.java:745)
2015-09-22 16:51:52,797  INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry

2015-10-05 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943809#comment-14943809
 ] 

Anubhav Dhoot commented on YARN-4185:
-

I don't think option 2 where you restart from 1 makes sense. Its also not a 
goal to minimize the total wait time. The goal should be to minimize the time 
to recover for short intermittent failure while also waiting long enough for 
long failures before giving up. Would it be better for us to ramp up to 10 sec 
exponentially and then do the n retries for 10 sec or do totally n retries 
including the ramp up.

> Retry interval delay for NM client can be improved from the fixed static 
> retry 
> ---
>
> Key: YARN-4185
> URL: https://issues.apache.org/jira/browse/YARN-4185
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Neelesh Srinivas Salian
>
> Instead of having a fixed retry interval that starts off very high and stays 
> there, we are better off using an exponential backoff that has the same fixed 
> max limit. Today the retry interval is fixed at 10 sec that can be 
> unnecessarily high especially when NMs could rolling restart within a sec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3996) YARN-789 (Support for zero capabilities in fairscheduler) is broken after YARN-3305

2015-10-02 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14941443#comment-14941443
 ] 

Anubhav Dhoot commented on YARN-3996:
-

Approach looks ok

> YARN-789 (Support for zero capabilities in fairscheduler) is broken after 
> YARN-3305
> ---
>
> Key: YARN-3996
> URL: https://issues.apache.org/jira/browse/YARN-3996
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, fairscheduler
>Reporter: Anubhav Dhoot
>Assignee: Neelesh Srinivas Salian
>Priority: Critical
> Attachments: YARN-3996.prelim.patch
>
>
> RMAppManager#validateAndCreateResourceRequest calls into normalizeRequest 
> with mininumResource for the incrementResource. This causes normalize to 
> return zero if minimum is set to zero as per YARN-789



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry

2015-09-30 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4185:

Assignee: Neelesh Srinivas Salian  (was: Anubhav Dhoot)

> Retry interval delay for NM client can be improved from the fixed static 
> retry 
> ---
>
> Key: YARN-4185
> URL: https://issues.apache.org/jira/browse/YARN-4185
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Neelesh Srinivas Salian
>
> Instead of having a fixed retry interval that starts off very high and stays 
> there, we are better off using an exponential backoff that has the same fixed 
> max limit. Today the retry interval is fixed at 10 sec that can be 
> unnecessarily high especially when NMs could rolling restart within a sec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3996) YARN-789 (Support for zero capabilities in fairscheduler) is broken after YARN-3305

2015-09-30 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938225#comment-14938225
 ] 

Anubhav Dhoot commented on YARN-3996:
-

SchedulerUtils has multiple overloads of normalizeRequests. The ones that take 
in increment and min handle what you are looking to do. 
Fair has support for the increment and Fifo/Capacity do not. So Fair does 
multiple of increment and uses min for min while Fifo/Capacity does multiple of 
min and use min for min. Basically Capacity/Fifo are setting incr also to min. 
We need to do the same in RMAppManager.
That way Fair can continue supporting zero min and use multiple of incr and 
Fifo/Capacity can choose to not support zero min and support multiple of min. 

> YARN-789 (Support for zero capabilities in fairscheduler) is broken after 
> YARN-3305
> ---
>
> Key: YARN-3996
> URL: https://issues.apache.org/jira/browse/YARN-3996
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, fairscheduler
>Reporter: Anubhav Dhoot
>Assignee: Neelesh Srinivas Salian
>Priority: Critical
>
> RMAppManager#validateAndCreateResourceRequest calls into normalizeRequest 
> with mininumResource for the incrementResource. This causes normalize to 
> return zero if minimum is set to zero as per YARN-789



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry

2015-09-30 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938190#comment-14938190
 ] 

Anubhav Dhoot commented on YARN-4185:
-

can we try to reuse the existing values for retries 
(yarn.client.nodemanager-connect. ) and see if we can be mostly compatible? I 
am thinking its fine if its not exactly the same behavior

> Retry interval delay for NM client can be improved from the fixed static 
> retry 
> ---
>
> Key: YARN-4185
> URL: https://issues.apache.org/jira/browse/YARN-4185
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> Instead of having a fixed retry interval that starts off very high and stays 
> there, we are better off using an exponential backoff that has the same fixed 
> max limit. Today the retry interval is fixed at 10 sec that can be 
> unnecessarily high especially when NMs could rolling restart within a sec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3996) YARN-789 (Support for zero capabilities in fairscheduler) is broken after YARN-3305

2015-09-30 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3996:

Assignee: Neelesh Srinivas Salian  (was: Anubhav Dhoot)

> YARN-789 (Support for zero capabilities in fairscheduler) is broken after 
> YARN-3305
> ---
>
> Key: YARN-3996
> URL: https://issues.apache.org/jira/browse/YARN-3996
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, fairscheduler
>Reporter: Anubhav Dhoot
>Assignee: Neelesh Srinivas Salian
>Priority: Critical
>
> RMAppManager#validateAndCreateResourceRequest calls into normalizeRequest 
> with mininumResource for the incrementResource. This causes normalize to 
> return zero if minimum is set to zero as per YARN-789



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4204) ConcurrentModificationException in FairSchedulerQueueInfo

2015-09-28 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933547#comment-14933547
 ] 

Anubhav Dhoot commented on YARN-4204:
-

Committed to trunk and branch-2. Thanks [~kasha] for the review.

> ConcurrentModificationException in FairSchedulerQueueInfo
> -
>
> Key: YARN-4204
> URL: https://issues.apache.org/jira/browse/YARN-4204
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Fix For: 2.8.0
>
> Attachments: YARN-4204.001.patch, YARN-4204.002.patch
>
>
> Saw this exception which caused RM to go down
> {noformat}
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
>   at java.util.ArrayList$Itr.next(ArrayList.java:851)
>   at 
> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerQueueInfo.(FairSchedulerQueueInfo.java:100)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerInfo.(FairSchedulerInfo.java:46)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getSchedulerInfo(RMWebServices.java:229)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>   at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
>   at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
>   at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>   at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
>   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:589)
>   at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291)
>   at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:552)
>   at 
> 

[jira] [Updated] (YARN-4180) AMLauncher does not retry on failures when talking to NM

2015-09-28 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4180:

Attachment: YARN-4180-branch-2.7.2.txt

Minor conflicts in backporting changes to branch 2.7

> AMLauncher does not retry on failures when talking to NM 
> -
>
> Key: YARN-4180
> URL: https://issues.apache.org/jira/browse/YARN-4180
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
> Attachments: YARN-4180-branch-2.7.2.txt, YARN-4180.001.patch, 
> YARN-4180.002.patch, YARN-4180.002.patch, YARN-4180.002.patch
>
>
> We see issues with RM trying to launch a container while a NM is restarting 
> and we get exceptions like NMNotReadyException. While YARN-3842 added retry 
> for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing 
> there intermittent errors to cause job failures. This can manifest during 
> rolling restart of NMs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4204) ConcurrentModificationException in FairSchedulerQueueInfo

2015-09-24 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4204:

Description: 
Saw this exception which caused RM to go down
{noformat}
java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
at java.util.ArrayList$Itr.next(ArrayList.java:851)
at 
java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerQueueInfo.(FairSchedulerQueueInfo.java:100)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerInfo.(FairSchedulerInfo.java:46)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getSchedulerInfo(RMWebServices.java:229)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
at 
com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
at 
com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
at 
com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at 
com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at 
com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
at 
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
at 
com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:589)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:552)
at 
org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:84)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1279)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at 

[jira] [Updated] (YARN-4180) AMLauncher does not retry on failures when talking to NM

2015-09-24 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4180:

Attachment: YARN-4180.002.patch

Try triggering jenkins again

> AMLauncher does not retry on failures when talking to NM 
> -
>
> Key: YARN-4180
> URL: https://issues.apache.org/jira/browse/YARN-4180
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
> Attachments: YARN-4180.001.patch, YARN-4180.002.patch, 
> YARN-4180.002.patch
>
>
> We see issues with RM trying to launch a container while a NM is restarting 
> and we get exceptions like NMNotReadyException. While YARN-3842 added retry 
> for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing 
> there intermittent errors to cause job failures. This can manifest during 
> rolling restart of NMs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4204) ConcurrentModificationException in FairSchedulerQueueInfo

2015-09-24 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4204:

Attachment: YARN-4204.002.patch

Add unit test to repro ConcurrentModificationException

> ConcurrentModificationException in FairSchedulerQueueInfo
> -
>
> Key: YARN-4204
> URL: https://issues.apache.org/jira/browse/YARN-4204
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-4204.001.patch, YARN-4204.002.patch
>
>
> Saw this exception which caused RM to go down
> {noformat}
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
>   at java.util.ArrayList$Itr.next(ArrayList.java:851)
>   at 
> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerQueueInfo.(FairSchedulerQueueInfo.java:100)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerInfo.(FairSchedulerInfo.java:46)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getSchedulerInfo(RMWebServices.java:229)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>   at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
>   at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
>   at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>   at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
>   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:589)
>   at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291)
>   at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:552)
>   at 
> 

[jira] [Commented] (YARN-4204) ConcurrentModificationException in FairSchedulerQueueInfo

2015-09-23 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905621#comment-14905621
 ] 

Anubhav Dhoot commented on YARN-4204:
-

Issue is getChildQueues is returning an unmodifiable list wrapper over the list 
childQueues that itself can get modified. So even though the callers cannot 
modify it, they can still get CME while iterating over it if someone modifies 
the underlying list. 

> ConcurrentModificationException in FairSchedulerQueueInfo
> -
>
> Key: YARN-4204
> URL: https://issues.apache.org/jira/browse/YARN-4204
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> Saw this exception
> {noformat}
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
>   at java.util.ArrayList$Itr.next(ArrayList.java:851)
>   at 
> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerQueueInfo.(FairSchedulerQueueInfo.java:100)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerInfo.(FairSchedulerInfo.java:46)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getSchedulerInfo(RMWebServices.java:229)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>   at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
>   at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
>   at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>   at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
>   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:589)
>   at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291)
>   at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:552)
>   

[jira] [Updated] (YARN-4204) ConcurrentModificationException in FairSchedulerQueueInfo

2015-09-23 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4204:

Attachment: YARN-4204.001.patch

> ConcurrentModificationException in FairSchedulerQueueInfo
> -
>
> Key: YARN-4204
> URL: https://issues.apache.org/jira/browse/YARN-4204
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-4204.001.patch
>
>
> Saw this exception
> {noformat}
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
>   at java.util.ArrayList$Itr.next(ArrayList.java:851)
>   at 
> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerQueueInfo.(FairSchedulerQueueInfo.java:100)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerInfo.(FairSchedulerInfo.java:46)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getSchedulerInfo(RMWebServices.java:229)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>   at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
>   at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
>   at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>   at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
>   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:589)
>   at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291)
>   at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:552)
>   at 
> org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:84)
>   at 
> 

[jira] [Created] (YARN-4204) ConcurrentModificationException in FairSchedulerQueueInfo

2015-09-23 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4204:
---

 Summary: ConcurrentModificationException in FairSchedulerQueueInfo
 Key: YARN-4204
 URL: https://issues.apache.org/jira/browse/YARN-4204
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


Saw this exception
{noformat}
java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
at java.util.ArrayList$Itr.next(ArrayList.java:851)
at 
java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerQueueInfo.(FairSchedulerQueueInfo.java:100)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerInfo.(FairSchedulerInfo.java:46)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getSchedulerInfo(RMWebServices.java:229)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
at 
com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
at 
com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
at 
com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at 
com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at 
com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
at 
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
at 
com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:589)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:552)
at 
org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:84)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1279)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at 

[jira] [Commented] (YARN-4180) AMLauncher does not retry on failures when talking to NM

2015-09-22 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903217#comment-14903217
 ] 

Anubhav Dhoot commented on YARN-4180:
-

The test failure looks unrelated. 

> AMLauncher does not retry on failures when talking to NM 
> -
>
> Key: YARN-4180
> URL: https://issues.apache.org/jira/browse/YARN-4180
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
> Attachments: YARN-4180.001.patch
>
>
> We see issues with RM trying to launch a container while a NM is restarting 
> and we get exceptions like NMNotReadyException. While YARN-3842 added retry 
> for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing 
> there intermittent errors to cause job failures. This can manifest during 
> rolling restart of NMs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4180) AMLauncher does not retry on failures when talking to NM

2015-09-22 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4180:

Attachment: YARN-4180.002.patch

Addressed feedback

> AMLauncher does not retry on failures when talking to NM 
> -
>
> Key: YARN-4180
> URL: https://issues.apache.org/jira/browse/YARN-4180
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
> Attachments: YARN-4180.001.patch, YARN-4180.002.patch
>
>
> We see issues with RM trying to launch a container while a NM is restarting 
> and we get exceptions like NMNotReadyException. While YARN-3842 added retry 
> for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing 
> there intermittent errors to cause job failures. This can manifest during 
> rolling restart of NMs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4180) AMLauncher does not retry on failures when talking to NM

2015-09-18 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4180:

Attachment: YARN-4180.001.patch

reuse the same retry proxy used by AM client for RM client. Also opened 
YARN-4185 to improve this retry mechanism

> AMLauncher does not retry on failures when talking to NM 
> -
>
> Key: YARN-4180
> URL: https://issues.apache.org/jira/browse/YARN-4180
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
> Attachments: YARN-4180.001.patch
>
>
> We see issues with RM trying to launch a container while a NM is restarting 
> and we get exceptions like NMNotReadyException. While YARN-3842 added retry 
> for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing 
> there intermittent errors to cause job failures. This can manifest during 
> rolling restart of NMs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry

2015-09-18 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4185:
---

 Summary: Retry interval delay for NM client can be improved from 
the fixed static retry 
 Key: YARN-4185
 URL: https://issues.apache.org/jira/browse/YARN-4185
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


Instead of having a fixed retry interval that starts off very high and stays 
there, we are better off using an exponential backoff that has the same fixed 
max limit. Today the retry interval is fixed at 10 sec that can be 
unnecessarily high especially when NMs could rolling restart within a sec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs

2015-09-18 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3985:

Attachment: YARN-3985.003.patch

> Make ReservationSystem persist state using RMStateStore reservation APIs 
> -
>
> Key: YARN-3985
> URL: https://issues.apache.org/jira/browse/YARN-3985
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3985.001.patch, YARN-3985.002.patch, 
> YARN-3985.002.patch, YARN-3985.002.patch, YARN-3985.003.patch
>
>
> YARN-3736 adds the RMStateStore apis to store and load reservation state. 
> This jira adds the actual storing of state from ReservationSystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs

2015-09-18 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876038#comment-14876038
 ] 

Anubhav Dhoot commented on YARN-2005:
-

Thanks [~jianhe], [~sunilg], [~jlowe] for the reviews and [~kasha]  for the 
review and commit !

> Blacklisting support for scheduling AMs
> ---
>
> Key: YARN-2005
> URL: https://issues.apache.org/jira/browse/YARN-2005
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 0.23.10, 2.4.0
>Reporter: Jason Lowe
>Assignee: Anubhav Dhoot
> Fix For: 2.8.0
>
> Attachments: YARN-2005.001.patch, YARN-2005.002.patch, 
> YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, 
> YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, 
> YARN-2005.008.patch, YARN-2005.009.patch
>
>
> It would be nice if the RM supported blacklisting a node for an AM launch 
> after the same node fails a configurable number of AM attempts.  This would 
> be similar to the blacklisting support for scheduling task attempts in the 
> MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4131) Add API and CLI to kill container on given containerId

2015-09-18 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876669#comment-14876669
 ] 

Anubhav Dhoot commented on YARN-4131:
-

Can you please add [~kasha] and me to this? We are interested in this effort.

> Add API and CLI to kill container on given containerId
> --
>
> Key: YARN-4131
> URL: https://issues.apache.org/jira/browse/YARN-4131
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: applications, client
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: YARN-4131-demo-2.patch, YARN-4131-demo.patch, 
> YARN-4131-v1.1.patch, YARN-4131-v1.2.patch, YARN-4131-v1.patch
>
>
> Per YARN-3337, we need a handy tools to kill container in some scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3920) FairScheduler container reservation on a node should be configurable to limit it to large containers

2015-09-18 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876473#comment-14876473
 ] 

Anubhav Dhoot commented on YARN-3920:
-

Thx [~asuresh] for review and commit!

> FairScheduler container reservation on a node should be configurable to limit 
> it to large containers
> 
>
> Key: YARN-3920
> URL: https://issues.apache.org/jira/browse/YARN-3920
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3920.004.patch, YARN-3920.004.patch, 
> YARN-3920.004.patch, YARN-3920.004.patch, YARN-3920.005.patch, 
> yARN-3920.001.patch, yARN-3920.002.patch, yARN-3920.003.patch
>
>
> Reserving a node for a container was designed for preventing large containers 
> from starvation from small requests that keep getting into a node. Today we 
> let this be used even for a small container request. This has a huge impact 
> on scheduling since we block other scheduling requests until that reservation 
> is fulfilled. We should make this configurable so its impact can be minimized 
> by limiting it for large container requests as originally intended. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4143) Optimize the check for AMContainer allocation needed by blacklisting and ContainerType

2015-09-18 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876494#comment-14876494
 ] 

Anubhav Dhoot commented on YARN-4143:
-

Cannot think of an API to add to the scheduler that will be called by 
RMAppAttempt. We can add an event to SchedulerEventType such as 
AM_CONTAINER_ALLOCATED that happens in the 2 places you mentioned. Seems 
overkill to me.  Lemme know if you have any alternate ways of doing this.

> Optimize the check for AMContainer allocation needed by blacklisting and 
> ContainerType
> --
>
> Key: YARN-4143
> URL: https://issues.apache.org/jira/browse/YARN-4143
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-4143.001.patch
>
>
> In YARN-2005 there are checks made to determine if the allocation is for an 
> AM container. This happens in every allocate call and should be optimized 
> away since it changes only once per SchedulerApplicationAttempt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4184) Remove update reservation state api from state store as its not used by ReservationSystem

2015-09-18 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4184:
---

 Summary: Remove update reservation state api from state store as 
its not used by ReservationSystem
 Key: YARN-4184
 URL: https://issues.apache.org/jira/browse/YARN-4184
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


ReservationSystem uses remove/add for updates and thus update api in state 
store is not needed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs

2015-09-18 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876049#comment-14876049
 ] 

Anubhav Dhoot commented on YARN-3985:
-

Addressed feedback and opened YARN-4184 for removing updateReservation

> Make ReservationSystem persist state using RMStateStore reservation APIs 
> -
>
> Key: YARN-3985
> URL: https://issues.apache.org/jira/browse/YARN-3985
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3985.001.patch, YARN-3985.002.patch, 
> YARN-3985.002.patch, YARN-3985.002.patch, YARN-3985.003.patch
>
>
> YARN-3736 adds the RMStateStore apis to store and load reservation state. 
> This jira adds the actual storing of state from ReservationSystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4143) Optimize the check for AMContainer allocation needed by blacklisting and ContainerType

2015-09-17 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804344#comment-14804344
 ] 

Anubhav Dhoot commented on YARN-4143:
-

I think we can minimize the impact of checking on every allocate to only 
allocates before the AM is assigned which should be only a few times until AM 
itself gets launched. This avoids adding an API to the scheduler.

> Optimize the check for AMContainer allocation needed by blacklisting and 
> ContainerType
> --
>
> Key: YARN-4143
> URL: https://issues.apache.org/jira/browse/YARN-4143
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> In YARN-2005 there are checks made to determine if the allocation is for an 
> AM container. This happens in every allocate call and should be optimized 
> away since it changes only once per SchedulerApplicationAttempt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4143) Optimize the check for AMContainer allocation needed by blacklisting and ContainerType

2015-09-17 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4143:

Attachment: YARN-4143.001.patch

Attached patch ensures checks are done only when AM is not allocated yet. Once 
its allocated it will simply return. Also remove passing the applicationId 
which is redundant since we are checking only for this App.

> Optimize the check for AMContainer allocation needed by blacklisting and 
> ContainerType
> --
>
> Key: YARN-4143
> URL: https://issues.apache.org/jira/browse/YARN-4143
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-4143.001.patch
>
>
> In YARN-2005 there are checks made to determine if the allocation is for an 
> AM container. This happens in every allocate call and should be optimized 
> away since it changes only once per SchedulerApplicationAttempt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4180) AMLauncher does not retry on failures when talking to NM

2015-09-17 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4180:
---

 Summary: AMLauncher does not retry on failures when talking to NM 
 Key: YARN-4180
 URL: https://issues.apache.org/jira/browse/YARN-4180
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


We see issues with RM trying to launch a container while a NM is restarting and 
we get exceptions like NMNotReadyException. While YARN-3842 added retry for 
other clients of NM (AMs mainly) its not used by AMLauncher in RM causing there 
intermittent errors to cause job failures. This can manifest during rolling 
restart of NMs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4180) AMLauncher does not retry on failures when talking to NM

2015-09-17 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804619#comment-14804619
 ] 

Anubhav Dhoot commented on YARN-4180:
-

Propose using retries in the ContainerManagement proxy used by the 
AMLauncher#getContainerMgrProxy

> AMLauncher does not retry on failures when talking to NM 
> -
>
> Key: YARN-4180
> URL: https://issues.apache.org/jira/browse/YARN-4180
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> We see issues with RM trying to launch a container while a NM is restarting 
> and we get exceptions like NMNotReadyException. While YARN-3842 added retry 
> for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing 
> there intermittent errors to cause job failures. This can manifest during 
> rolling restart of NMs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs

2015-09-16 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3985:

Attachment: YARN-3985.002.patch

All the failed tests passed locally for me. Rerunning the tests

> Make ReservationSystem persist state using RMStateStore reservation APIs 
> -
>
> Key: YARN-3985
> URL: https://issues.apache.org/jira/browse/YARN-3985
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3985.001.patch, YARN-3985.002.patch, 
> YARN-3985.002.patch, YARN-3985.002.patch
>
>
> YARN-3736 adds the RMStateStore apis to store and load reservation state. 
> This jira adds the actual storing of state from ReservationSystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4143) Optimize the check for AMContainer allocation needed by blacklisting and ContainerType

2015-09-16 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791259#comment-14791259
 ] 

Anubhav Dhoot commented on YARN-4143:
-

Copying comments from YARN-2005
  
Sunil G added a comment - 03/Sep/15 07:17
Hi Anubhav Dhoot
Thank you for updating the patch. I have a comment here.
isWaitingForAMContainer is now used in 2 cases. To set the ContainerType and 
also in blacklist case. And this check is now hitting in every heartbeat from 
AM.
I think its better to set a state called amIsStarted in 
SchedulerApplicationAttempt. And this can be set from 2 places.
1. RMAppAttemptImpl#AMContainerAllocatedTransition can call a new scheduler api 
to set amIsStarted flag when AM Container is launched and registered. We need 
to pass ContainerId to this new api to get attempt object and to set the flag.
2. AbstrctYarnScheduler#recoverContainersOnNode can also invoke this api to set 
this flag.
So now we can directly read from SchedulerApplicationAttempt everytime when 
heartbeat call comes from AM. If we are not doing this in this ticket, I can 
open another ticket for this optimization. Please suggest your thoughts.

> Optimize the check for AMContainer allocation needed by blacklisting and 
> ContainerType
> --
>
> Key: YARN-4143
> URL: https://issues.apache.org/jira/browse/YARN-4143
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> In YARN-2005 there are checks made to determine if the allocation is for an 
> AM container. This happens in every allocate call and should be optimized 
> away since it changes only once per SchedulerApplicationAttempt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs

2015-09-15 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3985:

Attachment: YARN-3985.002.patch

Retriggering jenkins as failures seem unrelated

> Make ReservationSystem persist state using RMStateStore reservation APIs 
> -
>
> Key: YARN-3985
> URL: https://issues.apache.org/jira/browse/YARN-3985
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3985.001.patch, YARN-3985.002.patch, 
> YARN-3985.002.patch
>
>
> YARN-3736 adds the RMStateStore apis to store and load reservation state. 
> This jira adds the actual storing of state from ReservationSystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4130) Duplicate declaration of ApplicationId in RMAppManager

2015-09-15 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745782#comment-14745782
 ] 

Anubhav Dhoot commented on YARN-4130:
-

LGTM the failures look unrelated

> Duplicate declaration of ApplicationId in RMAppManager
> --
>
> Key: YARN-4130
> URL: https://issues.apache.org/jira/browse/YARN-4130
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Kai Sasaki
>Assignee: Kai Sasaki
>Priority: Trivial
>  Labels: resourcemanager
> Attachments: YARN-4130.00.patch
>
>
> ApplicationId is declared double in {{RMAppManager}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs

2015-09-15 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745893#comment-14745893
 ] 

Anubhav Dhoot commented on YARN-3985:
-

failures are due to rebase with YARN-3656. Will update those new tests and 
submit a new patch

> Make ReservationSystem persist state using RMStateStore reservation APIs 
> -
>
> Key: YARN-3985
> URL: https://issues.apache.org/jira/browse/YARN-3985
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3985.001.patch
>
>
> YARN-3736 adds the RMStateStore apis to store and load reservation state. 
> This jira adds the actual storing of state from ReservationSystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs

2015-09-15 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3985:

Attachment: YARN-3985.002.patch

Updated patch to modify new tests to add valid ReservationDefinition thats 
required by the state store processing.

> Make ReservationSystem persist state using RMStateStore reservation APIs 
> -
>
> Key: YARN-3985
> URL: https://issues.apache.org/jira/browse/YARN-3985
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3985.001.patch, YARN-3985.002.patch
>
>
> YARN-3736 adds the RMStateStore apis to store and load reservation state. 
> This jira adds the actual storing of state from ReservationSystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4135) Improve the assertion message in MockRM while failing after waiting for the state.

2015-09-15 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745775#comment-14745775
 ] 

Anubhav Dhoot commented on YARN-4135:
-

The patch looks fine except for missing spaces at before and after the + at  
"+appId

> Improve the assertion message in MockRM while failing after waiting for the 
> state.
> --
>
> Key: YARN-4135
> URL: https://issues.apache.org/jira/browse/YARN-4135
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: nijel
>Assignee: nijel
>Priority: Minor
>  Labels: test
> Attachments: YARN-4135_1.patch
>
>
> In MockRM when the test is failed after waiting for the given state, the 
> application id or the attempt id can be printed for easy debug
> As of now if it hard to track the test fail in log since there is no relation 
> with test case and the application id.
> Any thoughts ? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4156) testAMBlacklistPreventsRestartOnSameNode assumes CapacityScheduler

2015-09-14 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4156:
---

 Summary: testAMBlacklistPreventsRestartOnSameNode assumes 
CapacityScheduler
 Key: YARN-4156
 URL: https://issues.apache.org/jira/browse/YARN-4156
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


The assumes the scheduler is CapacityScheduler without configuring it as such. 
This causes it to fail if the default is something else such as the 
FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4156) testAMBlacklistPreventsRestartOnSameNode assumes CapacityScheduler

2015-09-14 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4156:

Attachment: YARN-4156.001.patch

Uploading patch that configures the scheduler to be CapcacityScheduler

> testAMBlacklistPreventsRestartOnSameNode assumes CapacityScheduler
> --
>
> Key: YARN-4156
> URL: https://issues.apache.org/jira/browse/YARN-4156
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-4156.001.patch
>
>
> The assumes the scheduler is CapacityScheduler without configuring it as 
> such. This causes it to fail if the default is something else such as the 
> FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3784) Indicate preemption timout along with the list of containers to AM (preemption message)

2015-09-14 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744006#comment-14744006
 ] 

Anubhav Dhoot commented on YARN-3784:
-

Hi [~sunilg] this does not include support for FairScheduler. Are we planning 
to add that here or have a separate jira to track that work? Thx

> Indicate preemption timout along with the list of containers to AM 
> (preemption message)
> ---
>
> Key: YARN-3784
> URL: https://issues.apache.org/jira/browse/YARN-3784
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-3784.patch, 0002-YARN-3784.patch
>
>
> Currently during preemption, AM is notified with a list of containers which 
> are marked for preemption. Introducing a timeout duration also along with 
> this container list so that AM can know how much time it will get to do a 
> graceful shutdown to its containers (assuming one of preemption policy is 
> loaded in AM).
> This will help in decommissioning NM scenarios, where NM will be 
> decommissioned after a timeout (also killing containers on it). This timeout 
> will be helpful to indicate AM that those containers can be killed by RM 
> forcefully after the timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4115) Reduce loglevel of ContainerManagementProtocolProxy to Debug

2015-09-11 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741314#comment-14741314
 ] 

Anubhav Dhoot commented on YARN-4115:
-

The test failure looks unrelated. 

> Reduce loglevel of ContainerManagementProtocolProxy to Debug
> 
>
> Key: YARN-4115
> URL: https://issues.apache.org/jira/browse/YARN-4115
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Minor
> Attachments: YARN-4115.001.patch
>
>
> We see log spams of Aug 28, 1:57:52.441 PMINFO
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy 
> Opening proxy : :8041



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3273) Improve web UI to facilitate scheduling analysis and debugging

2015-09-11 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot reassigned YARN-3273:
---

Assignee: Anubhav Dhoot  (was: Rohith Sharma K S)

> Improve web UI to facilitate scheduling analysis and debugging
> --
>
> Key: YARN-3273
> URL: https://issues.apache.org/jira/browse/YARN-3273
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jian He
>Assignee: Anubhav Dhoot
> Fix For: 2.7.0
>
> Attachments: 0001-YARN-3273-v1.patch, 0001-YARN-3273-v2.patch, 
> 0002-YARN-3273.patch, 0003-YARN-3273.patch, 0003-YARN-3273.patch, 
> 0004-YARN-3273.patch, YARN-3273-am-resource-used-AND-User-limit-v2.PNG, 
> YARN-3273-am-resource-used-AND-User-limit.PNG, 
> YARN-3273-application-headroom-v2.PNG, YARN-3273-application-headroom.PNG
>
>
> Job may be stuck for reasons such as:
> - hitting queue capacity 
> - hitting user-limit, 
> - hitting AM-resource-percentage 
> The  first queueCapacity is already shown on the UI.
> We may surface things like:
> - what is user's current usage and user-limit; 
> - what is the AM resource usage and limit;
> - what is the application's current HeadRoom;
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4145) Make RMHATestBase abstract so its not run when running all tests under that namespace

2015-09-11 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741309#comment-14741309
 ] 

Anubhav Dhoot commented on YARN-4145:
-

The timed out tests are not related to this base class.

> Make RMHATestBase abstract so its not run when running all tests under that 
> namespace
> -
>
> Key: YARN-4145
> URL: https://issues.apache.org/jira/browse/YARN-4145
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Minor
> Attachments: YARN-4145.001.patch
>
>
> Make it abstract to avoid running it as a test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4150) Failure in TestNMClient because nodereports were not available

2015-09-11 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4150:

Description: 
Saw a failure in a test run
https://builds.apache.org/job/PreCommit-YARN-Build/9010/testReport/

java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
at 
org.apache.hadoop.yarn.client.api.impl.TestNMClient.allocateContainers(TestNMClient.java:244)
at 
org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClientNoCleanupOnStop(TestNMClient.java:210)

  was:
Saw a failure in a test run



> Failure in TestNMClient because nodereports were not available
> --
>
> Key: YARN-4150
> URL: https://issues.apache.org/jira/browse/YARN-4150
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> Saw a failure in a test run
> https://builds.apache.org/job/PreCommit-YARN-Build/9010/testReport/
> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>   at java.util.ArrayList.get(ArrayList.java:411)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.allocateContainers(TestNMClient.java:244)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClientNoCleanupOnStop(TestNMClient.java:210)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4150) Failure in TestNMClient because nodereports were not available

2015-09-11 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4150:

Attachment: YARN-4150.001.patch

Simple fix to wait for nodemanagers to be up before trying to get the 
nodereports.

> Failure in TestNMClient because nodereports were not available
> --
>
> Key: YARN-4150
> URL: https://issues.apache.org/jira/browse/YARN-4150
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-4150.001.patch
>
>
> Saw a failure in a test run
> https://builds.apache.org/job/PreCommit-YARN-Build/9010/testReport/
> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>   at java.util.ArrayList.get(ArrayList.java:411)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.allocateContainers(TestNMClient.java:244)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClientNoCleanupOnStop(TestNMClient.java:210)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4150) Failure in TestNMClient because nodereports were not available

2015-09-11 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4150:
---

 Summary: Failure in TestNMClient because nodereports were not 
available
 Key: YARN-4150
 URL: https://issues.apache.org/jira/browse/YARN-4150
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


Saw a failure in a test run




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4115) Reduce loglevel of ContainerManagementProtocolProxy to Debug

2015-09-11 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741334#comment-14741334
 ] 

Anubhav Dhoot commented on YARN-4115:
-

The test passes for me locally. Opened YARN-4150 for fixing the test which 
seems to have a race.

> Reduce loglevel of ContainerManagementProtocolProxy to Debug
> 
>
> Key: YARN-4115
> URL: https://issues.apache.org/jira/browse/YARN-4115
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Minor
> Attachments: YARN-4115.001.patch
>
>
> We see log spams of Aug 28, 1:57:52.441 PMINFO
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy 
> Opening proxy : :8041



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4150) Failure in TestNMClient because nodereports were not available

2015-09-11 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741338#comment-14741338
 ] 

Anubhav Dhoot commented on YARN-4150:
-

This is most likely due to the test reading the node reports before the 
nodemanagers are ready

> Failure in TestNMClient because nodereports were not available
> --
>
> Key: YARN-4150
> URL: https://issues.apache.org/jira/browse/YARN-4150
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> Saw a failure in a test run
> https://builds.apache.org/job/PreCommit-YARN-Build/9010/testReport/
> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>   at java.util.ArrayList.get(ArrayList.java:411)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.allocateContainers(TestNMClient.java:244)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClientNoCleanupOnStop(TestNMClient.java:210)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3273) Improve web UI to facilitate scheduling analysis and debugging

2015-09-11 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3273:

Assignee: Rohith Sharma K S  (was: Anubhav Dhoot)

> Improve web UI to facilitate scheduling analysis and debugging
> --
>
> Key: YARN-3273
> URL: https://issues.apache.org/jira/browse/YARN-3273
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jian He
>Assignee: Rohith Sharma K S
> Fix For: 2.7.0
>
> Attachments: 0001-YARN-3273-v1.patch, 0001-YARN-3273-v2.patch, 
> 0002-YARN-3273.patch, 0003-YARN-3273.patch, 0003-YARN-3273.patch, 
> 0004-YARN-3273.patch, YARN-3273-am-resource-used-AND-User-limit-v2.PNG, 
> YARN-3273-am-resource-used-AND-User-limit.PNG, 
> YARN-3273-application-headroom-v2.PNG, YARN-3273-application-headroom.PNG
>
>
> Job may be stuck for reasons such as:
> - hitting queue capacity 
> - hitting user-limit, 
> - hitting AM-resource-percentage 
> The  first queueCapacity is already shown on the UI.
> We may surface things like:
> - what is user's current usage and user-limit; 
> - what is the AM resource usage and limit;
> - what is the application's current HeadRoom;
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs

2015-09-10 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3985:

Attachment: YARN-3985.001.patch

Added a patch that calls into state store and unit test that verifies after 
recovery state the new RM gets the reservations saved from previous RM.

> Make ReservationSystem persist state using RMStateStore reservation APIs 
> -
>
> Key: YARN-3985
> URL: https://issues.apache.org/jira/browse/YARN-3985
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3985.001.patch
>
>
> YARN-3736 adds the RMStateStore apis to store and load reservation state. 
> This jira adds the actual storing of state from ReservationSystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs

2015-09-10 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739939#comment-14739939
 ] 

Anubhav Dhoot commented on YARN-3985:
-

Since updateReservation does an add and remove we do not have need to update 
reservation state in the state store. I can remove it if needed in either this 
or a separate patch.

> Make ReservationSystem persist state using RMStateStore reservation APIs 
> -
>
> Key: YARN-3985
> URL: https://issues.apache.org/jira/browse/YARN-3985
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3985.001.patch
>
>
> YARN-3736 adds the RMStateStore apis to store and load reservation state. 
> This jira adds the actual storing of state from ReservationSystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs

2015-09-10 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739700#comment-14739700
 ] 

Anubhav Dhoot commented on YARN-2005:
-

[~sunilg]  thats a good suggestion. Added a followup for this YARN-4143

> Blacklisting support for scheduling AMs
> ---
>
> Key: YARN-2005
> URL: https://issues.apache.org/jira/browse/YARN-2005
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 0.23.10, 2.4.0
>Reporter: Jason Lowe
>Assignee: Anubhav Dhoot
> Attachments: YARN-2005.001.patch, YARN-2005.002.patch, 
> YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, 
> YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, 
> YARN-2005.008.patch
>
>
> It would be nice if the RM supported blacklisting a node for an AM launch 
> after the same node fails a configurable number of AM attempts.  This would 
> be similar to the blacklisting support for scheduling task attempts in the 
> MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4145) Make RMHATestBase abstract so its not run when running all tests under that namespace

2015-09-10 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4145:

Attachment: YARN-4145.001.patch

> Make RMHATestBase abstract so its not run when running all tests under that 
> namespace
> -
>
> Key: YARN-4145
> URL: https://issues.apache.org/jira/browse/YARN-4145
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Minor
> Attachments: YARN-4145.001.patch
>
>
> Trivial patch to make it abstract



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2005) Blacklisting support for scheduling AMs

2015-09-10 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-2005:

Attachment: YARN-2005.009.patch

Addressed feedback

> Blacklisting support for scheduling AMs
> ---
>
> Key: YARN-2005
> URL: https://issues.apache.org/jira/browse/YARN-2005
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 0.23.10, 2.4.0
>Reporter: Jason Lowe
>Assignee: Anubhav Dhoot
> Attachments: YARN-2005.001.patch, YARN-2005.002.patch, 
> YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, 
> YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, 
> YARN-2005.008.patch, YARN-2005.009.patch
>
>
> It would be nice if the RM supported blacklisting a node for an AM launch 
> after the same node fails a configurable number of AM attempts.  This would 
> be similar to the blacklisting support for scheduling task attempts in the 
> MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4144) Add NM that causes LaunchFailedTransition to blacklist

2015-09-10 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4144:
---

 Summary: Add NM that causes LaunchFailedTransition to blacklist
 Key: YARN-4144
 URL: https://issues.apache.org/jira/browse/YARN-4144
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


During discussion of YARN-2005 we need to add more cases where blacklisting can 
occur. This tracks adding any failures in launch via LaunchFailedTransition to 
also contribute to blacklisting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs

2015-09-10 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739708#comment-14739708
 ] 

Anubhav Dhoot commented on YARN-2005:
-

Added YARN-4144 to add the node that causes LaunchFailedTransition also to the 
AM blacklist

> Blacklisting support for scheduling AMs
> ---
>
> Key: YARN-2005
> URL: https://issues.apache.org/jira/browse/YARN-2005
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 0.23.10, 2.4.0
>Reporter: Jason Lowe
>Assignee: Anubhav Dhoot
> Attachments: YARN-2005.001.patch, YARN-2005.002.patch, 
> YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, 
> YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, 
> YARN-2005.008.patch
>
>
> It would be nice if the RM supported blacklisting a node for an AM launch 
> after the same node fails a configurable number of AM attempts.  This would 
> be similar to the blacklisting support for scheduling task attempts in the 
> MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4145) Make RMHATestBase abstract so its not run when running all tests under that namespace

2015-09-10 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4145:
---

 Summary: Make RMHATestBase abstract so its not run when running 
all tests under that namespace
 Key: YARN-4145
 URL: https://issues.apache.org/jira/browse/YARN-4145
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
Priority: Minor


Trivial patch to make it abstract



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4145) Make RMHATestBase abstract so its not run when running all tests under that namespace

2015-09-10 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4145:

Description: Make it abstract to avoid running it as a test  (was: Trivial 
patch to make it abstract)

> Make RMHATestBase abstract so its not run when running all tests under that 
> namespace
> -
>
> Key: YARN-4145
> URL: https://issues.apache.org/jira/browse/YARN-4145
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Minor
> Attachments: YARN-4145.001.patch
>
>
> Make it abstract to avoid running it as a test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4143) Optimize the check for AMContainer allocation needed by blacklisting and ContainerType

2015-09-10 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4143:
---

 Summary: Optimize the check for AMContainer allocation needed by 
blacklisting and ContainerType
 Key: YARN-4143
 URL: https://issues.apache.org/jira/browse/YARN-4143
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot


In YARN-2005 there are checks made to determine if the allocation is for an AM 
container. This happens in every allocate call and should be optimized away 
since it changes only once per SchedulerApplicationAttempt





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4143) Optimize the check for AMContainer allocation needed by blacklisting and ContainerType

2015-09-10 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot reassigned YARN-4143:
---

Assignee: Anubhav Dhoot

> Optimize the check for AMContainer allocation needed by blacklisting and 
> ContainerType
> --
>
> Key: YARN-4143
> URL: https://issues.apache.org/jira/browse/YARN-4143
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> In YARN-2005 there are checks made to determine if the allocation is for an 
> AM container. This happens in every allocate call and should be optimized 
> away since it changes only once per SchedulerApplicationAttempt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs

2015-09-10 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739614#comment-14739614
 ] 

Anubhav Dhoot commented on YARN-2005:
-

Hi [~kasha] thanks for your comments.

2.4 - we do not need to update the systemBlacklist as its updated by the 
RMAppAttemptImpl#ScheduleTransition call every time to the complete list. 
11, 12 - The changes were needed because now we need a valid submission context 
for isWaitingForAMContainer.
9 - Is needed by the new test added in TestAMRestart.
8.3 - Yes i can file a follow up for that
Addressed rest of them

> Blacklisting support for scheduling AMs
> ---
>
> Key: YARN-2005
> URL: https://issues.apache.org/jira/browse/YARN-2005
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 0.23.10, 2.4.0
>Reporter: Jason Lowe
>Assignee: Anubhav Dhoot
> Attachments: YARN-2005.001.patch, YARN-2005.002.patch, 
> YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, 
> YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, 
> YARN-2005.008.patch
>
>
> It would be nice if the RM supported blacklisting a node for an AM launch 
> after the same node fails a configurable number of AM attempts.  This would 
> be similar to the blacklisting support for scheduling task attempts in the 
> MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs

2015-09-10 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739618#comment-14739618
 ] 

Anubhav Dhoot commented on YARN-2005:
-

[~He Tianyi] yes we are using the ContainerExitStatus in this. We can refine 
the conditions in a followup if needed.

> Blacklisting support for scheduling AMs
> ---
>
> Key: YARN-2005
> URL: https://issues.apache.org/jira/browse/YARN-2005
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 0.23.10, 2.4.0
>Reporter: Jason Lowe
>Assignee: Anubhav Dhoot
> Attachments: YARN-2005.001.patch, YARN-2005.002.patch, 
> YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, 
> YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, 
> YARN-2005.008.patch
>
>
> It would be nice if the RM supported blacklisting a node for an AM launch 
> after the same node fails a configurable number of AM attempts.  This would 
> be similar to the blacklisting support for scheduling task attempts in the 
> MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs

2015-09-08 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3985:

Component/s: (was: fairscheduler)
 (was: capacityscheduler)

> Make ReservationSystem persist state using RMStateStore reservation APIs 
> -
>
> Key: YARN-3985
> URL: https://issues.apache.org/jira/browse/YARN-3985
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> YARN-3736 adds the RMStateStore apis to store and load reservation state. 
> This jira adds the actual storing of state from ReservationSystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4115) Reduce loglevel of ContainerManagementProtocolProxy to Debug

2015-09-04 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4115:

Attachment: YARN-4115.001.patch

Change the default log level to Debug

> Reduce loglevel of ContainerManagementProtocolProxy to Debug
> 
>
> Key: YARN-4115
> URL: https://issues.apache.org/jira/browse/YARN-4115
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Minor
> Attachments: YARN-4115.001.patch
>
>
> We see log spams of Aug 28, 1:57:52.441 PMINFO
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy 
> Opening proxy : :8041



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4115) Reduce loglevel of ContainerManagementProtocolProxy to Debug

2015-09-04 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4115:
---

 Summary: Reduce loglevel of ContainerManagementProtocolProxy to 
Debug
 Key: YARN-4115
 URL: https://issues.apache.org/jira/browse/YARN-4115
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
Priority: Minor


We see log spams of Aug 28, 1:57:52.441 PM  INFO
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy 
Opening proxy : :8041



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3676) Disregard 'assignMultiple' directive while scheduling apps with NODE_LOCAL resource requests

2015-09-04 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731244#comment-14731244
 ] 

Anubhav Dhoot commented on YARN-3676:
-

Thanks [~asuresh] for working on this. I see the patch continues assigning on 
the node if you have *any* app which has a specific request on that node. But 
the scheduling attempt (via queueMgr.getRootQueue().assignContainer(node)) does 
not restrict which apps will get allocation on that node. So one could end up 
assigning the next container on the node for an app which may not have a 
specific request for that node. 
I see two choices. 
a) Smaller change - We should allow subsequent assignments only for the node 
local only apps? And you already have that list in the map. That can end up 
prioritizing the application's node local request over other applications.
b) BIgger change - Once we have picked the app based on priority, we allow it 
to assign multiple containers if there are multiple node local requests for 
that node.
 

> Disregard 'assignMultiple' directive while scheduling apps with NODE_LOCAL 
> resource requests
> 
>
> Key: YARN-3676
> URL: https://issues.apache.org/jira/browse/YARN-3676
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: YARN-3676.1.patch, YARN-3676.2.patch, YARN-3676.3.patch, 
> YARN-3676.4.patch, YARN-3676.5.patch
>
>
> AssignMultiple is generally set to false to prevent overloading a Node (for 
> eg, new NMs that have just joined)
> A possible scheduling optimization would be to disregard this directive for 
> apps whose allowed locality is NODE_LOCAL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4087) Set YARN_FAIL_FAST to be false by default

2015-09-03 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729422#comment-14729422
 ] 

Anubhav Dhoot commented on YARN-4087:
-

In general if we are not failing the daemon if fail fast flag is false, we 
still need to ensure we are not leaving inconsistent state in RM. For eg in 
YARN-4032. YARN-2019 is the other case where we did not need to do anything. 
This would mean every patch from now on that uses fail fast to not crash the 
daemon should consider taking corrective action to ensure correctness. Does 
that make sense?

> Set YARN_FAIL_FAST to be false by default
> -
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch
>
>
> Increasingly, I feel setting this property to be false makes more sense 
> especially in production environment, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   >