[jira] [Assigned] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time
[ https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot reassigned YARN-2175: --- Assignee: (was: Anubhav Dhoot) > Container localization has no timeouts and tasks can be stuck there for a > long time > --- > > Key: YARN-2175 > URL: https://issues.apache.org/jira/browse/YARN-2175 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Anubhav Dhoot > > There are no timeouts that can be used to limit the time taken by various > container startup operations. Localization for example could take a long time > and there is no automated way to kill an task if its stuck in these states. > These may have nothing to do with the task itself and could be an issue > within the platform. > Ideally there should be configurable limits for various states within the > NodeManager to limit various states. The RM does not care about most of these > and its only between AM and the NM. We can start by making these global > configurable defaults and in future we can make it fancier by letting AM > override them in the start container request. > This jira will be used to limit localization time and we can open others if > we feel we need to limit other operations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-2661) Container Localization is not resource limited
[ https://issues.apache.org/jira/browse/YARN-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot reassigned YARN-2661: --- Assignee: (was: Anubhav Dhoot) > Container Localization is not resource limited > -- > > Key: YARN-2661 > URL: https://issues.apache.org/jira/browse/YARN-2661 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Anubhav Dhoot > > Container localization itself can take up a lot of resources. Today this is > not resource limited in any way and can adversely affect actual containers > running on the node -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-3119) Memory limit check need not be enforced unless aggregate usage of all containers is near limit
[ https://issues.apache.org/jira/browse/YARN-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot reassigned YARN-3119: --- Assignee: (was: Anubhav Dhoot) > Memory limit check need not be enforced unless aggregate usage of all > containers is near limit > -- > > Key: YARN-3119 > URL: https://issues.apache.org/jira/browse/YARN-3119 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Anubhav Dhoot > Attachments: YARN-3119.prelim.patch > > > Today we kill any container preemptively even if the total usage of > containers for that is well within the limit for YARN. Instead if we enforce > memory limit only if the total limit of all containers is close to some > configurable ratio of overall memory assigned to containers, we can allow for > flexibility in container memory usage without adverse effects. This is > similar in principle to how cgroups uses soft_limit_in_bytes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-3229) Incorrect processing of container as LOST on Interruption during NM shutdown
[ https://issues.apache.org/jira/browse/YARN-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot reassigned YARN-3229: --- Assignee: (was: Anubhav Dhoot) > Incorrect processing of container as LOST on Interruption during NM shutdown > > > Key: YARN-3229 > URL: https://issues.apache.org/jira/browse/YARN-3229 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot > > YARN-2846 fixed the issue of writing to the state store incorrectly that the > process is LOST. But even after that we still process the ContainerExitEvent. > If notInterrupted is false in RecoveredContainerLaunch#call we should skip > the following > {noformat} > if (retCode != 0) { > LOG.warn("Recovered container exited with a non-zero exit code " > + retCode); > this.dispatcher.getEventHandler().handle(new ContainerExitEvent( > containerId, > ContainerEventType.CONTAINER_EXITED_WITH_FAILURE, retCode, > "Container exited with a non-zero exit code " + retCode)); > return retCode; > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-3257) FairScheduler: MaxAm may be set too low preventing apps from starting
[ https://issues.apache.org/jira/browse/YARN-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot reassigned YARN-3257: --- Assignee: (was: Anubhav Dhoot) > FairScheduler: MaxAm may be set too low preventing apps from starting > - > > Key: YARN-3257 > URL: https://issues.apache.org/jira/browse/YARN-3257 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Anubhav Dhoot > Attachments: YARN-3257.001.patch > > > In YARN-2637 CapacityScheduler#LeafQueue does not enforce max am share if the > limit prevents the first application from starting. This would be good to add > to FSLeafQueue as well -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-3994) RM should respect AM resource/placement constraints
[ https://issues.apache.org/jira/browse/YARN-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot reassigned YARN-3994: --- Assignee: (was: Anubhav Dhoot) > RM should respect AM resource/placement constraints > --- > > Key: YARN-3994 > URL: https://issues.apache.org/jira/browse/YARN-3994 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bikas Saha > > Today, locality and cpu for the AM can be specified in the AM launch > container request but are ignored at the RM. Locality is assumed to be ANY > and cpu is dropped. There may be other things too that are ignored. This > should be fixed so that the user gets what is specified in their code to > launch the AM. cc [~leftnoteasy] [~vvasudev] [~adhoot] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-4021) RuntimeException/YarnRuntimeException sent over to the client can cause client to assume a local fatal failure
[ https://issues.apache.org/jira/browse/YARN-4021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot reassigned YARN-4021: --- Assignee: (was: Anubhav Dhoot) > RuntimeException/YarnRuntimeException sent over to the client can cause > client to assume a local fatal failure > --- > > Key: YARN-4021 > URL: https://issues.apache.org/jira/browse/YARN-4021 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot > > Currently RuntimeException and its derived types such as > YarnRuntimeExceptions are serialized over to the client and thrown at the > client after YARN-731. This can cause issues like MAPREDUCE-6439 where we > assume a local fatal exception has happened. > Instead we should have a way to distinguish local RuntimeException versus > remote RuntimeException to avoid these issues. We need to go over all the > current client side code that is expecting a remote RuntimeException inorder > to make it work with this change. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-4076) FairScheduler does not allow AM to choose which containers to preempt
[ https://issues.apache.org/jira/browse/YARN-4076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot reassigned YARN-4076: --- Assignee: (was: Anubhav Dhoot) > FairScheduler does not allow AM to choose which containers to preempt > - > > Key: YARN-4076 > URL: https://issues.apache.org/jira/browse/YARN-4076 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Anubhav Dhoot > > Capacity scheduler allows for AM to choose which containers will be > preempted. See comment about corresponding work pending for FairScheduler > https://issues.apache.org/jira/browse/YARN-568?focusedCommentId=13649126=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13649126 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-4030) Make Nodemanager cgroup usage for container easier to use when its running inside a cgroup
[ https://issues.apache.org/jira/browse/YARN-4030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot reassigned YARN-4030: --- Assignee: (was: Anubhav Dhoot) > Make Nodemanager cgroup usage for container easier to use when its running > inside a cgroup > --- > > Key: YARN-4030 > URL: https://issues.apache.org/jira/browse/YARN-4030 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Anubhav Dhoot > > Today nodemanager uses the cgroup prefix pointed by > yarn.nodemanager.linux-container-executor.cgroups.hierarchy (default value > /hadoop-yarn) directly at the path of the controller say > /sys/fs/cgroup/cpu/hadoop-yarn. > If there are nodemanagers running inside docker containers on a host, each > would typically be separated by a cgroup under the controller path say > /sys/fs/cgroup/cpu/docker//nmcgroup for NM1 and > /sys/fs/cgroup/cpu/docker//nmcgroup for NM2. > In this case the correct behavior should be to use the docker cgroup paths as > /sys/fs/cgroup/cpu/docker//hadoop-yarn for NM1 > /sys/fs/cgroup/cpu/docker//hadoop-yarn for NM2. > But the default behavior would make both NMs try to use > /sys/fs/cgroup/cpu/hadoop-yarn which is incorrect and would usually fail > based on the permissions setup. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-4144) Add NM that causes LaunchFailedTransition to blacklist
[ https://issues.apache.org/jira/browse/YARN-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot reassigned YARN-4144: --- Assignee: (was: Anubhav Dhoot) > Add NM that causes LaunchFailedTransition to blacklist > -- > > Key: YARN-4144 > URL: https://issues.apache.org/jira/browse/YARN-4144 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Anubhav Dhoot > > During discussion of YARN-2005 we need to add more cases where blacklisting > can occur. This tracks adding any failures in launch via > LaunchFailedTransition to also contribute to blacklisting -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4032) Corrupted state from a previous version can still cause RM to fail with NPE due to same reasons as YARN-2834
[ https://issues.apache.org/jira/browse/YARN-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4032: Assignee: (was: Anubhav Dhoot) Unassigning this from myself since I am not going to have time to work on this. Please feel free to take this up. > Corrupted state from a previous version can still cause RM to fail with NPE > due to same reasons as YARN-2834 > > > Key: YARN-4032 > URL: https://issues.apache.org/jira/browse/YARN-4032 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Anubhav Dhoot >Priority: Critical > Attachments: YARN-4032.prelim.patch > > > YARN-2834 ensures in 2.6.0 there will not be any inconsistent state. But if > someone is upgrading from a previous version, the state can still be > inconsistent and then RM will still fail with NPE after upgrade to 2.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3738) Add support for recovery of reserved apps (running under dynamic queues) to Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3738: Attachment: YARN-3738-v3.patch Retriggering jenkins with same patch > Add support for recovery of reserved apps (running under dynamic queues) to > Capacity Scheduler > -- > > Key: YARN-3738 > URL: https://issues.apache.org/jira/browse/YARN-3738 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, resourcemanager >Reporter: Subru Krishnan >Assignee: Subru Krishnan > Attachments: YARN-3738-v2.patch, YARN-3738-v3.patch, > YARN-3738-v3.patch, YARN-3738.patch > > > YARN-3736 persists the current state of the Plan to the RMStateStore. This > JIRA covers recovery of the Plan, i.e. dynamic reservation queues with > associated apps as part Capacity Scheduler failover mechanism. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3738) Add support for recovery of reserved apps (running under dynamic queues) to Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971817#comment-14971817 ] Anubhav Dhoot commented on YARN-3738: - +1 pending jenkins > Add support for recovery of reserved apps (running under dynamic queues) to > Capacity Scheduler > -- > > Key: YARN-3738 > URL: https://issues.apache.org/jira/browse/YARN-3738 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, resourcemanager >Reporter: Subru Krishnan >Assignee: Subru Krishnan > Attachments: YARN-3738-v2.patch, YARN-3738-v3.patch, > YARN-3738-v3.patch, YARN-3738-v4.patch, YARN-3738.patch > > > YARN-3736 persists the current state of the Plan to the RMStateStore. This > JIRA covers recovery of the Plan, i.e. dynamic reservation queues with > associated apps as part Capacity Scheduler failover mechanism. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3739) Add reservation system recovery to RM recovery process
[ https://issues.apache.org/jira/browse/YARN-3739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3739: Fix Version/s: 2.8.0 > Add reservation system recovery to RM recovery process > -- > > Key: YARN-3739 > URL: https://issues.apache.org/jira/browse/YARN-3739 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Subru Krishnan >Assignee: Subru Krishnan > Fix For: 2.8.0 > > Attachments: YARN-3739-v1.patch, YARN-3739-v2.patch, > YARN-3739-v3.patch > > > YARN-1051 introduced a reservation system in the YARN RM. This JIRA tracks > the recovery of the reservation system in case of a RM failover. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3739) Add recovery of reservation system to RM failover process
[ https://issues.apache.org/jira/browse/YARN-3739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969137#comment-14969137 ] Anubhav Dhoot commented on YARN-3739: - +1 > Add recovery of reservation system to RM failover process > - > > Key: YARN-3739 > URL: https://issues.apache.org/jira/browse/YARN-3739 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Subru Krishnan >Assignee: Subru Krishnan > Attachments: YARN-3739-v1.patch, YARN-3739-v2.patch, > YARN-3739-v3.patch > > > YARN-1051 introduced a reservation system in the YARN RM. This JIRA tracks > the recovery of the reservation system in case of a RM failover. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3739) Add reservation system recovery to RM recovery process
[ https://issues.apache.org/jira/browse/YARN-3739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3739: Summary: Add reservation system recovery to RM recovery process (was: Add recovery of reservation system to RM failover process) > Add reservation system recovery to RM recovery process > -- > > Key: YARN-3739 > URL: https://issues.apache.org/jira/browse/YARN-3739 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Subru Krishnan >Assignee: Subru Krishnan > Attachments: YARN-3739-v1.patch, YARN-3739-v2.patch, > YARN-3739-v3.patch > > > YARN-1051 introduced a reservation system in the YARN RM. This JIRA tracks > the recovery of the reservation system in case of a RM failover. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4184) Remove update reservation state api from state store as its not used by ReservationSystem
[ https://issues.apache.org/jira/browse/YARN-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4184: Assignee: Subru Krishnan (was: Anubhav Dhoot) > Remove update reservation state api from state store as its not used by > ReservationSystem > - > > Key: YARN-4184 > URL: https://issues.apache.org/jira/browse/YARN-4184 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Anubhav Dhoot >Assignee: Subru Krishnan > > ReservationSystem uses remove/add for updates and thus update api in state > store is not needed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3739) Add recovery of reservation system to RM failover process
[ https://issues.apache.org/jira/browse/YARN-3739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968136#comment-14968136 ] Anubhav Dhoot commented on YARN-3739: - Minor comment on loadState you can enumerate on reservations.entrySet() instead of keySet and then doing a get Looks good otherwise > Add recovery of reservation system to RM failover process > - > > Key: YARN-3739 > URL: https://issues.apache.org/jira/browse/YARN-3739 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Subru Krishnan >Assignee: Subru Krishnan > Attachments: YARN-3739-v1.patch > > > YARN-1051 introduced a reservation system in the YARN RM. This JIRA tracks > the recovery of the reservation system in case of a RM failover. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs
[ https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965804#comment-14965804 ] Anubhav Dhoot commented on YARN-3985: - Reran the test multiple times without failure > Make ReservationSystem persist state using RMStateStore reservation APIs > - > > Key: YARN-3985 > URL: https://issues.apache.org/jira/browse/YARN-3985 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3985.001.patch, YARN-3985.002.patch, > YARN-3985.002.patch, YARN-3985.002.patch, YARN-3985.003.patch, > YARN-3985.004.patch, YARN-3985.005.patch, YARN-3985.005.patch > > > YARN-3736 adds the RMStateStore apis to store and load reservation state. > This jira adds the actual storing of state from ReservationSystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs
[ https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3985: Attachment: YARN-3985.005.patch Added a retry since there are multiple events that we need to wait for. Draindispatcher await can return if we have drained the first event from the queue and the next event has not yet been added. So a simple await does not seem to work reliably. > Make ReservationSystem persist state using RMStateStore reservation APIs > - > > Key: YARN-3985 > URL: https://issues.apache.org/jira/browse/YARN-3985 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3985.001.patch, YARN-3985.002.patch, > YARN-3985.002.patch, YARN-3985.002.patch, YARN-3985.003.patch, > YARN-3985.004.patch, YARN-3985.005.patch > > > YARN-3736 adds the RMStateStore apis to store and load reservation state. > This jira adds the actual storing of state from ReservationSystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs
[ https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3985: Attachment: YARN-3985.005.patch > Make ReservationSystem persist state using RMStateStore reservation APIs > - > > Key: YARN-3985 > URL: https://issues.apache.org/jira/browse/YARN-3985 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3985.001.patch, YARN-3985.002.patch, > YARN-3985.002.patch, YARN-3985.002.patch, YARN-3985.003.patch, > YARN-3985.004.patch, YARN-3985.005.patch, YARN-3985.005.patch > > > YARN-3736 adds the RMStateStore apis to store and load reservation state. > This jira adds the actual storing of state from ReservationSystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs
[ https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3985: Attachment: YARN-3985.004.patch Thanks for the review [~asuresh]. Attached patch removes sleep by adding the plan synchronization and a DrainDispatcher await needed to ensure the node capacity is added to the scheduler and to the plan capacity. > Make ReservationSystem persist state using RMStateStore reservation APIs > - > > Key: YARN-3985 > URL: https://issues.apache.org/jira/browse/YARN-3985 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3985.001.patch, YARN-3985.002.patch, > YARN-3985.002.patch, YARN-3985.002.patch, YARN-3985.003.patch, > YARN-3985.004.patch > > > YARN-3736 adds the RMStateStore apis to store and load reservation state. > This jira adds the actual storing of state from ReservationSystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node
[ https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957372#comment-14957372 ] Anubhav Dhoot commented on YARN-4227: - The previous statement should also be updated to handle a null node to avoid a NPE inside it {noformat}application.unreserve(rmContainer.getReservedPriority(), node);{noformat} This may need to still process some portion FSAppAttempt#unreserveInternal instead of skipping the entire processing. The test seems ok. Should we rename blacklist -> remove? Overall the fix looks ok. Just another bug which indicates until we restructure the code we will have to keep adding bandaids. > FairScheduler: RM quits processing expired container from a removed node > > > Key: YARN-4227 > URL: https://issues.apache.org/jira/browse/YARN-4227 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.3.0, 2.5.0, 2.7.1 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: YARN-4227.2.patch, YARN-4227.3.patch, YARN-4227.4.patch, > YARN-4227.patch > > > Under some circumstances the node is removed before an expired container > event is processed causing the RM to exit: > {code} > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: > Expired:container_1436927988321_1307950_01_12 Timed out after 600 secs > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1436927988321_1307950_01_12 Container Transitioned from > ACQUIRED to EXPIRED > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: > Completed container: container_1436927988321_1307950_01_12 in state: > EXPIRED event:EXPIRE > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op >OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1436927988321_1307950 > CONTAINERID=container_1436927988321_1307950_01_12 > 2015-10-04 21:14:01,063 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type CONTAINER_EXPIRED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585) > at java.lang.Thread.run(Thread.java:745) > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {code} > The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 > and 2.6.0 by different customers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4032) Corrupted state from a previous version can still cause RM to fail with NPE due to same reasons as YARN-2834
[ https://issues.apache.org/jira/browse/YARN-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14953924#comment-14953924 ] Anubhav Dhoot commented on YARN-4032: - This is a sample log {noformat} 2015-10-10 04:35:32,486 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1441905716013_43686_01 State change from NEW to FINISHED java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:642) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1219) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1008) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:760) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:107) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:841) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$1900(RMAppImpl.java:103) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:856) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:846) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:721) : java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:642) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1219) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1008) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:760) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:107) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:841) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$1900(RMAppImpl.java:103) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:856) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:846) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at
[jira] [Updated] (YARN-4032) Corrupted state from a previous version can still cause RM to fail with NPE due to same reasons as YARN-2834
[ https://issues.apache.org/jira/browse/YARN-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4032: Attachment: YARN-4032.prelim.patch Prelim patch based on the discussion > Corrupted state from a previous version can still cause RM to fail with NPE > due to same reasons as YARN-2834 > > > Key: YARN-4032 > URL: https://issues.apache.org/jira/browse/YARN-4032 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Critical > Attachments: YARN-4032.prelim.patch > > > YARN-2834 ensures in 2.6.0 there will not be any inconsistent state. But if > someone is upgrading from a previous version, the state can still be > inconsistent and then RM will still fail with NPE after upgrade to 2.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4247) Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing events
[ https://issues.apache.org/jira/browse/YARN-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952545#comment-14952545 ] Anubhav Dhoot commented on YARN-4247: - Yup. I had tested without that change. Resolving this as not needed. > Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing > events > - > > Key: YARN-4247 > URL: https://issues.apache.org/jira/browse/YARN-4247 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Blocker > Attachments: YARN-4247.001.patch, YARN-4247.001.patch > > > We see this deadlock in our testing where events do not get processed and we > see this in the logs before the RM dies of OOM {noformat} 2015-10-08 > 04:48:01,918 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of > event-queue is 1488000 2015-10-08 04:48:01,918 INFO > org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1488000 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4247) Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing events
[ https://issues.apache.org/jira/browse/YARN-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952032#comment-14952032 ] Anubhav Dhoot commented on YARN-4247: - Tested this in a cluster. Before this fix the cluster would fall over around 3 to 4 hours. After this fix the cluster going strong beyond 24 hours. > Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing > events > - > > Key: YARN-4247 > URL: https://issues.apache.org/jira/browse/YARN-4247 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Blocker > Attachments: YARN-4247.001.patch, YARN-4247.001.patch > > > We see this deadlock in our testing where events do not get processed and we > see this in the logs before the RM dies of OOM {noformat} 2015-10-08 > 04:48:01,918 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of > event-queue is 1488000 2015-10-08 04:48:01,918 INFO > org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1488000 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4247) Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing events
[ https://issues.apache.org/jira/browse/YARN-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4247: Attachment: YARN-4247.001.patch Fix removes need for locking from FSAppAttempt to RMAppAttemptImpl. > Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing > events > - > > Key: YARN-4247 > URL: https://issues.apache.org/jira/browse/YARN-4247 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Blocker > Attachments: YARN-4247.001.patch > > > We see this deadlock in our testing where events do not get processed and we > see this in the logs before the RM dies of OOM {noformat} 2015-10-08 > 04:48:01,918 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of > event-queue is 1488000 2015-10-08 04:48:01,918 INFO > org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1488000 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4247) Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing events
[ https://issues.apache.org/jira/browse/YARN-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4247: Attachment: YARN-4247.001.patch retrigger jenkins > Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing > events > - > > Key: YARN-4247 > URL: https://issues.apache.org/jira/browse/YARN-4247 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Blocker > Attachments: YARN-4247.001.patch, YARN-4247.001.patch > > > We see this deadlock in our testing where events do not get processed and we > see this in the logs before the RM dies of OOM {noformat} 2015-10-08 > 04:48:01,918 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of > event-queue is 1488000 2015-10-08 04:48:01,918 INFO > org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1488000 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4235) FairScheduler PrimaryGroup does not handle empty groups returned for a user
[ https://issues.apache.org/jira/browse/YARN-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951331#comment-14951331 ] Anubhav Dhoot commented on YARN-4235: - Thanks [~rohithsharma] for review and commit! > FairScheduler PrimaryGroup does not handle empty groups returned for a user > > > Key: YARN-4235 > URL: https://issues.apache.org/jira/browse/YARN-4235 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Fix For: 2.8.0 > > Attachments: YARN-4235.001.patch > > > We see NPE if empty groups are returned for a user. This causes a NPE and > cause RM to crash as below > {noformat} > 2015-09-22 16:51:52,780 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ADDED to the scheduler > java.lang.IndexOutOfBoundsException: Index: 0 > at java.util.Collections$EmptyList.get(Collections.java:3212) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule$PrimaryGroup.getQueueForApp(QueuePlacementRule.java:149) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule.assignAppToQueue(QueuePlacementRule.java:74) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementPolicy.assignAppToQueue(QueuePlacementPolicy.java:167) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:689) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:595) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1180) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:111) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684) > at java.lang.Thread.run(Thread.java:745) > 2015-09-22 16:51:52,797 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4247) Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing events
[ https://issues.apache.org/jira/browse/YARN-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951161#comment-14951161 ] Anubhav Dhoot commented on YARN-4247: - Looking at the jstack here is the deadlock between FS and RMAppAttemptImpl The first thread has a lock on FSAppAttempt and is waiting on the RMAppAttemptImpl lock The second thread RMAppAttemptImpl.getApplicationResourceUsageReport has taken a readlock and waiting on FSAppAttempt This causes other threads (eg. third thread) such as the AsyncDispatcher threads to get blocked causing RM to stop processing events and then crash with OOM because of the backlog of events. {noformat} "IPC Server handler 49 on 8030" #239 daemon prio=5 os_prio=0 tid=0x01093000 nid=0x8206 waiting on condition [0x7f930b2da000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00071719e0f0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.*RMAppAttemptImpl*.getMasterContainer(RMAppAttemptImpl.java:747) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.isWaitingForAMContainer(SchedulerApplicationAttempt.java:482) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:938) - locked <0x000715932d98> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.*FSAppAttempt*) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:529) - locked <0x0007171a5328> (a org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService$AllocateResponseLock) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) "IPC Server handler 9 on 8032" #253 daemon prio=5 os_prio=0 tid=0x00e2e800 nid=0x8214 waiting for monitor entry [0x7f930a4cd000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceUsageReport(SchedulerApplicationAttempt.java:570) - waiting to lock <0x000715932d98> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getAppResourceUsageReport(AbstractYarnScheduler.java:241) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:114) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.*RMAppAttemptImpl*.getApplicationResourceUsageReport(RMAppAttemptImpl.java:798) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:655) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:330) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:170) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:401) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at
[jira] [Moved] (YARN-4247) Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing events
[ https://issues.apache.org/jira/browse/YARN-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot moved MAPREDUCE-6509 to YARN-4247: Component/s: (was: resourcemanager) resourcemanager fairscheduler Key: YARN-4247 (was: MAPREDUCE-6509) Project: Hadoop YARN (was: Hadoop Map/Reduce) > Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing > events > - > > Key: YARN-4247 > URL: https://issues.apache.org/jira/browse/YARN-4247 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > > We see this deadlock in our testing where events do not get processed and we > see this in the logs before the RM dies of OOM {noformat} 2015-10-08 > 04:48:01,918 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of > event-queue is 1488000 2015-10-08 04:48:01,918 INFO > org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1488000 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4235) FairScheduler PrimaryGroup does not handle empty groups returned for a user
[ https://issues.apache.org/jira/browse/YARN-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4235: Attachment: YARN-4235.001.patch Handle empty groups > FairScheduler PrimaryGroup does not handle empty groups returned for a user > > > Key: YARN-4235 > URL: https://issues.apache.org/jira/browse/YARN-4235 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-4235.001.patch > > > We see NPE if empty groups are returned for a user. This causes a NPE and > cause RM to crash as below > {noformat} > 2015-09-22 16:51:52,780 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ADDED to the scheduler > java.lang.IndexOutOfBoundsException: Index: 0 > at java.util.Collections$EmptyList.get(Collections.java:3212) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule$PrimaryGroup.getQueueForApp(QueuePlacementRule.java:149) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule.assignAppToQueue(QueuePlacementRule.java:74) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementPolicy.assignAppToQueue(QueuePlacementPolicy.java:167) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:689) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:595) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1180) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:111) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684) > at java.lang.Thread.run(Thread.java:745) > 2015-09-22 16:51:52,797 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4235) FairScheduler PrimaryGroup does not handle empty groups returned for a user
Anubhav Dhoot created YARN-4235: --- Summary: FairScheduler PrimaryGroup does not handle empty groups returned for a user Key: YARN-4235 URL: https://issues.apache.org/jira/browse/YARN-4235 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot We see NPE if empty groups are returned for a user. This causes a NPE and cause RM to crash as below {noformat} 2015-09-22 16:51:52,780 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler java.lang.IndexOutOfBoundsException: Index: 0 at java.util.Collections$EmptyList.get(Collections.java:3212) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule$PrimaryGroup.getQueueForApp(QueuePlacementRule.java:149) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule.assignAppToQueue(QueuePlacementRule.java:74) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementPolicy.assignAppToQueue(QueuePlacementPolicy.java:167) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:689) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:595) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:111) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684) at java.lang.Thread.run(Thread.java:745) 2015-09-22 16:51:52,797 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry
[ https://issues.apache.org/jira/browse/YARN-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943809#comment-14943809 ] Anubhav Dhoot commented on YARN-4185: - I don't think option 2 where you restart from 1 makes sense. Its also not a goal to minimize the total wait time. The goal should be to minimize the time to recover for short intermittent failure while also waiting long enough for long failures before giving up. Would it be better for us to ramp up to 10 sec exponentially and then do the n retries for 10 sec or do totally n retries including the ramp up. > Retry interval delay for NM client can be improved from the fixed static > retry > --- > > Key: YARN-4185 > URL: https://issues.apache.org/jira/browse/YARN-4185 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Neelesh Srinivas Salian > > Instead of having a fixed retry interval that starts off very high and stays > there, we are better off using an exponential backoff that has the same fixed > max limit. Today the retry interval is fixed at 10 sec that can be > unnecessarily high especially when NMs could rolling restart within a sec. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3996) YARN-789 (Support for zero capabilities in fairscheduler) is broken after YARN-3305
[ https://issues.apache.org/jira/browse/YARN-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14941443#comment-14941443 ] Anubhav Dhoot commented on YARN-3996: - Approach looks ok > YARN-789 (Support for zero capabilities in fairscheduler) is broken after > YARN-3305 > --- > > Key: YARN-3996 > URL: https://issues.apache.org/jira/browse/YARN-3996 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, fairscheduler >Reporter: Anubhav Dhoot >Assignee: Neelesh Srinivas Salian >Priority: Critical > Attachments: YARN-3996.prelim.patch > > > RMAppManager#validateAndCreateResourceRequest calls into normalizeRequest > with mininumResource for the incrementResource. This causes normalize to > return zero if minimum is set to zero as per YARN-789 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry
[ https://issues.apache.org/jira/browse/YARN-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4185: Assignee: Neelesh Srinivas Salian (was: Anubhav Dhoot) > Retry interval delay for NM client can be improved from the fixed static > retry > --- > > Key: YARN-4185 > URL: https://issues.apache.org/jira/browse/YARN-4185 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Neelesh Srinivas Salian > > Instead of having a fixed retry interval that starts off very high and stays > there, we are better off using an exponential backoff that has the same fixed > max limit. Today the retry interval is fixed at 10 sec that can be > unnecessarily high especially when NMs could rolling restart within a sec. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3996) YARN-789 (Support for zero capabilities in fairscheduler) is broken after YARN-3305
[ https://issues.apache.org/jira/browse/YARN-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938225#comment-14938225 ] Anubhav Dhoot commented on YARN-3996: - SchedulerUtils has multiple overloads of normalizeRequests. The ones that take in increment and min handle what you are looking to do. Fair has support for the increment and Fifo/Capacity do not. So Fair does multiple of increment and uses min for min while Fifo/Capacity does multiple of min and use min for min. Basically Capacity/Fifo are setting incr also to min. We need to do the same in RMAppManager. That way Fair can continue supporting zero min and use multiple of incr and Fifo/Capacity can choose to not support zero min and support multiple of min. > YARN-789 (Support for zero capabilities in fairscheduler) is broken after > YARN-3305 > --- > > Key: YARN-3996 > URL: https://issues.apache.org/jira/browse/YARN-3996 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, fairscheduler >Reporter: Anubhav Dhoot >Assignee: Neelesh Srinivas Salian >Priority: Critical > > RMAppManager#validateAndCreateResourceRequest calls into normalizeRequest > with mininumResource for the incrementResource. This causes normalize to > return zero if minimum is set to zero as per YARN-789 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry
[ https://issues.apache.org/jira/browse/YARN-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938190#comment-14938190 ] Anubhav Dhoot commented on YARN-4185: - can we try to reuse the existing values for retries (yarn.client.nodemanager-connect. ) and see if we can be mostly compatible? I am thinking its fine if its not exactly the same behavior > Retry interval delay for NM client can be improved from the fixed static > retry > --- > > Key: YARN-4185 > URL: https://issues.apache.org/jira/browse/YARN-4185 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > > Instead of having a fixed retry interval that starts off very high and stays > there, we are better off using an exponential backoff that has the same fixed > max limit. Today the retry interval is fixed at 10 sec that can be > unnecessarily high especially when NMs could rolling restart within a sec. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3996) YARN-789 (Support for zero capabilities in fairscheduler) is broken after YARN-3305
[ https://issues.apache.org/jira/browse/YARN-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3996: Assignee: Neelesh Srinivas Salian (was: Anubhav Dhoot) > YARN-789 (Support for zero capabilities in fairscheduler) is broken after > YARN-3305 > --- > > Key: YARN-3996 > URL: https://issues.apache.org/jira/browse/YARN-3996 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, fairscheduler >Reporter: Anubhav Dhoot >Assignee: Neelesh Srinivas Salian >Priority: Critical > > RMAppManager#validateAndCreateResourceRequest calls into normalizeRequest > with mininumResource for the incrementResource. This causes normalize to > return zero if minimum is set to zero as per YARN-789 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4204) ConcurrentModificationException in FairSchedulerQueueInfo
[ https://issues.apache.org/jira/browse/YARN-4204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933547#comment-14933547 ] Anubhav Dhoot commented on YARN-4204: - Committed to trunk and branch-2. Thanks [~kasha] for the review. > ConcurrentModificationException in FairSchedulerQueueInfo > - > > Key: YARN-4204 > URL: https://issues.apache.org/jira/browse/YARN-4204 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Fix For: 2.8.0 > > Attachments: YARN-4204.001.patch, YARN-4204.002.patch > > > Saw this exception which caused RM to go down > {noformat} > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) > at java.util.ArrayList$Itr.next(ArrayList.java:851) > at > java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerQueueInfo.(FairSchedulerQueueInfo.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerInfo.(FairSchedulerInfo.java:46) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getSchedulerInfo(RMWebServices.java:229) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) > at > com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:589) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:552) > at >
[jira] [Updated] (YARN-4180) AMLauncher does not retry on failures when talking to NM
[ https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4180: Attachment: YARN-4180-branch-2.7.2.txt Minor conflicts in backporting changes to branch 2.7 > AMLauncher does not retry on failures when talking to NM > - > > Key: YARN-4180 > URL: https://issues.apache.org/jira/browse/YARN-4180 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Critical > Attachments: YARN-4180-branch-2.7.2.txt, YARN-4180.001.patch, > YARN-4180.002.patch, YARN-4180.002.patch, YARN-4180.002.patch > > > We see issues with RM trying to launch a container while a NM is restarting > and we get exceptions like NMNotReadyException. While YARN-3842 added retry > for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing > there intermittent errors to cause job failures. This can manifest during > rolling restart of NMs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4204) ConcurrentModificationException in FairSchedulerQueueInfo
[ https://issues.apache.org/jira/browse/YARN-4204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4204: Description: Saw this exception which caused RM to go down {noformat} java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) at java.util.ArrayList$Itr.next(ArrayList.java:851) at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerQueueInfo.(FairSchedulerQueueInfo.java:100) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerInfo.(FairSchedulerInfo.java:46) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getSchedulerInfo(RMWebServices.java:229) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:589) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:552) at org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:84) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1279) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at
[jira] [Updated] (YARN-4180) AMLauncher does not retry on failures when talking to NM
[ https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4180: Attachment: YARN-4180.002.patch Try triggering jenkins again > AMLauncher does not retry on failures when talking to NM > - > > Key: YARN-4180 > URL: https://issues.apache.org/jira/browse/YARN-4180 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Critical > Attachments: YARN-4180.001.patch, YARN-4180.002.patch, > YARN-4180.002.patch > > > We see issues with RM trying to launch a container while a NM is restarting > and we get exceptions like NMNotReadyException. While YARN-3842 added retry > for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing > there intermittent errors to cause job failures. This can manifest during > rolling restart of NMs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4204) ConcurrentModificationException in FairSchedulerQueueInfo
[ https://issues.apache.org/jira/browse/YARN-4204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4204: Attachment: YARN-4204.002.patch Add unit test to repro ConcurrentModificationException > ConcurrentModificationException in FairSchedulerQueueInfo > - > > Key: YARN-4204 > URL: https://issues.apache.org/jira/browse/YARN-4204 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-4204.001.patch, YARN-4204.002.patch > > > Saw this exception which caused RM to go down > {noformat} > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) > at java.util.ArrayList$Itr.next(ArrayList.java:851) > at > java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerQueueInfo.(FairSchedulerQueueInfo.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerInfo.(FairSchedulerInfo.java:46) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getSchedulerInfo(RMWebServices.java:229) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) > at > com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:589) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:552) > at >
[jira] [Commented] (YARN-4204) ConcurrentModificationException in FairSchedulerQueueInfo
[ https://issues.apache.org/jira/browse/YARN-4204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905621#comment-14905621 ] Anubhav Dhoot commented on YARN-4204: - Issue is getChildQueues is returning an unmodifiable list wrapper over the list childQueues that itself can get modified. So even though the callers cannot modify it, they can still get CME while iterating over it if someone modifies the underlying list. > ConcurrentModificationException in FairSchedulerQueueInfo > - > > Key: YARN-4204 > URL: https://issues.apache.org/jira/browse/YARN-4204 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > > Saw this exception > {noformat} > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) > at java.util.ArrayList$Itr.next(ArrayList.java:851) > at > java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerQueueInfo.(FairSchedulerQueueInfo.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerInfo.(FairSchedulerInfo.java:46) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getSchedulerInfo(RMWebServices.java:229) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) > at > com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:589) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:552) >
[jira] [Updated] (YARN-4204) ConcurrentModificationException in FairSchedulerQueueInfo
[ https://issues.apache.org/jira/browse/YARN-4204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4204: Attachment: YARN-4204.001.patch > ConcurrentModificationException in FairSchedulerQueueInfo > - > > Key: YARN-4204 > URL: https://issues.apache.org/jira/browse/YARN-4204 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-4204.001.patch > > > Saw this exception > {noformat} > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) > at java.util.ArrayList$Itr.next(ArrayList.java:851) > at > java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerQueueInfo.(FairSchedulerQueueInfo.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerInfo.(FairSchedulerInfo.java:46) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getSchedulerInfo(RMWebServices.java:229) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) > at > com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:589) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:552) > at > org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:84) > at >
[jira] [Created] (YARN-4204) ConcurrentModificationException in FairSchedulerQueueInfo
Anubhav Dhoot created YARN-4204: --- Summary: ConcurrentModificationException in FairSchedulerQueueInfo Key: YARN-4204 URL: https://issues.apache.org/jira/browse/YARN-4204 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Saw this exception {noformat} java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) at java.util.ArrayList$Itr.next(ArrayList.java:851) at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerQueueInfo.(FairSchedulerQueueInfo.java:100) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerInfo.(FairSchedulerInfo.java:46) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getSchedulerInfo(RMWebServices.java:229) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:589) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:552) at org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:84) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1279) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at
[jira] [Commented] (YARN-4180) AMLauncher does not retry on failures when talking to NM
[ https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903217#comment-14903217 ] Anubhav Dhoot commented on YARN-4180: - The test failure looks unrelated. > AMLauncher does not retry on failures when talking to NM > - > > Key: YARN-4180 > URL: https://issues.apache.org/jira/browse/YARN-4180 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Critical > Attachments: YARN-4180.001.patch > > > We see issues with RM trying to launch a container while a NM is restarting > and we get exceptions like NMNotReadyException. While YARN-3842 added retry > for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing > there intermittent errors to cause job failures. This can manifest during > rolling restart of NMs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4180) AMLauncher does not retry on failures when talking to NM
[ https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4180: Attachment: YARN-4180.002.patch Addressed feedback > AMLauncher does not retry on failures when talking to NM > - > > Key: YARN-4180 > URL: https://issues.apache.org/jira/browse/YARN-4180 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Critical > Attachments: YARN-4180.001.patch, YARN-4180.002.patch > > > We see issues with RM trying to launch a container while a NM is restarting > and we get exceptions like NMNotReadyException. While YARN-3842 added retry > for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing > there intermittent errors to cause job failures. This can manifest during > rolling restart of NMs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4180) AMLauncher does not retry on failures when talking to NM
[ https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4180: Attachment: YARN-4180.001.patch reuse the same retry proxy used by AM client for RM client. Also opened YARN-4185 to improve this retry mechanism > AMLauncher does not retry on failures when talking to NM > - > > Key: YARN-4180 > URL: https://issues.apache.org/jira/browse/YARN-4180 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Critical > Attachments: YARN-4180.001.patch > > > We see issues with RM trying to launch a container while a NM is restarting > and we get exceptions like NMNotReadyException. While YARN-3842 added retry > for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing > there intermittent errors to cause job failures. This can manifest during > rolling restart of NMs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry
Anubhav Dhoot created YARN-4185: --- Summary: Retry interval delay for NM client can be improved from the fixed static retry Key: YARN-4185 URL: https://issues.apache.org/jira/browse/YARN-4185 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Instead of having a fixed retry interval that starts off very high and stays there, we are better off using an exponential backoff that has the same fixed max limit. Today the retry interval is fixed at 10 sec that can be unnecessarily high especially when NMs could rolling restart within a sec. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs
[ https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3985: Attachment: YARN-3985.003.patch > Make ReservationSystem persist state using RMStateStore reservation APIs > - > > Key: YARN-3985 > URL: https://issues.apache.org/jira/browse/YARN-3985 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3985.001.patch, YARN-3985.002.patch, > YARN-3985.002.patch, YARN-3985.002.patch, YARN-3985.003.patch > > > YARN-3736 adds the RMStateStore apis to store and load reservation state. > This jira adds the actual storing of state from ReservationSystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs
[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876038#comment-14876038 ] Anubhav Dhoot commented on YARN-2005: - Thanks [~jianhe], [~sunilg], [~jlowe] for the reviews and [~kasha] for the review and commit ! > Blacklisting support for scheduling AMs > --- > > Key: YARN-2005 > URL: https://issues.apache.org/jira/browse/YARN-2005 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 0.23.10, 2.4.0 >Reporter: Jason Lowe >Assignee: Anubhav Dhoot > Fix For: 2.8.0 > > Attachments: YARN-2005.001.patch, YARN-2005.002.patch, > YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, > YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, > YARN-2005.008.patch, YARN-2005.009.patch > > > It would be nice if the RM supported blacklisting a node for an AM launch > after the same node fails a configurable number of AM attempts. This would > be similar to the blacklisting support for scheduling task attempts in the > MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4131) Add API and CLI to kill container on given containerId
[ https://issues.apache.org/jira/browse/YARN-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876669#comment-14876669 ] Anubhav Dhoot commented on YARN-4131: - Can you please add [~kasha] and me to this? We are interested in this effort. > Add API and CLI to kill container on given containerId > -- > > Key: YARN-4131 > URL: https://issues.apache.org/jira/browse/YARN-4131 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, client >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-4131-demo-2.patch, YARN-4131-demo.patch, > YARN-4131-v1.1.patch, YARN-4131-v1.2.patch, YARN-4131-v1.patch > > > Per YARN-3337, we need a handy tools to kill container in some scenarios. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3920) FairScheduler container reservation on a node should be configurable to limit it to large containers
[ https://issues.apache.org/jira/browse/YARN-3920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876473#comment-14876473 ] Anubhav Dhoot commented on YARN-3920: - Thx [~asuresh] for review and commit! > FairScheduler container reservation on a node should be configurable to limit > it to large containers > > > Key: YARN-3920 > URL: https://issues.apache.org/jira/browse/YARN-3920 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3920.004.patch, YARN-3920.004.patch, > YARN-3920.004.patch, YARN-3920.004.patch, YARN-3920.005.patch, > yARN-3920.001.patch, yARN-3920.002.patch, yARN-3920.003.patch > > > Reserving a node for a container was designed for preventing large containers > from starvation from small requests that keep getting into a node. Today we > let this be used even for a small container request. This has a huge impact > on scheduling since we block other scheduling requests until that reservation > is fulfilled. We should make this configurable so its impact can be minimized > by limiting it for large container requests as originally intended. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4143) Optimize the check for AMContainer allocation needed by blacklisting and ContainerType
[ https://issues.apache.org/jira/browse/YARN-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876494#comment-14876494 ] Anubhav Dhoot commented on YARN-4143: - Cannot think of an API to add to the scheduler that will be called by RMAppAttempt. We can add an event to SchedulerEventType such as AM_CONTAINER_ALLOCATED that happens in the 2 places you mentioned. Seems overkill to me. Lemme know if you have any alternate ways of doing this. > Optimize the check for AMContainer allocation needed by blacklisting and > ContainerType > -- > > Key: YARN-4143 > URL: https://issues.apache.org/jira/browse/YARN-4143 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-4143.001.patch > > > In YARN-2005 there are checks made to determine if the allocation is for an > AM container. This happens in every allocate call and should be optimized > away since it changes only once per SchedulerApplicationAttempt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4184) Remove update reservation state api from state store as its not used by ReservationSystem
Anubhav Dhoot created YARN-4184: --- Summary: Remove update reservation state api from state store as its not used by ReservationSystem Key: YARN-4184 URL: https://issues.apache.org/jira/browse/YARN-4184 Project: Hadoop YARN Issue Type: Sub-task Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot ReservationSystem uses remove/add for updates and thus update api in state store is not needed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs
[ https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876049#comment-14876049 ] Anubhav Dhoot commented on YARN-3985: - Addressed feedback and opened YARN-4184 for removing updateReservation > Make ReservationSystem persist state using RMStateStore reservation APIs > - > > Key: YARN-3985 > URL: https://issues.apache.org/jira/browse/YARN-3985 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3985.001.patch, YARN-3985.002.patch, > YARN-3985.002.patch, YARN-3985.002.patch, YARN-3985.003.patch > > > YARN-3736 adds the RMStateStore apis to store and load reservation state. > This jira adds the actual storing of state from ReservationSystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4143) Optimize the check for AMContainer allocation needed by blacklisting and ContainerType
[ https://issues.apache.org/jira/browse/YARN-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804344#comment-14804344 ] Anubhav Dhoot commented on YARN-4143: - I think we can minimize the impact of checking on every allocate to only allocates before the AM is assigned which should be only a few times until AM itself gets launched. This avoids adding an API to the scheduler. > Optimize the check for AMContainer allocation needed by blacklisting and > ContainerType > -- > > Key: YARN-4143 > URL: https://issues.apache.org/jira/browse/YARN-4143 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > > In YARN-2005 there are checks made to determine if the allocation is for an > AM container. This happens in every allocate call and should be optimized > away since it changes only once per SchedulerApplicationAttempt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4143) Optimize the check for AMContainer allocation needed by blacklisting and ContainerType
[ https://issues.apache.org/jira/browse/YARN-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4143: Attachment: YARN-4143.001.patch Attached patch ensures checks are done only when AM is not allocated yet. Once its allocated it will simply return. Also remove passing the applicationId which is redundant since we are checking only for this App. > Optimize the check for AMContainer allocation needed by blacklisting and > ContainerType > -- > > Key: YARN-4143 > URL: https://issues.apache.org/jira/browse/YARN-4143 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-4143.001.patch > > > In YARN-2005 there are checks made to determine if the allocation is for an > AM container. This happens in every allocate call and should be optimized > away since it changes only once per SchedulerApplicationAttempt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4180) AMLauncher does not retry on failures when talking to NM
Anubhav Dhoot created YARN-4180: --- Summary: AMLauncher does not retry on failures when talking to NM Key: YARN-4180 URL: https://issues.apache.org/jira/browse/YARN-4180 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot We see issues with RM trying to launch a container while a NM is restarting and we get exceptions like NMNotReadyException. While YARN-3842 added retry for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing there intermittent errors to cause job failures. This can manifest during rolling restart of NMs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4180) AMLauncher does not retry on failures when talking to NM
[ https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804619#comment-14804619 ] Anubhav Dhoot commented on YARN-4180: - Propose using retries in the ContainerManagement proxy used by the AMLauncher#getContainerMgrProxy > AMLauncher does not retry on failures when talking to NM > - > > Key: YARN-4180 > URL: https://issues.apache.org/jira/browse/YARN-4180 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > > We see issues with RM trying to launch a container while a NM is restarting > and we get exceptions like NMNotReadyException. While YARN-3842 added retry > for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing > there intermittent errors to cause job failures. This can manifest during > rolling restart of NMs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs
[ https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3985: Attachment: YARN-3985.002.patch All the failed tests passed locally for me. Rerunning the tests > Make ReservationSystem persist state using RMStateStore reservation APIs > - > > Key: YARN-3985 > URL: https://issues.apache.org/jira/browse/YARN-3985 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3985.001.patch, YARN-3985.002.patch, > YARN-3985.002.patch, YARN-3985.002.patch > > > YARN-3736 adds the RMStateStore apis to store and load reservation state. > This jira adds the actual storing of state from ReservationSystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4143) Optimize the check for AMContainer allocation needed by blacklisting and ContainerType
[ https://issues.apache.org/jira/browse/YARN-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791259#comment-14791259 ] Anubhav Dhoot commented on YARN-4143: - Copying comments from YARN-2005 Sunil G added a comment - 03/Sep/15 07:17 Hi Anubhav Dhoot Thank you for updating the patch. I have a comment here. isWaitingForAMContainer is now used in 2 cases. To set the ContainerType and also in blacklist case. And this check is now hitting in every heartbeat from AM. I think its better to set a state called amIsStarted in SchedulerApplicationAttempt. And this can be set from 2 places. 1. RMAppAttemptImpl#AMContainerAllocatedTransition can call a new scheduler api to set amIsStarted flag when AM Container is launched and registered. We need to pass ContainerId to this new api to get attempt object and to set the flag. 2. AbstrctYarnScheduler#recoverContainersOnNode can also invoke this api to set this flag. So now we can directly read from SchedulerApplicationAttempt everytime when heartbeat call comes from AM. If we are not doing this in this ticket, I can open another ticket for this optimization. Please suggest your thoughts. > Optimize the check for AMContainer allocation needed by blacklisting and > ContainerType > -- > > Key: YARN-4143 > URL: https://issues.apache.org/jira/browse/YARN-4143 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > > In YARN-2005 there are checks made to determine if the allocation is for an > AM container. This happens in every allocate call and should be optimized > away since it changes only once per SchedulerApplicationAttempt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs
[ https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3985: Attachment: YARN-3985.002.patch Retriggering jenkins as failures seem unrelated > Make ReservationSystem persist state using RMStateStore reservation APIs > - > > Key: YARN-3985 > URL: https://issues.apache.org/jira/browse/YARN-3985 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3985.001.patch, YARN-3985.002.patch, > YARN-3985.002.patch > > > YARN-3736 adds the RMStateStore apis to store and load reservation state. > This jira adds the actual storing of state from ReservationSystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4130) Duplicate declaration of ApplicationId in RMAppManager
[ https://issues.apache.org/jira/browse/YARN-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745782#comment-14745782 ] Anubhav Dhoot commented on YARN-4130: - LGTM the failures look unrelated > Duplicate declaration of ApplicationId in RMAppManager > -- > > Key: YARN-4130 > URL: https://issues.apache.org/jira/browse/YARN-4130 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Kai Sasaki >Assignee: Kai Sasaki >Priority: Trivial > Labels: resourcemanager > Attachments: YARN-4130.00.patch > > > ApplicationId is declared double in {{RMAppManager}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs
[ https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745893#comment-14745893 ] Anubhav Dhoot commented on YARN-3985: - failures are due to rebase with YARN-3656. Will update those new tests and submit a new patch > Make ReservationSystem persist state using RMStateStore reservation APIs > - > > Key: YARN-3985 > URL: https://issues.apache.org/jira/browse/YARN-3985 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3985.001.patch > > > YARN-3736 adds the RMStateStore apis to store and load reservation state. > This jira adds the actual storing of state from ReservationSystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs
[ https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3985: Attachment: YARN-3985.002.patch Updated patch to modify new tests to add valid ReservationDefinition thats required by the state store processing. > Make ReservationSystem persist state using RMStateStore reservation APIs > - > > Key: YARN-3985 > URL: https://issues.apache.org/jira/browse/YARN-3985 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3985.001.patch, YARN-3985.002.patch > > > YARN-3736 adds the RMStateStore apis to store and load reservation state. > This jira adds the actual storing of state from ReservationSystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4135) Improve the assertion message in MockRM while failing after waiting for the state.
[ https://issues.apache.org/jira/browse/YARN-4135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745775#comment-14745775 ] Anubhav Dhoot commented on YARN-4135: - The patch looks fine except for missing spaces at before and after the + at "+appId > Improve the assertion message in MockRM while failing after waiting for the > state. > -- > > Key: YARN-4135 > URL: https://issues.apache.org/jira/browse/YARN-4135 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: nijel >Assignee: nijel >Priority: Minor > Labels: test > Attachments: YARN-4135_1.patch > > > In MockRM when the test is failed after waiting for the given state, the > application id or the attempt id can be printed for easy debug > As of now if it hard to track the test fail in log since there is no relation > with test case and the application id. > Any thoughts ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4156) testAMBlacklistPreventsRestartOnSameNode assumes CapacityScheduler
Anubhav Dhoot created YARN-4156: --- Summary: testAMBlacklistPreventsRestartOnSameNode assumes CapacityScheduler Key: YARN-4156 URL: https://issues.apache.org/jira/browse/YARN-4156 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot The assumes the scheduler is CapacityScheduler without configuring it as such. This causes it to fail if the default is something else such as the FairScheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4156) testAMBlacklistPreventsRestartOnSameNode assumes CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4156: Attachment: YARN-4156.001.patch Uploading patch that configures the scheduler to be CapcacityScheduler > testAMBlacklistPreventsRestartOnSameNode assumes CapacityScheduler > -- > > Key: YARN-4156 > URL: https://issues.apache.org/jira/browse/YARN-4156 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-4156.001.patch > > > The assumes the scheduler is CapacityScheduler without configuring it as > such. This causes it to fail if the default is something else such as the > FairScheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3784) Indicate preemption timout along with the list of containers to AM (preemption message)
[ https://issues.apache.org/jira/browse/YARN-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744006#comment-14744006 ] Anubhav Dhoot commented on YARN-3784: - Hi [~sunilg] this does not include support for FairScheduler. Are we planning to add that here or have a separate jira to track that work? Thx > Indicate preemption timout along with the list of containers to AM > (preemption message) > --- > > Key: YARN-3784 > URL: https://issues.apache.org/jira/browse/YARN-3784 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-3784.patch, 0002-YARN-3784.patch > > > Currently during preemption, AM is notified with a list of containers which > are marked for preemption. Introducing a timeout duration also along with > this container list so that AM can know how much time it will get to do a > graceful shutdown to its containers (assuming one of preemption policy is > loaded in AM). > This will help in decommissioning NM scenarios, where NM will be > decommissioned after a timeout (also killing containers on it). This timeout > will be helpful to indicate AM that those containers can be killed by RM > forcefully after the timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4115) Reduce loglevel of ContainerManagementProtocolProxy to Debug
[ https://issues.apache.org/jira/browse/YARN-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741314#comment-14741314 ] Anubhav Dhoot commented on YARN-4115: - The test failure looks unrelated. > Reduce loglevel of ContainerManagementProtocolProxy to Debug > > > Key: YARN-4115 > URL: https://issues.apache.org/jira/browse/YARN-4115 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Minor > Attachments: YARN-4115.001.patch > > > We see log spams of Aug 28, 1:57:52.441 PMINFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy > Opening proxy : :8041 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3273) Improve web UI to facilitate scheduling analysis and debugging
[ https://issues.apache.org/jira/browse/YARN-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot reassigned YARN-3273: --- Assignee: Anubhav Dhoot (was: Rohith Sharma K S) > Improve web UI to facilitate scheduling analysis and debugging > -- > > Key: YARN-3273 > URL: https://issues.apache.org/jira/browse/YARN-3273 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Anubhav Dhoot > Fix For: 2.7.0 > > Attachments: 0001-YARN-3273-v1.patch, 0001-YARN-3273-v2.patch, > 0002-YARN-3273.patch, 0003-YARN-3273.patch, 0003-YARN-3273.patch, > 0004-YARN-3273.patch, YARN-3273-am-resource-used-AND-User-limit-v2.PNG, > YARN-3273-am-resource-used-AND-User-limit.PNG, > YARN-3273-application-headroom-v2.PNG, YARN-3273-application-headroom.PNG > > > Job may be stuck for reasons such as: > - hitting queue capacity > - hitting user-limit, > - hitting AM-resource-percentage > The first queueCapacity is already shown on the UI. > We may surface things like: > - what is user's current usage and user-limit; > - what is the AM resource usage and limit; > - what is the application's current HeadRoom; > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4145) Make RMHATestBase abstract so its not run when running all tests under that namespace
[ https://issues.apache.org/jira/browse/YARN-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741309#comment-14741309 ] Anubhav Dhoot commented on YARN-4145: - The timed out tests are not related to this base class. > Make RMHATestBase abstract so its not run when running all tests under that > namespace > - > > Key: YARN-4145 > URL: https://issues.apache.org/jira/browse/YARN-4145 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Minor > Attachments: YARN-4145.001.patch > > > Make it abstract to avoid running it as a test -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4150) Failure in TestNMClient because nodereports were not available
[ https://issues.apache.org/jira/browse/YARN-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4150: Description: Saw a failure in a test run https://builds.apache.org/job/PreCommit-YARN-Build/9010/testReport/ java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.allocateContainers(TestNMClient.java:244) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClientNoCleanupOnStop(TestNMClient.java:210) was: Saw a failure in a test run > Failure in TestNMClient because nodereports were not available > -- > > Key: YARN-4150 > URL: https://issues.apache.org/jira/browse/YARN-4150 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > > Saw a failure in a test run > https://builds.apache.org/job/PreCommit-YARN-Build/9010/testReport/ > java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:635) > at java.util.ArrayList.get(ArrayList.java:411) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.allocateContainers(TestNMClient.java:244) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClientNoCleanupOnStop(TestNMClient.java:210) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4150) Failure in TestNMClient because nodereports were not available
[ https://issues.apache.org/jira/browse/YARN-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4150: Attachment: YARN-4150.001.patch Simple fix to wait for nodemanagers to be up before trying to get the nodereports. > Failure in TestNMClient because nodereports were not available > -- > > Key: YARN-4150 > URL: https://issues.apache.org/jira/browse/YARN-4150 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-4150.001.patch > > > Saw a failure in a test run > https://builds.apache.org/job/PreCommit-YARN-Build/9010/testReport/ > java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:635) > at java.util.ArrayList.get(ArrayList.java:411) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.allocateContainers(TestNMClient.java:244) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClientNoCleanupOnStop(TestNMClient.java:210) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4150) Failure in TestNMClient because nodereports were not available
Anubhav Dhoot created YARN-4150: --- Summary: Failure in TestNMClient because nodereports were not available Key: YARN-4150 URL: https://issues.apache.org/jira/browse/YARN-4150 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Saw a failure in a test run -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4115) Reduce loglevel of ContainerManagementProtocolProxy to Debug
[ https://issues.apache.org/jira/browse/YARN-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741334#comment-14741334 ] Anubhav Dhoot commented on YARN-4115: - The test passes for me locally. Opened YARN-4150 for fixing the test which seems to have a race. > Reduce loglevel of ContainerManagementProtocolProxy to Debug > > > Key: YARN-4115 > URL: https://issues.apache.org/jira/browse/YARN-4115 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Minor > Attachments: YARN-4115.001.patch > > > We see log spams of Aug 28, 1:57:52.441 PMINFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy > Opening proxy : :8041 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4150) Failure in TestNMClient because nodereports were not available
[ https://issues.apache.org/jira/browse/YARN-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741338#comment-14741338 ] Anubhav Dhoot commented on YARN-4150: - This is most likely due to the test reading the node reports before the nodemanagers are ready > Failure in TestNMClient because nodereports were not available > -- > > Key: YARN-4150 > URL: https://issues.apache.org/jira/browse/YARN-4150 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > > Saw a failure in a test run > https://builds.apache.org/job/PreCommit-YARN-Build/9010/testReport/ > java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:635) > at java.util.ArrayList.get(ArrayList.java:411) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.allocateContainers(TestNMClient.java:244) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClientNoCleanupOnStop(TestNMClient.java:210) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3273) Improve web UI to facilitate scheduling analysis and debugging
[ https://issues.apache.org/jira/browse/YARN-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3273: Assignee: Rohith Sharma K S (was: Anubhav Dhoot) > Improve web UI to facilitate scheduling analysis and debugging > -- > > Key: YARN-3273 > URL: https://issues.apache.org/jira/browse/YARN-3273 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Rohith Sharma K S > Fix For: 2.7.0 > > Attachments: 0001-YARN-3273-v1.patch, 0001-YARN-3273-v2.patch, > 0002-YARN-3273.patch, 0003-YARN-3273.patch, 0003-YARN-3273.patch, > 0004-YARN-3273.patch, YARN-3273-am-resource-used-AND-User-limit-v2.PNG, > YARN-3273-am-resource-used-AND-User-limit.PNG, > YARN-3273-application-headroom-v2.PNG, YARN-3273-application-headroom.PNG > > > Job may be stuck for reasons such as: > - hitting queue capacity > - hitting user-limit, > - hitting AM-resource-percentage > The first queueCapacity is already shown on the UI. > We may surface things like: > - what is user's current usage and user-limit; > - what is the AM resource usage and limit; > - what is the application's current HeadRoom; > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs
[ https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3985: Attachment: YARN-3985.001.patch Added a patch that calls into state store and unit test that verifies after recovery state the new RM gets the reservations saved from previous RM. > Make ReservationSystem persist state using RMStateStore reservation APIs > - > > Key: YARN-3985 > URL: https://issues.apache.org/jira/browse/YARN-3985 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3985.001.patch > > > YARN-3736 adds the RMStateStore apis to store and load reservation state. > This jira adds the actual storing of state from ReservationSystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs
[ https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739939#comment-14739939 ] Anubhav Dhoot commented on YARN-3985: - Since updateReservation does an add and remove we do not have need to update reservation state in the state store. I can remove it if needed in either this or a separate patch. > Make ReservationSystem persist state using RMStateStore reservation APIs > - > > Key: YARN-3985 > URL: https://issues.apache.org/jira/browse/YARN-3985 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3985.001.patch > > > YARN-3736 adds the RMStateStore apis to store and load reservation state. > This jira adds the actual storing of state from ReservationSystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs
[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739700#comment-14739700 ] Anubhav Dhoot commented on YARN-2005: - [~sunilg] thats a good suggestion. Added a followup for this YARN-4143 > Blacklisting support for scheduling AMs > --- > > Key: YARN-2005 > URL: https://issues.apache.org/jira/browse/YARN-2005 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 0.23.10, 2.4.0 >Reporter: Jason Lowe >Assignee: Anubhav Dhoot > Attachments: YARN-2005.001.patch, YARN-2005.002.patch, > YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, > YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, > YARN-2005.008.patch > > > It would be nice if the RM supported blacklisting a node for an AM launch > after the same node fails a configurable number of AM attempts. This would > be similar to the blacklisting support for scheduling task attempts in the > MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4145) Make RMHATestBase abstract so its not run when running all tests under that namespace
[ https://issues.apache.org/jira/browse/YARN-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4145: Attachment: YARN-4145.001.patch > Make RMHATestBase abstract so its not run when running all tests under that > namespace > - > > Key: YARN-4145 > URL: https://issues.apache.org/jira/browse/YARN-4145 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Minor > Attachments: YARN-4145.001.patch > > > Trivial patch to make it abstract -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2005) Blacklisting support for scheduling AMs
[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2005: Attachment: YARN-2005.009.patch Addressed feedback > Blacklisting support for scheduling AMs > --- > > Key: YARN-2005 > URL: https://issues.apache.org/jira/browse/YARN-2005 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 0.23.10, 2.4.0 >Reporter: Jason Lowe >Assignee: Anubhav Dhoot > Attachments: YARN-2005.001.patch, YARN-2005.002.patch, > YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, > YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, > YARN-2005.008.patch, YARN-2005.009.patch > > > It would be nice if the RM supported blacklisting a node for an AM launch > after the same node fails a configurable number of AM attempts. This would > be similar to the blacklisting support for scheduling task attempts in the > MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4144) Add NM that causes LaunchFailedTransition to blacklist
Anubhav Dhoot created YARN-4144: --- Summary: Add NM that causes LaunchFailedTransition to blacklist Key: YARN-4144 URL: https://issues.apache.org/jira/browse/YARN-4144 Project: Hadoop YARN Issue Type: Improvement Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot During discussion of YARN-2005 we need to add more cases where blacklisting can occur. This tracks adding any failures in launch via LaunchFailedTransition to also contribute to blacklisting -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs
[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739708#comment-14739708 ] Anubhav Dhoot commented on YARN-2005: - Added YARN-4144 to add the node that causes LaunchFailedTransition also to the AM blacklist > Blacklisting support for scheduling AMs > --- > > Key: YARN-2005 > URL: https://issues.apache.org/jira/browse/YARN-2005 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 0.23.10, 2.4.0 >Reporter: Jason Lowe >Assignee: Anubhav Dhoot > Attachments: YARN-2005.001.patch, YARN-2005.002.patch, > YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, > YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, > YARN-2005.008.patch > > > It would be nice if the RM supported blacklisting a node for an AM launch > after the same node fails a configurable number of AM attempts. This would > be similar to the blacklisting support for scheduling task attempts in the > MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4145) Make RMHATestBase abstract so its not run when running all tests under that namespace
Anubhav Dhoot created YARN-4145: --- Summary: Make RMHATestBase abstract so its not run when running all tests under that namespace Key: YARN-4145 URL: https://issues.apache.org/jira/browse/YARN-4145 Project: Hadoop YARN Issue Type: Improvement Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Minor Trivial patch to make it abstract -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4145) Make RMHATestBase abstract so its not run when running all tests under that namespace
[ https://issues.apache.org/jira/browse/YARN-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4145: Description: Make it abstract to avoid running it as a test (was: Trivial patch to make it abstract) > Make RMHATestBase abstract so its not run when running all tests under that > namespace > - > > Key: YARN-4145 > URL: https://issues.apache.org/jira/browse/YARN-4145 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Minor > Attachments: YARN-4145.001.patch > > > Make it abstract to avoid running it as a test -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4143) Optimize the check for AMContainer allocation needed by blacklisting and ContainerType
Anubhav Dhoot created YARN-4143: --- Summary: Optimize the check for AMContainer allocation needed by blacklisting and ContainerType Key: YARN-4143 URL: https://issues.apache.org/jira/browse/YARN-4143 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot In YARN-2005 there are checks made to determine if the allocation is for an AM container. This happens in every allocate call and should be optimized away since it changes only once per SchedulerApplicationAttempt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-4143) Optimize the check for AMContainer allocation needed by blacklisting and ContainerType
[ https://issues.apache.org/jira/browse/YARN-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot reassigned YARN-4143: --- Assignee: Anubhav Dhoot > Optimize the check for AMContainer allocation needed by blacklisting and > ContainerType > -- > > Key: YARN-4143 > URL: https://issues.apache.org/jira/browse/YARN-4143 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > > In YARN-2005 there are checks made to determine if the allocation is for an > AM container. This happens in every allocate call and should be optimized > away since it changes only once per SchedulerApplicationAttempt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs
[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739614#comment-14739614 ] Anubhav Dhoot commented on YARN-2005: - Hi [~kasha] thanks for your comments. 2.4 - we do not need to update the systemBlacklist as its updated by the RMAppAttemptImpl#ScheduleTransition call every time to the complete list. 11, 12 - The changes were needed because now we need a valid submission context for isWaitingForAMContainer. 9 - Is needed by the new test added in TestAMRestart. 8.3 - Yes i can file a follow up for that Addressed rest of them > Blacklisting support for scheduling AMs > --- > > Key: YARN-2005 > URL: https://issues.apache.org/jira/browse/YARN-2005 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 0.23.10, 2.4.0 >Reporter: Jason Lowe >Assignee: Anubhav Dhoot > Attachments: YARN-2005.001.patch, YARN-2005.002.patch, > YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, > YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, > YARN-2005.008.patch > > > It would be nice if the RM supported blacklisting a node for an AM launch > after the same node fails a configurable number of AM attempts. This would > be similar to the blacklisting support for scheduling task attempts in the > MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs
[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739618#comment-14739618 ] Anubhav Dhoot commented on YARN-2005: - [~He Tianyi] yes we are using the ContainerExitStatus in this. We can refine the conditions in a followup if needed. > Blacklisting support for scheduling AMs > --- > > Key: YARN-2005 > URL: https://issues.apache.org/jira/browse/YARN-2005 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 0.23.10, 2.4.0 >Reporter: Jason Lowe >Assignee: Anubhav Dhoot > Attachments: YARN-2005.001.patch, YARN-2005.002.patch, > YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, > YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, > YARN-2005.008.patch > > > It would be nice if the RM supported blacklisting a node for an AM launch > after the same node fails a configurable number of AM attempts. This would > be similar to the blacklisting support for scheduling task attempts in the > MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs
[ https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3985: Component/s: (was: fairscheduler) (was: capacityscheduler) > Make ReservationSystem persist state using RMStateStore reservation APIs > - > > Key: YARN-3985 > URL: https://issues.apache.org/jira/browse/YARN-3985 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > > YARN-3736 adds the RMStateStore apis to store and load reservation state. > This jira adds the actual storing of state from ReservationSystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4115) Reduce loglevel of ContainerManagementProtocolProxy to Debug
[ https://issues.apache.org/jira/browse/YARN-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4115: Attachment: YARN-4115.001.patch Change the default log level to Debug > Reduce loglevel of ContainerManagementProtocolProxy to Debug > > > Key: YARN-4115 > URL: https://issues.apache.org/jira/browse/YARN-4115 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Minor > Attachments: YARN-4115.001.patch > > > We see log spams of Aug 28, 1:57:52.441 PMINFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy > Opening proxy : :8041 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4115) Reduce loglevel of ContainerManagementProtocolProxy to Debug
Anubhav Dhoot created YARN-4115: --- Summary: Reduce loglevel of ContainerManagementProtocolProxy to Debug Key: YARN-4115 URL: https://issues.apache.org/jira/browse/YARN-4115 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Minor We see log spams of Aug 28, 1:57:52.441 PM INFO org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy Opening proxy : :8041 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3676) Disregard 'assignMultiple' directive while scheduling apps with NODE_LOCAL resource requests
[ https://issues.apache.org/jira/browse/YARN-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731244#comment-14731244 ] Anubhav Dhoot commented on YARN-3676: - Thanks [~asuresh] for working on this. I see the patch continues assigning on the node if you have *any* app which has a specific request on that node. But the scheduling attempt (via queueMgr.getRootQueue().assignContainer(node)) does not restrict which apps will get allocation on that node. So one could end up assigning the next container on the node for an app which may not have a specific request for that node. I see two choices. a) Smaller change - We should allow subsequent assignments only for the node local only apps? And you already have that list in the map. That can end up prioritizing the application's node local request over other applications. b) BIgger change - Once we have picked the app based on priority, we allow it to assign multiple containers if there are multiple node local requests for that node. > Disregard 'assignMultiple' directive while scheduling apps with NODE_LOCAL > resource requests > > > Key: YARN-3676 > URL: https://issues.apache.org/jira/browse/YARN-3676 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Arun Suresh >Assignee: Arun Suresh > Attachments: YARN-3676.1.patch, YARN-3676.2.patch, YARN-3676.3.patch, > YARN-3676.4.patch, YARN-3676.5.patch > > > AssignMultiple is generally set to false to prevent overloading a Node (for > eg, new NMs that have just joined) > A possible scheduling optimization would be to disregard this directive for > apps whose allowed locality is NODE_LOCAL -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4087) Set YARN_FAIL_FAST to be false by default
[ https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729422#comment-14729422 ] Anubhav Dhoot commented on YARN-4087: - In general if we are not failing the daemon if fail fast flag is false, we still need to ensure we are not leaving inconsistent state in RM. For eg in YARN-4032. YARN-2019 is the other case where we did not need to do anything. This would mean every patch from now on that uses fail fast to not crash the daemon should consider taking corrective action to ensure correctness. Does that make sense? > Set YARN_FAIL_FAST to be false by default > - > > Key: YARN-4087 > URL: https://issues.apache.org/jira/browse/YARN-4087 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-4087.1.patch, YARN-4087.2.patch > > > Increasingly, I feel setting this property to be false makes more sense > especially in production environment, -- This message was sent by Atlassian JIRA (v6.3.4#6332)