[jira] [Comment Edited] (YARN-9730) Support forcing configured partitions to be exclusive based on app node label
[ https://issues.apache.org/jira/browse/YARN-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938302#comment-16938302 ] Bibin Chundatt edited comment on YARN-9730 at 9/26/19 5:58 AM: --- [~jhung] Thank you for working on this. Sorry to come in really late too .. {quote} 240 if (ResourceRequest.ANY.equals(req.getResourceName())) { 241 SchedulerUtils.enforcePartitionExclusivity(req, 242 getRmContext().getExclusiveEnforcedPartitions(), 243 asc.getNodeLabelExpression()); 244 } {quote} Configuration query on the AM allocation flow is going to be costly which i observed while evaluating the performance.. Could you optimize {{getRmContext().getExclusiveEnforcedPartitions()}}, since this is going to be invoked for every *request* was (Author: bibinchundatt): [~jhung] Thank you for working on this. Sorry to come in really late too .. {quote} 240 if (ResourceRequest.ANY.equals(req.getResourceName())) { 241 SchedulerUtils.enforcePartitionExclusivity(req, 242 getRmContext().getExclusiveEnforcedPartitions(), 243 asc.getNodeLabelExpression()); 244 } {quote} Configuration query on the AM allocation flow is going to be costly which i observed while evaluating the performance.. Could you optimize {getRmContext().getExclusiveEnforcedPartitions()} ,since this is going to be invoked for every *request* > Support forcing configured partitions to be exclusive based on app node label > - > > Key: YARN-9730 > URL: https://issues.apache.org/jira/browse/YARN-9730 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Labels: release-blocker > Fix For: 2.10.0, 3.3.0, 3.2.2, 3.1.4 > > Attachments: YARN-9730-branch-2.001.patch, YARN-9730.001.addendum, > YARN-9730.001.patch, YARN-9730.002.addendum, YARN-9730.002.patch, > YARN-9730.003.patch > > > Use case: queue X has all of its workload in non-default (exclusive) > partition P (by setting app submission context's node label set to P). Node > in partition Q != P heartbeats to RM. Capacity scheduler loops through every > application in X, and every scheduler key in this application, and fails to > allocate each time since the app's requested label and the node's label don't > match. This causes huge performance degradation when number of apps in X is > large. > To fix the issue, allow RM to configure partitions as "forced-exclusive". If > partition P is "forced-exclusive", then: > * 1a. If app sets its submission context's node label to P, all its resource > requests will be overridden to P > * 1b. If app sets its submission context's node label Q, any of its resource > requests whose labels are P will be overridden to Q > * 2. In the scheduler, we add apps with node label expression P to a > separate data structure. When a node in partition P heartbeats to scheduler, > we only try to schedule apps in this data structure. When a node in partition > Q heartbeats to scheduler, we schedule the rest of the apps as normal. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9730) Support forcing configured partitions to be exclusive based on app node label
[ https://issues.apache.org/jira/browse/YARN-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938302#comment-16938302 ] Bibin Chundatt commented on YARN-9730: -- [~jhung] Thank you for working on this. Sorry to come in really late too .. {quote} 240 if (ResourceRequest.ANY.equals(req.getResourceName())) { 241 SchedulerUtils.enforcePartitionExclusivity(req, 242 getRmContext().getExclusiveEnforcedPartitions(), 243 asc.getNodeLabelExpression()); 244 } {quote} Configuration query on the AM allocation flow is going to be costly which i observed while evaluating the performance.. Could you optimize {getRmContext().getExclusiveEnforcedPartitions()} ,since this is going to be invoked for every *request* > Support forcing configured partitions to be exclusive based on app node label > - > > Key: YARN-9730 > URL: https://issues.apache.org/jira/browse/YARN-9730 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Labels: release-blocker > Fix For: 2.10.0, 3.3.0, 3.2.2, 3.1.4 > > Attachments: YARN-9730-branch-2.001.patch, YARN-9730.001.addendum, > YARN-9730.001.patch, YARN-9730.002.addendum, YARN-9730.002.patch, > YARN-9730.003.patch > > > Use case: queue X has all of its workload in non-default (exclusive) > partition P (by setting app submission context's node label set to P). Node > in partition Q != P heartbeats to RM. Capacity scheduler loops through every > application in X, and every scheduler key in this application, and fails to > allocate each time since the app's requested label and the node's label don't > match. This causes huge performance degradation when number of apps in X is > large. > To fix the issue, allow RM to configure partitions as "forced-exclusive". If > partition P is "forced-exclusive", then: > * 1a. If app sets its submission context's node label to P, all its resource > requests will be overridden to P > * 1b. If app sets its submission context's node label Q, any of its resource > requests whose labels are P will be overridden to Q > * 2. In the scheduler, we add apps with node label expression P to a > separate data structure. When a node in partition P heartbeats to scheduler, > we only try to schedule apps in this data structure. When a node in partition > Q heartbeats to scheduler, we schedule the rest of the apps as normal. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9858) Optimize RMContext getExclusiveEnforcedPartitions
[ https://issues.apache.org/jira/browse/YARN-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt updated YARN-9858: - Description: Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a hot code path, need to optimize it . Since AMS allocate invoked by multiple handlers locking on conf will occur {code} java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841) - waiting to lock <0x7f1f8107c748> (a org.apache.hadoop.yarn.conf.YarnConfiguration) at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214) at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268) {code} was:Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a hot code path, need to optimize it. > Optimize RMContext getExclusiveEnforcedPartitions > -- > > Key: YARN-9858 > URL: https://issues.apache.org/jira/browse/YARN-9858 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Labels: release-blocker > > Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a > hot code path, need to optimize it . > Since AMS allocate invoked by multiple handlers locking on conf will occur > {code} > java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841) > - waiting to lock <0x7f1f8107c748> (a > org.apache.hadoop.yarn.conf.YarnConfiguration) > at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214) > at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9730) Support forcing configured partitions to be exclusive based on app node label
[ https://issues.apache.org/jira/browse/YARN-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938302#comment-16938302 ] Bibin Chundatt edited comment on YARN-9730 at 9/26/19 6:00 AM: --- [~jhung] Thank you for working on this. Sorry to come in really late too .. {code} 240 if (ResourceRequest.ANY.equals(req.getResourceName())) { 241 SchedulerUtils.enforcePartitionExclusivity(req, 242 getRmContext().getExclusiveEnforcedPartitions(), 243 asc.getNodeLabelExpression()); 244 } {code} Configuration query on the AM allocation flow is going to be costly which i observed while evaluating the performance.. Could you optimize {{getRmContext().getExclusiveEnforcedPartitions()}}, since this is going to be invoked for every *request* was (Author: bibinchundatt): [~jhung] Thank you for working on this. Sorry to come in really late too .. {quote} 240 if (ResourceRequest.ANY.equals(req.getResourceName())) { 241 SchedulerUtils.enforcePartitionExclusivity(req, 242 getRmContext().getExclusiveEnforcedPartitions(), 243 asc.getNodeLabelExpression()); 244 } {quote} Configuration query on the AM allocation flow is going to be costly which i observed while evaluating the performance.. Could you optimize {{getRmContext().getExclusiveEnforcedPartitions()}}, since this is going to be invoked for every *request* > Support forcing configured partitions to be exclusive based on app node label > - > > Key: YARN-9730 > URL: https://issues.apache.org/jira/browse/YARN-9730 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Labels: release-blocker > Fix For: 2.10.0, 3.3.0, 3.2.2, 3.1.4 > > Attachments: YARN-9730-branch-2.001.patch, YARN-9730.001.addendum, > YARN-9730.001.patch, YARN-9730.002.addendum, YARN-9730.002.patch, > YARN-9730.003.patch > > > Use case: queue X has all of its workload in non-default (exclusive) > partition P (by setting app submission context's node label set to P). Node > in partition Q != P heartbeats to RM. Capacity scheduler loops through every > application in X, and every scheduler key in this application, and fails to > allocate each time since the app's requested label and the node's label don't > match. This causes huge performance degradation when number of apps in X is > large. > To fix the issue, allow RM to configure partitions as "forced-exclusive". If > partition P is "forced-exclusive", then: > * 1a. If app sets its submission context's node label to P, all its resource > requests will be overridden to P > * 1b. If app sets its submission context's node label Q, any of its resource > requests whose labels are P will be overridden to Q > * 2. In the scheduler, we add apps with node label expression P to a > separate data structure. When a node in partition P heartbeats to scheduler, > we only try to schedule apps in this data structure. When a node in partition > Q heartbeats to scheduler, we schedule the rest of the apps as normal. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9738) Remove lock on ClusterNodeTracker#getNodeReport as it blocks application submission
[ https://issues.apache.org/jira/browse/YARN-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt updated YARN-9738: - Parent: YARN-9871 Issue Type: Sub-task (was: Bug) > Remove lock on ClusterNodeTracker#getNodeReport as it blocks application > submission > --- > > Key: YARN-9738 > URL: https://issues.apache.org/jira/browse/YARN-9738 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-9738-001.patch, YARN-9738-002.patch, > YARN-9738-003.patch > > > *Env :* > Server OS :- UBUNTU > No. of Cluster Node:- 9120 NMs > Env Mode:- [Secure / Non secure]Secure > *Preconditions:* > ~9120 NM's was running > ~1250 applications was in running state > 35K applications was in pending state > *Test Steps:* > 1. Submit the application from 5 clients, each client 2 threads and total 10 > queues > 2. Once application submittion increases (for each application of > distributted shell will call getClusterNodes) > *ClientRMservice#getClusterNodes tries to get > ClusterNodeTracker#getNodeReport where map nodes is locked.* > {quote} > "IPC Server handler 36 on 45022" #246 daemon prio=5 os_prio=0 > tid=0x7f75095de000 nid=0x1949c waiting on condition [0x7f74cff78000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f759f6d8858> (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.getNodeReport(ClusterNodeTracker.java:123) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getNodeReport(AbstractYarnScheduler.java:449) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.createNodeReports(ClientRMService.java:1067) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getClusterNodes(ClientRMService.java:992) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getClusterNodes(ApplicationClientProtocolPBServiceImpl.java:313) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:589) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2792) > {quote} > *Instead we can make nodes as concurrentHashMap and remove readlock* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9872) DecommissioningNodesWatcher#update blocks the heartbeat processing
[ https://issues.apache.org/jira/browse/YARN-9872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt updated YARN-9872: - Parent: YARN-9871 Issue Type: Sub-task (was: Bug) > DecommissioningNodesWatcher#update blocks the heartbeat processing > -- > > Key: YARN-9872 > URL: https://issues.apache.org/jira/browse/YARN-9872 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Priority: Major > > ResourceTrackerService handlers gettting blocked due to the synchronisation > at DecommissioningNodesWatcher#update -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9872) DecommissioningNodesWatcher#update blocks the heartbeat processing
Bibin Chundatt created YARN-9872: Summary: DecommissioningNodesWatcher#update blocks the heartbeat processing Key: YARN-9872 URL: https://issues.apache.org/jira/browse/YARN-9872 Project: Hadoop YARN Issue Type: Bug Reporter: Bibin Chundatt ResourceTrackerService handlers gettting blocked due to the synchronisation at DecommissioningNodesWatcher#update -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9871) Miscellaneous scalability improvement
Bibin Chundatt created YARN-9871: Summary: Miscellaneous scalability improvement Key: YARN-9871 URL: https://issues.apache.org/jira/browse/YARN-9871 Project: Hadoop YARN Issue Type: Improvement Reporter: Bibin Chundatt Jira is to group the issues observed during sls test and improvement required -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt updated YARN-9618: - Parent: YARN-9871 Issue Type: Sub-task (was: Improvement) > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Priority: Critical > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9830) Improve ContainerAllocationExpirer it blocks scheduling
[ https://issues.apache.org/jira/browse/YARN-9830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt updated YARN-9830: - Parent: YARN-9871 Issue Type: Sub-task (was: Bug) > Improve ContainerAllocationExpirer it blocks scheduling > --- > > Key: YARN-9830 > URL: https://issues.apache.org/jira/browse/YARN-9830 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Priority: Critical > Labels: perfomance > > {quote} >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor.register(AbstractLivelinessMonitor.java:106) > - waiting to lock <0x7fa348749550> (a > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:601) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:592) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > - locked <0x7fc8852f8200> (a > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:474) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:65) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9831) NMTokenSecretManagerInRM#createNMToken blocks ApplicationMasterService allocate flow
[ https://issues.apache.org/jira/browse/YARN-9831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt updated YARN-9831: - Parent: YARN-9871 Issue Type: Sub-task (was: Improvement) > NMTokenSecretManagerInRM#createNMToken blocks ApplicationMasterService > allocate flow > > > Key: YARN-9831 > URL: https://issues.apache.org/jira/browse/YARN-9831 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Priority: Critical > > Currently attempt's NMToken cannot be generated independently. > Each attempts allocate flow blocks each other. We should improve the same -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9858) Optimize RMContext getExclusiveEnforcedPartitions
[ https://issues.apache.org/jira/browse/YARN-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939136#comment-16939136 ] Bibin Chundatt commented on YARN-9858: -- [~jhung] Patch could cause *exclusiveEnforcedPartitions* getting set multiple times incase of concurrent execution. Its a possibility since its invoked by multiple handler. Alternative could be to set the *exclusiveEnforcedPartitions* after the creation of RMContext at # Resourcemanager#serviceInit # Resourcemanager#resetRMContext All the activeservices would be in stopped when we set it too.. Thoughts? > Optimize RMContext getExclusiveEnforcedPartitions > -- > > Key: YARN-9858 > URL: https://issues.apache.org/jira/browse/YARN-9858 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Labels: release-blocker > Attachments: YARN-9858.001.patch > > > Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a > hot code path, need to optimize it . > Since AMS allocate invoked by multiple handlers locking on conf will occur > {code} > java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841) > - waiting to lock <0x7f1f8107c748> (a > org.apache.hadoop.yarn.conf.YarnConfiguration) > at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214) > at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9858) Optimize RMContext getExclusiveEnforcedPartitions
[ https://issues.apache.org/jira/browse/YARN-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939161#comment-16939161 ] Bibin Chundatt commented on YARN-9858: -- +1 for approach. > Optimize RMContext getExclusiveEnforcedPartitions > -- > > Key: YARN-9858 > URL: https://issues.apache.org/jira/browse/YARN-9858 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Labels: release-blocker > Attachments: YARN-9858.001.patch > > > Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a > hot code path, need to optimize it . > Since AMS allocate invoked by multiple handlers locking on conf will occur > {code} > java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841) > - waiting to lock <0x7f1f8107c748> (a > org.apache.hadoop.yarn.conf.YarnConfiguration) > at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214) > at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9858) Optimize RMContext getExclusiveEnforcedPartitions
[ https://issues.apache.org/jira/browse/YARN-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939136#comment-16939136 ] Bibin Chundatt edited comment on YARN-9858 at 9/27/19 5:46 AM: --- [~jhung] Patch could cause *exclusiveEnforcedPartitions* getting set multiple times incase of concurrent execution. Its a possibility since its invoked by multiple handler. Alternative could be to set the *exclusiveEnforcedPartitions* after the creation of RMContext at # Resourcemanager#serviceInit # Resourcemanager#resetRMContext All activeservices would be in NEW STATE when set it too. Thoughts? was (Author: bibinchundatt): [~jhung] Patch could cause *exclusiveEnforcedPartitions* getting set multiple times incase of concurrent execution. Its a possibility since its invoked by multiple handler. Alternative could be to set the *exclusiveEnforcedPartitions* after the creation of RMContext at # Resourcemanager#serviceInit # Resourcemanager#resetRMContext All the activeservices would be in stopped when we set it too.. Thoughts? > Optimize RMContext getExclusiveEnforcedPartitions > -- > > Key: YARN-9858 > URL: https://issues.apache.org/jira/browse/YARN-9858 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Labels: release-blocker > Attachments: YARN-9858.001.patch > > > Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a > hot code path, need to optimize it . > Since AMS allocate invoked by multiple handlers locking on conf will occur > {code} > java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841) > - waiting to lock <0x7f1f8107c748> (a > org.apache.hadoop.yarn.conf.YarnConfiguration) > at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214) > at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9858) Optimize RMContext getExclusiveEnforcedPartitions
[ https://issues.apache.org/jira/browse/YARN-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939177#comment-16939177 ] Bibin Chundatt commented on YARN-9858: -- Over all patch looks good to me. Minor query . {code} 3803if (conf == null) { 3804 return new HashSet<>(); 3805} {code} Check is really required ? > Optimize RMContext getExclusiveEnforcedPartitions > -- > > Key: YARN-9858 > URL: https://issues.apache.org/jira/browse/YARN-9858 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Labels: release-blocker > Attachments: YARN-9858.001.patch, YARN-9858.002.patch > > > Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a > hot code path, need to optimize it . > Since AMS allocate invoked by multiple handlers locking on conf will occur > {code} > java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841) > - waiting to lock <0x7f1f8107c748> (a > org.apache.hadoop.yarn.conf.YarnConfiguration) > at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214) > at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9830) Improve ContainerAllocationExpirer it blocks scheduling
[ https://issues.apache.org/jira/browse/YARN-9830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt updated YARN-9830: - Attachment: YARN-9830.001.patch > Improve ContainerAllocationExpirer it blocks scheduling > --- > > Key: YARN-9830 > URL: https://issues.apache.org/jira/browse/YARN-9830 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Priority: Critical > Labels: perfomance > Attachments: YARN-9830.001.patch > > > {quote} >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor.register(AbstractLivelinessMonitor.java:106) > - waiting to lock <0x7fa348749550> (a > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:601) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:592) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > - locked <0x7fc8852f8200> (a > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:474) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:65) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9858) Optimize RMContext getExclusiveEnforcedPartitions
[ https://issues.apache.org/jira/browse/YARN-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940655#comment-16940655 ] Bibin Chundatt edited comment on YARN-9858 at 9/30/19 5:03 AM: --- Thank you [~jhung] +1 LGTM for latest patches. I will commit it by EOD if no objections. was (Author: bibinchundatt): Thank yoy [~jhung] +1 LGTM for latest patches. I will commit it by EOD if no objections. > Optimize RMContext getExclusiveEnforcedPartitions > -- > > Key: YARN-9858 > URL: https://issues.apache.org/jira/browse/YARN-9858 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Labels: release-blocker > Attachments: YARN-9858-branch-2.001.patch, > YARN-9858-branch-3.1.001.patch, YARN-9858-branch-3.2.001.patch, > YARN-9858.001.patch, YARN-9858.002.patch, YARN-9858.003.patch > > > Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a > hot code path, need to optimize it . > Since AMS allocate invoked by multiple handlers locking on conf will occur > {code} > java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841) > - waiting to lock <0x7f1f8107c748> (a > org.apache.hadoop.yarn.conf.YarnConfiguration) > at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214) > at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9858) Optimize RMContext getExclusiveEnforcedPartitions
[ https://issues.apache.org/jira/browse/YARN-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940655#comment-16940655 ] Bibin Chundatt commented on YARN-9858: -- Thank yoy [~jhung] +1 LGTM for latest patches. I will commit it by EOD if no objections. > Optimize RMContext getExclusiveEnforcedPartitions > -- > > Key: YARN-9858 > URL: https://issues.apache.org/jira/browse/YARN-9858 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Labels: release-blocker > Attachments: YARN-9858-branch-2.001.patch, > YARN-9858-branch-3.1.001.patch, YARN-9858-branch-3.2.001.patch, > YARN-9858.001.patch, YARN-9858.002.patch, YARN-9858.003.patch > > > Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a > hot code path, need to optimize it . > Since AMS allocate invoked by multiple handlers locking on conf will occur > {code} > java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841) > - waiting to lock <0x7f1f8107c748> (a > org.apache.hadoop.yarn.conf.YarnConfiguration) > at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214) > at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9858) Optimize RMContext getExclusiveEnforcedPartitions
[ https://issues.apache.org/jira/browse/YARN-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939245#comment-16939245 ] Bibin Chundatt commented on YARN-9858: -- I think we should fix the testcase. Setting conf to rmcontext should solve it .. {code} RMContext rmContext = mockRMContext(10, now - 2); Configuration conf = new YarnConfiguration(); ((RMContextImpl)rmContext).setYarnConfiguration(conf); {code} Also please a path for branch2 too to trigger jenkins. > Optimize RMContext getExclusiveEnforcedPartitions > -- > > Key: YARN-9858 > URL: https://issues.apache.org/jira/browse/YARN-9858 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Labels: release-blocker > Attachments: YARN-9858.001.patch, YARN-9858.002.patch > > > Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a > hot code path, need to optimize it . > Since AMS allocate invoked by multiple handlers locking on conf will occur > {code} > java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841) > - waiting to lock <0x7f1f8107c748> (a > org.apache.hadoop.yarn.conf.YarnConfiguration) > at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214) > at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB
[ https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940667#comment-16940667 ] Bibin Chundatt commented on YARN-2368: -- [~zhuqi] Incase you want to set "jute.maxbuffer" could probably make use of *YARN_RESOURCEMANAGER_OPTS*. At applicationSubmission time the znode size is limited by YARN-5006 IIUC YARN-2962 helps in limitting the number of nodes under on znode hierarchy. Attempt level update some discussion is already happening in YARN-9847. > ResourceManager failed when ZKRMStateStore tries to update znode data larger > than 1MB > - > > Key: YARN-2368 > URL: https://issues.apache.org/jira/browse/YARN-2368 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: Leitao Guo >Assignee: zhuqi >Priority: Critical > Attachments: YARN-2368.patch > > > Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed > finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode > larger than 1MB, which is the default configuration of ZooKeeper server and > client in 'jute.maxbuffer'. > ResourceManager (ip addr: 10.153.80.8) log shows as the following: > {code} > 2014-07-25 22:33:11,078 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2014-07-25 22:33:11,078 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2014-07-25 22:33:11,214 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a > org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for > /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:745) > {code} > Meanwhile, ZooKeeps log shows as the following: > {code} > 2014-07-25 22:10:09,728 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - > Accepted socket connection from /10.153.80.8:58890 > 2014-07-25 22:10:09,730 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client > attempting to renew session 0x247684586e70006 at /10.153.80.8:58890 > 2014-07-25 22:10:09,730 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating > client: 0x247684586e70006 > 2014-07-25 22:10:09,730 [myid:1] - INFO > [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session > 0x247684586e70006 with negotiated timeout 1 for client /10.153.80.8:58890 > 2014-07-25 22:10:09,730 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth > packet /10.153.80.8:58890 > 2014-07-25 22:10:09,730 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth > success /10.153.80.8:58890 > 2014-07-25 22:10:09,742
[jira] [Commented] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode
[ https://issues.apache.org/jira/browse/YARN-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940679#comment-16940679 ] Bibin Chundatt commented on YARN-9847: -- [~suxingfate] Issue still persists after configuring YARN-6125 & YARN-6967 ?? The way i understand the diagnostics size is limited by the above 2 at attempt level. > ZKRMStateStore will cause zk connection loss when writing huge data into znode > -- > > Key: YARN-9847 > URL: https://issues.apache.org/jira/browse/YARN-9847 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wang, Xinglong >Assignee: Wang, Xinglong >Priority: Minor > Attachments: YARN-9847.001.patch, YARN-9847.002.patch > > > Recently, we encountered RM ZK connection issue due to RM was trying to write > huge data into znode. This behavior will zk report Len error and then cause > zk session connection loss. And eventually RM would crash due to zk > connection issue. > *The fix* > In order to protect ResouceManager from crash due to this. > This fix is trying to limit the size of data for attemp by limiting the > diagnostic info when writing ApplicationAttemptStateData into znode. The size > will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will > be also used by zookeeper server. > *The story* > ResourceManager Log > {code:java} > 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session > 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, > unexpected error, closing socket connection and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) > at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > 2019-07-29 04:27:35,459 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at >
[jira] [Resolved] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode
[ https://issues.apache.org/jira/browse/YARN-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt resolved YARN-9847. -- Resolution: Invalid As confirmed by [~suxingfate] the issue is already fixed by YARN-6967 Closing as invalid. Please do reopen if required. > ZKRMStateStore will cause zk connection loss when writing huge data into znode > -- > > Key: YARN-9847 > URL: https://issues.apache.org/jira/browse/YARN-9847 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wang, Xinglong >Assignee: Wang, Xinglong >Priority: Minor > Attachments: YARN-9847.001.patch, YARN-9847.002.patch > > > Recently, we encountered RM ZK connection issue due to RM was trying to write > huge data into znode. This behavior will zk report Len error and then cause > zk session connection loss. And eventually RM would crash due to zk > connection issue. > *The fix* > In order to protect ResouceManager from crash due to this. > This fix is trying to limit the size of data for attemp by limiting the > diagnostic info when writing ApplicationAttemptStateData into znode. The size > will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will > be also used by zookeeper server. > *The story* > ResourceManager Log > {code:java} > 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session > 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, > unexpected error, closing socket connection and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) > at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > 2019-07-29 04:27:35,459 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at >
[jira] [Commented] (YARN-9768) RM Renew Delegation token thread should timeout and retry
[ https://issues.apache.org/jira/browse/YARN-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940899#comment-16940899 ] Bibin Chundatt commented on YARN-9768: -- [~maniraj...@gmail.com] Thank you for working on this Major comment {code} 215 future = renewerService.submit(new DelegationTokenRenewerRunnable(evt)); 216 future.get(tokenRenewerThreadTimeout, TimeUnit.MILLISECONDS); {code} IIUC the above implementation would cause the multi threaded renewal to single thread since get is going to be a blocking call. > RM Renew Delegation token thread should timeout and retry > - > > Key: YARN-9768 > URL: https://issues.apache.org/jira/browse/YARN-9768 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: CR Hota >Priority: Major > Attachments: YARN-9768.001.patch, YARN-9768.002.patch, > YARN-9768.003.patch > > > Delegation token renewer thread in RM (DelegationTokenRenewer.java) renews > HDFS tokens received to check for validity and expiration time. > This call is made to an underlying HDFS NN or Router Node (which has exact > APIs as HDFS NN). If one of the nodes is bad and the renew call is stuck the > thread remains stuck indefinitely. The thread should ideally timeout the > renewToken and retry from the client's perspective. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9624) Use switch case for ProtoUtils#convertFromProtoFormat containerState
[ https://issues.apache.org/jira/browse/YARN-9624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944333#comment-16944333 ] Bibin Chundatt commented on YARN-9624: -- Thank you [~BilwaST] for patch Few comments: * Remove changes to ContainerPBImpl * Add testcase to make sure for any new field addition tc fails if util not updated. > Use switch case for ProtoUtils#convertFromProtoFormat containerState > > > Key: YARN-9624 > URL: https://issues.apache.org/jira/browse/YARN-9624 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Bilwa S T >Priority: Major > Labels: performance > Attachments: YARN-9624.001.patch > > > On large cluster with 100K+ containers on every heartbeat > {{ContainerState.valueOf(e.name().replace(CONTAINER_STATE_PREFIX, ""))}} will > be too costly. Update with switch case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9738) Remove lock on ClusterNodeTracker#getNodeReport as it blocks application submission
[ https://issues.apache.org/jira/browse/YARN-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944336#comment-16944336 ] Bibin Chundatt commented on YARN-9738: -- [~sunilg] Could you please take a look . > Remove lock on ClusterNodeTracker#getNodeReport as it blocks application > submission > --- > > Key: YARN-9738 > URL: https://issues.apache.org/jira/browse/YARN-9738 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-9738-001.patch, YARN-9738-002.patch, > YARN-9738-003.patch > > > *Env :* > Server OS :- UBUNTU > No. of Cluster Node:- 9120 NMs > Env Mode:- [Secure / Non secure]Secure > *Preconditions:* > ~9120 NM's was running > ~1250 applications was in running state > 35K applications was in pending state > *Test Steps:* > 1. Submit the application from 5 clients, each client 2 threads and total 10 > queues > 2. Once application submittion increases (for each application of > distributted shell will call getClusterNodes) > *ClientRMservice#getClusterNodes tries to get > ClusterNodeTracker#getNodeReport where map nodes is locked.* > {quote} > "IPC Server handler 36 on 45022" #246 daemon prio=5 os_prio=0 > tid=0x7f75095de000 nid=0x1949c waiting on condition [0x7f74cff78000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f759f6d8858> (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.getNodeReport(ClusterNodeTracker.java:123) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getNodeReport(AbstractYarnScheduler.java:449) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.createNodeReports(ClientRMService.java:1067) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getClusterNodes(ClientRMService.java:992) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getClusterNodes(ApplicationClientProtocolPBServiceImpl.java:313) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:589) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2792) > {quote} > *Instead we can make nodes as concurrentHashMap and remove readlock* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'
[ https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962888#comment-16962888 ] Bibin Chundatt commented on YARN-9940: -- [~kailiu_dev] Apologies i thought issue is duplicate of YARN-8436 and you have close due to that. Fixed and resolved is only if the changes has gone into 3.2.0. Its that is not the case we have to keep the issue open. Please refer : https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute Reopening the issue > avoid continuous scheduling thread crashes while sorting nodes get > 'Comparison method violates its general contract' > > > Key: YARN-9940 > URL: https://issues.apache.org/jira/browse/YARN-9940 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: kailiu_dev >Priority: Major > Fix For: 3.2.0 > > Attachments: 0001.patch > > > 2019-10-16 09:14:51,215 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception. > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at java.util.TimSort.mergeHi(TimSort.java:868) > at java.util.TimSort.mergeAt(TimSort.java:485) > at java.util.TimSort.mergeForceCollapse(TimSort.java:426) > at java.util.TimSort.sort(TimSort.java:223) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'
[ https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt updated YARN-9940: - Fix Version/s: (was: 3.2.0) > avoid continuous scheduling thread crashes while sorting nodes get > 'Comparison method violates its general contract' > > > Key: YARN-9940 > URL: https://issues.apache.org/jira/browse/YARN-9940 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: kailiu_dev >Priority: Major > Attachments: 0001.patch > > > 2019-10-16 09:14:51,215 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception. > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at java.util.TimSort.mergeHi(TimSort.java:868) > at java.util.TimSort.mergeAt(TimSort.java:485) > at java.util.TimSort.mergeForceCollapse(TimSort.java:426) > at java.util.TimSort.sort(TimSort.java:223) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'
[ https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt reopened YARN-9940: -- > avoid continuous scheduling thread crashes while sorting nodes get > 'Comparison method violates its general contract' > > > Key: YARN-9940 > URL: https://issues.apache.org/jira/browse/YARN-9940 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: kailiu_dev >Priority: Major > Fix For: 3.2.0 > > Attachments: 0001.patch > > > 2019-10-16 09:14:51,215 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception. > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at java.util.TimSort.mergeHi(TimSort.java:868) > at java.util.TimSort.mergeAt(TimSort.java:485) > at java.util.TimSort.mergeForceCollapse(TimSort.java:426) > at java.util.TimSort.sort(TimSort.java:223) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9697) Efficient allocation of Opportunistic containers.
[ https://issues.apache.org/jira/browse/YARN-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971342#comment-16971342 ] Bibin Chundatt edited comment on YARN-9697 at 11/11/19 7:03 AM: [~abmodi] Few minor Nits: # NodeQueueLoadMonitor following set is not required , already getting set in constructor . {code} private int numNodesForAnyAllocation = DEFAULT_OPP_CONTAINER_ALLOCATION_NODES_NUMBER_USED; {code} Set to zero should be fine. # EnrichedResourceRequest : rename methods since we are returning maps now. Improvement: # CentralizedOpportunisticContainerAllocator # allocatePerSchedulerKey : Can you maintain a metrics to avoid iterating through allocations for each scheduler key {noformat} 152 for (List allocs : allocations.values()) { 153 totalAllocated += allocs.size(); 154 } {noformat} was (Author: bibinchundatt): [~abmodi] Few minor Nits: # NodeQueueLoadMonitor following set is not required , already getting set in constructor {code} private int numNodesForAnyAllocation = DEFAULT_OPP_CONTAINER_ALLOCATION_NODES_NUMBER_USED; {code} # EnrichedResourceRequest : rename methods since we are returning maps now. Improvement: # CentralizedOpportunisticContainerAllocator # allocatePerSchedulerKey : Can you maintain a metrics to avoid iterating through allocations for each scheduler key {noformat} 152 for (List allocs : allocations.values()) { 153 totalAllocated += allocs.size(); 154 } {noformat} > Efficient allocation of Opportunistic containers. > - > > Key: YARN-9697 > URL: https://issues.apache.org/jira/browse/YARN-9697 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Abhishek Modi >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-9697.001.patch, YARN-9697.002.patch, > YARN-9697.003.patch, YARN-9697.004.patch, YARN-9697.005.patch, > YARN-9697.006.patch, YARN-9697.007.patch, YARN-9697.008.patch, > YARN-9697.ut.patch, YARN-9697.ut2.patch, YARN-9697.wip1.patch, > YARN-9697.wip2.patch > > > In the current implementation, opportunistic containers are allocated based > on the number of queued opportunistic container information received in node > heartbeat. This information becomes stale as soon as more opportunistic > containers are allocated on that node. > Allocation of opportunistic containers happens on the same heartbeat in which > AM asks for the containers. When multiple applications request for > Opportunistic containers, containers might get allocated on the same set of > nodes as already allocated containers on the node are not considered while > serving requests from different applications. This can lead to uneven > allocation of Opportunistic containers across the cluster leading to > increased queuing time -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9697) Efficient allocation of Opportunistic containers.
[ https://issues.apache.org/jira/browse/YARN-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971342#comment-16971342 ] Bibin Chundatt commented on YARN-9697: -- [~abmodi] Few minor Nits: # NodeQueueLoadMonitor following set is not required , already getting set in constructor {code} private int numNodesForAnyAllocation = DEFAULT_OPP_CONTAINER_ALLOCATION_NODES_NUMBER_USED; {code} # EnrichedResourceRequest : rename methods since we are returning maps now. Improvement: # CentralizedOpportunisticContainerAllocator # allocatePerSchedulerKey : Can you maintain a metrics to avoid iterating through allocations for each scheduler key {noformat} 152 for (List allocs : allocations.values()) { 153 totalAllocated += allocs.size(); 154 } {noformat} > Efficient allocation of Opportunistic containers. > - > > Key: YARN-9697 > URL: https://issues.apache.org/jira/browse/YARN-9697 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Abhishek Modi >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-9697.001.patch, YARN-9697.002.patch, > YARN-9697.003.patch, YARN-9697.004.patch, YARN-9697.005.patch, > YARN-9697.006.patch, YARN-9697.007.patch, YARN-9697.008.patch, > YARN-9697.ut.patch, YARN-9697.ut2.patch, YARN-9697.wip1.patch, > YARN-9697.wip2.patch > > > In the current implementation, opportunistic containers are allocated based > on the number of queued opportunistic container information received in node > heartbeat. This information becomes stale as soon as more opportunistic > containers are allocated on that node. > Allocation of opportunistic containers happens on the same heartbeat in which > AM asks for the containers. When multiple applications request for > Opportunistic containers, containers might get allocated on the same set of > nodes as already allocated containers on the node are not considered while > serving requests from different applications. This can lead to uneven > allocation of Opportunistic containers across the cluster leading to > increased queuing time -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'
[ https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962888#comment-16962888 ] Bibin Chundatt edited comment on YARN-9940 at 10/30/19 2:04 PM: [~kailiu_dev] Apologies i thought issue is duplicate of YARN-8436 and you have closed based on that. Fixed and resolved state are set only if the changes has gone into 3.2.0. If tats is not the case we have to keep the issue open . Please refer : https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute Reopening the issue was (Author: bibinchundatt): [~kailiu_dev] Apologies i thought issue is duplicate of YARN-8436 and you have close due to that. Fixed and resolved is only if the changes has gone into 3.2.0. Its that is not the case we have to keep the issue open. Please refer : https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute Reopening the issue > avoid continuous scheduling thread crashes while sorting nodes get > 'Comparison method violates its general contract' > > > Key: YARN-9940 > URL: https://issues.apache.org/jira/browse/YARN-9940 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: kailiu_dev >Priority: Major > Attachments: 0001.patch > > > 2019-10-16 09:14:51,215 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception. > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at java.util.TimSort.mergeHi(TimSort.java:868) > at java.util.TimSort.mergeAt(TimSort.java:485) > at java.util.TimSort.mergeForceCollapse(TimSort.java:426) > at java.util.TimSort.sort(TimSort.java:223) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2442) ResourceManager JMX UI does not give HA State
[ https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961033#comment-16961033 ] Bibin Chundatt commented on YARN-2442: -- Thank you [~cyrusjackson25] for updated patch. Over all the patch looks good to me . +1 for YARN-2443.003.patch . Will wait for a day for others to take a look. cc:// [~rohithsharma] > ResourceManager JMX UI does not give HA State > - > > Key: YARN-2442 > URL: https://issues.apache.org/jira/browse/YARN-2442 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0, 2.7.0 >Reporter: Nishan Shetty >Assignee: Rohith Sharma K S >Priority: Major > Labels: oct16-easy > Attachments: 0001-YARN-2442.patch, YARN-2442.003.patch, > YARN-2442.004.patch, YARN-2442.02.patch > > > ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, > STOPPED) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9624) Use switch case for ProtoUtils#convertFromProtoFormat containerState
[ https://issues.apache.org/jira/browse/YARN-9624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954285#comment-16954285 ] Bibin Chundatt commented on YARN-9624: -- [~BilwaST] Could you please update the patch . > Use switch case for ProtoUtils#convertFromProtoFormat containerState > > > Key: YARN-9624 > URL: https://issues.apache.org/jira/browse/YARN-9624 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Bilwa S T >Priority: Major > Labels: performance > Attachments: YARN-9624.001.patch > > > On large cluster with 100K+ containers on every heartbeat > {{ContainerState.valueOf(e.name().replace(CONTAINER_STATE_PREFIX, ""))}} will > be too costly. Update with switch case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-2442) ResourceManager JMX UI does not give HA State
[ https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961033#comment-16961033 ] Bibin Chundatt edited comment on YARN-2442 at 10/28/19 8:32 PM: Thank you [~cyrusjackson25] for updated patch. Over all the patch looks good to me . +1 for YARN-2443.004.patch . Will wait for a day for others to take a look. cc:// [~rohithsharma] was (Author: bibinchundatt): Thank you [~cyrusjackson25] for updated patch. Over all the patch looks good to me . +1 for YARN-2443.003.patch . Will wait for a day for others to take a look. cc:// [~rohithsharma] > ResourceManager JMX UI does not give HA State > - > > Key: YARN-2442 > URL: https://issues.apache.org/jira/browse/YARN-2442 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0, 2.7.0 >Reporter: Nishan Shetty >Assignee: Rohith Sharma K S >Priority: Major > Labels: oct16-easy > Attachments: 0001-YARN-2442.patch, YARN-2442.003.patch, > YARN-2442.004.patch, YARN-2442.02.patch > > > ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, > STOPPED) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'
[ https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt reopened YARN-9940: -- > avoid continuous scheduling thread crashes while sorting nodes get > 'Comparison method violates its general contract' > > > Key: YARN-9940 > URL: https://issues.apache.org/jira/browse/YARN-9940 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: kailiu_dev >Priority: Major > Fix For: 3.2.0 > > Attachments: 0001.patch > > > 2019-10-16 09:14:51,215 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception. > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at java.util.TimSort.mergeHi(TimSort.java:868) > at java.util.TimSort.mergeAt(TimSort.java:485) > at java.util.TimSort.mergeForceCollapse(TimSort.java:426) > at java.util.TimSort.sort(TimSort.java:223) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'
[ https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt resolved YARN-9940. -- Target Version/s: (was: 2.7.2) Resolution: Duplicate > avoid continuous scheduling thread crashes while sorting nodes get > 'Comparison method violates its general contract' > > > Key: YARN-9940 > URL: https://issues.apache.org/jira/browse/YARN-9940 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: kailiu_dev >Priority: Major > Fix For: 3.2.0 > > Attachments: 0001.patch > > > 2019-10-16 09:14:51,215 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception. > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at java.util.TimSort.mergeHi(TimSort.java:868) > at java.util.TimSort.mergeAt(TimSort.java:485) > at java.util.TimSort.mergeForceCollapse(TimSort.java:426) > at java.util.TimSort.sort(TimSort.java:223) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9697) Efficient allocation of Opportunistic containers.
[ https://issues.apache.org/jira/browse/YARN-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956962#comment-16956962 ] Bibin Chundatt commented on YARN-9697: -- Thank you [~abmodi] for updating patch Few comments and suggestion # OpportunisticContainerAllocatorAMService -> NodeQueueLoadMonitor init could be moved to AbstractService#serviceinit # NodeQueueLoadMonitor ScheduledExecutorService#scheduledExecutor shutdown not done # NodeQueueLoadMonitor#nodeIdsByRack do we need the NodeIds to be sorted ?? # Thoughts on replacing NodeQueueLoadMonitor#addIntoNodeIdsByRack as follows {code} private void addIntoNodeIdsByRack(RMNode addedNode) { nodeIdsByRack.compute(addedNode.getRackName(), (k, v) -> v == null ? new ConcurrentHashMap().newKeySet() : v).add(addedNode.getNodeID()); } {code} # We could think of replacing NodeQueueLoadMonitor#removeFromNodeIdsByRack too with computeifPresent Not related to patch # OpportunisticSchedulerMetrics shouldn't we be having a destroy() method to reset the counters. During switch over i think we should reset the counters ? > Efficient allocation of Opportunistic containers. > - > > Key: YARN-9697 > URL: https://issues.apache.org/jira/browse/YARN-9697 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Abhishek Modi >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-9697.001.patch, YARN-9697.002.patch, > YARN-9697.003.patch, YARN-9697.004.patch, YARN-9697.005.patch, > YARN-9697.006.patch, YARN-9697.007.patch, YARN-9697.ut.patch, > YARN-9697.ut2.patch, YARN-9697.wip1.patch, YARN-9697.wip2.patch > > > In the current implementation, opportunistic containers are allocated based > on the number of queued opportunistic container information received in node > heartbeat. This information becomes stale as soon as more opportunistic > containers are allocated on that node. > Allocation of opportunistic containers happens on the same heartbeat in which > AM asks for the containers. When multiple applications request for > Opportunistic containers, containers might get allocated on the same set of > nodes as already allocated containers on the node are not considered while > serving requests from different applications. This can lead to uneven > allocation of Opportunistic containers across the cluster leading to > increased queuing time -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9935) SSLHandshakeException thrown when HTTPS is enabled in AM web server in one certain condition
[ https://issues.apache.org/jira/browse/YARN-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt updated YARN-9935: - Component/s: (was: amrmproxy) > SSLHandshakeException thrown when HTTPS is enabled in AM web server in one > certain condition > > > Key: YARN-9935 > URL: https://issues.apache.org/jira/browse/YARN-9935 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sushanta Sen >Priority: Major > > 【Precondition】: > 1. Install the cluster > 2. *{color:#4C9AFF}WebAppProxyServer service installed in 1 VM and RMs > installed in 2 VMs{color}* > 3. Enables all the HTTPS configuration required > yarn.resourcemanager.application-https.policy > STRICT > yarn.app.mapreduce.am.webapp.https.enabled > true > yarn.app.mapreduce.am.webapp.https.client.auth > true > 4. RM HA enabled > 5. *{color:#4C9AFF}Active RM is running in VM2, standby in VM1{color}* > 6. Cluster should be up and running > 【Test step】: > 1.Submit an application > 2. Open Application Master link from the applicationID from RM UI > 【Expect Output】: > No error should be thrown and JOb should be successful > 【Actual Output】: > SSLHandshakeException thrown , although Job is successful. > "javax.net.ssl.SSLHandshakeException: > sun.security.validator.ValidatorException: PKIX path building failed: > sun.security.provider.certpath.SunCertPathBuilderException: unable to find > valid certification path to requested target" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9926) RM multi-thread event processing mechanism
[ https://issues.apache.org/jira/browse/YARN-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt resolved YARN-9926. -- Resolution: Duplicate > RM multi-thread event processing mechanism > -- > > Key: YARN-9926 > URL: https://issues.apache.org/jira/browse/YARN-9926 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: hcarrot >Priority: Minor > Attachments: RM multi-thread event processing mechanism.pdf > > > Recently, we have observed serious event blocking in RM event dispatcher > queue. After analysis of RM event monitoring data and RM event processing > logic, we found that the proportion of RMNodeStatusEvent is less than other > events, but the overall processing time of it is more than other events. > Meanwhile, RM event processing is in a single-thread mode, and It results in > the decrease of RM's performance. So we proposed a RM multi-thread event > processing mechanism to improve RM performance. Is this mechanism feasible? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2442) ResourceManager JMX UI does not give HA State
[ https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957644#comment-16957644 ] Bibin Chundatt commented on YARN-2442: -- Thank you [~cyrusjackson25] for working on the patch Currently RMInfo is holding the reference of RMContext which could lead to memory leak on switch over. Instead we could use ResourceManager object directly. > ResourceManager JMX UI does not give HA State > - > > Key: YARN-2442 > URL: https://issues.apache.org/jira/browse/YARN-2442 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0, 2.7.0 >Reporter: Nishan Shetty >Assignee: Rohith Sharma K S >Priority: Major > Labels: oct16-easy > Attachments: 0001-YARN-2442.patch, YARN-2442.003.patch, > YARN-2442.02.patch > > > ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, > STOPPED) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-2442) ResourceManager JMX UI does not give HA State
[ https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957644#comment-16957644 ] Bibin Chundatt edited comment on YARN-2442 at 10/23/19 8:16 AM: Thank you [~cyrusjackson25] for working on the patch # Currently RMInfo is holding the reference of RMContext which could lead to memory leak on switch over. Instead we could use ResourceManager instance directly. # Fix the checkstyle issues # Findbug issue seems to be already fix. was (Author: bibinchundatt): Thank you [~cyrusjackson25] for working on the patch Currently RMInfo is holding the reference of RMContext which could lead to memory leak on switch over. Instead we could use ResourceManager object directly. > ResourceManager JMX UI does not give HA State > - > > Key: YARN-2442 > URL: https://issues.apache.org/jira/browse/YARN-2442 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0, 2.7.0 >Reporter: Nishan Shetty >Assignee: Rohith Sharma K S >Priority: Major > Labels: oct16-easy > Attachments: 0001-YARN-2442.patch, YARN-2442.003.patch, > YARN-2442.02.patch > > > ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, > STOPPED) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9697) Efficient allocation of Opportunistic containers.
[ https://issues.apache.org/jira/browse/YARN-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972097#comment-16972097 ] Bibin Chundatt commented on YARN-9697: -- Thank you [~abmodi] Overall patch looks good to me.. > Efficient allocation of Opportunistic containers. > - > > Key: YARN-9697 > URL: https://issues.apache.org/jira/browse/YARN-9697 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Abhishek Modi >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-9697.001.patch, YARN-9697.002.patch, > YARN-9697.003.patch, YARN-9697.004.patch, YARN-9697.005.patch, > YARN-9697.006.patch, YARN-9697.007.patch, YARN-9697.008.patch, > YARN-9697.009.patch, YARN-9697.ut.patch, YARN-9697.ut2.patch, > YARN-9697.wip1.patch, YARN-9697.wip2.patch > > > In the current implementation, opportunistic containers are allocated based > on the number of queued opportunistic container information received in node > heartbeat. This information becomes stale as soon as more opportunistic > containers are allocated on that node. > Allocation of opportunistic containers happens on the same heartbeat in which > AM asks for the containers. When multiple applications request for > Opportunistic containers, containers might get allocated on the same set of > nodes as already allocated containers on the node are not considered while > serving requests from different applications. This can lead to uneven > allocation of Opportunistic containers across the cluster leading to > increased queuing time -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9830) Improve ContainerAllocationExpirer it blocks scheduling
[ https://issues.apache.org/jira/browse/YARN-9830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt reassigned YARN-9830: Assignee: Bibin Chundatt > Improve ContainerAllocationExpirer it blocks scheduling > --- > > Key: YARN-9830 > URL: https://issues.apache.org/jira/browse/YARN-9830 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Bibin Chundatt >Priority: Critical > Labels: perfomance > Attachments: YARN-9830.001.patch > > > {quote} >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor.register(AbstractLivelinessMonitor.java:106) > - waiting to lock <0x7fa348749550> (a > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:601) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:592) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > - locked <0x7fc8852f8200> (a > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:474) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:65) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9830) Improve ContainerAllocationExpirer it blocks scheduling
[ https://issues.apache.org/jira/browse/YARN-9830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949904#comment-16949904 ] Bibin Chundatt commented on YARN-9830: -- [~sunil.gov...@gmail.com] Could you take a look > Improve ContainerAllocationExpirer it blocks scheduling > --- > > Key: YARN-9830 > URL: https://issues.apache.org/jira/browse/YARN-9830 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Bibin Chundatt >Priority: Critical > Labels: perfomance > Attachments: YARN-9830.001.patch > > > {quote} >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor.register(AbstractLivelinessMonitor.java:106) > - waiting to lock <0x7fa348749550> (a > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:601) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:592) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > - locked <0x7fc8852f8200> (a > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:474) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:65) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6924) Metrics for Federation AMRMProxy
[ https://issues.apache.org/jira/browse/YARN-6924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046785#comment-17046785 ] Bibin Chundatt commented on YARN-6924: -- [~youchen] Over all the patch looks good.. Minor nits : * Annotation and the method signature to be in different lines * Same applies for the variables too in AMRMProxyMetrics. * Since the testcase are in same package the visibility for get methods could be package private. > Metrics for Federation AMRMProxy > > > Key: YARN-6924 > URL: https://issues.apache.org/jira/browse/YARN-6924 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Giovanni Matteo Fumarola >Assignee: Young Chen >Priority: Major > Attachments: YARN-6924.01.patch, YARN-6924.01.patch, > YARN-6924.02.patch, YARN-6924.02.patch, YARN-6924.03.patch, YARN-6924.04.patch > > > This JIRA proposes addition of metrics for Federation AMRMProxy -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6924) Metrics for Federation AMRMProxy
[ https://issues.apache.org/jira/browse/YARN-6924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046785#comment-17046785 ] Bibin Chundatt edited comment on YARN-6924 at 2/27/20 4:33 PM: --- [~youchen] Over all the patch looks good.. Minor nits : * Annotation and the method signature to be in different lines * Same applies for the variables too in AMRMProxyMetrics. * Since the testcase are in same package the visibility for get methods could be package private. * Correct the apache source file copyright headers too. was (Author: bibinchundatt): [~youchen] Over all the patch looks good.. Minor nits : * Annotation and the method signature to be in different lines * Same applies for the variables too in AMRMProxyMetrics. * Since the testcase are in same package the visibility for get methods could be package private. > Metrics for Federation AMRMProxy > > > Key: YARN-6924 > URL: https://issues.apache.org/jira/browse/YARN-6924 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Giovanni Matteo Fumarola >Assignee: Young Chen >Priority: Major > Attachments: YARN-6924.01.patch, YARN-6924.01.patch, > YARN-6924.02.patch, YARN-6924.02.patch, YARN-6924.03.patch, YARN-6924.04.patch > > > This JIRA proposes addition of metrics for Federation AMRMProxy -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6924) Metrics for Federation AMRMProxy
[ https://issues.apache.org/jira/browse/YARN-6924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049835#comment-17049835 ] Bibin Chundatt commented on YARN-6924: -- [~youchen] Over all the patch looks good to me. Wait for a day to commit the same.. > Metrics for Federation AMRMProxy > > > Key: YARN-6924 > URL: https://issues.apache.org/jira/browse/YARN-6924 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Giovanni Matteo Fumarola >Assignee: Young Chen >Priority: Major > Attachments: YARN-6924.01.patch, YARN-6924.01.patch, > YARN-6924.02.patch, YARN-6924.02.patch, YARN-6924.03.patch, > YARN-6924.04.patch, YARN-6924.05.patch > > > This JIRA proposes addition of metrics for Federation AMRMProxy -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10098) Add interface to get node iterators by scheduler key for AppPlacementAllocator
[ https://issues.apache.org/jira/browse/YARN-10098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt resolved YARN-10098. --- Resolution: Invalid > Add interface to get node iterators by scheduler key for AppPlacementAllocator > -- > > Key: YARN-10098 > URL: https://issues.apache.org/jira/browse/YARN-10098 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10110) In Federation Secure cluster Application submission fails when authorization is enabled
[ https://issues.apache.org/jira/browse/YARN-10110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039688#comment-17039688 ] Bibin Chundatt commented on YARN-10110: --- [~BilwaST] Could you point me to the jira which supports Federation security. Also its better to group all the security related federation under one subtask and link it to YARN-5597 > In Federation Secure cluster Application submission fails when authorization > is enabled > --- > > Key: YARN-10110 > URL: https://issues.apache.org/jira/browse/YARN-10110 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sushanta Sen >Assignee: Bilwa S T >Priority: Blocker > Attachments: YARN-10110.001.patch, YARN-10110.002.patch > > > 【Precondition】: > 1. Secure Federated cluster is available > 2. Add the below configuration in Router and client core-site.xml > hadoop.security.authorization= true > 3. Restart the router service > 【Test step】: > 1. Go to router client bin path and submit a MR PI job > 2. Observe the client console screen > 【Expect Output】: > No error should be thrown and Job should be successful > 【Actual Output】: > Job failed prompting "Protocol interface > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB is not known.," > 【Additional Note】: > But on setting the parameter as false, job is submitted and success. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-4575) ApplicationResourceUsageReport should return ALL reserved resource
[ https://issues.apache.org/jira/browse/YARN-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt reassigned YARN-4575: Assignee: (was: Bibin Chundatt) > ApplicationResourceUsageReport should return ALL reserved resource > --- > > Key: YARN-4575 > URL: https://issues.apache.org/jira/browse/YARN-4575 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin Chundatt >Priority: Major > Labels: oct16-easy > Attachments: 0001-YARN-4575.patch, 0002-YARN-4575.patch > > > ApplicationResourceUsageReport reserved resource report is only of default > parition should be of all partitions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10098) AppPlacementAllocator get getPreferredNodeIterator based on scheduler key
Bibin Chundatt created YARN-10098: - Summary: AppPlacementAllocator get getPreferredNodeIterator based on scheduler key Key: YARN-10098 URL: https://issues.apache.org/jira/browse/YARN-10098 Project: Hadoop YARN Issue Type: Improvement Reporter: Bibin Chundatt -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10098) AppPlacementAllocator getPreferredNodeIterator based on scheduler key
[ https://issues.apache.org/jira/browse/YARN-10098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt updated YARN-10098: -- Summary: AppPlacementAllocator getPreferredNodeIterator based on scheduler key (was: AppPlacementAllocator get getPreferredNodeIterator based on scheduler key) > AppPlacementAllocator getPreferredNodeIterator based on scheduler key > -- > > Key: YARN-10098 > URL: https://issues.apache.org/jira/browse/YARN-10098 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10098) Add interface to get node iterators by scheduler key for AppPlacementAllocator
[ https://issues.apache.org/jira/browse/YARN-10098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt updated YARN-10098: -- Summary: Add interface to get node iterators by scheduler key for AppPlacementAllocator (was: AppPlacementAllocator getPreferredNodeIterator based on scheduler key) > Add interface to get node iterators by scheduler key for AppPlacementAllocator > -- > > Key: YARN-10098 > URL: https://issues.apache.org/jira/browse/YARN-10098 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9624) Use switch case for ProtoUtils#convertFromProtoFormat containerState
[ https://issues.apache.org/jira/browse/YARN-9624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009536#comment-17009536 ] Bibin Chundatt commented on YARN-9624: -- [~BilwaST] Could you update the patch ? > Use switch case for ProtoUtils#convertFromProtoFormat containerState > > > Key: YARN-9624 > URL: https://issues.apache.org/jira/browse/YARN-9624 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Bilwa S T >Priority: Major > Labels: performance > Attachments: YARN-9624.001.patch, YARN-9624.002.patch > > > On large cluster with 100K+ containers on every heartbeat > {{ContainerState.valueOf(e.name().replace(CONTAINER_STATE_PREFIX, ""))}} will > be too costly. Update with switch case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10208) Add metric in CapacityScheduler for evaluating the time difference between node heartbeats
[ https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076865#comment-17076865 ] Bibin Chundatt commented on YARN-10208: --- Thank you [~adam.antal] for additional review. Will wait for a day before commit. > Add metric in CapacityScheduler for evaluating the time difference between > node heartbeats > -- > > Key: YARN-10208 > URL: https://issues.apache.org/jira/browse/YARN-10208 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Pranjal Protim Borah >Assignee: Pranjal Protim Borah >Priority: Minor > Attachments: YARN-10208.001.patch, YARN-10208.002.patch, > YARN-10208.003.patch, YARN-10208.004.patch, YARN-10208.005.patch > > > Metric measuring average time interval between node heartbeats in capacity > scheduler on node update event. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-10182) SLS运行报错Couldn't create /yarn-leader-election/yarnRM
[ https://issues.apache.org/jira/browse/YARN-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt reopened YARN-10182: --- > SLS运行报错Couldn't create /yarn-leader-election/yarnRM > --- > > Key: YARN-10182 > URL: https://issues.apache.org/jira/browse/YARN-10182 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn > Environment: Cloudera Express 6.0.0 > RM1 :active RM2:standby > kerberos is on > yarn-site.xml : /etc/hadoop/conf.cloudera.yarn/yarn-site.xml > keytab: /etc/krb5.keytab ===>the keytab of yarn > when I run slsrun.sh ,I will get an error: > Exception in thread "main" org.apache.hadoop.service.ServiceStateException: > java.io.IOException: Couldn't create /yarn-leader-election/yarnRM > If I use sample-conf/yarn-site.xml ,I will get "KerberosAuthException: Login > failure for user: yarn from keytab /etc/krb5.keytab > javax.security.auth.login.LoginException: Unable to obtain password from user" > How to resolve it ? > >Reporter: zhangyu >Priority: Major > Attachments: slsrun.log.txt > > > RM1 :active RM2:standby > kerberos is on > yarn-site.xml : /etc/hadoop/conf.cloudera.yarn/yarn-site.xml > keytab: /etc/krb5.keytab ===>the keytab of yarn > when I run slsrun.sh on RM1 ,I will get an error: > Exception in thread "main" org.apache.hadoop.service.ServiceStateException: > java.io.IOException: Couldn't create /yarn-leader-election/yarnRM > If I use sample-conf/yarn-site.xml ,I will get "KerberosAuthException: Login > failure for user: yarn from keytab /etc/krb5.keytab > javax.security.auth.login.LoginException: Unable to obtain password from user" > How to resolve it ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10182) SLS运行报错Couldn't create /yarn-leader-election/yarnRM
[ https://issues.apache.org/jira/browse/YARN-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt resolved YARN-10182. --- Resolution: Not A Problem > SLS运行报错Couldn't create /yarn-leader-election/yarnRM > --- > > Key: YARN-10182 > URL: https://issues.apache.org/jira/browse/YARN-10182 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn > Environment: Cloudera Express 6.0.0 > RM1 :active RM2:standby > kerberos is on > yarn-site.xml : /etc/hadoop/conf.cloudera.yarn/yarn-site.xml > keytab: /etc/krb5.keytab ===>the keytab of yarn > when I run slsrun.sh ,I will get an error: > Exception in thread "main" org.apache.hadoop.service.ServiceStateException: > java.io.IOException: Couldn't create /yarn-leader-election/yarnRM > If I use sample-conf/yarn-site.xml ,I will get "KerberosAuthException: Login > failure for user: yarn from keytab /etc/krb5.keytab > javax.security.auth.login.LoginException: Unable to obtain password from user" > How to resolve it ? > >Reporter: zhangyu >Priority: Major > Attachments: slsrun.log.txt > > > RM1 :active RM2:standby > kerberos is on > yarn-site.xml : /etc/hadoop/conf.cloudera.yarn/yarn-site.xml > keytab: /etc/krb5.keytab ===>the keytab of yarn > when I run slsrun.sh on RM1 ,I will get an error: > Exception in thread "main" org.apache.hadoop.service.ServiceStateException: > java.io.IOException: Couldn't create /yarn-leader-election/yarnRM > If I use sample-conf/yarn-site.xml ,I will get "KerberosAuthException: Login > failure for user: yarn from keytab /etc/krb5.keytab > javax.security.auth.login.LoginException: Unable to obtain password from user" > How to resolve it ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9627) DelegationTokenRenewer could block transitionToStandy
[ https://issues.apache.org/jira/browse/YARN-9627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt reassigned YARN-9627: Assignee: (was: Bibin Chundatt) > DelegationTokenRenewer could block transitionToStandy > - > > Key: YARN-9627 > URL: https://issues.apache.org/jira/browse/YARN-9627 > Project: Hadoop YARN > Issue Type: Bug >Reporter: krishna reddy >Priority: Critical > Attachments: YARN-9627.001.patch, YARN-9627.002.patch, > YARN-9627.003.patch > > > Cluster size: 5K > Running containers: 55K > *Scenario*: Largenumber of pending applications (around 50K) and performing > RM switch over > Below exception : > {noformat} > 2019-06-13 17:39:27,594 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: > Renew Kind: HDFS_DELEGATION_TOKEN, Service: X:1616, Ident: (token > for root: HDFS_DELEGATION_TOKEN owner=root/had...@hadoop.com, renewer=yarn, > realUser=, issueDate=1560361265181, maxDate=1560966065181, > sequenceNumber=104708, masterKeyId=3);exp=1560533965360; > apps=[application_1560346941775_20702] in 86397766 ms, appId = > [application_1560346941775_20702] > 2019-06-13 17:39:27,609 WARN > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: > Unable to add the application to the delegation token renewer on recovery. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:522) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > 2019-06-13 17:58:20,878 ERROR org.apache.zookeeper.ClientCnxn: Time out error > occurred for the packet 'clientPath:null serverPath:null finished:false > header:: 27,4 replyHeader:: 27,4295687588,0 request:: > '/rmstore1/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTMasterKeysRoot/DelegationKey_49,F > response:: > #31ff8a16b74ffe129768ffdbffe949ff8dffd517ffcafffa,s{4295423577,4295423577,1560342837789,1560342837789,0,0,0,0,17,0,4295423577} > '. > 2019-06-13 17:58:20,877 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: > Renewed delegation-token= [Kind: HDFS_DELEGATION_TOKEN, Service: > X:1616, Ident: (token for root: HDFS_DELEGATION_TOKEN > owner=root/had...@hadoop.com, renewer=yarn, realUser=, > issueDate=1560366110990, maxDate=1560970910990, sequenceNumber=111891, > masterKeyId=3);exp=1560534896413; apps=[application_1560346941775_28115]] > 2019-06-13 17:58:20,924 WARN > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: > Unable to add the application to the delegation token renewer on recovery. > java.lang.IllegalStateException: Timer already cancelled. > at java.util.Timer.sched(Timer.java:397) > at java.util.Timer.schedule(Timer.java:208) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.setTimerForTokenRenewal(DelegationTokenRenewer.java:612) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:523) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Commented] (YARN-10208) Add metric in CapacityScheduler for evaluating the time difference between node heartbeats
[ https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068603#comment-17068603 ] Bibin Chundatt commented on YARN-10208: --- [~lapjarn] Minor comment. {code} 1834 // Add metrics for evaluating the time difference between heartbeats. 1835 SchedulerNode node = 1836 nodeTracker.getNode(nodeUpdatedEvent.getRMNode().getNodeID()); 1837 if (node != null) { 1838long lastInterval = 1839Time.monotonicNow() - node.getLastHeartbeatMonotonicTime(); 1840CapacitySchedulerMetrics.getMetrics() 1841.addSchedulerHeartBeatIntervalAverage(lastInterval); 1842 } {code} Refactor to method and update before the node update call > Add metric in CapacityScheduler for evaluating the time difference between > node heartbeats > -- > > Key: YARN-10208 > URL: https://issues.apache.org/jira/browse/YARN-10208 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Pranjal Protim Borah >Assignee: Pranjal Protim Borah >Priority: Minor > Attachments: YARN-10208.001.patch, YARN-10208.002.patch > > > Metric measuring average time interval between node heartbeats in capacity > scheduler on node update event. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10172) Default ApplicationPlacementType class should be configurable
[ https://issues.apache.org/jira/browse/YARN-10172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068619#comment-17068619 ] Bibin Chundatt commented on YARN-10172: --- [~cyrusjackson25] Please check the checkstyle issues.Apart from that changes looks good to me. {code} 184 String DEFAULT_APPLICATION_PLACEMENT_TYPE_CLASS = "org.apache.hadoop.yarn." 185 + "server.resourcemanager.scheduler.capacity." 186 + "yarnpp.YarnppLocalityAppPlacementAllocator"; {code} # Rename YarnppLocalityAppPlacementAllocator -> DummyLocalityAppPlacementAllocator # The package name also could be short. [~sunil.gov...@gmail.com] Would you like take a look > Default ApplicationPlacementType class should be configurable > - > > Key: YARN-10172 > URL: https://issues.apache.org/jira/browse/YARN-10172 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Cyrus Jackson >Assignee: Cyrus Jackson >Priority: Minor > Attachments: YARN-10172.001.patch > > > This can be useful in scheduling apps based on the configured placement type > class rather than resorting to LocalityAppPlacementAllocator -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10208) Add metric in CapacityScheduler for evaluating the time difference between node heartbeats
[ https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071476#comment-17071476 ] Bibin Chundatt commented on YARN-10208: --- +1 looks good to me. > Add metric in CapacityScheduler for evaluating the time difference between > node heartbeats > -- > > Key: YARN-10208 > URL: https://issues.apache.org/jira/browse/YARN-10208 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Pranjal Protim Borah >Assignee: Pranjal Protim Borah >Priority: Minor > Attachments: YARN-10208.001.patch, YARN-10208.002.patch, > YARN-10208.003.patch, YARN-10208.004.patch > > > Metric measuring average time interval between node heartbeats in capacity > scheduler on node update event. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10208) Add metric in CapacityScheduler for evaluating the time difference between node heartbeats
[ https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070748#comment-17070748 ] Bibin Chundatt commented on YARN-10208: --- [~lapjarn] Minor nit: schedulerHeartBeatIntervalAverage variable and method name rename to schedulerNodeHBInterval > Add metric in CapacityScheduler for evaluating the time difference between > node heartbeats > -- > > Key: YARN-10208 > URL: https://issues.apache.org/jira/browse/YARN-10208 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Pranjal Protim Borah >Assignee: Pranjal Protim Borah >Priority: Minor > Attachments: YARN-10208.001.patch, YARN-10208.002.patch, > YARN-10208.003.patch > > > Metric measuring average time interval between node heartbeats in capacity > scheduler on node update event. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10229) [Federation] Client should be able to submit application to RM directly using normal client conf
[ https://issues.apache.org/jira/browse/YARN-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098655#comment-17098655 ] Bibin Chundatt commented on YARN-10229: --- [~BilwaST] /[~122512...@qq.com] Nodemanagers need to stay independent of the applications . Parsing of application specific details are not suggested in nodemanager side. Alternate Solution: Currently AMRMProxyService overrides the AMRMToken always. If we could notify from interceptors whether the amrmtoken needs to be override , then we should be able to submit. In this case the FederationInterceptor could check the homeapplications entry is available in federation state store. Thoughts?? > [Federation] Client should be able to submit application to RM directly using > normal client conf > > > Key: YARN-10229 > URL: https://issues.apache.org/jira/browse/YARN-10229 > Project: Hadoop YARN > Issue Type: Wish > Components: amrmproxy, federation >Affects Versions: 3.1.1 >Reporter: JohnsonGuo >Assignee: Bilwa S T >Priority: Major > > Scenario: When enable the yarn federation feature with multi yarn clusters, > one can submit their job to yarn-router by *modified* their client > configuration with yarn router address. > But if one still wants to submit their jobs via the original client (before > enable federation) to RM directly, it will encounter the AMRMToken exception. > That means once enable federation ,if some one want to submit job, they have > to modify the client conf. > > one possible solution for this Scenario is: > In NodeManger, when the client ApplicationMaster request comes: > * get the client job.xml from HDFS "". > * parse the "yarn.resourcemanager.scheduler.address" parameter in job.xml > * if the value of the parameter is "localhost:8049"(AMRM address),then do > the AMRMToken valid process > * if the value of the parameter is "rm:port"(rm address),then skip the > AMRMToken valid process > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10246) Enable YARN Router to have a dedicated Zookeeper
[ https://issues.apache.org/jira/browse/YARN-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099735#comment-17099735 ] Bibin Chundatt commented on YARN-10246: --- [~dmmkr] In case of non secure cluster this could work with a different configuration file.. But i am not sure how this could work in secure cluster. Does curator support multiple kerboros configuration in same process (RM is the process here.) RM has to connect to Federation state store and also RM State Store.. IIRC the version of curator doesnt support the same. > Enable YARN Router to have a dedicated Zookeeper > > > Key: YARN-10246 > URL: https://issues.apache.org/jira/browse/YARN-10246 > Project: Hadoop YARN > Issue Type: Improvement > Components: federation, router >Reporter: D M Murali Krishna Reddy >Assignee: D M Murali Krishna Reddy >Priority: Major > Attachments: YARN-10246.001.patch, YARN-10246.002.patch > > > Currently, we have a single parameter hadoop.zk.address for Router and > Resourcemanager, Due to this we need have FederationStateStore and > RMStateStore on the same Zookeeper instance. > With the above topology there can be a load on ZooKeeper, since all > subcluster RMs will write to single ZooKeeper. > So, If we Introduce a new configuration such as hadoop.federation.zk.address > we can have FederationStateStore on a dedicated Zookeeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10246) Enable YARN Router to have a dedicated Zookeeper
[ https://issues.apache.org/jira/browse/YARN-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099735#comment-17099735 ] Bibin Chundatt edited comment on YARN-10246 at 5/5/20, 9:48 AM: [~dmmkr] In case of non secure cluster this could work with a different property names.. But i am not sure how this could work in secure cluster. Does curator support multiple kerboros configuration in same process (RM is the process here.) RM has to connect to Federation state store and also RM State Store.. IIRC the version of curator doesnt support the same. was (Author: bibinchundatt): [~dmmkr] In case of non secure cluster this could work with a different configuration file.. But i am not sure how this could work in secure cluster. Does curator support multiple kerboros configuration in same process (RM is the process here.) RM has to connect to Federation state store and also RM State Store.. IIRC the version of curator doesnt support the same. > Enable YARN Router to have a dedicated Zookeeper > > > Key: YARN-10246 > URL: https://issues.apache.org/jira/browse/YARN-10246 > Project: Hadoop YARN > Issue Type: Improvement > Components: federation, router >Reporter: D M Murali Krishna Reddy >Assignee: D M Murali Krishna Reddy >Priority: Major > Attachments: YARN-10246.001.patch, YARN-10246.002.patch > > > Currently, we have a single parameter hadoop.zk.address for Router and > Resourcemanager, Due to this we need have FederationStateStore and > RMStateStore on the same Zookeeper instance. > With the above topology there can be a load on ZooKeeper, since all > subcluster RMs will write to single ZooKeeper. > So, If we Introduce a new configuration such as hadoop.federation.zk.address > we can have FederationStateStore on a dedicated Zookeeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement
[ https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101380#comment-17101380 ] Bibin Chundatt commented on YARN-10259: --- In addition to the above.. I think the issue exists in the *LeafQueue#allocateFromReservedContainer* .. We do try the container allocation from first node we get iterating through all the candidate set. Change to previous logic. Issue -. container gets unreserved on node1. then again we reserve on node 1 during allocation .. The nodes in the last in list with reserved containers might never get a chance to do allocation./ unreservation. This impacts performance of multiNodelookup too. AsyncSchedulerThread give a fair change to each node to do unreserve/allocate from reserved container. Attempt allocation if reserved container exists with a single candidate nodeset. > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement > --- > > Key: YARN-10259 > URL: https://issues.apache.org/jira/browse/YARN-10259 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.2.0, 3.3.0 >Reporter: Prabhu Joseph >Priority: Major > Attachments: REPRO_TEST.patch > > > Reserved Containers are not allocated from the available space of other nodes > in CandidateNodeSet in MultiNodePlacement. > *Repro:* > 1. MultiNode Placement Enabled. > 2. Two nodes h1 and h2 with 8GB > 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets > placed in h2. > 4. Submit app3 AM which is reserved in h1 > 5. Kill app2 which frees space in h2. > 6. app3 AM never gets ALLOCATED > RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on > h2 as it expects the assignment to be on same node where reservation has > happened. > {code} > 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] > scheduler.SchedulerApplicationAttempt > (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt > appattempt_1588684773609_0003_01 reserved container > container_1588684773609_0003_01_01 on node host: h1:1234 #containers=1 > available= used=. This attempt > currently has 1 reserved containers at priority 0; currentReservation > > 2020-05-05 18:49:37,264 INFO [AsyncDispatcher event handler] > fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved > container=container_1588684773609_0003_01_01, on node=host: h1:1234 > #containers=1 available= used= > with resource= >RESERVED=[(Application=appattempt_1588684773609_0003_01; > Node=h1:1234; Resource=)] > > 2020-05-05 18:49:38,283 DEBUG [Time-limited test] > allocator.RegularContainerAllocator > (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: > node=h2 application=application_1588684773609_0003 priority=0 > pendingAsk=,repeat=1> > type=OFF_SWITCH > 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate > from reserved container container_1588684773609_0003_01_01, but node is > not reserved >ALLOCATED=[(Application=appattempt_1588684773609_0003_01; > Node=h2:1234; Resource=)] > {code} > After reverting fix of YARN-8127, it works. Attached testcase which > reproduces the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement
[ https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101380#comment-17101380 ] Bibin Chundatt edited comment on YARN-10259 at 5/7/20, 6:00 AM: In addition to the above.. I think the issue exists in the *LeafQueue#allocateFromReservedContainer* .. We do try the container allocation from first node we get iterating through all the candidate set. Change to previous logic. Issue -. container gets unreserved on node1. then again we reserve on node 1 during allocation .. The nodes in the last in list with reserved containers might never get a chance to do allocation./ unreservation. This impacts performance of multiNodelookup too. *AsyncSchedulerThread* give a fair chance to all nodes to do unreserve/allocate for reserved container. Attempt allocation if reserved container exists with a single candidate nodeset. was (Author: bibinchundatt): In addition to the above.. I think the issue exists in the *LeafQueue#allocateFromReservedContainer* .. We do try the container allocation from first node we get iterating through all the candidate set. Change to previous logic. Issue -. container gets unreserved on node1. then again we reserve on node 1 during allocation .. The nodes in the last in list with reserved containers might never get a chance to do allocation./ unreservation. This impacts performance of multiNodelookup too. AsyncSchedulerThread give a fair change to each node to do unreserve/allocate from reserved container. Attempt allocation if reserved container exists with a single candidate nodeset. > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement > --- > > Key: YARN-10259 > URL: https://issues.apache.org/jira/browse/YARN-10259 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.2.0, 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: REPRO_TEST.patch > > > Reserved Containers are not allocated from the available space of other nodes > in CandidateNodeSet in MultiNodePlacement. > *Repro:* > 1. MultiNode Placement Enabled. > 2. Two nodes h1 and h2 with 8GB > 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets > placed in h2. > 4. Submit app3 AM which is reserved in h1 > 5. Kill app2 which frees space in h2. > 6. app3 AM never gets ALLOCATED > RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on > h2 as it expects the assignment to be on same node where reservation has > happened. > {code} > 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] > scheduler.SchedulerApplicationAttempt > (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt > appattempt_1588684773609_0003_01 reserved container > container_1588684773609_0003_01_01 on node host: h1:1234 #containers=1 > available= used=. This attempt > currently has 1 reserved containers at priority 0; currentReservation > > 2020-05-05 18:49:37,264 INFO [AsyncDispatcher event handler] > fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved > container=container_1588684773609_0003_01_01, on node=host: h1:1234 > #containers=1 available= used= > with resource= >RESERVED=[(Application=appattempt_1588684773609_0003_01; > Node=h1:1234; Resource=)] > > 2020-05-05 18:49:38,283 DEBUG [Time-limited test] > allocator.RegularContainerAllocator > (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: > node=h2 application=application_1588684773609_0003 priority=0 > pendingAsk=,repeat=1> > type=OFF_SWITCH > 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate > from reserved container container_1588684773609_0003_01_01, but node is > not reserved >ALLOCATED=[(Application=appattempt_1588684773609_0003_01; > Node=h2:1234; Resource=)] > {code} > After reverting fix of YARN-8127, it works. Attached testcase which > reproduces the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10181) Managing Centralized Node Attribute via RMWebServices.
[ https://issues.apache.org/jira/browse/YARN-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060849#comment-17060849 ] Bibin Chundatt commented on YARN-10181: --- Could you move this jira as part of YARN-8766 ? > Managing Centralized Node Attribute via RMWebServices. > -- > > Key: YARN-10181 > URL: https://issues.apache.org/jira/browse/YARN-10181 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodeattibute >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Priority: Major > > Currently Centralized NodeAttributes can be managed only through Yarn > NodeAttribute CLI. This is to support via RMWebServices. > {code} > https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeAttributes.html#Centralised_Node_Attributes_mapping. > Centralised : Node to attributes mapping can be done through RM exposed CLI > or RPC (REST is yet to be supported). > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement
[ https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100769#comment-17100769 ] Bibin Chundatt commented on YARN-10259: --- [~prabhujoseph] I think we have few issue in the RegularContainerAllocator#allocate # Only when the preCheckForNodeCandidateSet check fails for *appInfo.precheckNode* we should be continuing the iteration over next set of nodes. # preCheckForNodeCandidateSet returns null try allocation # All other cases return preCheckForNodeCandidateSet(..) # if we have reserved container and for scheduler key the pending ask is zero. Unreserve the container. {code} if (application.getOutstandingAsksCount(schedulerKey) == 0) { // Release return new ContainerAllocation(reservedContainer, null, AllocationState.QUEUE_SKIPPED); } {code} # The *schedulingPS.getPreferredNodeIterator* i think we should filter out all the nodes with reserved containers. This should reduce the reservation. > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement > --- > > Key: YARN-10259 > URL: https://issues.apache.org/jira/browse/YARN-10259 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.2.0, 3.3.0 >Reporter: Prabhu Joseph >Priority: Major > Attachments: REPRO_TEST.patch > > > Reserved Containers are not allocated from the available space of other nodes > in CandidateNodeSet in MultiNodePlacement. > *Repro:* > 1. MultiNode Placement Enabled. > 2. Two nodes h1 and h2 with 8GB > 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets > placed in h2. > 4. Submit app3 AM which is reserved in h1 > 5. Kill app2 which frees space in h2. > 6. app3 AM never gets ALLOCATED > RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on > h2 as it expects the assignment to be on same node where reservation has > happened. > {code} > 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] > scheduler.SchedulerApplicationAttempt > (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt > appattempt_1588684773609_0003_01 reserved container > container_1588684773609_0003_01_01 on node host: h1:1234 #containers=1 > available= used=. This attempt > currently has 1 reserved containers at priority 0; currentReservation > > 2020-05-05 18:49:37,264 INFO [AsyncDispatcher event handler] > fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved > container=container_1588684773609_0003_01_01, on node=host: h1:1234 > #containers=1 available= used= > with resource= >RESERVED=[(Application=appattempt_1588684773609_0003_01; > Node=h1:1234; Resource=)] > > 2020-05-05 18:49:38,283 DEBUG [Time-limited test] > allocator.RegularContainerAllocator > (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: > node=h2 application=application_1588684773609_0003 priority=0 > pendingAsk=,repeat=1> > type=OFF_SWITCH > 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate > from reserved container container_1588684773609_0003_01_01, but node is > not reserved >ALLOCATED=[(Application=appattempt_1588684773609_0003_01; > Node=h2:1234; Resource=)] > {code} > After reverting fix of YARN-8127, it works. Attached testcase which > reproduces the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container
[ https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt reopened YARN-10395: --- Reopening since its not committed to any branch.. > ReservedContainer Node is added to blackList of application due to this node > can not allocate other container > - > > Key: YARN-10395 > URL: https://issues.apache.org/jira/browse/YARN-10395 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.2 >Reporter: chan >Priority: Major > Fix For: 2.9.2 > > Attachments: Yarn-10395-001.patch > > > Now,if a app reserved a node,but the node is added to app`s blacklist. > when this node send heartbeat to resourcemanager,the reserved container > allocate fails,it will make this node can not allocate other container even > thought this node have enough memory or vcores.so i think we can release this > reserved container when the reserved node is in the black list of this app. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10359) Log container report only if list is not empty
[ https://issues.apache.org/jira/browse/YARN-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169243#comment-17169243 ] Bibin Chundatt commented on YARN-10359: --- +1 committing shortly > Log container report only if list is not empty > -- > > Key: YARN-10359 > URL: https://issues.apache.org/jira/browse/YARN-10359 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Attachments: YARN-10359.001.patch, YARN-10359.002.patch > > > In NodeStatusUpdaterImpl print log only if containerReports list is not empty > {code:java} > if (containerReports != null) { > LOG.info("Registering with RM using containers :" + containerReports); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10335) Improve scheduling of containers based on node health
[ https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169246#comment-17169246 ] Bibin Chundatt commented on YARN-10335: --- Thank you [~cyrusjackson25] for patch Just did a brief look at the patch .. Few comments # Could be assigned to *NodeHealthCheckerService* {noformat} 107 NodeHealthCheckerServiceImpl healthChecker = 108 createNodeHealthCheckerService(); {noformat} # Update to readlock for get API {noformat} 528 public NodeHealthDetails getNodeHealthDetails() { 529 this.writeLock.lock(); 530 531 try { 532 return this.nodeHealthDetails; 533 } finally { 534 this.writeLock.unlock(); 535 } 536 } {noformat} # Fix all jenkins erros.. > Improve scheduling of containers based on node health > - > > Key: YARN-10335 > URL: https://issues.apache.org/jira/browse/YARN-10335 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Cyrus Jackson >Priority: Major > Attachments: YARN-10335.001.patch, YARN-10335.002.patch, > YARN-10335.003.patch, YARN-10335.004.patch > > > YARN-7494 supports providing interface to choose nodeset for scheduler > allocation. > We could leverage the same to support allocation of containers based on node > health value send from nodemanagers -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170656#comment-17170656 ] Bibin Chundatt commented on YARN-10352: --- Thank you [~prabhujoseph] for patch. Just few queries / comments # The customer iterator how much improvement we have against the *Iterators.filter* ? # Can we avoid the multiplier * 2 and make it configurable.. The multiplier could go wrong when the dispatcher is overloaded . processing events for large clusters could be slow . >2 seconds the events could stay in async dispatcher . # MultiNodeSortingManager the imports could be ordered. > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch, > YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, > YARN-10352-006.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172949#comment-17172949 ] Bibin Chundatt commented on YARN-10352: --- +1 for the latest patch. Will commit the same by EOD , if no objections. > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch, > YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, > YARN-10352-006.patch, YARN-10352-007.patch, YARN-10352-008.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10335) Improve scheduling of containers based on node health
[ https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172042#comment-17172042 ] Bibin Chundatt commented on YARN-10335: --- [~sunilg] Could you take a look.. > Improve scheduling of containers based on node health > - > > Key: YARN-10335 > URL: https://issues.apache.org/jira/browse/YARN-10335 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Cyrus Jackson >Priority: Major > Attachments: YARN-10335.001.patch, YARN-10335.002.patch, > YARN-10335.003.patch, YARN-10335.004.patch > > > YARN-7494 supports providing interface to choose nodeset for scheduler > allocation. > We could leverage the same to support allocation of containers based on node > health value send from nodemanagers -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10388) RMNode updatedCapability flag not set while RecommissionNodeTransition
[ https://issues.apache.org/jira/browse/YARN-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172043#comment-17172043 ] Bibin Chundatt commented on YARN-10388: --- Good catch [~lapjarn] . > RMNode updatedCapability flag not set while RecommissionNodeTransition > -- > > Key: YARN-10388 > URL: https://issues.apache.org/jira/browse/YARN-10388 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Pranjal Protim Borah >Assignee: Pranjal Protim Borah >Priority: Major > > RMNode updatedCapability flag is not set while RecommissionNodeTransition > happens. RM gets updated of new totalcapability when recommissioning of node > happens. But the nodemanager still has old totalcapability and is not aware > of the change. Setting this flag while RecommissionNodeTransition would > update nodemanager of totalcapability change as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10388) RMNode updatedCapability flag not set while RecommissionNodeTransition
[ https://issues.apache.org/jira/browse/YARN-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172128#comment-17172128 ] Bibin Chundatt commented on YARN-10388: --- Over all the patch looks good to me.. +1. Wait for jenkins result.. cc : [~inigoiri] > RMNode updatedCapability flag not set while RecommissionNodeTransition > -- > > Key: YARN-10388 > URL: https://issues.apache.org/jira/browse/YARN-10388 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Pranjal Protim Borah >Assignee: Pranjal Protim Borah >Priority: Major > Attachments: YARN-10388.001.patch > > > RMNode updatedCapability flag is not set while RecommissionNodeTransition > happens. RM gets updated of new totalcapability when recommissioning of node > happens. But the nodemanager still has old totalcapability and is not aware > of the change. Setting this flag while RecommissionNodeTransition would > update nodemanager of totalcapability change as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10335) Improve scheduling of containers based on node health
[ https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153234#comment-17153234 ] Bibin Chundatt commented on YARN-10335: --- Thank you [~cyrusjackson25] for working in this Few comments: # Refer NodeHealthStatus for how the records needs to implemented. Define as abstract and also add comments. # setNodeResources -> setNodeResourceScore also rename the variables too. # Finding addition description detail why did we add this ?? {noformat} optional string node_health_description = 4; {noformat} # NodeHealthService instead of *getNodeHealthDetails* we could add updateNodeHealthDetails # Add Visibility Annotation as private > Improve scheduling of containers based on node health > - > > Key: YARN-10335 > URL: https://issues.apache.org/jira/browse/YARN-10335 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Cyrus Jackson >Priority: Major > Attachments: YARN-10335.001.patch > > > YARN-7494 supports providing interface to choose nodeset for scheduler > allocation. > We could leverage the same to support allocation of containers based on node > health value send from nodemanagers -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
[ https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151290#comment-17151290 ] Bibin Chundatt commented on YARN-10332: --- [~yehuanhuan] My bad . Statetransition is defined twice makes sense to remove it. Misunderstood the JIRA as YARN-10315. +1 for the change. > RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state > > > Key: YARN-10332 > URL: https://issues.apache.org/jira/browse/YARN-10332 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: yehuanhuan >Priority: Minor > Attachments: YARN-10332.001.patch > > > RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10335) Improve scheduling of containers based on node health
[ https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149843#comment-17149843 ] Bibin Chundatt edited comment on YARN-10335 at 7/5/20, 6:19 AM: Thank you for showing interest in the JIRA [~cyrusjackson25] Adding what i have in mind about the health detail. Node manager has node health service which returns a boolean value .Sends UNHEALTHY if the node health script return error / If we don't have any healthy local directories. We will introduce field/fields which returns detailed node health value about the node along with the NodeHealthStatus. Example: {quote} message NodeHealthStatusProto { optional bool isHealthy = 1; optional string nodeHealthDescription = 2; optional string exceptionString = 3; optional NodeHealthDetail nodehealthDetail=4; optional StringIntMapProto nodeHealthdetail=5; } message StringStringMapProto { optional string key = 1; optional int32 value = 2; } keys could be - ssd, non ssd, etc.. {quote} Also make the NodeHealthService pluggable to support custom implementations of NodeHealthServices. was (Author: bibinchundatt): Thank you for showing interest in the JIRA [~cyrusjackson25] Adding what i have in mind about the health detail. Node manager has node health service which returns a boolean value .Sends UNHEALTHY if the node health script return error / If we don't have any healthy local directories. We will introduce field/fields which returns detailed node health value about the node along with the NodeHealthStatus. Example: {quote} message NodeHealthStatusProto { optional bool isHealthy = 1; optional string nodeHealthDescription = 2; optional string exceptionString = 3; optional NodeHealthDetail nodehealthDetail=4; optional StringIntMapProto nodeHealthdetail=5; } message StringStringMapProto { optional string key = 1; optional int32 value = 2; } keys could be - overall , ssd, non ssd, etc.. {quote} Also make the NodeHealthService pluggable to support custom implementations of NodeHealthServices. > Improve scheduling of containers based on node health > - > > Key: YARN-10335 > URL: https://issues.apache.org/jira/browse/YARN-10335 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Cyrus Jackson >Priority: Major > > YARN-7494 supports providing interface to choose nodeset for scheduler > allocation. > We could leverage the same to support allocation of containers based on node > health value send from nodemanagers -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10335) Improve scheduling of containers based on node health
[ https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149843#comment-17149843 ] Bibin Chundatt edited comment on YARN-10335 at 7/5/20, 6:22 AM: Thank you for showing interest in the JIRA [~cyrusjackson25] Adding what i have in mind about the health detail. Node manager has node health service which returns a boolean value .Sends UNHEALTHY if the node health script return error / If we don't have any healthy local directories. We will introduce field/fields which returns detailed node health value about the node along with the NodeHealthStatus. Example: {noformat} message NodeHealthStatusProto { optional bool isHealthy = 1; optional string nodeHealthDescription = 2; optional string exceptionString = 3; optional NodeHealthDetail nodehealthDetail=4; } message NodeHealthDetail{ optional int32 overallscore=1; optional StringIntMapProto nodeResources =2 ; } message StringIntMapProto { optional string key = 1; optional int32 value = 2; } keys could be - ssd, non ssd, etc.. {noformat} Also make the NodeHealthService pluggable to support custom implementations of NodeHealthServices. was (Author: bibinchundatt): Thank you for showing interest in the JIRA [~cyrusjackson25] Adding what i have in mind about the health detail. Node manager has node health service which returns a boolean value .Sends UNHEALTHY if the node health script return error / If we don't have any healthy local directories. We will introduce field/fields which returns detailed node health value about the node along with the NodeHealthStatus. Example: {quote} message NodeHealthStatusProto { optional bool isHealthy = 1; optional string nodeHealthDescription = 2; optional string exceptionString = 3; optional NodeHealthDetail nodehealthDetail=4; optional StringIntMapProto nodeHealthdetail=5; } message StringStringMapProto { optional string key = 1; optional int32 value = 2; } keys could be - ssd, non ssd, etc.. {quote} Also make the NodeHealthService pluggable to support custom implementations of NodeHealthServices. > Improve scheduling of containers based on node health > - > > Key: YARN-10335 > URL: https://issues.apache.org/jira/browse/YARN-10335 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Cyrus Jackson >Priority: Major > > YARN-7494 supports providing interface to choose nodeset for scheduler > allocation. > We could leverage the same to support allocation of containers based on node > health value send from nodemanagers -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10335) Improve scheduling of containers based on node health
[ https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151483#comment-17151483 ] Bibin Chundatt commented on YARN-10335: --- [~subru]/[~sunilg] Does the proto structure look good ? > Improve scheduling of containers based on node health > - > > Key: YARN-10335 > URL: https://issues.apache.org/jira/browse/YARN-10335 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Cyrus Jackson >Priority: Major > > YARN-7494 supports providing interface to choose nodeset for scheduler > allocation. > We could leverage the same to support allocation of containers based on node > health value send from nodemanagers -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
[ https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149166#comment-17149166 ] Bibin Chundatt commented on YARN-10332: --- [~yehuanhuan] looks like duplicate of YARN-10315. > RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state > > > Key: YARN-10332 > URL: https://issues.apache.org/jira/browse/YARN-10332 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: yehuanhuan >Priority: Minor > Attachments: YARN-10332.001.patch > > > RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
[ https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149166#comment-17149166 ] Bibin Chundatt edited comment on YARN-10332 at 7/1/20, 6:56 AM: [~yehuanhuan] looks like duplicate of YARN-10315. Current change will create InvalidStateTransitionException when Node is in decommissioning state and admin is calling node resource update.. Also during node update.. was (Author: bibinchundatt): [~yehuanhuan] looks like duplicate of YARN-10315. Current change is got in create InvalidStateTransitionException when Node is in decommissioning state and admin is calling node resource update.. Also during node update.. > RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state > > > Key: YARN-10332 > URL: https://issues.apache.org/jira/browse/YARN-10332 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: yehuanhuan >Priority: Minor > Attachments: YARN-10332.001.patch > > > RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
[ https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149166#comment-17149166 ] Bibin Chundatt edited comment on YARN-10332 at 7/1/20, 6:56 AM: [~yehuanhuan] looks like duplicate of YARN-10315. Current change is got in create InvalidStateTransitionException when Node is in decommissioning state and admin is calling node resource update.. Also during node update.. was (Author: bibinchundatt): [~yehuanhuan] looks like duplicate of YARN-10315. > RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state > > > Key: YARN-10332 > URL: https://issues.apache.org/jira/browse/YARN-10332 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: yehuanhuan >Priority: Minor > Attachments: YARN-10332.001.patch > > > RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10335) Improve scheduling of containers based on node health
[ https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149843#comment-17149843 ] Bibin Chundatt edited comment on YARN-10335 at 7/2/20, 4:45 AM: Thank you for showing interest in the JIRA [~cyrusjackson25] Adding what i have in mind about the health detail. Node manager has node health service which returns a boolean value .Sends UNHEALTHY if the node health script return error / If we don't have any healthy local directories. We will introduce field/fields which returns detailed node health value about the node along with the NodeHealthStatus. Example: {quote} message NodeHealthStatusProto { optional bool isHealthy = 1; optional string nodeHealthDescription = 2; optional string exceptionString = 3; optional NodeHealthDetail nodehealthDetail=4; optional StringIntMapProto nodeHealthdetail=5; } message StringStringMapProto { optional string key = 1; optional int32 value = 2; } keys could be - overall , ssd, non ssd, etc.. {quote} Also make the NodeHealthService pluggable to support custom implementations of NodeHealthServices. was (Author: bibinchundatt): Thank you for showing interest in the JIRA [~cyrusjackson25] Adding the thought what i have in mind about the health value. Node manager has node health service which returns a boolean value . Sends UNHEALTHY if the node health script return error / If we don't have any healthy local directories. We want to introduce field/fields which returns detailed node health value about the node along with the NodeHealthStatus. Example: {quote} message NodeHealthStatusProto { optional bool isHealthy = 1; optional string nodeHealthDescription = 2; optional string exceptionString = 3; optional NodeHealthDetail nodehealthDetail=4; optional StringIntMapProto nodeHealthdetail=5; } message StringStringMapProto { optional string key = 1; optional int32 value = 2; } keys could be - overall , ssd, non ssd, etc.. {quote} Also make the NodeHealthService pluggable to support custom implementations of NodeHealthServices. > Improve scheduling of containers based on node health > - > > Key: YARN-10335 > URL: https://issues.apache.org/jira/browse/YARN-10335 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Cyrus Jackson >Priority: Major > > YARN-7494 supports providing interface to choose nodeset for scheduler > allocation. > We could leverage the same to support allocation of containers based on node > health value send from nodemanagers -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10335) Improve scheduling of containers based on node health
[ https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149843#comment-17149843 ] Bibin Chundatt commented on YARN-10335: --- Thank you for showing interest in the JIRA [~cyrusjackson25] Adding the thought what i have in mind about the health value. Node manager has node health service which returns a boolean value . Sends UNHEALTHY if the node health script return error / If we don't have any healthy local directories. We want to introduce field/fields which returns detailed node health value about the node along with the NodeHealthStatus. Example: {quote} message NodeHealthStatusProto { optional bool isHealthy = 1; optional string nodeHealthDescription = 2; optional string exceptionString = 3; optional NodeHealthDetail nodehealthDetail=4; optional StringIntMapProto nodeHealthdetail=5; } message StringStringMapProto { optional string key = 1; optional int32 value = 2; } keys could be - overall , ssd, non ssd, etc.. {quote} Also make the NodeHealthService pluggable to support custom implementations of NodeHealthServices. > Improve scheduling of containers based on node health > - > > Key: YARN-10335 > URL: https://issues.apache.org/jira/browse/YARN-10335 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Cyrus Jackson >Priority: Major > > YARN-7494 supports providing interface to choose nodeset for scheduler > allocation. > We could leverage the same to support allocation of containers based on node > health value send from nodemanagers -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10335) Improve scheduling of containers based on node health
Bibin Chundatt created YARN-10335: - Summary: Improve scheduling of containers based on node health Key: YARN-10335 URL: https://issues.apache.org/jira/browse/YARN-10335 Project: Hadoop YARN Issue Type: Improvement Reporter: Bibin Chundatt YARN-7494 supports providing interface to choose nodeset for scheduler allocation. We could leverage the same to support allocation of containers based on nodehealth. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10335) Improve scheduling of containers based on node health
[ https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt updated YARN-10335: -- Description: YARN-7494 supports providing interface to choose nodeset for scheduler allocation. We could leverage the same to support allocation of containers based on nodehealth value was: YARN-7494 supports providing interface to choose nodeset for scheduler allocation. We could leverage the same to support allocation of containers based on nodehealth. > Improve scheduling of containers based on node health > - > > Key: YARN-10335 > URL: https://issues.apache.org/jira/browse/YARN-10335 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Priority: Major > > YARN-7494 supports providing interface to choose nodeset for scheduler > allocation. > We could leverage the same to support allocation of containers based on > nodehealth value -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10335) Improve scheduling of containers based on node health
[ https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt updated YARN-10335: -- Description: YARN-7494 supports providing interface to choose nodeset for scheduler allocation. We could leverage the same to support allocation of containers based on node health value send from nodemanagers was: YARN-7494 supports providing interface to choose nodeset for scheduler allocation. We could leverage the same to support allocation of containers based on nodehealth value > Improve scheduling of containers based on node health > - > > Key: YARN-10335 > URL: https://issues.apache.org/jira/browse/YARN-10335 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Priority: Major > > YARN-7494 supports providing interface to choose nodeset for scheduler > allocation. > We could leverage the same to support allocation of containers based on node > health value send from nodemanagers -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10307) /leveldb-timeline-store.ldb/LOCK not exist
[ https://issues.apache.org/jira/browse/YARN-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt resolved YARN-10307. --- Fix Version/s: (was: 3.1.2) Resolution: Invalid > /leveldb-timeline-store.ldb/LOCK not exist > -- > > Key: YARN-10307 > URL: https://issues.apache.org/jira/browse/YARN-10307 > Project: Hadoop YARN > Issue Type: Bug > Environment: Ubuntu 19.10 > Hadoop 3.1.2 > Tez 0.9.2 > Hbase 2.2.4 >Reporter: appleyuchi >Priority: Blocker > > $HADOOP_HOME/sbin/yarn-daemon.sh start timelineserver > > in hadoop-appleyuchi-timelineserver-Desktop.out I get > > org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: > /home/appleyuchi/[file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:|file:///home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:] > 沒有此一檔案或目錄 > at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) > at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) > at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:246) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) > 2020-06-04 17:48:21,525 INFO [main] service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore failed in state > INITED > java.io.FileNotFoundException: Source > 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' > does not exist > at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1268) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1237) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:253) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) > 2020-06-04 17:48:21,526 INFO [main] service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer > failed in state INITED > org.apache.hadoop.service.ServiceStateException: > java.io.FileNotFoundException: Source > 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' > does not exist > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) > Caused by: java.io.FileNotFoundException: Source > 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' > does not exist > at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368)
[jira] [Reopened] (YARN-10307) /leveldb-timeline-store.ldb/LOCK not exist
[ https://issues.apache.org/jira/browse/YARN-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt reopened YARN-10307: --- Reopening to set the correct resolution > /leveldb-timeline-store.ldb/LOCK not exist > -- > > Key: YARN-10307 > URL: https://issues.apache.org/jira/browse/YARN-10307 > Project: Hadoop YARN > Issue Type: Bug > Environment: Ubuntu 19.10 > Hadoop 3.1.2 > Tez 0.9.2 > Hbase 2.2.4 >Reporter: appleyuchi >Priority: Blocker > Fix For: 3.1.2 > > > $HADOOP_HOME/sbin/yarn-daemon.sh start timelineserver > > in hadoop-appleyuchi-timelineserver-Desktop.out I get > > org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: > /home/appleyuchi/[file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:|file:///home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:] > 沒有此一檔案或目錄 > at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) > at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) > at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:246) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) > 2020-06-04 17:48:21,525 INFO [main] service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore failed in state > INITED > java.io.FileNotFoundException: Source > 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' > does not exist > at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1268) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1237) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:253) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) > 2020-06-04 17:48:21,526 INFO [main] service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer > failed in state INITED > org.apache.hadoop.service.ServiceStateException: > java.io.FileNotFoundException: Source > 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' > does not exist > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) > Caused by: java.io.FileNotFoundException: Source > 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' > does not exist > at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405) > at
[jira] [Created] (YARN-10315) Avoid sending RMNodeResoureupdate event if resource is same
Bibin Chundatt created YARN-10315: - Summary: Avoid sending RMNodeResoureupdate event if resource is same Key: YARN-10315 URL: https://issues.apache.org/jira/browse/YARN-10315 Project: Hadoop YARN Issue Type: Improvement Reporter: Bibin Chundatt When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is send for every heartbeat . Which will result in scheduler resource update. Avoid sending the same. Scheduler node resource update iterates through all the queues for resource update which is costly.. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10315) Avoid sending RMNodeResourceupdate event if resource is same
[ https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163246#comment-17163246 ] Bibin Chundatt commented on YARN-10315: --- +1 looks good to me . [~adam.antal] will wait for fee days before committing. > Avoid sending RMNodeResourceupdate event if resource is same > > > Key: YARN-10315 > URL: https://issues.apache.org/jira/browse/YARN-10315 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Sushil Ks >Priority: Major > Attachments: YARN-10315.001.patch, YARN-10315.002.patch > > > When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is > send for every heartbeat . Which will result in scheduler resource update. > Avoid sending the same. > Scheduler node resource update iterates through all the queues for resource > update which is costly.. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10350) TestUserGroupMappingPlacementRule fails
[ https://issues.apache.org/jira/browse/YARN-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt updated YARN-10350: -- Fix Version/s: 3.4.0 > TestUserGroupMappingPlacementRule fails > --- > > Key: YARN-10350 > URL: https://issues.apache.org/jira/browse/YARN-10350 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Akira Ajisaka >Assignee: Bilwa S T >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10350.001.patch, YARN-10350.002.patch > > > TestUserGroupMappingPlacementRule fails on trunk: > {noformat} > [INFO] Running > org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule > [ERROR] Tests run: 31, Failures: 1, Errors: 2, Skipped: 0, Time elapsed: > 2.662 s <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule > [ERROR] > testResolvedQueueIsNotManaged(org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule) > Time elapsed: 0.03 s <<< ERROR! > java.lang.Exception: Unexpected exception, > expected but > was > at > org.junit.internal.runners.statements.ExpectException.evaluate(ExpectException.java:28) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > Caused by: java.lang.AssertionError: Queue expected: but was: > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:118) > at > org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule.verifyQueueMapping(TestUserGroupMappingPlacementRule.java:236) > at > org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule.testResolvedQueueIsNotManaged(TestUserGroupMappingPlacementRule.java:516) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.ExpectException.evaluate(ExpectException.java:19) > ... 18 more > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10369) Make NMTokenSecretManagerInRM sending NMToken for nodeId DEBUG
[ https://issues.apache.org/jira/browse/YARN-10369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166332#comment-17166332 ] Bibin Chundatt commented on YARN-10369: --- [~Jim_Brennan] . In addition to above comment please do use {}-placeholders too for logging > Make NMTokenSecretManagerInRM sending NMToken for nodeId DEBUG > -- > > Key: YARN-10369 > URL: https://issues.apache.org/jira/browse/YARN-10369 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Minor > Attachments: YARN-10369.001.patch > > > This message is logged at the info level, but it doesn't really add much > information. > We changed this to DEBUG internally years ago and haven't missed it. > {noformat} > 2020-07-27 21:51:29,027 INFO [RM Event dispatcher] > security.NMTokenSecretManagerInRM > (NMTokenSecretManagerInRM.java:createAndGetNMToken(200)) - Sending NMToken > for nodeId : localhost.localdomain:45454 for container : > container_1595886659189_0001_01_01 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10208) Add capacityScheduler metric for NODE_UPDATE interval
[ https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt updated YARN-10208: -- Summary: Add capacityScheduler metric for NODE_UPDATE interval (was: CapacityScheduler metric for evaluating the time difference between node heartbeats) > Add capacityScheduler metric for NODE_UPDATE interval > - > > Key: YARN-10208 > URL: https://issues.apache.org/jira/browse/YARN-10208 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Pranjal Protim Borah >Assignee: Pranjal Protim Borah >Priority: Minor > Attachments: YARN-10208.001.patch, YARN-10208.002.patch, > YARN-10208.003.patch, YARN-10208.004.patch, YARN-10208.005.patch, > YARN-10208.006.patch > > > Metric measuring average time interval between node heartbeats in capacity > scheduler on node update event. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10208) CapacityScheduler metric for evaluating the time difference between node heartbeats
[ https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt updated YARN-10208: -- Summary: CapacityScheduler metric for evaluating the time difference between node heartbeats (was: Add metric in CapacityScheduler for evaluating the time difference between node heartbeats) > CapacityScheduler metric for evaluating the time difference between node > heartbeats > --- > > Key: YARN-10208 > URL: https://issues.apache.org/jira/browse/YARN-10208 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Pranjal Protim Borah >Assignee: Pranjal Protim Borah >Priority: Minor > Attachments: YARN-10208.001.patch, YARN-10208.002.patch, > YARN-10208.003.patch, YARN-10208.004.patch, YARN-10208.005.patch, > YARN-10208.006.patch > > > Metric measuring average time interval between node heartbeats in capacity > scheduler on node update event. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10208) Add capacityScheduler metric for NODE_UPDATE interval
[ https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166219#comment-17166219 ] Bibin Chundatt commented on YARN-10208: --- Missed committing this JIRA.. The testcase failures are not related to patch attached Committing shortly > Add capacityScheduler metric for NODE_UPDATE interval > - > > Key: YARN-10208 > URL: https://issues.apache.org/jira/browse/YARN-10208 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Pranjal Protim Borah >Assignee: Pranjal Protim Borah >Priority: Minor > Attachments: YARN-10208.001.patch, YARN-10208.002.patch, > YARN-10208.003.patch, YARN-10208.004.patch, YARN-10208.005.patch, > YARN-10208.006.patch, YARN-10208.007.patch > > > Metric measuring average time interval between node heartbeats in capacity > scheduler on node update event. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160936#comment-17160936 ] Bibin Chundatt commented on YARN-10352: --- [~prabhujoseph] With current approach we are iterating through all the nodes 2 times in the partition. We could filter out the nodes during the {{reSortClusterNodes}} iteration that creating a list then iterating it all over it again. thoughts ? One more additional filter to {{preferrednodeIterator}} while querying nodes per schedulerKey would reduce the node selection being done during sorting interval of 5 sec. Iterators.filter(iterator, > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org