[jira] [Comment Edited] (YARN-9730) Support forcing configured partitions to be exclusive based on app node label

2019-09-25 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938302#comment-16938302
 ] 

Bibin Chundatt edited comment on YARN-9730 at 9/26/19 5:58 AM:
---

[~jhung]

Thank you for working on this. Sorry to come in really late too ..

{quote}
240   if (ResourceRequest.ANY.equals(req.getResourceName())) {
241 SchedulerUtils.enforcePartitionExclusivity(req,
242 getRmContext().getExclusiveEnforcedPartitions(),
243 asc.getNodeLabelExpression());
244   }
{quote}

Configuration query on the AM allocation flow is going to be costly which i 
observed while evaluating the performance..
Could you optimize {{getRmContext().getExclusiveEnforcedPartitions()}}, since 
this is going to be invoked for every *request*






was (Author: bibinchundatt):
[~jhung]

Thank you for working on this. Sorry to come in really late too ..

{quote}
240   if (ResourceRequest.ANY.equals(req.getResourceName())) {
241 SchedulerUtils.enforcePartitionExclusivity(req,
242 getRmContext().getExclusiveEnforcedPartitions(),
243 asc.getNodeLabelExpression());
244   }
{quote}

Configuration query on the AM allocation flow is going to be costly which i 
observed while evaluating the performance..
Could you optimize {getRmContext().getExclusiveEnforcedPartitions()} ,since 
this is going to be invoked for every *request*





> Support forcing configured partitions to be exclusive based on app node label
> -
>
> Key: YARN-9730
> URL: https://issues.apache.org/jira/browse/YARN-9730
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Fix For: 2.10.0, 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9730-branch-2.001.patch, YARN-9730.001.addendum, 
> YARN-9730.001.patch, YARN-9730.002.addendum, YARN-9730.002.patch, 
> YARN-9730.003.patch
>
>
> Use case: queue X has all of its workload in non-default (exclusive) 
> partition P (by setting app submission context's node label set to P). Node 
> in partition Q != P heartbeats to RM. Capacity scheduler loops through every 
> application in X, and every scheduler key in this application, and fails to 
> allocate each time since the app's requested label and the node's label don't 
> match. This causes huge performance degradation when number of apps in X is 
> large.
> To fix the issue, allow RM to configure partitions as "forced-exclusive". If 
> partition P is "forced-exclusive", then:
>  * 1a. If app sets its submission context's node label to P, all its resource 
> requests will be overridden to P
>  * 1b. If app sets its submission context's node label Q, any of its resource 
> requests whose labels are P will be overridden to Q
>  * 2. In the scheduler, we add apps with node label expression P to a 
> separate data structure. When a node in partition P heartbeats to scheduler, 
> we only try to schedule apps in this data structure. When a node in partition 
> Q heartbeats to scheduler, we schedule the rest of the apps as normal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9730) Support forcing configured partitions to be exclusive based on app node label

2019-09-25 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938302#comment-16938302
 ] 

Bibin Chundatt commented on YARN-9730:
--

[~jhung]

Thank you for working on this. Sorry to come in really late too ..

{quote}
240   if (ResourceRequest.ANY.equals(req.getResourceName())) {
241 SchedulerUtils.enforcePartitionExclusivity(req,
242 getRmContext().getExclusiveEnforcedPartitions(),
243 asc.getNodeLabelExpression());
244   }
{quote}

Configuration query on the AM allocation flow is going to be costly which i 
observed while evaluating the performance..
Could you optimize {getRmContext().getExclusiveEnforcedPartitions()} ,since 
this is going to be invoked for every *request*





> Support forcing configured partitions to be exclusive based on app node label
> -
>
> Key: YARN-9730
> URL: https://issues.apache.org/jira/browse/YARN-9730
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Fix For: 2.10.0, 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9730-branch-2.001.patch, YARN-9730.001.addendum, 
> YARN-9730.001.patch, YARN-9730.002.addendum, YARN-9730.002.patch, 
> YARN-9730.003.patch
>
>
> Use case: queue X has all of its workload in non-default (exclusive) 
> partition P (by setting app submission context's node label set to P). Node 
> in partition Q != P heartbeats to RM. Capacity scheduler loops through every 
> application in X, and every scheduler key in this application, and fails to 
> allocate each time since the app's requested label and the node's label don't 
> match. This causes huge performance degradation when number of apps in X is 
> large.
> To fix the issue, allow RM to configure partitions as "forced-exclusive". If 
> partition P is "forced-exclusive", then:
>  * 1a. If app sets its submission context's node label to P, all its resource 
> requests will be overridden to P
>  * 1b. If app sets its submission context's node label Q, any of its resource 
> requests whose labels are P will be overridden to Q
>  * 2. In the scheduler, we add apps with node label expression P to a 
> separate data structure. When a node in partition P heartbeats to scheduler, 
> we only try to schedule apps in this data structure. When a node in partition 
> Q heartbeats to scheduler, we schedule the rest of the apps as normal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9858) Optimize RMContext getExclusiveEnforcedPartitions

2019-09-26 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-9858:
-
Description: 
Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a hot 
code path, need to optimize it .

Since AMS allocate invoked by multiple handlers locking on conf will occur

{code}
java.lang.Thread.State: BLOCKED (on object monitor)
 at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841)
 - waiting to lock <0x7f1f8107c748> (a 
org.apache.hadoop.yarn.conf.YarnConfiguration)
 at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214)
 at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268)
{code}

  was:Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is 
a hot code path, need to optimize it.


> Optimize RMContext getExclusiveEnforcedPartitions 
> --
>
> Key: YARN-9858
> URL: https://issues.apache.org/jira/browse/YARN-9858
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
>
> Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a 
> hot code path, need to optimize it .
> Since AMS allocate invoked by multiple handlers locking on conf will occur
> {code}
> java.lang.Thread.State: BLOCKED (on object monitor)
>  at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841)
>  - waiting to lock <0x7f1f8107c748> (a 
> org.apache.hadoop.yarn.conf.YarnConfiguration)
>  at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214)
>  at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9730) Support forcing configured partitions to be exclusive based on app node label

2019-09-26 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938302#comment-16938302
 ] 

Bibin Chundatt edited comment on YARN-9730 at 9/26/19 6:00 AM:
---

[~jhung]

Thank you for working on this. Sorry to come in really late too ..

{code}
240   if (ResourceRequest.ANY.equals(req.getResourceName())) {
241 SchedulerUtils.enforcePartitionExclusivity(req,
242 getRmContext().getExclusiveEnforcedPartitions(),
243 asc.getNodeLabelExpression());
244   }
{code}

Configuration query on the AM allocation flow is going to be costly which i 
observed while evaluating the performance..
Could you optimize {{getRmContext().getExclusiveEnforcedPartitions()}}, since 
this is going to be invoked for every *request*






was (Author: bibinchundatt):
[~jhung]

Thank you for working on this. Sorry to come in really late too ..

{quote}
240   if (ResourceRequest.ANY.equals(req.getResourceName())) {
241 SchedulerUtils.enforcePartitionExclusivity(req,
242 getRmContext().getExclusiveEnforcedPartitions(),
243 asc.getNodeLabelExpression());
244   }
{quote}

Configuration query on the AM allocation flow is going to be costly which i 
observed while evaluating the performance..
Could you optimize {{getRmContext().getExclusiveEnforcedPartitions()}}, since 
this is going to be invoked for every *request*





> Support forcing configured partitions to be exclusive based on app node label
> -
>
> Key: YARN-9730
> URL: https://issues.apache.org/jira/browse/YARN-9730
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Fix For: 2.10.0, 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9730-branch-2.001.patch, YARN-9730.001.addendum, 
> YARN-9730.001.patch, YARN-9730.002.addendum, YARN-9730.002.patch, 
> YARN-9730.003.patch
>
>
> Use case: queue X has all of its workload in non-default (exclusive) 
> partition P (by setting app submission context's node label set to P). Node 
> in partition Q != P heartbeats to RM. Capacity scheduler loops through every 
> application in X, and every scheduler key in this application, and fails to 
> allocate each time since the app's requested label and the node's label don't 
> match. This causes huge performance degradation when number of apps in X is 
> large.
> To fix the issue, allow RM to configure partitions as "forced-exclusive". If 
> partition P is "forced-exclusive", then:
>  * 1a. If app sets its submission context's node label to P, all its resource 
> requests will be overridden to P
>  * 1b. If app sets its submission context's node label Q, any of its resource 
> requests whose labels are P will be overridden to Q
>  * 2. In the scheduler, we add apps with node label expression P to a 
> separate data structure. When a node in partition P heartbeats to scheduler, 
> we only try to schedule apps in this data structure. When a node in partition 
> Q heartbeats to scheduler, we schedule the rest of the apps as normal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9738) Remove lock on ClusterNodeTracker#getNodeReport as it blocks application submission

2019-10-01 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-9738:
-
Parent: YARN-9871
Issue Type: Sub-task  (was: Bug)

> Remove lock on ClusterNodeTracker#getNodeReport as it blocks application 
> submission
> ---
>
> Key: YARN-9738
> URL: https://issues.apache.org/jira/browse/YARN-9738
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-9738-001.patch, YARN-9738-002.patch, 
> YARN-9738-003.patch
>
>
> *Env :*
> Server OS :- UBUNTU
> No. of Cluster Node:- 9120 NMs
> Env Mode:- [Secure / Non secure]Secure
> *Preconditions:*
> ~9120 NM's was running
> ~1250 applications was in running state 
> 35K applications was in pending state
> *Test Steps:*
> 1. Submit the application from 5 clients, each client 2 threads and total 10 
> queues
> 2. Once application submittion increases (for each application of 
> distributted shell will call getClusterNodes)
> *ClientRMservice#getClusterNodes tries to get 
> ClusterNodeTracker#getNodeReport where map nodes is locked.*
> {quote}
> "IPC Server handler 36 on 45022" #246 daemon prio=5 os_prio=0 
> tid=0x7f75095de000 nid=0x1949c waiting on condition [0x7f74cff78000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x7f759f6d8858> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.getNodeReport(ClusterNodeTracker.java:123)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getNodeReport(AbstractYarnScheduler.java:449)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.createNodeReports(ClientRMService.java:1067)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getClusterNodes(ClientRMService.java:992)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getClusterNodes(ApplicationClientProtocolPBServiceImpl.java:313)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:589)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2792)
> {quote}
> *Instead we can make nodes as concurrentHashMap and remove readlock*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9872) DecommissioningNodesWatcher#update blocks the heartbeat processing

2019-10-01 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-9872:
-
Parent: YARN-9871
Issue Type: Sub-task  (was: Bug)

> DecommissioningNodesWatcher#update blocks the heartbeat processing
> --
>
> Key: YARN-9872
> URL: https://issues.apache.org/jira/browse/YARN-9872
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Priority: Major
>
> ResourceTrackerService handlers gettting blocked due to the synchronisation 
> at DecommissioningNodesWatcher#update



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9872) DecommissioningNodesWatcher#update blocks the heartbeat processing

2019-10-01 Thread Bibin Chundatt (Jira)
Bibin Chundatt created YARN-9872:


 Summary: DecommissioningNodesWatcher#update blocks the heartbeat 
processing
 Key: YARN-9872
 URL: https://issues.apache.org/jira/browse/YARN-9872
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin Chundatt


ResourceTrackerService handlers gettting blocked due to the synchronisation at 
DecommissioningNodesWatcher#update



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9871) Miscellaneous scalability improvement

2019-10-01 Thread Bibin Chundatt (Jira)
Bibin Chundatt created YARN-9871:


 Summary: Miscellaneous scalability improvement
 Key: YARN-9871
 URL: https://issues.apache.org/jira/browse/YARN-9871
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Bibin Chundatt


Jira is to group the issues observed during sls test and improvement required



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9618) NodeListManager event improvement

2019-10-01 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-9618:
-
Parent: YARN-9871
Issue Type: Sub-task  (was: Improvement)

> NodeListManager event improvement
> -
>
> Key: YARN-9618
> URL: https://issues.apache.org/jira/browse/YARN-9618
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Priority: Critical
>
> Current implementation nodelistmanager event blocks async dispacher and can 
> cause RM crash and slowing down event processing.
> # Cluster restart with 1K running apps . Each usable event will create 1K 
> events over all events could be 5k*1k events for 5K cluster
> # Event processing is blocked till new events are added to queue.
> Solution :
> # Add another async Event handler similar to scheduler.
> # Instead of adding events to dispatcher directly call RMApp event handler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9830) Improve ContainerAllocationExpirer it blocks scheduling

2019-10-01 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-9830:
-
Parent: YARN-9871
Issue Type: Sub-task  (was: Bug)

> Improve ContainerAllocationExpirer it blocks scheduling
> ---
>
> Key: YARN-9830
> URL: https://issues.apache.org/jira/browse/YARN-9830
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Priority: Critical
>  Labels: perfomance
>
> {quote}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor.register(AbstractLivelinessMonitor.java:106)
> - waiting to lock <0x7fa348749550> (a 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:601)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:592)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> - locked <0x7fc8852f8200> (a 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:474)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:65)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9831) NMTokenSecretManagerInRM#createNMToken blocks ApplicationMasterService allocate flow

2019-10-01 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-9831:
-
Parent: YARN-9871
Issue Type: Sub-task  (was: Improvement)

> NMTokenSecretManagerInRM#createNMToken blocks ApplicationMasterService 
> allocate flow
> 
>
> Key: YARN-9831
> URL: https://issues.apache.org/jira/browse/YARN-9831
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Priority: Critical
>
> Currently attempt's NMToken cannot be generated independently. 
> Each attempts allocate flow blocks each other. We should improve the same



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9858) Optimize RMContext getExclusiveEnforcedPartitions

2019-09-26 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939136#comment-16939136
 ] 

Bibin Chundatt commented on YARN-9858:
--

[~jhung]

Patch could cause  *exclusiveEnforcedPartitions* getting set multiple times 
incase of concurrent execution.
 Its a possibility since its invoked by multiple handler.

Alternative could be to set the *exclusiveEnforcedPartitions* after the 
creation of RMContext at
 # Resourcemanager#serviceInit
 # Resourcemanager#resetRMContext

All the activeservices would be in stopped when we set it too.. Thoughts?

> Optimize RMContext getExclusiveEnforcedPartitions 
> --
>
> Key: YARN-9858
> URL: https://issues.apache.org/jira/browse/YARN-9858
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Attachments: YARN-9858.001.patch
>
>
> Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a 
> hot code path, need to optimize it .
> Since AMS allocate invoked by multiple handlers locking on conf will occur
> {code}
> java.lang.Thread.State: BLOCKED (on object monitor)
>  at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841)
>  - waiting to lock <0x7f1f8107c748> (a 
> org.apache.hadoop.yarn.conf.YarnConfiguration)
>  at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214)
>  at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9858) Optimize RMContext getExclusiveEnforcedPartitions

2019-09-27 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939161#comment-16939161
 ] 

Bibin Chundatt commented on YARN-9858:
--

+1 for approach.

> Optimize RMContext getExclusiveEnforcedPartitions 
> --
>
> Key: YARN-9858
> URL: https://issues.apache.org/jira/browse/YARN-9858
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Attachments: YARN-9858.001.patch
>
>
> Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a 
> hot code path, need to optimize it .
> Since AMS allocate invoked by multiple handlers locking on conf will occur
> {code}
> java.lang.Thread.State: BLOCKED (on object monitor)
>  at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841)
>  - waiting to lock <0x7f1f8107c748> (a 
> org.apache.hadoop.yarn.conf.YarnConfiguration)
>  at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214)
>  at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9858) Optimize RMContext getExclusiveEnforcedPartitions

2019-09-26 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939136#comment-16939136
 ] 

Bibin Chundatt edited comment on YARN-9858 at 9/27/19 5:46 AM:
---

[~jhung]

Patch could cause  *exclusiveEnforcedPartitions* getting set multiple times 
incase of concurrent execution.
 Its a possibility since its invoked by multiple handler.

Alternative could be to set the *exclusiveEnforcedPartitions* after the 
creation of RMContext at
 # Resourcemanager#serviceInit
 # Resourcemanager#resetRMContext

All activeservices would be in NEW STATE when set it too.  Thoughts?


was (Author: bibinchundatt):
[~jhung]

Patch could cause  *exclusiveEnforcedPartitions* getting set multiple times 
incase of concurrent execution.
 Its a possibility since its invoked by multiple handler.

Alternative could be to set the *exclusiveEnforcedPartitions* after the 
creation of RMContext at
 # Resourcemanager#serviceInit
 # Resourcemanager#resetRMContext

All the activeservices would be in stopped when we set it too.. Thoughts?

> Optimize RMContext getExclusiveEnforcedPartitions 
> --
>
> Key: YARN-9858
> URL: https://issues.apache.org/jira/browse/YARN-9858
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Attachments: YARN-9858.001.patch
>
>
> Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a 
> hot code path, need to optimize it .
> Since AMS allocate invoked by multiple handlers locking on conf will occur
> {code}
> java.lang.Thread.State: BLOCKED (on object monitor)
>  at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841)
>  - waiting to lock <0x7f1f8107c748> (a 
> org.apache.hadoop.yarn.conf.YarnConfiguration)
>  at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214)
>  at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9858) Optimize RMContext getExclusiveEnforcedPartitions

2019-09-27 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939177#comment-16939177
 ] 

Bibin Chundatt commented on YARN-9858:
--

Over all patch looks good to me.

Minor query .

{code}
3803if (conf == null) {
3804  return new HashSet<>();
3805}
{code}

Check is really required ?

> Optimize RMContext getExclusiveEnforcedPartitions 
> --
>
> Key: YARN-9858
> URL: https://issues.apache.org/jira/browse/YARN-9858
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Attachments: YARN-9858.001.patch, YARN-9858.002.patch
>
>
> Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a 
> hot code path, need to optimize it .
> Since AMS allocate invoked by multiple handlers locking on conf will occur
> {code}
> java.lang.Thread.State: BLOCKED (on object monitor)
>  at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841)
>  - waiting to lock <0x7f1f8107c748> (a 
> org.apache.hadoop.yarn.conf.YarnConfiguration)
>  at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214)
>  at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9830) Improve ContainerAllocationExpirer it blocks scheduling

2019-10-01 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-9830:
-
Attachment: YARN-9830.001.patch

> Improve ContainerAllocationExpirer it blocks scheduling
> ---
>
> Key: YARN-9830
> URL: https://issues.apache.org/jira/browse/YARN-9830
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Priority: Critical
>  Labels: perfomance
> Attachments: YARN-9830.001.patch
>
>
> {quote}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor.register(AbstractLivelinessMonitor.java:106)
> - waiting to lock <0x7fa348749550> (a 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:601)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:592)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> - locked <0x7fc8852f8200> (a 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:474)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:65)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9858) Optimize RMContext getExclusiveEnforcedPartitions

2019-09-29 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940655#comment-16940655
 ] 

Bibin Chundatt edited comment on YARN-9858 at 9/30/19 5:03 AM:
---

Thank you [~jhung]

+1 LGTM  for latest patches.  I will commit it by EOD if no objections.


was (Author: bibinchundatt):
Thank yoy [~jhung]

+1 LGTM  for latest patches.  I will commit it by EOD if no objections.

> Optimize RMContext getExclusiveEnforcedPartitions 
> --
>
> Key: YARN-9858
> URL: https://issues.apache.org/jira/browse/YARN-9858
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Attachments: YARN-9858-branch-2.001.patch, 
> YARN-9858-branch-3.1.001.patch, YARN-9858-branch-3.2.001.patch, 
> YARN-9858.001.patch, YARN-9858.002.patch, YARN-9858.003.patch
>
>
> Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a 
> hot code path, need to optimize it .
> Since AMS allocate invoked by multiple handlers locking on conf will occur
> {code}
> java.lang.Thread.State: BLOCKED (on object monitor)
>  at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841)
>  - waiting to lock <0x7f1f8107c748> (a 
> org.apache.hadoop.yarn.conf.YarnConfiguration)
>  at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214)
>  at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9858) Optimize RMContext getExclusiveEnforcedPartitions

2019-09-29 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940655#comment-16940655
 ] 

Bibin Chundatt commented on YARN-9858:
--

Thank yoy [~jhung]

+1 LGTM  for latest patches.  I will commit it by EOD if no objections.

> Optimize RMContext getExclusiveEnforcedPartitions 
> --
>
> Key: YARN-9858
> URL: https://issues.apache.org/jira/browse/YARN-9858
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Attachments: YARN-9858-branch-2.001.patch, 
> YARN-9858-branch-3.1.001.patch, YARN-9858-branch-3.2.001.patch, 
> YARN-9858.001.patch, YARN-9858.002.patch, YARN-9858.003.patch
>
>
> Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a 
> hot code path, need to optimize it .
> Since AMS allocate invoked by multiple handlers locking on conf will occur
> {code}
> java.lang.Thread.State: BLOCKED (on object monitor)
>  at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841)
>  - waiting to lock <0x7f1f8107c748> (a 
> org.apache.hadoop.yarn.conf.YarnConfiguration)
>  at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214)
>  at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9858) Optimize RMContext getExclusiveEnforcedPartitions

2019-09-27 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939245#comment-16939245
 ] 

Bibin Chundatt commented on YARN-9858:
--

I think we should fix the testcase. Setting conf to rmcontext should solve it ..
{code}
RMContext rmContext = mockRMContext(10, now - 2);
Configuration conf = new YarnConfiguration();
((RMContextImpl)rmContext).setYarnConfiguration(conf);
{code}
Also please a path for branch2 too  to trigger jenkins.


> Optimize RMContext getExclusiveEnforcedPartitions 
> --
>
> Key: YARN-9858
> URL: https://issues.apache.org/jira/browse/YARN-9858
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Attachments: YARN-9858.001.patch, YARN-9858.002.patch
>
>
> Follow-up from YARN-9730. RMContextImpl#getExclusiveEnforcedPartitions is a 
> hot code path, need to optimize it .
> Since AMS allocate invoked by multiple handlers locking on conf will occur
> {code}
> java.lang.Thread.State: BLOCKED (on object monitor)
>  at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841)
>  - waiting to lock <0x7f1f8107c748> (a 
> org.apache.hadoop.yarn.conf.YarnConfiguration)
>  at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214)
>  at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB

2019-09-30 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940667#comment-16940667
 ] 

Bibin Chundatt commented on YARN-2368:
--

[~zhuqi]

Incase you want to set "jute.maxbuffer" could probably make use of 
*YARN_RESOURCEMANAGER_OPTS*.
At applicationSubmission time the znode size is limited by YARN-5006
IIUC YARN-2962 helps in limitting the number of nodes under on znode hierarchy.
Attempt level update some discussion is already happening in YARN-9847.



> ResourceManager failed when ZKRMStateStore tries to update znode data larger 
> than 1MB
> -
>
> Key: YARN-2368
> URL: https://issues.apache.org/jira/browse/YARN-2368
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.1
>Reporter: Leitao Guo
>Assignee: zhuqi
>Priority: Critical
> Attachments: YARN-2368.patch
>
>
> Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed 
> finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode 
> larger than 1MB, which is the default configuration of ZooKeeper server and 
> client in 'jute.maxbuffer'.
> ResourceManager (ip addr: 10.153.80.8) log shows as the following:
> {code}
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2014-07-25 22:33:11,214 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for 
> /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Meanwhile, ZooKeeps log shows as the following:
> {code}
> 2014-07-25 22:10:09,728 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - 
> Accepted socket connection from /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client 
> attempting to renew session 0x247684586e70006 at /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating 
> client: 0x247684586e70006
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session 
> 0x247684586e70006 with negotiated timeout 1 for client /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth 
> packet /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth 
> success /10.153.80.8:58890
> 2014-07-25 22:10:09,742 

[jira] [Commented] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode

2019-09-30 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940679#comment-16940679
 ] 

Bibin Chundatt commented on YARN-9847:
--

[~suxingfate]

Issue still persists after configuring YARN-6125 & YARN-6967 ??
The way i understand the diagnostics size is limited by the above 2 at attempt 
level. 


> ZKRMStateStore will cause zk connection loss when writing huge data into znode
> --
>
> Key: YARN-9847
> URL: https://issues.apache.org/jira/browse/YARN-9847
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wang, Xinglong
>Assignee: Wang, Xinglong
>Priority: Minor
> Attachments: YARN-9847.001.patch, YARN-9847.002.patch
>
>
> Recently, we encountered RM ZK connection issue due to RM was trying to write 
> huge data into znode. This behavior will zk report Len error and then cause 
> zk session connection loss. And eventually RM would crash due to zk 
> connection issue.
> *The fix*
> In order to protect ResouceManager from crash due to this.
> This fix is trying to limit the size of data for attemp by limiting the 
> diagnostic info when writing ApplicationAttemptStateData into znode. The size 
> will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will 
> be also used by zookeeper server.
> *The story*
> ResourceManager Log
> {code:java}
> 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, 
> unexpected error, closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> 2019-07-29 04:27:35,459 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> 

[jira] [Resolved] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode

2019-09-30 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt resolved YARN-9847.
--
Resolution: Invalid

As confirmed by [~suxingfate] the issue is already fixed by YARN-6967
Closing as invalid. Please do reopen if required.

> ZKRMStateStore will cause zk connection loss when writing huge data into znode
> --
>
> Key: YARN-9847
> URL: https://issues.apache.org/jira/browse/YARN-9847
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wang, Xinglong
>Assignee: Wang, Xinglong
>Priority: Minor
> Attachments: YARN-9847.001.patch, YARN-9847.002.patch
>
>
> Recently, we encountered RM ZK connection issue due to RM was trying to write 
> huge data into znode. This behavior will zk report Len error and then cause 
> zk session connection loss. And eventually RM would crash due to zk 
> connection issue.
> *The fix*
> In order to protect ResouceManager from crash due to this.
> This fix is trying to limit the size of data for attemp by limiting the 
> diagnostic info when writing ApplicationAttemptStateData into znode. The size 
> will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will 
> be also used by zookeeper server.
> *The story*
> ResourceManager Log
> {code:java}
> 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, 
> unexpected error, closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> 2019-07-29 04:27:35,459 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> 

[jira] [Commented] (YARN-9768) RM Renew Delegation token thread should timeout and retry

2019-09-30 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940899#comment-16940899
 ] 

Bibin Chundatt commented on YARN-9768:
--

[~maniraj...@gmail.com] Thank you for working on this

Major comment

{code}
215 future = renewerService.submit(new 
DelegationTokenRenewerRunnable(evt));
216 future.get(tokenRenewerThreadTimeout, TimeUnit.MILLISECONDS);
{code}

IIUC the above implementation would cause the multi threaded renewal to single 
thread since get is going to be a blocking call.

> RM Renew Delegation token thread should timeout and retry
> -
>
> Key: YARN-9768
> URL: https://issues.apache.org/jira/browse/YARN-9768
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: CR Hota
>Priority: Major
> Attachments: YARN-9768.001.patch, YARN-9768.002.patch, 
> YARN-9768.003.patch
>
>
> Delegation token renewer thread in RM (DelegationTokenRenewer.java) renews 
> HDFS tokens received to check for validity and expiration time.
> This call is made to an underlying HDFS NN or Router Node (which has exact 
> APIs as HDFS NN). If one of the nodes is bad and the renew call is stuck the 
> thread remains stuck indefinitely. The thread should ideally timeout the 
> renewToken and retry from the client's perspective.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9624) Use switch case for ProtoUtils#convertFromProtoFormat containerState

2019-10-04 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944333#comment-16944333
 ] 

Bibin Chundatt commented on YARN-9624:
--

Thank you  [~BilwaST] for patch

Few comments:
 * Remove changes to ContainerPBImpl
 * Add testcase to make sure for any new field addition tc fails if util  not 
updated.

> Use switch case for ProtoUtils#convertFromProtoFormat containerState
> 
>
> Key: YARN-9624
> URL: https://issues.apache.org/jira/browse/YARN-9624
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Bilwa S T
>Priority: Major
>  Labels: performance
> Attachments: YARN-9624.001.patch
>
>
> On large cluster with 100K+ containers on every heartbeat 
> {{ContainerState.valueOf(e.name().replace(CONTAINER_STATE_PREFIX, ""))}} will 
> be too costly. Update with switch case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9738) Remove lock on ClusterNodeTracker#getNodeReport as it blocks application submission

2019-10-04 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944336#comment-16944336
 ] 

Bibin Chundatt commented on YARN-9738:
--

[~sunilg] Could you please take a look . 

> Remove lock on ClusterNodeTracker#getNodeReport as it blocks application 
> submission
> ---
>
> Key: YARN-9738
> URL: https://issues.apache.org/jira/browse/YARN-9738
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-9738-001.patch, YARN-9738-002.patch, 
> YARN-9738-003.patch
>
>
> *Env :*
> Server OS :- UBUNTU
> No. of Cluster Node:- 9120 NMs
> Env Mode:- [Secure / Non secure]Secure
> *Preconditions:*
> ~9120 NM's was running
> ~1250 applications was in running state 
> 35K applications was in pending state
> *Test Steps:*
> 1. Submit the application from 5 clients, each client 2 threads and total 10 
> queues
> 2. Once application submittion increases (for each application of 
> distributted shell will call getClusterNodes)
> *ClientRMservice#getClusterNodes tries to get 
> ClusterNodeTracker#getNodeReport where map nodes is locked.*
> {quote}
> "IPC Server handler 36 on 45022" #246 daemon prio=5 os_prio=0 
> tid=0x7f75095de000 nid=0x1949c waiting on condition [0x7f74cff78000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x7f759f6d8858> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.getNodeReport(ClusterNodeTracker.java:123)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getNodeReport(AbstractYarnScheduler.java:449)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.createNodeReports(ClientRMService.java:1067)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getClusterNodes(ClientRMService.java:992)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getClusterNodes(ApplicationClientProtocolPBServiceImpl.java:313)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:589)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2792)
> {quote}
> *Instead we can make nodes as concurrentHashMap and remove readlock*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2019-10-30 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962888#comment-16962888
 ] 

Bibin Chundatt commented on YARN-9940:
--

[~kailiu_dev]

Apologies i thought issue is duplicate of YARN-8436 and you have close due to 
that.
Fixed and resolved is only if the changes has gone into 3.2.0.  Its that is not 
the case we have to keep the issue open.

Please refer : 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
Reopening the issue 









> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: 0001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2019-10-30 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-9940:
-
Fix Version/s: (was: 3.2.0)

> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Priority: Major
> Attachments: 0001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2019-10-30 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt reopened YARN-9940:
--

> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: 0001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9697) Efficient allocation of Opportunistic containers.

2019-11-10 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971342#comment-16971342
 ] 

Bibin Chundatt edited comment on YARN-9697 at 11/11/19 7:03 AM:


[~abmodi]

Few minor Nits:

# NodeQueueLoadMonitor  following set is not required , already getting set in 
constructor . 
{code}
  private int numNodesForAnyAllocation =
  DEFAULT_OPP_CONTAINER_ALLOCATION_NODES_NUMBER_USED;
{code}
Set to zero should be fine.
# EnrichedResourceRequest : rename methods since we are returning maps now.
Improvement:
# CentralizedOpportunisticContainerAllocator # allocatePerSchedulerKey :  Can 
you maintain a metrics to avoid iterating through allocations for each 
scheduler key
{noformat}
152 for (List allocs : allocations.values()) {
153   totalAllocated += allocs.size();
154 }
{noformat}


was (Author: bibinchundatt):
[~abmodi]

Few minor Nits:

# NodeQueueLoadMonitor  following set is not required , already getting set in 
constructor
{code}
  private int numNodesForAnyAllocation =
  DEFAULT_OPP_CONTAINER_ALLOCATION_NODES_NUMBER_USED;
{code}
# EnrichedResourceRequest : rename methods since we are returning maps now.
Improvement:
# CentralizedOpportunisticContainerAllocator # allocatePerSchedulerKey :  Can 
you maintain a metrics to avoid iterating through allocations for each 
scheduler key
{noformat}
152 for (List allocs : allocations.values()) {
153   totalAllocated += allocs.size();
154 }
{noformat}

> Efficient allocation of Opportunistic containers.
> -
>
> Key: YARN-9697
> URL: https://issues.apache.org/jira/browse/YARN-9697
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9697.001.patch, YARN-9697.002.patch, 
> YARN-9697.003.patch, YARN-9697.004.patch, YARN-9697.005.patch, 
> YARN-9697.006.patch, YARN-9697.007.patch, YARN-9697.008.patch, 
> YARN-9697.ut.patch, YARN-9697.ut2.patch, YARN-9697.wip1.patch, 
> YARN-9697.wip2.patch
>
>
> In the current implementation, opportunistic containers are allocated based 
> on the number of queued opportunistic container information received in node 
> heartbeat. This information becomes stale as soon as more opportunistic 
> containers are allocated on that node.
> Allocation of opportunistic containers happens on the same heartbeat in which 
> AM asks for the containers. When multiple applications request for 
> Opportunistic containers, containers might get allocated on the same set of 
> nodes as already allocated containers on the node are not considered while 
> serving requests from different applications. This can lead to uneven 
> allocation of Opportunistic containers across the cluster leading to 
> increased queuing time 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9697) Efficient allocation of Opportunistic containers.

2019-11-10 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971342#comment-16971342
 ] 

Bibin Chundatt commented on YARN-9697:
--

[~abmodi]

Few minor Nits:

# NodeQueueLoadMonitor  following set is not required , already getting set in 
constructor
{code}
  private int numNodesForAnyAllocation =
  DEFAULT_OPP_CONTAINER_ALLOCATION_NODES_NUMBER_USED;
{code}
# EnrichedResourceRequest : rename methods since we are returning maps now.
Improvement:
# CentralizedOpportunisticContainerAllocator # allocatePerSchedulerKey :  Can 
you maintain a metrics to avoid iterating through allocations for each 
scheduler key
{noformat}
152 for (List allocs : allocations.values()) {
153   totalAllocated += allocs.size();
154 }
{noformat}

> Efficient allocation of Opportunistic containers.
> -
>
> Key: YARN-9697
> URL: https://issues.apache.org/jira/browse/YARN-9697
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9697.001.patch, YARN-9697.002.patch, 
> YARN-9697.003.patch, YARN-9697.004.patch, YARN-9697.005.patch, 
> YARN-9697.006.patch, YARN-9697.007.patch, YARN-9697.008.patch, 
> YARN-9697.ut.patch, YARN-9697.ut2.patch, YARN-9697.wip1.patch, 
> YARN-9697.wip2.patch
>
>
> In the current implementation, opportunistic containers are allocated based 
> on the number of queued opportunistic container information received in node 
> heartbeat. This information becomes stale as soon as more opportunistic 
> containers are allocated on that node.
> Allocation of opportunistic containers happens on the same heartbeat in which 
> AM asks for the containers. When multiple applications request for 
> Opportunistic containers, containers might get allocated on the same set of 
> nodes as already allocated containers on the node are not considered while 
> serving requests from different applications. This can lead to uneven 
> allocation of Opportunistic containers across the cluster leading to 
> increased queuing time 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2019-10-30 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962888#comment-16962888
 ] 

Bibin Chundatt edited comment on YARN-9940 at 10/30/19 2:04 PM:


[~kailiu_dev]

Apologies i thought issue is duplicate of YARN-8436 and you have closed  based 
on that.
Fixed and resolved state are set only if the changes has gone into 3.2.0. 

If tats is not the case we have to keep the issue open .

Please refer : 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute

Reopening the issue 










was (Author: bibinchundatt):
[~kailiu_dev]

Apologies i thought issue is duplicate of YARN-8436 and you have close due to 
that.
Fixed and resolved is only if the changes has gone into 3.2.0.  Its that is not 
the case we have to keep the issue open.

Please refer : 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
Reopening the issue 









> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Priority: Major
> Attachments: 0001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2442) ResourceManager JMX UI does not give HA State

2019-10-28 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961033#comment-16961033
 ] 

Bibin Chundatt commented on YARN-2442:
--

Thank you [~cyrusjackson25] for updated patch.

Over all the patch looks good to me . +1  for YARN-2443.003.patch . Will wait 
for a day for others to take a look.

cc:// [~rohithsharma]



> ResourceManager JMX UI does not give HA State
> -
>
> Key: YARN-2442
> URL: https://issues.apache.org/jira/browse/YARN-2442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.5.0, 2.6.0, 2.7.0
>Reporter: Nishan Shetty
>Assignee: Rohith Sharma K S
>Priority: Major
>  Labels: oct16-easy
> Attachments: 0001-YARN-2442.patch, YARN-2442.003.patch, 
> YARN-2442.004.patch, YARN-2442.02.patch
>
>
> ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, 
> STOPPED)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9624) Use switch case for ProtoUtils#convertFromProtoFormat containerState

2019-10-17 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954285#comment-16954285
 ] 

Bibin Chundatt commented on YARN-9624:
--

[~BilwaST] Could you please update the patch .

> Use switch case for ProtoUtils#convertFromProtoFormat containerState
> 
>
> Key: YARN-9624
> URL: https://issues.apache.org/jira/browse/YARN-9624
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Bilwa S T
>Priority: Major
>  Labels: performance
> Attachments: YARN-9624.001.patch
>
>
> On large cluster with 100K+ containers on every heartbeat 
> {{ContainerState.valueOf(e.name().replace(CONTAINER_STATE_PREFIX, ""))}} will 
> be too costly. Update with switch case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-2442) ResourceManager JMX UI does not give HA State

2019-10-28 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961033#comment-16961033
 ] 

Bibin Chundatt edited comment on YARN-2442 at 10/28/19 8:32 PM:


Thank you [~cyrusjackson25] for updated patch.

Over all the patch looks good to me . +1  for YARN-2443.004.patch . Will wait 
for a day for others to take a look.

cc:// [~rohithsharma]




was (Author: bibinchundatt):
Thank you [~cyrusjackson25] for updated patch.

Over all the patch looks good to me . +1  for YARN-2443.003.patch . Will wait 
for a day for others to take a look.

cc:// [~rohithsharma]



> ResourceManager JMX UI does not give HA State
> -
>
> Key: YARN-2442
> URL: https://issues.apache.org/jira/browse/YARN-2442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.5.0, 2.6.0, 2.7.0
>Reporter: Nishan Shetty
>Assignee: Rohith Sharma K S
>Priority: Major
>  Labels: oct16-easy
> Attachments: 0001-YARN-2442.patch, YARN-2442.003.patch, 
> YARN-2442.004.patch, YARN-2442.02.patch
>
>
> ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, 
> STOPPED)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2019-10-29 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt reopened YARN-9940:
--

> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: 0001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2019-10-29 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt resolved YARN-9940.
--
Target Version/s:   (was: 2.7.2)
  Resolution: Duplicate

> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: 0001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9697) Efficient allocation of Opportunistic containers.

2019-10-22 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956962#comment-16956962
 ] 

Bibin Chundatt commented on YARN-9697:
--

Thank you [~abmodi] for updating patch

Few comments and suggestion

# OpportunisticContainerAllocatorAMService -> NodeQueueLoadMonitor init could 
be moved to AbstractService#serviceinit
# NodeQueueLoadMonitor ScheduledExecutorService#scheduledExecutor shutdown not 
done
# NodeQueueLoadMonitor#nodeIdsByRack do we need the NodeIds to be sorted ??
# Thoughts on replacing NodeQueueLoadMonitor#addIntoNodeIdsByRack  as follows 
{code}
  private void addIntoNodeIdsByRack(RMNode addedNode) {
nodeIdsByRack.compute(addedNode.getRackName(), (k, v) -> v == null ?
new ConcurrentHashMap().newKeySet() :
v).add(addedNode.getNodeID());
  }
{code}
# We could  think of replacing NodeQueueLoadMonitor#removeFromNodeIdsByRack too 
with computeifPresent

Not related to patch

# OpportunisticSchedulerMetrics shouldn't we be having a destroy() method to 
reset the counters. During switch over i think we should reset the counters ?

> Efficient allocation of Opportunistic containers.
> -
>
> Key: YARN-9697
> URL: https://issues.apache.org/jira/browse/YARN-9697
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9697.001.patch, YARN-9697.002.patch, 
> YARN-9697.003.patch, YARN-9697.004.patch, YARN-9697.005.patch, 
> YARN-9697.006.patch, YARN-9697.007.patch, YARN-9697.ut.patch, 
> YARN-9697.ut2.patch, YARN-9697.wip1.patch, YARN-9697.wip2.patch
>
>
> In the current implementation, opportunistic containers are allocated based 
> on the number of queued opportunistic container information received in node 
> heartbeat. This information becomes stale as soon as more opportunistic 
> containers are allocated on that node.
> Allocation of opportunistic containers happens on the same heartbeat in which 
> AM asks for the containers. When multiple applications request for 
> Opportunistic containers, containers might get allocated on the same set of 
> nodes as already allocated containers on the node are not considered while 
> serving requests from different applications. This can lead to uneven 
> allocation of Opportunistic containers across the cluster leading to 
> increased queuing time 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9935) SSLHandshakeException thrown when HTTPS is enabled in AM web server in one certain condition

2019-10-26 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-9935:
-
Component/s: (was: amrmproxy)

> SSLHandshakeException thrown when HTTPS is enabled in AM web server in one 
> certain condition
> 
>
> Key: YARN-9935
> URL: https://issues.apache.org/jira/browse/YARN-9935
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sushanta Sen
>Priority: Major
>
> 【Precondition】:
> 1. Install the cluster
> 2. *{color:#4C9AFF}WebAppProxyServer service installed in 1 VM and RMs 
> installed in 2 VMs{color}*
> 3. Enables all the HTTPS configuration required 
> yarn.resourcemanager.application-https.policy
> STRICT
> yarn.app.mapreduce.am.webapp.https.enabled
> true
> yarn.app.mapreduce.am.webapp.https.client.auth
> true
> 4. RM HA enabled
> 5. *{color:#4C9AFF}Active RM is running in VM2, standby in VM1{color}*
> 6. Cluster should be up and running
> 【Test step】:
> 1.Submit an application
> 2. Open Application Master link from the applicationID from RM UI
> 【Expect Output】:
> No error should be thrown and JOb should be successful
> 【Actual Output】:
> SSLHandshakeException thrown , although Job is successful.
> "javax.net.ssl.SSLHandshakeException: 
> sun.security.validator.ValidatorException: PKIX path building failed: 
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
> valid certification path to requested target"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9926) RM multi-thread event processing mechanism

2019-10-22 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt resolved YARN-9926.
--
Resolution: Duplicate

> RM multi-thread event processing mechanism
> --
>
> Key: YARN-9926
> URL: https://issues.apache.org/jira/browse/YARN-9926
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: hcarrot
>Priority: Minor
> Attachments: RM multi-thread event processing mechanism.pdf
>
>
> Recently, we have observed serious event blocking in RM event dispatcher 
> queue. After analysis of RM event monitoring data and RM event processing 
> logic, we found that the proportion of RMNodeStatusEvent is less than other 
> events, but the overall processing time of it is more than other events. 
> Meanwhile, RM event processing is in a single-thread mode, and It results in 
> the decrease of RM's performance. So we proposed a RM multi-thread event 
> processing mechanism to improve RM performance. Is this mechanism feasible?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2442) ResourceManager JMX UI does not give HA State

2019-10-23 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957644#comment-16957644
 ] 

Bibin Chundatt commented on YARN-2442:
--

Thank you  [~cyrusjackson25] for working on the patch

Currently RMInfo is holding the reference of RMContext which could lead to 
memory leak on switch over. Instead we could use ResourceManager object 
directly.



> ResourceManager JMX UI does not give HA State
> -
>
> Key: YARN-2442
> URL: https://issues.apache.org/jira/browse/YARN-2442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.5.0, 2.6.0, 2.7.0
>Reporter: Nishan Shetty
>Assignee: Rohith Sharma K S
>Priority: Major
>  Labels: oct16-easy
> Attachments: 0001-YARN-2442.patch, YARN-2442.003.patch, 
> YARN-2442.02.patch
>
>
> ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, 
> STOPPED)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-2442) ResourceManager JMX UI does not give HA State

2019-10-23 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957644#comment-16957644
 ] 

Bibin Chundatt edited comment on YARN-2442 at 10/23/19 8:16 AM:


Thank you  [~cyrusjackson25] for working on the patch

# Currently RMInfo is holding the reference of RMContext which could lead to 
memory leak on switch over. Instead we could use ResourceManager instance 
directly.
# Fix the checkstyle issues
# Findbug issue seems to be already fix.




was (Author: bibinchundatt):
Thank you  [~cyrusjackson25] for working on the patch

Currently RMInfo is holding the reference of RMContext which could lead to 
memory leak on switch over. Instead we could use ResourceManager object 
directly.



> ResourceManager JMX UI does not give HA State
> -
>
> Key: YARN-2442
> URL: https://issues.apache.org/jira/browse/YARN-2442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.5.0, 2.6.0, 2.7.0
>Reporter: Nishan Shetty
>Assignee: Rohith Sharma K S
>Priority: Major
>  Labels: oct16-easy
> Attachments: 0001-YARN-2442.patch, YARN-2442.003.patch, 
> YARN-2442.02.patch
>
>
> ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, 
> STOPPED)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9697) Efficient allocation of Opportunistic containers.

2019-11-11 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972097#comment-16972097
 ] 

Bibin Chundatt commented on YARN-9697:
--

Thank you [~abmodi]

Overall patch looks good to  me..

> Efficient allocation of Opportunistic containers.
> -
>
> Key: YARN-9697
> URL: https://issues.apache.org/jira/browse/YARN-9697
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9697.001.patch, YARN-9697.002.patch, 
> YARN-9697.003.patch, YARN-9697.004.patch, YARN-9697.005.patch, 
> YARN-9697.006.patch, YARN-9697.007.patch, YARN-9697.008.patch, 
> YARN-9697.009.patch, YARN-9697.ut.patch, YARN-9697.ut2.patch, 
> YARN-9697.wip1.patch, YARN-9697.wip2.patch
>
>
> In the current implementation, opportunistic containers are allocated based 
> on the number of queued opportunistic container information received in node 
> heartbeat. This information becomes stale as soon as more opportunistic 
> containers are allocated on that node.
> Allocation of opportunistic containers happens on the same heartbeat in which 
> AM asks for the containers. When multiple applications request for 
> Opportunistic containers, containers might get allocated on the same set of 
> nodes as already allocated containers on the node are not considered while 
> serving requests from different applications. This can lead to uneven 
> allocation of Opportunistic containers across the cluster leading to 
> increased queuing time 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9830) Improve ContainerAllocationExpirer it blocks scheduling

2019-10-03 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt reassigned YARN-9830:


Assignee: Bibin Chundatt

> Improve ContainerAllocationExpirer it blocks scheduling
> ---
>
> Key: YARN-9830
> URL: https://issues.apache.org/jira/browse/YARN-9830
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Bibin Chundatt
>Priority: Critical
>  Labels: perfomance
> Attachments: YARN-9830.001.patch
>
>
> {quote}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor.register(AbstractLivelinessMonitor.java:106)
> - waiting to lock <0x7fa348749550> (a 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:601)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:592)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> - locked <0x7fc8852f8200> (a 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:474)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:65)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9830) Improve ContainerAllocationExpirer it blocks scheduling

2019-10-11 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949904#comment-16949904
 ] 

Bibin Chundatt commented on YARN-9830:
--

[~sunil.gov...@gmail.com] Could you take a look



> Improve ContainerAllocationExpirer it blocks scheduling
> ---
>
> Key: YARN-9830
> URL: https://issues.apache.org/jira/browse/YARN-9830
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Bibin Chundatt
>Priority: Critical
>  Labels: perfomance
> Attachments: YARN-9830.001.patch
>
>
> {quote}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor.register(AbstractLivelinessMonitor.java:106)
> - waiting to lock <0x7fa348749550> (a 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:601)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$AcquiredTransition.transition(RMContainerImpl.java:592)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> - locked <0x7fc8852f8200> (a 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:474)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:65)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6924) Metrics for Federation AMRMProxy

2020-02-27 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046785#comment-17046785
 ] 

Bibin Chundatt commented on YARN-6924:
--

[~youchen]

Over all the patch looks good.. 

Minor nits : 

* Annotation and the method signature to be in different lines
* Same applies for the variables too in AMRMProxyMetrics.
* Since the testcase are in same package the visibility for get methods could 
be package private.

> Metrics for Federation AMRMProxy
> 
>
> Key: YARN-6924
> URL: https://issues.apache.org/jira/browse/YARN-6924
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Giovanni Matteo Fumarola
>Assignee: Young Chen
>Priority: Major
> Attachments: YARN-6924.01.patch, YARN-6924.01.patch, 
> YARN-6924.02.patch, YARN-6924.02.patch, YARN-6924.03.patch, YARN-6924.04.patch
>
>
> This JIRA proposes addition of metrics for Federation AMRMProxy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6924) Metrics for Federation AMRMProxy

2020-02-27 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046785#comment-17046785
 ] 

Bibin Chundatt edited comment on YARN-6924 at 2/27/20 4:33 PM:
---

[~youchen]

Over all the patch looks good.. 

Minor nits : 

* Annotation and the method signature to be in different lines
* Same applies for the variables too in AMRMProxyMetrics.
* Since the testcase are in same package the visibility for get methods could 
be package private.
* Correct the apache source file copyright headers too.


was (Author: bibinchundatt):
[~youchen]

Over all the patch looks good.. 

Minor nits : 

* Annotation and the method signature to be in different lines
* Same applies for the variables too in AMRMProxyMetrics.
* Since the testcase are in same package the visibility for get methods could 
be package private.

> Metrics for Federation AMRMProxy
> 
>
> Key: YARN-6924
> URL: https://issues.apache.org/jira/browse/YARN-6924
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Giovanni Matteo Fumarola
>Assignee: Young Chen
>Priority: Major
> Attachments: YARN-6924.01.patch, YARN-6924.01.patch, 
> YARN-6924.02.patch, YARN-6924.02.patch, YARN-6924.03.patch, YARN-6924.04.patch
>
>
> This JIRA proposes addition of metrics for Federation AMRMProxy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6924) Metrics for Federation AMRMProxy

2020-03-02 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049835#comment-17049835
 ] 

Bibin Chundatt commented on YARN-6924:
--

[~youchen]

Over all the patch looks good to me. Wait for a day  to commit the same..

> Metrics for Federation AMRMProxy
> 
>
> Key: YARN-6924
> URL: https://issues.apache.org/jira/browse/YARN-6924
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Giovanni Matteo Fumarola
>Assignee: Young Chen
>Priority: Major
> Attachments: YARN-6924.01.patch, YARN-6924.01.patch, 
> YARN-6924.02.patch, YARN-6924.02.patch, YARN-6924.03.patch, 
> YARN-6924.04.patch, YARN-6924.05.patch
>
>
> This JIRA proposes addition of metrics for Federation AMRMProxy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10098) Add interface to get node iterators by scheduler key for AppPlacementAllocator

2020-01-23 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt resolved YARN-10098.
---
Resolution: Invalid

> Add interface to get node iterators by scheduler key for AppPlacementAllocator
> --
>
> Key: YARN-10098
> URL: https://issues.apache.org/jira/browse/YARN-10098
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10110) In Federation Secure cluster Application submission fails when authorization is enabled

2020-02-18 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039688#comment-17039688
 ] 

Bibin Chundatt commented on YARN-10110:
---

[~BilwaST] Could you point me to the jira which supports Federation security. 

Also its better to  group all the security related federation  under one 
subtask and link it to YARN-5597 

> In Federation Secure cluster Application submission fails when authorization 
> is enabled
> ---
>
> Key: YARN-10110
> URL: https://issues.apache.org/jira/browse/YARN-10110
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sushanta Sen
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-10110.001.patch, YARN-10110.002.patch
>
>
> 【Precondition】:
> 1. Secure Federated cluster is available
> 2. Add the below configuration in Router and client core-site.xml
> hadoop.security.authorization= true 
> 3. Restart the router service
> 【Test step】:
> 1. Go to router client bin path and submit a MR PI job
> 2. Observe the client console screen
> 【Expect Output】:
> No error should be thrown and Job should be successful
> 【Actual Output】:
> Job failed prompting "Protocol interface 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB is not known.,"
> 【Additional Note】:
>  But on setting the parameter as false, job is submitted and success.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-4575) ApplicationResourceUsageReport should return ALL reserved resource

2020-01-22 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt reassigned YARN-4575:


Assignee: (was: Bibin Chundatt)

> ApplicationResourceUsageReport should return ALL  reserved resource
> ---
>
> Key: YARN-4575
> URL: https://issues.apache.org/jira/browse/YARN-4575
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin Chundatt
>Priority: Major
>  Labels: oct16-easy
> Attachments: 0001-YARN-4575.patch, 0002-YARN-4575.patch
>
>
> ApplicationResourceUsageReport reserved resource report  is only of default 
> parition should be of all partitions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10098) AppPlacementAllocator get getPreferredNodeIterator based on scheduler key

2020-01-22 Thread Bibin Chundatt (Jira)
Bibin Chundatt created YARN-10098:
-

 Summary:  AppPlacementAllocator get getPreferredNodeIterator based 
on scheduler key
 Key: YARN-10098
 URL: https://issues.apache.org/jira/browse/YARN-10098
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Bibin Chundatt






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10098) AppPlacementAllocator getPreferredNodeIterator based on scheduler key

2020-01-22 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-10098:
--
Summary:  AppPlacementAllocator getPreferredNodeIterator based on scheduler 
key  (was:  AppPlacementAllocator get getPreferredNodeIterator based on 
scheduler key)

>  AppPlacementAllocator getPreferredNodeIterator based on scheduler key
> --
>
> Key: YARN-10098
> URL: https://issues.apache.org/jira/browse/YARN-10098
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10098) Add interface to get node iterators by scheduler key for AppPlacementAllocator

2020-01-22 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-10098:
--
Summary: Add interface to get node iterators by scheduler key for 
AppPlacementAllocator  (was:  AppPlacementAllocator getPreferredNodeIterator 
based on scheduler key)

> Add interface to get node iterators by scheduler key for AppPlacementAllocator
> --
>
> Key: YARN-10098
> URL: https://issues.apache.org/jira/browse/YARN-10098
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9624) Use switch case for ProtoUtils#convertFromProtoFormat containerState

2020-01-07 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009536#comment-17009536
 ] 

Bibin Chundatt commented on YARN-9624:
--

[~BilwaST] Could you update the patch ?

> Use switch case for ProtoUtils#convertFromProtoFormat containerState
> 
>
> Key: YARN-9624
> URL: https://issues.apache.org/jira/browse/YARN-9624
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Bilwa S T
>Priority: Major
>  Labels: performance
> Attachments: YARN-9624.001.patch, YARN-9624.002.patch
>
>
> On large cluster with 100K+ containers on every heartbeat 
> {{ContainerState.valueOf(e.name().replace(CONTAINER_STATE_PREFIX, ""))}} will 
> be too costly. Update with switch case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10208) Add metric in CapacityScheduler for evaluating the time difference between node heartbeats

2020-04-06 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076865#comment-17076865
 ] 

Bibin Chundatt commented on YARN-10208:
---

Thank you [~adam.antal] for additional review. Will wait for a day before 
commit.

> Add metric in CapacityScheduler for evaluating the time difference between 
> node heartbeats
> --
>
> Key: YARN-10208
> URL: https://issues.apache.org/jira/browse/YARN-10208
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Minor
> Attachments: YARN-10208.001.patch, YARN-10208.002.patch, 
> YARN-10208.003.patch, YARN-10208.004.patch, YARN-10208.005.patch
>
>
> Metric measuring average time interval between node heartbeats in capacity 
> scheduler on node update event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-10182) SLS运行报错Couldn't create /yarn-leader-election/yarnRM

2020-03-27 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt reopened YARN-10182:
---

> SLS运行报错Couldn't create /yarn-leader-election/yarnRM
> ---
>
> Key: YARN-10182
> URL: https://issues.apache.org/jira/browse/YARN-10182
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
> Environment: Cloudera Express 6.0.0
> RM1 :active RM2:standby
> kerberos is on
> yarn-site.xml : /etc/hadoop/conf.cloudera.yarn/yarn-site.xml
> keytab: /etc/krb5.keytab ===>the keytab of yarn
> when I run slsrun.sh ,I will get an error:
> Exception in thread "main" org.apache.hadoop.service.ServiceStateException: 
> java.io.IOException: Couldn't create /yarn-leader-election/yarnRM
> If I use sample-conf/yarn-site.xml ,I will get  "KerberosAuthException: Login 
> failure for user: yarn from keytab /etc/krb5.keytab 
> javax.security.auth.login.LoginException: Unable to obtain password from user"
> How to resolve it ?
>  
>Reporter: zhangyu
>Priority: Major
> Attachments: slsrun.log.txt
>
>
> RM1 :active RM2:standby
> kerberos is on
> yarn-site.xml : /etc/hadoop/conf.cloudera.yarn/yarn-site.xml
> keytab: /etc/krb5.keytab ===>the keytab of yarn
> when I run slsrun.sh on RM1 ,I will get an error:
> Exception in thread "main" org.apache.hadoop.service.ServiceStateException: 
> java.io.IOException: Couldn't create /yarn-leader-election/yarnRM
> If I use sample-conf/yarn-site.xml ,I will get "KerberosAuthException: Login 
> failure for user: yarn from keytab /etc/krb5.keytab 
> javax.security.auth.login.LoginException: Unable to obtain password from user"
> How to resolve it ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10182) SLS运行报错Couldn't create /yarn-leader-election/yarnRM

2020-03-27 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt resolved YARN-10182.
---
Resolution: Not A Problem

> SLS运行报错Couldn't create /yarn-leader-election/yarnRM
> ---
>
> Key: YARN-10182
> URL: https://issues.apache.org/jira/browse/YARN-10182
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
> Environment: Cloudera Express 6.0.0
> RM1 :active RM2:standby
> kerberos is on
> yarn-site.xml : /etc/hadoop/conf.cloudera.yarn/yarn-site.xml
> keytab: /etc/krb5.keytab ===>the keytab of yarn
> when I run slsrun.sh ,I will get an error:
> Exception in thread "main" org.apache.hadoop.service.ServiceStateException: 
> java.io.IOException: Couldn't create /yarn-leader-election/yarnRM
> If I use sample-conf/yarn-site.xml ,I will get  "KerberosAuthException: Login 
> failure for user: yarn from keytab /etc/krb5.keytab 
> javax.security.auth.login.LoginException: Unable to obtain password from user"
> How to resolve it ?
>  
>Reporter: zhangyu
>Priority: Major
> Attachments: slsrun.log.txt
>
>
> RM1 :active RM2:standby
> kerberos is on
> yarn-site.xml : /etc/hadoop/conf.cloudera.yarn/yarn-site.xml
> keytab: /etc/krb5.keytab ===>the keytab of yarn
> when I run slsrun.sh on RM1 ,I will get an error:
> Exception in thread "main" org.apache.hadoop.service.ServiceStateException: 
> java.io.IOException: Couldn't create /yarn-leader-election/yarnRM
> If I use sample-conf/yarn-site.xml ,I will get "KerberosAuthException: Login 
> failure for user: yarn from keytab /etc/krb5.keytab 
> javax.security.auth.login.LoginException: Unable to obtain password from user"
> How to resolve it ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9627) DelegationTokenRenewer could block transitionToStandy

2020-03-27 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt reassigned YARN-9627:


Assignee: (was: Bibin Chundatt)

> DelegationTokenRenewer could block transitionToStandy
> -
>
> Key: YARN-9627
> URL: https://issues.apache.org/jira/browse/YARN-9627
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: krishna reddy
>Priority: Critical
> Attachments: YARN-9627.001.patch, YARN-9627.002.patch, 
> YARN-9627.003.patch
>
>
> Cluster size: 5K
> Running containers: 55K
> *Scenario*: Largenumber of pending applications (around 50K) and performing 
> RM switch over
> Below exception :
> {noformat}
> 2019-06-13 17:39:27,594 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Renew Kind: HDFS_DELEGATION_TOKEN, Service: X:1616, Ident: (token 
> for root: HDFS_DELEGATION_TOKEN owner=root/had...@hadoop.com, renewer=yarn, 
> realUser=, issueDate=1560361265181, maxDate=1560966065181, 
> sequenceNumber=104708, masterKeyId=3);exp=1560533965360; 
> apps=[application_1560346941775_20702] in 86397766 ms, appId = 
> [application_1560346941775_20702]
> 2019-06-13 17:39:27,609 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Unable to add the application to the delegation token renewer on recovery.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:522)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>  
> 2019-06-13 17:58:20,878 ERROR org.apache.zookeeper.ClientCnxn: Time out error 
> occurred for the packet 'clientPath:null serverPath:null finished:false 
> header:: 27,4  replyHeader:: 27,4295687588,0  request:: 
> '/rmstore1/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTMasterKeysRoot/DelegationKey_49,F
>   response:: 
> #31ff8a16b74ffe129768ffdbffe949ff8dffd517ffcafffa,s{4295423577,4295423577,1560342837789,1560342837789,0,0,0,0,17,0,4295423577}
>  '.
> 2019-06-13 17:58:20,877 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Renewed delegation-token= [Kind: HDFS_DELEGATION_TOKEN, Service: 
> X:1616, Ident: (token for root: HDFS_DELEGATION_TOKEN 
> owner=root/had...@hadoop.com, renewer=yarn, realUser=, 
> issueDate=1560366110990, maxDate=1560970910990, sequenceNumber=111891, 
> masterKeyId=3);exp=1560534896413; apps=[application_1560346941775_28115]]
> 2019-06-13 17:58:20,924 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Unable to add the application to the delegation token renewer on recovery.
> java.lang.IllegalStateException: Timer already cancelled.
> at java.util.Timer.sched(Timer.java:397)
> at java.util.Timer.schedule(Timer.java:208)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.setTimerForTokenRenewal(DelegationTokenRenewer.java:612)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:523)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Commented] (YARN-10208) Add metric in CapacityScheduler for evaluating the time difference between node heartbeats

2020-03-27 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068603#comment-17068603
 ] 

Bibin Chundatt commented on YARN-10208:
---

[~lapjarn]
Minor comment.

{code}
1834  // Add metrics for evaluating the time difference between 
heartbeats.
1835  SchedulerNode node =
1836  nodeTracker.getNode(nodeUpdatedEvent.getRMNode().getNodeID());
1837  if (node != null) {
1838long lastInterval =
1839Time.monotonicNow() - node.getLastHeartbeatMonotonicTime();
1840CapacitySchedulerMetrics.getMetrics()
1841.addSchedulerHeartBeatIntervalAverage(lastInterval);
1842  }
{code}
Refactor to method and update before the node update call


> Add metric in CapacityScheduler for evaluating the time difference between 
> node heartbeats
> --
>
> Key: YARN-10208
> URL: https://issues.apache.org/jira/browse/YARN-10208
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Minor
> Attachments: YARN-10208.001.patch, YARN-10208.002.patch
>
>
> Metric measuring average time interval between node heartbeats in capacity 
> scheduler on node update event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10172) Default ApplicationPlacementType class should be configurable

2020-03-27 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068619#comment-17068619
 ] 

Bibin Chundatt commented on YARN-10172:
---

[~cyrusjackson25]

Please check the checkstyle issues.Apart from that changes looks good to me.  

{code}
184 String DEFAULT_APPLICATION_PLACEMENT_TYPE_CLASS = 
"org.apache.hadoop.yarn."
185 + "server.resourcemanager.scheduler.capacity."
186 + "yarnpp.YarnppLocalityAppPlacementAllocator";
{code}
# Rename  YarnppLocalityAppPlacementAllocator -> 
DummyLocalityAppPlacementAllocator
# The package name also could be short.


[~sunil.gov...@gmail.com] Would you  like take a look

> Default ApplicationPlacementType class should be configurable
> -
>
> Key: YARN-10172
> URL: https://issues.apache.org/jira/browse/YARN-10172
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Cyrus Jackson
>Assignee: Cyrus Jackson
>Priority: Minor
> Attachments: YARN-10172.001.patch
>
>
> This can be useful in scheduling apps based on the configured placement type 
> class rather than resorting to LocalityAppPlacementAllocator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10208) Add metric in CapacityScheduler for evaluating the time difference between node heartbeats

2020-03-30 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071476#comment-17071476
 ] 

Bibin Chundatt commented on YARN-10208:
---

+1 looks good to me. 

> Add metric in CapacityScheduler for evaluating the time difference between 
> node heartbeats
> --
>
> Key: YARN-10208
> URL: https://issues.apache.org/jira/browse/YARN-10208
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Minor
> Attachments: YARN-10208.001.patch, YARN-10208.002.patch, 
> YARN-10208.003.patch, YARN-10208.004.patch
>
>
> Metric measuring average time interval between node heartbeats in capacity 
> scheduler on node update event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10208) Add metric in CapacityScheduler for evaluating the time difference between node heartbeats

2020-03-30 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070748#comment-17070748
 ] 

Bibin Chundatt commented on YARN-10208:
---

[~lapjarn]

Minor nit:

schedulerHeartBeatIntervalAverage  variable and method name rename to 
schedulerNodeHBInterval

> Add metric in CapacityScheduler for evaluating the time difference between 
> node heartbeats
> --
>
> Key: YARN-10208
> URL: https://issues.apache.org/jira/browse/YARN-10208
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Minor
> Attachments: YARN-10208.001.patch, YARN-10208.002.patch, 
> YARN-10208.003.patch
>
>
> Metric measuring average time interval between node heartbeats in capacity 
> scheduler on node update event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10229) [Federation] Client should be able to submit application to RM directly using normal client conf

2020-05-03 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098655#comment-17098655
 ] 

Bibin Chundatt commented on YARN-10229:
---

[~BilwaST] /[~122512...@qq.com]

Nodemanagers need to stay independent of the applications . Parsing of 
application specific details are not suggested in nodemanager side.

Alternate Solution:

Currently AMRMProxyService overrides the AMRMToken always. If we could notify 
from interceptors whether the amrmtoken needs to be override , then we should 
be able to submit. In this case the FederationInterceptor could check the 
homeapplications entry is available in federation state store. 

Thoughts??

> [Federation] Client should be able to submit application to RM directly using 
> normal client conf
> 
>
> Key: YARN-10229
> URL: https://issues.apache.org/jira/browse/YARN-10229
> Project: Hadoop YARN
>  Issue Type: Wish
>  Components: amrmproxy, federation
>Affects Versions: 3.1.1
>Reporter: JohnsonGuo
>Assignee: Bilwa S T
>Priority: Major
>
> Scenario: When enable the yarn federation feature with multi yarn clusters, 
> one can submit their job to yarn-router by *modified* their client 
> configuration with yarn router address.
> But if one still wants to submit their jobs via the original client (before 
> enable federation) to RM directly, it will encounter the AMRMToken exception. 
>  That means once enable federation ,if some one want to submit job, they have 
> to  modify the client conf.
>  
> one possible solution for this Scenario is:
> In NodeManger, when the client ApplicationMaster request comes:
>  * get the client job.xml  from HDFS "".
>  * parse the "yarn.resourcemanager.scheduler.address" parameter in job.xml
>  * if the value of the parameter is "localhost:8049"(AMRM address),then do 
> the AMRMToken valid process
>  * if the value of the parameter is "rm:port"(rm address),then skip the 
> AMRMToken valid process
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10246) Enable YARN Router to have a dedicated Zookeeper

2020-05-05 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099735#comment-17099735
 ] 

Bibin Chundatt commented on YARN-10246:
---

[~dmmkr] 

In case of non secure cluster this could work with a different configuration 
file.. But i am not sure how this could work in secure cluster.
Does curator support multiple kerboros configuration  in same process (RM is 
the process here.) RM has to connect to Federation state store and also RM 
State Store..

IIRC the version of curator doesnt support the same. 

> Enable YARN Router to have a dedicated Zookeeper
> 
>
> Key: YARN-10246
> URL: https://issues.apache.org/jira/browse/YARN-10246
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: federation, router
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
> Attachments: YARN-10246.001.patch, YARN-10246.002.patch
>
>
> Currently, we have a single parameter hadoop.zk.address for Router and 
> Resourcemanager, Due to this we need have FederationStateStore and 
> RMStateStore on the same Zookeeper instance. 
> With the above topology there can be a load on ZooKeeper, since all 
> subcluster RMs will write to single ZooKeeper.
> So, If we Introduce a new configuration such as hadoop.federation.zk.address 
> we can have FederationStateStore on a dedicated Zookeeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10246) Enable YARN Router to have a dedicated Zookeeper

2020-05-05 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099735#comment-17099735
 ] 

Bibin Chundatt edited comment on YARN-10246 at 5/5/20, 9:48 AM:


[~dmmkr] 

In case of non secure cluster this could work with a different property names.. 
But i am not sure how this could work in secure cluster.
Does curator support multiple kerboros configuration  in same process (RM is 
the process here.) RM has to connect to Federation state store and also RM 
State Store..

IIRC the version of curator doesnt support the same. 


was (Author: bibinchundatt):
[~dmmkr] 

In case of non secure cluster this could work with a different configuration 
file.. But i am not sure how this could work in secure cluster.
Does curator support multiple kerboros configuration  in same process (RM is 
the process here.) RM has to connect to Federation state store and also RM 
State Store..

IIRC the version of curator doesnt support the same. 

> Enable YARN Router to have a dedicated Zookeeper
> 
>
> Key: YARN-10246
> URL: https://issues.apache.org/jira/browse/YARN-10246
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: federation, router
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
> Attachments: YARN-10246.001.patch, YARN-10246.002.patch
>
>
> Currently, we have a single parameter hadoop.zk.address for Router and 
> Resourcemanager, Due to this we need have FederationStateStore and 
> RMStateStore on the same Zookeeper instance. 
> With the above topology there can be a load on ZooKeeper, since all 
> subcluster RMs will write to single ZooKeeper.
> So, If we Introduce a new configuration such as hadoop.federation.zk.address 
> we can have FederationStateStore on a dedicated Zookeeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement

2020-05-06 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101380#comment-17101380
 ] 

Bibin Chundatt commented on YARN-10259:
---

In addition to the above.. 

I think the issue exists in the *LeafQueue#allocateFromReservedContainer* .. We 
do try the container allocation from first node we get iterating through all 
the candidate set.
Change to previous logic. 

 Issue -. container gets unreserved on node1. then again we reserve on node 1 
during allocation .. The nodes in the last in list with reserved containers  
might never get a chance to do allocation./ unreservation.

This impacts performance of multiNodelookup too. AsyncSchedulerThread give a 
fair change to each node to do unreserve/allocate from reserved container.
Attempt allocation if reserved container exists with a single candidate nodeset.


> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement
> ---
>
> Key: YARN-10259
> URL: https://issues.apache.org/jira/browse/YARN-10259
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Prabhu Joseph
>Priority: Major
> Attachments: REPRO_TEST.patch
>
>
> Reserved Containers are not allocated from the available space of other nodes 
> in CandidateNodeSet in MultiNodePlacement. 
> *Repro:*
> 1. MultiNode Placement Enabled.
> 2. Two nodes h1 and h2 with 8GB
> 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets 
> placed in h2.
> 4. Submit app3 AM which is reserved in h1
> 5. Kill app2 which frees space in h2.
> 6. app3 AM never gets ALLOCATED
> RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on 
> h2 as it expects the assignment to be on same node where reservation has 
> happened.
> {code}
> 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt 
> appattempt_1588684773609_0003_01 reserved container 
> container_1588684773609_0003_01_01 on node host: h1:1234 #containers=1 
> available= used=. This attempt 
> currently has 1 reserved containers at priority 0; currentReservation 
> 
> 2020-05-05 18:49:37,264 INFO  [AsyncDispatcher event handler] 
> fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved 
> container=container_1588684773609_0003_01_01, on node=host: h1:1234 
> #containers=1 available= used= 
> with resource=
>RESERVED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h1:1234; Resource=)]
>
> 2020-05-05 18:49:38,283 DEBUG [Time-limited test] 
> allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: 
> node=h2 application=application_1588684773609_0003 priority=0 
> pendingAsk=,repeat=1> 
> type=OFF_SWITCH
> 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate 
> from reserved container container_1588684773609_0003_01_01, but node is 
> not reserved
>ALLOCATED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h2:1234; Resource=)]
> {code}
> After reverting fix of YARN-8127, it works. Attached testcase which 
> reproduces the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement

2020-05-07 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101380#comment-17101380
 ] 

Bibin Chundatt edited comment on YARN-10259 at 5/7/20, 6:00 AM:


In addition to the above.. 

I think the issue exists in the *LeafQueue#allocateFromReservedContainer* .. We 
do try the container allocation from first node we get iterating through all 
the candidate set.
Change to previous logic. 

 Issue -. container gets unreserved on node1. then again we reserve on node 1 
during allocation .. The nodes in the last in list with reserved containers  
might never get a chance to do allocation./ unreservation.

This impacts performance of multiNodelookup too. *AsyncSchedulerThread* give a 
fair chance to all nodes to do unreserve/allocate for reserved container.
Attempt allocation if reserved container exists with a single candidate nodeset.



was (Author: bibinchundatt):
In addition to the above.. 

I think the issue exists in the *LeafQueue#allocateFromReservedContainer* .. We 
do try the container allocation from first node we get iterating through all 
the candidate set.
Change to previous logic. 

 Issue -. container gets unreserved on node1. then again we reserve on node 1 
during allocation .. The nodes in the last in list with reserved containers  
might never get a chance to do allocation./ unreservation.

This impacts performance of multiNodelookup too. AsyncSchedulerThread give a 
fair change to each node to do unreserve/allocate from reserved container.
Attempt allocation if reserved container exists with a single candidate nodeset.


> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement
> ---
>
> Key: YARN-10259
> URL: https://issues.apache.org/jira/browse/YARN-10259
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: REPRO_TEST.patch
>
>
> Reserved Containers are not allocated from the available space of other nodes 
> in CandidateNodeSet in MultiNodePlacement. 
> *Repro:*
> 1. MultiNode Placement Enabled.
> 2. Two nodes h1 and h2 with 8GB
> 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets 
> placed in h2.
> 4. Submit app3 AM which is reserved in h1
> 5. Kill app2 which frees space in h2.
> 6. app3 AM never gets ALLOCATED
> RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on 
> h2 as it expects the assignment to be on same node where reservation has 
> happened.
> {code}
> 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt 
> appattempt_1588684773609_0003_01 reserved container 
> container_1588684773609_0003_01_01 on node host: h1:1234 #containers=1 
> available= used=. This attempt 
> currently has 1 reserved containers at priority 0; currentReservation 
> 
> 2020-05-05 18:49:37,264 INFO  [AsyncDispatcher event handler] 
> fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved 
> container=container_1588684773609_0003_01_01, on node=host: h1:1234 
> #containers=1 available= used= 
> with resource=
>RESERVED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h1:1234; Resource=)]
>
> 2020-05-05 18:49:38,283 DEBUG [Time-limited test] 
> allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: 
> node=h2 application=application_1588684773609_0003 priority=0 
> pendingAsk=,repeat=1> 
> type=OFF_SWITCH
> 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate 
> from reserved container container_1588684773609_0003_01_01, but node is 
> not reserved
>ALLOCATED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h2:1234; Resource=)]
> {code}
> After reverting fix of YARN-8127, it works. Attached testcase which 
> reproduces the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10181) Managing Centralized Node Attribute via RMWebServices.

2020-03-17 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060849#comment-17060849
 ] 

Bibin Chundatt commented on YARN-10181:
---

Could you move this jira as part of YARN-8766 ? 

> Managing Centralized Node Attribute via RMWebServices.
> --
>
> Key: YARN-10181
> URL: https://issues.apache.org/jira/browse/YARN-10181
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodeattibute
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Priority: Major
>
> Currently Centralized NodeAttributes can be managed only through Yarn 
> NodeAttribute CLI. This is to support via RMWebServices.
> {code}
> https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeAttributes.html#Centralised_Node_Attributes_mapping.
> Centralised : Node to attributes mapping can be done through RM exposed CLI 
> or RPC (REST is yet to be supported).
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement

2020-05-06 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100769#comment-17100769
 ] 

Bibin Chundatt commented on YARN-10259:
---

[~prabhujoseph]

I think we have few issue in the RegularContainerAllocator#allocate 

 #   Only when the preCheckForNodeCandidateSet check  fails for 
*appInfo.precheckNode* we should be continuing the iteration over next set of 
nodes.
 #  preCheckForNodeCandidateSet returns null try allocation
 #  All other cases return preCheckForNodeCandidateSet(..)
 #  if we have reserved container and for scheduler key the pending 
ask is zero. Unreserve the container.
{code}
   if (application.getOutstandingAsksCount(schedulerKey) == 
0) {
  // Release
  return new ContainerAllocation(reservedContainer, null,
  AllocationState.QUEUE_SKIPPED);

}
{code}
   # The *schedulingPS.getPreferredNodeIterator* i think we should 
filter out all the nodes with reserved containers. This should reduce the 
reservation.
   
 


> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement
> ---
>
> Key: YARN-10259
> URL: https://issues.apache.org/jira/browse/YARN-10259
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Prabhu Joseph
>Priority: Major
> Attachments: REPRO_TEST.patch
>
>
> Reserved Containers are not allocated from the available space of other nodes 
> in CandidateNodeSet in MultiNodePlacement. 
> *Repro:*
> 1. MultiNode Placement Enabled.
> 2. Two nodes h1 and h2 with 8GB
> 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets 
> placed in h2.
> 4. Submit app3 AM which is reserved in h1
> 5. Kill app2 which frees space in h2.
> 6. app3 AM never gets ALLOCATED
> RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on 
> h2 as it expects the assignment to be on same node where reservation has 
> happened.
> {code}
> 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt 
> appattempt_1588684773609_0003_01 reserved container 
> container_1588684773609_0003_01_01 on node host: h1:1234 #containers=1 
> available= used=. This attempt 
> currently has 1 reserved containers at priority 0; currentReservation 
> 
> 2020-05-05 18:49:37,264 INFO  [AsyncDispatcher event handler] 
> fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved 
> container=container_1588684773609_0003_01_01, on node=host: h1:1234 
> #containers=1 available= used= 
> with resource=
>RESERVED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h1:1234; Resource=)]
>
> 2020-05-05 18:49:38,283 DEBUG [Time-limited test] 
> allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: 
> node=h2 application=application_1588684773609_0003 priority=0 
> pendingAsk=,repeat=1> 
> type=OFF_SWITCH
> 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate 
> from reserved container container_1588684773609_0003_01_01, but node is 
> not reserved
>ALLOCATED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h2:1234; Resource=)]
> {code}
> After reverting fix of YARN-8127, it works. Attached testcase which 
> reproduces the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container

2020-08-31 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt reopened YARN-10395:
---

Reopening since its not committed to any branch..

> ReservedContainer Node is added to blackList of application due to this node 
> can not allocate other container
> -
>
> Key: YARN-10395
> URL: https://issues.apache.org/jira/browse/YARN-10395
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.2
>Reporter: chan
>Priority: Major
> Fix For: 2.9.2
>
> Attachments: Yarn-10395-001.patch
>
>
> Now,if a app reserved a node,but the node is added to app`s blacklist.
> when this node send  heartbeat to resourcemanager,the reserved container 
> allocate fails,it will make this node can not allocate other container even 
> thought this node have enough memory or vcores.so i think we can release this 
> reserved container when the reserved node is in the black list of this app.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10359) Log container report only if list is not empty

2020-08-01 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169243#comment-17169243
 ] 

Bibin Chundatt commented on YARN-10359:
---

+1 committing shortly

> Log container report only if list is not empty
> --
>
> Key: YARN-10359
> URL: https://issues.apache.org/jira/browse/YARN-10359
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Attachments: YARN-10359.001.patch, YARN-10359.002.patch
>
>
> In NodeStatusUpdaterImpl print log only if containerReports list is  not empty
> {code:java}
> if (containerReports != null) {
> LOG.info("Registering with RM using containers :" + containerReports);
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10335) Improve scheduling of containers based on node health

2020-08-01 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169246#comment-17169246
 ] 

Bibin Chundatt commented on YARN-10335:
---

Thank you [~cyrusjackson25] for patch

Just did a brief look at the patch .. Few comments

# Could be assigned to *NodeHealthCheckerService*
{noformat}
107 NodeHealthCheckerServiceImpl healthChecker =
108 createNodeHealthCheckerService();
{noformat}
# Update to readlock for get API
{noformat}
528   public NodeHealthDetails getNodeHealthDetails() {
529 this.writeLock.lock();
530 
531 try {
532   return this.nodeHealthDetails;
533 } finally {
534   this.writeLock.unlock();
535 }
536   }
{noformat}
# Fix all jenkins erros..

> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Cyrus Jackson
>Priority: Major
> Attachments: YARN-10335.001.patch, YARN-10335.002.patch, 
> YARN-10335.003.patch, YARN-10335.004.patch
>
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on node 
> health value send from nodemanagers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-08-04 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170656#comment-17170656
 ] 

Bibin Chundatt commented on YARN-10352:
---

Thank you [~prabhujoseph] for patch.

 Just few queries / comments
 # The customer iterator how much improvement we have against the 
*Iterators.filter* ?
 # Can we avoid the multiplier * 2 and make it configurable.. The multiplier 
could go wrong when the dispatcher is overloaded . processing events for large 
clusters could be slow . >2 seconds the events could stay in async dispatcher .
 # MultiNodeSortingManager the imports could be ordered.

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-08-07 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172949#comment-17172949
 ] 

Bibin Chundatt commented on YARN-10352:
---

+1 for the latest patch. Will commit the same by EOD , if no objections.

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch, YARN-10352-007.patch, YARN-10352-008.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10335) Improve scheduling of containers based on node health

2020-08-05 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172042#comment-17172042
 ] 

Bibin Chundatt commented on YARN-10335:
---

[~sunilg] Could you take a look..

> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Cyrus Jackson
>Priority: Major
> Attachments: YARN-10335.001.patch, YARN-10335.002.patch, 
> YARN-10335.003.patch, YARN-10335.004.patch
>
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on node 
> health value send from nodemanagers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10388) RMNode updatedCapability flag not set while RecommissionNodeTransition

2020-08-05 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172043#comment-17172043
 ] 

Bibin Chundatt commented on YARN-10388:
---

Good catch [~lapjarn] .

> RMNode updatedCapability flag not set while RecommissionNodeTransition
> --
>
> Key: YARN-10388
> URL: https://issues.apache.org/jira/browse/YARN-10388
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Major
>
> RMNode updatedCapability flag is not set while RecommissionNodeTransition 
> happens. RM gets updated of new totalcapability when recommissioning of node 
> happens. But the nodemanager still has old totalcapability and is not aware 
> of the change. Setting this flag while RecommissionNodeTransition  would 
> update nodemanager of totalcapability change as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10388) RMNode updatedCapability flag not set while RecommissionNodeTransition

2020-08-06 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172128#comment-17172128
 ] 

Bibin Chundatt commented on YARN-10388:
---

Over all the patch looks good to me.. +1. Wait for jenkins result..
cc : [~inigoiri] 

> RMNode updatedCapability flag not set while RecommissionNodeTransition
> --
>
> Key: YARN-10388
> URL: https://issues.apache.org/jira/browse/YARN-10388
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Major
> Attachments: YARN-10388.001.patch
>
>
> RMNode updatedCapability flag is not set while RecommissionNodeTransition 
> happens. RM gets updated of new totalcapability when recommissioning of node 
> happens. But the nodemanager still has old totalcapability and is not aware 
> of the change. Setting this flag while RecommissionNodeTransition  would 
> update nodemanager of totalcapability change as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10335) Improve scheduling of containers based on node health

2020-07-07 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153234#comment-17153234
 ] 

Bibin Chundatt commented on YARN-10335:
---

Thank you [~cyrusjackson25] for working in this

Few comments:


# Refer NodeHealthStatus for how the records needs to implemented. Define as 
abstract and also add comments.
# setNodeResources -> setNodeResourceScore also rename the variables too.
#  Finding addition description detail why did we add this ??
 {noformat}
  optional string node_health_description = 4;
 {noformat}
# NodeHealthService  instead of *getNodeHealthDetails* we could add 
updateNodeHealthDetails
# Add Visibility Annotation as private

> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Cyrus Jackson
>Priority: Major
> Attachments: YARN-10335.001.patch
>
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on node 
> health value send from nodemanagers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state

2020-07-04 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151290#comment-17151290
 ] 

Bibin Chundatt commented on YARN-10332:
---

[~yehuanhuan] My bad .

Statetransition  is defined twice makes sense to remove it. Misunderstood the 
JIRA as YARN-10315.
+1 for the change.


> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
> 
>
> Key: YARN-10332
> URL: https://issues.apache.org/jira/browse/YARN-10332
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: yehuanhuan
>Priority: Minor
> Attachments: YARN-10332.001.patch
>
>
> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10335) Improve scheduling of containers based on node health

2020-07-05 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149843#comment-17149843
 ] 

Bibin Chundatt edited comment on YARN-10335 at 7/5/20, 6:19 AM:


Thank you for showing interest in the JIRA [~cyrusjackson25]

Adding what i have in mind about the health detail. Node manager  has node 
health service which returns a boolean value .Sends UNHEALTHY if the node 
health script return error / If  we don't have any healthy local  directories. 

We will introduce field/fields which returns detailed node health value about 
the node along with the NodeHealthStatus.  

Example:
{quote}
message NodeHealthStatusProto {
optional bool isHealthy = 1;
optional string nodeHealthDescription = 2;
optional string exceptionString = 3;
optional NodeHealthDetail nodehealthDetail=4;
optional StringIntMapProto nodeHealthdetail=5;
}

message StringStringMapProto {
  optional string key = 1;
  optional int32 value = 2;
}

keys could be -  ssd, non ssd, etc.. 
{quote}

Also make the NodeHealthService pluggable to support custom implementations of 
NodeHealthServices.


was (Author: bibinchundatt):
Thank you for showing interest in the JIRA [~cyrusjackson25]

Adding what i have in mind about the health detail. Node manager  has node 
health service which returns a boolean value .Sends UNHEALTHY if the node 
health script return error / If  we don't have any healthy local  directories. 

We will introduce field/fields which returns detailed node health value about 
the node along with the NodeHealthStatus.  

Example:
{quote}
message NodeHealthStatusProto {
optional bool isHealthy = 1;
optional string nodeHealthDescription = 2;
optional string exceptionString = 3;
optional NodeHealthDetail nodehealthDetail=4;
optional StringIntMapProto nodeHealthdetail=5;
}

message StringStringMapProto {
  optional string key = 1;
  optional int32 value = 2;
}

keys could be - overall , ssd, non ssd, etc.. 
{quote}

Also make the NodeHealthService pluggable to support custom implementations of 
NodeHealthServices.

> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Cyrus Jackson
>Priority: Major
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on node 
> health value send from nodemanagers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10335) Improve scheduling of containers based on node health

2020-07-05 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149843#comment-17149843
 ] 

Bibin Chundatt edited comment on YARN-10335 at 7/5/20, 6:22 AM:


Thank you for showing interest in the JIRA [~cyrusjackson25]

Adding what i have in mind about the health detail. Node manager  has node 
health service which returns a boolean value .Sends UNHEALTHY if the node 
health script return error / If  we don't have any healthy local  directories. 

We will introduce field/fields which returns detailed node health value about 
the node along with the NodeHealthStatus.  

Example:
{noformat}
message NodeHealthStatusProto {
optional bool isHealthy = 1;
optional string nodeHealthDescription = 2;
optional string exceptionString = 3;
optional NodeHealthDetail nodehealthDetail=4;
}

message NodeHealthDetail{
 optional int32 overallscore=1;
 optional StringIntMapProto nodeResources =2 ;
}
message StringIntMapProto {
  optional string key = 1;
  optional int32 value = 2;
}

keys could be -  ssd, non ssd, etc.. 
{noformat}

Also make the NodeHealthService pluggable to support custom implementations of 
NodeHealthServices.


was (Author: bibinchundatt):
Thank you for showing interest in the JIRA [~cyrusjackson25]

Adding what i have in mind about the health detail. Node manager  has node 
health service which returns a boolean value .Sends UNHEALTHY if the node 
health script return error / If  we don't have any healthy local  directories. 

We will introduce field/fields which returns detailed node health value about 
the node along with the NodeHealthStatus.  

Example:
{quote}
message NodeHealthStatusProto {
optional bool isHealthy = 1;
optional string nodeHealthDescription = 2;
optional string exceptionString = 3;
optional NodeHealthDetail nodehealthDetail=4;
optional StringIntMapProto nodeHealthdetail=5;
}

message StringStringMapProto {
  optional string key = 1;
  optional int32 value = 2;
}

keys could be -  ssd, non ssd, etc.. 
{quote}

Also make the NodeHealthService pluggable to support custom implementations of 
NodeHealthServices.

> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Cyrus Jackson
>Priority: Major
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on node 
> health value send from nodemanagers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10335) Improve scheduling of containers based on node health

2020-07-05 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151483#comment-17151483
 ] 

Bibin Chundatt commented on YARN-10335:
---

[~subru]/[~sunilg]  Does the proto structure look good  ?

> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Cyrus Jackson
>Priority: Major
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on node 
> health value send from nodemanagers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state

2020-07-01 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149166#comment-17149166
 ] 

Bibin Chundatt commented on YARN-10332:
---

[~yehuanhuan] looks like duplicate of YARN-10315. 

> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
> 
>
> Key: YARN-10332
> URL: https://issues.apache.org/jira/browse/YARN-10332
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: yehuanhuan
>Priority: Minor
> Attachments: YARN-10332.001.patch
>
>
> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state

2020-07-01 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149166#comment-17149166
 ] 

Bibin Chundatt edited comment on YARN-10332 at 7/1/20, 6:56 AM:


[~yehuanhuan] looks like duplicate of YARN-10315. 

Current change will create InvalidStateTransitionException when Node is in 
decommissioning state and admin is calling node resource update.. Also during 
node update..



was (Author: bibinchundatt):
[~yehuanhuan] looks like duplicate of YARN-10315. 

Current change is got in create InvalidStateTransitionException when Node is in 
decommissioning state and admin is calling node resource update.. Also during 
node update..


> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
> 
>
> Key: YARN-10332
> URL: https://issues.apache.org/jira/browse/YARN-10332
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: yehuanhuan
>Priority: Minor
> Attachments: YARN-10332.001.patch
>
>
> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state

2020-07-01 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149166#comment-17149166
 ] 

Bibin Chundatt edited comment on YARN-10332 at 7/1/20, 6:56 AM:


[~yehuanhuan] looks like duplicate of YARN-10315. 

Current change is got in create InvalidStateTransitionException when Node is in 
decommissioning state and admin is calling node resource update.. Also during 
node update..



was (Author: bibinchundatt):
[~yehuanhuan] looks like duplicate of YARN-10315. 

> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
> 
>
> Key: YARN-10332
> URL: https://issues.apache.org/jira/browse/YARN-10332
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: yehuanhuan
>Priority: Minor
> Attachments: YARN-10332.001.patch
>
>
> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10335) Improve scheduling of containers based on node health

2020-07-01 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149843#comment-17149843
 ] 

Bibin Chundatt edited comment on YARN-10335 at 7/2/20, 4:45 AM:


Thank you for showing interest in the JIRA [~cyrusjackson25]

Adding what i have in mind about the health detail. Node manager  has node 
health service which returns a boolean value .Sends UNHEALTHY if the node 
health script return error / If  we don't have any healthy local  directories. 

We will introduce field/fields which returns detailed node health value about 
the node along with the NodeHealthStatus.  

Example:
{quote}
message NodeHealthStatusProto {
optional bool isHealthy = 1;
optional string nodeHealthDescription = 2;
optional string exceptionString = 3;
optional NodeHealthDetail nodehealthDetail=4;
optional StringIntMapProto nodeHealthdetail=5;
}

message StringStringMapProto {
  optional string key = 1;
  optional int32 value = 2;
}

keys could be - overall , ssd, non ssd, etc.. 
{quote}

Also make the NodeHealthService pluggable to support custom implementations of 
NodeHealthServices.


was (Author: bibinchundatt):
Thank you for showing interest in the JIRA [~cyrusjackson25]

Adding the thought what i have in mind about the health value. Node manager  
has node health service which returns a boolean value . 
Sends UNHEALTHY if the node health script return error / If  we don't have any 
healthy local  directories. 

We want to introduce field/fields which returns detailed node health value 
about the node along with the NodeHealthStatus.  

Example:
{quote}
message NodeHealthStatusProto {
optional bool isHealthy = 1;
optional string nodeHealthDescription = 2;
optional string exceptionString = 3;
optional NodeHealthDetail nodehealthDetail=4;
optional StringIntMapProto nodeHealthdetail=5;
}

message StringStringMapProto {
  optional string key = 1;
  optional int32 value = 2;
}

keys could be - overall , ssd, non ssd, etc.. 
{quote}

Also make the NodeHealthService pluggable to support custom implementations of 
NodeHealthServices.

> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Cyrus Jackson
>Priority: Major
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on node 
> health value send from nodemanagers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10335) Improve scheduling of containers based on node health

2020-07-01 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149843#comment-17149843
 ] 

Bibin Chundatt commented on YARN-10335:
---

Thank you for showing interest in the JIRA [~cyrusjackson25]

Adding the thought what i have in mind about the health value. Node manager  
has node health service which returns a boolean value . 
Sends UNHEALTHY if the node health script return error / If  we don't have any 
healthy local  directories. 

We want to introduce field/fields which returns detailed node health value 
about the node along with the NodeHealthStatus.  

Example:
{quote}
message NodeHealthStatusProto {
optional bool isHealthy = 1;
optional string nodeHealthDescription = 2;
optional string exceptionString = 3;
optional NodeHealthDetail nodehealthDetail=4;
optional StringIntMapProto nodeHealthdetail=5;
}

message StringStringMapProto {
  optional string key = 1;
  optional int32 value = 2;
}

keys could be - overall , ssd, non ssd, etc.. 
{quote}

Also make the NodeHealthService pluggable to support custom implementations of 
NodeHealthServices.

> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Cyrus Jackson
>Priority: Major
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on node 
> health value send from nodemanagers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10335) Improve scheduling of containers based on node health

2020-07-01 Thread Bibin Chundatt (Jira)
Bibin Chundatt created YARN-10335:
-

 Summary: Improve scheduling of containers based on node health
 Key: YARN-10335
 URL: https://issues.apache.org/jira/browse/YARN-10335
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Bibin Chundatt


YARN-7494 supports providing interface to choose nodeset for scheduler 
allocation.
We could leverage the same to support allocation of containers based on 
nodehealth.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10335) Improve scheduling of containers based on node health

2020-07-01 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-10335:
--
Description: 
YARN-7494 supports providing interface to choose nodeset for scheduler 
allocation.
We could leverage the same to support allocation of containers based on 
nodehealth value

  was:
YARN-7494 supports providing interface to choose nodeset for scheduler 
allocation.
We could leverage the same to support allocation of containers based on 
nodehealth.


> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Priority: Major
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on 
> nodehealth value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10335) Improve scheduling of containers based on node health

2020-07-01 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-10335:
--
Description: 
YARN-7494 supports providing interface to choose nodeset for scheduler 
allocation.
We could leverage the same to support allocation of containers based on node 
health value send from nodemanagers

  was:
YARN-7494 supports providing interface to choose nodeset for scheduler 
allocation.
We could leverage the same to support allocation of containers based on 
nodehealth value


> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Priority: Major
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on node 
> health value send from nodemanagers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10307) /leveldb-timeline-store.ldb/LOCK not exist

2020-06-07 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt resolved YARN-10307.
---
Fix Version/s: (was: 3.1.2)
   Resolution: Invalid

> /leveldb-timeline-store.ldb/LOCK not exist
> --
>
> Key: YARN-10307
> URL: https://issues.apache.org/jira/browse/YARN-10307
> Project: Hadoop YARN
>  Issue Type: Bug
> Environment: Ubuntu 19.10
> Hadoop 3.1.2
> Tez 0.9.2
> Hbase 2.2.4
>Reporter: appleyuchi
>Priority: Blocker
>
> $HADOOP_HOME/sbin/yarn-daemon.sh start timelineserver
>  
> in hadoop-appleyuchi-timelineserver-Desktop.out I get
>  
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> /home/appleyuchi/[file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:|file:///home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:]
>  沒有此一檔案或目錄
>  at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>  at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>  at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>  at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:246)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  2020-06-04 17:48:21,525 INFO [main] service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore failed in state 
> INITED
>  java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1268)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1237)
>  at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:253)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  2020-06-04 17:48:21,526 INFO [main] service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
>  org.apache.hadoop.service.ServiceStateException: 
> java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  Caused by: java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368)

[jira] [Reopened] (YARN-10307) /leveldb-timeline-store.ldb/LOCK not exist

2020-06-07 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt reopened YARN-10307:
---

Reopening to set the correct resolution

> /leveldb-timeline-store.ldb/LOCK not exist
> --
>
> Key: YARN-10307
> URL: https://issues.apache.org/jira/browse/YARN-10307
> Project: Hadoop YARN
>  Issue Type: Bug
> Environment: Ubuntu 19.10
> Hadoop 3.1.2
> Tez 0.9.2
> Hbase 2.2.4
>Reporter: appleyuchi
>Priority: Blocker
> Fix For: 3.1.2
>
>
> $HADOOP_HOME/sbin/yarn-daemon.sh start timelineserver
>  
> in hadoop-appleyuchi-timelineserver-Desktop.out I get
>  
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> /home/appleyuchi/[file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:|file:///home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:]
>  沒有此一檔案或目錄
>  at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>  at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>  at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>  at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:246)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  2020-06-04 17:48:21,525 INFO [main] service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore failed in state 
> INITED
>  java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1268)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1237)
>  at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:253)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  2020-06-04 17:48:21,526 INFO [main] service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
>  org.apache.hadoop.service.ServiceStateException: 
> java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  Caused by: java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405)
>  at 

[jira] [Created] (YARN-10315) Avoid sending RMNodeResoureupdate event if resource is same

2020-06-14 Thread Bibin Chundatt (Jira)
Bibin Chundatt created YARN-10315:
-

 Summary: Avoid sending RMNodeResoureupdate event if resource is 
same
 Key: YARN-10315
 URL: https://issues.apache.org/jira/browse/YARN-10315
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Bibin Chundatt


When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is send 
for every heartbeat . Which will result in scheduler resource update.

Avoid sending the same.

 Scheduler node resource update iterates through all the queues for resource 
update which is costly..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10315) Avoid sending RMNodeResourceupdate event if resource is same

2020-07-23 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163246#comment-17163246
 ] 

Bibin Chundatt commented on YARN-10315:
---

+1 looks good to me .

[~adam.antal] will wait for  fee days before committing.

> Avoid sending RMNodeResourceupdate event if resource is same
> 
>
> Key: YARN-10315
> URL: https://issues.apache.org/jira/browse/YARN-10315
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-10315.001.patch, YARN-10315.002.patch
>
>
> When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is 
> send for every heartbeat . Which will result in scheduler resource update.
> Avoid sending the same.
>  Scheduler node resource update iterates through all the queues for resource 
> update which is costly..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10350) TestUserGroupMappingPlacementRule fails

2020-07-16 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-10350:
--
Fix Version/s: 3.4.0

> TestUserGroupMappingPlacementRule fails
> ---
>
> Key: YARN-10350
> URL: https://issues.apache.org/jira/browse/YARN-10350
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Akira Ajisaka
>Assignee: Bilwa S T
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10350.001.patch, YARN-10350.002.patch
>
>
> TestUserGroupMappingPlacementRule fails on trunk:
> {noformat}
> [INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule
> [ERROR] Tests run: 31, Failures: 1, Errors: 2, Skipped: 0, Time elapsed: 
> 2.662 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule
> [ERROR] 
> testResolvedQueueIsNotManaged(org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule)
>   Time elapsed: 0.03 s  <<< ERROR!
> java.lang.Exception: Unexpected exception, 
> expected but 
> was
>   at 
> org.junit.internal.runners.statements.ExpectException.evaluate(ExpectException.java:28)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> Caused by: java.lang.AssertionError: Queue expected: but was:
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule.verifyQueueMapping(TestUserGroupMappingPlacementRule.java:236)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule.testResolvedQueueIsNotManaged(TestUserGroupMappingPlacementRule.java:516)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.ExpectException.evaluate(ExpectException.java:19)
>   ... 18 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10369) Make NMTokenSecretManagerInRM sending NMToken for nodeId DEBUG

2020-07-28 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166332#comment-17166332
 ] 

Bibin Chundatt commented on YARN-10369:
---

[~Jim_Brennan] .

In addition to above comment please do use  {}-placeholders too for logging

> Make NMTokenSecretManagerInRM sending NMToken for nodeId DEBUG
> --
>
> Key: YARN-10369
> URL: https://issues.apache.org/jira/browse/YARN-10369
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10369.001.patch
>
>
> This message is logged at the info level, but it doesn't really add much 
> information.
> We changed this to DEBUG internally years ago and haven't missed it.
> {noformat}
> 2020-07-27 21:51:29,027 INFO  [RM Event dispatcher] 
> security.NMTokenSecretManagerInRM 
> (NMTokenSecretManagerInRM.java:createAndGetNMToken(200)) - Sending NMToken 
> for nodeId : localhost.localdomain:45454 for container : 
> container_1595886659189_0001_01_01
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10208) Add capacityScheduler metric for NODE_UPDATE interval

2020-07-27 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-10208:
--
Summary: Add capacityScheduler metric for NODE_UPDATE interval  (was: 
CapacityScheduler metric for evaluating the time difference between node 
heartbeats)

> Add capacityScheduler metric for NODE_UPDATE interval
> -
>
> Key: YARN-10208
> URL: https://issues.apache.org/jira/browse/YARN-10208
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Minor
> Attachments: YARN-10208.001.patch, YARN-10208.002.patch, 
> YARN-10208.003.patch, YARN-10208.004.patch, YARN-10208.005.patch, 
> YARN-10208.006.patch
>
>
> Metric measuring average time interval between node heartbeats in capacity 
> scheduler on node update event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10208) CapacityScheduler metric for evaluating the time difference between node heartbeats

2020-07-27 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-10208:
--
Summary: CapacityScheduler metric for evaluating the time difference 
between node heartbeats  (was: Add metric in CapacityScheduler for evaluating 
the time difference between node heartbeats)

> CapacityScheduler metric for evaluating the time difference between node 
> heartbeats
> ---
>
> Key: YARN-10208
> URL: https://issues.apache.org/jira/browse/YARN-10208
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Minor
> Attachments: YARN-10208.001.patch, YARN-10208.002.patch, 
> YARN-10208.003.patch, YARN-10208.004.patch, YARN-10208.005.patch, 
> YARN-10208.006.patch
>
>
> Metric measuring average time interval between node heartbeats in capacity 
> scheduler on node update event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10208) Add capacityScheduler metric for NODE_UPDATE interval

2020-07-28 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166219#comment-17166219
 ] 

Bibin Chundatt commented on YARN-10208:
---

Missed committing this JIRA.. The testcase failures are not related to patch 
attached
Committing shortly

> Add capacityScheduler metric for NODE_UPDATE interval
> -
>
> Key: YARN-10208
> URL: https://issues.apache.org/jira/browse/YARN-10208
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Minor
> Attachments: YARN-10208.001.patch, YARN-10208.002.patch, 
> YARN-10208.003.patch, YARN-10208.004.patch, YARN-10208.005.patch, 
> YARN-10208.006.patch, YARN-10208.007.patch
>
>
> Metric measuring average time interval between node heartbeats in capacity 
> scheduler on node update event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160936#comment-17160936
 ] 

Bibin Chundatt commented on YARN-10352:
---

[~prabhujoseph]

With current approach we are iterating through all the nodes 2 times in the 
partition.

We could filter out the nodes during the {{reSortClusterNodes}} iteration that 
creating a list then iterating it all over it again. thoughts ?
 One more additional filter to {{preferrednodeIterator}} while querying nodes 
per schedulerKey would reduce the node selection being done during sorting 
interval of 5 sec.

Iterators.filter(iterator, 

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   >