[ 
https://issues.apache.org/jira/browse/YARN-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lindongdong updated YARN-9738:
------------------------------
    Description: 
*Env :*
 Server OS :- UBUNTU
 No. of Cluster Node:- 9120 NMs
 Env Mode:- [Secure / Non secure]Secure

*Preconditions:*
 ~9120 NM's was running
 ~1250 applications was in running state 
 35K applications was in pending state

*Test Steps:*
 1. Submit the application from 5 clients, each client 2 threads and total 10 
queues
 2. Once application submittion increases (for each application of distributted 
shell will call getClusterNodes)

*ClientRMservice#getClusterNodes tries to get ClusterNodeTracker#getNodeReport 
where map nodes is locked.*
{quote}"IPC Server handler 36 on 45022" #246 daemon prio=5 os_prio=0 
tid=0x00007f75095de000 nid=0x1949c waiting on condition [0x00007f74cff78000]
 java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0x00007f759f6d8858> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
 at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.getNodeReport(ClusterNodeTracker.java:123)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getNodeReport(AbstractYarnScheduler.java:449)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.createNodeReports(ClientRMService.java:1067)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getClusterNodes(ClientRMService.java:992)
 at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getClusterNodes(ApplicationClientProtocolPBServiceImpl.java:313)
 at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:589)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2792){quote}
*Instead we can make nodes as concurrentHashMap and remove readlock*

  was:
*Env :*
Server OS :- UBUNTU
No. of Cluster Node:- 9120 NMs
Env Mode:- [Secure / Non secure]Secure

*Preconditions:*
~9120 NM's was running
~1250 applications was in running state 
35K applications was in pending state

*Test Steps:*
1. Submit the application from 5 clients, each client 2 threads and total 10 
queues
2. Once application submittion increases (for each application of distributted 
shell will call getClusterNodes)

*ClientRMservice#getClusterNodes tries to get ClusterNodeTracker#getNodeReport 
where map nodes is locked.*

{quote}
"IPC Server handler 36 on 45022" #246 daemon prio=5 os_prio=0 
tid=0x00007f75095de000 nid=0x1949c waiting on condition [0x00007f74cff78000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f759f6d8858> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
        at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.getNodeReport(ClusterNodeTracker.java:123)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getNodeReport(AbstractYarnScheduler.java:449)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.createNodeReports(ClientRMService.java:1067)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getClusterNodes(ClientRMService.java:992)
        at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getClusterNodes(ApplicationClientProtocolPBServiceImpl.java:313)
        at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:589)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2792)
{quote}

*Instead we can make nodes as concurrentHashMap and remove readlock*






> Remove lock on ClusterNodeTracker#getNodeReport as it blocks application 
> submission
> -----------------------------------------------------------------------------------
>
>                 Key: YARN-9738
>                 URL: https://issues.apache.org/jira/browse/YARN-9738
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Bilwa S T
>            Assignee: Bilwa S T
>            Priority: Major
>         Attachments: YARN-9738-001.patch, YARN-9738-002.patch, 
> YARN-9738-003.patch
>
>
> *Env :*
>  Server OS :- UBUNTU
>  No. of Cluster Node:- 9120 NMs
>  Env Mode:- [Secure / Non secure]Secure
> *Preconditions:*
>  ~9120 NM's was running
>  ~1250 applications was in running state 
>  35K applications was in pending state
> *Test Steps:*
>  1. Submit the application from 5 clients, each client 2 threads and total 10 
> queues
>  2. Once application submittion increases (for each application of 
> distributted shell will call getClusterNodes)
> *ClientRMservice#getClusterNodes tries to get 
> ClusterNodeTracker#getNodeReport where map nodes is locked.*
> {quote}"IPC Server handler 36 on 45022" #246 daemon prio=5 os_prio=0 
> tid=0x00007f75095de000 nid=0x1949c waiting on condition [0x00007f74cff78000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x00007f759f6d8858> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>  at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.getNodeReport(ClusterNodeTracker.java:123)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getNodeReport(AbstractYarnScheduler.java:449)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.createNodeReports(ClientRMService.java:1067)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getClusterNodes(ClientRMService.java:992)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getClusterNodes(ApplicationClientProtocolPBServiceImpl.java:313)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:589)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2792){quote}
> *Instead we can make nodes as concurrentHashMap and remove readlock*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to