from:"zhoukang \(Jira\)"

[jira] [Commented] (YARN-10080) Support show app id on localizer thread pool

2020-04-22 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090195#comment-17090195
 ] 

zhoukang commented on YARN-10080:
-

how to push this [~adam.antal][~abmodi] Thanks

> Support show app id on localizer thread pool
> 
>
> Key: YARN-10080
> URL: https://issues.apache.org/jira/browse/YARN-10080
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-10080-001.patch, YARN-10080.002.patch
>
>
> Currently when we are troubleshooting a container localizer issue, if we want 
> to analyze the jstack with thread detail, we can not figure out which thread 
> is processing the given container. So i want to add app id on the thread name



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10242) CapacityScheduler may call updateClusterResource for every node register event which will cause resource register too slow

2020-04-22 Thread zhoukang (Jira)

zhoukang created YARN-10242:
---

 Summary: CapacityScheduler may call updateClusterResource for 
every node register event which will cause resource register too slow
 Key: YARN-10242
 URL: https://issues.apache.org/jira/browse/YARN-10242
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, resourcemanager
Reporter: zhoukang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10204) ResContainer may be unreserved during process outstanding containers

2020-03-21 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063907#comment-17063907
 ] 

zhoukang commented on YARN-10204:
-

we can add double check logic below:

{code:java}
private void completeOustandingUpdatesWhichAreReserved(
  RMContainer rmContainer, ContainerStatus containerStatus,
  RMContainerEventType event) {
N schedulerNode = getSchedulerNode(rmContainer.getNodeId());
if (schedulerNode != null &&
schedulerNode.getReservedContainer() != null) {
  RMContainer resContainer = schedulerNode.getReservedContainer();
  // double check here since container
  // may be unreserved which can make resContainer be null
  if (resContainer!= null
  && resContainer.getReservedSchedulerKey() != null) {
ContainerId containerToUpdate = resContainer
.getReservedSchedulerKey().getContainerToUpdate();
{code}


> ResContainer may be unreserved during process outstanding containers
> 
>
> Key: YARN-10204
> URL: https://issues.apache.org/jira/browse/YARN-10204
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> ResContainer may be unreserved during process outstanding containers
> which may cause rm exit with failure
> {code:java}
> 2020-03-21,13:13:36,569 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
> Error in handling event type CONTAINER_EXPIRED to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completeOustandingUpdatesWhichAreReserved(AbstractYarnScheduler.java:719)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:678)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1952)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:168)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10204) ResContainer may be unreserved during process outstanding containers

2020-03-21 Thread zhoukang (Jira)

zhoukang created YARN-10204:
---

 Summary: ResContainer may be unreserved during process outstanding 
containers
 Key: YARN-10204
 URL: https://issues.apache.org/jira/browse/YARN-10204
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhoukang
Assignee: zhoukang


ResContainer may be unreserved during process outstanding containers
which may cause rm exit with failure

{code:java}
2020-03-21,13:13:36,569 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
Error in handling event type CONTAINER_EXPIRED to the Event Dispatcher
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completeOustandingUpdatesWhichAreReserved(AbstractYarnScheduler.java:719)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:678)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1952)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:168)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
at java.lang.Thread.run(Thread.java:748)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10118) Use zk to store node info for rm which can show lostnodes information in rm UI after failover

2020-02-06 Thread zhoukang (Jira)

zhoukang created YARN-10118:
---

 Summary: Use zk to store node info for rm which can show lostnodes 
information in rm UI after failover
 Key: YARN-10118
 URL: https://issues.apache.org/jira/browse/YARN-10118
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: zhoukang
Assignee: zhoukang


When maintenance a large cluster we may have some nodes lost, if we did 
failover before deal with these nodes. The information will lost in new 
master.We can use zk to store the nodes' information



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10115) Use interceptor pipeline for app submit to make app submit checker policy pluggable

2020-02-04 Thread zhoukang (Jira)

zhoukang created YARN-10115:
---

 Summary: Use interceptor pipeline for app submit to make app 
submit checker policy pluggable 
 Key: YARN-10115
 URL: https://issues.apache.org/jira/browse/YARN-10115
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, resourcemanager
Reporter: zhoukang
Assignee: zhoukang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10115) Use interceptor pipeline for app submit to make app submit check policy pluggable

2020-02-04 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10115:

Summary: Use interceptor pipeline for app submit to make app submit check 
policy pluggable   (was: Use interceptor pipeline for app submit to make app 
submit checker policy pluggable )

> Use interceptor pipeline for app submit to make app submit check policy 
> pluggable 
> --
>
> Key: YARN-10115
> URL: https://issues.apache.org/jira/browse/YARN-10115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, resourcemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10060) Historyserver may recover too slow since JobHistory init too slow when there exist too many job

2020-02-02 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10060:

Attachment: (was: YARN-10060.001.patch)

> Historyserver may recover too slow since JobHistory init too slow when there 
> exist too many job
> ---
>
> Key: YARN-10060
> URL: https://issues.apache.org/jira/browse/YARN-10060
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-10060-001.patch
>
>
> Like below it cost >7min to listen to the service port
> {code:java}
> 2019-12-24,20:01:37,272 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2019-12-24,20:01:47,354 INFO 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Initializing Existing 
> Jobs...
> 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server xxx. Will not attempt to authenticate using SASL 
> (unknown error)
> 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to xxx, initiating session
> 2019-12-24,20:08:29,590 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server xxx, sessionid = 0x66d1a13e596ddc9, 
> negotiated timeout = 5000
> 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x66d1a13e596ddc9 closed
> 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2019-12-24,20:08:29,655 INFO 
> org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage: CachedHistoryStorage 
> Init
> 2019-12-24,20:08:29,681 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
> callQueue class java.util.concurrent.LinkedBlockingQueue
> 2019-12-24,20:08:29,715 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
> callQueue class java.util.concurrent.LinkedBlockingQueue
> 2019-12-24,20:08:29,800 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: 
> loaded properties from hadoop-metrics2.properties
> 2019-12-24,20:08:29,943 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period 
> at 10 second(s).
> 2019-12-24,20:08:29,943 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: JobHistoryServer metrics 
> system started
> 2019-12-24,20:08:29,950 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Updating the current master key for generating delegation tokens
> 2019-12-24,20:08:29,951 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Starting expired delegation token remover thread, 
> tokenRemoverScanInterval=60 min(s)
> 2019-12-24,20:08:29,952 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Updating the current master key for generating delegation tokens
> 2019-12-24,20:08:30,015 INFO org.apache.hadoop.http.HttpRequestLog: Http 
> request log for http.requests.jobhistory is not defined
> 2019-12-24,20:08:30,025 INFO org.apache.hadoop.http.HttpServer2: Added global 
> filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
> 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter 
> static_user_filter 
> (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to 
> context jobhistory
> 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter 
> static_user_filter 
> (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to 
> context static
> 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path 
> spec: /jobhistory/*
> 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path 
> spec: /ws/*
> 2019-12-24,20:08:30,057 INFO org.apache.hadoop.http.HttpServer2: Jetty bound 
> to port 20901
> 2019-12-24,20:08:30,939 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app 
> /jobhistory started at 20901
> 2019-12-24,20:08:31,177 INFO org.apache.hadoop.yarn.webapp.WebApps: 
> Registered webapp guice modules
> 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
> callQueue class java.util.concurrent.LinkedBlockingQueue
> 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
> callQueue class java.util.concurrent.LinkedBlockingQueue
> 2019-12-24,20:08:31,189 INFO 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding 
> protocol org.apache.hadoop.mapreduce.v2.api.HSClientProtocolPB to the server
> 2019-12-24,20:08:31,216 INFO 
> org.apache.hadoop.mapreduce.v2.hs.HistoryClientService: Instantiated 
> HistoryClientService at xxx
> 2019-12-24,20:08:31,344 INFO 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService: 
> aggregated log

[jira] [Updated] (YARN-10060) Historyserver may recover too slow since JobHistory init too slow when there exist too many job

2020-02-02 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10060:

Attachment: YARN-10060-001.patch

> Historyserver may recover too slow since JobHistory init too slow when there 
> exist too many job
> ---
>
> Key: YARN-10060
> URL: https://issues.apache.org/jira/browse/YARN-10060
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-10060-001.patch
>
>
> Like below it cost >7min to listen to the service port
> {code:java}
> 2019-12-24,20:01:37,272 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2019-12-24,20:01:47,354 INFO 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Initializing Existing 
> Jobs...
> 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server xxx. Will not attempt to authenticate using SASL 
> (unknown error)
> 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to xxx, initiating session
> 2019-12-24,20:08:29,590 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server xxx, sessionid = 0x66d1a13e596ddc9, 
> negotiated timeout = 5000
> 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x66d1a13e596ddc9 closed
> 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2019-12-24,20:08:29,655 INFO 
> org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage: CachedHistoryStorage 
> Init
> 2019-12-24,20:08:29,681 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
> callQueue class java.util.concurrent.LinkedBlockingQueue
> 2019-12-24,20:08:29,715 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
> callQueue class java.util.concurrent.LinkedBlockingQueue
> 2019-12-24,20:08:29,800 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: 
> loaded properties from hadoop-metrics2.properties
> 2019-12-24,20:08:29,943 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period 
> at 10 second(s).
> 2019-12-24,20:08:29,943 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: JobHistoryServer metrics 
> system started
> 2019-12-24,20:08:29,950 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Updating the current master key for generating delegation tokens
> 2019-12-24,20:08:29,951 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Starting expired delegation token remover thread, 
> tokenRemoverScanInterval=60 min(s)
> 2019-12-24,20:08:29,952 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Updating the current master key for generating delegation tokens
> 2019-12-24,20:08:30,015 INFO org.apache.hadoop.http.HttpRequestLog: Http 
> request log for http.requests.jobhistory is not defined
> 2019-12-24,20:08:30,025 INFO org.apache.hadoop.http.HttpServer2: Added global 
> filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
> 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter 
> static_user_filter 
> (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to 
> context jobhistory
> 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter 
> static_user_filter 
> (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to 
> context static
> 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path 
> spec: /jobhistory/*
> 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path 
> spec: /ws/*
> 2019-12-24,20:08:30,057 INFO org.apache.hadoop.http.HttpServer2: Jetty bound 
> to port 20901
> 2019-12-24,20:08:30,939 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app 
> /jobhistory started at 20901
> 2019-12-24,20:08:31,177 INFO org.apache.hadoop.yarn.webapp.WebApps: 
> Registered webapp guice modules
> 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
> callQueue class java.util.concurrent.LinkedBlockingQueue
> 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
> callQueue class java.util.concurrent.LinkedBlockingQueue
> 2019-12-24,20:08:31,189 INFO 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding 
> protocol org.apache.hadoop.mapreduce.v2.api.HSClientProtocolPB to the server
> 2019-12-24,20:08:31,216 INFO 
> org.apache.hadoop.mapreduce.v2.hs.HistoryClientService: Instantiated 
> HistoryClientService at xxx
> 2019-12-24,20:08:31,344 INFO 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService: 
> aggregated log deletion

[jira] [Commented] (YARN-10011) Catch all exception during init app in LogAggregationService

2020-02-02 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028457#comment-17028457
 ] 

zhoukang commented on YARN-10011:
-

[~adam.antal]Could you help review this?

> Catch all exception  during init app in LogAggregationService 
> --
>
> Key: YARN-10011
> URL: https://issues.apache.org/jira/browse/YARN-10011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-10011-001.patch
>
>
> we should catch all exception during init app in LogAggregationService in 
> case of nm exit 
> {code:java}
> 2019-06-12,09:36:03,652 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.IllegalStateException
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:129)
> at 
> org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:118)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
> at com.sun.proxy.$Proxy22.getFileInfo(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2115)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1300)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1296)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1312)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:193)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:319)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:116)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10011) Catch all exception during init app in LogAggregationService

2020-02-02 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10011:

Attachment: YARN-10011-001.patch

> Catch all exception  during init app in LogAggregationService 
> --
>
> Key: YARN-10011
> URL: https://issues.apache.org/jira/browse/YARN-10011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-10011-001.patch
>
>
> we should catch all exception during init app in LogAggregationService in 
> case of nm exit 
> {code:java}
> 2019-06-12,09:36:03,652 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.IllegalStateException
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:129)
> at 
> org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:118)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
> at com.sun.proxy.$Proxy22.getFileInfo(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2115)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1300)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1296)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1312)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:193)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:319)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:116)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10011) Catch all exception during init app in LogAggregationService

2020-02-02 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10011:

Attachment: (was: YARN-10011.001.patch)

> Catch all exception  during init app in LogAggregationService 
> --
>
> Key: YARN-10011
> URL: https://issues.apache.org/jira/browse/YARN-10011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-10011-001.patch
>
>
> we should catch all exception during init app in LogAggregationService in 
> case of nm exit 
> {code:java}
> 2019-06-12,09:36:03,652 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.IllegalStateException
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:129)
> at 
> org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:118)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
> at com.sun.proxy.$Proxy22.getFileInfo(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2115)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1300)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1296)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1312)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:193)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:319)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:116)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10080) Support show app id on localizer thread pool

2020-02-02 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028455#comment-17028455
 ] 

zhoukang commented on YARN-10080:
-

ping [~abmodi][~tangzhankun]

> Support show app id on localizer thread pool
> 
>
> Key: YARN-10080
> URL: https://issues.apache.org/jira/browse/YARN-10080
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-10080-001.patch, YARN-10080.002.patch
>
>
> Currently when we are troubleshooting a container localizer issue, if we want 
> to analyze the jstack with thread detail, we can not figure out which thread 
> is processing the given container. So i want to add app id on the thread name



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10096) Add zk based configuration provider for router

2020-01-20 Thread zhoukang (Jira)

zhoukang created YARN-10096:
---

 Summary: Add zk based configuration provider for router
 Key: YARN-10096
 URL: https://issues.apache.org/jira/browse/YARN-10096
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: router
Reporter: zhoukang
Assignee: zhoukang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10094) Add a configuration to support NM overuse in RM

2020-01-19 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10094:

Attachment: YARN-10094.001.patch

> Add a configuration to support NM overuse in RM
> ---
>
> Key: YARN-10094
> URL: https://issues.apache.org/jira/browse/YARN-10094
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-10094.001.patch
>
>
> In a large cluster , upgrade NM will cost too much time.
> Some times we want to support memory or cpu overuse from RM view.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10094) Add configuration to support NM overuse in RM

2020-01-19 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10094:

Summary: Add configuration to support NM overuse in RM  (was: Add a 
configuration to support NM overuse in RM)

> Add configuration to support NM overuse in RM
> -
>
> Key: YARN-10094
> URL: https://issues.apache.org/jira/browse/YARN-10094
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-10094.001.patch
>
>
> In a large cluster , upgrade NM will cost too much time.
> Some times we want to support memory or cpu overuse from RM view.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10094) Add a configuration to support NM overuse in RM

2020-01-19 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10094:

Description: 
In a large cluster , upgrade NM will cost too much time.
Some times we want to support memory or cpu overuse from RM view.

> Add a configuration to support NM overuse in RM
> ---
>
> Key: YARN-10094
> URL: https://issues.apache.org/jira/browse/YARN-10094
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> In a large cluster , upgrade NM will cost too much time.
> Some times we want to support memory or cpu overuse from RM view.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10094) Add a configuration to support NM overuse in RM

2020-01-19 Thread zhoukang (Jira)

zhoukang created YARN-10094:
---

 Summary: Add a configuration to support NM overuse in RM
 Key: YARN-10094
 URL: https://issues.apache.org/jira/browse/YARN-10094
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: zhoukang
Assignee: zhoukang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10093) Support list applications by queue name

2020-01-19 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10093:

Summary: Support list applications by queue name  (was: Support get 
applications by queue)

> Support list applications by queue name
> ---
>
> Key: YARN-10093
> URL: https://issues.apache.org/jira/browse/YARN-10093
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10093) Support get applications by queue

2020-01-19 Thread zhoukang (Jira)

zhoukang created YARN-10093:
---

 Summary: Support get applications by queue
 Key: YARN-10093
 URL: https://issues.apache.org/jira/browse/YARN-10093
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhoukang
Assignee: zhoukang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10092) Support config special log retain time for given user

2020-01-19 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10092:

Summary: Support config special log retain time for  given user  (was: 
Support log retain time for give user)

> Support config special log retain time for  given user
> --
>
> Key: YARN-10092
> URL: https://issues.apache.org/jira/browse/YARN-10092
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10092) Support log retain time for give user

2020-01-19 Thread zhoukang (Jira)

zhoukang created YARN-10092:
---

 Summary: Support log retain time for give user
 Key: YARN-10092
 URL: https://issues.apache.org/jira/browse/YARN-10092
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: zhoukang
Assignee: zhoukang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10091) Support clean up orphan app's log in LogAggService

2020-01-19 Thread zhoukang (Jira)

zhoukang created YARN-10091:
---

 Summary: Support clean up orphan app's log in LogAggService
 Key: YARN-10091
 URL: https://issues.apache.org/jira/browse/YARN-10091
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: zhoukang
Assignee: zhoukang


In a large cluster, there will exist orphan app log directory which will cause 
disk leak.We should support cleanup app log directory for this kind of app



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10062) Support deploy multiple historyserver in case of sp

2020-01-19 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10062:

Attachment: YARN-10062.001.patch

> Support deploy multiple historyserver in case of sp
> ---
>
> Key: YARN-10062
> URL: https://issues.apache.org/jira/browse/YARN-10062
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-10062.001.patch
>
>
> In this jira, i want to implement a patch to support history ha.
> We can deploy two historyserver and use a load balancer like lvs to support 
> HA.
> But there exist error in our production cluster like below:
> {code:java}
> 19/12/13/00 does not exist.
> 2019-12-21,13:25:06,822 DEBUG org.apache.hadoop.yarn.webapp.Controller: 
> text/plain; charset=UTF-8: java.io.FileNotFoundException: File 
> /yarn/xxx/staging/history/done/2019/12/13/00 does not exist.
> 2019-12-21,13:25:07,530 DEBUG org.apache.hadoop.yarn.webapp.Controller: 
> text/plain; charset=UTF-8: java.io.FileNotFoundException: File 
> /yarn/xxx/staging/history/done/2019/12/13/00 does not exist.
> 2019-12-21,13:25:09,910 DEBUG org.apache.hadoop.yarn.webapp.Controller: 
> text/plain; charset=UTF-8: java.io.FileNotFoundException: File 
> /yarn/xxx/staging/history/done/2019/12/13/00 does not exist.
> 2019-12-21,13:44:29,044 DEBUG org.apache.hadoop.yarn.webapp.Controller: 
> text/plain; charset=UTF-8: java.io.FileNotFoundException: File 
> /yarn/xxx/staging/history/done/2019/12/13/00 does not exist.
> 2019-12-21,13:47:08,154 DEBUG org.apache.hadoop.yarn.webapp.Controller: 
> text/plain; charset=UTF-8: java.io.FileNotFoundException: File 
> /yarn/xxx/staging/history/done/2019/12/13/00 does not exist.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10062) Support deploy multiple historyserver in case of sp

2020-01-19 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10062:

Description: 
In this jira, i want to implement a patch to support history ha.
We can deploy two historyserver and use a load balancer like lvs to support HA.
But there exist error in our production cluster like below:

{code:java}
19/12/13/00 does not exist.
2019-12-21,13:25:06,822 DEBUG org.apache.hadoop.yarn.webapp.Controller: 
text/plain; charset=UTF-8: java.io.FileNotFoundException: File 
/yarn/xxx/staging/history/done/2019/12/13/00 does not exist.
2019-12-21,13:25:07,530 DEBUG org.apache.hadoop.yarn.webapp.Controller: 
text/plain; charset=UTF-8: java.io.FileNotFoundException: File 
/yarn/xxx/staging/history/done/2019/12/13/00 does not exist.
2019-12-21,13:25:09,910 DEBUG org.apache.hadoop.yarn.webapp.Controller: 
text/plain; charset=UTF-8: java.io.FileNotFoundException: File 
/yarn/xxx/staging/history/done/2019/12/13/00 does not exist.
2019-12-21,13:44:29,044 DEBUG org.apache.hadoop.yarn.webapp.Controller: 
text/plain; charset=UTF-8: java.io.FileNotFoundException: File 
/yarn/xxx/staging/history/done/2019/12/13/00 does not exist.
2019-12-21,13:47:08,154 DEBUG org.apache.hadoop.yarn.webapp.Controller: 
text/plain; charset=UTF-8: java.io.FileNotFoundException: File 
/yarn/xxx/staging/history/done/2019/12/13/00 does not exist.
{code}
 

  was:In this jira, i want to implement a patch to support history ha 


> Support deploy multiple historyserver in case of sp
> ---
>
> Key: YARN-10062
> URL: https://issues.apache.org/jira/browse/YARN-10062
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> In this jira, i want to implement a patch to support history ha.
> We can deploy two historyserver and use a load balancer like lvs to support 
> HA.
> But there exist error in our production cluster like below:
> {code:java}
> 19/12/13/00 does not exist.
> 2019-12-21,13:25:06,822 DEBUG org.apache.hadoop.yarn.webapp.Controller: 
> text/plain; charset=UTF-8: java.io.FileNotFoundException: File 
> /yarn/xxx/staging/history/done/2019/12/13/00 does not exist.
> 2019-12-21,13:25:07,530 DEBUG org.apache.hadoop.yarn.webapp.Controller: 
> text/plain; charset=UTF-8: java.io.FileNotFoundException: File 
> /yarn/xxx/staging/history/done/2019/12/13/00 does not exist.
> 2019-12-21,13:25:09,910 DEBUG org.apache.hadoop.yarn.webapp.Controller: 
> text/plain; charset=UTF-8: java.io.FileNotFoundException: File 
> /yarn/xxx/staging/history/done/2019/12/13/00 does not exist.
> 2019-12-21,13:44:29,044 DEBUG org.apache.hadoop.yarn.webapp.Controller: 
> text/plain; charset=UTF-8: java.io.FileNotFoundException: File 
> /yarn/xxx/staging/history/done/2019/12/13/00 does not exist.
> 2019-12-21,13:47:08,154 DEBUG org.apache.hadoop.yarn.webapp.Controller: 
> text/plain; charset=UTF-8: java.io.FileNotFoundException: File 
> /yarn/xxx/staging/history/done/2019/12/13/00 does not exist.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9605) Add ZkConfiguredFailoverProxyProvider for RM HA

2020-01-19 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019209#comment-17019209
 ] 

zhoukang commented on YARN-9605:


ping~ [~prabhujoseph][~subru][~tangzhankun]Could you help push this jira?thanks

> Add ZkConfiguredFailoverProxyProvider for RM HA
> ---
>
> Key: YARN-9605
> URL: https://issues.apache.org/jira/browse/YARN-9605
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-9605.001.patch, YARN-9605.002.patch, 
> YARN-9605.003.patch, YARN-9605.004.patch, YARN-9605.005.patch, 
> YARN-9605.006.patch
>
>
> In this issue, i will track a new feature to support 
> ZkConfiguredFailoverProxyProvider for RM HA



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10069) Showing jstack on UI for containers

2020-01-19 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019207#comment-17019207
 ] 

zhoukang commented on YARN-10069:
-

[~akhilpb]Showing jstack for running container

> Showing jstack on UI for containers
> ---
>
> Key: YARN-10069
> URL: https://issues.apache.org/jira/browse/YARN-10069
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> In this jira, i want to post a patch to support showing jstack on the 
> container ui



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9979) When a app expired with many containers , scheduler event size will be huge

2020-01-19 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-9979:
---
Attachment: YARN-9979.001.patch

> When a app expired with many containers , scheduler event size will be huge
> ---
>
> Key: YARN-9979
> URL: https://issues.apache.org/jira/browse/YARN-9979
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-9979.001.patch
>
>
> When there is an app expired with many containers, the scheduler event size 
> will be huge.
> {code:java}
> 2019-11-11,21:39:49,690 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 9000
> 2019-11-11,21:39:49,695 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 1
> 2019-11-11,21:39:49,700 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 11000
> 2019-11-11,21:39:49,705 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 12000
> 2019-11-11,21:39:49,710 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 13000
> 2019-11-11,21:39:49,715 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 14000
> 2019-11-11,21:39:49,720 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Discarded 1 
> messages due to full event buffer including: Size of scheduler event-queue is 
> 15000
> 2019-11-11,21:39:49,724 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 16000
> 2019-11-11,21:39:49,729 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 17000
> 2019-11-11,21:39:49,733 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 18000
> 2019-11-11,21:40:14,953 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 19000
> 2019-11-11,21:43:09,743 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 19000
> 2019-11-11,21:43:09,750 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 2
> 2019-11-11,21:43:09,758 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 21000
> 2019-11-11,21:43:09,766 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 22000
> 2019-11-11,21:43:09,775 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 23000
> 2019-11-11,21:43:09,783 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 24000
> 2019-11-11,21:43:09,792 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 25000
> 2019-11-11,21:43:09,800 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 26000
> 2019-11-11,21:43:09,807 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 27000
> 2019-11-11,21:43:09,814 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 28000
> 2019-11-11,21:46:29,830 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 29000
> 2019-11-11,21:46:29,841 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 3
> 2019-11-11,21:46:29,850 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 31000
> 2019-11-11,21:46:29,862 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 32000
> 2019-11-11,21:49:49,875 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 33000
> 2019-11-11,21:49:49,875 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 34000
> 2019-11-11,21:49:49,876 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 35000
> 2019-11-11,21:49:49,882 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 36000
> 2019-11-11,21:49:49,887 INFO 
>

[jira] [Commented] (YARN-10010) NM upload log cost too much time

2020-01-19 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019205#comment-17019205
 ] 

zhoukang commented on YARN-10010:
-

I post a patch in YARN-10056 [~wilfreds] I will close this as dupe

> NM upload log cost too much time
> 
>
> Key: YARN-10010
> URL: https://issues.apache.org/jira/browse/YARN-10010
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: notfound.png
>
>
> Since thread pool size of log service is 100.
> Some times the log uploading service will delay for some apps.like below
>  !notfound.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-10010) NM upload log cost too much time

2020-01-19 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang resolved YARN-10010.
-
Resolution: Duplicate

> NM upload log cost too much time
> 
>
> Key: YARN-10010
> URL: https://issues.apache.org/jira/browse/YARN-10010
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: notfound.png
>
>
> Since thread pool size of log service is 100.
> Some times the log uploading service will delay for some apps.like below
>  !notfound.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10011) Catch all exception during init app in LogAggregationService

2020-01-19 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10011:

Attachment: YARN-10011.001.patch

> Catch all exception  during init app in LogAggregationService 
> --
>
> Key: YARN-10011
> URL: https://issues.apache.org/jira/browse/YARN-10011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-10011.001.patch
>
>
> we should catch all exception during init app in LogAggregationService in 
> case of nm exit 
> {code:java}
> 2019-06-12,09:36:03,652 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.IllegalStateException
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:129)
> at 
> org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:118)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
> at com.sun.proxy.$Proxy22.getFileInfo(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2115)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1300)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1296)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1312)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:193)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:319)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:116)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9930) Support max running app logic for CapacityScheduler

2020-01-19 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019204#comment-17019204
 ] 

zhoukang commented on YARN-9930:


Sorry for late reply, i will post a patch later. thanks [~sunilg]

> Support max running app logic for CapacityScheduler
> ---
>
> Key: YARN-9930
> URL: https://issues.apache.org/jira/browse/YARN-9930
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, capacityscheduler
>Affects Versions: 3.1.0, 3.1.1
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> In FairScheduler, there has limitation for max running which will let 
> application pending.
> But in CapacityScheduler there has no feature like max running app.Only got 
> max app,and jobs will be rejected directly on client.
> This jira i want to implement this semantic for CapacityScheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10056) Logservice may encuonter nm fgc since filesystem will only close when app finished

2020-01-19 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10056:

Attachment: YARN-10056.001.patch

> Logservice may encuonter nm fgc since filesystem will only close when app 
> finished
> --
>
> Key: YARN-10056
> URL: https://issues.apache.org/jira/browse/YARN-10056
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-10056.001.patch
>
>
> Currently, filesystem will only be closed when app finished, which may cause 
> memory overhead



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10056) Logservice may encuonter nm fgc since filesystem will only close when app finished

2020-01-19 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10056:

Description: Currently, filesystem will only be closed when app finished, 
which may cause memory overhead

> Logservice may encuonter nm fgc since filesystem will only close when app 
> finished
> --
>
> Key: YARN-10056
> URL: https://issues.apache.org/jira/browse/YARN-10056
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> Currently, filesystem will only be closed when app finished, which may cause 
> memory overhead



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10060) Historyserver may recover too slow since JobHistory init too slow when there exist too many job

2020-01-19 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10060:

Attachment: YARN-10060.001.patch

> Historyserver may recover too slow since JobHistory init too slow when there 
> exist too many job
> ---
>
> Key: YARN-10060
> URL: https://issues.apache.org/jira/browse/YARN-10060
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-10060.001.patch
>
>
> Like below it cost >7min to listen to the service port
> {code:java}
> 2019-12-24,20:01:37,272 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2019-12-24,20:01:47,354 INFO 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Initializing Existing 
> Jobs...
> 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server xxx. Will not attempt to authenticate using SASL 
> (unknown error)
> 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to xxx, initiating session
> 2019-12-24,20:08:29,590 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server xxx, sessionid = 0x66d1a13e596ddc9, 
> negotiated timeout = 5000
> 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x66d1a13e596ddc9 closed
> 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2019-12-24,20:08:29,655 INFO 
> org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage: CachedHistoryStorage 
> Init
> 2019-12-24,20:08:29,681 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
> callQueue class java.util.concurrent.LinkedBlockingQueue
> 2019-12-24,20:08:29,715 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
> callQueue class java.util.concurrent.LinkedBlockingQueue
> 2019-12-24,20:08:29,800 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: 
> loaded properties from hadoop-metrics2.properties
> 2019-12-24,20:08:29,943 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period 
> at 10 second(s).
> 2019-12-24,20:08:29,943 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: JobHistoryServer metrics 
> system started
> 2019-12-24,20:08:29,950 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Updating the current master key for generating delegation tokens
> 2019-12-24,20:08:29,951 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Starting expired delegation token remover thread, 
> tokenRemoverScanInterval=60 min(s)
> 2019-12-24,20:08:29,952 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Updating the current master key for generating delegation tokens
> 2019-12-24,20:08:30,015 INFO org.apache.hadoop.http.HttpRequestLog: Http 
> request log for http.requests.jobhistory is not defined
> 2019-12-24,20:08:30,025 INFO org.apache.hadoop.http.HttpServer2: Added global 
> filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
> 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter 
> static_user_filter 
> (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to 
> context jobhistory
> 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter 
> static_user_filter 
> (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to 
> context static
> 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path 
> spec: /jobhistory/*
> 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path 
> spec: /ws/*
> 2019-12-24,20:08:30,057 INFO org.apache.hadoop.http.HttpServer2: Jetty bound 
> to port 20901
> 2019-12-24,20:08:30,939 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app 
> /jobhistory started at 20901
> 2019-12-24,20:08:31,177 INFO org.apache.hadoop.yarn.webapp.WebApps: 
> Registered webapp guice modules
> 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
> callQueue class java.util.concurrent.LinkedBlockingQueue
> 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
> callQueue class java.util.concurrent.LinkedBlockingQueue
> 2019-12-24,20:08:31,189 INFO 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding 
> protocol org.apache.hadoop.mapreduce.v2.api.HSClientProtocolPB to the server
> 2019-12-24,20:08:31,216 INFO 
> org.apache.hadoop.mapreduce.v2.hs.HistoryClientService: Instantiated 
> HistoryClientService at xxx
> 2019-12-24,20:08:31,344 INFO 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService: 
> aggregated log deletion

[jira] [Commented] (YARN-10060) Historyserver may recover too slow since JobHistory init too slow when there exist too many job

2020-01-19 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019193#comment-17019193
 ] 

zhoukang commented on YARN-10060:
-

I will submit a patch to skip load file older than max history age

> Historyserver may recover too slow since JobHistory init too slow when there 
> exist too many job
> ---
>
> Key: YARN-10060
> URL: https://issues.apache.org/jira/browse/YARN-10060
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> Like below it cost >7min to listen to the service port
> {code:java}
> 2019-12-24,20:01:37,272 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2019-12-24,20:01:47,354 INFO 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Initializing Existing 
> Jobs...
> 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server xxx. Will not attempt to authenticate using SASL 
> (unknown error)
> 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to xxx, initiating session
> 2019-12-24,20:08:29,590 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server xxx, sessionid = 0x66d1a13e596ddc9, 
> negotiated timeout = 5000
> 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x66d1a13e596ddc9 closed
> 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2019-12-24,20:08:29,655 INFO 
> org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage: CachedHistoryStorage 
> Init
> 2019-12-24,20:08:29,681 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
> callQueue class java.util.concurrent.LinkedBlockingQueue
> 2019-12-24,20:08:29,715 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
> callQueue class java.util.concurrent.LinkedBlockingQueue
> 2019-12-24,20:08:29,800 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: 
> loaded properties from hadoop-metrics2.properties
> 2019-12-24,20:08:29,943 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period 
> at 10 second(s).
> 2019-12-24,20:08:29,943 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: JobHistoryServer metrics 
> system started
> 2019-12-24,20:08:29,950 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Updating the current master key for generating delegation tokens
> 2019-12-24,20:08:29,951 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Starting expired delegation token remover thread, 
> tokenRemoverScanInterval=60 min(s)
> 2019-12-24,20:08:29,952 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Updating the current master key for generating delegation tokens
> 2019-12-24,20:08:30,015 INFO org.apache.hadoop.http.HttpRequestLog: Http 
> request log for http.requests.jobhistory is not defined
> 2019-12-24,20:08:30,025 INFO org.apache.hadoop.http.HttpServer2: Added global 
> filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
> 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter 
> static_user_filter 
> (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to 
> context jobhistory
> 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter 
> static_user_filter 
> (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to 
> context static
> 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path 
> spec: /jobhistory/*
> 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path 
> spec: /ws/*
> 2019-12-24,20:08:30,057 INFO org.apache.hadoop.http.HttpServer2: Jetty bound 
> to port 20901
> 2019-12-24,20:08:30,939 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app 
> /jobhistory started at 20901
> 2019-12-24,20:08:31,177 INFO org.apache.hadoop.yarn.webapp.WebApps: 
> Registered webapp guice modules
> 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
> callQueue class java.util.concurrent.LinkedBlockingQueue
> 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
> callQueue class java.util.concurrent.LinkedBlockingQueue
> 2019-12-24,20:08:31,189 INFO 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding 
> protocol org.apache.hadoop.mapreduce.v2.api.HSClientProtocolPB to the server
> 2019-12-24,20:08:31,216 INFO 
> org.apache.hadoop.mapreduce.v2.hs.HistoryClientService: Instantiated 
> HistoryClientService at xxx
> 2019-12-24,20:08:31,344 INFO 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService: 
>

[jira] [Commented] (YARN-10080) Support show app id on localizer thread pool

2020-01-19 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019188#comment-17019188
 ] 

zhoukang commented on YARN-10080:
-

[~abmodi]thanks for review, i submit a new patch to support show container id

> Support show app id on localizer thread pool
> 
>
> Key: YARN-10080
> URL: https://issues.apache.org/jira/browse/YARN-10080
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-10080-001.patch, YARN-10080.002.patch
>
>
> Currently when we are troubleshooting a container localizer issue, if we want 
> to analyze the jstack with thread detail, we can not figure out which thread 
> is processing the given container. So i want to add app id on the thread name



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10080) Support show app id on localizer thread pool

2020-01-19 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10080:

Attachment: YARN-10080.002.patch

> Support show app id on localizer thread pool
> 
>
> Key: YARN-10080
> URL: https://issues.apache.org/jira/browse/YARN-10080
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-10080-001.patch, YARN-10080.002.patch
>
>
> Currently when we are troubleshooting a container localizer issue, if we want 
> to analyze the jstack with thread detail, we can not figure out which thread 
> is processing the given container. So i want to add app id on the thread name



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10080) Support show app id on localizer thread pool

2020-01-08 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10080:

Attachment: YARN-10080-001.patch

> Support show app id on localizer thread pool
> 
>
> Key: YARN-10080
> URL: https://issues.apache.org/jira/browse/YARN-10080
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-10080-001.patch
>
>
> Currently when we are troubleshooting a container localizer issue, if we want 
> to analyze the jstack with thread detail, we can not figure out which thread 
> is processing the given container. So i want to add app id on the thread name



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10080) Support show app id on localizer thread pool

2020-01-08 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10080:

Summary: Support show app id on localizer thread pool  (was: Support show 
container id on localizer thread pool)

> Support show app id on localizer thread pool
> 
>
> Key: YARN-10080
> URL: https://issues.apache.org/jira/browse/YARN-10080
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> Currently when we are troubleshooting a container localizer issue, if we want 
> to analyze the jstack with thread detail, we can not figure out which thread 
> is processing the given container. So i want to add container id on the 
> thread name



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10080) Support show app id on localizer thread pool

2020-01-08 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10080:

Description: Currently when we are troubleshooting a container localizer 
issue, if we want to analyze the jstack with thread detail, we can not figure 
out which thread is processing the given container. So i want to add app id on 
the thread name  (was: Currently when we are troubleshooting a container 
localizer issue, if we want to analyze the jstack with thread detail, we can 
not figure out which thread is processing the given container. So i want to add 
container id on the thread name)

> Support show app id on localizer thread pool
> 
>
> Key: YARN-10080
> URL: https://issues.apache.org/jira/browse/YARN-10080
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> Currently when we are troubleshooting a container localizer issue, if we want 
> to analyze the jstack with thread detail, we can not figure out which thread 
> is processing the given container. So i want to add app id on the thread name



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10080) Support show container id on localizer thread pool

2020-01-08 Thread zhoukang (Jira)

zhoukang created YARN-10080:
---

 Summary: Support show container id on localizer thread pool
 Key: YARN-10080
 URL: https://issues.apache.org/jira/browse/YARN-10080
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: zhoukang
Assignee: zhoukang


Currently when we are troubleshooting a container localizer issue, if we want 
to analyze the jstack with thread detail, we can not figure out which thread is 
processing the given container. So i want to add container id on the thread name



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10069) Showing jstack on UI for containers

2020-01-06 Thread zhoukang (Jira)

zhoukang created YARN-10069:
---

 Summary: Showing jstack on UI for containers
 Key: YARN-10069
 URL: https://issues.apache.org/jira/browse/YARN-10069
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: zhoukang
Assignee: zhoukang


In this jira, i want to post a patch to support showing jstack on the container 
ui



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10066) Support showing nm version distribution on rm UI

2019-12-31 Thread zhoukang (Jira)

zhoukang created YARN-10066:
---

 Summary: Support showing nm version distribution on rm UI
 Key: YARN-10066
 URL: https://issues.apache.org/jira/browse/YARN-10066
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: zhoukang
Assignee: zhoukang


In this jira, i will post a patch to support showing nm version distribution on 
rm ui.
which is useful for large cluster maintenance



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10062) Support deploy multiple historyserver in case of sp

2019-12-27 Thread zhoukang (Jira)

zhoukang created YARN-10062:
---

 Summary: Support deploy multiple historyserver in case of sp
 Key: YARN-10062
 URL: https://issues.apache.org/jira/browse/YARN-10062
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: zhoukang
Assignee: zhoukang


In this jira, i want to implement a patch to support history ha 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10061) job historyserver old gen may be 100% when too many jobs load history

2019-12-26 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003561#comment-17003561
 ] 

zhoukang commented on YARN-10061:
-

job below generated 170+ requests, we should add a filter for the same job 
which is replaying
{code:java}
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973

{code}
come from the same browser
{code:java}
GET 
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973 
HTTP/1.1..Connection: upgrade..X-Real-IP: 10.232.22.174..X-Forwarded-For: 
10.232.22.174..Host: zjy-hadoop-prc-ct11.bj:20901..User-Agent: Mozilla/5.0 
(X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0..Accept: 
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8..Accept-Language:
 en-US,en;q=0.5..Accept-Encoding: gzip, deflate..Referer: 
http://zjy-hadoop-prc-ct11.bj:21001/proxy/application_1576831312050_362973/?proxyapproved=true..Upgrade-Insecure-Requests:
 
1..

{code}


> job historyserver old gen may be 100% when too many jobs load history
> -
>
> Key: YARN-10061
> URL: https://issues.apache.org/jira/browse/YARN-10061
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: 001.png, 002.png, 003.png
>
>
>  !003.png! 
> {code:java}
> [work@zjy-hadoop-prc-ct11 log]$ jstat -gcutil 26774
>   S0 S1 E  O  M CCSYGC YGCTFGCFGCT GCT
>   0.00  99.99 100.00 100.00  98.13  96.35  10999 2786.664   497  989.782 
> 3776.446
> {code}
> {code:java}
> hread 1058215567@qtp-1107509430-6121
>  Thread Properties 
> Object / Stack Frame  org.mortbay.thread.QueuedThreadPool$PoolThread @ 
> 0x7606db678
> Name  1058215567@qtp-1107509430-6121
> Shallow Heap  0.00 MB
> Retained Heap 0.17 MB
> Context Class Loader  jobhistory
> Is Daemon true
> Total: 6 entries
>  Thread Stack
> 1058215567@qtp-1107509430-6121
>   at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(Lorg/apache/hadoop/fs/FileStatus;)V
>  (HistoryFileManager.java:278)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory()V
>  (HistoryFileManager.java:798)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getFileInfo(Lorg/apache/hadoop/mapreduce/v2/api/records/JobId;)Lorg/apache/hadoop/mapreduce/v2/hs/HistoryFileManager$HistoryFileInfo;
>  (HistoryFileManager.java:948)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.getFullJob(Lorg/apache/hadoop/mapreduce/v2/api/records/JobId;)Lorg/apache/hadoop/mapreduce/v2/app/job/Job;
>  (CachedHistoryStorage.java:135)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.JobHistory.getJob(Lorg/apache/hadoop/mapreduce/v2/api/records/JobId;)Lorg/apache/hadoop/mapreduce/v2/app/job/Job;
>  (JobHistory.java:221)
>   at org.apache.hadoop.mapreduce.v2.app.webapp.AppController.requireJob()V 
> (AppController.java:382)
>   at org.apache.hadoop.mapreduce.v2.app.webapp.AppController.job()V 
> (AppController.java:109)
>   at org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.job()V 
> (HsController.java:104)
>   at 
> sun.reflect.GeneratedMethodAccessor30.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
>  (Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
>  (DelegatingMethodAccessorImpl.java:43)
>   at 
> java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
>  (Method.java:498)
>   at 
> org.apache.hadoop.yarn.webapp.Dispatcher.service(Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V
>  (Dispatcher.java:153)
>   at 
> javax.servlet.http.HttpServlet.service(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V
>  (HttpServlet.java:820)
>   at 
> com.google.inject.servlet.ServletDefinition.doService(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V
>  (ServletDefinition.java:263)
>   at 
> com.google.inject.servlet.ServletDefinition.service(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)Z
>  (ServletDefinition.java:178)
>   at 
> com.google.inject.servlet.ManagedServletPipeline.service(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)Z
>

[jira] [Commented] (YARN-10061) job historyserver old gen may be 100% when too many jobs load history

2019-12-26 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003558#comment-17003558
 ] 

zhoukang commented on YARN-10061:
-


{code:java}
Start Page
Table Of Contents

Thread 1058215567@qtp-1107509430-6121
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973.
Summary
URI
Thread 1333459277@qtp-1107509430-6120
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973.
Summary
URI
Thread 582278496@qtp-1107509430-6119
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/attempts/job_1576831312050_347354/r/KILLED.
Summary
URI
Thread 1010032540@qtp-1107509430-6118
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973.
Summary
URI
Thread 2057128499@qtp-1107509430-6117
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973.
Summary
URI
Thread 1230320783@qtp-1107509430-6116
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973.
Summary
URI
Thread 787412968@qtp-1107509430-6115
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973.
Summary
URI
Thread 328094070@qtp-1107509430-6114
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973.
Summary
URI
Thread 284896606@qtp-1107509430-6113
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973.
Summary
URI
Thread 1764513565@qtp-1107509430-6112
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/attempts/job_1576831312050_347314/m/KILLED.
Summary
URI
Thread 1013350884@qtp-1107509430-6111
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973.
Summary
URI
Thread 2030348115@qtp-1107509430-6110
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973.
Summary
URI
Thread 1609530906@qtp-1107509430-6109
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973.
Summary
URI
Thread 2112512892@qtp-1107509430-6108
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973.
Summary
URI
Thread 1508380482@qtp-1107509430-6107
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973.
Summary
URI
Thread 2066251373@qtp-1107509430-6106
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973.
Summary
URI
Thread 367949850@qtp-1107509430-6105
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973.
Summary
URI
Thread 626277387@qtp-1107509430-6104
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973.
Summary
URI
Thread 515689957@qtp-1107509430-6103
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/attempts/job_1576831312050_347325/r/KILLED.
Summary
URI
Thread 2097370166@qtp-1107509430-6102
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/attempts/job_1576831312050_347313/m/KILLED.
Summary
URI
Thread 1680793908@qtp-1107509430-6101
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973.
Summary
URI
Thread 1425331186@qtp-1107509430-6100
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to 
/jobhistory/attempts/job_1576831312050_349232/r/KILLED.
Summary
URI
Thread 1797324868@qtp-1107509430-6099
Thread Properties
Thread Stack
Requests
The thread is executing an HTTP Request to

[jira] [Created] (YARN-10061) job historyserver old gen may be 100% when too many jobs load history

2019-12-26 Thread zhoukang (Jira)

zhoukang created YARN-10061:
---

 Summary: job historyserver old gen may be 100% when too many jobs 
load history
 Key: YARN-10061
 URL: https://issues.apache.org/jira/browse/YARN-10061
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: zhoukang
Assignee: zhoukang
 Attachments: 001.png, 002.png, 003.png

 !003.png! 
{code:java}
[work@zjy-hadoop-prc-ct11 log]$ jstat -gcutil 26774
  S0 S1 E  O  M CCSYGC YGCTFGCFGCT GCT
  0.00  99.99 100.00 100.00  98.13  96.35  10999 2786.664   497  989.782 
3776.446
{code}

{code:java}
hread 1058215567@qtp-1107509430-6121

 Thread Properties 

Object / Stack Frameorg.mortbay.thread.QueuedThreadPool$PoolThread @ 
0x7606db678
Name1058215567@qtp-1107509430-6121
Shallow Heap0.00 MB
Retained Heap   0.17 MB
Context Class Loaderjobhistory
Is Daemon   true

Total: 6 entries
 Thread Stack

1058215567@qtp-1107509430-6121
  at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(Lorg/apache/hadoop/fs/FileStatus;)V
 (HistoryFileManager.java:278)
  at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory()V
 (HistoryFileManager.java:798)
  at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getFileInfo(Lorg/apache/hadoop/mapreduce/v2/api/records/JobId;)Lorg/apache/hadoop/mapreduce/v2/hs/HistoryFileManager$HistoryFileInfo;
 (HistoryFileManager.java:948)
  at 
org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.getFullJob(Lorg/apache/hadoop/mapreduce/v2/api/records/JobId;)Lorg/apache/hadoop/mapreduce/v2/app/job/Job;
 (CachedHistoryStorage.java:135)
  at 
org.apache.hadoop.mapreduce.v2.hs.JobHistory.getJob(Lorg/apache/hadoop/mapreduce/v2/api/records/JobId;)Lorg/apache/hadoop/mapreduce/v2/app/job/Job;
 (JobHistory.java:221)
  at org.apache.hadoop.mapreduce.v2.app.webapp.AppController.requireJob()V 
(AppController.java:382)
  at org.apache.hadoop.mapreduce.v2.app.webapp.AppController.job()V 
(AppController.java:109)
  at org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.job()V 
(HsController.java:104)
  at 
sun.reflect.GeneratedMethodAccessor30.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
 (Unknown Source)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
 (DelegatingMethodAccessorImpl.java:43)
  at 
java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
 (Method.java:498)
  at 
org.apache.hadoop.yarn.webapp.Dispatcher.service(Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V
 (Dispatcher.java:153)
  at 
javax.servlet.http.HttpServlet.service(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V
 (HttpServlet.java:820)
  at 
com.google.inject.servlet.ServletDefinition.doService(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V
 (ServletDefinition.java:263)
  at 
com.google.inject.servlet.ServletDefinition.service(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)Z
 (ServletDefinition.java:178)
  at 
com.google.inject.servlet.ManagedServletPipeline.service(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)Z
 (ManagedServletPipeline.java:91)
  at 
com.google.inject.servlet.FilterChainInvocation.doFilter(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V
 (FilterChainInvocation.java:62)
  at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;Ljavax/servlet/FilterChain;Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)V
 (ServletContainer.java:900)
  at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;Ljavax/servlet/FilterChain;)V
 (ServletContainer.java:834)
  at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;Ljavax/servlet/FilterChain;)V
 (ServletContainer.java:795)
  at 
com.google.inject.servlet.FilterDefinition.doFilter(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;Lcom/google/inject/servlet/FilterChainInvocation;)V
 (FilterDefinition.java:163)
  at 
com.google.inject.servlet.FilterChainInvocation.doFilter(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V
 (FilterChainInvocation.java:58)
  at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;Ljavax/servlet/FilterChain;)V
 (ManagedFilterPipeline.java:118)
  at 
com.google.inject.servlet.GuiceFilter.doFilter(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;Ljavax/servlet/FilterChain;)V
 (GuiceFilter.java:113)
  at

[jira] [Updated] (YARN-10060) Historyserver may recover too slow since JobHistory init too slow when there exist too many job

2019-12-24 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10060:

Description: 
Like below it cost >7min to listen to the service port

{code:java}
2019-12-24,20:01:37,272 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2019-12-24,20:01:47,354 INFO 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Initializing Existing 
Jobs...

2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server xxx. Will not attempt to authenticate using SASL (unknown 
error)
2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to xxx, initiating session
2019-12-24,20:08:29,590 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server xxx, sessionid = 0x66d1a13e596ddc9, negotiated 
timeout = 5000
2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x66d1a13e596ddc9 closed
2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2019-12-24,20:08:29,655 INFO 
org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage: CachedHistoryStorage 
Init
2019-12-24,20:08:29,681 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
callQueue class java.util.concurrent.LinkedBlockingQueue
2019-12-24,20:08:29,715 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
callQueue class java.util.concurrent.LinkedBlockingQueue
2019-12-24,20:08:29,800 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: 
loaded properties from hadoop-metrics2.properties
2019-12-24,20:08:29,943 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
Scheduled snapshot period at 10 second(s).
2019-12-24,20:08:29,943 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
JobHistoryServer metrics system started
2019-12-24,20:08:29,950 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Updating the current master key for generating delegation tokens
2019-12-24,20:08:29,951 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Starting expired delegation token remover thread, tokenRemoverScanInterval=60 
min(s)
2019-12-24,20:08:29,952 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Updating the current master key for generating delegation tokens
2019-12-24,20:08:30,015 INFO org.apache.hadoop.http.HttpRequestLog: Http 
request log for http.requests.jobhistory is not defined
2019-12-24,20:08:30,025 INFO org.apache.hadoop.http.HttpServer2: Added global 
filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter 
static_user_filter 
(class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to 
context jobhistory
2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter 
static_user_filter 
(class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to 
context static
2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path 
spec: /jobhistory/*
2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path 
spec: /ws/*
2019-12-24,20:08:30,057 INFO org.apache.hadoop.http.HttpServer2: Jetty bound to 
port 20901
2019-12-24,20:08:30,939 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app 
/jobhistory started at 20901
2019-12-24,20:08:31,177 INFO org.apache.hadoop.yarn.webapp.WebApps: Registered 
webapp guice modules
2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
callQueue class java.util.concurrent.LinkedBlockingQueue
2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
callQueue class java.util.concurrent.LinkedBlockingQueue
2019-12-24,20:08:31,189 INFO 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding 
protocol org.apache.hadoop.mapreduce.v2.api.HSClientProtocolPB to the server
2019-12-24,20:08:31,216 INFO 
org.apache.hadoop.mapreduce.v2.hs.HistoryClientService: Instantiated 
HistoryClientService at xxx
2019-12-24,20:08:31,344 INFO 
org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService: aggregated 
log deletion started.
2019-12-24,20:08:31,690 INFO org.apache.zookeeper.ZooKeeper: Initiating client 
connection, connectString=xxx sessionTimeout=5000 watcher=org
{code}

{code:java}
protected void serviceInit(Configuration conf) throws Exception {
LOG.info("JobHistory Init");
this.conf = conf;
this.appID = ApplicationId.newInstance(0, 0);
this.appAttemptID = RecordFactoryProvider.getRecordFactory(conf)
.newRecordInstance(ApplicationAttemptId.class);

moveThreadInterval = conf.getLong(
JHAdminConfig.MR_HISTORY_MOVE_INTERVAL_MS,
JHAdminConfig.DEFAULT_MR_HISTORY_MOVE_INTERVAL_MS);

hsManager = createHistoryFileManager();
hsManager.init(conf);
try {
  hsManager.initExisting();

[jira] [Created] (YARN-10060) Historyserver may recover too slow since JobHistory init too slow when there exist too many job

2019-12-24 Thread zhoukang (Jira)

zhoukang created YARN-10060:
---

 Summary: Historyserver may recover too slow since JobHistory init 
too slow when there exist too many job
 Key: YARN-10060
 URL: https://issues.apache.org/jira/browse/YARN-10060
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: zhoukang
Assignee: zhoukang


Like below it cost >7min to listen to the service port

{code:java}
2019-12-24,20:01:37,272 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2019-12-24,20:01:47,354 INFO 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Initializing Existing 
Jobs...

2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server zjy-hadoop-prc-ct07.bj/10.152.50.2:11000. Will not attempt 
to authenticate using SASL (unknown error)
2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to zjy-hadoop-prc-ct07.bj/10.152.50.2:11000, initiating session
2019-12-24,20:08:29,590 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server zjy-hadoop-prc-ct07.bj/10.152.50.2:11000, 
sessionid = 0x66d1a13e596ddc9, negotiated timeout = 5000
2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x66d1a13e596ddc9 closed
2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2019-12-24,20:08:29,655 INFO 
org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage: CachedHistoryStorage 
Init
2019-12-24,20:08:29,681 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
callQueue class java.util.concurrent.LinkedBlockingQueue
2019-12-24,20:08:29,715 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
callQueue class java.util.concurrent.LinkedBlockingQueue
2019-12-24,20:08:29,800 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: 
loaded properties from hadoop-metrics2.properties
2019-12-24,20:08:29,943 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
Scheduled snapshot period at 10 second(s).
2019-12-24,20:08:29,943 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
JobHistoryServer metrics system started
2019-12-24,20:08:29,950 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Updating the current master key for generating delegation tokens
2019-12-24,20:08:29,951 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Starting expired delegation token remover thread, tokenRemoverScanInterval=60 
min(s)
2019-12-24,20:08:29,952 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Updating the current master key for generating delegation tokens
2019-12-24,20:08:30,015 INFO org.apache.hadoop.http.HttpRequestLog: Http 
request log for http.requests.jobhistory is not defined
2019-12-24,20:08:30,025 INFO org.apache.hadoop.http.HttpServer2: Added global 
filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter 
static_user_filter 
(class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to 
context jobhistory
2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter 
static_user_filter 
(class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to 
context static
2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path 
spec: /jobhistory/*
2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path 
spec: /ws/*
2019-12-24,20:08:30,057 INFO org.apache.hadoop.http.HttpServer2: Jetty bound to 
port 20901
2019-12-24,20:08:30,939 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app 
/jobhistory started at 20901
2019-12-24,20:08:31,177 INFO org.apache.hadoop.yarn.webapp.WebApps: Registered 
webapp guice modules
2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
callQueue class java.util.concurrent.LinkedBlockingQueue
2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
callQueue class java.util.concurrent.LinkedBlockingQueue
2019-12-24,20:08:31,189 INFO 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding 
protocol org.apache.hadoop.mapreduce.v2.api.HSClientProtocolPB to the server
2019-12-24,20:08:31,216 INFO 
org.apache.hadoop.mapreduce.v2.hs.HistoryClientService: Instantiated 
HistoryClientService at zjy-hadoop-prc-ct11.bj/10.152.50.42:20900
2019-12-24,20:08:31,344 INFO 
org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService: aggregated 
log deletion started.
2019-12-24,20:08:31,690 INFO org.apache.zookeeper.ZooKeeper: Initiating client 
connection, connectString=zjyprc.observer.zk.hadoop.srv:11000 
sessionTimeout=5000 watcher=org
{code}

{code:java}
protected void serviceInit(Configuration conf) throws Exception {
LOG.info("JobHistory Init");
this.conf = conf;
this.appID =

[jira] [Commented] (YARN-7672) hadoop-sls can not simulate huge scale of YARN

2019-12-23 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-7672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002610#comment-17002610
 ] 

zhoukang commented on YARN-7672:


Thanks for the patch [~yufeigu]do we have a patch about the metrics?thanks

> hadoop-sls can not simulate huge scale of YARN
> --
>
> Key: YARN-7672
> URL: https://issues.apache.org/jira/browse/YARN-7672
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhangshilong
>Assignee: zhangshilong
>Priority: Major
> Attachments: YARN-7672.patch
>
>
> Our YARN cluster scale to nearly 10 thousands nodes. We need to do scheduler 
> pressure test.
> Using SLS,we start  2000+ threads to simulate NM and AM. But  cpu.load very 
> high to 100+. I thought that will affect  performance evaluation of 
> scheduler. 
> So I thought to separate the scheduler from the simulator.
> I start a real RM. Then SLS will register nodes to RM,And submit apps to RM 
> using RM RPC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10047) Memory consume of process tree will consider subprocess which may make container exit unexcepted

2019-12-23 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002273#comment-17002273
 ] 

zhoukang commented on YARN-10047:
-

[~wilfreds]thanks for your reply, for some case the memory consume will 
consider subprocess which will has incorrect memory usage.

> Memory consume of process tree will consider  subprocess which may make 
> container exit unexcepted
> -
>
> Key: YARN-10047
> URL: https://issues.apache.org/jira/browse/YARN-10047
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> As below, we have a case which spark driver execute some scripts.Then 
> sometimes the driver will be killed.
> {code:java}
> yarn.174410.log.2019-12-17.02:2019-12-17,06:59:14,831 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Container 
> [pid=50529,containerID=container_e917_1576303656075_174957_01_003197] is 
> running beyond physical memory limits. Current usage: 50.28 GB of 5.25 GB 
> physical memory used; xxx. Killing container.
> {code}
> {code:java}
> boolean isProcessTreeOverLimit(String containerId,
>   long currentMemUsage,
>   long curMemUsageOfAgedProcesses,
>   long vmemLimit) {
> boolean isOverLimit = false;
>
> /**
> if (currentMemUsage > (2 * vmemLimit)) {
>   LOG.warn("Process tree for container: " + containerId
>   + " running over twice " + "the configured limit. Limit=" + 
> vmemLimit
>   + ", current usage = " + currentMemUsage);
>   isOverLimit = true;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10047) Memory consume of process tree will consider subprocess which may make container exit unexcepted

2019-12-23 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10047:

Summary: Memory consume of process tree will consider  subprocess which may 
make container exit unexcepted  (was: Process tree will consider memory consume 
of subprocess which may make container exit unexcepted)

> Memory consume of process tree will consider  subprocess which may make 
> container exit unexcepted
> -
>
> Key: YARN-10047
> URL: https://issues.apache.org/jira/browse/YARN-10047
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> As below, we have a case which spark driver execute some scripts.Then 
> sometimes the driver will be killed.
> {code:java}
> yarn.174410.log.2019-12-17.02:2019-12-17,06:59:14,831 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Container 
> [pid=50529,containerID=container_e917_1576303656075_174957_01_003197] is 
> running beyond physical memory limits. Current usage: 50.28 GB of 5.25 GB 
> physical memory used; xxx. Killing container.
> {code}
> {code:java}
> boolean isProcessTreeOverLimit(String containerId,
>   long currentMemUsage,
>   long curMemUsageOfAgedProcesses,
>   long vmemLimit) {
> boolean isOverLimit = false;
>
> /**
> if (currentMemUsage > (2 * vmemLimit)) {
>   LOG.warn("Process tree for container: " + containerId
>   + " running over twice " + "the configured limit. Limit=" + 
> vmemLimit
>   + ", current usage = " + currentMemUsage);
>   isOverLimit = true;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10047) Process tree will consider memory consume of subprocess which may make container exit unexcepted

2019-12-23 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10047:

Summary: Process tree will consider memory consume of subprocess which may 
make container exit unexcepted  (was: container memory monitor may make 
container exit)

> Process tree will consider memory consume of subprocess which may make 
> container exit unexcepted
> 
>
> Key: YARN-10047
> URL: https://issues.apache.org/jira/browse/YARN-10047
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> As below, we have a case which spark driver execute some scripts.Then 
> sometimes the driver will be killed.
> {code:java}
> yarn.174410.log.2019-12-17.02:2019-12-17,06:59:14,831 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Container 
> [pid=50529,containerID=container_e917_1576303656075_174957_01_003197] is 
> running beyond physical memory limits. Current usage: 50.28 GB of 5.25 GB 
> physical memory used; xxx. Killing container.
> {code}
> {code:java}
> boolean isProcessTreeOverLimit(String containerId,
>   long currentMemUsage,
>   long curMemUsageOfAgedProcesses,
>   long vmemLimit) {
> boolean isOverLimit = false;
>
> /**
> if (currentMemUsage > (2 * vmemLimit)) {
>   LOG.warn("Process tree for container: " + containerId
>   + " running over twice " + "the configured limit. Limit=" + 
> vmemLimit
>   + ", current usage = " + currentMemUsage);
>   isOverLimit = true;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10056) Logservice may encuonter nm fgc since filesystem will only close when app finished

2019-12-23 Thread zhoukang (Jira)

zhoukang created YARN-10056:
---

 Summary: Logservice may encuonter nm fgc since filesystem will 
only close when app finished
 Key: YARN-10056
 URL: https://issues.apache.org/jira/browse/YARN-10056
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: zhoukang
Assignee: zhoukang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10047) container memory monitor may make container exit

2019-12-19 Thread zhoukang (Jira)

zhoukang created YARN-10047:
---

 Summary: container memory monitor may make container exit
 Key: YARN-10047
 URL: https://issues.apache.org/jira/browse/YARN-10047
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: zhoukang
Assignee: zhoukang


As below, we have a case which spark driver execute some scripts.Then sometimes 
the driver will be killed.

{code:java}
yarn.174410.log.2019-12-17.02:2019-12-17,06:59:14,831 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container 
[pid=50529,containerID=container_e917_1576303656075_174957_01_003197] is 
running beyond physical memory limits. Current usage: 50.28 GB of 5.25 GB 
physical memory used; xxx. Killing container.
{code}

{code:java}
boolean isProcessTreeOverLimit(String containerId,
  long currentMemUsage,
  long curMemUsageOfAgedProcesses,
  long vmemLimit) {
boolean isOverLimit = false;
   
/**
if (currentMemUsage > (2 * vmemLimit)) {
  LOG.warn("Process tree for container: " + containerId
  + " running over twice " + "the configured limit. Limit=" + vmemLimit
  + ", current usage = " + currentMemUsage);
  isOverLimit = true;
}
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10011) Catch all exception during init app in LogAggregationService

2019-12-03 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10011:

Component/s: nodemanager

> Catch all exception  during init app in LogAggregationService 
> --
>
> Key: YARN-10011
> URL: https://issues.apache.org/jira/browse/YARN-10011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> we should catch all exception during init app in LogAggregationService in 
> case of nm exit 
> {code:java}
> 2019-06-12,09:36:03,652 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.IllegalStateException
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:129)
> at 
> org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:118)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
> at com.sun.proxy.$Proxy22.getFileInfo(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2115)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1300)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1296)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1312)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:193)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:319)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:116)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10011) Catch all exception during init app in LogAggregationService

2019-12-03 Thread zhoukang (Jira)

zhoukang created YARN-10011:
---

 Summary: Catch all exception  during init app in 
LogAggregationService 
 Key: YARN-10011
 URL: https://issues.apache.org/jira/browse/YARN-10011
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhoukang
Assignee: zhoukang


we should catch all exception during init app in LogAggregationService in case 
of nm exit 
{code:java}
2019-06-12,09:36:03,652 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
Error in dispatcher thread
java.lang.IllegalStateException
at 
com.google.common.base.Preconditions.checkState(Preconditions.java:129)
at org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:118)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy22.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2115)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1300)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1296)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1312)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:193)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:319)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:116)
at java.lang.Thread.run(Thread.java:745)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10010) NM upload log cost too much time

2019-12-02 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10010:

Attachment: (was: 选区_002.png)

> NM upload log cost too much time
> 
>
> Key: YARN-10010
> URL: https://issues.apache.org/jira/browse/YARN-10010
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: notfound.png
>
>
> Since thread pool size of log service is 100.
> Some times the log uploading service will delay for some apps.like below
>  !选区_002.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10010) NM upload log cost too much time

2019-12-02 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10010:

Description: 
Since thread pool size of log service is 100.
Some times the log uploading service will delay for some apps.like below
 !notfound.png! 

  was:
Since thread pool size of log service is 100.
Some times the log uploading service will delay for some apps.like below
 !选区_002.png! 


> NM upload log cost too much time
> 
>
> Key: YARN-10010
> URL: https://issues.apache.org/jira/browse/YARN-10010
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: notfound.png
>
>
> Since thread pool size of log service is 100.
> Some times the log uploading service will delay for some apps.like below
>  !notfound.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10010) NM upload log cost too much time

2019-12-02 Thread zhoukang (Jira)

zhoukang created YARN-10010:
---

 Summary: NM upload log cost too much time
 Key: YARN-10010
 URL: https://issues.apache.org/jira/browse/YARN-10010
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhoukang
Assignee: zhoukang
 Attachments: notfound.png

Since thread pool size of log service is 100.
Some times the log uploading service will delay for some apps.like below
 !选区_002.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10010) NM upload log cost too much time

2019-12-02 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10010:

Attachment: notfound.png

> NM upload log cost too much time
> 
>
> Key: YARN-10010
> URL: https://issues.apache.org/jira/browse/YARN-10010
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: notfound.png
>
>
> Since thread pool size of log service is 100.
> Some times the log uploading service will delay for some apps.like below
>  !notfound.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8364) NM aggregation thread should be able to exempt pool

2019-12-02 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-8364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986613#comment-16986613
 ] 

zhoukang commented on YARN-8364:


I will work on this

> NM aggregation thread should be able to exempt pool
> ---
>
> Key: YARN-8364
> URL: https://issues.apache.org/jira/browse/YARN-8364
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Reporter: Oleksandr Shevchenko
>Priority: Major
>
> For now, we have limited NM aggregation thread pool that can be configured by 
> the property yarn.nodemanager.logaggregation.threadpool-size-max=100. 
> When some application is starting it use one unit of the pool. And locks this 
> unit until the application is finished. As the result, another application 
> can aggregate their logs only when the previous application is finished.
> Just for example:
> yarn.nodemanager.logaggregation.threadpool-size-max=1
> 1. Start long-running application app1
> 2. Start short application app2
> 3. Finished app2
> 4. Finished app1
> 5. Aggregating logs of app1
> 6. Aggregating logs of app2
> In the real cluster, we can have many long running jobs (for example Spark 
> streaming), therefore short-running application do not aggregate their logs a 
> long time. It problem appears if the average number of jobs exceeds thread 
> pool size. All threads occupied by some applications, as the result we have 
> the huge delay between application finishing and logs uploading.
> Will be good if we improve this behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-9605) Add ZkConfiguredFailoverProxyProvider for RM HA

2019-11-14 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974020#comment-16974020
 ] 

zhoukang edited comment on YARN-9605 at 11/14/19 8:33 AM:
--

Sorry for bother [~tangzhankun][~prabhujoseph] but i really can not figure out  
the cause of warning below during 'cc phase':
I think the patch i post has no relation with hdfs?

{code:java}
WARNING] 
/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/IpcConnectionContext.pb.cc:129:13:
 warning: 'dynamic_init_dummy_IpcConnectionContext_2eproto' defined but not 
used [-Wunused-variable]
[WARNING] 
/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/HAServiceProtocol.pb.cc:404:13:
 warning: 'dynamic_init_dummy_HAServiceProtocol_2eproto' defined but not used 
[-Wunused-variable]
[WARNING] 
/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/Security.pb.cc:349:13:
 warning: 'dynamic_init_dummy_Security_2eproto' defined but not used 
[-Wunused-variable]
[WARNING] 
/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/acl.pb.cc:533:13:
 warning: 'dynamic_init_dummy_acl_2eproto' defined but not used 
[-Wunused-variable]
{code}



was (Author: cane):
Sorry for bother [~tangzhankun][~prabhujoseph] but i really can not figure out  
the cause of warning below during 'cc phase':

{code:java}
WARNING] 
/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/IpcConnectionContext.pb.cc:129:13:
 warning: 'dynamic_init_dummy_IpcConnectionContext_2eproto' defined but not 
used [-Wunused-variable]
[WARNING] 
/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/HAServiceProtocol.pb.cc:404:13:
 warning: 'dynamic_init_dummy_HAServiceProtocol_2eproto' defined but not used 
[-Wunused-variable]
[WARNING] 
/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/Security.pb.cc:349:13:
 warning: 'dynamic_init_dummy_Security_2eproto' defined but not used 
[-Wunused-variable]
[WARNING] 
/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/acl.pb.cc:533:13:
 warning: 'dynamic_init_dummy_acl_2eproto' defined but not used 
[-Wunused-variable]
{code}


> Add ZkConfiguredFailoverProxyProvider for RM HA
> ---
>
> Key: YARN-9605
> URL: https://issues.apache.org/jira/browse/YARN-9605
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-9605.001.patch, YARN-9605.002.patch, 
> YARN-9605.003.patch, YARN-9605.004.patch, YARN-9605.005.patch, 
> YARN-9605.006.patch
>
>
> In this issue, i will track a new feature to support 
> ZkConfiguredFailoverProxyProvider for RM HA



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9605) Add ZkConfiguredFailoverProxyProvider for RM HA

2019-11-14 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974020#comment-16974020
 ] 

zhoukang commented on YARN-9605:


Sorry for bother [~tangzhankun][~prabhujoseph] but i really can not figure out  
the cause of warning below during 'cc phase':

{code:java}
WARNING] 
/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/IpcConnectionContext.pb.cc:129:13:
 warning: 'dynamic_init_dummy_IpcConnectionContext_2eproto' defined but not 
used [-Wunused-variable]
[WARNING] 
/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/HAServiceProtocol.pb.cc:404:13:
 warning: 'dynamic_init_dummy_HAServiceProtocol_2eproto' defined but not used 
[-Wunused-variable]
[WARNING] 
/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/Security.pb.cc:349:13:
 warning: 'dynamic_init_dummy_Security_2eproto' defined but not used 
[-Wunused-variable]
[WARNING] 
/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/acl.pb.cc:533:13:
 warning: 'dynamic_init_dummy_acl_2eproto' defined but not used 
[-Wunused-variable]
{code}


> Add ZkConfiguredFailoverProxyProvider for RM HA
> ---
>
> Key: YARN-9605
> URL: https://issues.apache.org/jira/browse/YARN-9605
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-9605.001.patch, YARN-9605.002.patch, 
> YARN-9605.003.patch, YARN-9605.004.patch, YARN-9605.005.patch, 
> YARN-9605.006.patch
>
>
> In this issue, i will track a new feature to support 
> ZkConfiguredFailoverProxyProvider for RM HA



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9979) When a app expired with many containers , scheduler event size will be huge

2019-11-13 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973932#comment-16973932
 ] 

zhoukang commented on YARN-9979:


I think we can add throttle logic for ContainerAllocationExpirer

> When a app expired with many containers , scheduler event size will be huge
> ---
>
> Key: YARN-9979
> URL: https://issues.apache.org/jira/browse/YARN-9979
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> When there is an app expired with many containers, the scheduler event size 
> will be huge.
> {code:java}
> 2019-11-11,21:39:49,690 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 9000
> 2019-11-11,21:39:49,695 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 1
> 2019-11-11,21:39:49,700 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 11000
> 2019-11-11,21:39:49,705 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 12000
> 2019-11-11,21:39:49,710 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 13000
> 2019-11-11,21:39:49,715 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 14000
> 2019-11-11,21:39:49,720 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Discarded 1 
> messages due to full event buffer including: Size of scheduler event-queue is 
> 15000
> 2019-11-11,21:39:49,724 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 16000
> 2019-11-11,21:39:49,729 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 17000
> 2019-11-11,21:39:49,733 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 18000
> 2019-11-11,21:40:14,953 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 19000
> 2019-11-11,21:43:09,743 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 19000
> 2019-11-11,21:43:09,750 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 2
> 2019-11-11,21:43:09,758 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 21000
> 2019-11-11,21:43:09,766 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 22000
> 2019-11-11,21:43:09,775 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 23000
> 2019-11-11,21:43:09,783 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 24000
> 2019-11-11,21:43:09,792 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 25000
> 2019-11-11,21:43:09,800 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 26000
> 2019-11-11,21:43:09,807 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 27000
> 2019-11-11,21:43:09,814 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 28000
> 2019-11-11,21:46:29,830 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 29000
> 2019-11-11,21:46:29,841 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 3
> 2019-11-11,21:46:29,850 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 31000
> 2019-11-11,21:46:29,862 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 32000
> 2019-11-11,21:49:49,875 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 33000
> 2019-11-11,21:49:49,875 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 34000
> 2019-11-11,21:49:49,876 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 35000
> 2019-11-11,21:49:49,882 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
> scheduler event-queue is 36000
> 2019-11-11,21:49:49,887 INFO 
>

[jira] [Created] (YARN-9979) When a app expired with many containers , scheduler event size will be huge

2019-11-13 Thread zhoukang (Jira)

zhoukang created YARN-9979:
--

 Summary: When a app expired with many containers , scheduler event 
size will be huge
 Key: YARN-9979
 URL: https://issues.apache.org/jira/browse/YARN-9979
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Reporter: zhoukang
Assignee: zhoukang


When there is an app expired with many containers, the scheduler event size 
will be huge.

{code:java}
2019-11-11,21:39:49,690 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 9000
2019-11-11,21:39:49,695 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 1
2019-11-11,21:39:49,700 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 11000
2019-11-11,21:39:49,705 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 12000
2019-11-11,21:39:49,710 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 13000
2019-11-11,21:39:49,715 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 14000
2019-11-11,21:39:49,720 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Discarded 1 
messages due to full event buffer including: Size of scheduler event-queue is 
15000
2019-11-11,21:39:49,724 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 16000
2019-11-11,21:39:49,729 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 17000
2019-11-11,21:39:49,733 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 18000
2019-11-11,21:40:14,953 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 19000
2019-11-11,21:43:09,743 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 19000
2019-11-11,21:43:09,750 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 2
2019-11-11,21:43:09,758 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 21000
2019-11-11,21:43:09,766 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 22000
2019-11-11,21:43:09,775 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 23000
2019-11-11,21:43:09,783 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 24000
2019-11-11,21:43:09,792 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 25000
2019-11-11,21:43:09,800 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 26000
2019-11-11,21:43:09,807 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 27000
2019-11-11,21:43:09,814 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 28000
2019-11-11,21:46:29,830 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 29000
2019-11-11,21:46:29,841 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 3
2019-11-11,21:46:29,850 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 31000
2019-11-11,21:46:29,862 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 32000
2019-11-11,21:49:49,875 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 33000
2019-11-11,21:49:49,875 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 34000
2019-11-11,21:49:49,876 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 35000
2019-11-11,21:49:49,882 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 36000
2019-11-11,21:49:49,887 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 37000
2019-11-11,21:49:49,891 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 38000
2019-11-11,21:49:49,896 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 39000
2019-11-11,21:49:49,900 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of 
scheduler event-queue is 4

[jira] [Updated] (YARN-9709) When we expanding queue list the scheduler page will not show any applications

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-9709:
---
Attachment: YARN-9709.001.patch

> When we expanding queue list the scheduler page will not show any applications
> --
>
> Key: YARN-9709
> URL: https://issues.apache.org/jira/browse/YARN-9709
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Affects Versions: 3.1.2
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-9709.001.patch, list1.png, list3.png
>
>
> When we expanding queue list the scheduler page will not show any 
> applications.But it works well in FairScheduler.
>  !list1.png! 
>  !list3.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9978) Support show acl on capacityScheuler page

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-9978:
---
Attachment: YARN-9978.001.patch

> Support show acl on capacityScheuler page
> -
>
> Key: YARN-9978
> URL: https://issues.apache.org/jira/browse/YARN-9978
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: 001.png, YARN-9978.001.patch
>
>
> Support show submit acl and admin acl on ui
>  !001.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9978) Support show acl on capacityScheuler page

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-9978:
---
Attachment: (was: YARN-9978.001.patch)

> Support show acl on capacityScheuler page
> -
>
> Key: YARN-9978
> URL: https://issues.apache.org/jira/browse/YARN-9978
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: 001.png
>
>
> Support show submit acl and admin acl on ui
>  !001.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9978) Support show acl on capacityScheuler page

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-9978:
---
Attachment: YARN-9978.001.patch

> Support show acl on capacityScheuler page
> -
>
> Key: YARN-9978
> URL: https://issues.apache.org/jira/browse/YARN-9978
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: 001.png, YARN-9978.001.patch
>
>
> Support show submit acl and admin acl on ui
>  !001.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9978) Support show acl on capacityScheuler page

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-9978:
---
Component/s: capacityscheduler
 capacity scheduler

> Support show acl on capacityScheuler page
> -
>
> Key: YARN-9978
> URL: https://issues.apache.org/jira/browse/YARN-9978
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: 001.png
>
>
> Support show submit acl and admin acl on ui
>  !001.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9978) Support show acl on capacityScheuler page

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-9978:
---
Attachment: 001.png

> Support show acl on capacityScheuler page
> -
>
> Key: YARN-9978
> URL: https://issues.apache.org/jira/browse/YARN-9978
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: 001.png
>
>
> Support show submit acl and admin acl on ui



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9978) Support show acl on capacityScheuler page

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-9978:
---
Attachment: (was: 001.png)

> Support show acl on capacityScheuler page
> -
>
> Key: YARN-9978
> URL: https://issues.apache.org/jira/browse/YARN-9978
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: 001.png
>
>
> Support show submit acl and admin acl on ui
>  !001.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9978) Support show acl on capacityScheuler page

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-9978:
---
Description: 
Support show submit acl and admin acl on ui
 !001.png! 

  was:
Support show submit acl and admin acl on ui



> Support show acl on capacityScheuler page
> -
>
> Key: YARN-9978
> URL: https://issues.apache.org/jira/browse/YARN-9978
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: 001.png
>
>
> Support show submit acl and admin acl on ui
>  !001.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-9978) Support show acl on capacityScheuler page

2019-11-13 Thread zhoukang (Jira)

zhoukang created YARN-9978:
--

 Summary: Support show acl on capacityScheuler page
 Key: YARN-9978
 URL: https://issues.apache.org/jira/browse/YARN-9978
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhoukang
Assignee: zhoukang
 Attachments: 001.png

Support show submit acl and admin acl on ui




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9978) Support show acl on capacityScheuler page

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-9978:
---
Attachment: 001.png

> Support show acl on capacityScheuler page
> -
>
> Key: YARN-9978
> URL: https://issues.apache.org/jira/browse/YARN-9978
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: 001.png
>
>
> Support show submit acl and admin acl on ui



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9605) Add ZkConfiguredFailoverProxyProvider for RM HA

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-9605:
---
Attachment: YARN-9605.006.patch

> Add ZkConfiguredFailoverProxyProvider for RM HA
> ---
>
> Key: YARN-9605
> URL: https://issues.apache.org/jira/browse/YARN-9605
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-9605.001.patch, YARN-9605.002.patch, 
> YARN-9605.003.patch, YARN-9605.004.patch, YARN-9605.005.patch, 
> YARN-9605.006.patch
>
>
> In this issue, i will track a new feature to support 
> ZkConfiguredFailoverProxyProvider for RM HA



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-9977) Support monitor threads number in ContainersMonitorImpl

2019-11-13 Thread zhoukang (Jira)

zhoukang created YARN-9977:
--

 Summary: Support monitor threads number in ContainersMonitorImpl
 Key: YARN-9977
 URL: https://issues.apache.org/jira/browse/YARN-9977
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager
Reporter: zhoukang
Assignee: zhoukang


In this jira, we want add a feature to monitor thread number for given 
container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-9976) Application rejected by CapacityScheduler can not be searched on UI

2019-11-13 Thread zhoukang (Jira)

zhoukang created YARN-9976:
--

 Summary: Application rejected by CapacityScheduler can not be 
searched on UI
 Key: YARN-9976
 URL: https://issues.apache.org/jira/browse/YARN-9976
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhoukang


In jira https://issues.apache.org/jira/browse/YARN-4522 submission acl check 
will be done at RMAppManager. But this will cause users can not find his app on 
UI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-9976) Application rejected by CapacityScheduler can not be searched on UI

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang reassigned YARN-9976:
--

Assignee: zhoukang

> Application rejected by CapacityScheduler can not be searched on UI
> ---
>
> Key: YARN-9976
> URL: https://issues.apache.org/jira/browse/YARN-9976
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> In jira https://issues.apache.org/jira/browse/YARN-4522 submission acl check 
> will be done at RMAppManager. But this will cause users can not find his app 
> on UI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9975) Support proxy acl user for CapacityScheduler

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-9975:
---
Parent: YARN-9698
Issue Type: Sub-task  (was: Improvement)

> Support proxy acl user for CapacityScheduler
> 
>
> Key: YARN-9975
> URL: https://issues.apache.org/jira/browse/YARN-9975
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> As commented in https://issues.apache.org/jira/browse/YARN-9698.
> I will open a new jira for the proxy user feature. 
> The background is that we have long running  sql thriftserver for many users:
> {quote}{{user->sql proxy-> sql thriftserver}}{quote}
> But we do not have keytab for all users on 'sql proxy'. We just use a super 
> user like 'sql_prc' to submit the 'sql thriftserver' application. To support 
> this we should change the scheduler to support proxy user acl



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-9975) Support proxy acl user for CapacityScheduler

2019-11-13 Thread zhoukang (Jira)

zhoukang created YARN-9975:
--

 Summary: Support proxy acl user for CapacityScheduler
 Key: YARN-9975
 URL: https://issues.apache.org/jira/browse/YARN-9975
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhoukang
Assignee: zhoukang


As commented in https://issues.apache.org/jira/browse/YARN-9698.
I will open a new jira for the proxy user feature. 
The background is that we have long running  sql thriftserver for many users:
{quote}{{user->sql proxy-> sql thriftserver}}{quote}
But we do not have keytab for all users on 'sql proxy'. We just use a super 
user like 'sql_prc' to submit the 'sql thriftserver' application. To support 
this we should change the scheduler to support proxy user acl



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-7621) Support submitting apps with queue path for CapacityScheduler

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang reassigned YARN-7621:
--

Assignee: zhoukang  (was: Tao Yang)

> Support submitting apps with queue path for CapacityScheduler
> -
>
> Key: YARN-7621
> URL: https://issues.apache.org/jira/browse/YARN-7621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: zhoukang
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-7621.001.patch, YARN-7621.002.patch
>
>
> Currently there is a difference of queue definition in 
> ApplicationSubmissionContext between CapacityScheduler and FairScheduler. 
> FairScheduler needs queue path but CapacityScheduler needs queue name. There 
> is no doubt of the correction of queue definition for CapacityScheduler 
> because it does not allow duplicate leaf queue names, but it's hard to switch 
> between FairScheduler and CapacityScheduler. I propose to support submitting 
> apps with queue path for CapacityScheduler to make the interface clearer and 
> scheduler switch smoothly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7621) Support submitting apps with queue path for CapacityScheduler

2019-11-13 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973888#comment-16973888
 ] 

zhoukang commented on YARN-7621:


[~Tao Yang] I will work on this, thanks!

> Support submitting apps with queue path for CapacityScheduler
> -
>
> Key: YARN-7621
> URL: https://issues.apache.org/jira/browse/YARN-7621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-7621.001.patch, YARN-7621.002.patch
>
>
> Currently there is a difference of queue definition in 
> ApplicationSubmissionContext between CapacityScheduler and FairScheduler. 
> FairScheduler needs queue path but CapacityScheduler needs queue name. There 
> is no doubt of the correction of queue definition for CapacityScheduler 
> because it does not allow duplicate leaf queue names, but it's hard to switch 
> between FairScheduler and CapacityScheduler. I propose to support submitting 
> apps with queue path for CapacityScheduler to make the interface clearer and 
> scheduler switch smoothly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-9693) When AMRMProxyService is enabled RMCommunicator will register with failure

2019-11-13 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973877#comment-16973877
 ] 

zhoukang edited comment on YARN-9693 at 11/14/19 2:44 AM:
--

A initial patch has been post, i will optimize this patch.
[~subru] Could you help review this idea?
[~botong][~giovanni.fumarola] Thanks


was (Author: cane):
A initial patch has been post, i will optimize this patch.
[~subru] Could you help review this idea?

> When AMRMProxyService is enabled RMCommunicator will register with failure
> --
>
> Key: YARN-9693
> URL: https://issues.apache.org/jira/browse/YARN-9693
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: federation
>Affects Versions: 3.1.2
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-9693.001.patch
>
>
> When we enable amrm proxy service, the  RMCommunicator will register with 
> failure below:
> {code:java}
> 2019-07-23 17:12:44,794 INFO [TaskHeartbeatHandler PingChecker] 
> org.apache.hadoop.mapreduce.v2.app.TaskHeartbeatHandler: TaskHeartbeatHandler 
> thread interrupted
> 2019-07-23 17:12:44,794 ERROR [main] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid 
> AMRMToken from appattempt_1563872237585_0001_02
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:186)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.serviceStart(RMCommunicator.java:123)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.serviceStart(RMContainerAllocator.java:280)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.serviceStart(MRAppMaster.java:986)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1300)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$6.run(MRAppMaster.java:1768)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1764)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1698)
> Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: 
> Invalid AMRMToken from appattempt_1563872237585_0001_02
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>   at com.sun.proxy.$Proxy93.registerApplicationMaster(Unknown Source)
>   at 
>

[jira] [Commented] (YARN-9693) When AMRMProxyService is enabled RMCommunicator will register with failure

2019-11-13 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973877#comment-16973877
 ] 

zhoukang commented on YARN-9693:


A initial patch has been post, i will optimize this patch.
[~subru] Could you help review this idea?

> When AMRMProxyService is enabled RMCommunicator will register with failure
> --
>
> Key: YARN-9693
> URL: https://issues.apache.org/jira/browse/YARN-9693
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: federation
>Affects Versions: 3.1.2
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-9693.001.patch
>
>
> When we enable amrm proxy service, the  RMCommunicator will register with 
> failure below:
> {code:java}
> 2019-07-23 17:12:44,794 INFO [TaskHeartbeatHandler PingChecker] 
> org.apache.hadoop.mapreduce.v2.app.TaskHeartbeatHandler: TaskHeartbeatHandler 
> thread interrupted
> 2019-07-23 17:12:44,794 ERROR [main] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid 
> AMRMToken from appattempt_1563872237585_0001_02
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:186)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.serviceStart(RMCommunicator.java:123)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.serviceStart(RMContainerAllocator.java:280)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.serviceStart(MRAppMaster.java:986)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1300)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$6.run(MRAppMaster.java:1768)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1764)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1698)
> Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: 
> Invalid AMRMToken from appattempt_1563872237585_0001_02
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>   at com.sun.proxy.$Proxy93.registerApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:170)
>   ... 14 more
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>

[jira] [Resolved] (YARN-9974) Large diagnostics may cause RM recover failed

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang resolved YARN-9974.

Resolution: Duplicate

> Large diagnostics may cause RM recover failed
> -
>
> Key: YARN-9974
> URL: https://issues.apache.org/jira/browse/YARN-9974
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Critical
>
> {code:java}
> 2019-09-04,16:37:32,224 DEBUG org.apache.zookeeper.ClientCnxn: Reading reply 
> sessionid:0x563398cdd5889a1, packet:: clientPath:null serverPath:null 
> finished:false header:: 1659,4  replyHeader:: 1659,27117069873,0  request:: 
> '/yarn-ha/zjyprc-hadoop/rm-state/ZKRMStateRoot/RMAppRoot/application_1531361280531_691245,F
>   response:: 
>

[jira] [Assigned] (YARN-9973) Catch RuntimeException in yarn historyserver

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang reassigned YARN-9973:
--

Assignee: zhoukang

> Catch RuntimeException in yarn historyserver 
> -
>
> Key: YARN-9973
> URL: https://issues.apache.org/jira/browse/YARN-9973
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-9973.001.patch
>
>
> When we got exception below the thread in jobhisotry will exit, we should 
> catch runtime exception 
> {code:java}
> xxx 2019-06-30,17:45:52,386 ERROR 
> org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider: 
> Fail to get initial active Namenode informationjava.lang.RuntimeException: 
> Fail to get active namenode from zookeeper
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.getActiveNNIndex(ZkConfiguredFailoverProxyProvider.java:149)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.performFailover(ZkConfiguredFailoverProxyProvider.java:176)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:159)
> at $Proxy15.getListing(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1996)
> at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:211)
> at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:198)
> at org.apache.hadoop.fs.Hdfs$2.(Hdfs.java:180)
> at org.apache.hadoop.fs.Hdfs.listStatusIterator(Hdfs.java:180)
> at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1445)
> at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1440)
> at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
> at org.apache.hadoop.fs.FileContext.listStatus(FileContext.java:1440)
> at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectory(HistoryFileManager.java:739)
> at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectoryForHistoryFiles(HistoryFileManager.java:752)
> at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:806)
> at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.access$200(HistoryFileManager.java:82)
> at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(HistoryFileManager.java:280)
> at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:792)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9974) Large diagnostics may cause RM recover failed

2019-11-13 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973331#comment-16973331
 ] 

zhoukang commented on YARN-9974:


I will post a patch later

> Large diagnostics may cause RM recover failed
> -
>
> Key: YARN-9974
> URL: https://issues.apache.org/jira/browse/YARN-9974
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Critical
>
> {code:java}
> 2019-09-04,16:37:32,224 DEBUG org.apache.zookeeper.ClientCnxn: Reading reply 
> sessionid:0x563398cdd5889a1, packet:: clientPath:null serverPath:null 
> finished:false header:: 1659,4  replyHeader:: 1659,27117069873,0  request:: 
> '/yarn-ha/zjyprc-hadoop/rm-state/ZKRMStateRoot/RMAppRoot/application_1531361280531_691245,F
>   response:: 
>

[jira] [Created] (YARN-9974) Large diagnostics may cause RM recover failed

2019-11-13 Thread zhoukang (Jira)

zhoukang created YARN-9974:
--

 Summary: Large diagnostics may cause RM recover failed
 Key: YARN-9974
 URL: https://issues.apache.org/jira/browse/YARN-9974
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhoukang
Assignee: zhoukang


{code:java}
2019-09-04,16:37:32,224 DEBUG org.apache.zookeeper.ClientCnxn: Reading reply 
sessionid:0x563398cdd5889a1, packet:: clientPath:null serverPath:null 
finished:false header:: 1659,4  replyHeader:: 1659,27117069873,0  request:: 
'/yarn-ha/zjyprc-hadoop/rm-state/ZKRMStateRoot/RMAppRoot/application_1531361280531_691245,F
  response::

[jira] [Updated] (YARN-9693) When AMRMProxyService is enabled RMCommunicator will register with failure

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-9693:
---
Attachment: YARN-9693.001.patch

> When AMRMProxyService is enabled RMCommunicator will register with failure
> --
>
> Key: YARN-9693
> URL: https://issues.apache.org/jira/browse/YARN-9693
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: federation
>Affects Versions: 3.1.2
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-9693.001.patch
>
>
> When we enable amrm proxy service, the  RMCommunicator will register with 
> failure below:
> {code:java}
> 2019-07-23 17:12:44,794 INFO [TaskHeartbeatHandler PingChecker] 
> org.apache.hadoop.mapreduce.v2.app.TaskHeartbeatHandler: TaskHeartbeatHandler 
> thread interrupted
> 2019-07-23 17:12:44,794 ERROR [main] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid 
> AMRMToken from appattempt_1563872237585_0001_02
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:186)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.serviceStart(RMCommunicator.java:123)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.serviceStart(RMContainerAllocator.java:280)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.serviceStart(MRAppMaster.java:986)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1300)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$6.run(MRAppMaster.java:1768)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1764)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1698)
> Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: 
> Invalid AMRMToken from appattempt_1563872237585_0001_02
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>   at com.sun.proxy.$Proxy93.registerApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:170)
>   ... 14 more
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  Invalid AMRMToken from appattempt_1563872237585_0001_02
>   at

[jira] [Updated] (YARN-9973) Catch RuntimeException in yarn historyserver

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-9973:
---
Description: 
When we got exception below the thread in jobhisotry will exit, we should catch 
runtime exception 
{code:java}
xxx 2019-06-30,17:45:52,386 ERROR 
org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider: 
Fail to get initial active Namenode informationjava.lang.RuntimeException: Fail 
to get active namenode from zookeeper
at 
org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.getActiveNNIndex(ZkConfiguredFailoverProxyProvider.java:149)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.performFailover(ZkConfiguredFailoverProxyProvider.java:176)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:159)
at $Proxy15.getListing(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1996)
at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:211)
at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:198)
at org.apache.hadoop.fs.Hdfs$2.(Hdfs.java:180)
at org.apache.hadoop.fs.Hdfs.listStatusIterator(Hdfs.java:180)
at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1445)
at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1440)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.listStatus(FileContext.java:1440)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectory(HistoryFileManager.java:739)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectoryForHistoryFiles(HistoryFileManager.java:752)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:806)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.access$200(HistoryFileManager.java:82)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(HistoryFileManager.java:280)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:792)
{code}


  was:
When we got exception below the cleaner thread will exit, we should catch 
runtime exception 
{code:java}
xxx 2019-06-30,17:45:52,386 ERROR 
org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider: 
Fail to get initial active Namenode informationjava.lang.RuntimeException: Fail 
to get active namenode from zookeeper
at 
org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.getActiveNNIndex(ZkConfiguredFailoverProxyProvider.java:149)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.performFailover(ZkConfiguredFailoverProxyProvider.java:176)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:159)
at $Proxy15.getListing(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1996)
at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:211)
at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:198)
at org.apache.hadoop.fs.Hdfs$2.(Hdfs.java:180)
at org.apache.hadoop.fs.Hdfs.listStatusIterator(Hdfs.java:180)
at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1445)
at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1440)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.listStatus(FileContext.java:1440)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectory(HistoryFileManager.java:739)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectoryForHistoryFiles(HistoryFileManager.java:752)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:806)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.access$200(HistoryFileManager.java:82)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(HistoryFileManager.java:280)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:792)
{code}



> Catch RuntimeException in yarn historyserver 
> -
>
> Key: YARN-9973
> URL: https://issues.apache.org/jira/browse/YARN-9973
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: zhoukang
>Priority: Major
> Attachments: YARN-9973.001.patch
>
>
> When we got exception below the thread in jobhisotry will exit, we should 
> catch runtime exception 
>

[jira] [Updated] (YARN-9973) Catch RuntimeException in yarn historyserver

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-9973:
---
Attachment: YARN-9973.001.patch

> Catch RuntimeException in yarn historyserver 
> -
>
> Key: YARN-9973
> URL: https://issues.apache.org/jira/browse/YARN-9973
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: zhoukang
>Priority: Major
> Attachments: YARN-9973.001.patch
>
>
> When we got exception below the cleaner thread will exit, we should catch 
> runtime exception 
> {code:java}
> xxx 2019-06-30,17:45:52,386 ERROR 
> org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider: 
> Fail to get initial active Namenode informationjava.lang.RuntimeException: 
> Fail to get active namenode from zookeeper
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.getActiveNNIndex(ZkConfiguredFailoverProxyProvider.java:149)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.performFailover(ZkConfiguredFailoverProxyProvider.java:176)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:159)
> at $Proxy15.getListing(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1996)
> at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:211)
> at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:198)
> at org.apache.hadoop.fs.Hdfs$2.(Hdfs.java:180)
> at org.apache.hadoop.fs.Hdfs.listStatusIterator(Hdfs.java:180)
> at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1445)
> at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1440)
> at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
> at org.apache.hadoop.fs.FileContext.listStatus(FileContext.java:1440)
> at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectory(HistoryFileManager.java:739)
> at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectoryForHistoryFiles(HistoryFileManager.java:752)
> at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:806)
> at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.access$200(HistoryFileManager.java:82)
> at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(HistoryFileManager.java:280)
> at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:792)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9973) Catch RuntimeException in yarn historyserver

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-9973:
---
Component/s: yarn

> Catch RuntimeException in yarn historyserver 
> -
>
> Key: YARN-9973
> URL: https://issues.apache.org/jira/browse/YARN-9973
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: zhoukang
>Priority: Major
>
> When we got exception below the cleaner thread will exit, we should catch 
> runtime exception 
> {code:java}
> xxx 2019-06-30,17:45:52,386 ERROR 
> org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider: 
> Fail to get initial active Namenode informationjava.lang.RuntimeException: 
> Fail to get active namenode from zookeeper
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.getActiveNNIndex(ZkConfiguredFailoverProxyProvider.java:149)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.performFailover(ZkConfiguredFailoverProxyProvider.java:176)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:159)
> at $Proxy15.getListing(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1996)
> at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:211)
> at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:198)
> at org.apache.hadoop.fs.Hdfs$2.(Hdfs.java:180)
> at org.apache.hadoop.fs.Hdfs.listStatusIterator(Hdfs.java:180)
> at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1445)
> at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1440)
> at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
> at org.apache.hadoop.fs.FileContext.listStatus(FileContext.java:1440)
> at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectory(HistoryFileManager.java:739)
> at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectoryForHistoryFiles(HistoryFileManager.java:752)
> at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:806)
> at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.access$200(HistoryFileManager.java:82)
> at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(HistoryFileManager.java:280)
> at 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:792)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-9973) Catch RuntimeException in yarn historyserver

2019-11-13 Thread zhoukang (Jira)

zhoukang created YARN-9973:
--

 Summary: Catch RuntimeException in yarn historyserver 
 Key: YARN-9973
 URL: https://issues.apache.org/jira/browse/YARN-9973
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhoukang


When we got exception below the cleaner thread will exit, we should catch 
runtime exception 
{code:java}
xxx 2019-06-30,17:45:52,386 ERROR 
org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider: 
Fail to get initial active Namenode informationjava.lang.RuntimeException: Fail 
to get active namenode from zookeeper
at 
org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.getActiveNNIndex(ZkConfiguredFailoverProxyProvider.java:149)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.performFailover(ZkConfiguredFailoverProxyProvider.java:176)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:159)
at $Proxy15.getListing(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1996)
at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:211)
at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:198)
at org.apache.hadoop.fs.Hdfs$2.(Hdfs.java:180)
at org.apache.hadoop.fs.Hdfs.listStatusIterator(Hdfs.java:180)
at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1445)
at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1440)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.listStatus(FileContext.java:1440)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectory(HistoryFileManager.java:739)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectoryForHistoryFiles(HistoryFileManager.java:752)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:806)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.access$200(HistoryFileManager.java:82)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(HistoryFileManager.java:280)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:792)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-9972) Do not kill am container when node is unhealthy

2019-11-13 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang reassigned YARN-9972:
--

Assignee: zhoukang

> Do not kill am container when node is unhealthy
> ---
>
> Key: YARN-9972
> URL: https://issues.apache.org/jira/browse/YARN-9972
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> In this patch, we want to add a configuration to disable kill am container 
> when node is unhealthy since this will cause some application exit with 
> failure. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-9972) Do not kill am container when node is unhealthy

2019-11-13 Thread zhoukang (Jira)

zhoukang created YARN-9972:
--

 Summary: Do not kill am container when node is unhealthy
 Key: YARN-9972
 URL: https://issues.apache.org/jira/browse/YARN-9972
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: zhoukang


In this patch, we want to add a configuration to disable kill am container when 
node is unhealthy since this will cause some application exit with failure. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9931) Support run script before kill container

2019-11-13 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973282#comment-16973282
 ] 

zhoukang commented on YARN-9931:


The background is that in our production  cluster we have many Applications and 
framework.
We always encountered problem that the container was killed and we have no 
information about the operation which the container was doing.
Add a shutdown hook may be unfriendly to the users. And add a feature in yarn 
will make the troubleshooting more efficient. [~epayne]
Thanks!

> Support run script before kill container
> 
>
> Key: YARN-9931
> URL: https://issues.apache.org/jira/browse/YARN-9931
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> Like node health check script. We can add a pre-kill script which run before 
> kill container.
> For example we can save the thread dump before kill the container, which is 
> helpful for troubleshooting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-9930) Support max running app logic for CapacityScheduler

2019-11-13 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973280#comment-16973280
 ] 

zhoukang edited comment on YARN-9930 at 11/13/19 11:55 AM:
---

[~pbacsko]Thanks. The background  is that in our production cluster we want to 
upgrade hadoop version to 3.x, and we used FairScheduler before.
Now we want to use CapacityScheduler  in new version 3.x.
If we want to migrate from FS to CS , this behavior will be confused to users.

[~epayne][~pbacsko]I agree with the point.Add a config like
bq. "yarn.scheduler.capacity.maxrunningapps.reject"


was (Author: cane):
[~pbacsko]Thanks. The background  is that in our production cluster we want to 
upgrade hadoop version to 3.x, and we used FairScheduler before.
Now we want to use CapacityScheduler  in new version 3.x.
If we want to migrate from FS to CS , this behavior will be confused.

[~epayne][~pbacsko]I agree with the point.Add a config like
bq. "yarn.scheduler.capacity.maxrunningapps.reject"

> Support max running app logic for CapacityScheduler
> ---
>
> Key: YARN-9930
> URL: https://issues.apache.org/jira/browse/YARN-9930
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, capacityscheduler
>Affects Versions: 3.1.0, 3.1.1
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> In FairScheduler, there has limitation for max running which will let 
> application pending.
> But in CapacityScheduler there has no feature like max running app.Only got 
> max app,and jobs will be rejected directly on client.
> This jira i want to implement this semantic for CapacityScheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

1 2 3 >

1 - 100 of 236 matches

Mail list logo