[jira] [Commented] (YARN-10080) Support show app id on localizer thread pool
[ https://issues.apache.org/jira/browse/YARN-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090195#comment-17090195 ] zhoukang commented on YARN-10080: - how to push this [~adam.antal][~abmodi] Thanks > Support show app id on localizer thread pool > > > Key: YARN-10080 > URL: https://issues.apache.org/jira/browse/YARN-10080 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-10080-001.patch, YARN-10080.002.patch > > > Currently when we are troubleshooting a container localizer issue, if we want > to analyze the jstack with thread detail, we can not figure out which thread > is processing the given container. So i want to add app id on the thread name -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10242) CapacityScheduler may call updateClusterResource for every node register event which will cause resource register too slow
zhoukang created YARN-10242: --- Summary: CapacityScheduler may call updateClusterResource for every node register event which will cause resource register too slow Key: YARN-10242 URL: https://issues.apache.org/jira/browse/YARN-10242 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Reporter: zhoukang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10204) ResContainer may be unreserved during process outstanding containers
[ https://issues.apache.org/jira/browse/YARN-10204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063907#comment-17063907 ] zhoukang commented on YARN-10204: - we can add double check logic below: {code:java} private void completeOustandingUpdatesWhichAreReserved( RMContainer rmContainer, ContainerStatus containerStatus, RMContainerEventType event) { N schedulerNode = getSchedulerNode(rmContainer.getNodeId()); if (schedulerNode != null && schedulerNode.getReservedContainer() != null) { RMContainer resContainer = schedulerNode.getReservedContainer(); // double check here since container // may be unreserved which can make resContainer be null if (resContainer!= null && resContainer.getReservedSchedulerKey() != null) { ContainerId containerToUpdate = resContainer .getReservedSchedulerKey().getContainerToUpdate(); {code} > ResContainer may be unreserved during process outstanding containers > > > Key: YARN-10204 > URL: https://issues.apache.org/jira/browse/YARN-10204 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > ResContainer may be unreserved during process outstanding containers > which may cause rm exit with failure > {code:java} > 2020-03-21,13:13:36,569 FATAL org.apache.hadoop.yarn.event.EventDispatcher: > Error in handling event type CONTAINER_EXPIRED to the Event Dispatcher > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completeOustandingUpdatesWhichAreReserved(AbstractYarnScheduler.java:719) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:678) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1952) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:168) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10204) ResContainer may be unreserved during process outstanding containers
zhoukang created YARN-10204: --- Summary: ResContainer may be unreserved during process outstanding containers Key: YARN-10204 URL: https://issues.apache.org/jira/browse/YARN-10204 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhoukang Assignee: zhoukang ResContainer may be unreserved during process outstanding containers which may cause rm exit with failure {code:java} 2020-03-21,13:13:36,569 FATAL org.apache.hadoop.yarn.event.EventDispatcher: Error in handling event type CONTAINER_EXPIRED to the Event Dispatcher java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completeOustandingUpdatesWhichAreReserved(AbstractYarnScheduler.java:719) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:678) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1952) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:168) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10118) Use zk to store node info for rm which can show lostnodes information in rm UI after failover
zhoukang created YARN-10118: --- Summary: Use zk to store node info for rm which can show lostnodes information in rm UI after failover Key: YARN-10118 URL: https://issues.apache.org/jira/browse/YARN-10118 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: zhoukang Assignee: zhoukang When maintenance a large cluster we may have some nodes lost, if we did failover before deal with these nodes. The information will lost in new master.We can use zk to store the nodes' information -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10115) Use interceptor pipeline for app submit to make app submit checker policy pluggable
zhoukang created YARN-10115: --- Summary: Use interceptor pipeline for app submit to make app submit checker policy pluggable Key: YARN-10115 URL: https://issues.apache.org/jira/browse/YARN-10115 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Reporter: zhoukang Assignee: zhoukang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10115) Use interceptor pipeline for app submit to make app submit check policy pluggable
[ https://issues.apache.org/jira/browse/YARN-10115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10115: Summary: Use interceptor pipeline for app submit to make app submit check policy pluggable (was: Use interceptor pipeline for app submit to make app submit checker policy pluggable ) > Use interceptor pipeline for app submit to make app submit check policy > pluggable > -- > > Key: YARN-10115 > URL: https://issues.apache.org/jira/browse/YARN-10115 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, resourcemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10060) Historyserver may recover too slow since JobHistory init too slow when there exist too many job
[ https://issues.apache.org/jira/browse/YARN-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10060: Attachment: (was: YARN-10060.001.patch) > Historyserver may recover too slow since JobHistory init too slow when there > exist too many job > --- > > Key: YARN-10060 > URL: https://issues.apache.org/jira/browse/YARN-10060 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-10060-001.patch > > > Like below it cost >7min to listen to the service port > {code:java} > 2019-12-24,20:01:37,272 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2019-12-24,20:01:47,354 INFO > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Initializing Existing > Jobs... > 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server xxx. Will not attempt to authenticate using SASL > (unknown error) > 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to xxx, initiating session > 2019-12-24,20:08:29,590 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server xxx, sessionid = 0x66d1a13e596ddc9, > negotiated timeout = 5000 > 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x66d1a13e596ddc9 closed > 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2019-12-24,20:08:29,655 INFO > org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage: CachedHistoryStorage > Init > 2019-12-24,20:08:29,681 INFO org.apache.hadoop.ipc.CallQueueManager: Using > callQueue class java.util.concurrent.LinkedBlockingQueue > 2019-12-24,20:08:29,715 INFO org.apache.hadoop.ipc.CallQueueManager: Using > callQueue class java.util.concurrent.LinkedBlockingQueue > 2019-12-24,20:08:29,800 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: > loaded properties from hadoop-metrics2.properties > 2019-12-24,20:08:29,943 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period > at 10 second(s). > 2019-12-24,20:08:29,943 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: JobHistoryServer metrics > system started > 2019-12-24,20:08:29,950 INFO > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: > Updating the current master key for generating delegation tokens > 2019-12-24,20:08:29,951 INFO > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: > Starting expired delegation token remover thread, > tokenRemoverScanInterval=60 min(s) > 2019-12-24,20:08:29,952 INFO > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: > Updating the current master key for generating delegation tokens > 2019-12-24,20:08:30,015 INFO org.apache.hadoop.http.HttpRequestLog: Http > request log for http.requests.jobhistory is not defined > 2019-12-24,20:08:30,025 INFO org.apache.hadoop.http.HttpServer2: Added global > filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter) > 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter > static_user_filter > (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to > context jobhistory > 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter > static_user_filter > (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to > context static > 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path > spec: /jobhistory/* > 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path > spec: /ws/* > 2019-12-24,20:08:30,057 INFO org.apache.hadoop.http.HttpServer2: Jetty bound > to port 20901 > 2019-12-24,20:08:30,939 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app > /jobhistory started at 20901 > 2019-12-24,20:08:31,177 INFO org.apache.hadoop.yarn.webapp.WebApps: > Registered webapp guice modules > 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using > callQueue class java.util.concurrent.LinkedBlockingQueue > 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using > callQueue class java.util.concurrent.LinkedBlockingQueue > 2019-12-24,20:08:31,189 INFO > org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding > protocol org.apache.hadoop.mapreduce.v2.api.HSClientProtocolPB to the server > 2019-12-24,20:08:31,216 INFO > org.apache.hadoop.mapreduce.v2.hs.HistoryClientService: Instantiated > HistoryClientService at xxx > 2019-12-24,20:08:31,344 INFO > org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService: > aggregated log
[jira] [Updated] (YARN-10060) Historyserver may recover too slow since JobHistory init too slow when there exist too many job
[ https://issues.apache.org/jira/browse/YARN-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10060: Attachment: YARN-10060-001.patch > Historyserver may recover too slow since JobHistory init too slow when there > exist too many job > --- > > Key: YARN-10060 > URL: https://issues.apache.org/jira/browse/YARN-10060 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-10060-001.patch > > > Like below it cost >7min to listen to the service port > {code:java} > 2019-12-24,20:01:37,272 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2019-12-24,20:01:47,354 INFO > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Initializing Existing > Jobs... > 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server xxx. Will not attempt to authenticate using SASL > (unknown error) > 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to xxx, initiating session > 2019-12-24,20:08:29,590 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server xxx, sessionid = 0x66d1a13e596ddc9, > negotiated timeout = 5000 > 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x66d1a13e596ddc9 closed > 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2019-12-24,20:08:29,655 INFO > org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage: CachedHistoryStorage > Init > 2019-12-24,20:08:29,681 INFO org.apache.hadoop.ipc.CallQueueManager: Using > callQueue class java.util.concurrent.LinkedBlockingQueue > 2019-12-24,20:08:29,715 INFO org.apache.hadoop.ipc.CallQueueManager: Using > callQueue class java.util.concurrent.LinkedBlockingQueue > 2019-12-24,20:08:29,800 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: > loaded properties from hadoop-metrics2.properties > 2019-12-24,20:08:29,943 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period > at 10 second(s). > 2019-12-24,20:08:29,943 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: JobHistoryServer metrics > system started > 2019-12-24,20:08:29,950 INFO > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: > Updating the current master key for generating delegation tokens > 2019-12-24,20:08:29,951 INFO > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: > Starting expired delegation token remover thread, > tokenRemoverScanInterval=60 min(s) > 2019-12-24,20:08:29,952 INFO > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: > Updating the current master key for generating delegation tokens > 2019-12-24,20:08:30,015 INFO org.apache.hadoop.http.HttpRequestLog: Http > request log for http.requests.jobhistory is not defined > 2019-12-24,20:08:30,025 INFO org.apache.hadoop.http.HttpServer2: Added global > filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter) > 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter > static_user_filter > (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to > context jobhistory > 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter > static_user_filter > (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to > context static > 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path > spec: /jobhistory/* > 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path > spec: /ws/* > 2019-12-24,20:08:30,057 INFO org.apache.hadoop.http.HttpServer2: Jetty bound > to port 20901 > 2019-12-24,20:08:30,939 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app > /jobhistory started at 20901 > 2019-12-24,20:08:31,177 INFO org.apache.hadoop.yarn.webapp.WebApps: > Registered webapp guice modules > 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using > callQueue class java.util.concurrent.LinkedBlockingQueue > 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using > callQueue class java.util.concurrent.LinkedBlockingQueue > 2019-12-24,20:08:31,189 INFO > org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding > protocol org.apache.hadoop.mapreduce.v2.api.HSClientProtocolPB to the server > 2019-12-24,20:08:31,216 INFO > org.apache.hadoop.mapreduce.v2.hs.HistoryClientService: Instantiated > HistoryClientService at xxx > 2019-12-24,20:08:31,344 INFO > org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService: > aggregated log deletion
[jira] [Commented] (YARN-10011) Catch all exception during init app in LogAggregationService
[ https://issues.apache.org/jira/browse/YARN-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028457#comment-17028457 ] zhoukang commented on YARN-10011: - [~adam.antal]Could you help review this? > Catch all exception during init app in LogAggregationService > -- > > Key: YARN-10011 > URL: https://issues.apache.org/jira/browse/YARN-10011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-10011-001.patch > > > we should catch all exception during init app in LogAggregationService in > case of nm exit > {code:java} > 2019-06-12,09:36:03,652 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: > Error in dispatcher thread > java.lang.IllegalStateException > at > com.google.common.base.Preconditions.checkState(Preconditions.java:129) > at > org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:118) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy22.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2115) > at > org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1300) > at > org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1296) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1312) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:193) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:319) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:116) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10011) Catch all exception during init app in LogAggregationService
[ https://issues.apache.org/jira/browse/YARN-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10011: Attachment: YARN-10011-001.patch > Catch all exception during init app in LogAggregationService > -- > > Key: YARN-10011 > URL: https://issues.apache.org/jira/browse/YARN-10011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-10011-001.patch > > > we should catch all exception during init app in LogAggregationService in > case of nm exit > {code:java} > 2019-06-12,09:36:03,652 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: > Error in dispatcher thread > java.lang.IllegalStateException > at > com.google.common.base.Preconditions.checkState(Preconditions.java:129) > at > org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:118) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy22.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2115) > at > org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1300) > at > org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1296) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1312) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:193) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:319) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:116) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10011) Catch all exception during init app in LogAggregationService
[ https://issues.apache.org/jira/browse/YARN-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10011: Attachment: (was: YARN-10011.001.patch) > Catch all exception during init app in LogAggregationService > -- > > Key: YARN-10011 > URL: https://issues.apache.org/jira/browse/YARN-10011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-10011-001.patch > > > we should catch all exception during init app in LogAggregationService in > case of nm exit > {code:java} > 2019-06-12,09:36:03,652 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: > Error in dispatcher thread > java.lang.IllegalStateException > at > com.google.common.base.Preconditions.checkState(Preconditions.java:129) > at > org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:118) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy22.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2115) > at > org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1300) > at > org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1296) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1312) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:193) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:319) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:116) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10080) Support show app id on localizer thread pool
[ https://issues.apache.org/jira/browse/YARN-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028455#comment-17028455 ] zhoukang commented on YARN-10080: - ping [~abmodi][~tangzhankun] > Support show app id on localizer thread pool > > > Key: YARN-10080 > URL: https://issues.apache.org/jira/browse/YARN-10080 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-10080-001.patch, YARN-10080.002.patch > > > Currently when we are troubleshooting a container localizer issue, if we want > to analyze the jstack with thread detail, we can not figure out which thread > is processing the given container. So i want to add app id on the thread name -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10096) Add zk based configuration provider for router
zhoukang created YARN-10096: --- Summary: Add zk based configuration provider for router Key: YARN-10096 URL: https://issues.apache.org/jira/browse/YARN-10096 Project: Hadoop YARN Issue Type: Improvement Components: router Reporter: zhoukang Assignee: zhoukang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10094) Add a configuration to support NM overuse in RM
[ https://issues.apache.org/jira/browse/YARN-10094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10094: Attachment: YARN-10094.001.patch > Add a configuration to support NM overuse in RM > --- > > Key: YARN-10094 > URL: https://issues.apache.org/jira/browse/YARN-10094 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-10094.001.patch > > > In a large cluster , upgrade NM will cost too much time. > Some times we want to support memory or cpu overuse from RM view. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10094) Add configuration to support NM overuse in RM
[ https://issues.apache.org/jira/browse/YARN-10094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10094: Summary: Add configuration to support NM overuse in RM (was: Add a configuration to support NM overuse in RM) > Add configuration to support NM overuse in RM > - > > Key: YARN-10094 > URL: https://issues.apache.org/jira/browse/YARN-10094 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-10094.001.patch > > > In a large cluster , upgrade NM will cost too much time. > Some times we want to support memory or cpu overuse from RM view. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10094) Add a configuration to support NM overuse in RM
[ https://issues.apache.org/jira/browse/YARN-10094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10094: Description: In a large cluster , upgrade NM will cost too much time. Some times we want to support memory or cpu overuse from RM view. > Add a configuration to support NM overuse in RM > --- > > Key: YARN-10094 > URL: https://issues.apache.org/jira/browse/YARN-10094 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > In a large cluster , upgrade NM will cost too much time. > Some times we want to support memory or cpu overuse from RM view. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10094) Add a configuration to support NM overuse in RM
zhoukang created YARN-10094: --- Summary: Add a configuration to support NM overuse in RM Key: YARN-10094 URL: https://issues.apache.org/jira/browse/YARN-10094 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: zhoukang Assignee: zhoukang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10093) Support list applications by queue name
[ https://issues.apache.org/jira/browse/YARN-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10093: Summary: Support list applications by queue name (was: Support get applications by queue) > Support list applications by queue name > --- > > Key: YARN-10093 > URL: https://issues.apache.org/jira/browse/YARN-10093 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10093) Support get applications by queue
zhoukang created YARN-10093: --- Summary: Support get applications by queue Key: YARN-10093 URL: https://issues.apache.org/jira/browse/YARN-10093 Project: Hadoop YARN Issue Type: Improvement Reporter: zhoukang Assignee: zhoukang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10092) Support config special log retain time for given user
[ https://issues.apache.org/jira/browse/YARN-10092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10092: Summary: Support config special log retain time for given user (was: Support log retain time for give user) > Support config special log retain time for given user > -- > > Key: YARN-10092 > URL: https://issues.apache.org/jira/browse/YARN-10092 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10092) Support log retain time for give user
zhoukang created YARN-10092: --- Summary: Support log retain time for give user Key: YARN-10092 URL: https://issues.apache.org/jira/browse/YARN-10092 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: zhoukang Assignee: zhoukang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10091) Support clean up orphan app's log in LogAggService
zhoukang created YARN-10091: --- Summary: Support clean up orphan app's log in LogAggService Key: YARN-10091 URL: https://issues.apache.org/jira/browse/YARN-10091 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: zhoukang Assignee: zhoukang In a large cluster, there will exist orphan app log directory which will cause disk leak.We should support cleanup app log directory for this kind of app -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10062) Support deploy multiple historyserver in case of sp
[ https://issues.apache.org/jira/browse/YARN-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10062: Attachment: YARN-10062.001.patch > Support deploy multiple historyserver in case of sp > --- > > Key: YARN-10062 > URL: https://issues.apache.org/jira/browse/YARN-10062 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-10062.001.patch > > > In this jira, i want to implement a patch to support history ha. > We can deploy two historyserver and use a load balancer like lvs to support > HA. > But there exist error in our production cluster like below: > {code:java} > 19/12/13/00 does not exist. > 2019-12-21,13:25:06,822 DEBUG org.apache.hadoop.yarn.webapp.Controller: > text/plain; charset=UTF-8: java.io.FileNotFoundException: File > /yarn/xxx/staging/history/done/2019/12/13/00 does not exist. > 2019-12-21,13:25:07,530 DEBUG org.apache.hadoop.yarn.webapp.Controller: > text/plain; charset=UTF-8: java.io.FileNotFoundException: File > /yarn/xxx/staging/history/done/2019/12/13/00 does not exist. > 2019-12-21,13:25:09,910 DEBUG org.apache.hadoop.yarn.webapp.Controller: > text/plain; charset=UTF-8: java.io.FileNotFoundException: File > /yarn/xxx/staging/history/done/2019/12/13/00 does not exist. > 2019-12-21,13:44:29,044 DEBUG org.apache.hadoop.yarn.webapp.Controller: > text/plain; charset=UTF-8: java.io.FileNotFoundException: File > /yarn/xxx/staging/history/done/2019/12/13/00 does not exist. > 2019-12-21,13:47:08,154 DEBUG org.apache.hadoop.yarn.webapp.Controller: > text/plain; charset=UTF-8: java.io.FileNotFoundException: File > /yarn/xxx/staging/history/done/2019/12/13/00 does not exist. > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10062) Support deploy multiple historyserver in case of sp
[ https://issues.apache.org/jira/browse/YARN-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10062: Description: In this jira, i want to implement a patch to support history ha. We can deploy two historyserver and use a load balancer like lvs to support HA. But there exist error in our production cluster like below: {code:java} 19/12/13/00 does not exist. 2019-12-21,13:25:06,822 DEBUG org.apache.hadoop.yarn.webapp.Controller: text/plain; charset=UTF-8: java.io.FileNotFoundException: File /yarn/xxx/staging/history/done/2019/12/13/00 does not exist. 2019-12-21,13:25:07,530 DEBUG org.apache.hadoop.yarn.webapp.Controller: text/plain; charset=UTF-8: java.io.FileNotFoundException: File /yarn/xxx/staging/history/done/2019/12/13/00 does not exist. 2019-12-21,13:25:09,910 DEBUG org.apache.hadoop.yarn.webapp.Controller: text/plain; charset=UTF-8: java.io.FileNotFoundException: File /yarn/xxx/staging/history/done/2019/12/13/00 does not exist. 2019-12-21,13:44:29,044 DEBUG org.apache.hadoop.yarn.webapp.Controller: text/plain; charset=UTF-8: java.io.FileNotFoundException: File /yarn/xxx/staging/history/done/2019/12/13/00 does not exist. 2019-12-21,13:47:08,154 DEBUG org.apache.hadoop.yarn.webapp.Controller: text/plain; charset=UTF-8: java.io.FileNotFoundException: File /yarn/xxx/staging/history/done/2019/12/13/00 does not exist. {code} was:In this jira, i want to implement a patch to support history ha > Support deploy multiple historyserver in case of sp > --- > > Key: YARN-10062 > URL: https://issues.apache.org/jira/browse/YARN-10062 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > In this jira, i want to implement a patch to support history ha. > We can deploy two historyserver and use a load balancer like lvs to support > HA. > But there exist error in our production cluster like below: > {code:java} > 19/12/13/00 does not exist. > 2019-12-21,13:25:06,822 DEBUG org.apache.hadoop.yarn.webapp.Controller: > text/plain; charset=UTF-8: java.io.FileNotFoundException: File > /yarn/xxx/staging/history/done/2019/12/13/00 does not exist. > 2019-12-21,13:25:07,530 DEBUG org.apache.hadoop.yarn.webapp.Controller: > text/plain; charset=UTF-8: java.io.FileNotFoundException: File > /yarn/xxx/staging/history/done/2019/12/13/00 does not exist. > 2019-12-21,13:25:09,910 DEBUG org.apache.hadoop.yarn.webapp.Controller: > text/plain; charset=UTF-8: java.io.FileNotFoundException: File > /yarn/xxx/staging/history/done/2019/12/13/00 does not exist. > 2019-12-21,13:44:29,044 DEBUG org.apache.hadoop.yarn.webapp.Controller: > text/plain; charset=UTF-8: java.io.FileNotFoundException: File > /yarn/xxx/staging/history/done/2019/12/13/00 does not exist. > 2019-12-21,13:47:08,154 DEBUG org.apache.hadoop.yarn.webapp.Controller: > text/plain; charset=UTF-8: java.io.FileNotFoundException: File > /yarn/xxx/staging/history/done/2019/12/13/00 does not exist. > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9605) Add ZkConfiguredFailoverProxyProvider for RM HA
[ https://issues.apache.org/jira/browse/YARN-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019209#comment-17019209 ] zhoukang commented on YARN-9605: ping~ [~prabhujoseph][~subru][~tangzhankun]Could you help push this jira?thanks > Add ZkConfiguredFailoverProxyProvider for RM HA > --- > > Key: YARN-9605 > URL: https://issues.apache.org/jira/browse/YARN-9605 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-9605.001.patch, YARN-9605.002.patch, > YARN-9605.003.patch, YARN-9605.004.patch, YARN-9605.005.patch, > YARN-9605.006.patch > > > In this issue, i will track a new feature to support > ZkConfiguredFailoverProxyProvider for RM HA -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10069) Showing jstack on UI for containers
[ https://issues.apache.org/jira/browse/YARN-10069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019207#comment-17019207 ] zhoukang commented on YARN-10069: - [~akhilpb]Showing jstack for running container > Showing jstack on UI for containers > --- > > Key: YARN-10069 > URL: https://issues.apache.org/jira/browse/YARN-10069 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > In this jira, i want to post a patch to support showing jstack on the > container ui -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9979) When a app expired with many containers , scheduler event size will be huge
[ https://issues.apache.org/jira/browse/YARN-9979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-9979: --- Attachment: YARN-9979.001.patch > When a app expired with many containers , scheduler event size will be huge > --- > > Key: YARN-9979 > URL: https://issues.apache.org/jira/browse/YARN-9979 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-9979.001.patch > > > When there is an app expired with many containers, the scheduler event size > will be huge. > {code:java} > 2019-11-11,21:39:49,690 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 9000 > 2019-11-11,21:39:49,695 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 1 > 2019-11-11,21:39:49,700 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 11000 > 2019-11-11,21:39:49,705 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 12000 > 2019-11-11,21:39:49,710 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 13000 > 2019-11-11,21:39:49,715 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 14000 > 2019-11-11,21:39:49,720 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Discarded 1 > messages due to full event buffer including: Size of scheduler event-queue is > 15000 > 2019-11-11,21:39:49,724 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 16000 > 2019-11-11,21:39:49,729 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 17000 > 2019-11-11,21:39:49,733 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 18000 > 2019-11-11,21:40:14,953 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 19000 > 2019-11-11,21:43:09,743 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 19000 > 2019-11-11,21:43:09,750 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 2 > 2019-11-11,21:43:09,758 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 21000 > 2019-11-11,21:43:09,766 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 22000 > 2019-11-11,21:43:09,775 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 23000 > 2019-11-11,21:43:09,783 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 24000 > 2019-11-11,21:43:09,792 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 25000 > 2019-11-11,21:43:09,800 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 26000 > 2019-11-11,21:43:09,807 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 27000 > 2019-11-11,21:43:09,814 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 28000 > 2019-11-11,21:46:29,830 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 29000 > 2019-11-11,21:46:29,841 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 3 > 2019-11-11,21:46:29,850 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 31000 > 2019-11-11,21:46:29,862 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 32000 > 2019-11-11,21:49:49,875 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 33000 > 2019-11-11,21:49:49,875 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 34000 > 2019-11-11,21:49:49,876 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 35000 > 2019-11-11,21:49:49,882 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 36000 > 2019-11-11,21:49:49,887 INFO >
[jira] [Commented] (YARN-10010) NM upload log cost too much time
[ https://issues.apache.org/jira/browse/YARN-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019205#comment-17019205 ] zhoukang commented on YARN-10010: - I post a patch in YARN-10056 [~wilfreds] I will close this as dupe > NM upload log cost too much time > > > Key: YARN-10010 > URL: https://issues.apache.org/jira/browse/YARN-10010 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: notfound.png > > > Since thread pool size of log service is 100. > Some times the log uploading service will delay for some apps.like below > !notfound.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10010) NM upload log cost too much time
[ https://issues.apache.org/jira/browse/YARN-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang resolved YARN-10010. - Resolution: Duplicate > NM upload log cost too much time > > > Key: YARN-10010 > URL: https://issues.apache.org/jira/browse/YARN-10010 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: notfound.png > > > Since thread pool size of log service is 100. > Some times the log uploading service will delay for some apps.like below > !notfound.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10011) Catch all exception during init app in LogAggregationService
[ https://issues.apache.org/jira/browse/YARN-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10011: Attachment: YARN-10011.001.patch > Catch all exception during init app in LogAggregationService > -- > > Key: YARN-10011 > URL: https://issues.apache.org/jira/browse/YARN-10011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-10011.001.patch > > > we should catch all exception during init app in LogAggregationService in > case of nm exit > {code:java} > 2019-06-12,09:36:03,652 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: > Error in dispatcher thread > java.lang.IllegalStateException > at > com.google.common.base.Preconditions.checkState(Preconditions.java:129) > at > org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:118) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy22.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2115) > at > org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1300) > at > org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1296) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1312) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:193) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:319) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:116) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9930) Support max running app logic for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019204#comment-17019204 ] zhoukang commented on YARN-9930: Sorry for late reply, i will post a patch later. thanks [~sunilg] > Support max running app logic for CapacityScheduler > --- > > Key: YARN-9930 > URL: https://issues.apache.org/jira/browse/YARN-9930 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler, capacityscheduler >Affects Versions: 3.1.0, 3.1.1 >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > In FairScheduler, there has limitation for max running which will let > application pending. > But in CapacityScheduler there has no feature like max running app.Only got > max app,and jobs will be rejected directly on client. > This jira i want to implement this semantic for CapacityScheduler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10056) Logservice may encuonter nm fgc since filesystem will only close when app finished
[ https://issues.apache.org/jira/browse/YARN-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10056: Attachment: YARN-10056.001.patch > Logservice may encuonter nm fgc since filesystem will only close when app > finished > -- > > Key: YARN-10056 > URL: https://issues.apache.org/jira/browse/YARN-10056 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-10056.001.patch > > > Currently, filesystem will only be closed when app finished, which may cause > memory overhead -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10056) Logservice may encuonter nm fgc since filesystem will only close when app finished
[ https://issues.apache.org/jira/browse/YARN-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10056: Description: Currently, filesystem will only be closed when app finished, which may cause memory overhead > Logservice may encuonter nm fgc since filesystem will only close when app > finished > -- > > Key: YARN-10056 > URL: https://issues.apache.org/jira/browse/YARN-10056 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > Currently, filesystem will only be closed when app finished, which may cause > memory overhead -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10060) Historyserver may recover too slow since JobHistory init too slow when there exist too many job
[ https://issues.apache.org/jira/browse/YARN-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10060: Attachment: YARN-10060.001.patch > Historyserver may recover too slow since JobHistory init too slow when there > exist too many job > --- > > Key: YARN-10060 > URL: https://issues.apache.org/jira/browse/YARN-10060 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-10060.001.patch > > > Like below it cost >7min to listen to the service port > {code:java} > 2019-12-24,20:01:37,272 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2019-12-24,20:01:47,354 INFO > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Initializing Existing > Jobs... > 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server xxx. Will not attempt to authenticate using SASL > (unknown error) > 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to xxx, initiating session > 2019-12-24,20:08:29,590 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server xxx, sessionid = 0x66d1a13e596ddc9, > negotiated timeout = 5000 > 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x66d1a13e596ddc9 closed > 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2019-12-24,20:08:29,655 INFO > org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage: CachedHistoryStorage > Init > 2019-12-24,20:08:29,681 INFO org.apache.hadoop.ipc.CallQueueManager: Using > callQueue class java.util.concurrent.LinkedBlockingQueue > 2019-12-24,20:08:29,715 INFO org.apache.hadoop.ipc.CallQueueManager: Using > callQueue class java.util.concurrent.LinkedBlockingQueue > 2019-12-24,20:08:29,800 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: > loaded properties from hadoop-metrics2.properties > 2019-12-24,20:08:29,943 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period > at 10 second(s). > 2019-12-24,20:08:29,943 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: JobHistoryServer metrics > system started > 2019-12-24,20:08:29,950 INFO > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: > Updating the current master key for generating delegation tokens > 2019-12-24,20:08:29,951 INFO > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: > Starting expired delegation token remover thread, > tokenRemoverScanInterval=60 min(s) > 2019-12-24,20:08:29,952 INFO > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: > Updating the current master key for generating delegation tokens > 2019-12-24,20:08:30,015 INFO org.apache.hadoop.http.HttpRequestLog: Http > request log for http.requests.jobhistory is not defined > 2019-12-24,20:08:30,025 INFO org.apache.hadoop.http.HttpServer2: Added global > filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter) > 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter > static_user_filter > (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to > context jobhistory > 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter > static_user_filter > (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to > context static > 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path > spec: /jobhistory/* > 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path > spec: /ws/* > 2019-12-24,20:08:30,057 INFO org.apache.hadoop.http.HttpServer2: Jetty bound > to port 20901 > 2019-12-24,20:08:30,939 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app > /jobhistory started at 20901 > 2019-12-24,20:08:31,177 INFO org.apache.hadoop.yarn.webapp.WebApps: > Registered webapp guice modules > 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using > callQueue class java.util.concurrent.LinkedBlockingQueue > 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using > callQueue class java.util.concurrent.LinkedBlockingQueue > 2019-12-24,20:08:31,189 INFO > org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding > protocol org.apache.hadoop.mapreduce.v2.api.HSClientProtocolPB to the server > 2019-12-24,20:08:31,216 INFO > org.apache.hadoop.mapreduce.v2.hs.HistoryClientService: Instantiated > HistoryClientService at xxx > 2019-12-24,20:08:31,344 INFO > org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService: > aggregated log deletion
[jira] [Commented] (YARN-10060) Historyserver may recover too slow since JobHistory init too slow when there exist too many job
[ https://issues.apache.org/jira/browse/YARN-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019193#comment-17019193 ] zhoukang commented on YARN-10060: - I will submit a patch to skip load file older than max history age > Historyserver may recover too slow since JobHistory init too slow when there > exist too many job > --- > > Key: YARN-10060 > URL: https://issues.apache.org/jira/browse/YARN-10060 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > Like below it cost >7min to listen to the service port > {code:java} > 2019-12-24,20:01:37,272 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2019-12-24,20:01:47,354 INFO > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Initializing Existing > Jobs... > 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server xxx. Will not attempt to authenticate using SASL > (unknown error) > 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to xxx, initiating session > 2019-12-24,20:08:29,590 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server xxx, sessionid = 0x66d1a13e596ddc9, > negotiated timeout = 5000 > 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x66d1a13e596ddc9 closed > 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2019-12-24,20:08:29,655 INFO > org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage: CachedHistoryStorage > Init > 2019-12-24,20:08:29,681 INFO org.apache.hadoop.ipc.CallQueueManager: Using > callQueue class java.util.concurrent.LinkedBlockingQueue > 2019-12-24,20:08:29,715 INFO org.apache.hadoop.ipc.CallQueueManager: Using > callQueue class java.util.concurrent.LinkedBlockingQueue > 2019-12-24,20:08:29,800 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: > loaded properties from hadoop-metrics2.properties > 2019-12-24,20:08:29,943 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period > at 10 second(s). > 2019-12-24,20:08:29,943 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: JobHistoryServer metrics > system started > 2019-12-24,20:08:29,950 INFO > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: > Updating the current master key for generating delegation tokens > 2019-12-24,20:08:29,951 INFO > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: > Starting expired delegation token remover thread, > tokenRemoverScanInterval=60 min(s) > 2019-12-24,20:08:29,952 INFO > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: > Updating the current master key for generating delegation tokens > 2019-12-24,20:08:30,015 INFO org.apache.hadoop.http.HttpRequestLog: Http > request log for http.requests.jobhistory is not defined > 2019-12-24,20:08:30,025 INFO org.apache.hadoop.http.HttpServer2: Added global > filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter) > 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter > static_user_filter > (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to > context jobhistory > 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter > static_user_filter > (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to > context static > 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path > spec: /jobhistory/* > 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path > spec: /ws/* > 2019-12-24,20:08:30,057 INFO org.apache.hadoop.http.HttpServer2: Jetty bound > to port 20901 > 2019-12-24,20:08:30,939 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app > /jobhistory started at 20901 > 2019-12-24,20:08:31,177 INFO org.apache.hadoop.yarn.webapp.WebApps: > Registered webapp guice modules > 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using > callQueue class java.util.concurrent.LinkedBlockingQueue > 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using > callQueue class java.util.concurrent.LinkedBlockingQueue > 2019-12-24,20:08:31,189 INFO > org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding > protocol org.apache.hadoop.mapreduce.v2.api.HSClientProtocolPB to the server > 2019-12-24,20:08:31,216 INFO > org.apache.hadoop.mapreduce.v2.hs.HistoryClientService: Instantiated > HistoryClientService at xxx > 2019-12-24,20:08:31,344 INFO > org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService: >
[jira] [Commented] (YARN-10080) Support show app id on localizer thread pool
[ https://issues.apache.org/jira/browse/YARN-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019188#comment-17019188 ] zhoukang commented on YARN-10080: - [~abmodi]thanks for review, i submit a new patch to support show container id > Support show app id on localizer thread pool > > > Key: YARN-10080 > URL: https://issues.apache.org/jira/browse/YARN-10080 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-10080-001.patch, YARN-10080.002.patch > > > Currently when we are troubleshooting a container localizer issue, if we want > to analyze the jstack with thread detail, we can not figure out which thread > is processing the given container. So i want to add app id on the thread name -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10080) Support show app id on localizer thread pool
[ https://issues.apache.org/jira/browse/YARN-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10080: Attachment: YARN-10080.002.patch > Support show app id on localizer thread pool > > > Key: YARN-10080 > URL: https://issues.apache.org/jira/browse/YARN-10080 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-10080-001.patch, YARN-10080.002.patch > > > Currently when we are troubleshooting a container localizer issue, if we want > to analyze the jstack with thread detail, we can not figure out which thread > is processing the given container. So i want to add app id on the thread name -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10080) Support show app id on localizer thread pool
[ https://issues.apache.org/jira/browse/YARN-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10080: Attachment: YARN-10080-001.patch > Support show app id on localizer thread pool > > > Key: YARN-10080 > URL: https://issues.apache.org/jira/browse/YARN-10080 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-10080-001.patch > > > Currently when we are troubleshooting a container localizer issue, if we want > to analyze the jstack with thread detail, we can not figure out which thread > is processing the given container. So i want to add app id on the thread name -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10080) Support show app id on localizer thread pool
[ https://issues.apache.org/jira/browse/YARN-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10080: Summary: Support show app id on localizer thread pool (was: Support show container id on localizer thread pool) > Support show app id on localizer thread pool > > > Key: YARN-10080 > URL: https://issues.apache.org/jira/browse/YARN-10080 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > Currently when we are troubleshooting a container localizer issue, if we want > to analyze the jstack with thread detail, we can not figure out which thread > is processing the given container. So i want to add container id on the > thread name -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10080) Support show app id on localizer thread pool
[ https://issues.apache.org/jira/browse/YARN-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10080: Description: Currently when we are troubleshooting a container localizer issue, if we want to analyze the jstack with thread detail, we can not figure out which thread is processing the given container. So i want to add app id on the thread name (was: Currently when we are troubleshooting a container localizer issue, if we want to analyze the jstack with thread detail, we can not figure out which thread is processing the given container. So i want to add container id on the thread name) > Support show app id on localizer thread pool > > > Key: YARN-10080 > URL: https://issues.apache.org/jira/browse/YARN-10080 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > Currently when we are troubleshooting a container localizer issue, if we want > to analyze the jstack with thread detail, we can not figure out which thread > is processing the given container. So i want to add app id on the thread name -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10080) Support show container id on localizer thread pool
zhoukang created YARN-10080: --- Summary: Support show container id on localizer thread pool Key: YARN-10080 URL: https://issues.apache.org/jira/browse/YARN-10080 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: zhoukang Assignee: zhoukang Currently when we are troubleshooting a container localizer issue, if we want to analyze the jstack with thread detail, we can not figure out which thread is processing the given container. So i want to add container id on the thread name -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10069) Showing jstack on UI for containers
zhoukang created YARN-10069: --- Summary: Showing jstack on UI for containers Key: YARN-10069 URL: https://issues.apache.org/jira/browse/YARN-10069 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: zhoukang Assignee: zhoukang In this jira, i want to post a patch to support showing jstack on the container ui -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10066) Support showing nm version distribution on rm UI
zhoukang created YARN-10066: --- Summary: Support showing nm version distribution on rm UI Key: YARN-10066 URL: https://issues.apache.org/jira/browse/YARN-10066 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: zhoukang Assignee: zhoukang In this jira, i will post a patch to support showing nm version distribution on rm ui. which is useful for large cluster maintenance -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10062) Support deploy multiple historyserver in case of sp
zhoukang created YARN-10062: --- Summary: Support deploy multiple historyserver in case of sp Key: YARN-10062 URL: https://issues.apache.org/jira/browse/YARN-10062 Project: Hadoop YARN Issue Type: Improvement Components: yarn Reporter: zhoukang Assignee: zhoukang In this jira, i want to implement a patch to support history ha -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10061) job historyserver old gen may be 100% when too many jobs load history
[ https://issues.apache.org/jira/browse/YARN-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003561#comment-17003561 ] zhoukang commented on YARN-10061: - job below generated 170+ requests, we should add a filter for the same job which is replaying {code:java} /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973 {code} come from the same browser {code:java} GET /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973 HTTP/1.1..Connection: upgrade..X-Real-IP: 10.232.22.174..X-Forwarded-For: 10.232.22.174..Host: zjy-hadoop-prc-ct11.bj:20901..User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0..Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8..Accept-Language: en-US,en;q=0.5..Accept-Encoding: gzip, deflate..Referer: http://zjy-hadoop-prc-ct11.bj:21001/proxy/application_1576831312050_362973/?proxyapproved=true..Upgrade-Insecure-Requests: 1.. {code} > job historyserver old gen may be 100% when too many jobs load history > - > > Key: YARN-10061 > URL: https://issues.apache.org/jira/browse/YARN-10061 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: 001.png, 002.png, 003.png > > > !003.png! > {code:java} > [work@zjy-hadoop-prc-ct11 log]$ jstat -gcutil 26774 > S0 S1 E O M CCSYGC YGCTFGCFGCT GCT > 0.00 99.99 100.00 100.00 98.13 96.35 10999 2786.664 497 989.782 > 3776.446 > {code} > {code:java} > hread 1058215567@qtp-1107509430-6121 > Thread Properties > Object / Stack Frame org.mortbay.thread.QueuedThreadPool$PoolThread @ > 0x7606db678 > Name 1058215567@qtp-1107509430-6121 > Shallow Heap 0.00 MB > Retained Heap 0.17 MB > Context Class Loader jobhistory > Is Daemon true > Total: 6 entries > Thread Stack > 1058215567@qtp-1107509430-6121 > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(Lorg/apache/hadoop/fs/FileStatus;)V > (HistoryFileManager.java:278) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory()V > (HistoryFileManager.java:798) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getFileInfo(Lorg/apache/hadoop/mapreduce/v2/api/records/JobId;)Lorg/apache/hadoop/mapreduce/v2/hs/HistoryFileManager$HistoryFileInfo; > (HistoryFileManager.java:948) > at > org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.getFullJob(Lorg/apache/hadoop/mapreduce/v2/api/records/JobId;)Lorg/apache/hadoop/mapreduce/v2/app/job/Job; > (CachedHistoryStorage.java:135) > at > org.apache.hadoop.mapreduce.v2.hs.JobHistory.getJob(Lorg/apache/hadoop/mapreduce/v2/api/records/JobId;)Lorg/apache/hadoop/mapreduce/v2/app/job/Job; > (JobHistory.java:221) > at org.apache.hadoop.mapreduce.v2.app.webapp.AppController.requireJob()V > (AppController.java:382) > at org.apache.hadoop.mapreduce.v2.app.webapp.AppController.job()V > (AppController.java:109) > at org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.job()V > (HsController.java:104) > at > sun.reflect.GeneratedMethodAccessor30.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; > (Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; > (DelegatingMethodAccessorImpl.java:43) > at > java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; > (Method.java:498) > at > org.apache.hadoop.yarn.webapp.Dispatcher.service(Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V > (Dispatcher.java:153) > at > javax.servlet.http.HttpServlet.service(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V > (HttpServlet.java:820) > at > com.google.inject.servlet.ServletDefinition.doService(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V > (ServletDefinition.java:263) > at > com.google.inject.servlet.ServletDefinition.service(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)Z > (ServletDefinition.java:178) > at > com.google.inject.servlet.ManagedServletPipeline.service(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)Z >
[jira] [Commented] (YARN-10061) job historyserver old gen may be 100% when too many jobs load history
[ https://issues.apache.org/jira/browse/YARN-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003558#comment-17003558 ] zhoukang commented on YARN-10061: - {code:java} Start Page Table Of Contents Thread 1058215567@qtp-1107509430-6121 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973. Summary URI Thread 1333459277@qtp-1107509430-6120 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973. Summary URI Thread 582278496@qtp-1107509430-6119 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/attempts/job_1576831312050_347354/r/KILLED. Summary URI Thread 1010032540@qtp-1107509430-6118 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973. Summary URI Thread 2057128499@qtp-1107509430-6117 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973. Summary URI Thread 1230320783@qtp-1107509430-6116 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973. Summary URI Thread 787412968@qtp-1107509430-6115 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973. Summary URI Thread 328094070@qtp-1107509430-6114 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973. Summary URI Thread 284896606@qtp-1107509430-6113 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973. Summary URI Thread 1764513565@qtp-1107509430-6112 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/attempts/job_1576831312050_347314/m/KILLED. Summary URI Thread 1013350884@qtp-1107509430-6111 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973. Summary URI Thread 2030348115@qtp-1107509430-6110 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973. Summary URI Thread 1609530906@qtp-1107509430-6109 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973. Summary URI Thread 2112512892@qtp-1107509430-6108 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973. Summary URI Thread 1508380482@qtp-1107509430-6107 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973. Summary URI Thread 2066251373@qtp-1107509430-6106 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973. Summary URI Thread 367949850@qtp-1107509430-6105 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973. Summary URI Thread 626277387@qtp-1107509430-6104 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973. Summary URI Thread 515689957@qtp-1107509430-6103 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/attempts/job_1576831312050_347325/r/KILLED. Summary URI Thread 2097370166@qtp-1107509430-6102 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/attempts/job_1576831312050_347313/m/KILLED. Summary URI Thread 1680793908@qtp-1107509430-6101 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/job/job_1576831312050_362973/mapreduce/job/job_1576831312050_362973. Summary URI Thread 1425331186@qtp-1107509430-6100 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to /jobhistory/attempts/job_1576831312050_349232/r/KILLED. Summary URI Thread 1797324868@qtp-1107509430-6099 Thread Properties Thread Stack Requests The thread is executing an HTTP Request to
[jira] [Created] (YARN-10061) job historyserver old gen may be 100% when too many jobs load history
zhoukang created YARN-10061: --- Summary: job historyserver old gen may be 100% when too many jobs load history Key: YARN-10061 URL: https://issues.apache.org/jira/browse/YARN-10061 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: zhoukang Assignee: zhoukang Attachments: 001.png, 002.png, 003.png !003.png! {code:java} [work@zjy-hadoop-prc-ct11 log]$ jstat -gcutil 26774 S0 S1 E O M CCSYGC YGCTFGCFGCT GCT 0.00 99.99 100.00 100.00 98.13 96.35 10999 2786.664 497 989.782 3776.446 {code} {code:java} hread 1058215567@qtp-1107509430-6121 Thread Properties Object / Stack Frameorg.mortbay.thread.QueuedThreadPool$PoolThread @ 0x7606db678 Name1058215567@qtp-1107509430-6121 Shallow Heap0.00 MB Retained Heap 0.17 MB Context Class Loaderjobhistory Is Daemon true Total: 6 entries Thread Stack 1058215567@qtp-1107509430-6121 at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(Lorg/apache/hadoop/fs/FileStatus;)V (HistoryFileManager.java:278) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory()V (HistoryFileManager.java:798) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getFileInfo(Lorg/apache/hadoop/mapreduce/v2/api/records/JobId;)Lorg/apache/hadoop/mapreduce/v2/hs/HistoryFileManager$HistoryFileInfo; (HistoryFileManager.java:948) at org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.getFullJob(Lorg/apache/hadoop/mapreduce/v2/api/records/JobId;)Lorg/apache/hadoop/mapreduce/v2/app/job/Job; (CachedHistoryStorage.java:135) at org.apache.hadoop.mapreduce.v2.hs.JobHistory.getJob(Lorg/apache/hadoop/mapreduce/v2/api/records/JobId;)Lorg/apache/hadoop/mapreduce/v2/app/job/Job; (JobHistory.java:221) at org.apache.hadoop.mapreduce.v2.app.webapp.AppController.requireJob()V (AppController.java:382) at org.apache.hadoop.mapreduce.v2.app.webapp.AppController.job()V (AppController.java:109) at org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.job()V (HsController.java:104) at sun.reflect.GeneratedMethodAccessor30.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (Method.java:498) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (Dispatcher.java:153) at javax.servlet.http.HttpServlet.service(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V (HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V (ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)Z (ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)Z (ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V (FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;Ljavax/servlet/FilterChain;Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)V (ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;Ljavax/servlet/FilterChain;)V (ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;Ljavax/servlet/FilterChain;)V (ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;Lcom/google/inject/servlet/FilterChainInvocation;)V (FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V (FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;Ljavax/servlet/FilterChain;)V (ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;Ljavax/servlet/FilterChain;)V (GuiceFilter.java:113) at
[jira] [Updated] (YARN-10060) Historyserver may recover too slow since JobHistory init too slow when there exist too many job
[ https://issues.apache.org/jira/browse/YARN-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10060: Description: Like below it cost >7min to listen to the service port {code:java} 2019-12-24,20:01:37,272 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2019-12-24,20:01:47,354 INFO org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Initializing Existing Jobs... 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server xxx. Will not attempt to authenticate using SASL (unknown error) 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to xxx, initiating session 2019-12-24,20:08:29,590 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server xxx, sessionid = 0x66d1a13e596ddc9, negotiated timeout = 5000 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ZooKeeper: Session: 0x66d1a13e596ddc9 closed 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2019-12-24,20:08:29,655 INFO org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage: CachedHistoryStorage Init 2019-12-24,20:08:29,681 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue 2019-12-24,20:08:29,715 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue 2019-12-24,20:08:29,800 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2019-12-24,20:08:29,943 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2019-12-24,20:08:29,943 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: JobHistoryServer metrics system started 2019-12-24,20:08:29,950 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens 2019-12-24,20:08:29,951 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Starting expired delegation token remover thread, tokenRemoverScanInterval=60 min(s) 2019-12-24,20:08:29,952 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens 2019-12-24,20:08:30,015 INFO org.apache.hadoop.http.HttpRequestLog: Http request log for http.requests.jobhistory is not defined 2019-12-24,20:08:30,025 INFO org.apache.hadoop.http.HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter) 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context jobhistory 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context static 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path spec: /jobhistory/* 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path spec: /ws/* 2019-12-24,20:08:30,057 INFO org.apache.hadoop.http.HttpServer2: Jetty bound to port 20901 2019-12-24,20:08:30,939 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app /jobhistory started at 20901 2019-12-24,20:08:31,177 INFO org.apache.hadoop.yarn.webapp.WebApps: Registered webapp guice modules 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue 2019-12-24,20:08:31,189 INFO org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding protocol org.apache.hadoop.mapreduce.v2.api.HSClientProtocolPB to the server 2019-12-24,20:08:31,216 INFO org.apache.hadoop.mapreduce.v2.hs.HistoryClientService: Instantiated HistoryClientService at xxx 2019-12-24,20:08:31,344 INFO org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService: aggregated log deletion started. 2019-12-24,20:08:31,690 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=xxx sessionTimeout=5000 watcher=org {code} {code:java} protected void serviceInit(Configuration conf) throws Exception { LOG.info("JobHistory Init"); this.conf = conf; this.appID = ApplicationId.newInstance(0, 0); this.appAttemptID = RecordFactoryProvider.getRecordFactory(conf) .newRecordInstance(ApplicationAttemptId.class); moveThreadInterval = conf.getLong( JHAdminConfig.MR_HISTORY_MOVE_INTERVAL_MS, JHAdminConfig.DEFAULT_MR_HISTORY_MOVE_INTERVAL_MS); hsManager = createHistoryFileManager(); hsManager.init(conf); try { hsManager.initExisting();
[jira] [Created] (YARN-10060) Historyserver may recover too slow since JobHistory init too slow when there exist too many job
zhoukang created YARN-10060: --- Summary: Historyserver may recover too slow since JobHistory init too slow when there exist too many job Key: YARN-10060 URL: https://issues.apache.org/jira/browse/YARN-10060 Project: Hadoop YARN Issue Type: Improvement Components: yarn Reporter: zhoukang Assignee: zhoukang Like below it cost >7min to listen to the service port {code:java} 2019-12-24,20:01:37,272 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2019-12-24,20:01:47,354 INFO org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Initializing Existing Jobs... 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server zjy-hadoop-prc-ct07.bj/10.152.50.2:11000. Will not attempt to authenticate using SASL (unknown error) 2019-12-24,20:08:29,589 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to zjy-hadoop-prc-ct07.bj/10.152.50.2:11000, initiating session 2019-12-24,20:08:29,590 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server zjy-hadoop-prc-ct07.bj/10.152.50.2:11000, sessionid = 0x66d1a13e596ddc9, negotiated timeout = 5000 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ZooKeeper: Session: 0x66d1a13e596ddc9 closed 2019-12-24,20:08:29,593 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2019-12-24,20:08:29,655 INFO org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage: CachedHistoryStorage Init 2019-12-24,20:08:29,681 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue 2019-12-24,20:08:29,715 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue 2019-12-24,20:08:29,800 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2019-12-24,20:08:29,943 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2019-12-24,20:08:29,943 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: JobHistoryServer metrics system started 2019-12-24,20:08:29,950 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens 2019-12-24,20:08:29,951 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Starting expired delegation token remover thread, tokenRemoverScanInterval=60 min(s) 2019-12-24,20:08:29,952 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens 2019-12-24,20:08:30,015 INFO org.apache.hadoop.http.HttpRequestLog: Http request log for http.requests.jobhistory is not defined 2019-12-24,20:08:30,025 INFO org.apache.hadoop.http.HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter) 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context jobhistory 2019-12-24,20:08:30,027 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context static 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path spec: /jobhistory/* 2019-12-24,20:08:30,030 INFO org.apache.hadoop.http.HttpServer2: adding path spec: /ws/* 2019-12-24,20:08:30,057 INFO org.apache.hadoop.http.HttpServer2: Jetty bound to port 20901 2019-12-24,20:08:30,939 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app /jobhistory started at 20901 2019-12-24,20:08:31,177 INFO org.apache.hadoop.yarn.webapp.WebApps: Registered webapp guice modules 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue 2019-12-24,20:08:31,187 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue 2019-12-24,20:08:31,189 INFO org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding protocol org.apache.hadoop.mapreduce.v2.api.HSClientProtocolPB to the server 2019-12-24,20:08:31,216 INFO org.apache.hadoop.mapreduce.v2.hs.HistoryClientService: Instantiated HistoryClientService at zjy-hadoop-prc-ct11.bj/10.152.50.42:20900 2019-12-24,20:08:31,344 INFO org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService: aggregated log deletion started. 2019-12-24,20:08:31,690 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=zjyprc.observer.zk.hadoop.srv:11000 sessionTimeout=5000 watcher=org {code} {code:java} protected void serviceInit(Configuration conf) throws Exception { LOG.info("JobHistory Init"); this.conf = conf; this.appID =
[jira] [Commented] (YARN-7672) hadoop-sls can not simulate huge scale of YARN
[ https://issues.apache.org/jira/browse/YARN-7672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002610#comment-17002610 ] zhoukang commented on YARN-7672: Thanks for the patch [~yufeigu]do we have a patch about the metrics?thanks > hadoop-sls can not simulate huge scale of YARN > -- > > Key: YARN-7672 > URL: https://issues.apache.org/jira/browse/YARN-7672 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: zhangshilong >Assignee: zhangshilong >Priority: Major > Attachments: YARN-7672.patch > > > Our YARN cluster scale to nearly 10 thousands nodes. We need to do scheduler > pressure test. > Using SLS,we start 2000+ threads to simulate NM and AM. But cpu.load very > high to 100+. I thought that will affect performance evaluation of > scheduler. > So I thought to separate the scheduler from the simulator. > I start a real RM. Then SLS will register nodes to RM,And submit apps to RM > using RM RPC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10047) Memory consume of process tree will consider subprocess which may make container exit unexcepted
[ https://issues.apache.org/jira/browse/YARN-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002273#comment-17002273 ] zhoukang commented on YARN-10047: - [~wilfreds]thanks for your reply, for some case the memory consume will consider subprocess which will has incorrect memory usage. > Memory consume of process tree will consider subprocess which may make > container exit unexcepted > - > > Key: YARN-10047 > URL: https://issues.apache.org/jira/browse/YARN-10047 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > As below, we have a case which spark driver execute some scripts.Then > sometimes the driver will be killed. > {code:java} > yarn.174410.log.2019-12-17.02:2019-12-17,06:59:14,831 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Container > [pid=50529,containerID=container_e917_1576303656075_174957_01_003197] is > running beyond physical memory limits. Current usage: 50.28 GB of 5.25 GB > physical memory used; xxx. Killing container. > {code} > {code:java} > boolean isProcessTreeOverLimit(String containerId, > long currentMemUsage, > long curMemUsageOfAgedProcesses, > long vmemLimit) { > boolean isOverLimit = false; > > /** > if (currentMemUsage > (2 * vmemLimit)) { > LOG.warn("Process tree for container: " + containerId > + " running over twice " + "the configured limit. Limit=" + > vmemLimit > + ", current usage = " + currentMemUsage); > isOverLimit = true; > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10047) Memory consume of process tree will consider subprocess which may make container exit unexcepted
[ https://issues.apache.org/jira/browse/YARN-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10047: Summary: Memory consume of process tree will consider subprocess which may make container exit unexcepted (was: Process tree will consider memory consume of subprocess which may make container exit unexcepted) > Memory consume of process tree will consider subprocess which may make > container exit unexcepted > - > > Key: YARN-10047 > URL: https://issues.apache.org/jira/browse/YARN-10047 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > As below, we have a case which spark driver execute some scripts.Then > sometimes the driver will be killed. > {code:java} > yarn.174410.log.2019-12-17.02:2019-12-17,06:59:14,831 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Container > [pid=50529,containerID=container_e917_1576303656075_174957_01_003197] is > running beyond physical memory limits. Current usage: 50.28 GB of 5.25 GB > physical memory used; xxx. Killing container. > {code} > {code:java} > boolean isProcessTreeOverLimit(String containerId, > long currentMemUsage, > long curMemUsageOfAgedProcesses, > long vmemLimit) { > boolean isOverLimit = false; > > /** > if (currentMemUsage > (2 * vmemLimit)) { > LOG.warn("Process tree for container: " + containerId > + " running over twice " + "the configured limit. Limit=" + > vmemLimit > + ", current usage = " + currentMemUsage); > isOverLimit = true; > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10047) Process tree will consider memory consume of subprocess which may make container exit unexcepted
[ https://issues.apache.org/jira/browse/YARN-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10047: Summary: Process tree will consider memory consume of subprocess which may make container exit unexcepted (was: container memory monitor may make container exit) > Process tree will consider memory consume of subprocess which may make > container exit unexcepted > > > Key: YARN-10047 > URL: https://issues.apache.org/jira/browse/YARN-10047 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > As below, we have a case which spark driver execute some scripts.Then > sometimes the driver will be killed. > {code:java} > yarn.174410.log.2019-12-17.02:2019-12-17,06:59:14,831 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Container > [pid=50529,containerID=container_e917_1576303656075_174957_01_003197] is > running beyond physical memory limits. Current usage: 50.28 GB of 5.25 GB > physical memory used; xxx. Killing container. > {code} > {code:java} > boolean isProcessTreeOverLimit(String containerId, > long currentMemUsage, > long curMemUsageOfAgedProcesses, > long vmemLimit) { > boolean isOverLimit = false; > > /** > if (currentMemUsage > (2 * vmemLimit)) { > LOG.warn("Process tree for container: " + containerId > + " running over twice " + "the configured limit. Limit=" + > vmemLimit > + ", current usage = " + currentMemUsage); > isOverLimit = true; > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10056) Logservice may encuonter nm fgc since filesystem will only close when app finished
zhoukang created YARN-10056: --- Summary: Logservice may encuonter nm fgc since filesystem will only close when app finished Key: YARN-10056 URL: https://issues.apache.org/jira/browse/YARN-10056 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhoukang Assignee: zhoukang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10047) container memory monitor may make container exit
zhoukang created YARN-10047: --- Summary: container memory monitor may make container exit Key: YARN-10047 URL: https://issues.apache.org/jira/browse/YARN-10047 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhoukang Assignee: zhoukang As below, we have a case which spark driver execute some scripts.Then sometimes the driver will be killed. {code:java} yarn.174410.log.2019-12-17.02:2019-12-17,06:59:14,831 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=50529,containerID=container_e917_1576303656075_174957_01_003197] is running beyond physical memory limits. Current usage: 50.28 GB of 5.25 GB physical memory used; xxx. Killing container. {code} {code:java} boolean isProcessTreeOverLimit(String containerId, long currentMemUsage, long curMemUsageOfAgedProcesses, long vmemLimit) { boolean isOverLimit = false; /** if (currentMemUsage > (2 * vmemLimit)) { LOG.warn("Process tree for container: " + containerId + " running over twice " + "the configured limit. Limit=" + vmemLimit + ", current usage = " + currentMemUsage); isOverLimit = true; } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10011) Catch all exception during init app in LogAggregationService
[ https://issues.apache.org/jira/browse/YARN-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10011: Component/s: nodemanager > Catch all exception during init app in LogAggregationService > -- > > Key: YARN-10011 > URL: https://issues.apache.org/jira/browse/YARN-10011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > we should catch all exception during init app in LogAggregationService in > case of nm exit > {code:java} > 2019-06-12,09:36:03,652 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: > Error in dispatcher thread > java.lang.IllegalStateException > at > com.google.common.base.Preconditions.checkState(Preconditions.java:129) > at > org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:118) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy22.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2115) > at > org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1300) > at > org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1296) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1312) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:193) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:319) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:116) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10011) Catch all exception during init app in LogAggregationService
zhoukang created YARN-10011: --- Summary: Catch all exception during init app in LogAggregationService Key: YARN-10011 URL: https://issues.apache.org/jira/browse/YARN-10011 Project: Hadoop YARN Issue Type: Bug Reporter: zhoukang Assignee: zhoukang we should catch all exception during init app in LogAggregationService in case of nm exit {code:java} 2019-06-12,09:36:03,652 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread java.lang.IllegalStateException at com.google.common.base.Preconditions.checkState(Preconditions.java:129) at org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:118) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) at com.sun.proxy.$Proxy22.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2115) at org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1300) at org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1296) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1312) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:193) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:319) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:116) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10010) NM upload log cost too much time
[ https://issues.apache.org/jira/browse/YARN-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10010: Attachment: (was: 选区_002.png) > NM upload log cost too much time > > > Key: YARN-10010 > URL: https://issues.apache.org/jira/browse/YARN-10010 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: notfound.png > > > Since thread pool size of log service is 100. > Some times the log uploading service will delay for some apps.like below > !选区_002.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10010) NM upload log cost too much time
[ https://issues.apache.org/jira/browse/YARN-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10010: Description: Since thread pool size of log service is 100. Some times the log uploading service will delay for some apps.like below !notfound.png! was: Since thread pool size of log service is 100. Some times the log uploading service will delay for some apps.like below !选区_002.png! > NM upload log cost too much time > > > Key: YARN-10010 > URL: https://issues.apache.org/jira/browse/YARN-10010 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: notfound.png > > > Since thread pool size of log service is 100. > Some times the log uploading service will delay for some apps.like below > !notfound.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10010) NM upload log cost too much time
zhoukang created YARN-10010: --- Summary: NM upload log cost too much time Key: YARN-10010 URL: https://issues.apache.org/jira/browse/YARN-10010 Project: Hadoop YARN Issue Type: Improvement Reporter: zhoukang Assignee: zhoukang Attachments: notfound.png Since thread pool size of log service is 100. Some times the log uploading service will delay for some apps.like below !选区_002.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10010) NM upload log cost too much time
[ https://issues.apache.org/jira/browse/YARN-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-10010: Attachment: notfound.png > NM upload log cost too much time > > > Key: YARN-10010 > URL: https://issues.apache.org/jira/browse/YARN-10010 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: notfound.png > > > Since thread pool size of log service is 100. > Some times the log uploading service will delay for some apps.like below > !notfound.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8364) NM aggregation thread should be able to exempt pool
[ https://issues.apache.org/jira/browse/YARN-8364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986613#comment-16986613 ] zhoukang commented on YARN-8364: I will work on this > NM aggregation thread should be able to exempt pool > --- > > Key: YARN-8364 > URL: https://issues.apache.org/jira/browse/YARN-8364 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Reporter: Oleksandr Shevchenko >Priority: Major > > For now, we have limited NM aggregation thread pool that can be configured by > the property yarn.nodemanager.logaggregation.threadpool-size-max=100. > When some application is starting it use one unit of the pool. And locks this > unit until the application is finished. As the result, another application > can aggregate their logs only when the previous application is finished. > Just for example: > yarn.nodemanager.logaggregation.threadpool-size-max=1 > 1. Start long-running application app1 > 2. Start short application app2 > 3. Finished app2 > 4. Finished app1 > 5. Aggregating logs of app1 > 6. Aggregating logs of app2 > In the real cluster, we can have many long running jobs (for example Spark > streaming), therefore short-running application do not aggregate their logs a > long time. It problem appears if the average number of jobs exceeds thread > pool size. All threads occupied by some applications, as the result we have > the huge delay between application finishing and logs uploading. > Will be good if we improve this behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9605) Add ZkConfiguredFailoverProxyProvider for RM HA
[ https://issues.apache.org/jira/browse/YARN-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974020#comment-16974020 ] zhoukang edited comment on YARN-9605 at 11/14/19 8:33 AM: -- Sorry for bother [~tangzhankun][~prabhujoseph] but i really can not figure out the cause of warning below during 'cc phase': I think the patch i post has no relation with hdfs? {code:java} WARNING] /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/IpcConnectionContext.pb.cc:129:13: warning: 'dynamic_init_dummy_IpcConnectionContext_2eproto' defined but not used [-Wunused-variable] [WARNING] /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/HAServiceProtocol.pb.cc:404:13: warning: 'dynamic_init_dummy_HAServiceProtocol_2eproto' defined but not used [-Wunused-variable] [WARNING] /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/Security.pb.cc:349:13: warning: 'dynamic_init_dummy_Security_2eproto' defined but not used [-Wunused-variable] [WARNING] /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/acl.pb.cc:533:13: warning: 'dynamic_init_dummy_acl_2eproto' defined but not used [-Wunused-variable] {code} was (Author: cane): Sorry for bother [~tangzhankun][~prabhujoseph] but i really can not figure out the cause of warning below during 'cc phase': {code:java} WARNING] /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/IpcConnectionContext.pb.cc:129:13: warning: 'dynamic_init_dummy_IpcConnectionContext_2eproto' defined but not used [-Wunused-variable] [WARNING] /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/HAServiceProtocol.pb.cc:404:13: warning: 'dynamic_init_dummy_HAServiceProtocol_2eproto' defined but not used [-Wunused-variable] [WARNING] /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/Security.pb.cc:349:13: warning: 'dynamic_init_dummy_Security_2eproto' defined but not used [-Wunused-variable] [WARNING] /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/acl.pb.cc:533:13: warning: 'dynamic_init_dummy_acl_2eproto' defined but not used [-Wunused-variable] {code} > Add ZkConfiguredFailoverProxyProvider for RM HA > --- > > Key: YARN-9605 > URL: https://issues.apache.org/jira/browse/YARN-9605 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-9605.001.patch, YARN-9605.002.patch, > YARN-9605.003.patch, YARN-9605.004.patch, YARN-9605.005.patch, > YARN-9605.006.patch > > > In this issue, i will track a new feature to support > ZkConfiguredFailoverProxyProvider for RM HA -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9605) Add ZkConfiguredFailoverProxyProvider for RM HA
[ https://issues.apache.org/jira/browse/YARN-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974020#comment-16974020 ] zhoukang commented on YARN-9605: Sorry for bother [~tangzhankun][~prabhujoseph] but i really can not figure out the cause of warning below during 'cc phase': {code:java} WARNING] /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/IpcConnectionContext.pb.cc:129:13: warning: 'dynamic_init_dummy_IpcConnectionContext_2eproto' defined but not used [-Wunused-variable] [WARNING] /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/HAServiceProtocol.pb.cc:404:13: warning: 'dynamic_init_dummy_HAServiceProtocol_2eproto' defined but not used [-Wunused-variable] [WARNING] /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/Security.pb.cc:349:13: warning: 'dynamic_init_dummy_Security_2eproto' defined but not used [-Wunused-variable] [WARNING] /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/libhdfspp/lib/proto/acl.pb.cc:533:13: warning: 'dynamic_init_dummy_acl_2eproto' defined but not used [-Wunused-variable] {code} > Add ZkConfiguredFailoverProxyProvider for RM HA > --- > > Key: YARN-9605 > URL: https://issues.apache.org/jira/browse/YARN-9605 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-9605.001.patch, YARN-9605.002.patch, > YARN-9605.003.patch, YARN-9605.004.patch, YARN-9605.005.patch, > YARN-9605.006.patch > > > In this issue, i will track a new feature to support > ZkConfiguredFailoverProxyProvider for RM HA -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9979) When a app expired with many containers , scheduler event size will be huge
[ https://issues.apache.org/jira/browse/YARN-9979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973932#comment-16973932 ] zhoukang commented on YARN-9979: I think we can add throttle logic for ContainerAllocationExpirer > When a app expired with many containers , scheduler event size will be huge > --- > > Key: YARN-9979 > URL: https://issues.apache.org/jira/browse/YARN-9979 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > When there is an app expired with many containers, the scheduler event size > will be huge. > {code:java} > 2019-11-11,21:39:49,690 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 9000 > 2019-11-11,21:39:49,695 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 1 > 2019-11-11,21:39:49,700 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 11000 > 2019-11-11,21:39:49,705 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 12000 > 2019-11-11,21:39:49,710 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 13000 > 2019-11-11,21:39:49,715 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 14000 > 2019-11-11,21:39:49,720 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Discarded 1 > messages due to full event buffer including: Size of scheduler event-queue is > 15000 > 2019-11-11,21:39:49,724 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 16000 > 2019-11-11,21:39:49,729 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 17000 > 2019-11-11,21:39:49,733 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 18000 > 2019-11-11,21:40:14,953 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 19000 > 2019-11-11,21:43:09,743 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 19000 > 2019-11-11,21:43:09,750 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 2 > 2019-11-11,21:43:09,758 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 21000 > 2019-11-11,21:43:09,766 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 22000 > 2019-11-11,21:43:09,775 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 23000 > 2019-11-11,21:43:09,783 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 24000 > 2019-11-11,21:43:09,792 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 25000 > 2019-11-11,21:43:09,800 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 26000 > 2019-11-11,21:43:09,807 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 27000 > 2019-11-11,21:43:09,814 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 28000 > 2019-11-11,21:46:29,830 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 29000 > 2019-11-11,21:46:29,841 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 3 > 2019-11-11,21:46:29,850 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 31000 > 2019-11-11,21:46:29,862 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 32000 > 2019-11-11,21:49:49,875 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 33000 > 2019-11-11,21:49:49,875 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 34000 > 2019-11-11,21:49:49,876 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 35000 > 2019-11-11,21:49:49,882 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of > scheduler event-queue is 36000 > 2019-11-11,21:49:49,887 INFO >
[jira] [Created] (YARN-9979) When a app expired with many containers , scheduler event size will be huge
zhoukang created YARN-9979: -- Summary: When a app expired with many containers , scheduler event size will be huge Key: YARN-9979 URL: https://issues.apache.org/jira/browse/YARN-9979 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Reporter: zhoukang Assignee: zhoukang When there is an app expired with many containers, the scheduler event size will be huge. {code:java} 2019-11-11,21:39:49,690 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 9000 2019-11-11,21:39:49,695 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 1 2019-11-11,21:39:49,700 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 11000 2019-11-11,21:39:49,705 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 12000 2019-11-11,21:39:49,710 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 13000 2019-11-11,21:39:49,715 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 14000 2019-11-11,21:39:49,720 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Discarded 1 messages due to full event buffer including: Size of scheduler event-queue is 15000 2019-11-11,21:39:49,724 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 16000 2019-11-11,21:39:49,729 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 17000 2019-11-11,21:39:49,733 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 18000 2019-11-11,21:40:14,953 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 19000 2019-11-11,21:43:09,743 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 19000 2019-11-11,21:43:09,750 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 2 2019-11-11,21:43:09,758 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 21000 2019-11-11,21:43:09,766 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 22000 2019-11-11,21:43:09,775 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 23000 2019-11-11,21:43:09,783 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 24000 2019-11-11,21:43:09,792 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 25000 2019-11-11,21:43:09,800 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 26000 2019-11-11,21:43:09,807 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 27000 2019-11-11,21:43:09,814 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 28000 2019-11-11,21:46:29,830 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 29000 2019-11-11,21:46:29,841 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 3 2019-11-11,21:46:29,850 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 31000 2019-11-11,21:46:29,862 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 32000 2019-11-11,21:49:49,875 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 33000 2019-11-11,21:49:49,875 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 34000 2019-11-11,21:49:49,876 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 35000 2019-11-11,21:49:49,882 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 36000 2019-11-11,21:49:49,887 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 37000 2019-11-11,21:49:49,891 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 38000 2019-11-11,21:49:49,896 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 39000 2019-11-11,21:49:49,900 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Size of scheduler event-queue is 4
[jira] [Updated] (YARN-9709) When we expanding queue list the scheduler page will not show any applications
[ https://issues.apache.org/jira/browse/YARN-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-9709: --- Attachment: YARN-9709.001.patch > When we expanding queue list the scheduler page will not show any applications > -- > > Key: YARN-9709 > URL: https://issues.apache.org/jira/browse/YARN-9709 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 3.1.2 >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-9709.001.patch, list1.png, list3.png > > > When we expanding queue list the scheduler page will not show any > applications.But it works well in FairScheduler. > !list1.png! > !list3.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9978) Support show acl on capacityScheuler page
[ https://issues.apache.org/jira/browse/YARN-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-9978: --- Attachment: YARN-9978.001.patch > Support show acl on capacityScheuler page > - > > Key: YARN-9978 > URL: https://issues.apache.org/jira/browse/YARN-9978 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, capacityscheduler >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: 001.png, YARN-9978.001.patch > > > Support show submit acl and admin acl on ui > !001.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9978) Support show acl on capacityScheuler page
[ https://issues.apache.org/jira/browse/YARN-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-9978: --- Attachment: (was: YARN-9978.001.patch) > Support show acl on capacityScheuler page > - > > Key: YARN-9978 > URL: https://issues.apache.org/jira/browse/YARN-9978 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, capacityscheduler >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: 001.png > > > Support show submit acl and admin acl on ui > !001.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9978) Support show acl on capacityScheuler page
[ https://issues.apache.org/jira/browse/YARN-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-9978: --- Attachment: YARN-9978.001.patch > Support show acl on capacityScheuler page > - > > Key: YARN-9978 > URL: https://issues.apache.org/jira/browse/YARN-9978 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, capacityscheduler >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: 001.png, YARN-9978.001.patch > > > Support show submit acl and admin acl on ui > !001.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9978) Support show acl on capacityScheuler page
[ https://issues.apache.org/jira/browse/YARN-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-9978: --- Component/s: capacityscheduler capacity scheduler > Support show acl on capacityScheuler page > - > > Key: YARN-9978 > URL: https://issues.apache.org/jira/browse/YARN-9978 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, capacityscheduler >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: 001.png > > > Support show submit acl and admin acl on ui > !001.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9978) Support show acl on capacityScheuler page
[ https://issues.apache.org/jira/browse/YARN-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-9978: --- Attachment: 001.png > Support show acl on capacityScheuler page > - > > Key: YARN-9978 > URL: https://issues.apache.org/jira/browse/YARN-9978 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: 001.png > > > Support show submit acl and admin acl on ui -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9978) Support show acl on capacityScheuler page
[ https://issues.apache.org/jira/browse/YARN-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-9978: --- Attachment: (was: 001.png) > Support show acl on capacityScheuler page > - > > Key: YARN-9978 > URL: https://issues.apache.org/jira/browse/YARN-9978 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: 001.png > > > Support show submit acl and admin acl on ui > !001.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9978) Support show acl on capacityScheuler page
[ https://issues.apache.org/jira/browse/YARN-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-9978: --- Description: Support show submit acl and admin acl on ui !001.png! was: Support show submit acl and admin acl on ui > Support show acl on capacityScheuler page > - > > Key: YARN-9978 > URL: https://issues.apache.org/jira/browse/YARN-9978 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: 001.png > > > Support show submit acl and admin acl on ui > !001.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9978) Support show acl on capacityScheuler page
zhoukang created YARN-9978: -- Summary: Support show acl on capacityScheuler page Key: YARN-9978 URL: https://issues.apache.org/jira/browse/YARN-9978 Project: Hadoop YARN Issue Type: Improvement Reporter: zhoukang Assignee: zhoukang Attachments: 001.png Support show submit acl and admin acl on ui -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9978) Support show acl on capacityScheuler page
[ https://issues.apache.org/jira/browse/YARN-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-9978: --- Attachment: 001.png > Support show acl on capacityScheuler page > - > > Key: YARN-9978 > URL: https://issues.apache.org/jira/browse/YARN-9978 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: 001.png > > > Support show submit acl and admin acl on ui -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9605) Add ZkConfiguredFailoverProxyProvider for RM HA
[ https://issues.apache.org/jira/browse/YARN-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-9605: --- Attachment: YARN-9605.006.patch > Add ZkConfiguredFailoverProxyProvider for RM HA > --- > > Key: YARN-9605 > URL: https://issues.apache.org/jira/browse/YARN-9605 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-9605.001.patch, YARN-9605.002.patch, > YARN-9605.003.patch, YARN-9605.004.patch, YARN-9605.005.patch, > YARN-9605.006.patch > > > In this issue, i will track a new feature to support > ZkConfiguredFailoverProxyProvider for RM HA -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9977) Support monitor threads number in ContainersMonitorImpl
zhoukang created YARN-9977: -- Summary: Support monitor threads number in ContainersMonitorImpl Key: YARN-9977 URL: https://issues.apache.org/jira/browse/YARN-9977 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager Reporter: zhoukang Assignee: zhoukang In this jira, we want add a feature to monitor thread number for given container. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9976) Application rejected by CapacityScheduler can not be searched on UI
zhoukang created YARN-9976: -- Summary: Application rejected by CapacityScheduler can not be searched on UI Key: YARN-9976 URL: https://issues.apache.org/jira/browse/YARN-9976 Project: Hadoop YARN Issue Type: Bug Reporter: zhoukang In jira https://issues.apache.org/jira/browse/YARN-4522 submission acl check will be done at RMAppManager. But this will cause users can not find his app on UI -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9976) Application rejected by CapacityScheduler can not be searched on UI
[ https://issues.apache.org/jira/browse/YARN-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang reassigned YARN-9976: -- Assignee: zhoukang > Application rejected by CapacityScheduler can not be searched on UI > --- > > Key: YARN-9976 > URL: https://issues.apache.org/jira/browse/YARN-9976 > Project: Hadoop YARN > Issue Type: Bug >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > In jira https://issues.apache.org/jira/browse/YARN-4522 submission acl check > will be done at RMAppManager. But this will cause users can not find his app > on UI -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9975) Support proxy acl user for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-9975: --- Parent: YARN-9698 Issue Type: Sub-task (was: Improvement) > Support proxy acl user for CapacityScheduler > > > Key: YARN-9975 > URL: https://issues.apache.org/jira/browse/YARN-9975 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > As commented in https://issues.apache.org/jira/browse/YARN-9698. > I will open a new jira for the proxy user feature. > The background is that we have long running sql thriftserver for many users: > {quote}{{user->sql proxy-> sql thriftserver}}{quote} > But we do not have keytab for all users on 'sql proxy'. We just use a super > user like 'sql_prc' to submit the 'sql thriftserver' application. To support > this we should change the scheduler to support proxy user acl -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9975) Support proxy acl user for CapacityScheduler
zhoukang created YARN-9975: -- Summary: Support proxy acl user for CapacityScheduler Key: YARN-9975 URL: https://issues.apache.org/jira/browse/YARN-9975 Project: Hadoop YARN Issue Type: Improvement Reporter: zhoukang Assignee: zhoukang As commented in https://issues.apache.org/jira/browse/YARN-9698. I will open a new jira for the proxy user feature. The background is that we have long running sql thriftserver for many users: {quote}{{user->sql proxy-> sql thriftserver}}{quote} But we do not have keytab for all users on 'sql proxy'. We just use a super user like 'sql_prc' to submit the 'sql thriftserver' application. To support this we should change the scheduler to support proxy user acl -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7621) Support submitting apps with queue path for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang reassigned YARN-7621: -- Assignee: zhoukang (was: Tao Yang) > Support submitting apps with queue path for CapacityScheduler > - > > Key: YARN-7621 > URL: https://issues.apache.org/jira/browse/YARN-7621 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: zhoukang >Priority: Major > Labels: fs2cs > Attachments: YARN-7621.001.patch, YARN-7621.002.patch > > > Currently there is a difference of queue definition in > ApplicationSubmissionContext between CapacityScheduler and FairScheduler. > FairScheduler needs queue path but CapacityScheduler needs queue name. There > is no doubt of the correction of queue definition for CapacityScheduler > because it does not allow duplicate leaf queue names, but it's hard to switch > between FairScheduler and CapacityScheduler. I propose to support submitting > apps with queue path for CapacityScheduler to make the interface clearer and > scheduler switch smoothly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7621) Support submitting apps with queue path for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973888#comment-16973888 ] zhoukang commented on YARN-7621: [~Tao Yang] I will work on this, thanks! > Support submitting apps with queue path for CapacityScheduler > - > > Key: YARN-7621 > URL: https://issues.apache.org/jira/browse/YARN-7621 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: fs2cs > Attachments: YARN-7621.001.patch, YARN-7621.002.patch > > > Currently there is a difference of queue definition in > ApplicationSubmissionContext between CapacityScheduler and FairScheduler. > FairScheduler needs queue path but CapacityScheduler needs queue name. There > is no doubt of the correction of queue definition for CapacityScheduler > because it does not allow duplicate leaf queue names, but it's hard to switch > between FairScheduler and CapacityScheduler. I propose to support submitting > apps with queue path for CapacityScheduler to make the interface clearer and > scheduler switch smoothly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9693) When AMRMProxyService is enabled RMCommunicator will register with failure
[ https://issues.apache.org/jira/browse/YARN-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973877#comment-16973877 ] zhoukang edited comment on YARN-9693 at 11/14/19 2:44 AM: -- A initial patch has been post, i will optimize this patch. [~subru] Could you help review this idea? [~botong][~giovanni.fumarola] Thanks was (Author: cane): A initial patch has been post, i will optimize this patch. [~subru] Could you help review this idea? > When AMRMProxyService is enabled RMCommunicator will register with failure > -- > > Key: YARN-9693 > URL: https://issues.apache.org/jira/browse/YARN-9693 > Project: Hadoop YARN > Issue Type: Improvement > Components: federation >Affects Versions: 3.1.2 >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-9693.001.patch > > > When we enable amrm proxy service, the RMCommunicator will register with > failure below: > {code:java} > 2019-07-23 17:12:44,794 INFO [TaskHeartbeatHandler PingChecker] > org.apache.hadoop.mapreduce.v2.app.TaskHeartbeatHandler: TaskHeartbeatHandler > thread interrupted > 2019-07-23 17:12:44,794 ERROR [main] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid > AMRMToken from appattempt_1563872237585_0001_02 > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:186) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.serviceStart(RMCommunicator.java:123) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.serviceStart(RMContainerAllocator.java:280) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.serviceStart(MRAppMaster.java:986) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1300) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$6.run(MRAppMaster.java:1768) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1764) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1698) > Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: > Invalid AMRMToken from appattempt_1563872237585_0001_02 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy93.registerApplicationMaster(Unknown Source) > at >
[jira] [Commented] (YARN-9693) When AMRMProxyService is enabled RMCommunicator will register with failure
[ https://issues.apache.org/jira/browse/YARN-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973877#comment-16973877 ] zhoukang commented on YARN-9693: A initial patch has been post, i will optimize this patch. [~subru] Could you help review this idea? > When AMRMProxyService is enabled RMCommunicator will register with failure > -- > > Key: YARN-9693 > URL: https://issues.apache.org/jira/browse/YARN-9693 > Project: Hadoop YARN > Issue Type: Improvement > Components: federation >Affects Versions: 3.1.2 >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-9693.001.patch > > > When we enable amrm proxy service, the RMCommunicator will register with > failure below: > {code:java} > 2019-07-23 17:12:44,794 INFO [TaskHeartbeatHandler PingChecker] > org.apache.hadoop.mapreduce.v2.app.TaskHeartbeatHandler: TaskHeartbeatHandler > thread interrupted > 2019-07-23 17:12:44,794 ERROR [main] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid > AMRMToken from appattempt_1563872237585_0001_02 > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:186) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.serviceStart(RMCommunicator.java:123) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.serviceStart(RMContainerAllocator.java:280) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.serviceStart(MRAppMaster.java:986) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1300) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$6.run(MRAppMaster.java:1768) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1764) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1698) > Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: > Invalid AMRMToken from appattempt_1563872237585_0001_02 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy93.registerApplicationMaster(Unknown Source) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:170) > ... 14 more > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >
[jira] [Resolved] (YARN-9974) Large diagnostics may cause RM recover failed
[ https://issues.apache.org/jira/browse/YARN-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang resolved YARN-9974. Resolution: Duplicate > Large diagnostics may cause RM recover failed > - > > Key: YARN-9974 > URL: https://issues.apache.org/jira/browse/YARN-9974 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Critical > > {code:java} > 2019-09-04,16:37:32,224 DEBUG org.apache.zookeeper.ClientCnxn: Reading reply > sessionid:0x563398cdd5889a1, packet:: clientPath:null serverPath:null > finished:false header:: 1659,4 replyHeader:: 1659,27117069873,0 request:: > '/yarn-ha/zjyprc-hadoop/rm-state/ZKRMStateRoot/RMAppRoot/application_1531361280531_691245,F > response:: >
[jira] [Assigned] (YARN-9973) Catch RuntimeException in yarn historyserver
[ https://issues.apache.org/jira/browse/YARN-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang reassigned YARN-9973: -- Assignee: zhoukang > Catch RuntimeException in yarn historyserver > - > > Key: YARN-9973 > URL: https://issues.apache.org/jira/browse/YARN-9973 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-9973.001.patch > > > When we got exception below the thread in jobhisotry will exit, we should > catch runtime exception > {code:java} > xxx 2019-06-30,17:45:52,386 ERROR > org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider: > Fail to get initial active Namenode informationjava.lang.RuntimeException: > Fail to get active namenode from zookeeper > at > org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.getActiveNNIndex(ZkConfiguredFailoverProxyProvider.java:149) > at > org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.performFailover(ZkConfiguredFailoverProxyProvider.java:176) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:159) > at $Proxy15.getListing(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1996) > at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:211) > at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:198) > at org.apache.hadoop.fs.Hdfs$2.(Hdfs.java:180) > at org.apache.hadoop.fs.Hdfs.listStatusIterator(Hdfs.java:180) > at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1445) > at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1440) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.listStatus(FileContext.java:1440) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectory(HistoryFileManager.java:739) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectoryForHistoryFiles(HistoryFileManager.java:752) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:806) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.access$200(HistoryFileManager.java:82) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(HistoryFileManager.java:280) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:792) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9974) Large diagnostics may cause RM recover failed
[ https://issues.apache.org/jira/browse/YARN-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973331#comment-16973331 ] zhoukang commented on YARN-9974: I will post a patch later > Large diagnostics may cause RM recover failed > - > > Key: YARN-9974 > URL: https://issues.apache.org/jira/browse/YARN-9974 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Critical > > {code:java} > 2019-09-04,16:37:32,224 DEBUG org.apache.zookeeper.ClientCnxn: Reading reply > sessionid:0x563398cdd5889a1, packet:: clientPath:null serverPath:null > finished:false header:: 1659,4 replyHeader:: 1659,27117069873,0 request:: > '/yarn-ha/zjyprc-hadoop/rm-state/ZKRMStateRoot/RMAppRoot/application_1531361280531_691245,F > response:: >
[jira] [Created] (YARN-9974) Large diagnostics may cause RM recover failed
zhoukang created YARN-9974: -- Summary: Large diagnostics may cause RM recover failed Key: YARN-9974 URL: https://issues.apache.org/jira/browse/YARN-9974 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhoukang Assignee: zhoukang {code:java} 2019-09-04,16:37:32,224 DEBUG org.apache.zookeeper.ClientCnxn: Reading reply sessionid:0x563398cdd5889a1, packet:: clientPath:null serverPath:null finished:false header:: 1659,4 replyHeader:: 1659,27117069873,0 request:: '/yarn-ha/zjyprc-hadoop/rm-state/ZKRMStateRoot/RMAppRoot/application_1531361280531_691245,F response::
[jira] [Updated] (YARN-9693) When AMRMProxyService is enabled RMCommunicator will register with failure
[ https://issues.apache.org/jira/browse/YARN-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-9693: --- Attachment: YARN-9693.001.patch > When AMRMProxyService is enabled RMCommunicator will register with failure > -- > > Key: YARN-9693 > URL: https://issues.apache.org/jira/browse/YARN-9693 > Project: Hadoop YARN > Issue Type: Improvement > Components: federation >Affects Versions: 3.1.2 >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-9693.001.patch > > > When we enable amrm proxy service, the RMCommunicator will register with > failure below: > {code:java} > 2019-07-23 17:12:44,794 INFO [TaskHeartbeatHandler PingChecker] > org.apache.hadoop.mapreduce.v2.app.TaskHeartbeatHandler: TaskHeartbeatHandler > thread interrupted > 2019-07-23 17:12:44,794 ERROR [main] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid > AMRMToken from appattempt_1563872237585_0001_02 > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:186) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.serviceStart(RMCommunicator.java:123) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.serviceStart(RMContainerAllocator.java:280) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.serviceStart(MRAppMaster.java:986) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1300) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$6.run(MRAppMaster.java:1768) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1764) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1698) > Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: > Invalid AMRMToken from appattempt_1563872237585_0001_02 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy93.registerApplicationMaster(Unknown Source) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:170) > ... 14 more > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > Invalid AMRMToken from appattempt_1563872237585_0001_02 > at
[jira] [Updated] (YARN-9973) Catch RuntimeException in yarn historyserver
[ https://issues.apache.org/jira/browse/YARN-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-9973: --- Description: When we got exception below the thread in jobhisotry will exit, we should catch runtime exception {code:java} xxx 2019-06-30,17:45:52,386 ERROR org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider: Fail to get initial active Namenode informationjava.lang.RuntimeException: Fail to get active namenode from zookeeper at org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.getActiveNNIndex(ZkConfiguredFailoverProxyProvider.java:149) at org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.performFailover(ZkConfiguredFailoverProxyProvider.java:176) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:159) at $Proxy15.getListing(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1996) at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:211) at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:198) at org.apache.hadoop.fs.Hdfs$2.(Hdfs.java:180) at org.apache.hadoop.fs.Hdfs.listStatusIterator(Hdfs.java:180) at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1445) at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1440) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.listStatus(FileContext.java:1440) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectory(HistoryFileManager.java:739) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectoryForHistoryFiles(HistoryFileManager.java:752) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:806) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.access$200(HistoryFileManager.java:82) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(HistoryFileManager.java:280) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:792) {code} was: When we got exception below the cleaner thread will exit, we should catch runtime exception {code:java} xxx 2019-06-30,17:45:52,386 ERROR org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider: Fail to get initial active Namenode informationjava.lang.RuntimeException: Fail to get active namenode from zookeeper at org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.getActiveNNIndex(ZkConfiguredFailoverProxyProvider.java:149) at org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.performFailover(ZkConfiguredFailoverProxyProvider.java:176) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:159) at $Proxy15.getListing(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1996) at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:211) at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:198) at org.apache.hadoop.fs.Hdfs$2.(Hdfs.java:180) at org.apache.hadoop.fs.Hdfs.listStatusIterator(Hdfs.java:180) at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1445) at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1440) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.listStatus(FileContext.java:1440) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectory(HistoryFileManager.java:739) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectoryForHistoryFiles(HistoryFileManager.java:752) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:806) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.access$200(HistoryFileManager.java:82) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(HistoryFileManager.java:280) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:792) {code} > Catch RuntimeException in yarn historyserver > - > > Key: YARN-9973 > URL: https://issues.apache.org/jira/browse/YARN-9973 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: zhoukang >Priority: Major > Attachments: YARN-9973.001.patch > > > When we got exception below the thread in jobhisotry will exit, we should > catch runtime exception >
[jira] [Updated] (YARN-9973) Catch RuntimeException in yarn historyserver
[ https://issues.apache.org/jira/browse/YARN-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-9973: --- Attachment: YARN-9973.001.patch > Catch RuntimeException in yarn historyserver > - > > Key: YARN-9973 > URL: https://issues.apache.org/jira/browse/YARN-9973 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: zhoukang >Priority: Major > Attachments: YARN-9973.001.patch > > > When we got exception below the cleaner thread will exit, we should catch > runtime exception > {code:java} > xxx 2019-06-30,17:45:52,386 ERROR > org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider: > Fail to get initial active Namenode informationjava.lang.RuntimeException: > Fail to get active namenode from zookeeper > at > org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.getActiveNNIndex(ZkConfiguredFailoverProxyProvider.java:149) > at > org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.performFailover(ZkConfiguredFailoverProxyProvider.java:176) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:159) > at $Proxy15.getListing(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1996) > at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:211) > at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:198) > at org.apache.hadoop.fs.Hdfs$2.(Hdfs.java:180) > at org.apache.hadoop.fs.Hdfs.listStatusIterator(Hdfs.java:180) > at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1445) > at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1440) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.listStatus(FileContext.java:1440) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectory(HistoryFileManager.java:739) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectoryForHistoryFiles(HistoryFileManager.java:752) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:806) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.access$200(HistoryFileManager.java:82) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(HistoryFileManager.java:280) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:792) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9973) Catch RuntimeException in yarn historyserver
[ https://issues.apache.org/jira/browse/YARN-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated YARN-9973: --- Component/s: yarn > Catch RuntimeException in yarn historyserver > - > > Key: YARN-9973 > URL: https://issues.apache.org/jira/browse/YARN-9973 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: zhoukang >Priority: Major > > When we got exception below the cleaner thread will exit, we should catch > runtime exception > {code:java} > xxx 2019-06-30,17:45:52,386 ERROR > org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider: > Fail to get initial active Namenode informationjava.lang.RuntimeException: > Fail to get active namenode from zookeeper > at > org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.getActiveNNIndex(ZkConfiguredFailoverProxyProvider.java:149) > at > org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.performFailover(ZkConfiguredFailoverProxyProvider.java:176) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:159) > at $Proxy15.getListing(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1996) > at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:211) > at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:198) > at org.apache.hadoop.fs.Hdfs$2.(Hdfs.java:180) > at org.apache.hadoop.fs.Hdfs.listStatusIterator(Hdfs.java:180) > at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1445) > at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1440) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.listStatus(FileContext.java:1440) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectory(HistoryFileManager.java:739) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectoryForHistoryFiles(HistoryFileManager.java:752) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:806) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.access$200(HistoryFileManager.java:82) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(HistoryFileManager.java:280) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:792) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9973) Catch RuntimeException in yarn historyserver
zhoukang created YARN-9973: -- Summary: Catch RuntimeException in yarn historyserver Key: YARN-9973 URL: https://issues.apache.org/jira/browse/YARN-9973 Project: Hadoop YARN Issue Type: Bug Reporter: zhoukang When we got exception below the cleaner thread will exit, we should catch runtime exception {code:java} xxx 2019-06-30,17:45:52,386 ERROR org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider: Fail to get initial active Namenode informationjava.lang.RuntimeException: Fail to get active namenode from zookeeper at org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.getActiveNNIndex(ZkConfiguredFailoverProxyProvider.java:149) at org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider.performFailover(ZkConfiguredFailoverProxyProvider.java:176) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:159) at $Proxy15.getListing(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1996) at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:211) at org.apache.hadoop.fs.Hdfs$DirListingIterator.(Hdfs.java:198) at org.apache.hadoop.fs.Hdfs$2.(Hdfs.java:180) at org.apache.hadoop.fs.Hdfs.listStatusIterator(Hdfs.java:180) at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1445) at org.apache.hadoop.fs.FileContext$21.next(FileContext.java:1440) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.listStatus(FileContext.java:1440) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectory(HistoryFileManager.java:739) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectoryForHistoryFiles(HistoryFileManager.java:752) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:806) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.access$200(HistoryFileManager.java:82) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(HistoryFileManager.java:280) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:792) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9972) Do not kill am container when node is unhealthy
[ https://issues.apache.org/jira/browse/YARN-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang reassigned YARN-9972: -- Assignee: zhoukang > Do not kill am container when node is unhealthy > --- > > Key: YARN-9972 > URL: https://issues.apache.org/jira/browse/YARN-9972 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > In this patch, we want to add a configuration to disable kill am container > when node is unhealthy since this will cause some application exit with > failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9972) Do not kill am container when node is unhealthy
zhoukang created YARN-9972: -- Summary: Do not kill am container when node is unhealthy Key: YARN-9972 URL: https://issues.apache.org/jira/browse/YARN-9972 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: zhoukang In this patch, we want to add a configuration to disable kill am container when node is unhealthy since this will cause some application exit with failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9931) Support run script before kill container
[ https://issues.apache.org/jira/browse/YARN-9931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973282#comment-16973282 ] zhoukang commented on YARN-9931: The background is that in our production cluster we have many Applications and framework. We always encountered problem that the container was killed and we have no information about the operation which the container was doing. Add a shutdown hook may be unfriendly to the users. And add a feature in yarn will make the troubleshooting more efficient. [~epayne] Thanks! > Support run script before kill container > > > Key: YARN-9931 > URL: https://issues.apache.org/jira/browse/YARN-9931 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > Like node health check script. We can add a pre-kill script which run before > kill container. > For example we can save the thread dump before kill the container, which is > helpful for troubleshooting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9930) Support max running app logic for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973280#comment-16973280 ] zhoukang edited comment on YARN-9930 at 11/13/19 11:55 AM: --- [~pbacsko]Thanks. The background is that in our production cluster we want to upgrade hadoop version to 3.x, and we used FairScheduler before. Now we want to use CapacityScheduler in new version 3.x. If we want to migrate from FS to CS , this behavior will be confused to users. [~epayne][~pbacsko]I agree with the point.Add a config like bq. "yarn.scheduler.capacity.maxrunningapps.reject" was (Author: cane): [~pbacsko]Thanks. The background is that in our production cluster we want to upgrade hadoop version to 3.x, and we used FairScheduler before. Now we want to use CapacityScheduler in new version 3.x. If we want to migrate from FS to CS , this behavior will be confused. [~epayne][~pbacsko]I agree with the point.Add a config like bq. "yarn.scheduler.capacity.maxrunningapps.reject" > Support max running app logic for CapacityScheduler > --- > > Key: YARN-9930 > URL: https://issues.apache.org/jira/browse/YARN-9930 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler, capacityscheduler >Affects Versions: 3.1.0, 3.1.1 >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > In FairScheduler, there has limitation for max running which will let > application pending. > But in CapacityScheduler there has no feature like max running app.Only got > max app,and jobs will be rejected directly on client. > This jira i want to implement this semantic for CapacityScheduler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org