[jira] [Created] (YARN-10735) Unmanaged AM is won't populate AMRMToken to ApplicationReport in secure cluster
Wang, Xinglong created YARN-10735: - Summary: Unmanaged AM is won't populate AMRMToken to ApplicationReport in secure cluster Key: YARN-10735 URL: https://issues.apache.org/jira/browse/YARN-10735 Project: Hadoop YARN Issue Type: Bug Reporter: Wang, Xinglong Assignee: Wang, Xinglong With kerberos enabled, NPE will be reported when launching UnmanagedAMLauncher. It is due to there is no AMRMToken is returned in ApplicationReport. After a while investigation, it turns out that RMAppImpl has a bad if condition inside createAndGetApplicationReport {code:java} 21/04/14 02:46:01 INFO unmanagedamlauncher.UnmanagedAMLauncher: Initializing Client 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Starting Client 21/04/14 02:46:02 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting up application submission context for ASM 21/04/14 02:46:02 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting unmanaged AM 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Submitting application to ASM 21/04/14 02:46:03 INFO impl.YarnClientImpl: Submitted application application_1618393442264_0002 21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Got application report from ASM for, appId=2, appAttemptId=appattempt_1618393442264_0002_01, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service: }, appDiagnostics=AM container is launched, waiting for AM container to Register with RM, appMasterHost=N/A, appQueue=hdmi-default, appMasterRpcPort=-1, appStartTime=1618393562917, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=N/A, appUser=b_carmel 21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Launching AM with application attempt id appattempt_1618393442264_0002_01 21/04/14 02:46:04 FATAL unmanagedamlauncher.UnmanagedAMLauncher: Error running Client java.lang.NullPointerException at org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.launchAM(UnmanagedAMLauncher.java:186) at org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.run(UnmanagedAMLauncher.java:354) at org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.main(UnmanagedAMLauncher.java:111) {code} {code:java} public ApplicationReport createAndGetApplicationReport(String clientUserName, boolean allowAccess) { .. if (currentAttempt != null && currentAttempt.getAppAttemptState() == RMAppAttemptState.LAUNCHED) { if (getApplicationSubmissionContext().getUnmanagedAM() && clientUserName != null && getUser().equals(clientUserName)) { Token token = currentAttempt.getAMRMToken(); if (token != null) { amrmToken = BuilderUtils.newAMRMToken(token.getIdentifier(), token.getKind().toString(), token.getPassword(), token.getService().toString()); } } } {code} clientUserName is fullName of a kerberos principle like a...@domain.com whereas getUser() will return the username recorded in RMAppImpl which is short name. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9980) App hangs in accepted when moved from DEFAULT_PARTITION queue to an exclusive partition queue
Wang, Xinglong created YARN-9980: Summary: App hangs in accepted when moved from DEFAULT_PARTITION queue to an exclusive partition queue Key: YARN-9980 URL: https://issues.apache.org/jira/browse/YARN-9980 Project: Hadoop YARN Issue Type: Improvement Reporter: Wang, Xinglong Assignee: Wang, Xinglong Attachments: Screen Shot 2019-11-14 at 5.11.39 PM.png App hangs in accpeted when moved from DEFAULT_PARTITION queue to an exclusive partition queue. queue_root queue_a - default_partition queue_b - exclusive partition x, default partition is x When an app is submitted to queue_a, with AM_LABEL_EXPRESSION unset, RM will give default_partition as AM_LABEL_EXPRESSION to this app, then it gets an am1 and runs. And if later, the app is moved to queue_b, and the am1 is preempted/killed/failed, it will schedule another am2 if am retry number allows. But this time the resource request for this am2 is with AM_LABEL_EXPRESSION = default_partition, the issue is queue_b don't have any resource with default_partition, then this app will be in accepted state forever in RM UI. My understanding is that, since the app was submitted with no AM_LABEL_EXPRESSION, And in the code base, we allow in our code for such kind of app to run with current queue's default partition. Here for the move queue scenario, we should also let the app to run successfully. That means am2 should get queue_b's default partition x resource to run instead of pending forever. In our production, we have a landing queue with default_partition, we have some kind of route mechanism to route apps in this queue to other queues including queues with exclusive partition. !Screen Shot 2019-11-14 at 5.11.39 PM.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9854) RM jetty hang due to WebAppProxyServlet lacks of timeout while doing proxyLink
Wang, Xinglong created YARN-9854: Summary: RM jetty hang due to WebAppProxyServlet lacks of timeout while doing proxyLink Key: YARN-9854 URL: https://issues.apache.org/jira/browse/YARN-9854 Project: Hadoop YARN Issue Type: Improvement Components: amrmproxy, resourcemanager, webapp Reporter: Wang, Xinglong Assignee: Wang, Xinglong RM will proxy url request to [http://rm:port/proxy/application_x] to AM or related history server. Recently we met an issue https://issues.apache.org/jira/browse/SPARK-26961 which will cause Spark AM hang forever. And we have a monitor tool to access [http://rm:port/proxy/application_x] periodically. Thus all proxied connection to the hang spark AM will also hang forever due to WebAppProxyServlet is lacking of socket connection timeout setting while initialize httpclient towards this spark AM. The jetty server holding RM servlets is with limited threads. In this case, each time one such thread will hang due to waiting for Spark AM response. Eventually all jetty threads serving http traffic hang and caused all RM web links not responsive. If we give timeout config to httpclient, we will be free of this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode
Wang, Xinglong created YARN-9847: Summary: ZKRMStateStore will cause zk connection loss when writing huge data into znode Key: YARN-9847 URL: https://issues.apache.org/jira/browse/YARN-9847 Project: Hadoop YARN Issue Type: Improvement Reporter: Wang, Xinglong Assignee: Wang, Xinglong Recently, we encountered RM ZK connection issue due to RM was trying to write huge data into znode. This behavior will zk report Len error and then cause zk session connection loss. And eventually RM would crash due to zk connection issue. *The fix* In order to protect ResouceManager from crash due to this. This fix is trying to limit the size of data for attemp by limiting the diagnostic info when writing ApplicationAttemptStateData into znode. The size will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will be also used by zookeeper server. *The story* ResourceManager Log {code:java} 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) 2019-07-29 04:27:35,459 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109) at java.lang.Thread.run(Thread.java:745) {code} ResourceManager will retry to connect to zookeeper until it exhausted retry number and then give up. {code:java} 2019-07-29 02:25:06,404 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 999 2019-07-29 02:25:06,718 INFO org.apache.zookeeper.client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism. 2019-07-29 02:25:06,718 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 2019-07-29 02:25:06,404 INFO
[jira] [Created] (YARN-9494) ApplicationHistoryServer endpoint access wrongly requested
Wang, Xinglong created YARN-9494: Summary: ApplicationHistoryServer endpoint access wrongly requested Key: YARN-9494 URL: https://issues.apache.org/jira/browse/YARN-9494 Project: Hadoop YARN Issue Type: Bug Components: ATSv2 Reporter: Wang, Xinglong With the following configuration, resource manager will redirect https://resourcemanager.hadoop.com:50030/proxy/application_1553677175329_47053/ to 0.0.0.0:10200 when resource manager can't find application_1553677175329_47053 in applicationManager. {code:java} yarn.timeline-service.enabled = false yarn.timeline-service.generic-application-history.enabled = true {code} However, in this case, there is no timeline service enabled, thus no yarn.timeline-service.address defined, and 0.0.0.0:10200 will be used as timelineserver access point. This combination of configuration is a valid configuration, due to we have in house tool to analyze the generic-applicaiton-history files generated by resource manager. While we don't enable timeline service. {code:java} HTTP ERROR 500 Problem accessing /proxy/application_1553677175329_47053/. Reason: Call From x/10.22.59.23 to 0.0.0.0:10200 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused Caused by: java.net.ConnectException: Call From x/10.22.59.23 to 0.0.0.0:10200 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.GeneratedConstructorAccessor240.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1558) at org.apache.hadoop.ipc.Client.call(Client.java:1498) at org.apache.hadoop.ipc.Client.call(Client.java:1398) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationHistoryProtocolPBClientImpl.getApplicationReport(ApplicationHistoryProtocolPBClientImpl.java:108) at org.apache.hadoop.yarn.server.webproxy.AppReportFetcher.getApplicationReport(AppReportFetcher.java:137) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getApplicationReport(WebAppProxyServlet.java:251) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getFetchedAppReport(WebAppProxyServlet.java:491) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:329) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:178) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:617) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:576) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at