[jira] [Created] (YARN-10735) Unmanaged AM is won't populate AMRMToken to ApplicationReport in secure cluster

2021-04-14 Thread Wang, Xinglong (Jira)
Wang, Xinglong created YARN-10735:
-

 Summary: Unmanaged AM is won't populate AMRMToken to 
ApplicationReport in secure cluster
 Key: YARN-10735
 URL: https://issues.apache.org/jira/browse/YARN-10735
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wang, Xinglong
Assignee: Wang, Xinglong


With kerberos enabled, NPE will be reported when launching UnmanagedAMLauncher.
It is due to there is no AMRMToken is returned in ApplicationReport. After a 
while investigation, it turns out that RMAppImpl has a bad if condition inside 
createAndGetApplicationReport

{code:java}
21/04/14 02:46:01 INFO unmanagedamlauncher.UnmanagedAMLauncher: Initializing 
Client
21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Starting Client
21/04/14 02:46:02 INFO client.AHSProxy: Connecting to Application History 
server at /0.0.0.0:10200
21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting up 
application submission context for ASM
21/04/14 02:46:02 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
to rm2
21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting 
unmanaged AM
21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Submitting 
application to ASM
21/04/14 02:46:03 INFO impl.YarnClientImpl: Submitted application 
application_1618393442264_0002
21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Got application 
report from ASM for, appId=2, 
appAttemptId=appattempt_1618393442264_0002_01, clientToAMToken=Token { 
kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=AM container is launched, 
waiting for AM container to Register with RM, appMasterHost=N/A, 
appQueue=hdmi-default, appMasterRpcPort=-1, appStartTime=1618393562917, 
yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=N/A, 
appUser=b_carmel
21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Launching AM 
with application attempt id appattempt_1618393442264_0002_01
21/04/14 02:46:04 FATAL unmanagedamlauncher.UnmanagedAMLauncher: Error running 
Client
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.launchAM(UnmanagedAMLauncher.java:186)
at 
org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.run(UnmanagedAMLauncher.java:354)
at 
org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.main(UnmanagedAMLauncher.java:111)
{code}

 

{code:java}
 public ApplicationReport createAndGetApplicationReport(String clientUserName,
  boolean allowAccess) {
..
if (currentAttempt != null && 
currentAttempt.getAppAttemptState() == RMAppAttemptState.LAUNCHED) {
  if (getApplicationSubmissionContext().getUnmanagedAM() &&
  clientUserName != null && getUser().equals(clientUserName)) {
Token token = currentAttempt.getAMRMToken();
if (token != null) {
  amrmToken = BuilderUtils.newAMRMToken(token.getIdentifier(),
  token.getKind().toString(), token.getPassword(),
  token.getService().toString());
}
  }
}
{code}

clientUserName is fullName of a kerberos principle like a...@domain.com whereas 
getUser() will return the username recorded in RMAppImpl which is short name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9980) App hangs in accepted when moved from DEFAULT_PARTITION queue to an exclusive partition queue

2019-11-14 Thread Wang, Xinglong (Jira)
Wang, Xinglong created YARN-9980:


 Summary: App hangs in accepted when moved from DEFAULT_PARTITION 
queue to an exclusive partition queue
 Key: YARN-9980
 URL: https://issues.apache.org/jira/browse/YARN-9980
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wang, Xinglong
Assignee: Wang, Xinglong
 Attachments: Screen Shot 2019-11-14 at 5.11.39 PM.png

App hangs in accpeted when moved from DEFAULT_PARTITION queue to an exclusive 
partition queue.

queue_root
queue_a   - default_partition
queue_b   - exclusive partition x, default partition is x

When an app is submitted to queue_a, with AM_LABEL_EXPRESSION unset, RM will 
give default_partition as AM_LABEL_EXPRESSION to this app, then it gets an am1 
and runs. And if later, the app is moved to queue_b, and the am1 is 
preempted/killed/failed, it will schedule another am2 if am retry number 
allows. But this time the resource request for this am2 is with 
AM_LABEL_EXPRESSION = default_partition, the issue is queue_b don't have any 
resource with default_partition, then this app will be in accepted state 
forever in RM UI.

My understanding is that, since the app was submitted with no 
AM_LABEL_EXPRESSION, And in the code base, we allow in our code for such kind 
of app to run with current queue's default partition.
Here for the move queue scenario, we should also let the app to run 
successfully. That means am2 should get queue_b's default partition x resource 
to run instead of pending forever.

In our production, we have a landing queue with default_partition, we have some 
kind of route mechanism to route apps in this queue to other queues including 
queues with exclusive partition.

 !Screen Shot 2019-11-14 at 5.11.39 PM.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9854) RM jetty hang due to WebAppProxyServlet lacks of timeout while doing proxyLink

2019-09-24 Thread Wang, Xinglong (Jira)
Wang, Xinglong created YARN-9854:


 Summary: RM jetty hang due to WebAppProxyServlet lacks of timeout 
while doing proxyLink
 Key: YARN-9854
 URL: https://issues.apache.org/jira/browse/YARN-9854
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: amrmproxy, resourcemanager, webapp
Reporter: Wang, Xinglong
Assignee: Wang, Xinglong


RM will proxy url request to [http://rm:port/proxy/application_x] to AM or 
related history server.

Recently we met an issue https://issues.apache.org/jira/browse/SPARK-26961 
which will cause Spark AM hang forever.

And we have a monitor tool to access [http://rm:port/proxy/application_x]  
periodically. Thus all proxied connection to the hang spark AM will also hang 
forever due to WebAppProxyServlet is lacking of socket connection timeout 
setting while initialize httpclient towards this spark AM.

 

The jetty server holding RM servlets is with limited threads. In this case, 
each time one such thread will hang due to waiting for Spark AM response. 
Eventually all jetty threads serving http traffic hang and caused all RM web 
links not responsive. 

 

If we give timeout config to httpclient, we will be free of this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode

2019-09-19 Thread Wang, Xinglong (Jira)
Wang, Xinglong created YARN-9847:


 Summary: ZKRMStateStore will cause zk connection loss when writing 
huge data into znode
 Key: YARN-9847
 URL: https://issues.apache.org/jira/browse/YARN-9847
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wang, Xinglong
Assignee: Wang, Xinglong


Recently, we encountered RM ZK connection issue due to RM was trying to write 
huge data into znode. This behavior will zk report Len error and then cause zk 
session connection loss. And eventually RM would crash due to zk connection 
issue.

*The fix*

In order to protect ResouceManager from crash due to this.
This fix is trying to limit the size of data for attemp by limiting the 
diagnostic info when writing ApplicationAttemptStateData into znode. The size 
will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will 
be also used by zookeeper server.

*The story*

ResourceManager Log
{code:java}
2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session 
0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, unexpected 
error, closing socket connection and attempting reconnect
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)

2019-07-29 04:27:35,459 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
at java.lang.Thread.run(Thread.java:745)
{code}


ResourceManager will retry to connect to zookeeper until it exhausted retry 
number and then give up.

{code:java}
2019-07-29 02:25:06,404 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying 
operation on ZK. Retry no. 999


2019-07-29 02:25:06,718 INFO org.apache.zookeeper.client.ZooKeeperSaslClient: 
Client will use GSSAPI as SASL mechanism.
2019-07-29 02:25:06,718 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server 2019-07-29 02:25:06,404 INFO 

[jira] [Created] (YARN-9494) ApplicationHistoryServer endpoint access wrongly requested

2019-04-17 Thread Wang, Xinglong (JIRA)
Wang, Xinglong created YARN-9494:


 Summary: ApplicationHistoryServer endpoint access wrongly requested
 Key: YARN-9494
 URL: https://issues.apache.org/jira/browse/YARN-9494
 Project: Hadoop YARN
  Issue Type: Bug
  Components: ATSv2
Reporter: Wang, Xinglong


With the following configuration, resource manager will redirect 
https://resourcemanager.hadoop.com:50030/proxy/application_1553677175329_47053/ 
to  0.0.0.0:10200 when resource manager can't find 
application_1553677175329_47053 in applicationManager.

{code:java}
yarn.timeline-service.enabled = false
yarn.timeline-service.generic-application-history.enabled = true
{code}

However, in this case, there is no timeline service enabled, thus no 
yarn.timeline-service.address defined, and 0.0.0.0:10200 will be used as 
timelineserver access point.

This combination of configuration is a valid configuration, due to we have in 
house tool to analyze the generic-applicaiton-history files generated by 
resource manager. While we don't enable timeline service.

{code:java}
HTTP ERROR 500

Problem accessing /proxy/application_1553677175329_47053/. Reason:

Call From x/10.22.59.23 to 0.0.0.0:10200 failed on connection 
exception: java.net.ConnectException: Connection refused; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused

Caused by:

java.net.ConnectException: Call From x/10.22.59.23 to 0.0.0.0:10200 failed 
on connection exception: java.net.ConnectException: Connection refused; For 
more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.GeneratedConstructorAccessor240.newInstance(Unknown 
Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1558)
at org.apache.hadoop.ipc.Client.call(Client.java:1498)
at org.apache.hadoop.ipc.Client.call(Client.java:1398)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationHistoryProtocolPBClientImpl.getApplicationReport(ApplicationHistoryProtocolPBClientImpl.java:108)
at 
org.apache.hadoop.yarn.server.webproxy.AppReportFetcher.getApplicationReport(AppReportFetcher.java:137)
at 
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getApplicationReport(WebAppProxyServlet.java:251)
at 
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getFetchedAppReport(WebAppProxyServlet.java:491)
at 
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:329)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:178)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
at 
com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:617)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:576)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at